使用R分析16s擴(kuò)增子α,β多樣性

這部分教程依然來(lái)自Happy Belly Bioinformatics網(wǎng)站, 主要學(xué)習(xí)一下使用DADA2獲得ASVs數(shù)據(jù)之后的進(jìn)一步分析。譬如,微生物群落多樣性,樣品間的微生物群落差異分析以及系統(tǒng)進(jìn)化等。

1. 工作環(huán)境設(shè)置和數(shù)據(jù)導(dǎo)入

> setwd("C:\\Users\\Administrator\\Documents\\R_work\\amplicon_example_workflow\\R_working_dir")
> getwd()
[1] "C:/Users/Administrator/Documents/R_work/amplicon_example_workflow/R_working_dir"
> list.files()
[1] "amplicon_example_analysis.R" "ASVs_counts.txt"            
[3] "ASVs_taxonomy.txt"           "sample_info.txt"  
 
#載入所需包
> library("phyloseq")
> library("vegan")
> library("DESeq2")
載入需要的程輯包:SummarizedExperiment
載入需要的程輯包:DelayedArray
載入需要的程輯包:matrixStats

載入程輯包:‘matrixStats’

The following objects are masked from ‘package:Biobase’:

    anyMissing, rowMedians

載入需要的程輯包:BiocParallel

載入程輯包:‘DelayedArray’

The following objects are masked from ‘package:matrixStats’:

    colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges

The following objects are masked from ‘package:base’:

    aperm, apply

> library("ggplot2")
> library("dendextend")
> library("tidyr")
> library("viridis")
> library("reshape")

#導(dǎo)入數(shù)據(jù)
> count_tab <- read.table("ASVs_counts.txt", header=T, row.names=1, check.names=F)
> tax_tab <- as.matrix(read.table("ASVs_taxonomy.txt", header=T, row.names=1, check.names=F, na.strings="", sep="\t"))
> sample_info_tab <- read.table("sample_info.txt", header=T, row.names=1, check.names=F)
> sample_info_tab
      temp    type      char
B1    <NA>   blank      <NA>
B2    <NA>   blank      <NA>
B3    <NA>   blank      <NA>
B4    <NA>   blank      <NA>
BW1    2.0   water     water
BW2    2.0   water     water
R10   13.7    rock    glassy
R11    7.3    rock    glassy
R11BF  7.3 biofilm   biofilm
R12   <NA>    rock   altered
R1A    8.6    rock   altered
R1B    8.6    rock   altered
R2     8.6    rock   altered
R3    12.7    rock   altered
R4    12.7    rock   altered
R5    12.7    rock   altered
R6    12.7    rock   altered
R7    <NA>    rock carbonate
R8    13.5    rock    glassy
R9    13.7    rock    glassy

2. 數(shù)據(jù)清洗

# first we need to get a sum for each ASV across all 4 blanks and all 16 samples
blank_ASV_counts <- rowSums(count_tab[,1:4])
sample_ASV_counts <- rowSums(count_tab[,5:20])
blank_ASV_counts
sample_ASV_counts

# now we normalize them, here by dividing the samples' total by 4 ??? as there are 4x as many samples (16) as there are blanks (4)
norm_sample_ASV_counts <- sample_ASV_counts/4

# here we're getting which ASVs are deemed likely contaminants based on the threshold noted above:
blank_ASVs <- names(blank_ASV_counts[blank_ASV_counts * 10 > norm_sample_ASV_counts])
length(blank_ASVs) # this approach identified about 50 out of ~1550 that are likely to have orginated from contamination

# now we normalize them, here by dividing the samples' total by 4 ??? as there are 4x as many samples (16) as there are blanks (4)
norm_sample_ASV_counts <- sample_ASV_counts/4

# here we're getting which ASVs are deemed likely contaminants based on the threshold noted above:
blank_ASVs <- names(blank_ASV_counts[blank_ASV_counts * 10 > norm_sample_ASV_counts])

length(blank_ASVs) # this approach identified about 50 out of ~1550 that are likely to have orginated from contamination

# looking at the percentage of reads retained for each sample after removing these presumed contaminant ASVs shows that the blanks lost almost all of their sequences, while the samples, other than one of the bottom water samples, lost less than 1% of their sequences, as would be hoped
colSums(count_tab[!rownames(count_tab) %in% blank_ASVs, ]) / colSums(count_tab) * 100

# now that we've used our extraction blanks to identify ASVs that were likely due to contamination, we're going to trim down our count table by removing those sequences and the blank samples from further analysis
filt_count_tab <- count_tab[!rownames(count_tab) %in% blank_ASVs, -c(1:4)]

# and here make a filtered sample info table that doesn't contain the blanks
filt_sample_info_tab<-sample_info_tab[-c(1:4), ]

# and let's add some colors to the sample info table that are specific to sample types and characteristics that we will use when plotting things
# we'll color the water samples blue: 
filt_sample_info_tab$color[filt_sample_info_tab$char == "water"] <- "blue"

# the biofilm sample a darkgreen:
filt_sample_info_tab$color[filt_sample_info_tab$char == "biofilm"] <- "darkgreen"

# the basalts with highly altered, thick outer rinds (>1 cm) brown ("chocolate4" is the best brown I can find...):
filt_sample_info_tab$color[filt_sample_info_tab$char == "altered"] <- "chocolate4"
# the basalts with smooth, glassy, thin exteriors black:
filt_sample_info_tab$color[filt_sample_info_tab$char == "glassy"] <- "black"
# and the calcified carbonate sample an ugly yellow:
filt_sample_info_tab$color[filt_sample_info_tab$char == "carbonate"] <- "darkkhaki"

# and now looking at our filtered sample info table we can see it has an addition column for color
filt_sample_info_tab

3. β多樣性分析

主要計(jì)算樣品之間的距離或者不同點(diǎn),一般通過(guò)計(jì)算樣品間的歐幾里得距離來(lái)展示樣品間的異同。但在不同樣品的測(cè)序深度并不一致,這將影響歐幾里得距離的計(jì)算,因此我們首先要將所有樣品測(cè)序深度標(biāo)準(zhǔn)化。
一般的方法是,以最低測(cè)序深度樣品為基準(zhǔn),將所有樣品測(cè)序深度通過(guò)二次抽樣調(diào)整到最低測(cè)序深度水平;或者是將每個(gè)樣品測(cè)序的數(shù)量轉(zhuǎn)化為比例值。作者這里推薦使用 DESeq2 包的來(lái)標(biāo)準(zhǔn)化樣品數(shù)據(jù)。

# first we need to make a DESeq2 object
> deseq_counts <- DESeqDataSetFromMatrix(filt_count_tab, colData = filt_sample_info_tab, design = ~type)
> deseq_counts_vst <- varianceStabilizingTransformation(deseq_counts)
# pulling out our transformed table 
> vst_trans_count_tab <- assay(deseq_counts_vst)
# and calculating our Euclidean distance matrix
> euc_dist <- dist(t(vst_trans_count_tab))

> euc_clust <- hclust(euc_dist, method="ward.D2")

# plot(euc_clust) # hclust objects like this can be plotted as is, but i like to change them to dendrograms for two reasons:
  # 1) it's easier to color the dendrogram plot by groups
  # 2) if you want you can rotate clusters with the rotate() function of the dendextend package

> euc_dend <- as.dendrogram(euc_clust, hang=0.1)
> dend_cols <- filt_sample_info_tab$color[order.dendrogram(euc_dend)]
> labels_colors(euc_dend) <- dend_cols

> plot(euc_dend, ylab="VST Euc. dist.")

根據(jù)樣品間的歐幾里德距離,對(duì)樣品進(jìn)行層次聚類。我作出來(lái)的跟作者的結(jié)果不完全一致,不知道問(wèn)題出在哪兒。作者是使用Usearch分析得到的OTU數(shù)據(jù),我是使用DADA2得到的ASVs數(shù)據(jù),然后多樣性的分析是參照的作者??赡苁荱search和DADA2分析結(jié)果不完全一致,也可能是我代碼輸入過(guò)程中出了一些未能察覺(jué)的差錯(cuò)導(dǎo)致的。


PCoA分析

> # making our phyloseq object with transformed table
> vst_count_phy <- otu_table(vst_trans_count_tab, taxa_are_rows=T)
> sample_info_tab_phy <- sample_data(filt_sample_info_tab)
> vst_physeq <- phyloseq(vst_count_phy, sample_info_tab_phy)
> # generating and visualizing the PCoA with phyloseq
> vst_pcoa <- ordinate(vst_physeq, method="MDS", distance="euclidean")
> eigen_vals <- vst_pcoa$values$Eigenvalues # allows us to scale the axes according to their magnitude of separating apart the samples
> plot_ordination(vst_physeq, vst_pcoa, color="char") + 
+   labs(col="type") + geom_point(size=1) + 
+   geom_text(aes(label=rownames(filt_sample_info_tab), hjust=0.3, vjust=-0.4)) + 
+   coord_fixed(sqrt(eigen_vals[2]/eigen_vals[1])) + ggtitle("PCoA") + 
+   scale_color_manual(values=unique(filt_sample_info_tab$color[order(filt_sample_info_tab$char)])) + 
+   theme(legend.position="none")

結(jié)果依然跟作者不完全一致!

4. α多樣性分析

樣品稀薄曲線

> rarecurve(t(filt_count_tab), step=100, col=filt_sample_info_tab$color, lwd=2, ylab="ASVs")
abline(v=(min(rowSums(t(filt_count_tab)))))

群落豐富度和多樣性計(jì)算

# first we need to create a phyloseq object using our un-transformed count table
> count_tab_phy <- otu_table(filt_count_tab, taxa_are_rows=T)
> tax_tab_phy <- tax_table(tax_tab)
> ASV_physeq <- phyloseq(count_tab_phy, tax_tab_phy, sample_info_tab_phy)
# first we need to create a phyloseq object using our un-transformed count table
> count_tab_phy <- otu_table(filt_count_tab1, taxa_are_rows=T)
> tax_tab_phy <- tax_table(tax_tab)
> ASV_physeq <- phyloseq(count_tab_phy, tax_tab_phy, sample_info_tab_phy)
# and now we can call the plot_richness() function on our phyloseq object
> plot_richness(ASV_physeq, color="char", measures=c("Chao1", "Shannon")) + 
+   scale_color_manual(values=unique(filt_sample_info_tab$color[order(filt_sample_info_tab$char)])) +
+   theme(legend.title = element_blank())

phyloseq包繪圖

> plot_richness(ASV_physeq, x="type", color="char", measures=c("Chao1", "Shannon")) + 
+   scale_color_manual(values=unique(filt_sample_info_tab$color[order(filt_sample_info_tab$char)])) +
+   theme(legend.title = element_blank())

至此,16s擴(kuò)增子的α和β多樣性分析完成。我只是簡(jiǎn)單的過(guò)了一遍作者的流程和代碼,其中的可視化部分還有待精雕細(xì)琢。下一步是微生物群落組成和豐度差異分析。

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書(shū)系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

友情鏈接更多精彩內(nèi)容