轉(zhuǎn)錄組分析實(shí)戰(zhàn)第四節(jié):轉(zhuǎn)錄組分析中的技術(shù)重復(fù)和生物學(xué)重復(fù)檢查

前期的到的基因表達(dá)量矩陣,可以得到每個基因的表達(dá)量,然而由于我們在做實(shí)驗(yàn)過程中的重復(fù)(包括技術(shù)重復(fù)與生物學(xué)重復(fù))理論上來講是可以保持表達(dá)量在重復(fù)中的一致性。因此我們也可通過這個工作來檢查我們是否有正確的重復(fù)數(shù)據(jù)。

Trinity工具包提供了一些可以用于檢測重復(fù)一致性的腳本。我們今天就通過這些腳本進(jìn)行檢查。

在這個工作之前需要兩個數(shù)據(jù):

1. 基因表達(dá)的counts.matrix 文件
2. 生物學(xué)重復(fù)的表文件
yeyuntian@yeyuntian-rescuer-r720-15ikbn:~/trinitytest/downstr/RSEMout/RSEMout$ l *counts.matrix
RSEM.gene.counts.matrix  RSEM.isoform.counts.matrix
yeyuntian@yeyuntian-rescuer-r720-15ikbn:~/trinitytest/downstr/RSEMout/RSEMout$ cat samples.txt 
B25 B251
B25 B252
R25 R251
R25 R252
W25 W251
W25 W252
需要注意的是:samples.txt中的名字需要和matrix中的名字一致,否則沒辦法識別
yeyuntian@yeyuntian-RESCUER-R720-15IKBN:~/Biodata/trinitytest/downstr/RSEMout/RSEMout$ $TRINITY_HOME/Analysis/DifferentialExpression/PtR \ #調(diào)用PtR腳本
--matrix RSEM.isoform.counts.matrix \#指定給定的matrix
--samples samples.txt \#樣品重復(fù)信息
--log2 \#做一個對數(shù)處理
--min_rowSums 10 \#過濾數(shù)據(jù)指標(biāo)
--compare_replicates #輸出的圖像參數(shù)
為了作為補(bǔ)充,我們獲取這個腳本的幫助文件
yeyuntian@yeyuntian-RESCUER-R720-15IKBN:~$ $TRINITY_HOME/Analysis/DifferentialExpression/PtR --help

#################################################################################### 
#
#######################
# Inputs and Outputs: #
#######################
#
#  --matrix <string>        matrix.RAW.normalized.FPKM
#
#  Optional:
#
#  Sample groupings:
#
#  --samples <string>      tab-delimited text file indicating biological replicate relationships.
#                                   ex.
#                                        cond_A    cond_A_rep1
#                                        cond_A    cond_A_rep2
#                                        cond_B    cond_B_rep1
#                                        cond_B    cond_B_rep2
#
#  --gene_factors <string>   tab-delimited file containing gene-to-factor relationships.
#                               ex.
#                                    liver_enriched <tab> gene1
#                                    heart_enriched <tab> gene2
#                                    ...
#                            (use of this data in plotting is noted for corresponding plotting options)
#
#
#  --output <string>        prefix for output file (default: "${matrix_file}.heatmap")
#
#  --save                   save R session (as .RData file)
#  --no_reuse               do not reuse any existing .RData file on initial loading
#
#####################
#  Plotting Actions #
#####################
#
#  --compare_replicates        provide scatter, MA, QQ, and correlation plots to compare replicates.
#
#   
#
#  --barplot_sum_counts        generate a barplot that sums frag counts per replicate across all samples.
#
#  --boxplot_log2_dist <float>        generate a boxplot showing the log2 dist of counts where counts >= min fpkm
#
#  --sample_cor_matrix         generate a sample correlation matrix plot
#    --sample_cor_scale_limits <string>    ex. "-0.2,0.6"
#    --sample_cor_sum_gene_factor_expr <factor=string>    instead of plotting the correlation value, plot the sum of expr according to gene factor
#                                                         requires --gene_factors 
#
#  --sample_cor_subset_matrix <string>  plot the sample correlation matrix, but create a disjoint set for rows,cols.
#                                       The subset of the samples to provide as the columns is provided as parameter.
#
#  --gene_cor_matrix           generate a gene-level correlation matrix plot
#
#  --indiv_gene_cor <string>   generate a correlation matrix and heatmaps for '--top_cor_gene_count' to specified genes (comma-delimited list)
#      --top_cor_gene_count <int>   (requires '--indiv_gene_cor with gene identifier specified')
#      --min_gene_cor_val <float>   (requires '--indiv_gene_cor with gene identifier specified')
#
#  --heatmap                   genes vs. samples heatmap plot
#      --heatmap_scale_limits "<int,int>"  cap scale intensity to low,high  (ie.  "-5,5")
#      --heatmap_colorscheme <string>  default is 'purple,black,yellow'
#                                      a popular alternative is 'green,black,red'
#                                      Specify a two-color gradient like so: "black,yellow".
#
#     # sample (column) labeling order
#      --lexical_column_ordering        order samples by column name lexical order.
#      --specified_column_ordering <string>  comma-delimited list of column names (must match matrix exactly!)
#      --order_columns_by_samples_file  order the columns in the heatmap according to replicate name ordering in the samples file.
#
#     # gene (row) labeling order
#      --order_by_gene_factor           order the genes by their factor (given --gene_factors)
#
#  --gene_heatmaps <string>    generate heatmaps for just one or more specified genes
#                              Requires a comma-delimited list of gene identifiers.
#                              Plots one heatmap containing all specified genes, then separate heatmaps for each gene.
#                                 if --gene_factors set, will include factor annotations as color panel.
#                                 else if --prin_comp set, will include include principal component color panel.
#
#  --prin_comp <int>           generate principal components, include <int> top components in heatmap  
#      --add_prin_comp_heatmaps <int>  draw heatmaps for the top <int> features at each end of the prin. comp. axis.
#                                      (requires '--prin_comp') 
#      --add_top_loadings_pc_heatmap <int>  draw a heatmap containing the <int> top feature loadings across all PCs.
#      --R_prin_comp_method <string>        options: princomp, prcomp (default: prcomp)
#
#  --mean_vs_sd               expression variability plot. (highlight specific genes by category via --gene_factors )
#
#  --var_vs_count_hist <vartype=string>        create histogram of counts of samples having feature expressed within a given expression bin.
#                                              vartype can be any of 'sd|var|cv|fano'
#      --count_hist_num_bins <int>  number of bins to distribute counts in the histogram (default: 10)
#      --count_hist_max_expr <float>  maximum value for the expression histogram (default: max(data))
#      --count_hist_convert_percentages       convert the histogram counts to percentage values.
#
#
#  --per_gene_plots                   plot each gene as a separate expression plot (barplot or lineplot)
#    --per_gene_plot_width <float>     default: 2.5
#    --per_gene_plot_height <float>    default: 2.5
#    --per_gene_plots_per_row <int>   default: 1
#    --per_gene_plots_per_col <int>   default: 2
#    --per_gene_plots_incl_vioplot    include violin plots to show distribution of rep vals
#
########################################################
#  Data Filtering, in order of operation below:  #########################################################
#
#
## Column filters:
#
#  --restrict_samples <string>   comma-delimited list of samples to restrict to (comma-delim list)
#
#  --top_rows <int>         only include the top number of rows in the matrix, as ordered.
#
#  --min_colSums <float>      min number of fragments, default: 0
#
#  --min_expressed_genes <int>           minimum number of genes (rows) for a column (replicate) having at least '--min_gene_expr_val'
#       --min_gene_expr_val <float>   a gene must be at least this value expressed across all samples.  (default: 1)
#
#
## Row Filters:
#
#  --min_rowSums <float>      min number of fragments, default: 0
#
#  --gene_grep <string>     grep on string to restrict to genes
#
#  --min_across_ALL_samples_gene_expr_val <int>   a gene must have this minimum expression value across ALL samples to be retained.
#
#  --min_across_ANY_samples_gene_expr_val <int>   a gene must have at least this expression value across ANY single sample to be retained.
#
#  --min_gene_prevalence <int>   gene must be found expressed in at least this number of columns
#       --min_gene_expr_val <float>   a gene must be at least this value expressed across all samples.  (default: 1)
#
#  --minValAltNA <float>    minimum cell value after above transformations, otherwise convert to NA
#
#  --top_genes <int>        use only the top number of most highly expressed transcripts
#
#  --top_variable_genes <int>      Restrict to the those genes with highest coeff. of variability across samples (use median of replicates)
#
#      --var_gene_method <string>   method for ranking top variable genes ( 'coeffvar|anova', default: 'anova' )
#           --anova_maxFDR <float>    if anova chose, require FDR value <= anova_maxFDR  (default: 0.05)
#            or
#           --anova_maxP <float>    if set, over-rides anova_maxQ  (default, off, uses --anova_maxQ)
#
#  --top_variable_via_stdev_and_mean_expr    perform filtering based on the stdev vs. mean expression plot.
#      Requires both:               (note, if you used --log2 and/or --Zscale, settings below should use those transformed values)
#         --min_stdev_expr <float>       minimum standard deviation in expression
#         --min_mean_expr  <float>       minimum mean expression value 
#
######################################
#  Data transformations:             #
######################################
#
#  --CPM                    convert to counts per million (uses sum of totals before filtering)
#  --CPK                    convert to counts per thousand
#
#  --binary                 all values > 0 are set to 1.  All values < 0 are set to zero.
#
#  --log2
#
#  --center_rows            subtract row mean from each data point. (only used under '--heatmap' )
#
#  --Zscale_rows            Z-scale the values across the rows (genes)  
#
#########################
#  Clustering methods:  #
#########################
#
#  --gene_dist <string>        Setting used for --heatmap (samples vs. genes)
#                                  Options: euclidean, gene_cor
#                                           maximum, manhattan, canberra, binary, minkowski
#                                  (default: 'euclidean')  Note: if using 'gene_cor', set method using '--gene_cor' below.
#
#
#  --sample_dist <string>      Setting used for --heatmap (samples vs. genes)
#                                  Options: euclidean, sample_cor
#                                           maximum, manhattan, canberra, binary, minkowski
#                                  (default: 'euclidean')  Note: if using 'sample_cor', set method using '--sample_cor' below.
#
#
#  --gene_clust <string>       ward, single, complete, average, mcquitty, median, centroid, none (default: complete)
#  --sample_clust <string>     ward, single, complete, average, mcquitty, median, centroid, none (default: complete)
#
#  --gene_cor <string>             Options: pearson, spearman  (default: pearson)
#  --sample_cor <string>           Options: pearson, spearman  (default: pearson)
#
####################
#  Image settings: #
####################
#
#
#  --imgfmt <string>           image type (pdf,svg) with default: pdf
#
#  --img_width <int>           image width
#  --img_height <int>          image height
#
################
# Misc. params #
################
#
#  --write_intermediate_data_tables         writes out the data table after each transformation.
#
#  --show_pipeline_flowchart                describe order of events and exit.
#
####################################################################################
但是在這個過程中會報錯,原因是本地的R包沒有安裝好,然后回頭去安裝R包,有些R包在Bioconductor上有些就在CRAN里面。R腳本如下
source("https://bioconductor.org/biocLite.R")
biocLite("Biobase")
installed.packages()
biocLite("qvalue")
help(package='qvalue')
install.packages('fastcluster')
最后結(jié)果就是關(guān)于一個處理中生物學(xué)重復(fù)之間的相關(guān)性的幾個圖,放在一個PDF上的
對于圖的講解有機(jī)會再講。(因?yàn)槲乙膊恢烙惺裁匆饬x)我先放出來:
image 1 .png

image 2 .png

image 3 .png

image 4 .png
關(guān)于這幾張圖的解釋請大家多多指教,另外我后期通過學(xué)習(xí)也可以晚上對這個圖的解讀與分析。

=======================

下面進(jìn)行跨樣本間的相關(guān)性檢測與作圖
$TRINITY_HOME/Analysis/DifferentialExpression/PtR \
          --matrix RSEM.isoform.counts.matrix \
          --min_rowSums 10 \
          --samples samples.txt \ #
          --log2 \ #數(shù)據(jù)轉(zhuǎn)換參數(shù)
          --CPM \ #數(shù)據(jù)轉(zhuǎn)換參數(shù)
          --sample_cor_matrix  #輸出樣品相關(guān)性矩陣圖
這個代碼做出來的結(jié)果是不同樣本間的數(shù)據(jù)一致性熱圖
image 5 .png
熱圖反應(yīng)處理之間和處理內(nèi)部的重復(fù)之間的一致性

======================

最后一個結(jié)果是通過PCA分析對樣品重復(fù)關(guān)系進(jìn)行檢測。
yeyuntian@yeyuntian-RESCUER-R720-15IKBN:~/Biodata/trinitytest/downstr/RSEMout/RSEMout$ $TRINITY_HOME/Analysis/DifferentialExpression/PtR \ 
--matrix RSEM.isoform.counts.matrix \
--samples samples.txt \
--log2 \
--min_rowSums 10 \
--CPM \
--center_rows \
--prin_comp 3
輸出結(jié)果為PCA分析圖(這個圖我也看不懂)
PCA Plot
以后有機(jī)會在進(jìn)行解讀吧。

重點(diǎn)是我看不懂這些圖,請大家多多指教!

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

友情鏈接更多精彩內(nèi)容