前言:
觀看B站孟浩巍的系列視頻2017-08-04-高通量測序技術交流錄像,內容頗多,直接記錄筆記代碼,沒有經過本地運行,僅作記錄,他日再做梳理。
RNA-seq分析的流程及軟件
- raw data quality control
fastqc;cutadapt;fastx_trimmer - mapping to the genome
tophat2;STAR; hisat2 - remove the duplication
picard - calculate RPKM
cufflinks -
find different expression gene
cuffdiff
RNAseq pipeline.png
1. fastqc:
fastqc -t 2 -o ./fastqc_result/ -q ./raw.fast/*gz &
輸出結果包含:
html、fastqc.zip
2. cutadapt.sh
for case_name in SRR1033853
do
fastq_1 = ./raw.fastq/${case_name}_1.fastq.gz
fastq_2 = ./raw.fastq/${case_name}_2.fastq.gz
log_file = ./raw.fastq/${case_name}_cutadapt.log
out_fastq_1 = ./raw.fastq/${case_name}_1_cutadapt.fastq.gz
out_fastq_2 = ./raw.fastq/${case_name}_2_cutadapt.fastq.gz
nohup cutadapt --times 1 -e 0.1 -O 3 --qualigy-cutoff 6 -m 75 -a AGATTTGGGG -A AGTTGGAATC -o $out_fastq_1 -p $out_fastq_2 $fastq_1 $fastq_2 > &log_file 2>&1 &
done
參數(shù)說明:
--times 1 :去掉一次
-e 0.1 允許錯誤率
-O 3 至少有三個bp與reads序列overlap上
-m 75 去adapter 后最小長度
-a reads1需要去的序列
-A reads2需要去的序列
后面依次是輸出1,2,輸入1,2
log_file 屏幕內容輸出端到文件中
2>&1 報錯內容也輸出到屏幕上
nohup &不掛起,退出不終止
- trim.sh
for case_name in SRR1033853
do
fastq_1 = ./raw.fastq/${case_name}_1_cutadapt.fastq.gz
fastq_2 = ./raw.fastq/${case_name}_2_cutadapt.fastq.gz
out_fastq_1 = ./raw.fastq/${case_name}_R1_trim.fq.gz
out_fastq_2 = ./raw.fastq/${case_name}_R2_trim.fq.gz
zcat $fastq_1 | fastx_trimmer -f 6 -l 75 -z -o $out_fastq_1 &
zcat $fastq_2 | fastx_trimmer -f 6 -l 75 -z -o $out_fastq_2 &
done
參數(shù)說明:
zcat 解壓縮gz文件
-f 6 -l 75 從第6個堿基一直trimmer到第75個堿基
-z 輸出文件也用gzip壓縮
05. mm10_mapping.sh
mm10-index = /home/reference/mm10_fasta/mm10_genome_bt2
for case-name in SRR1033853
do
fastq_1 = ./raw.fastq/${case_name}_R1_trim.fq.gz
fastq_2 = ./raw.fastq/${case_name}_R2_trim.fq.gz
out_put_dir = ./${case_name}_tophat_result
log= ./${case_name}_tophat_result/${case-name}_tophat.log
mkdir -p $output_dir
nohup tophat2 -p 32 -o $output_dir $mm10-index $fastq_1 $fastq_2 >$log 2>&1 &
done
06. rmdup.sh
對于單細胞測序,去除重復
mm10-index = /home/reference/mm10_fasta/mm10_genome_bt2
for case-name in SRR1033853
do
input_bam = ./${case_name}_tophat_result/accepted_hits.bam
output_bam = ./${case_name}_tophat_result/accepted_hits_rmdup.bam
matrix_file = ./${case_name}_tophat_result/accepted_hists_rmdup.matrix
log_file= ./${case_name}_tophat_result/accepted_hists_rmdup.log
nohup java -Xms16g -Xmx32g -XX:ParallelGCThreads=32 -jar /home/my_software/picard/picard.jar MarkDuplicates INPUT= $input_bam OUTPUT=$out_bam METRICS_FILE=$matrix_file ASSUME_SORTED=true REMOVE_DUPLICATES=true >$log_file 2>&1 &
done
參數(shù)說明:
-Xms16g -Xmx32g 線程最小16g,最大32g
-XX:ParallelGCThreads=32 32核
METRICS_FILE=$matrix_file 指定一個文件名
ASSUME_SORTED=true 把完全一樣的序列去掉,先排序再運行
07. cufflink.sh
mm10_gtf = /home/reference/mm10_fasta/mm10_RefSeq.fix.gtf
for case-name in SRR1033853
do
cufflink_dir= ./${case_name}_cufflink_result/
log= ./${case_name}_cufflink_result/cufflink.log
bam_file = ./${case_name}_tophat_result/accepted_hits_rmdup.bam
mkdir -p $cufflink_dir
nohup cufflinks -o $cufflink_dir -p 16 -G $mm10_gtf $bam_file > $log 2>&1 &
done
參數(shù)說明:
cufflinks計算fpkm
-G 基因的注釋文件
RSEM計算表達量,用統(tǒng)計回歸去計算擬合表達量
08. R中分析
主成分分析PCA
principal component analysis 簡單來說,就是在尋找變量中最能代表這組數(shù)據(jù)的變量,用少數(shù)幾個變量的線性變化來代表原始數(shù)據(jù),從而叨叨數(shù)據(jù)降維的目的。
#load fpkm table
rm(list=ls())
combine_fpkm_table = read.table(file="./nature_2014-single/", headr = T, sep="\t")
dim(combine_fpkm_table)
#R自帶的princcomp(input_matrix)
input_matrix = combine_fpkm_table[,c(2:ncol(combine_fpkm_table))]
princomp(input_matrix )
#install.packages("psych")
install.packages("pysch")
library(psych)
input_matrix = combine_fpkm_table[,c(2:ncol(combine_fpkm_table))]
pca_result = principal(input_matrix, nfactors = 3)
dim(pca_result$scores)
pca_result$values
pca_result$scores
pca_result$weights
plot(pca_result$scores[,1],pca_result$scores[,2],xlim=c(0,50),ylim=c(0,50))
09.繪圖
1. boxplot

Q1-Q3:25%與75%差中位數(shù)
IQR:interquartile range
boxplot(log2(gene_fpkm_table.select$FPKM+1),col="orange")
2. vioplot

可以看出基因的豐度
rm (list=ls())
library(vioplot)
vioplot::vioplot(log2(gene_fpkm_table.select$FPKM+1),col="orange")
3. pheatmap
pheatmap(log2(input_matrix[c(1:500),]+1))
