寫在前面
- 很榮幸參加生信技能樹全國巡講-第一站-重慶站的RNA-seq線下培訓(xùn)課程。
- 以下為線下課結(jié)合線上課所做的筆記,希望能幫到你。
一些小練習(xí)題
R語言的練習(xí)題
初級10 個題目,盡量根據(jù)參考代碼理解及完成:http://www.bio-info-trainee.com/3793.html
中級要求是:http://www.bio-info-trainee.com/3750.html
高級要求是完成20題: http://www.bio-info-trainee.com/3415.html
LINUX的練習(xí)題:
最低要求是完成我的 linux 20題 http://www.bio-info-trainee.com/2900.html
其次完成生物信息學(xué)數(shù)據(jù)格式的習(xí)題(blast/blat/fa-fq/sam-bam/vcf/bed/gtf-gff),收集這些格式的說明書。
fasta和fastq格式文件的shell小練習(xí) http://www.bio-info-trainee.com/3575.html
sam和bam格式文件的shell小練習(xí) http://www.bio-info-trainee.com/3578.html
VCF格式文件的shell小練習(xí) http://www.bio-info-trainee.com/3577.html
RNA-seq只有小考核 http://www.bio-info-trainee.com/3920.html
下面開始RNA-seq流程
1.軟件下載
1.1下載Miniconda3
- 清華鏡像關(guān)閉后,只有選擇其它下載:
[官網(wǎng)]:https://docs.conda.io/en/latest/miniconda.html - 參考教程: conda的安裝與使用
# 下載Miniconda3
$ wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
## 選擇miniconda2/3都可以,可以通過創(chuàng)建另一個版本Python環(huán)境。
# 安裝Miniconda3
$ bash Miniconda3-latest-Linux-x86_64.sh
$ source ~/.bashrc
(base) vip39 13:30:42 ~ #表示激活miniconda
# 配置鏡像,清華鏡像掛掉后懸著這個吧
conda config --add channels bioconda #生信軟件
conda config --add channels conda-forge
# 查看一下配置的鏡像
$ vim ~/.condarc
channels:
- conda-forge
- bioconda
- defaults
show_channel_urls: true
# 創(chuàng)建命名為rna的python2的環(huán)境
$ conda create -n rna python=2
Collecting package metadata: /
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
# 激活/進入conda的rna環(huán)境,避免每次用-n rna
$ source activate rna
(rna) vip39 13:47:00 ~
1.2安裝軟件
jimmy老師給出的安裝列表
- 數(shù)據(jù)資源下載,參考基因組及參考轉(zhuǎn)錄組
gtf ,genome,fa - 質(zhì)控,需要fastqc及multiqc
trimmomatic,cutadapt,trim-galore - 對比
star,HISAT2,TOPHAT2, bowtie,bwa,subread - 計數(shù)
featureCounts,htseq-counts,bedtools - nomalization 歸一化,差異分析等
DEseq2,edgeR,limma
$ conda install -y fastqc trim-galore star bwa hisat2 bowtie2 subread tophat htseq bedtools deeptools salmon cutadapt multiqc sra-tools
$ conda install -c bioconda trimmomatic
- 清華鏡像掛掉后,就多用google,或者https://anaconda.org/bioconda/
2.下載數(shù)據(jù)
- 找到SRA數(shù)據(jù) [可修改SRP號]
- 找到SRA數(shù)據(jù)集,找到_List.txt
#得到_List.txt后,我們直接用循環(huán)下載數(shù)據(jù)
$ cat SRR_Acc_List.txt | while read id; do (prefetch ${id});done
##上面程序掛為后臺運行的方式為:
cat SRR_Acc_List.txt | while read id; do (prefetch $id &);done
## 殺死上面這個循壞
ps -ef |grep prefecth awk '{print $2}'|while read id ;do kill $ id ;done
# 下載得到數(shù)據(jù),這里我們下載得到4個數(shù)據(jù)
$ tree
.
├── SRR1039508.sra
├── SRR1039509.sra
├── SRR1039510.sra
└── SRR1039514.sra
# 3. SRA數(shù)據(jù)轉(zhuǎn)成fastq.gz格式
## 使用循環(huán)來實現(xiàn)
$ mkdir 1.fastq_qc
$ cd 1.fastq_qc
$ ls ~/ncbi/public/sra/*|while read id ;do (nohup fastq-dump --gzip --split-3 -O ./ $id &);done
## 你也可以手動實現(xiàn)轉(zhuǎn)換
fastq-dump --gzip --split-3 -O ~/ ./SRR1039508.sra
3.質(zhì)控
3.1了解數(shù)據(jù)大小、質(zhì)量
$ ls *gz |xargs fastqc -t 10
## multiqc整合結(jié)果
$ multiqc ./
...
[INFO ] fastqc : Found 8 reports #找到8個,對應(yīng)上我上面的4個
[INFO ] multiqc : Report : multiqc_report.html #將這個結(jié)果報告下載下來查看。
[INFO ] multiqc : MultiQC complete
-
查看multiqc_report.html
multiqc_report.html
3.2過濾數(shù)據(jù)
- 去除接頭
- 去除兩端低質(zhì)量堿基(-q 25)
- 最大允許錯誤率(默認(rèn)-e 0.1)
- 去除<36的reads(--length 36)
- 切除index的overlap>3的堿基
- reads去除以對為單位(--paired)
##過濾數(shù)據(jù)
###step3:filter the bad quality reads and remove adaptors。
$ cd 1.fastq_qc/
$ ls ~/1.fastq_qc/*_1.fastq.gz
/home/vip51/1.fastq_qc/SRR1039508_1.fastq.gz /home/vip51/1.fastq_qc/SRR1039509_1.fastq.gz /home/vip51/1.fastq_qc/SRR1039510_1.fastq.gz /home/vip51/1.fastq_qc/SRR1039514_1.fastq.gz
$ ls ~/1.fastq_qc/*_1.fastq.gz >1
$ ls ~/1.fastq_qc/*_2.fastq.gz >2
$ paste 1 2
$ paste 1 2 >config
$ cat config
/home/vip51/1.fastq_qc/SRR1039508_1.fastq.gz /home/vip51/1.fastq_qc/SRR1039508_2.fastq.gz
/home/vip51/1.fastq_qc/SRR1039509_1.fastq.gz /home/vip51/1.fastq_qc/SRR1039509_2.fastq.gz
/home/vip51/1.fastq_qc/SRR1039510_1.fastq.gz /home/vip51/1.fastq_qc/SRR1039510_2.fastq.gz
/home/vip51/1.fastq_qc/SRR1039514_1.fastq.gz /home/vip51/1.fastq_qc/SRR1039514_2.fastq.gz
## 建立好config后,接下來就可以進行批量質(zhì)控了
## 建立批量處理腳本
$ cat >qc.sh
source activate rna
bin_trim_galore=trim_galore
mkdir filter-data
dir='/home/vip51/filter-data'
cat config |while read id
do
arr=($id)
fq1=${arr[0]}
fq2=${arr[1]}
nohup $bin_trim_galore -q 25 --phred33 --length 36 -e 0.1 --stringency 3 --paired -o $dir $fq1 $fq2 &
done
source deactivate
## 拿到config和qc.sh后
$ bash qc.sh config
4比對-HISAT2 mapping
4.1參考基因組hg38及構(gòu)建索引
#hisat軟件建立索引文件
cd ~/reference
mkdir -p index/hisat && cd index/hisat
nohup wget ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/data/hg38.tar.gz &
nohup tar zxvf grcm38.tar.gz &
## 解壓后
(rna) vip51 00:19:08 ~/reference/index/hisat/hg38
$ ls
genome.1.ht2 genome.2.ht2 genome.3.ht2 genome.4.ht2 genome.5.ht2 genome.6.ht2 genome.7.ht2 genome.8.ht2 make_hg38.sh
# 這里表示構(gòu)建索引成功,簡單的說,就是去官網(wǎng)下載index,然后解壓即可成功
- 其實構(gòu)建索引的時候,我花了很多時間,就在于
~/reference/index/hisat/hg38/genome這個的理解不到位,所以跑去Google-構(gòu)建索引失敗才發(fā)現(xiàn)問題.
hisat2 -p 2 -x ~/reference/index/hisat/hg38/genome -1 ~/filter-data/${id}_1_val_1.fq -2 ~/filter-data/${id}_2_val_2.fq -S ${id}.hisat.sam 2>${id}.sam.log
4.2 比對
- bam查看方法
samtools view -h smaple.bam
samtools view smaple.bam
## 將文件名為*gq.gz改為*fq,head-1000取的前1000行,這一步非必要的
$ ls ~/filter-data/*gz|while read id ;do (zcat $id|head -1000> $(basename $id ".gz"));done
$ (rna) vip51 20:30:12 ~/filter-data
$ ls -lh
total 12G
-rw-rw-r-- 1 vip51 vip51 2.9K Jun 2 15:29 SRR1039508_1.fastq.gz_trimming_report.txt
-rw-rw-r-- 1 vip51 vip51 64K Jun 6 16:53 SRR1039508_1_val_1.fq #取的前1000行,所以數(shù)據(jù)很小。
-rw-rw-r-- 1 vip51 vip51 1.3G Jun 2 15:44 SRR1039508_1_val_1.fq.gz
## hisat2 比對
id=SRR1039508
hisat2 -p 2 -x ~/reference/index/hisat/hg38/genome -1 ~/filter-data/${id}_1_val_1.fq -2 ~/filter-data/${id}_2_val_2.fq -S ${id}.hisat.sam 2>${id}.sam.log
## hisat批量比對
$ ls *gz|cut -d "_" -f 1|sort -u
SRR1039508
SRR1039509
SRR1039510
SRR1039514
## 建立一個hisat.sh
$ cat > hisat.sh
cd filter-data/
ls *gz|cut -d "_" -f 1|sort -u |while read id ;do
cd ~/4.mapping/
ls -lh ${id}_1_val_1.fq.gz ${id}_2_val_2.fq.gz
hisat2 -p 10 -x ~/reference/index/hisat/hg38/genome -1 ~/filter-data/${id}_1_val_1.fq.gz -2 ~/filter-data/${id}_2_val_2.fq.gz -S ${id}.hisat.sam
done
###批量對比
$ source hisat.sh
(rna) vip51 23:15:04 ~/4.mapping
$ ls -lh
total 56G
-rw-rw-r-- 1 vip51 vip51 13G Jun 7 22:44 SRR1039508.hisat.sam
-rw-rw-r-- 1 vip51 vip51 12G Jun 7 22:51 SRR1039509.hisat.sam
-rw-rw-r-- 1 vip51 vip51 13G Jun 7 22:58 SRR1039510.hisat.sam
-rw-rw-r-- 1 vip51 vip51 21G Jun 7 23:08 SRR1039514.hisat.sam
##上面看出.sam文件還是很大的
## 如果跑的是這個
# hisat2 -p 10 -x ~/reference/index/hisat/hg38/genome -1 ~/filter-data/${id}_1_val_1.fq -2 ~/filter-data/${id}_2_val_2.fq -S ${id}.hisat.sam
## 上面對對應(yīng)的就是head -1000,數(shù)據(jù)會很小
###.sam轉(zhuǎn)化為.bam
$ ls *.sam |while read id;do (samtools sort -O bam -@ 5 -o $(basename ${id} ".sam").bam ${id});done
(rna) vip51 23:34:31 ~/4.mapping
$ ls -lh
total 66G
-rw-rw-r-- 1 vip51 vip51 2.1G Jun 7 23:21 SRR1039508.hisat.bam
-rw-rw-r-- 1 vip51 vip51 13G Jun 7 22:44 SRR1039508.hisat.sam
-rw-rw-r-- 1 vip51 vip51 1.9G Jun 7 23:24 SRR1039509.hisat.bam
-rw-rw-r-- 1 vip51 vip51 12G Jun 7 22:51 SRR1039509.hisat.sam
-rw-rw-r-- 1 vip51 vip51 2.1G Jun 7 23:28 SRR1039510.hisat.bam
-rw-rw-r-- 1 vip51 vip51 13G Jun 7 22:58 SRR1039510.hisat.sam
-rw-rw-r-- 1 vip51 vip51 3.7G Jun 7 23:34 SRR1039514.hisat.bam
-rw-rw-r-- 1 vip51 vip51 21G Jun 7 23:08 SRR1039514.hisat.sam
# 可以看出.bam文件要小得多
4.3 比對策略統(tǒng)計比對效率
ls *.bam |while read id ;do ( samtools flagstat -@ 1 $id > $(basename ${id} ".bam").flagstat );done
(rna) vip51 23:42:28 ~/4.mapping
$ ls -lh
total 66G
-rw-rw-r-- 1 vip51 vip51 2.1G Jun 7 23:21 SRR1039508.hisat.bam
-rw-rw-r-- 1 vip51 vip51 448 Jun 7 23:38 SRR1039508.hisat.flagstat
-rw-rw-r-- 1 vip51 vip51 13G Jun 7 22:44 SRR1039508.hisat.sam
-rw-rw-r-- 1 vip51 vip51 1.9G Jun 7 23:24 SRR1039509.hisat.bam
-rw-rw-r-- 1 vip51 vip51 448 Jun 7 23:38 SRR1039509.hisat.flagstat
-rw-rw-r-- 1 vip51 vip51 12G Jun 7 22:51 SRR1039509.hisat.sam
-rw-rw-r-- 1 vip51 vip51 2.1G Jun 7 23:28 SRR1039510.hisat.bam
-rw-rw-r-- 1 vip51 vip51 449 Jun 7 23:39 SRR1039510.hisat.flagstat
-rw-rw-r-- 1 vip51 vip51 13G Jun 7 22:58 SRR1039510.hisat.sam
-rw-rw-r-- 1 vip51 vip51 3.7G Jun 7 23:34 SRR1039514.hisat.bam
-rw-rw-r-- 1 vip51 vip51 449 Jun 7 23:39 SRR1039514.hisat.flagstat
-rw-rw-r-- 1 vip51 vip51 21G Jun 7 23:08 SRR1039514.hisat.sam
$ ls *.bam|xargs -i samtools index {} #構(gòu)建索引,拿到IGV里面看
###
(rna) vip51 23:43:56 ~/4.mapping
$ mkdir stat
$ mv *flagstat stat/
$ cd stat/
$ ls
SRR1039508.hisat.flagstat SRR1039509.hisat.flagstat SRR1039510.hisat.flagstat SRR1039514.hisat.flagstat
(rna) vip51 00:24:55 ~/4.mapping/stat
$ wc -l *
13 SRR1039508.hisat.flagstat
13 SRR1039509.hisat.flagstat
13 SRR1039510.hisat.flagstat
13 SRR1039514.hisat.flagstat
52 total
(rna) vip51 00:25:41 ~/4.mapping/stat
$ cat *|awk '{print $1}' |paste - - - - - - - - - - - - -
50094593 5350127 0 0 49304507 44744466 22372233 22372233 43122616 43495718 458662 121216 89837
45068638 4229594 0 0 44198454 40839044 20419522 20419522 39169816 39531396 437464 121706 92949
49310701 5020389 0 0 48454263 44290312 22145156 22145156 42601074 42978162 455712 143714 111364
82290684 6889352 0 0 81370594 75401332 37700666 37700666 73172032 73858618 622624 395008 343293
$ cut -d"+" -f 2 SRR1039508.hisat.flagstat |cut -d" " -f 3-90
$ cut -d"+" -f 2 SRR1039508.hisat.flagstat |cut -d" " -f 3-90
in total (QC-passed reads
secondary
supplementary
duplicates
mapped (98.42% : N/A)
paired in sequencing
read1
read2
properly paired (96.38% : N/A)
with itself and mate mapped
singletons (1.03% : N/A)
with mate mapped to a different chr
with mate mapped to a different chr (mapQ>=5)
# 如下表

hisat比對結(jié)果
- multiqc ./hisat.flagstat*
## 上面的SRR1039508.hisat.flagstat SRR1039509.hisat.flagstat SRR1039510.hisat.flagstat SRR1039514.hisat.flagstat 也可以如下multiqc生成網(wǎng)頁版報告
$ multiqc ./
(rna) vip51 00:43:27 ~/4.mapping/stat
$ ls
multiqc_data multiqc_report.html
-
查看multiqc_report.html
multiqc_report.html
5.差異分析-featureCounts
5.1featureCounts 需要兩個輸入文件:
- 比對產(chǎn)生的BAM/ SAM文件(上面我們已經(jīng)的到)
$ ls *bam
SRR1039508.hisat.bam SRR1039509.hisat.bam SRR1039510.hisat.bam SRR1039514.hisat.bam
- 區(qū)間注釋文件,這個文件需要自己創(chuàng)建
$ cd ~/reference/gtf/
## 下載最新的
$ wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_30/gencode.v30.annotation.gtf.gz
-
去Gencode網(wǎng)站下載,這里我們下載最新的,下載GTF文件。
Release 30 - 解壓縮,得到Load annotation file gencode.v30.annotation.gtf
$ gunzip gencode.v30.annotation.gtf.gz
## 得到
$ ls
gencode.v30.annotation.gtf human.gene.positions nohup.out
6.2 featureounts
$ cd ~/6.featureCounts/
$ featureCounts -T 5 -p -t exon -g gene_id -a ~/reference/gtf/gencode.v30.annotation.gtf -o ~/all.id.txt ~/4.mapping/*.bam
## 得到如下:
========== _____ _ _ ____ _____ ______ _____
===== / ____| | | | _ \| __ \| ____| /\ | __ \
===== | (___ | | | | |_) | |__) | |__ / \ | | | |
==== \___ \| | | | _ <| _ /| __| / /\ \ | | | |
==== ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
========== |_____/ \____/|____/|_| \_\______/_/ \_\_____/
v1.6.4
//========================== featureCounts setting ===========================\\
|| ||
|| Input files : 4 BAM files ||
|| P SRR1039508.hisat.bam ||
|| P SRR1039509.hisat.bam ||
|| P SRR1039510.hisat.bam ||
|| P SRR1039514.hisat.bam ||
|| ||
|| Output file : all.id.txt ||
|| Summary : all.id.txt.summary ||
|| Annotation : gencode.v30.annotation.gtf (GTF) ||
|| Dir for temp files : /home/vip51 ||
|| ||
|| Threads : 5 ||
|| Level : meta-feature level ||
|| Paired-end : yes ||
|| Multimapping reads : not counted ||
|| Multi-overlapping reads : not counted ||
|| Min overlapping bases : 1 ||
|| ||
|| Chimeric reads : counted ||
|| Both ends mapped : not required ||
|| ||
\\============================================================================//
//================================= Running ==================================\\
|| ||
|| Load annotation file gencode.v30.annotation.gtf ... ||
|| Features : 1279686 ||
|| Meta-features : 58870 ||
|| Chromosomes/contigs : 25 ||
|| ||
|| Process BAM file SRR1039508.hisat.bam... ||
|| Paired-end reads are included. ||
|| Assign alignments (paired-end) to features... ||
|| ||
|| WARNING: reads from the same pair were found not adjacent to each ||
|| other in the input (due to read sorting by location or ||
|| reporting of multi-mapping read pairs). ||
|| ||
|| Pairing up the read pairs. ||
|| ||
|| Total alignments : 25134856 ||
|| Successfully assigned alignments : 18588891 (74.0%) ||
|| Running time : 0.41 minutes ||
|| ||
|| Process BAM file SRR1039509.hisat.bam... ||
|| Paired-end reads are included. ||
|| Assign alignments (paired-end) to features... ||
|| ||
|| WARNING: reads from the same pair were found not adjacent to each ||
|| other in the input (due to read sorting by location or ||
|| reporting of multi-mapping read pairs). ||
|| ||
|| Pairing up the read pairs. ||
|| ||
|| Total alignments : 22612196 ||
|| Successfully assigned alignments : 17040035 (75.4%) ||
|| Running time : 0.36 minutes ||
|| ||
|| Process BAM file SRR1039510.hisat.bam... ||
|| Paired-end reads are included. ||
|| Assign alignments (paired-end) to features... ||
|| ||
|| WARNING: reads from the same pair were found not adjacent to each ||
|| other in the input (due to read sorting by location or ||
|| reporting of multi-mapping read pairs). ||
|| ||
|| Pairing up the read pairs. ||
|| ||
|| Total alignments : 24741721 ||
|| Successfully assigned alignments : 18349007 (74.2%) ||
|| Running time : 0.32 minutes ||
|| ||
|| Process BAM file SRR1039514.hisat.bam... ||
|| Paired-end reads are included. ||
|| Assign alignments (paired-end) to features... ||
|| ||
|| WARNING: reads from the same pair were found not adjacent to each ||
|| other in the input (due to read sorting by location or ||
|| reporting of multi-mapping read pairs). ||
|| ||
|| Pairing up the read pairs. ||
|| ||
|| Total alignments : 41252925 ||
|| Successfully assigned alignments : 31979144 (77.5%) ||
|| Running time : 0.54 minutes ||
|| ||
|| ||
|| Summary of counting results can be found in file "/home/vip51/all.id.txt. ||
|| summary" ||
|| ||
\\============================================================================//
- 得到下面兩個結(jié)果
# all.id.txt all.id.txt.summary
## 我們對all.id.txt.summary 進行multiqc一下,看看Counts情況
$ multiqc all.id.txt.summary
...
[INFO ] multiqc : Report : multiqc_report_1.html
[INFO ] multiqc : Data : multiqc_data_1
[INFO ] multiqc : MultiQC complete
-
查看上面得的到的multiqc_report_1.html
General Statistics
featureCounts




