生信技能樹線上課程:RNA-seq

寫在前面

  • 很榮幸參加生信技能樹全國巡講-第一站-重慶站的RNA-seq線下培訓(xùn)課程。
  • 以下為線下課結(jié)合線上課所做的筆記,希望能幫到你。

一些小練習(xí)題

R語言的練習(xí)題
初級10 個題目,盡量根據(jù)參考代碼理解及完成:http://www.bio-info-trainee.com/3793.html  
中級要求是:http://www.bio-info-trainee.com/3750.html
高級要求是完成20題: http://www.bio-info-trainee.com/3415.html

LINUX的練習(xí)題:
最低要求是完成我的 linux 20題 http://www.bio-info-trainee.com/2900.html   
其次完成生物信息學(xué)數(shù)據(jù)格式的習(xí)題(blast/blat/fa-fq/sam-bam/vcf/bed/gtf-gff),收集這些格式的說明書。
fasta和fastq格式文件的shell小練習(xí) http://www.bio-info-trainee.com/3575.html 
sam和bam格式文件的shell小練習(xí) http://www.bio-info-trainee.com/3578.html
VCF格式文件的shell小練習(xí) http://www.bio-info-trainee.com/3577.html 

RNA-seq只有小考核 http://www.bio-info-trainee.com/3920.html


下面開始RNA-seq流程

1.軟件下載

1.1下載Miniconda3

# 下載Miniconda3
$ wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
## 選擇miniconda2/3都可以,可以通過創(chuàng)建另一個版本Python環(huán)境。

# 安裝Miniconda3
$ bash Miniconda3-latest-Linux-x86_64.sh 
$ source ~/.bashrc
(base) vip39 13:30:42 ~ #表示激活miniconda

# 配置鏡像,清華鏡像掛掉后懸著這個吧
conda config --add channels bioconda #生信軟件
conda config --add channels conda-forge
# 查看一下配置的鏡像
$ vim ~/.condarc
channels:
  - conda-forge
  - bioconda
  - defaults
show_channel_urls: true

# 創(chuàng)建命名為rna的python2的環(huán)境
$ conda create -n rna python=2 
Collecting package metadata: /

Preparing transaction: done
Verifying transaction: done
Executing transaction: done

# 激活/進入conda的rna環(huán)境,避免每次用-n rna
$ source activate rna
(rna) vip39 13:47:00 ~

1.2安裝軟件

jimmy老師給出的安裝列表
  • 數(shù)據(jù)資源下載,參考基因組及參考轉(zhuǎn)錄組
    gtf ,genome,fa
  • 質(zhì)控,需要fastqc及multiqc
    trimmomatic,cutadapt,trim-galore
  • 對比
    star,HISAT2,TOPHAT2, bowtie,bwa,subread
  • 計數(shù)
    featureCounts,htseq-counts,bedtools
  • nomalization 歸一化,差異分析等
    DEseq2,edgeR,limma
$ conda install -y fastqc trim-galore star bwa hisat2 bowtie2 subread tophat htseq bedtools deeptools salmon cutadapt multiqc sra-tools
$ conda install -c bioconda trimmomatic 

2.下載數(shù)據(jù)

#得到_List.txt后,我們直接用循環(huán)下載數(shù)據(jù)
$ cat SRR_Acc_List.txt | while read id; do (prefetch ${id});done
##上面程序掛為后臺運行的方式為:
cat SRR_Acc_List.txt | while read id; do (prefetch $id &);done
## 殺死上面這個循壞
ps -ef |grep prefecth awk '{print $2}'|while read id ;do kill $ id ;done

# 下載得到數(shù)據(jù),這里我們下載得到4個數(shù)據(jù)
$ tree
.
├── SRR1039508.sra
├── SRR1039509.sra
├── SRR1039510.sra
└── SRR1039514.sra

# 3. SRA數(shù)據(jù)轉(zhuǎn)成fastq.gz格式
## 使用循環(huán)來實現(xiàn)
$ mkdir 1.fastq_qc
$ cd 1.fastq_qc
$ ls ~/ncbi/public/sra/*|while read id ;do (nohup fastq-dump --gzip --split-3 -O ./ $id &);done
## 你也可以手動實現(xiàn)轉(zhuǎn)換
fastq-dump --gzip --split-3 -O ~/ ./SRR1039508.sra

3.質(zhì)控

3.1了解數(shù)據(jù)大小、質(zhì)量

$ ls *gz |xargs fastqc -t 10 
## multiqc整合結(jié)果
$ multiqc ./
...
[INFO   ]          fastqc : Found 8 reports #找到8個,對應(yīng)上我上面的4個
[INFO   ]         multiqc : Report      : multiqc_report.html #將這個結(jié)果報告下載下來查看。
[INFO   ]         multiqc : MultiQC complete
  • 查看multiqc_report.html


    multiqc_report.html

3.2過濾數(shù)據(jù)

  • 去除接頭
  • 去除兩端低質(zhì)量堿基(-q 25)
  • 最大允許錯誤率(默認(rèn)-e 0.1)
  • 去除<36的reads(--length 36)
  • 切除index的overlap>3的堿基
  • reads去除以對為單位(--paired)
##過濾數(shù)據(jù)
###step3:filter the bad quality reads and remove adaptors。
$ cd 1.fastq_qc/
$ ls ~/1.fastq_qc/*_1.fastq.gz
/home/vip51/1.fastq_qc/SRR1039508_1.fastq.gz  /home/vip51/1.fastq_qc/SRR1039509_1.fastq.gz  /home/vip51/1.fastq_qc/SRR1039510_1.fastq.gz  /home/vip51/1.fastq_qc/SRR1039514_1.fastq.gz
$ ls ~/1.fastq_qc/*_1.fastq.gz >1
$ ls ~/1.fastq_qc/*_2.fastq.gz >2
$ paste 1 2
$ paste 1 2 >config
$ cat config 
/home/vip51/1.fastq_qc/SRR1039508_1.fastq.gz    /home/vip51/1.fastq_qc/SRR1039508_2.fastq.gz
/home/vip51/1.fastq_qc/SRR1039509_1.fastq.gz    /home/vip51/1.fastq_qc/SRR1039509_2.fastq.gz
/home/vip51/1.fastq_qc/SRR1039510_1.fastq.gz    /home/vip51/1.fastq_qc/SRR1039510_2.fastq.gz
/home/vip51/1.fastq_qc/SRR1039514_1.fastq.gz    /home/vip51/1.fastq_qc/SRR1039514_2.fastq.gz
## 建立好config后,接下來就可以進行批量質(zhì)控了
## 建立批量處理腳本
$ cat >qc.sh
source activate rna
bin_trim_galore=trim_galore
mkdir filter-data
dir='/home/vip51/filter-data'
cat config |while read id
do
    arr=($id)
    fq1=${arr[0]}
    fq2=${arr[1]}
nohup $bin_trim_galore -q 25 --phred33 --length 36 -e 0.1 --stringency 3 --paired -o $dir $fq1 $fq2 &
done
source deactivate
## 拿到config和qc.sh后
$ bash qc.sh config

4比對-HISAT2 mapping

4.1參考基因組hg38及構(gòu)建索引

#hisat軟件建立索引文件
cd ~/reference
mkdir -p index/hisat && cd index/hisat
nohup wget ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/data/hg38.tar.gz  &
nohup tar zxvf grcm38.tar.gz &
## 解壓后
(rna) vip51 00:19:08 ~/reference/index/hisat/hg38
$ ls 
genome.1.ht2  genome.2.ht2  genome.3.ht2  genome.4.ht2  genome.5.ht2  genome.6.ht2  genome.7.ht2  genome.8.ht2  make_hg38.sh
# 這里表示構(gòu)建索引成功,簡單的說,就是去官網(wǎng)下載index,然后解壓即可成功
  • 其實構(gòu)建索引的時候,我花了很多時間,就在于~/reference/index/hisat/hg38/genome這個的理解不到位,所以跑去Google-構(gòu)建索引失敗才發(fā)現(xiàn)問題.
hisat2 -p 2 -x ~/reference/index/hisat/hg38/genome -1 ~/filter-data/${id}_1_val_1.fq -2 ~/filter-data/${id}_2_val_2.fq -S ${id}.hisat.sam 2>${id}.sam.log

4.2 比對

  • bam查看方法
samtools view -h smaple.bam
samtools view smaple.bam
## 將文件名為*gq.gz改為*fq,head-1000取的前1000行,這一步非必要的
$ ls ~/filter-data/*gz|while read id ;do (zcat $id|head -1000> $(basename $id ".gz"));done
$ (rna) vip51 20:30:12 ~/filter-data
$ ls -lh
total 12G
-rw-rw-r-- 1 vip51 vip51 2.9K Jun  2 15:29 SRR1039508_1.fastq.gz_trimming_report.txt
-rw-rw-r-- 1 vip51 vip51  64K Jun  6 16:53 SRR1039508_1_val_1.fq #取的前1000行,所以數(shù)據(jù)很小。
-rw-rw-r-- 1 vip51 vip51 1.3G Jun  2 15:44 SRR1039508_1_val_1.fq.gz
## hisat2 比對
id=SRR1039508
hisat2 -p 2 -x ~/reference/index/hisat/hg38/genome -1 ~/filter-data/${id}_1_val_1.fq -2 ~/filter-data/${id}_2_val_2.fq -S ${id}.hisat.sam 2>${id}.sam.log

## hisat批量比對
$ ls *gz|cut -d "_" -f 1|sort -u
SRR1039508
SRR1039509
SRR1039510
SRR1039514
## 建立一個hisat.sh

$ cat > hisat.sh
cd filter-data/
ls *gz|cut -d "_" -f 1|sort -u |while read id ;do
cd ~/4.mapping/
ls -lh ${id}_1_val_1.fq.gz  ${id}_2_val_2.fq.gz
hisat2 -p 10 -x ~/reference/index/hisat/hg38/genome -1 ~/filter-data/${id}_1_val_1.fq.gz -2 ~/filter-data/${id}_2_val_2.fq.gz -S ${id}.hisat.sam 
done
###批量對比
$ source hisat.sh
(rna) vip51 23:15:04 ~/4.mapping
$ ls -lh
total 56G
-rw-rw-r-- 1 vip51 vip51 13G Jun  7 22:44 SRR1039508.hisat.sam
-rw-rw-r-- 1 vip51 vip51 12G Jun  7 22:51 SRR1039509.hisat.sam
-rw-rw-r-- 1 vip51 vip51 13G Jun  7 22:58 SRR1039510.hisat.sam
-rw-rw-r-- 1 vip51 vip51 21G Jun  7 23:08 SRR1039514.hisat.sam
##上面看出.sam文件還是很大的
## 如果跑的是這個
# hisat2 -p 10 -x ~/reference/index/hisat/hg38/genome -1 ~/filter-data/${id}_1_val_1.fq -2 ~/filter-data/${id}_2_val_2.fq -S ${id}.hisat.sam 
## 上面對對應(yīng)的就是head -1000,數(shù)據(jù)會很小

###.sam轉(zhuǎn)化為.bam
$ ls *.sam |while read id;do (samtools sort -O bam -@ 5 -o $(basename ${id} ".sam").bam ${id});done
(rna) vip51 23:34:31 ~/4.mapping
$ ls -lh
total 66G
-rw-rw-r-- 1 vip51 vip51 2.1G Jun  7 23:21 SRR1039508.hisat.bam
-rw-rw-r-- 1 vip51 vip51  13G Jun  7 22:44 SRR1039508.hisat.sam
-rw-rw-r-- 1 vip51 vip51 1.9G Jun  7 23:24 SRR1039509.hisat.bam
-rw-rw-r-- 1 vip51 vip51  12G Jun  7 22:51 SRR1039509.hisat.sam
-rw-rw-r-- 1 vip51 vip51 2.1G Jun  7 23:28 SRR1039510.hisat.bam
-rw-rw-r-- 1 vip51 vip51  13G Jun  7 22:58 SRR1039510.hisat.sam
-rw-rw-r-- 1 vip51 vip51 3.7G Jun  7 23:34 SRR1039514.hisat.bam
-rw-rw-r-- 1 vip51 vip51  21G Jun  7 23:08 SRR1039514.hisat.sam
# 可以看出.bam文件要小得多

4.3 比對策略統(tǒng)計比對效率

ls *.bam |while read id ;do ( samtools flagstat -@ 1 $id > $(basename ${id} ".bam").flagstat );done

(rna) vip51 23:42:28 ~/4.mapping
$ ls -lh
total 66G
-rw-rw-r-- 1 vip51 vip51 2.1G Jun  7 23:21 SRR1039508.hisat.bam
-rw-rw-r-- 1 vip51 vip51  448 Jun  7 23:38 SRR1039508.hisat.flagstat
-rw-rw-r-- 1 vip51 vip51  13G Jun  7 22:44 SRR1039508.hisat.sam
-rw-rw-r-- 1 vip51 vip51 1.9G Jun  7 23:24 SRR1039509.hisat.bam
-rw-rw-r-- 1 vip51 vip51  448 Jun  7 23:38 SRR1039509.hisat.flagstat
-rw-rw-r-- 1 vip51 vip51  12G Jun  7 22:51 SRR1039509.hisat.sam
-rw-rw-r-- 1 vip51 vip51 2.1G Jun  7 23:28 SRR1039510.hisat.bam
-rw-rw-r-- 1 vip51 vip51  449 Jun  7 23:39 SRR1039510.hisat.flagstat
-rw-rw-r-- 1 vip51 vip51  13G Jun  7 22:58 SRR1039510.hisat.sam
-rw-rw-r-- 1 vip51 vip51 3.7G Jun  7 23:34 SRR1039514.hisat.bam
-rw-rw-r-- 1 vip51 vip51  449 Jun  7 23:39 SRR1039514.hisat.flagstat
-rw-rw-r-- 1 vip51 vip51  21G Jun  7 23:08 SRR1039514.hisat.sam

$ ls *.bam|xargs -i samtools index {} #構(gòu)建索引,拿到IGV里面看
### 
(rna) vip51 23:43:56 ~/4.mapping
$ mkdir stat
$ mv *flagstat stat/
$ cd stat/
$ ls
SRR1039508.hisat.flagstat  SRR1039509.hisat.flagstat  SRR1039510.hisat.flagstat  SRR1039514.hisat.flagstat

(rna) vip51 00:24:55 ~/4.mapping/stat
$ wc -l *
  13 SRR1039508.hisat.flagstat
  13 SRR1039509.hisat.flagstat
  13 SRR1039510.hisat.flagstat
  13 SRR1039514.hisat.flagstat
  52 total
(rna) vip51 00:25:41 ~/4.mapping/stat
$ cat *|awk '{print $1}' |paste - - - - - - - - - - - - - 
50094593    5350127 0   0   49304507    44744466    22372233    22372233    43122616    43495718    458662  121216  89837
45068638    4229594 0   0   44198454    40839044    20419522    20419522    39169816    39531396    437464  121706  92949
49310701    5020389 0   0   48454263    44290312    22145156    22145156    42601074    42978162    455712  143714  111364
82290684    6889352 0   0   81370594    75401332    37700666    37700666    73172032    73858618    622624  395008  343293
$ cut -d"+" -f 2 SRR1039508.hisat.flagstat |cut -d" " -f 3-90
$ cut -d"+" -f 2 SRR1039508.hisat.flagstat |cut -d" " -f 3-90
in total (QC-passed reads 
secondary
supplementary
duplicates
mapped (98.42% : N/A)
paired in sequencing
read1
read2
properly paired (96.38% : N/A)
with itself and mate mapped
singletons (1.03% : N/A)
with mate mapped to a different chr
with mate mapped to a different chr (mapQ>=5)
# 如下表
hisat比對結(jié)果
  • multiqc ./hisat.flagstat*
## 上面的SRR1039508.hisat.flagstat  SRR1039509.hisat.flagstat  SRR1039510.hisat.flagstat  SRR1039514.hisat.flagstat 也可以如下multiqc生成網(wǎng)頁版報告
$ multiqc ./

(rna) vip51 00:43:27 ~/4.mapping/stat
$ ls 
multiqc_data multiqc_report.html
  • 查看multiqc_report.html


    multiqc_report.html

5.差異分析-featureCounts

5.1featureCounts 需要兩個輸入文件:

  • 比對產(chǎn)生的BAM/ SAM文件(上面我們已經(jīng)的到)
$ ls *bam
SRR1039508.hisat.bam  SRR1039509.hisat.bam  SRR1039510.hisat.bam  SRR1039514.hisat.bam
  • 區(qū)間注釋文件,這個文件需要自己創(chuàng)建
$ cd ~/reference/gtf/
## 下載最新的
$ wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_30/gencode.v30.annotation.gtf.gz
  • 去Gencode網(wǎng)站下載,這里我們下載最新的,下載GTF文件。
    Release 30
  • 解壓縮,得到Load annotation file gencode.v30.annotation.gtf
$ gunzip gencode.v30.annotation.gtf.gz
## 得到
$ ls
gencode.v30.annotation.gtf  human.gene.positions  nohup.out

6.2 featureounts

$ cd ~/6.featureCounts/
$ featureCounts -T 5 -p -t exon -g gene_id -a ~/reference/gtf/gencode.v30.annotation.gtf -o ~/all.id.txt ~/4.mapping/*.bam
## 得到如下:

        ==========     _____ _    _ ____  _____  ______          _____  
        =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
          =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
            ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
              ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
        ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
      v1.6.4

//========================== featureCounts setting ===========================\\
||                                                                            ||
||             Input files : 4 BAM files                                      ||
||                           P SRR1039508.hisat.bam                           ||
||                           P SRR1039509.hisat.bam                           ||
||                           P SRR1039510.hisat.bam                           ||
||                           P SRR1039514.hisat.bam                           ||
||                                                                            ||
||             Output file : all.id.txt                                       ||
||                 Summary : all.id.txt.summary                               ||
||              Annotation : gencode.v30.annotation.gtf (GTF)                 ||
||      Dir for temp files : /home/vip51                                      ||
||                                                                            ||
||                 Threads : 5                                                ||
||                   Level : meta-feature level                               ||
||              Paired-end : yes                                              ||
||      Multimapping reads : not counted                                      ||
|| Multi-overlapping reads : not counted                                      ||
||   Min overlapping bases : 1                                                ||
||                                                                            ||
||          Chimeric reads : counted                                          ||
||        Both ends mapped : not required                                     ||
||                                                                            ||
\\============================================================================//

//================================= Running ==================================\\
||                                                                            ||
|| Load annotation file gencode.v30.annotation.gtf ...                        ||
||    Features : 1279686                                                      ||
||    Meta-features : 58870                                                   ||
||    Chromosomes/contigs : 25                                                ||
||                                                                            ||
|| Process BAM file SRR1039508.hisat.bam...                                   ||
||    Paired-end reads are included.                                          ||
||    Assign alignments (paired-end) to features...                           ||
||                                                                            ||
||    WARNING: reads from the same pair were found not adjacent to each       ||
||             other in the input (due to read sorting by location or         ||
||             reporting of multi-mapping read pairs).                        ||
||                                                                            ||
||    Pairing up the read pairs.                                              ||
||                                                                            ||
||    Total alignments : 25134856                                             ||
||    Successfully assigned alignments : 18588891 (74.0%)                     ||
||    Running time : 0.41 minutes                                             ||
||                                                                            ||
|| Process BAM file SRR1039509.hisat.bam...                                   ||
||    Paired-end reads are included.                                          ||
||    Assign alignments (paired-end) to features...                           ||
||                                                                            ||
||    WARNING: reads from the same pair were found not adjacent to each       ||
||             other in the input (due to read sorting by location or         ||
||             reporting of multi-mapping read pairs).                        ||
||                                                                            ||
||    Pairing up the read pairs.                                              ||
||                                                                            ||
||    Total alignments : 22612196                                             ||
||    Successfully assigned alignments : 17040035 (75.4%)                     ||
||    Running time : 0.36 minutes                                             ||
||                                                                            ||
|| Process BAM file SRR1039510.hisat.bam...                                   ||
||    Paired-end reads are included.                                          ||
||    Assign alignments (paired-end) to features...                           ||
||                                                                            ||
||    WARNING: reads from the same pair were found not adjacent to each       ||
||             other in the input (due to read sorting by location or         ||
||             reporting of multi-mapping read pairs).                        ||
||                                                                            ||
||    Pairing up the read pairs.                                              ||
||                                                                            ||
||    Total alignments : 24741721                                             ||
||    Successfully assigned alignments : 18349007 (74.2%)                     ||
||    Running time : 0.32 minutes                                             ||
||                                                                            ||
|| Process BAM file SRR1039514.hisat.bam...                                   ||
||    Paired-end reads are included.                                          ||
||    Assign alignments (paired-end) to features...                           ||
||                                                                            ||
||    WARNING: reads from the same pair were found not adjacent to each       ||
||             other in the input (due to read sorting by location or         ||
||             reporting of multi-mapping read pairs).                        ||
||                                                                            ||
||    Pairing up the read pairs.                                              ||
||                                                                            ||
||    Total alignments : 41252925                                             ||
||    Successfully assigned alignments : 31979144 (77.5%)                     ||
||    Running time : 0.54 minutes                                             ||
||                                                                            ||
||                                                                            ||
|| Summary of counting results can be found in file "/home/vip51/all.id.txt.  ||
|| summary"                                                                   ||
||                                                                            ||
\\============================================================================//

  • 得到下面兩個結(jié)果
# all.id.txt         all.id.txt.summary
## 我們對all.id.txt.summary 進行multiqc一下,看看Counts情況
$ multiqc all.id.txt.summary 
...
[INFO   ]         multiqc : Report      : multiqc_report_1.html
[INFO   ]         multiqc : Data        : multiqc_data_1
[INFO   ]         multiqc : MultiQC complete
  • 查看上面得的到的multiqc_report_1.html


    General Statistics

    featureCounts


拿到all.id.txt,就可以到R中進行下游分析了

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

友情鏈接更多精彩內(nèi)容