簡(jiǎn)介
Hisat能夠?qū)NA-seq reads 比對(duì)到參考基因組上,并且發(fā)現(xiàn)轉(zhuǎn)錄本的剪接位點(diǎn),具有運(yùn)行速度快,占用計(jì)算機(jī)內(nèi)存資源少的特點(diǎn)。StringTie將這些比對(duì)結(jié)果組合成完整的轉(zhuǎn)錄本,并估計(jì)所有基因和轉(zhuǎn)錄本的表達(dá)水平(Pertea et al. 2015)。Ballgown從StringTie結(jié)果中提取轉(zhuǎn)錄本和表達(dá)水平,并應(yīng)用嚴(yán)格的統(tǒng)計(jì)方法來(lái)確定它們的差異表達(dá)??偟膩?lái)說(shuō),上述三個(gè)軟件聯(lián)合的RNA-seq 分析流程是比較 完善的,如下圖(Pertea et al. 2016)。

測(cè)試數(shù)據(jù)
包含12個(gè)樣本的RNA-seq序列共2.0G,樣本分別來(lái)自GBR(英國(guó)人來(lái)自英格蘭)和YRI(約魯巴人來(lái)自尼日利亞伊巴丹)。(Pertea et al. 2016)
ftp下載鏈接: ftp://ftp.ccb.jhu.edu/pub/RNAseq_protocol/chrX_data.tar.gz
fastp質(zhì)控
一般fastp 可以在質(zhì)控后直接生成質(zhì)控前,質(zhì)控后數(shù)據(jù)的變化情況,此處依然沿用fastqc + multiqc + fastp的質(zhì)控路線。
參數(shù):
RawFastQC:Raw reads 的 FastQC 輸出路徑
RequiredCPU:線程數(shù)(int)
# fastqc
fastqc -o RawFastQC -t RequiredCPU sampleID_1.fq.gz sampleID_2.fq.gz
# multiqc - summary report
multiqc RawFastQC
# fastp - Quality control
fastp -i sampleID_1.fq.gz -I sampleID_2.fq.gz \
-o CleanReads/sampleID_1.fq.gz \
-O CleanReads/sampleID_2.fq.gz \
-w RequiredCPU \
-j FastpDir/sampleID.json -h FastpDir/sampleID.html
由于每個(gè)樣本都生成了一個(gè)fastp.json 文件,我們可以批量解析質(zhì)控結(jié)果,從而在整體水平上評(píng)估質(zhì)控效果。
#!/usr/bin/python
"""
@ Usage: python fastp_parse.py input_file input_file
@ input_file:the file of the fastp.jsons
@ Wu 2020.12.30
"""
import sys, os
import json
import pandas as pd
input_file = sys.argv[1]
input_file = sys.argv[2]
fastp_list = os.listdir(input_file)
result_dict = {}
merge_result_dict = {}
for i in fastp_list:
if i.endswith('.json'):
with open(input_file + i, 'r') as f:
result_dict[i] = json.load(f)
key = i.replace(".json", "")
key1 = key + '+before_filtering'
key2 = key + '+after_filtering'
merge_result_dict[key1] = {'total_reads' : result_dict[i]["summary"]['before_filtering']['total_reads'],
"total_bases": result_dict[i]["summary"]['before_filtering']['total_bases'],
'q20_rate' : result_dict[i]["summary"]['before_filtering']['q20_rate'],
'q30_rate' : result_dict[i]["summary"]['before_filtering']['q30_rate']
}
merge_result_dict[key2] = {'total_reads' : result_dict[i]["summary"]['after_filtering']['total_reads'],
"total_bases": result_dict[i]["summary"]['after_filtering']['total_bases'],
'q20_rate' : result_dict[i]["summary"]['after_filtering']['q20_rate'],
'q30_rate' : result_dict[i]["summary"]['after_filtering']['q30_rate']
}
df = pd.DataFrame(merge_result_dict).T
df.to_csv(output_file, sep="\t", index= True,header=True)
Hisat Mapping
對(duì)于一些模式生物或者常見(jiàn)生物,其Hisat index可能已經(jīng)構(gòu)建好了,可以直接去下載即可。
Url page: https://ccb.jhu.edu/software/hisat2/index.shtml, ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/data/
對(duì)于上述鏈接中不存在Hisat index的情況,則需要我們自己進(jìn)行構(gòu)建;值得注意的是,不同于bwa index,Hisat index 可以為參考基因組提供注釋信息以便獲得更高的效率和更好的跨外顯子邊界和剪接位點(diǎn) mapping 。
Data:1. reference genome(reference.fa);2. clean RNAseq reads 3. ref_annotation.gff
# extract splice-site and exon information
hisat2/bin/extract_splice_sites.py ref_annotation.gff >ref.ss
hisat2/bin/extract_exons.py ref_annotation.gff >ref.exon
# building the index
hisat2-build -p 8 --ss ref.ss --exon ref.exon reference.fa index\rna_ref_index(out_file_name)
# mapping the NGS reads
hisat2 -p 8 --dta -x index/rna_ref_index -1 CleanReads/sampleID_1.fq.gz \
-2 CleanReads/sampleID_2.fq.gz -S sampleID.sam
# sortting BAM
samtools view -S -b sampleID.sam | samtools sort -@ 10 -o sampleID_sorted.bam
Mapping quality
對(duì)于mapping quality,可以通過(guò)IGV和samtools tview 進(jìn)行可視化查看,亦可以通過(guò)samtools flagstat 獲得,并提取所有樣本比對(duì)信息作圖比較。
# samtools flagstat
samtools flagstat sampleID_sorted.bam > flagstat/sampleID_sorted.bam
Transcriptome Assembly
在這里有兩種模式可以選擇,一種是 ‘reference only’ 模式,該模式運(yùn)行速度更快、更簡(jiǎn)單,而且它會(huì)避免數(shù)據(jù)受到其他庫(kù)的污染,但可能無(wú)法發(fā)現(xiàn)新的轉(zhuǎn)錄本;另一種是預(yù)測(cè)模式,在該模式下轉(zhuǎn)錄本將跨過(guò)不同的樣本數(shù)據(jù)去進(jìn)行比較,可能會(huì)找到新的轉(zhuǎn)錄本,但在不了解自己樣本數(shù)據(jù)的前提下最好不用;
參考:https://ccb.jhu.edu/software/stringtie/index.shtml?t=manual
# reference only
stringtie --rf -p 8 -G ref_annotation.gff -e -B \
-o ballgown/sampleID/sampleID.gtf \
-A ballgown/sampleID/sampleID_gene_abundances.tsv \
sampleID_sorted.bam
表達(dá)矩陣:t_data.ctab,gene_abundances.tsv
ballgown
有表達(dá)矩陣數(shù)據(jù)后,除了可以分別讀入、合并和分析外,亦可以用ballgow 包進(jìn)行全局性的數(shù)據(jù)管理和分析。
library(ballgown)
library(tidyverse)
# load the sample information
pheno_data <- read.table("RNAseq/phenodata.txt", header = TRUE, sep ='\t')
# create a ballgown object
ballgown_obj <- ballgown(dataDir = "RNAseq/ballgown",
# regular expression identifying the sample files
# "" means all FILES
samplePattern = "",
pData = pheno_data)
# getting the gene, transcript, exon, and intron expression levels
# by using gexpr(), texpr(), eexpr(), and iexpr()
gene_data <- gexpr(ballgown_obj)
# Load all attributes including gene name
trans_data <- texpr(ballgown_obj, 'all')
Reference
Pertea, Mihaela, Geo M Pertea, Corina M Antonescu, Tsung-Cheng Chang, Joshua T Mendell, and Steven L Salzberg. 2015. “StringTie Enables Improved Reconstruction of a Transcriptome from Rna-Seq Reads.” Nature Biotechnology 33 (3): 290–95.
Pertea, Mihaela, Daehwan Kim, Geo M Pertea, Jeffrey T Leek, and Steven L Salzberg. 2016. “Transcript-Level Expression Analysis of Rna-Seq Experiments with Hisat, Stringtie and Ballgown.” Nature Protocols 11 (9): 1650.
Stringtie:https://ccb.jhu.edu/software/stringtie/index.shtml?t=manual
ballgown:https://www.bioconductor.org/packages/release/bioc/html/ballgown.html