RNA- seq 的數(shù)據(jù)處理主要分為以下幾個過程:
1. 測序數(shù)據(jù)質(zhì)控,以及參考基因組和注釋文件下載;
2. 序列比對,將測序得到的短 reads 序列往參考基因組上 mapping;
3. 差異表達基因鑒定。

更詳細的內(nèi)容見軟件官網(wǎng)?
HISAT2: graph-based alignment of next generation sequencing reads to a population of genomes
一、建索引
下載參考基因組和相應(yīng)的注釋文件:
wget ftp://ftp.ensembl.org/pub/release-84/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
gzip -d Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
mv Homo_sapiens.GRCh38.dna.primary_assembly.fa genome.fa
wget ftp://ftp.ensembl.org/pub/release-84/gtf/homo_sapiens/Homo_sapiens.GRCh38.84.gtf.gz $ gzip -d Homo_sapiens.GRCh38.84.gtf.gz$ mv Homo_sapiens.GRCh38.84.gtf genome.gtf
make exon, splicesite file:
hisat2_extract_splice_sites.py genome.gtf > genome.ss
hisat2_extract_exons.py genome.gtf > genome.exon
Building indexes
hisat2-build? ?-p? 16 --exon? genome.exon --ss? genome.ss? genome.fa? genome_tran
# -p? 線程數(shù);
二、序列比對
hisat2? ?[options]*? ?-x? ? <hisat2-idx>? ? ?{-1 <m1> -2 <m2> | -U <r> | --sra-acc <SRA accession number>}? ? ?[-S <hit>]?? --dta
-x? ?? ? # The basename of the index for the reference genome.
-1? ?? ? ? ? ? ? ? ?#?Comma-separated list of files containing mate 1s (filename usually includes?_1), e.g.?-1 flyA_1.fq,flyB_1.fq.
-2? ? ? ? ? ? ? ? ?#?Comma-separated list of files containing mate 2s (filename usually includes?_2), e.g.?-2 flyA_2.fq,flyB_2.fq.
-S? ? ?? ? ? ? ? ? ?#?File to write SAM alignments to.
--dta? ? ? ? ? ?#??Report alignments tailored for transcript assemblers including StringTie. With this option, HISAT2 requires longer anchor lengths for de novo discovery of splice sites. This leads to fewer alignments with short-anchors, which helps transcript assemblers improve significantly in computation and memory usage.
上述 4 個參數(shù)是基本參數(shù),如果輸入文件是 FASTQ 文件,并按第一步建了索引文件,即可直接進行比對。
三、結(jié)果文件解讀
hisat2? ? -p? ?8? ? ? -x? ? $RNA_REF_INDEX? ? -1? ? ?BB.read1.fastq.gz? ? ?-2? ? ? BB.read2.fastq.gz? ? -S? ? ?./HBR_Rep3.sam
生成的 sam 文件格式如下:

第1列:reads名稱;
第2列:Flag標簽;Flag標簽是二進制數(shù)字之和,不同數(shù)字代表了不同的意義。Sum of all applicable flags.
1: The read is one of a pair
2: The alignment is one end of a proper paired-end alignment
4: The read has no reported alignments
8: The read is one of a pair and has no reported alignments
16: The alignment is to the reverse reference strand
32: The other mate in the paired-end alignment is aligned to the reverse reference strand
64: The read is mate 1 in a pair
128: The read is mate 2 in a pair
第3列:比對到的染色體信息;
第4列:比對到參考基因組物理位置;
第5列:比對質(zhì)量值(0-60);
第6列:CIAGR(記錄插入、缺失等);CIAGR中包含的是比對結(jié)果信息,表明了一條reads所有堿基的比對情況。比如CIGAR = 150M表示150bp的reads都比對到參考基因組上;
第7列:配對reads比對到的染色體,=表示相同;
第8列:配對reads比對到的染色體物理位置;
第9列:文庫插入序列大?。?/p>
第10列:Reads序列;
第11列:質(zhì)量值。
第12列:Optional fields.
四、格式轉(zhuǎn)換并排序
使用 samtools 對文件進行排序
# 轉(zhuǎn)換為 bam 文件格式
samtools view -S? test.sam -b > test.bam
# 排序并建索引
samtools sort test.bam -o test_sort.bam
samtools index test_sort.bam
比對過程還算是比較簡單,接下來就是檢測差異表達基因了。
后續(xù)處理可能需要結(jié)合其它軟件使用,我自己有用到的話會持續(xù)更新,歡迎翻閱寫的其它筆記!