參考文獻(xiàn):Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown #一定要看!
說(shuō)明書(shū):http://ccb.jhu.edu/software/stringtie/gff.shtml
參考文章:http://www.itdecent.cn/p/5b104830751b #使用類(lèi)似的cufflinks的附件做的
參考文章:http://www.itdecent.cn/p/1f5d13cc47f8 #未用gffcompare導(dǎo)致出現(xiàn)大量未知轉(zhuǎn)錄本
一、簡(jiǎn)介
比較不同樣本的轉(zhuǎn)錄本定量信息需要先將轉(zhuǎn)錄本信息儲(chǔ)存為相同的格式,一般組裝軟件的輸出結(jié)果都是gtf或gff。由于在組裝的過(guò)程中產(chǎn)生了大量的新的轉(zhuǎn)錄本信息,而我們僅通過(guò)肉眼觀(guān)察其唯一的注釋信息----染色體上的起始位置,很顯然無(wú)法闡明其中蘊(yùn)含的生物學(xué)意義,因此我們需要將它們與已知的轉(zhuǎn)錄本注釋文件---annotation.gtf進(jìn)行比較,將新得到的轉(zhuǎn)錄本與注釋好的轉(zhuǎn)錄本之間建立聯(lián)系,這樣可以讓我們更好地發(fā)現(xiàn)新的轉(zhuǎn)錄本。而gffcompare就是做的這個(gè)工作,由于它是基于cufflinks的一個(gè)附件cuffcompare開(kāi)發(fā)的,因此很多原理及輸出文件的格式也與cuffcompare類(lèi)似。


二、使用方法及參數(shù)說(shuō)明
使用方法:gffcompare [options] gtf.file(s)
常用表達(dá):gffcompare –G –r annotation.gtf -o output.prefix input.gtf(s)
常用參數(shù)說(shuō)明:
-r 提供注釋好的gtf文件
-G 比較輸入的gtf中所有的轉(zhuǎn)錄本,即使它們有可能是冗余的
-o 輸出文件的前綴
-i 如果gtf是很多文件,可以通過(guò)-i 提交一個(gè)gtf文件的list文件
所有參數(shù)
gffcompare v0.11.2
-----------------------------
Usage:
gffcompare [-r <reference_mrna.gtf> [-R]] [-T] [-V] [-s <seq_path>]
[-o <outprefix>] [-p <cprefix>]
{-i <input_gtf_list> | <input1.gtf> [<input2.gtf> .. <inputN.gtf>]}
GffCompare provides classification and reference annotation mapping and
matching statistics for RNA-Seq assemblies (transfrags) or other generic
GFF/GTF files.
GffCompare also clusters and tracks transcripts across multiple GFF/GTF
files (samples), writing matching transcripts (identical intron chains) into
<outprefix>.tracking, and a GTF file <outprefix>.combined.gtf which
contains a nonredundant set of transcripts across all input files (with
a single representative transfrag chosen for each clique of matching transfrags
across samples).
Options:
-v display gffcompare version (also --version)
-i provide a text file with a list of (query) GTF files to process instead
of expecting them as command line arguments (useful when a large number
of GTF files should be processed)
-r reference annotation file (GTF/GFF)
--strict-match : the match code '=' is only assigned when all exon boundaries
match; code '~' is assigned for intron chain match or single-exon
-R for -r option, consider only the reference transcripts that
overlap any of the input transfrags (Sn correction)
-Q for -r option, consider only the input transcripts that
overlap any of the reference transcripts (Precision correction);
(Warning: this will discard all "novel" loci!)
-M discard (ignore) single-exon transfrags and reference transcripts
-N discard (ignore) single-exon reference transcripts
-D discard "duplicate" query transfrags (i.e. those with the same
intron chain) within a single sample (disable "annotation" mode)
-S like -D, but stricter duplicate checking: only discard matching query
or reference transcripts (same intron chain) if their boundaries are fully
contained within other, larger or identical transfrags; if --strict-match
is also given, exact matching of all exon boundaries is required
--no-merge : disable close-exon merging (default: merge exons separated by
"introns" shorter than 5 bases
-s path to genome sequences (optional); this can be either a multi-FASTA
file or a directory containing single-fasta files (one for each contig);
repeats must be soft-masked (lower case) in order to be able to classify
transfrags as repeats
-T do not generate .tmap and .refmap files for each input file
-e max. distance (range) allowed from free ends of terminal exons of
reference transcripts when assessing exon accuracy (100)
-d max. distance (range) for grouping transcript start sites (100)
-V verbose processing mode (also shows GFF parser warnings)
--chr-stats: the .stats file will show summary and accuracy data
for each reference contig/chromosome separately
--debug : enables -V and generates additional files:
<outprefix>.Q_discarded.lst, <outprefix>.missed_introns.gff,
<outprefix>.R_missed.lst
Options for the combined GTF output file:
-p the name prefix to use for consensus transcripts in the
<outprefix>.combined.gtf file (default: 'TCONS')
-C discard matching and "contained" transfrags in the GTF output
(i.e. collapse intron-redundant transfrags across all query files)
-A like -C but does not discard intron-redundant transfrags if they start
with a different 5' exon (keep alternate TSS)
-X like -C but also discard contained transfrags if transfrag ends stick out
within the container's introns
-K for -C/-A/-X, do NOT discard any redundant transfrag matching a reference
三、輸出文件說(shuō)明
1、class codes
是指一些代碼,用于表示input中的轉(zhuǎn)錄本與annotation中的轉(zhuǎn)錄本的關(guān)系,代碼對(duì)應(yīng)關(guān)系如下圖所示

2、輸出文件六個(gè),前四個(gè)文件可以指定保存位置,后兩個(gè)文件是跟輸入的gtf文件保存在一個(gè)位置,并且都是以-o提供的前綴開(kāi)頭的
gffcmp.annotated.gtf:包含了class code信息,該文件一般用于下文繼續(xù)stringtie
gffcmp.stats:包含了feature的統(tǒng)計(jì)信息,也包含了找到新的外顯子、內(nèi)含子的數(shù)目,其中有兩個(gè)統(tǒng)計(jì)量sensitivity和precision,定義為 Sensitivity is defned as the proportion of genes from the annotation that are correctly reconstructed,whereas precision (also known as positive predictive value) captures the proportion of the output that overlaps the annotation
gffcompare.loci:見(jiàn)說(shuō)明書(shū)
gffcompare.tracking:見(jiàn)說(shuō)明書(shū)
gffcompare_result.refmap:這個(gè)文件包含四列信息,第一列ref_gene_id是gene symbol ,無(wú)symbol的給出的是ensemble的gene id; 第二列ref_id是指ensemble的transcript id; 第三列class_code 是“=”和“c”;第四列是cuff_id_list。這個(gè)文件指組裝后與參考基因組幾乎完全匹配的轉(zhuǎn)錄本
gffcompare_result.tmap:包含了轉(zhuǎn)錄本的定量信息,如cov,F(xiàn)PKM等,可用于定量或篩選新轉(zhuǎn)錄本
四、如何尋找新的轉(zhuǎn)錄本
1、上游:hisat2+stringtie+stringtie-merge
2、中游:gffcompare
3、下游:stringtie+gffcompare.result
4、下下游:ballgown定量及差異分析
新轉(zhuǎn)錄本的特征為(參考別人的文章)
1、class code滿(mǎn)足標(biāo)準(zhǔn),如滿(mǎn)足”i,j,o,u,x“等
2、統(tǒng)計(jì)信息達(dá)標(biāo),如FPKM>=0.5 、coverage >1,Length > 200等