gffcompare的使用說(shuō)明

參考文獻(xiàn):Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown #一定要看!
說(shuō)明書(shū):http://ccb.jhu.edu/software/stringtie/gff.shtml
參考文章:http://www.itdecent.cn/p/5b104830751b #使用類(lèi)似的cufflinks的附件做的
參考文章:http://www.itdecent.cn/p/1f5d13cc47f8 #未用gffcompare導(dǎo)致出現(xiàn)大量未知轉(zhuǎn)錄本

一、簡(jiǎn)介

比較不同樣本的轉(zhuǎn)錄本定量信息需要先將轉(zhuǎn)錄本信息儲(chǔ)存為相同的格式,一般組裝軟件的輸出結(jié)果都是gtf或gff。由于在組裝的過(guò)程中產(chǎn)生了大量的新的轉(zhuǎn)錄本信息,而我們僅通過(guò)肉眼觀(guān)察其唯一的注釋信息----染色體上的起始位置,很顯然無(wú)法闡明其中蘊(yùn)含的生物學(xué)意義,因此我們需要將它們與已知的轉(zhuǎn)錄本注釋文件---annotation.gtf進(jìn)行比較,將新得到的轉(zhuǎn)錄本與注釋好的轉(zhuǎn)錄本之間建立聯(lián)系,這樣可以讓我們更好地發(fā)現(xiàn)新的轉(zhuǎn)錄本。而gffcompare就是做的這個(gè)工作,由于它是基于cufflinks的一個(gè)附件cuffcompare開(kāi)發(fā)的,因此很多原理及輸出文件的格式也與cuffcompare類(lèi)似。


不用gffcompare得到的信息只有染色體上的定位

使用gffcompare可以得到與參考轉(zhuǎn)錄本的關(guān)系信息

二、使用方法及參數(shù)說(shuō)明

使用方法:gffcompare [options] gtf.file(s)
常用表達(dá):gffcompare –G –r annotation.gtf -o output.prefix input.gtf(s)

常用參數(shù)說(shuō)明:

-r 提供注釋好的gtf文件
-G 比較輸入的gtf中所有的轉(zhuǎn)錄本,即使它們有可能是冗余的
-o 輸出文件的前綴
-i 如果gtf是很多文件,可以通過(guò)-i 提交一個(gè)gtf文件的list文件

所有參數(shù)

gffcompare v0.11.2
-----------------------------
Usage:
gffcompare [-r <reference_mrna.gtf> [-R]] [-T] [-V] [-s <seq_path>]
    [-o <outprefix>] [-p <cprefix>]
    {-i <input_gtf_list> | <input1.gtf> [<input2.gtf> .. <inputN.gtf>]}

 GffCompare provides classification and reference annotation mapping and
 matching statistics for RNA-Seq assemblies (transfrags) or other generic
 GFF/GTF files.
 GffCompare also clusters and tracks transcripts across multiple GFF/GTF
 files (samples), writing matching transcripts (identical intron chains) into
 <outprefix>.tracking, and a GTF file <outprefix>.combined.gtf which
 contains a nonredundant set of transcripts across all input files (with
 a single representative transfrag chosen for each clique of matching transfrags
 across samples).

 Options:
 -v display gffcompare version (also --version)
 -i provide a text file with a list of (query) GTF files to process instead
    of expecting them as command line arguments (useful when a large number
    of GTF files should be processed)

 -r reference annotation file (GTF/GFF)
 --strict-match : the match code '=' is only assigned when all exon boundaries
    match; code '~' is assigned for intron chain match or single-exon

 -R for -r option, consider only the reference transcripts that
    overlap any of the input transfrags (Sn correction)
 -Q for -r option, consider only the input transcripts that
    overlap any of the reference transcripts (Precision correction);
    (Warning: this will discard all "novel" loci!)
 -M discard (ignore) single-exon transfrags and reference transcripts
 -N discard (ignore) single-exon reference transcripts
 -D discard "duplicate" query transfrags (i.e. those with the same
    intron chain) within a single sample (disable "annotation" mode)
 -S like -D, but stricter duplicate checking: only discard matching query
    or reference transcripts (same intron chain) if their boundaries are fully
        contained within other, larger or identical transfrags; if --strict-match
    is also given, exact matching of all exon boundaries is required
 --no-merge : disable close-exon merging (default: merge exons separated by
        "introns" shorter than 5 bases

 -s path to genome sequences (optional); this can be either a multi-FASTA
    file or a directory containing single-fasta files (one for each contig);
    repeats must be soft-masked (lower case) in order to be able to classify
    transfrags as repeats

 -T do not generate .tmap and .refmap files for each input file
 -e max. distance (range) allowed from free ends of terminal exons of
    reference transcripts when assessing exon accuracy (100)
 -d max. distance (range) for grouping transcript start sites (100)
 -V verbose processing mode (also shows GFF parser warnings)
 --chr-stats: the .stats file will show summary and accuracy data
   for each reference contig/chromosome separately
 --debug : enables -V and generates additional files:
    <outprefix>.Q_discarded.lst, <outprefix>.missed_introns.gff,
    <outprefix>.R_missed.lst

Options for the combined GTF output file:
 -p the name prefix to use for consensus transcripts in the
    <outprefix>.combined.gtf file (default: 'TCONS')
 -C discard matching and "contained" transfrags in the GTF output
    (i.e. collapse intron-redundant transfrags across all query files)
 -A like -C but does not discard intron-redundant transfrags if they start
    with a different 5' exon (keep alternate TSS)
 -X like -C but also discard contained transfrags if transfrag ends stick out
    within the container's introns
 -K for -C/-A/-X, do NOT discard any redundant transfrag matching a reference

三、輸出文件說(shuō)明

1、class codes

是指一些代碼,用于表示input中的轉(zhuǎn)錄本與annotation中的轉(zhuǎn)錄本的關(guān)系,代碼對(duì)應(yīng)關(guān)系如下圖所示


class code

2、輸出文件六個(gè),前四個(gè)文件可以指定保存位置,后兩個(gè)文件是跟輸入的gtf文件保存在一個(gè)位置,并且都是以-o提供的前綴開(kāi)頭的

gffcmp.annotated.gtf:包含了class code信息,該文件一般用于下文繼續(xù)stringtie
gffcmp.stats:包含了feature的統(tǒng)計(jì)信息,也包含了找到新的外顯子、內(nèi)含子的數(shù)目,其中有兩個(gè)統(tǒng)計(jì)量sensitivity和precision,定義為 Sensitivity is defned as the proportion of genes from the annotation that are correctly reconstructed,whereas precision (also known as positive predictive value) captures the proportion of the output that overlaps the annotation
gffcompare.loci:見(jiàn)說(shuō)明書(shū)
gffcompare.tracking:見(jiàn)說(shuō)明書(shū)
gffcompare_result.refmap:這個(gè)文件包含四列信息,第一列ref_gene_id是gene symbol ,無(wú)symbol的給出的是ensemble的gene id; 第二列ref_id是指ensemble的transcript id; 第三列class_code 是“=”和“c”;第四列是cuff_id_list。這個(gè)文件指組裝后與參考基因組幾乎完全匹配的轉(zhuǎn)錄本
gffcompare_result.tmap:包含了轉(zhuǎn)錄本的定量信息,如cov,F(xiàn)PKM等,可用于定量或篩選新轉(zhuǎn)錄本

四、如何尋找新的轉(zhuǎn)錄本

1、上游:hisat2+stringtie+stringtie-merge
2、中游:gffcompare
3、下游:stringtie+gffcompare.result
4、下下游:ballgown定量及差異分析

新轉(zhuǎn)錄本的特征為(參考別人的文章)

1、class code滿(mǎn)足標(biāo)準(zhǔn),如滿(mǎn)足”i,j,o,u,x“等
2、統(tǒng)計(jì)信息達(dá)標(biāo),如FPKM>=0.5 、coverage >1,Length > 200等

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀(guān)點(diǎn),簡(jiǎn)書(shū)系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容