作者,Evil Genius~~
外顯子生信分析流程
基本分析部分主要用于獲取樣本基因組的突變信息。 首先,對(duì)測(cè)序原始數(shù)據(jù)(Raw data)進(jìn)行質(zhì)控,得到高質(zhì)量的Clean data;然后,將Clean data與人參考基因組序列進(jìn)行比對(duì)分析,獲得Bam文件; 最后基于Bam進(jìn)行Somatic Mutation的檢出與注釋,從而得到疾病機(jī)制研究的基本突變信息結(jié)果,基本分析版報(bào)告主要展示Somatic相關(guān)結(jié)果。以下為基本分析的流程圖:

測(cè)序數(shù)據(jù)過濾
測(cè)序獲得的原始數(shù)據(jù)中包含少量帶有測(cè)序接頭或測(cè)序質(zhì)量較低的reads, 為保證數(shù)據(jù)分析的質(zhì)量及可靠性,需要對(duì)原始數(shù)據(jù)進(jìn)行過濾。 通常使用Trimmomatic軟件對(duì)測(cè)序數(shù)據(jù)進(jìn)行過濾。
測(cè)序錯(cuò)誤率分布檢查
每個(gè)堿基的測(cè)序Phred值(Phred score,Qphred)是由測(cè)序錯(cuò)誤率通過公式一 轉(zhuǎn)化得到的,而測(cè)序錯(cuò)誤率是在堿基識(shí)別(Base Calling)過程中通過一種判別 發(fā)生錯(cuò)誤概率的模型計(jì)算得到的,其數(shù)值對(duì)應(yīng)關(guān)系如下表所示。

測(cè)序錯(cuò)誤率與堿基質(zhì)量有關(guān),受測(cè)序儀本身、測(cè)序試劑、樣品等多個(gè)因素共同影響。 對(duì)于Illumina高通量測(cè)序平臺(tái),測(cè)序錯(cuò)誤率分布具有兩個(gè)特點(diǎn):
*(1)測(cè)序錯(cuò)誤率會(huì)隨著測(cè)序的進(jìn)行而升高,這是由于測(cè)序過程中 熒光標(biāo)記的不完全切割等因素引起熒光信號(hào)衰減,因而導(dǎo)致錯(cuò)誤率升高;
*(2)每個(gè)Read前幾個(gè)堿基的位置也會(huì)有較高的測(cè)序錯(cuò)誤率, 這是由于邊合成邊測(cè)序過程初始階段,測(cè)序儀熒光感光元件對(duì)焦速度較慢, 獲取的熒光圖像質(zhì)量較低,導(dǎo)致堿基識(shí)別錯(cuò)誤率較高。
在該部分分析中,若樣品80%的測(cè)序序列錯(cuò)誤率在0.1%以下即為合格。
GC含量分布
核苷酸序列中鳥嘌呤(G)和胞嘧啶(C)所占的比例稱為GC含量。GC含量在物種間存在一定特異性,但由于反轉(zhuǎn)錄過程中所使用的6bp隨機(jī)引物,會(huì)引起前幾位堿基在核苷酸組成上有一定偏好性,產(chǎn)生正常波動(dòng),隨后則趨于穩(wěn)定。對(duì)于NEB普通建庫(kù)方法,由于序列的隨機(jī)性打斷和雙鏈互補(bǔ)等原則,理論上測(cè)序讀段 在每個(gè)位置的GC及AT含量應(yīng)分別相等,且在整個(gè)測(cè)序過程基本穩(wěn)定不變,呈水平線。而對(duì)于鏈特異性建庫(kù)而言,由于只保留了單鏈信息,可能會(huì)出現(xiàn)AT分離或GC分離現(xiàn)象。
在該部分結(jié)果的圖中,正常情況下四種堿基的出現(xiàn)頻率應(yīng)該是接近的,而且沒有位置差異,因此好的樣本四條線應(yīng)該是平行且接近的。當(dāng)部分位置堿基的比例出現(xiàn)bias時(shí),即四條線在某些位置紛亂交織,往往提示有overrepresented sequence的污染。當(dāng)所有的堿基比例一致地出現(xiàn)bias時(shí),即四條線平行但分開,往往代表文庫(kù)有bias(建庫(kù)過程或本身特點(diǎn)),或是測(cè)序中的系統(tǒng)誤差。當(dāng)任一位置A/T比例與G/C比例相差超過10%時(shí),為warn; 當(dāng)任一位置的A/T與G/G比例相差超過20%時(shí),為fail。
測(cè)序數(shù)據(jù)質(zhì)量分布
樣本中超過80%測(cè)序序列的測(cè)序質(zhì)量在Q30以上即為合格,就能保證后續(xù)分析的正常進(jìn)行, 根據(jù)測(cè)序技術(shù)的特點(diǎn),測(cè)序片段末端的堿基質(zhì)量一般會(huì)比前端的低。
測(cè)序深度、覆蓋度統(tǒng)計(jì)
有效測(cè)序數(shù)據(jù)通過BWA比對(duì)到參考基因組,得到BAM格式的最初的比對(duì)結(jié)果。 BAM文件再進(jìn)行標(biāo)記重復(fù)等處理,從而得到BAM格式的最終比對(duì)結(jié)果, 再利用比對(duì)到參考基因組上的有效數(shù)據(jù)進(jìn)行覆蓋度的統(tǒng)計(jì)。 通常,人類樣本的測(cè)序Reads能達(dá)到98%以上的比對(duì)率, 測(cè)序深度在10×以上Reads覆蓋的位點(diǎn)檢測(cè)出的SNV比較可信。
胚系突變變異檢測(cè)
胚系突變(Germline mutations)主要是由于生殖細(xì)胞(germ cells)突變導(dǎo)致,生殖細(xì)胞在男性中為精源細(xì)胞,突變發(fā)生在睪丸中;生殖細(xì)胞在女性中為卵細(xì)胞,突變發(fā)生在卵巢中。在大多數(shù)情況下,Germline mutations 是沉默的,不對(duì)親本產(chǎn)生影響,除非它們影響配子(gametes)的產(chǎn)生。 盡管這些突變會(huì)造成負(fù)面影響,例如罕見疾病,甚至癌癥,但仍可促進(jìn)人類健康的遺傳多樣性
Germline Indel檢測(cè)結(jié)果
InDel全稱Insertion and Deletion,指小片段的插入和缺失。編碼區(qū)或剪接位點(diǎn)處發(fā)生的插入缺失都可能會(huì)改變蛋白的翻譯。移碼變異,其插入或缺失的堿基串的長(zhǎng)度為3的非整數(shù)倍,因此可能導(dǎo)致整個(gè)閱讀框的改變;與非移碼變異比較,移碼突變對(duì)基因功能的影響更大,同時(shí)也受到更大的篩選壓力。這里利用MutScan模塊檢測(cè)Germline InDel信息, 對(duì)得到的InDel結(jié)果示例如下:

注:
*???#CHROM:染色體號(hào)
*???POS:Indel的基因組位置
*???ID:Indel ID
*???REF:Indel參考等位
*???ALT:突變等位
*???QUAL:Indel質(zhì)量
*???FILTER:篩選信息
*???INFO:Indel信息,該部分內(nèi)容較多,可在文件夾下結(jié)果文件的注釋部分查看具體含義
*???FORMAT:格式信息
*??????DP:Read depth for tier1
*??????DP2:Read depth for tier2
*??????TAR:Reads strongly supporting alternate allele for tiers 1,2
*??????TIR:Reads strongly supporting indel allele for tiers 1,2
*??????TOR:Other reads (weak support or insufficient indel breakpoint overlap) for tiers 1,2
*??????DP50:Average tier1 read depth within 50 bases
*??????FDP50:Average tier1 number of basecalls filtered from original read depth within 50 bases
*??????SUBDP50:Average number of reads below tier1 mapping quality threshold aligned across sites within 50 bases
*??????BCN50:Fraction of filtered reads within 50 bases of the indel.
*???NORMAL:正常樣本的GT信息
*???TUMOR:TUMOR的GT信息
Germline SNV檢測(cè)結(jié)果
SNV全稱Single nucleotide variant,即單核苷酸突變, 指在基因組上由單個(gè)核苷酸的替換所引起的變異。主要使用GATK muTect2軟件 尋找Somatic SNV。

注:
*???CHROM:染色體
*???POS:SNV在基因組位置
*???ID:SNV ID
*???REF:參考等位
*???ALT:突變等位
*???QUAL:質(zhì)量值
*???FILTER:篩選信息
*???INFO:SNV信息,該部分信息較多,每項(xiàng)內(nèi)容參考文件夾中文件的注釋信息
*???FORMAT:格式信息
*??????GT:Genotype;
*??????AD:Allelic depths for the ref and alt alleles in the order listed;
*??????AF:Allele fractions of alternate alleles in the tumor;
*??????DP:Approximate read depth; some reads may have been filtered;
*??????F1R2:Count of reads in F1R2 pair orientation supporting each allele;
*??????F2R1:Count of reads in F2R1 pair orientation supporting each allele;
*??????SB:Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias.
*???sample name:FORMAT中內(nèi)容在該樣本中對(duì)應(yīng)的數(shù)值
體細(xì)胞變異檢測(cè)
體細(xì)胞突變(Somatic mutation)是指除胚系細(xì)胞以外的體細(xì)胞所發(fā)生的突變,是發(fā)生在正常機(jī)體細(xì)胞中的突變,如發(fā)生在皮膚和器官。體細(xì)胞突變既不遺傳自親本,也不會(huì)傳遞給后代,卻可以引起當(dāng)代某些細(xì)胞的遺傳結(jié)構(gòu)發(fā)生改變。體細(xì)胞突變,特別是其中的驅(qū)動(dòng)突變(Driver mutation)對(duì)解釋腫瘤的發(fā)生和發(fā)展具有非常重要的意義,另一方面腫瘤耐藥性的產(chǎn)生也與體細(xì)胞突變有關(guān)。 因此,關(guān)注體細(xì)胞突變是腫瘤基因組研究的重心,也是腫瘤基因組研究區(qū)別于疾病研究的一個(gè)特性。在信息分析時(shí),通過將腫瘤組織與正常組織相比, 過濾掉正常組織中的胚系細(xì)胞突變(Germline mutation),只保留下腫瘤細(xì)胞攜帶的體細(xì)胞突變。
Somatic SNV檢測(cè)結(jié)果
SNV全稱Single nucleotide variant,即單核苷酸突變, 指在基因組上由單個(gè)核苷酸的替換所引起的變異。主要使用GATK muTect2軟件 尋找Somatic SNV。
設(shè)置過濾條件:reads數(shù)≥30條,突變豐度>1%,包括 “Pathogenic” 的變異位點(diǎn)和“Likely_pathogenic” 的變異位點(diǎn)。

Germline/Somatic mutation結(jié)果注釋
利用ANNOVAR(http://www.openbioinformatics.org/annovar/) 軟件對(duì)檢測(cè)出的體細(xì)胞突變進(jìn)行注釋。
*???1)使用Refseq注釋變異位點(diǎn)的基因結(jié)構(gòu),基因類型包括mRNA、非編碼RNA等;
*???2)變異位點(diǎn)的基因組特征,對(duì)于位于基因組重復(fù)區(qū)段內(nèi)的突變需謹(jǐn)慎對(duì)待;
*???3)通過SIFT、PolyPhen以及MutationTaster等方法全面評(píng)估非同義突變 對(duì)疾病/腫瘤的影響;
*???4)提供dbSNP、千人基因組SNP數(shù)據(jù)庫(kù)、COSMIC已知腫瘤體細(xì)胞突變數(shù)據(jù)庫(kù) 和esp6500變異數(shù)據(jù)庫(kù)等注釋,對(duì)變異結(jié)果可以進(jìn)行任何組合的篩選;

注:數(shù)據(jù)庫(kù)信息見文末
*???該部分注釋結(jié)果包含200+列數(shù)據(jù),此處選擇較為重要的列進(jìn)行說(shuō)明:
*???第1列:(Chr)染色體信息
*???第2列:(Start)變異起始位點(diǎn)
*???第3列:(End)變異終止位置
*???第4列:(Ref)參考等位
*???第5列:(Alt)突變等位
*???第6列:(Func.refGene)變異位點(diǎn)在基因組的區(qū)域
*???第7列:(Gene.refGene)變異位點(diǎn)注釋到的基因
*???第8列:(GeneDetail.refGene) 描述UTR、splicing、ncRNA_splicing或intergenic區(qū)域的變異情況。當(dāng)Func列的值為exonic、ncRNA_exonic、intronic、ncRNA_intronic、upstream、downstream、upstream;downstream、ncRNA_UTR3、ncRNA_UTR5時(shí),該列為空;當(dāng)Func列的值為exonic;splicing時(shí),表示該位點(diǎn)位于某些轉(zhuǎn)錄本的exonic區(qū),另一些轉(zhuǎn)錄本的splicing區(qū),這種情況下,GeneDetail會(huì)給出該位點(diǎn)對(duì)于轉(zhuǎn)錄本splicing的影響,例如,NM_1524XX:exon3:c.232C>T,表示該變異位于轉(zhuǎn)錄本NM_1524XX上,exon3表示第3個(gè)外顯子,c.232C>T表示cDNA的232bp處發(fā)生由C到T的突變;當(dāng)Func列的值為intergenic時(shí),該列格式為dist=1322;dist=12414,表示該變異位點(diǎn)距離兩側(cè)基因的距離。
*???第9列:(ExonicFunc.refGene) 外顯子區(qū)的SNV or InDel變異類型(SNV的變異類型包括synonymous_SNV, missense_SNV, stopgain,stopgloss和unknown;InDel的變異類型包括frameshift insertion, frameshift deletion, stopgain, stoploss, nonframeshift insertion, nonframeshift deletion和unknown)。
*???第10列:(AAChange.refGene) 氨基酸改變,只有當(dāng)Func列為exonic或exonic;splicing時(shí),該列才有結(jié)果。
第11列:(Interpro_domain) 蛋白序列和蛋白分類數(shù)據(jù)庫(kù)interpro中關(guān)于結(jié)構(gòu)域的注釋
- 第12列:(avsnp150)該變異在 dbSNP中的 ID
- 第13列:(snp138)該變異在dbsnp138中的 ID
- 第14列:(1000g2015aug_all) 千人基因組計(jì)劃數(shù)據(jù)(2015年8月公布的版本)的所有人群中,該變異位點(diǎn)上突變堿基的等位基因頻率
- 第15列:(ExAC_ALL) 指在所有人群中,該變異位點(diǎn)上突變堿基的等位基因頻率
- 第16列:(ExAC_AFR)該變異在ExAC數(shù)據(jù)庫(kù)中非洲人群的等位基因頻率
- 第17列:(ExAC_AMR)該變異在ExAC數(shù)據(jù)庫(kù)中美國(guó)人群的等位基因頻率
- 第18列:(ExAC_EAS)該變異在ExAC數(shù)據(jù)庫(kù)中東亞人群的等位基因頻率
- 第19列:(ExAC_FIN) 該變異在ExAC數(shù)據(jù)庫(kù)中芬蘭人群的等位基因頻率
- 第20列:(ExAC_NFE) 該變異在ExAC數(shù)據(jù)庫(kù)中非芬蘭歐洲人群的等位基因頻率
- 第21列:(ExAC_OTH) 該變異在ExAC數(shù)據(jù)庫(kù)中除已指定人群之外的人群等位基因頻率
第22列:(ExAC_SAS) 該變異在ExAC數(shù)據(jù)庫(kù)中南亞人群的等位基因頻率
*?? 第23列:(CLNALLELEID) the ClinVar Allele ID
*??? 第24列:(CLNDN) ClinVar's preferred disease name for the concept specified by disease identifiers in CLNDISDB
*??? 第25列:(CLNDISDB) Tag-value pairs of disease database name and identifier, e.g. OMIM:NNNNNN
*??? 第26列:(CLNREVSTAT) ClinVar review status for the Variation ID
*??? 第27列:(CLNSIG) Clinical significance for this single variant
*? ?第28列:(InterVar_automated) InterVar數(shù)據(jù)庫(kù)對(duì)遺傳變異進(jìn)行臨床解釋 - 第29-56列:(PVS1-BP7) 美國(guó)醫(yī)學(xué)遺傳學(xué)與基因組學(xué)學(xué)會(huì)(ACMG)和分子病理協(xié)會(huì)(AMP)在2015年對(duì)臨床實(shí)驗(yàn)室的基因檢測(cè)進(jìn)行了指導(dǎo)和規(guī)范(PMID: 25741868)。該指導(dǎo)規(guī)范主要就是適用于孟德爾遺傳病相關(guān)基因變異或者是生殖系變異。指導(dǎo)規(guī)范推薦記載突變遵循統(tǒng)一的規(guī)范-人類基因組變異協(xié)會(huì)(HGVS),并變異根據(jù)人群基因頻率(population data)、軟件預(yù)測(cè)(computational data)和功能試驗(yàn)(functional data)等參數(shù)分為五個(gè)級(jí)別:致病性突變(pathogenic)、可能致病性突變(likelypathogenic)、意義不明突變(uncertain significance)、可能良性突變(likely benign)和良性多態(tài)性突變(benign)。該規(guī)范列出了致病性/可能致病的各種情況的支持證據(jù),證據(jù)強(qiáng)度依次包括超強(qiáng)證據(jù)(PVS1)、強(qiáng)證據(jù)(PS1-4,注意這里的數(shù)字不代表證據(jù)強(qiáng)度的區(qū)別,僅表示同一證據(jù)強(qiáng)度的不同的證據(jù)情況,下同)、中度證據(jù)(PM1-6)、支持性證據(jù)(PP1-5),良性多態(tài)性/可能良性證據(jù)強(qiáng)度依次包括獨(dú)立證據(jù)(BA1)、強(qiáng)證據(jù)(BS1-4)、支持性證據(jù)(BP1-6)。
- 第57列:(cosmic95) 通過Cosmic 數(shù)據(jù)庫(kù)的注釋,可知該突變位點(diǎn)在此前文獻(xiàn)中是否出現(xiàn)過,出現(xiàn)在什么癌種,出現(xiàn)了幾次。
*???第58列:(SIFT_score) SIFT分值,表示該變異對(duì)蛋白序列的影響,SIFT 分值越小越“有害”,表明該SNP導(dǎo)致蛋白結(jié)構(gòu)或功能改變的可能性大;
*???第59列:(SIFT_pred) D: Deleterious (sift<=0.05); T: tolerated (sift>0.05)
*???第60列:(Polyphen2_HDIV_score) 利用 PolyPhen2 基于 HumanDiv 數(shù)據(jù)庫(kù)預(yù)測(cè)該變異對(duì)蛋白序列的影響,用于復(fù)雜疾病,數(shù)值越大越“有害”,表明該 SNP 導(dǎo)致蛋白結(jié)構(gòu)或功能改變的可能性大;damaging (0.453<=pp2_hdiv<=0.956); B: benign (pp2_hdiv<=0.452)
*???第61列:(Polyphen2_HDIV_pred) D 或 P 或 B(D: Probably damaging (>=0.957), P: possibly
*???第62列:(Polyphen2_HVAR_score) 利用 PolyPhen2 基于 HumanVar 數(shù)據(jù)庫(kù)預(yù)測(cè)該變異對(duì)蛋白序列的影響,用于單基因遺傳病。數(shù)值越大越“有害”,表明該 SNP 導(dǎo)致蛋白結(jié)構(gòu)或功能改變的可能性大;
*???第63列:(Polyphen2_HVAR_pred) D 或 P 或 B(D: Probably damaging (>=0.909), P: possibly damaging (0.447<=pp2_hvar<=0.909); B: benign (pp2_hvar<=0.446)
*???第64列:(LRT_score) LRT 分值,表示該變異對(duì)蛋白序列的影響,值越大越“有害”,表明該 SNP 導(dǎo)致蛋白結(jié)構(gòu)或功能改變的可能性大。
*???第65列:(LRT_pred) D、N 或者 U(D: Deleterious; N: Neutral; U: Unknown)。
*???第66列:(MutationTaster_score) MutationTaster 分值,表示該變異對(duì)蛋白序列的影響,值越大越“有害”,表明該 SNP 導(dǎo)致蛋白結(jié)構(gòu)或功能改變的可能性大。("polymorphism_automatic")
*???第67列:(MutationTaster_pred) A ("disease_causing_automatic"); "D" ("disease_causing");"N" ("polymorphism"); "P" (Polymorphism_automatic)。
*???第68列:(MutationAssessor_score) MutationAssessor預(yù)測(cè)的致病得分
*???第69列:(MutationAssessor_pred) MutationAssessor根據(jù)閾值判斷得到的預(yù)測(cè)分類:H為較高可信度的致病位點(diǎn),M為中等可信的致病位點(diǎn),L為低可信度的致病位點(diǎn),N為無(wú)害位點(diǎn)
*???第70列:(FATHMM_score) FATHMM軟件預(yù)測(cè)的致病性得分
*???第71列:(FATHMM_pred) FATHMM根據(jù)閾值得到的分類:D為較高可信度的致病位點(diǎn),P為可信度一般的致病位點(diǎn)
*???第72列:(PROVEAN_score) Protein Variation Effect Analyzer,higher values are more deleterious
*???第73列:(PROVEAN_pred) Protein Variation Effect Analyzer,D: Deleterious; N: Neutral - 第74列:(VEST3_score) Variant effect scoring tool;Random forest classifier, higher values are more deleterious
*???第75列:(CADD_raw) CADD raw score
*???第76列:(CADD_phred) CADD phred-like score,higher values are more deleterious - 第77列:(DANN_score) Deleterious Annotation of genetic variants using Neural Networks,higher values are more deleterious
*? 第78列:(fathmm-MKL_coding_score) edicting the effects of both coding and non-coding variants using nucleotide-based HMMs, Score >= 0.5: D; Score < 0.5: T - 第79列:(fathmm-MKL_coding_pred) predicting the effects of both coding and non-coding variants using nucleotide-based HMMs, D: Deleterious; T: Tolerated.
*? 第80列:(MetaSVM_score) Using Support vector machine. MetaSVM prediction, higher scores are more deleterious. - 第81列:(MetaSVM_pred) Using Support vector machine. MetaSVM prediction, D: Deleterious; T: Tolerated.
- 第82列:(MetaSVM_score) Using Logistic regression. MetaLR score. higher scores are more deleterious.
- 第83列:(MetaLR_pred) Using Logistic regression. MetaLR prediction. D: Deleterious; T: Tolerated.
- 第84列:(integrated_fitCons_score) Integrate functional assays like ChIP-Seq with conservation measure of transcription factor binding sites. Fitness consequences of functional annotation. higher scores are more deleterious.
- 第85列:(integrated_confidence_value) Fitness consequences of functional annotation. confidence level. higher scores are more deleterious.
*?第86列:(GERP++_RS) GREP++ "rejected substitutions" (RS) score,higher scores are more deleterious
*?第87列:(phyloP7way_vertebrate) Phylogenetic p-values for 7 vertebrate species. higher scores are more deleterious
*?第88列:(phyloP20way_mammalian) Phylogenetic p-values for 20 mammalian species. higher scores are more deleterious
*?第89列:(phastCons7way_vertebrate) PhastCons score for 7 vertebrate species. higher scores are more deleterious. - 第90列:(phastCons20way_mammalian) Phylogenetic p-values for 20 mammalian species. higher scores are more deleterious.
- 第91列:(SiPhy_29way_logOdds) SiPhy log odds score for 29 species. higher scores are more deleterious
*?第92-104列:(Otherinfo1-13) 其他注釋信息。
ANNOVAR使用的30個(gè)注釋數(shù)據(jù)庫(kù)相關(guān)信息
| 數(shù)據(jù)庫(kù) | 注釋 |
|---|---|
| refGene | FASTA sequences for all annotated transcripts in RefSeq Gene |
| avsift | whole-exome SIFT scores for non-synonymous variants (obselete and should not be uesd any more) |
| abraom | 2.3 million Brazilian genomic variants |
| cadd13gt20 | CADD version 1.3 score>20 |
| clinvar_20210123 | Clinvar version 20210123 with separate columns (CLNALLELEID CLNDN CLNDISDB CLNREVSTAT CLNSIG) |
| cadd13gt20 | CADD version 1.3 score>20 |
| dbscsnv11 | dbscSNV version 1.1 for splice site prediction by AdaBoost and Random Forest |
| esp6500siv2_all | alternative allele frequency in All subjects in the NHLBI-ESP project with 6500 exomes, including the indel calls and the chrY calls. This is lifted over from hg19 by myself. |
| esp6500siv2_ea | alternative allele frequency in European American subjects in the NHLBI-ESP project with 6500 exomes, including the indel calls and the chrY calls. This is lifted over from hg19 by myself |
| exac03 | ExAC 65000 exome allele frequency data for ALL, AFR (African), AMR (Admixed American), EAS (East Asian), FIN (Finnish), NFE (Non-finnish European), OTH (other), SAS (South Asian)). version 0.3. Left normalization done. |
| exac03nontcga | ExAC on non-TCGA samples (updated header) |
| gene4denovo201907 | gene4denovo database |
| avsnp150 | dbSNP150 with allelic splitting and left-normalization |
| gerp++elem | conserved genomic regions by GERP++ |
| gme | Great Middle East allele frequency including NWA (northwest Africa), NEA (northeast Africa), AP (Arabian peninsula), Israel, SD (Syrian desert), TP (Turkish peninsula) and CA (Central Asia) |
| gwava | whole genome GWAVA_region_score and GWAVA_tss_score (GWAVA_unmatched_score has bug in file) |
| hrcr1 | 40 million variants from 32K samples in haplotype reference consortium |
| cadd13gt10 | CADD version 1.3 score>10 |
| intervar_20180118 | InterVar: clinical interpretation of missense variants (indels not supported) |
| mcap13 | [M-CAP scores v1.3] |
| mitimpact24 | pathogenicity predictions of human mitochondrial missense variants |
| nci60 | NCI-60 human tumor cell line panel exome sequencing allele frequency data |
| revel | REVEL scores for non-synonymous variants |
| gnomad211_genome | gnomAD exome collection (v2.1.1), with "AF AF_popmax AF_male AF_female AF_raw AF_afr AF_sas AF_amr AF_eas AF_nfe AF_fin AF_asj AF_oth non_topmed_AF_popmax non_neuro_AF_popmax non_cancer_AF_popmax controls_AF_popmax" header |
| gerp++gt2 | whole-genome GERP++ scores greater than 2 (RS score threshold of 2 provides high sensitivity while still strongly enriching for truly constrained sites. ) |
| icgc28 | International Cancer Genome Consortium version 28 |
| kaviar_20150923 | 170 million Known VARiants from 13K genomes and 64K exomes in 34 projects |
| ljb26_all | whole-exome SIFT, PolyPhen2 HDIV, PolyPhen2 HVAR, LRT, MutationTaster, MutationAssessor, FATHMM, MetaSVM, MetaLR, VEST, CADD, GERP++, PhyloP and SiPhy scores from dbNSFP version 2.6 |
| eigen | whole-genome Eigen scores |
| popfreq_all_20150413 | A database containing all allele frequency from 1000G, ESP6500, ExAC and CG46 |
| regsnpintron | prioritize the disease-causing probability of intronic SNVs |
| gerp++gt2 | whole-genome GERP++ scores greater than 2 (RS score threshold of 2 provides high sensitivity while still strongly enriching for truly constrained sites. ) |