Part1: gnomAD-LOEUF數(shù)據(jù)下載
? ? 上一篇文章已經(jīng)講解了gnomAD的flagship paper,文章中新建的評(píng)估基因?qū)LoF突變?nèi)淌芰Φ哪P汀狶OEUF,想必大家肯定都很想趕緊用起來(lái)。
? ? 這個(gè)模型的數(shù)據(jù)結(jié)果即可以在文章的supplymentary中找到(就是supplementary_dataset_11_full_constraint_metrics.tsv.gz)。也可以爬上梯子,在gnomAD的官網(wǎng) → 右上角Downloads → gnomad.v2.1.1系列的Constraint,鏈接到相應(yīng)位置進(jìn)行下載,官網(wǎng)提供了:①只有經(jīng)典轉(zhuǎn)錄本的gene list表格、②包含了多個(gè)轉(zhuǎn)錄本的transcript list表格(和文章附表內(nèi)容相同)、③按人群數(shù)量做了降采樣的E.O值。

Part2: 表格數(shù)據(jù)意義(翻譯一下)
? 以full_constraint_metrics,即lof_metrics.by_transcript為例。
該文檔共80950行,其中包含了19600+個(gè)基因的經(jīng)典轉(zhuǎn)錄本和其他常見(jiàn)轉(zhuǎn)錄本。共78列,每列的header解釋如下(詳見(jiàn)supplementary information文檔第74頁(yè)):
gene: Gene name,基因名稱
transcript: Ensembl transcript ID (Gencode v19),轉(zhuǎn)錄本編號(hào)
canonical: Boolean indicator as to whether the transcript is the canonical transcript for the gene,是否是該基因的經(jīng)典轉(zhuǎn)錄本
obs_XXX: Number of observed XXX variants in transcript,在該轉(zhuǎn)錄本上觀察到XXX突變的數(shù)量(XXX=mis錯(cuò)義、syn同義、lof功能缺失)
exp_XXX: Number of expected XXX variants in transcript,在該轉(zhuǎn)錄本上預(yù)測(cè)到XXX突變的數(shù)量
oe_XXX: Observed over expected ratio for XXX variants in transcript (obs_XXX divided by exp_XXX),在該轉(zhuǎn)錄本上觀察到XXX變異超出預(yù)期的比率
mu_XXX: Mutation rate summed across all possible XXX variants in transcript,該轉(zhuǎn)錄本中所有可能的XXX變異的突變率總和
possible_XXX: Number of possible XXX variants in transcript,該轉(zhuǎn)錄中可能的XXX突變的數(shù)量(其實(shí)不是很理解這個(gè))
obs_XXX_pphen: Number of observed XXX variants in transcript predicted "probably damaging" by PolyPhen-2,被PolyPhen-2預(yù)測(cè)為“可能有害”的、觀察到的XXX突變數(shù)量
exp_XXX_pphen: Number of expected XXX variants in transcript predicted "probably damaging" by PolyPhen-2,被PolyPhen-2預(yù)測(cè)為“可能有害”的、預(yù)測(cè)察到的XXX突變數(shù)量
oe_XXX_pphen: Observed over expected ratio for PolyPhen-2 predicted "probably damaging" XXX variants in transcript (obs_mis_pphen divided by exp_mis_pphen),被PolyPhen-2預(yù)測(cè)為“可能有害”的XXX突變,觀察到超過(guò)預(yù)期的比率
possible_XXX_pphen: Number of possible missense variants in transcript that are predicted "probably damaging" by PolyPhen-2,被PolyPhen-2預(yù)測(cè)為“可能有害”的、可能的XXX突變的數(shù)量(其實(shí)也不是很理解這個(gè))
oe_XXX_lower: Lower bound of 90% confidence interval for o/e ratio for XXX variants,XXX突變的o/e比率90%置信區(qū)間的下界
oe_XXX_upper: Upper bound of 90% confidence interval for o/e ratio for XXX variants,XXX突變的o/e比率90%置信區(qū)間的上界
XXX_z: Z score for XXX variants in gene. Higher (more positive) Z scores indicate that the transcript is more intolerant of variation (more constrained). Extreme values of XXX_z indicate likely data quality issues,基因中XXX突變的Z-score。Z-score越高(越陽(yáng)性)表明該轉(zhuǎn)錄本越不耐受XXX變異(越受限制)。XXX_z的極端值表示可能存在數(shù)據(jù)質(zhì)量問(wèn)題。
pLI: Probability of loss-of-function intolerance; probability that transcript falls into distribution of haploinsufficient genes (~9% o/e pLoF ratio; computed from gnomAD data),用gnomAD數(shù)據(jù)計(jì)算出來(lái)的pLI
pRec: Probability that transcript falls into distribution of recessive genes (~46% o/e pLoF ratio; computed from gnomAD data),該轉(zhuǎn)錄本屬于隱性基因的概率
pNull: Probability that transcript falls into distribution of unconstrained genes (~100% o/e pLoF ratio; computed from gnomAD data),該轉(zhuǎn)錄本屬于非約束基因的概率
oe_lof_upper_rank: Transcript’s rank of LOEUF value compared to all transcripts (lower values indicate more constrained),與所有轉(zhuǎn)錄本相比,該轉(zhuǎn)錄本的LOEUF值的排名(較低的值表示更受限制)
oe_lof_upper_bin: Decile bin of LOEUF for given transcript (lower values indicate more constrained),該轉(zhuǎn)錄本在十分位分類中的位置(較低的值表示更受限制)
(以上2個(gè)主要是是表示LOEUF的排序、decile 分類指標(biāo))
oe_lof_upper_bin_6: Sextile bin of LOEUF for given transcript (lower values indicate more constrained),該轉(zhuǎn)錄本在六分位分類中的位置(較低的值表示更受限制)
n_sites: Number of distinct pLoF variant sites in the transcript,該轉(zhuǎn)錄體中不同lof突變位點(diǎn)的數(shù)量
classic_caf: Sum of allele frequencies of pLoFs in the transcript,該轉(zhuǎn)錄本中的pLoFs的等位基因頻率的總和
max_af: Maximum allele frequency of any pLoF in the transcript,該轉(zhuǎn)錄本中的任一pLoF的最大等位基因頻率
no_lofs: The number of individuals with no observed pLoF variants in the transcript,在該轉(zhuǎn)錄本中觀察到pLoF變異的個(gè)體數(shù)量
obs_het_lof: The number of individuals with at least one observed heterozygous pLoF variant, but no homozygous pLoF variants, in the transcript,在該轉(zhuǎn)錄本中觀察到至少一個(gè)雜合pLoF變異,但沒(méi)有純合,的個(gè)體數(shù)量
obs_hom_lof: The number of individuals with at least one observed homozygous pLoF in the transcript,在該轉(zhuǎn)錄本中觀察到至少一個(gè)純合pLoF變異的個(gè)體數(shù)量
defined: The number of individuals where at least one high-quality genotype (including homozygous reference) is observed at a called site annotated as a pLoF variant,至少有觀察到一個(gè)高質(zhì)量的pLoF突變的個(gè)體數(shù)量
p: The estimated proportion of haplotypes with a pLoF variant. Defined as: 1 - sqrt(no_lofs / defined) 一個(gè)pLoF突變的單倍型的估計(jì)比例。
exp_hom_lof: The expected number of individuals with at least one homozygous pLoF variant based on the frequency of pLoF haplotypes. Defined as: defined * p2,根據(jù)pLoF的單倍型頻率計(jì)算,至少有一個(gè)純合pLoF突變的預(yù)期個(gè)體數(shù)量。
classic_caf_POP: Sum of allele frequencies of pLoFs in the transcript among POP individuals,POP人群中pLoFs的等位基因頻率的總和
p_POP: The computation of `p` repeated among only POP individuals,只在POP群體中重復(fù)的'p'值
transcript_type: Transcript biotype (https://www.gencodegenes.org/pages/biotypes.html),轉(zhuǎn)錄本生物型
gene_id: Ensembl gene ID,Ensembl 的基因編號(hào)
transcript_level: Transcript level from Gencode (https://www.gencodegenes.org/pages/data_format.html),來(lái)自Gencode的轉(zhuǎn)錄水平
cds_length: Length of coding sequence in gene,該基因的編碼序列長(zhǎng)度
num_coding_exons: Number of coding exons in gene,該基因上編碼外顯子的數(shù)量
gene_type: Gene biotype (https://www.gencodegenes.org/pages/biotypes.html),基因生物型
gene_length: Length of gene,基因的長(zhǎng)度
exac_pLI: pLI score calculated from ExAC,在ExAC中計(jì)算得到的pLI值
exac_obs_lof: Number of observed pLoF variants in gene in ExAC,在ExAC中pLoF突變的觀察數(shù)量
exac_exp_lof: Number of expected pLoF variants in gene in ExAC,在ExAC中pLoF突變的預(yù)測(cè)數(shù)量
exac_oe_lof: Observed to expected ratio of pLoF variants in ExAC,在ExAC中pLoF突變的觀察與預(yù)期的比率
brain_expression: Expression of gene in brain from GTEx data,GTEx數(shù)據(jù)中該基因在腦部的表達(dá)
chromosome: Chromosome name, 染色體
start_position: Start position of gene,該基因的起始位置
end_position: End position of gene,該基因的終止始位置)
Part3: 使用annovar注釋
注釋數(shù)據(jù)庫(kù)文件制作:
? ? 根據(jù)對(duì)header的理解,選用了默認(rèn)canonical=TRUE的"gnomad.v2.1.1.lof_metrics.by_gene.txt"作為數(shù)據(jù)庫(kù)數(shù)據(jù)來(lái)源;
? ? 方便起見(jiàn),將最后3列移到了最前面;然后挑選了一些我自己認(rèn)為對(duì)我做疾病分析可能比較重要的列:
gene,gene_id,transcript,oe_lof_upper_bin,oe_lof_upper_rank, pLI,pRec,pNull,transcript_level
? ? 制作成了一個(gè)12列的文件,命名為“hg19_LOEUF.txt”。數(shù)據(jù)量不是很大的話,做不做該文件的annovar_index都可以,想做的話,構(gòu)建索引的程序index_annovar.pl可以給王凱老師發(fā)郵件獲得。
? ? 另外,由于ANNOVAR的region_base注釋數(shù)據(jù)庫(kù)不能有header,而且所有數(shù)據(jù)會(huì)被合并到一列里面,需要后續(xù)再自己拆開(kāi),所以上面這個(gè)順序需要另外記錄好哦~
ANNOVAR注釋腳本的修改:?
? ? 這里感謝TSY小伙伴之前的嘗試和經(jīng)驗(yàn)分享,在annotate_variation.pl文件的3084行加入下面這段elseif:

注釋一下:
$annovar/convert2annovar.pl -format vcf4 sample.vcf > sample.avinput
$annovar/table_annovar.pl sample.avinput $humandb --buildver hg19 -out sample_anno_LOEUF? -remove -protocol refGene,ensGene,LOEUF -operation g,g,r --nastring . -csvout
同時(shí)把refGene和ensGene注釋上去了,為了比較一下注釋結(jié)果中的gene名和ID是不是都能匹配上。

? ? 這個(gè)結(jié)果可以按需要再進(jìn)行分列處理。如果對(duì)其他列的信息也感興趣想進(jìn)行注釋,可以參考上面的步驟進(jìn)行修改和使用哦。
To Wang Lab 小伙伴:
? ? ANNOVAR比較適合做突變位點(diǎn)的注釋,在這個(gè)region_base的注釋中,只能覆蓋到這些基因的基因內(nèi)區(qū)域,如coding區(qū)和intron區(qū),如果你們的SNP是在基因間區(qū)的話應(yīng)該不能直接注釋。可以基于最原始的表格,直接用你們感興趣的基因去做匹配,主要看“oe_lof_upper_bin”列的值。