臨床生物信息學中的GWAS分析

今天要分享的是一本合集?Clinical Bioinformatics?臨床生物信息學實驗指南中的第五章Bioinformatics Challenges in Genome-Wide Association Studies (GWAS)?

De R., Bush W.S., Moore J.H. (2014) Bioinformatics Challenges in Genome-Wide Association Studies (GWAS). In: Trent R. (eds) Clinical Bioinformatics. Methods in Molecular Biology (Methods and Protocols), vol 1168. Humana Press, New York, NY

http://www.springer.com/series/7651

一張導(dǎo)圖總結(jié)?

下載鏈接—GWAS原理作者rapunzel

作者之一Jason H. Moore教授就職于Geisel School of Medicine at Dartmouth,研究方向是生物統(tǒng)計、流行病學和基因組,開發(fā)SPARCoC軟件,還寫過一本書Computational Methods for Genetics of Complex Traits(2010)以后有錢了找來看看。。。。

真的很貴

好了,繼續(xù)來說這篇文章

摘要:本章回顧了GWAS 的基本概念、用于捕獲遺傳變異的技術(shù)、遺傳力缺失問題、高效實驗設(shè)計、減少引入到數(shù)據(jù)集中的偏差以及如何利用新的資源(如電子病歷)

Key words:Data imputation, Epistasis, Electronic medical records, Filtering, Gene–gene interactions, GWAS, Meta-analysis, Missing heritability, Replication

一、簡介

GWAS 是基于常見疾病-共同變異(Common Disease—Common Variant,CD-CV)假說的,即common diseases (II型糖尿病,類風濕性關(guān)節(jié)炎或原發(fā)性高血壓等)?are caused in part by genetic variations that are also common in the population。

SNP遺傳效力和疾病遺傳力的關(guān)系 If common variants have a small effect size but common diseases show a strong inheritance in families (high heritability), then almost by definition the disease must be influenced by multiple genetic factors.

The?missing heritability problem:?GWAS has had limited success in detecting genetic variants that account for a large portion of the heritability of any common disease trait. 作者舉例在breast cancer研究中找到的兩個loci僅能解釋5.9%的乳腺癌家族風險。

? ? *產(chǎn)生原因之一是上位效應(yīng)epistatic interactions.?Biological epistasis refers to the physical interactions between biomolecules that are influenced by multiple genetic variants. Statistical epistasis is the term for the nonadditive interactions between multiple genes, each of which affects disease susceptibility, and the environment.

? ? *解決辦法:?1)?Designing our studies to search for nonlinear interactions amongst SNPs. 2) Using methods such as meta-analysis and data imputation?to increase our statistical power. 3) Establishing strict criteria for defining phenotypes

二、材料

介紹了IlluminaAffymetrix兩家測序平臺以及Electronic Medical Records的應(yīng)用,這里略過

三、方法

Overview of the GWAS process ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

1 關(guān)于基本概念:

SNP-single?base pair changes in the DNA sequence, have now become the?modern unit of genetic variation

MAF-the frequency of the less common allele is referred to as minor allele frequency

LD-Linkage disequilibrium is a measure of correlation between SNP alleles at one site and the specific alleles carried at variant sites nearby. 用D′ 或r2來計算

Haplotype-a particular combination of alleles along a chromosome

tag SNPs-in strong LD with other variants surrounding them最終會被篩選出來

2 關(guān)于實驗設(shè)計:

(1)Case–Control VS Quantitative?

Case–Control案例研究通常是二元結(jié)果,如病例/對照或受影響/未受影響。若病例中SNP頻率高于對照組,說明SNP與疾病風險增加有關(guān);Quantitative定量研究評估量化或連續(xù)性狀,以獲得定量值(如HDL、LDL),研究SNP或等位基因的頻率是否與數(shù)量性狀相關(guān)。

(2)Standardizing Phenotype Criteria

對表型的標準化定義是非常重要的,特別是在多機構(gòu)的合作中。有時案例研究里把病人由case錯歸為control的影響要比定量研究中記錄錯數(shù)值嚴重得多。

(3)Testing for an Association(重點) ?

? ? 1)前期準備?

? ? ? ?選擇合適的方法——關(guān)聯(lián)分析可分為allelicgenotypic與表型相關(guān)聯(lián),需根據(jù)具體情況選擇顯性、隱形、加性效應(yīng)模型來分析

? ? ? ?調(diào)整數(shù)據(jù)集——用Regression方法調(diào)整協(xié)變量以防出現(xiàn)假陽性結(jié)果

? ? ? 群體結(jié)構(gòu)分析Population substructure——作為重要協(xié)變量之一, ethnic-specific SNPs may show up to be associated with a trait due to population stratification,可以用STRUCTUREEIGENSTRAT來分析

? ? 2)單一位點 VS 多位點

Binary traits,?case–control研究中常采用?a contingency table methodlogistic regression.

? ? ? ?*A contingency table summarizes the number of individuals within each genotypic group for a single biallelic SNP.?It searches for a deviation from the null hypothesis that there is no association between the phenotype and genotype. e.g. the chi-square test or the Fisher’s exact test?by SAS, SPSS, Stata, or Microsoft Excel.

? ? ? ? *Logistic regression is an extension of linear regression where the phenotypic outcome studied is transformed using a logistic function. This method predicts the probability of an individual having a case status, given their genotype class. ?因允許協(xié)變量調(diào)整而被更廣泛地使用

對于quantitative traits,常采用方差分析Analysis of Variance (ANOVA). It assumes that 1) the trait is normally distributed?(正態(tài)分布), 2) the variance of the trait is the same within each group, and 3) that the groups are independent.?For?single-SNP?analysis, ANOVA functions under the?null hypothesis. ?

PLINK是GWAS分析中的常用軟件,功能強大,操作簡便,可以使用the?allelic?orinheritance模型, or by using the?Cochran-Armitage test (a contingency table method).


由于用linear modeling framework 去分析單一SNPs at a time會導(dǎo)致之前提到過的missing heritability問題, 因此需要用到multi-locus analysis,?more holistic approaches that recognize the complex landscape of the genotype–phenotype relationship and examine nonlinear interactions between genetic variants throughout the genome. 這里最大的挑戰(zhàn)在于處理50萬個SNP會消耗大量計算資源,需用特定的過濾方法來減輕計算壓力。

一般的GWAS single SNP分析會基于MAF\LD值進行初始過濾(仍會留下30萬SNPs),?然后會通過設(shè)定顯著性閾值篩選出一些主效markers (和疾病強關(guān)聯(lián)的單一SNPs)

另一種過濾方法是檢測marks有沒有在某一通路、蛋白家族中存在相互作用 dataset can also be filtered so that only those multi-marker interactions will be examined that fit within a certain biological context such as a biological pathway, protein family, and group of genes or proteins involved in a certain molecular function.

Biofilter algorithm 算法?combines biomedical knowledge from multiple public repositories with statistical methods such as logistic regression or multifactor dimensionality reduction (MDR) method to analyze SNP–SNP combinations.?

? ? 3)Post Analysis 糾錯

p-value 檢驗?is defined as the probability of observing a test statistic that is equal to or greater than the observed test statistic, if the null hypothesis is true.?P值的問題

GWAS中常用的多重假設(shè)檢驗矯正方法有:

? ? *The Bonferroni correction

????*Adjusting the False Discovery Rate (FDR)

? ? ?*Using permutation testing to adjust the significance threshold by?PLINK, PRESTO, and PERMORY?

(4)結(jié)果的可重復(fù)

重復(fù)的唯一目的是評估GWAS最初的陽性結(jié)果,證實其有效性和可信度

? ??1)Statistical?Replication

要實現(xiàn)統(tǒng)計上的可重復(fù)需滿足以下條件:

? ? *樣本量足夠大 ?由于winner’s curse 贏家的詛咒?(GWAS在研究群體中的效應(yīng)被高估,即比實際在人群中要高)?的存在,這點至關(guān)重要

? ? *重復(fù)必須在同一群體的獨立數(shù)據(jù)集中進行,并應(yīng)該使用相同的標準來定義所討論的

? ? *由于GWAS標記是基于LD模式選擇的,應(yīng)旨在重復(fù)某個基因組區(qū)域,而不一定是最初研究中得到的具體某個SNP

? ? 2)Meta-analysis

Meta-analysis is a statistical method for combining several different studies to provide one summary result ?aims to examine the effect of the same allele across?all studies.(前提是所有研究需基于相同的假說). 可以用Cochran’s Q 或 I2 statistic來計算heterogeneity

? ? 3)Data Imputation

The imputation procedure makes use of the known LD and haplotype patterns in reference panels to estimate genotypes for SNPs that were not directly genotyped within a study. 常用的算法有BimBam, IMPUTE, MaCH, and Beagle (均基于haplotype phasing algorithms,?which?estimate the contiguous set of alleles that lie on a specific chromosome)

四、 展望

Although, as the content of genotyping chips, cohort sizes, and biobanks grow even larger, the challenges of data manipulation, quality control, strong study design, and strict phenotypic definitions grow more complex. Hence, moving forward human geneticists will have to develop bioinformatics infrastructure and expertise to overcome such challenges. Most importantly, scientists will have to combine their bioinformatics efforts with genetics, biochemistry and cell biology to confirm the functional consequence and biological relevance of the genotype–phenotype associations that are identified.?


本文提綱挈領(lǐng)地闡明了醫(yī)學臨床上的GWAS分析基本概念和原理,關(guān)聯(lián)算法模型的選擇和使用,特別是指出了現(xiàn)有GWAS存在的不足以及我們在具體實踐中應(yīng)該如何避免誤差。建議小伙伴在學習GWAS時先看這篇入門介紹,再根據(jù)個人水平去查陌生的專業(yè)名詞的含義以及常用軟件的使用方法。另一篇簡書文章歡迎閱讀GWAS基本分析內(nèi)容

GWAS提出到現(xiàn)在已經(jīng)十多年,發(fā)揮了重要的作用,存在很多問題 (參見擴展閱讀),還有許多改進的空間。正如作者最后在Future Directions所說 ‘Ultimately,?the translation of GWAS findings into clinical practice will rely upon correct assumptions regarding the genetic architecture of complex traits especially in the context of gene–gene and gene–environment interactions.’

參考文獻:

見原文

推薦:GWAS – Science topic

擴展閱讀:

GWAS的困境和遺傳模型的新思

旋渦下的GWAS丨全基因組疾病研究價值幾何?

GWAS還能走多遠?——十年的思考

RVAS(低頻突變關(guān)聯(lián)分析)成為研究新寵,超越GWAS

GWAS研究中樣本數(shù)量和結(jié)果真實有效性之間的關(guān)系是怎樣的?

GWAS的基因型填充是怎么回事?

使用Plink對CNV做GWAS分析(一)

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容