AlphaGenome: advancing regulatory variant effect prediction with a unified DNA sequence model
AlphaGenome:利用統(tǒng)一的DNA序列模型進(jìn)行調(diào)控變體效應(yīng)預(yù)測(cè)

作者簡(jiǎn)介: ?iga Avsec, Ph.D
他從物理學(xué)轉(zhuǎn)向計(jì)算基因組學(xué),標(biāo)志著人工智能與基因研究的融合邁出了重要的一步。他從斯洛文尼亞來到慕尼黑,在朱利安·加格尼爾 (Julien Gagneur) 的指導(dǎo)下探索 DNA 的奧秘,并為 Kipoi 和 BPNet 等工具做出了貢獻(xiàn),這些工具增進(jìn)了我們對(duì)基因組學(xué)的理解。
在 Google DeepMind,?iga 在 Enformer 和 AlphaMissense 上的工作正在為識(shí)別基因變異和推進(jìn)我們對(duì)抗遺傳疾病的斗爭(zhēng)開辟新天地。通過他的故事,我們可以一窺醫(yī)療保健的未來:人工智能驅(qū)動(dòng)的基因組學(xué)發(fā)現(xiàn)將徹底改變個(gè)性化醫(yī)療和疾病治療。
更多詳細(xì)的介紹可以訪問如下鏈接:https://blog.superbio.ai/superbio-scientist-spotlight-%C5%BEiga-avsec-ph-d-2225dacc2b9b
1.前情提要
隨著大語言模型的出現(xiàn),transformer的推出為我們破譯基因組密碼提供了更優(yōu)質(zhì)的工具,先前基于k-mer,短序列的方法逐漸被取代,長(zhǎng)序列深度學(xué)習(xí)模型的出現(xiàn),可以實(shí)現(xiàn)從更長(zhǎng)的DNA序列中學(xué)習(xí)到更多基因組信息——建模增強(qiáng)子跨越長(zhǎng)距離與啟動(dòng)子相互作用;判斷單個(gè)堿基突變是否會(huì)破壞關(guān)鍵調(diào)控位點(diǎn);觀察一個(gè)變異對(duì)所有相關(guān)層級(jí)的影響,重建完整的致病因果鏈......
2.摘要
-
目標(biāo)
開發(fā)深度學(xué)習(xí)模型,從 DNA 序列預(yù)測(cè)功能基因組學(xué)測(cè)量值(例如基因表達(dá)、染色質(zhì)可及性等),以解讀基因調(diào)控密碼。 -
現(xiàn)有問題
當(dāng)前模型面臨一個(gè)關(guān)鍵取舍——要么處理較長(zhǎng)的輸入序列但預(yù)測(cè)分辨率低,要么預(yù)測(cè)分辨率高但只能處理很短的序列片段。這限制了它們能夠預(yù)測(cè)的功能模態(tài)(數(shù)據(jù)類型)數(shù)量和預(yù)測(cè)性能。 - 提出的解決方案 —— AlphaGenome :
AlphaGenome 解決了上述“長(zhǎng)度-分辨率”的取舍問題。能夠處理長(zhǎng)達(dá)1 兆堿基對(duì) (1 Mb)的 DNA 序列輸入。這相當(dāng)于人類基因組的大約 1/3000,包含了更廣泛的調(diào)控上下文(如遠(yuǎn)距離增強(qiáng)子、拓?fù)潢P(guān)聯(lián)域邊界等)。 在如此長(zhǎng)的輸入序列基礎(chǔ)上,能夠以單堿基對(duì)分辨率預(yù)測(cè)數(shù)千種功能基因組學(xué)數(shù)據(jù)軌道。 - 預(yù)測(cè)覆蓋極其多樣化的功能模態(tài),包括:
1.基因表達(dá)水平 2.轉(zhuǎn)錄起始位點(diǎn)
3.染色質(zhì)可及性 (如 ATAC-seq) 4.組蛋白修飾 (如 H3K27ac, H3K4me3)
5.轉(zhuǎn)錄因子結(jié)合位點(diǎn) 6.染色質(zhì)空間構(gòu)象 (染色質(zhì)接觸圖譜,如 Hi-C)
7.剪接位點(diǎn)使用情況 8.剪接連接點(diǎn)坐標(biāo)及其連接強(qiáng)度 -
模型訓(xùn)練與性能評(píng)估:
訓(xùn)練數(shù)據(jù): 使用人類和小鼠的基因組數(shù)據(jù)進(jìn)行訓(xùn)練。
評(píng)估指標(biāo): 主要評(píng)估模型在預(yù)測(cè)遺傳變異效應(yīng)(如 SNP)方面的能力。這是驗(yàn)證模型是否真正理解序列-功能關(guān)系的關(guān)鍵任務(wù)。
結(jié)果: 在 26 項(xiàng)獨(dú)立的、與現(xiàn)有最強(qiáng)外部模型(如 Enformer, Basenji2)的對(duì)比評(píng)估中,AlphaGenome 在 24 項(xiàng)上匹配或超越了這些模型的性能。這證明了其強(qiáng)大的預(yù)測(cè)能力。 -
關(guān)鍵應(yīng)用與價(jià)值:
多模態(tài)變異效應(yīng)評(píng)分: AlphaGenome 的核心優(yōu)勢(shì)在于能同時(shí)預(yù)測(cè)一個(gè)變異(如致病 SNP)對(duì)所有上述數(shù)千種功能模態(tài)的影響。
揭示致病機(jī)制: 以 TAL1 癌基因附近的臨床相關(guān)變異為例,AlphaGenome 能夠準(zhǔn)確重現(xiàn)該變異影響多個(gè)功能層面(如破壞某個(gè)轉(zhuǎn)錄因子結(jié)合位點(diǎn)、改變?nèi)旧|(zhì)可及性、進(jìn)而影響基因表達(dá))的完整致病機(jī)制。這為理解復(fù)雜疾病的遺傳基礎(chǔ)提供了前所未有的整合視角。 -
可用性:
工具發(fā)布: 為了促進(jìn)更廣泛的應(yīng)用,研究者提供了工具,方便用戶利用 AlphaGenome 進(jìn)行基因組軌道預(yù)測(cè)和變異效應(yīng)評(píng)分。
3.模型構(gòu)建
3.1 數(shù)據(jù)準(zhǔn)備
- Gneome data
Input sequences were extracted from the hg38 (human) and mm10 (mouse) reference genomes. For sequence intervals that extended beyond chromosomal boundaries, padding with ‘N’ characters was used to ensure consistent input length.
- Tracks details
| Human | Mouse | ||
|---|---|---|---|
| Tracks | 5930 | 1128 | |
| Gene expression | RNA-seq (ENCODE and GTEx) CAGE (FANTOM5) PRO-cap (ENCODE) |
667 546 12 |
173 188 0 |
| Detailed splicing patterns | splice sites (ENCODE and GTEx realigned using STAR) splice site usage (公式計(jì)算) splice junctions (splicemap package) |
4 734 734 |
4 180 180 |
| Chromatin state | DNase (ENCODE) ATAC-seq (ENCODE) histone modifications (ENCODE) TF binding (ENCODE) |
305 167 1116 1617 |
67 18 183 127 |
| Chromatin contact maps | Hi-C / micro-C (4D Nucleome) | 28 | 8 |
3.2 模型構(gòu)建

3.2.1 模型架構(gòu) (圖a)
核心設(shè)計(jì):U-Net式分層處理
①. 輸入處理:
- 序列輸入:1 Mb DNA序列(1,000,000 bp)
- 物種標(biāo)識(shí):區(qū)分人類/小鼠基因組
- 并行計(jì)算策略:將1 Mb序列分割為 131 kb的獨(dú)立片段,分布式處理于多個(gè)計(jì)算設(shè)備(GPU/TPU)
②. 三階段處理流程:
| 階段 | 功能 | 關(guān)鍵技術(shù) |
|---|---|---|
| Encoder | 序列降維壓縮:提取局部特征(如轉(zhuǎn)錄因子結(jié)合位點(diǎn)) | 卷積層(捕捉基序特征) + 池化(降維) |
| Transformer | 建模長(zhǎng)程依賴:解析增強(qiáng)子-啟動(dòng)子遠(yuǎn)程互作、染色質(zhì)域結(jié)構(gòu) | 跨設(shè)備通信的注意力機(jī)制(覆蓋1 Mb全局上下文) |
| Decoder | 序列升維還原:重建高分辨率輸出 | 轉(zhuǎn)置卷積(上采樣) + 跳躍連接(保留細(xì)節(jié)) |
③. 任務(wù)特定輸出頭:
- 多任務(wù)適配:連接至解碼器末端,生成11類實(shí)驗(yàn)數(shù)據(jù)類型的預(yù)測(cè)結(jié)果
- 分辨率定制化:不同數(shù)據(jù)類型的輸出分辨率獨(dú)立設(shè)定(如單堿基/128bp bin)
- 預(yù)測(cè)規(guī)模:同時(shí)輸出5,930條人類基因組軌道或1,128條小鼠軌道
技術(shù)意義:U-Net結(jié)構(gòu)解決了長(zhǎng)序列與高分辨率的矛盾——編碼器提取抽象特征,Transformer建模全局交互,解碼器恢復(fù)空間細(xì)節(jié)。
3.2.2 訓(xùn)練策略 (圖b-c)
階段①:教師模型訓(xùn)練 (圖1b)
- 數(shù)據(jù)準(zhǔn)備:
采樣區(qū)域:從人類/小鼠基因組的交叉驗(yàn)證劃分區(qū)域選取1 Mb區(qū)間
數(shù)據(jù)增強(qiáng):隨機(jī)平移(模擬調(diào)控元件位置變化)反向互補(bǔ)(增強(qiáng)序列方向不變性) - 模型訓(xùn)練目標(biāo):
直接預(yù)測(cè)實(shí)驗(yàn)測(cè)得的基因組功能信號(hào)(如ChIP-seq峰、RNA表達(dá)量)
產(chǎn)出兩種教師模型:
Fold-specific:?jiǎn)握蹟?shù)據(jù)訓(xùn)練的專家模型
All-folds:全數(shù)據(jù)訓(xùn)練的通用模型
階段②:學(xué)生模型蒸餾 (圖1c)
- 知識(shí)蒸餾流程:
教師凍結(jié):固定All-folds教師模型的參數(shù)
學(xué)生輸入:在原始序列基礎(chǔ)上引入突變擾動(dòng)(模擬自然變異)
學(xué)習(xí)目標(biāo):讓學(xué)生模型復(fù)現(xiàn)教師對(duì)擾動(dòng)序列的預(yù)測(cè)結(jié)果 - 關(guān)鍵優(yōu)勢(shì):
變異預(yù)測(cè)專精化:學(xué)生模型專注學(xué)習(xí)序列變異與功能變化的映射
模型輕量化:產(chǎn)出單一高效推理模型(避免集成多教師模型的計(jì)算開銷)
生物學(xué)意義:教師-學(xué)生框架將"功能預(yù)測(cè)"能力蒸餾為"變異效應(yīng)預(yù)測(cè)"能力,提升臨床應(yīng)用的準(zhǔn)確性。
3.2.3 性能評(píng)估 (圖d-e)
①. 基因組軌道預(yù)測(cè)性能 (圖1d)
評(píng)估指標(biāo):
相對(duì)性能提升%(分類任務(wù)需標(biāo)準(zhǔn)化)
關(guān)鍵結(jié)果:
| 模態(tài)類型 | 代表性任務(wù) | 性能提升 | 技術(shù)意義 |
|---|---|---|---|
| 轉(zhuǎn)錄調(diào)控 | RNA表達(dá)量預(yù)測(cè) | 顯著提升 | 捕捉長(zhǎng)程增強(qiáng)子交互 |
| 染色質(zhì)構(gòu)象 | Hi-C接觸圖譜預(yù)測(cè) | 最大提升 | 建模1 Mb尺度三維結(jié)構(gòu) |
| 表觀遺傳 | H3K27ac組蛋白修飾預(yù)測(cè) | 中等提升 | 識(shí)別開放染色質(zhì)區(qū)域 |
| RNA加工 | 多聚腺苷酸化位點(diǎn)(PA)識(shí)別 | 顯著提升 | 精確定位轉(zhuǎn)錄后調(diào)控位點(diǎn) |
注:128bp分辨率任務(wù)提升幅度普遍低于單堿基任務(wù),因基線模型在此分辨率已有較好表現(xiàn)。
② .變異效應(yīng)預(yù)測(cè)性能 (圖1e)
- 評(píng)估場(chǎng)景:
功能變異:預(yù)測(cè)非編碼區(qū)SNP對(duì)分子表型的影響
因果推斷:評(píng)估數(shù)量性狀位點(diǎn)(ds/caQTL)的因果方向 - 核心突破:
24/26任務(wù)超越基線:在涵蓋染色質(zhì)可及性(ATAC)、轉(zhuǎn)錄因子結(jié)合(ChIP)、基因表達(dá)(eQTL)等任務(wù)中全面領(lǐng)先
因果方向識(shí)別:對(duì)"變異是否導(dǎo)致分子表型改變"的判斷準(zhǔn)確率提升15-25%
案例佐證:TAL1癌基因附近的臨床變異機(jī)制解析(多模態(tài)協(xié)同預(yù)測(cè)揭示:SNP→破壞TF結(jié)合→降低染色質(zhì)開放性→抑制基因表達(dá))
3.2.4 技術(shù)突破總結(jié)
| 維度 | 創(chuàng)新點(diǎn) | 解決的核心問題 |
|---|---|---|
| 架構(gòu)設(shè)計(jì) | U-Net + 跨設(shè)備Transformer | 1 Mb長(zhǎng)序列與單堿基分辨率的兼容 |
| 訓(xùn)練策略 | 兩階段教師-學(xué)生蒸餾 | 變異效應(yīng)預(yù)測(cè)的專一性優(yōu)化 |
| 多模態(tài)輸出 | 11類數(shù)據(jù)類型/數(shù)千軌道并行預(yù)測(cè) | 系統(tǒng)性解析變異致病機(jī)制 |
| 工程實(shí)現(xiàn) | 131 kb分塊并行計(jì)算 | 突破GPU顯存限制實(shí)現(xiàn)兆堿基處理 |
| 評(píng)估驗(yàn)證 | 26項(xiàng)嚴(yán)格測(cè)試(含臨床變異機(jī)制再現(xiàn)) | 證明模型在基礎(chǔ)研究和臨床應(yīng)用的普適性 |
3.3 AlphaGenome model architecture

Extended Data Figure 1 | AlphaGenome model architecture. (a) Overview schematic illustrating the flow of activations through the model. The architecture follows a U-Net-like structure with an Encoder, a central Transformer Tower, and a Decoder processing a 1Mb DNA input sequence. The Encoder uses convolutional blocks and max pooling to progressively downsample the sequence resolution (from 1 bp to 128 bp) while increasing feature channels. The Transformer Tower operates at 128 bp resolution, iteratively refining sequence representations and generating pairwise (2D) representations. The Decoder uses convolutional blocks and upsampling, incorporating skip connections (dashed lines) from corresponding Encoder stages, to restore sequence resolution up to 1 bp. An Output Embedder performs final processing before feeding representations to task-specific output heads. (b) Internal structure of key component blocks used repeatedly within the architecture overview shown in (a). Diagrams detail the layers within the convolutional blocks (Conv block, Upres block), the Transformer blocks, and the blocks responsible for generating and updating pairwise representations (Pair update block, Sequence to pair block). Tensor shapes are shown excluding the batch dimension. Abbreviations: r = log-resolution, c = channels.
4.結(jié)果展示
這里詳細(xì)介紹我感興趣的兩部分Result
4.1 AlphaGenome enables state-of-the-art enhancer-gene linking
AlphaGenome無需針對(duì)PE linking任務(wù)專門訓(xùn)練(即“零樣本”)。其Transformer模塊通過自注意力機(jī)制
自動(dòng)識(shí)別序列中遠(yuǎn)距離的調(diào)控依賴關(guān)系。例如:
- 增強(qiáng)子特有的轉(zhuǎn)錄因子結(jié)合基序(如MYB、CTCF)被局部卷積層捕獲;
- Transformer將這些局部信號(hào)與遠(yuǎn)端啟動(dòng)子關(guān)聯(lián),形成功能連接假設(shè)
零樣本表現(xiàn)媲美監(jiān)督模型 - 在增強(qiáng)子距離TSS >10 kb時(shí),AlphaGenome顯著優(yōu)于Borzoi(相對(duì)auPRC提升17–25%);
-
與專門訓(xùn)練E-P鏈接的ENCODE-rE2G-extended模型相比,性能差距<1% auPRC
restlt1.jpg
Figure 4 | AlphaGenome predicts the effect of variants on gene expression. (j) Enhancer-gene linking performance (ENCODE-rE2G CRISPRi dataset17). Zero-shot evaluation: Performance (auPRC) comparison stratified by enhancer-TSS distance for AlphaGenome (distilled) vs Borzoi vs TSS distance baseline. Supervised evaluation: AlphaGenome input gradient score integrated into ENCODE-rE2G-extended vs ENCODE-rE2G models.
Extended Data Figure 7 | AlphaGenome improves enhancer-gene linking using input gradients and shows enhanced sensitivity to distal enhancers. (b) Impact of incorporating AlphaGenome’s input gradient score as a feature in the ENCODE-rE2G extended logistic regression model, evaluated on the ENCODE-rE2G benchmark. ENCODE-rE2G is a logistic regression model trained to predict enhancer-gene interactions from features2. Precision-recall curves are shown, colored by the feature sets used for training the regression model (auPRC values indicated in the legend). Feature sets are:
? rE2G extended with AlphaGenome features: All ENCODE-rE2G extended model features plus a single AlphaGenome’s input x gradient score.
? AlphaGenome features only : The AlphaGenome input x gradient score alone.
? TSS distance with AlphaGenome features: AlphaGenome input x gradient score plus the distance to TSS feature. ? rE2G extended: All features from the ENCODE-rE2G extended model2. ? TSS distance: Distance to TSS feature from2.
? ABC features only : Subset of ’rE2g extended’, with only features related to the Activity-By-Contact (ABC) model2.(c) Precision-recall curves for the ENCODE-rE2G benchmark, similar to panel (b), evaluating the ENCODE-rE2G extended regression model with different feature sets. Area under the precision-recall curve (auPRC) values for the different feature sets are indicated in the legend. In this configuration, ‘AlphaGenome features’ consist of a more comprehensive set of K562 cell line-specific variant effect scores. These include Allele-Specific Activity Scores (AAS) and variant effect scores calculated as the difference between alternate (ALT) and reference (REF) allele predictions (ALT-REF Diff scores). These scores were derived from AlphaGenome for the following genomic assays:
? RNA-seq of the target gene
? ChIP-TF EP300
? ChIP-Histone H3K27ac
? CAGE
? PRO-cap
? H1-ESC contact maps
4.2 AlphaGenome improves on predicting variant effects on chromatin accessibility and transcription factor binding
解決兩大關(guān)鍵問題:
- QTL效應(yīng)預(yù)測(cè):
判斷非編碼變異(如SNP)是否影響染色質(zhì)可及性(caQTL)、DNase敏感性(dsQTL)或轉(zhuǎn)錄因子結(jié)合(bQTL)
量化變異對(duì)上述分子表型的效應(yīng)強(qiáng)度 - MPRA活性預(yù)測(cè):
預(yù)測(cè)短DNA序列的調(diào)控活性(報(bào)告基因表達(dá)水平)
解析局部序列變異如何通過染色質(zhì)狀態(tài)調(diào)控基因表達(dá)

Figure 5 | AlphaGenome accurately predicts variant effects on chromatin accessibility and SPI1 transcription factor binding. (a) Schematic of the center-mask variant scoring strategy. This approach, detailed in Methods, is used for accessibility (DNase-seq, ATAC-seq) and ChIP-seq predictions. (b) Performance comparison on QTL causality prediction. Average Precision (AP) for AlphaGenome, Borzoi, and ChromBPNet across QTL types (caQTL, dsQTL, bQTL) and ancestries. (c) Performance comparison on QTL effect size prediction. Pearson r is shown for AlphaGenome, Borzoi, and ChromBPNet across QTL types (caQTL, dsQTL, bQTL) and ancestries. (d) AlphaGenome’s predicted versus observed effect sizes for causal caQTLs (African ancestry). Scatterplot displays predictions using the DNase track for the GM12878 cell line. Signed Pearson r = 0.74; unsigned Pearson r = 0.45. Signed Pearson r correlation uses raw values; unsigned Pearson r uses absolute values. Red and blue circles highlight variants detailed in (e, f). (e) Example AlphaGenome predictions for selected caQTLs. Shown are ALT-REF differences in predicted DNase track (GM12878) around the variants highlighted in (d). (f) ISM-derived sequence logos for REF and ALT alleles of example caQTLs from (e). The examples suggest variant disruption or modulation of TF binding motifs. Putative binding factors and JASPAR39 matrix IDs (MA0105.1, MA0105.3) are indicated on the right. (g) AlphaGenome’s predicted versus observed effect sizes for causal SPI1 bQTLs. Scatterplot displays predictions using the SPI1 ChIP-seq track for the GM12878 cell line. Signed Pearson r = 0.55; unsigned Pearson r = 0.12. Red and blue circles highlight variants detailed in (h, i). (h) Example AlphaGenome predictions for selected SPI1 bQTLs. Shown are ALT-REF differences in predicted SPI1 ChIP-TF track (GM12878) around the variants highlighted in (g). (i) ISM-derived sequence logos for REF and ALT alleles of example SPI1 bQTLs from (h). Examples indicate potential motif impacts such as creation or disruption of SPI1 or related motifs. Putative binding factors and JASPAR matrix IDs (MA0081.2, MA0080.5) are indicated on the right. (j) CAGI5 MPRA challenge performance (average across loci). Top: Average zero-shot Pearson r performance, using cell type-matched raw DNase model outputs. Middle: Average Pearson r from LASSO regression using cell type-matched or cell type-agnostic DNase outputs. Bottom: LASSO regression Pearson r performance using features from multiple modalities and the full set of cell types (DNase + RNA + ChIP-Histone output types for AlphaGenome and Borzoi; DNase + CAGE output types for Enformer).

Supplementary Figure 9 | Additional accessibility variant analysis. Extended evaluation of variant effect prediction on chromatin accessibility across diverse contexts. AP = average precision (auPRC). Signed Pearson R correlation uses raw values; unsigned Pearson R uses absolute values first. (a) Precision-Recall curves comparing AlphaGenome, Borzoi, and ChromBPNet performance on caQTL causality prediction in European ancestry. (b) Scatterplot comparing AlphaGenome’s predicted versus observed effect sizes (Coefficient) for causal caQTL variants in European ancestry. (c) Precision-Recall curves comparing AlphaGenome, Borzoi, and ChromBPNet performance on dsQTL causality prediction in Yoruba ancestry. (d) Scatterplot comparing AlphaGenome’s predicted versus observed effect sizes (Coefficient) for causal dsQTL variants in Yoruba ancestry. (e) Precision-Recall curves comparing model performance for caQTL causality prediction (African ancestry). (f) Effect size prediction for microglia causal caQTL variants. Scatterplot compares observed effects versus AlphaGenome’s predicted DNase effects in a closely-related available cell type (suppressor macrophage). (g) Effect size prediction for cardiac smooth muscle cell (SMC) causal caQTL variants. Scatterplot compares observed effects versus AlphaGenome’s predicted ATAC effects in a closely-related available cell type (left cardiac atrium ATAC). (h) Precision-Recall curves comparing model performance for SPI1 bQTL causality prediction.
訪問Google DeepMind可以獲得關(guān)于AlphaGenome更多詳細(xì)信息:網(wǎng)址如下https://deepmind.google/discover/blog/alphagenome-ai-for-better-understanding-the-genome/
AlphaGenome github軟件地址:
https://github.com/google-deepmind/alphagenome
