Corresponding author: Li Ding
Director of Computational Biology, Oncology
Washington University School of Medicine, St. Louis, MO

1.Sequencing strategies
WES轉(zhuǎn)向WGS:WGS data are therefore considered to be the unbiased 'gold standard'
1.1 Traditional sequencing analyses
In practice, detection of all germline and somatic aberrations is a formidable challenge owing to limitations in current analysis algorithms, as well as to the quantity and quality of sequence data.
實際上,由于當(dāng)前分析算的局限性,以及測序數(shù)據(jù)的數(shù)量和質(zhì)量的限制,檢測所有種系和體細(xì)胞突變是一項艱巨的挑戰(zhàn)。1.2 Subclonal analyses
cancer progression has long been known to be a fundamentally clonal process, and sequence coverage is now becoming sufficiently large to permit detection of the low-prevalence events that are routinely associated with tumour subclones. Multisite and/or multistage sequencing and tumour sectioning experiments have begun to identify founding clones and subclones that contribute to cancer progression1.3 Single-cell sequencing
Pioneering work on assessing CNAs in multiple tumour subpopulations was followed by single-cell sequencing using whole-genome amplification (WGA) of DNA extracted from nuclei that were sorted by flow cytometry.
目前仍然存在一些挑戰(zhàn),如簡并寡核苷酸引物WGA的放大偏差和多重置換擴增技術(shù)(degenerate oligonucleotide-primed WGA是指引物的3' 含6bp的隨機序列,可以隨機的和基因組DNA結(jié)合,從而實現(xiàn)對全基因組的擴增;multiple displacement amplification techniques利用隨機引物和等溫擴增可以獲得高保真的DNA大片段,但該方法的主要缺陷在于非平衡的基因組覆蓋率、擴增偏倚、嵌合序列及非特異擴增等),這些技術(shù)的偏倚導(dǎo)致了不均勻的覆蓋,并因此難以確定體細(xì)胞的變化,包括SNVs、CNAs和結(jié)構(gòu)畸變。由于兩個等位基因中的一個的優(yōu)先擴增,檢測靈敏度受等位基因缺失的影響最大,有報道稱等位基因缺失率為8 - 40%。大的CNAs仍然可以在基因組覆蓋率較低的情況下進(jìn)行檢測(例如,5-6%),而不平等的覆蓋率使得分析較小的CNAs和結(jié)構(gòu)變異極其困難。
2.Dissecting genomic changes in cancer
以下表格是注釋和解讀腫瘤基因組突變的計算工具
| Program | Function | Synopsis | Refs |
|---|---|---|---|
| SNV and indel detection | |||
| Bassovac | SNV and indel detection | Bayesian approach with tumour or normal impurity and clonality | – |
| GATK | SNV and indel detection | Analysis framework using MapReduce | 23 |
| JointSNVMix | SNV detection | Binomial/multinomial probability with pre-filtering | 31 |
| MuTect | SNV and indel detection | Bayesian probability with pre- and post-filtering | 28 |
| Pindel | Indel detection | Pattern growth learning method | 38 |
| SNVMix | SNV detection | Binomial mixture model | 30 |
| SomaticSniper | SNV and indel detection | Bayesian probability with posterior filtering | 27 |
| Strelka | SNV and indel detection | Bayesian probability with posterior filtering | 29 |
| VarScan | SNV and indel detection | Fisher exact test, filtering and FDR correction | 24,25 |
| Copy-number aberration, structural variant and gene fusion detection | |||
| BreakDancer | Structural variant and indel detection | Kolmogorov–Smirnov test on discordant reads | 54 |
| BreakFusion | Gene fusion detection | Alignment-based pipeline for transcriptomic data | 68 |
| BreakTrans | Gene fusion mapping | Integration of fusion discovery and breakpoint tools | 73 |
| ChimeraScan | Chimeric transcription detection | Discordant read pairs with posterior filtering | 67 |
| CREST | Structural variant detection | Heuristics and binomial test on soft-clipped reads | 55 |
| deFuse | Gene fusion detection | Dynamic programming split and discordant reads | 65 |
| DELLY | Structural variant detection | Integrated method of discordant and split reads | 40 |
| GASV-Pro | Structural variant detection | Plane sweep for segment intersection | 57 |
| Genome STRiP | Structural variant detection | Depth and split or discordant reads on populations | 59 |
| Hydra | Structural variant detection | Discordant reads with assembly validation | 139 |
| LUMPY | Structural variant detection | Integrated method of discordant and split reads | 167 |
| TIGRA | Structural variant detection | Debruijn graph-based assembly | 42 |
| Level I annotation and interpretation | |||
| ABSOLUTE | Purity, ploidy and clonality prediction | Optimization of logarithmic scores | 148 |
| ANNOVAR | Functional prediction | Annotation-based prediction | 74 |
| ASCAT | Purity, ploidy and clonality prediction | Goodness-of-fit ranking of candidate solutions | 168 |
| TUSON Explorer | Gene classification | Oncogene or tumour suppressor discovery using mutational signatures | 100 |
| CHASM | Functional prediction | Random forest classifier | 84,85 |
| MutationAssessor | Functional prediction | Conservation-based prediction (entropy score) | 83 |
| PolyPhen2 | Functional prediction | Probability model based on structure and alignment | 81,169 |
| SciClone | Tumour clonality prediction | Bayesian mixture model | – |
| SIFT | Functional prediction | Conservation-based prediction | 82 |
| SNPeff | Functional prediction | Annotation and coding effect prediction | 75 |
| THetA | Purity, ploidy and clonality prediction | Maximum likelihood of mixture composition | 151 |
| VEP | Functional prediction | Annotation-based prediction | 170 |
| Level II annotation and interpretation | |||
| Dendrix | Mutation analysis | De novo discovery of mutually exclusive mutations | 128 |
| HotNet | Network analysis | Diffusion model for significant networks | 119 |
| MEMo | Network analysis | Network modules with mutual exclusivity | 122 |
| MuSiC | Mutation analysis | Framework for significance analysis of mutations | 92 |
| Multi-Dendrix | Mutation analysis | De novo discovery of multiple sets of exclusive mutations | 129 |
| MutSigCV | Mutation analysis | Gene significance with variable background mutation rate | 93 |
| NBS | Network analysis | Clustering using non-negative matrix factorization | 121 |
| Oncodrive-CIS and OncodriveCLUST | Mutation analysis | Z-statistics for copy numbers of driver genes | 171,172 |
| PARADIGM | Gene expression analysis | Network analysis of gene expression | 126 |
| PathScan | Pathway analysis | Probability model for mutation-enriched pathways | 109 |
| TieDIE | Network analysis | Network diffusion model linking mutations to gene expression | 125 |
根據(jù)經(jīng)驗,由多個獨立算法call出來的候選事件不太可能是假陽性,而由任何單個算法call出來的候選事件則反之。因此,使用multicaller strategies現(xiàn)在變得更加普遍,當(dāng)然這樣做也會影響結(jié)果的靈敏度。但是各類工具的組合數(shù)量太龐大了,較難實現(xiàn)。
-
2.1 SNV detection
SNV檢測算法:GATK、VarScan、SAMtools、SomaticSniper、MuTect、Strelka、JointSNVMix和SNVMix。前三種方法能夠同時處理germline and somatic variants,其他幾種方法用來call somatic mutations using tumour and matched normal genomic sequences.
盡管在生殖系樣本中雜合子VAFs(variant allele fraction)預(yù)計為50%,但這一數(shù)字不適用于腫瘤中的體細(xì)胞突變,主要原因是正常組織污染和/或腫瘤異質(zhì)性。目前,算法開發(fā)的重點是在廣泛的VAFs上處理體細(xì)胞突變。例如Bassovac算法,它在call變異時考慮了雙向雜質(zhì)和腫瘤亞克隆結(jié)構(gòu)(即異質(zhì)性)的影響。 -
2.2 Indel detection
Indel detection is still challenging, mainly owing both to their lower frequencies than those of SNVs and to mapping difficulties.
大多數(shù)工具默認(rèn)允許two mismatches and no gaps in 'seeded' regions (that is, in the first 28 bp in a read), 從而導(dǎo)致了包含indel的序列無法正常比對。Paired-end mapping對于發(fā)現(xiàn)末端再翼側(cè)的大片段indel很有幫助,Gapped alignment, split read and de novo assembly 是目前常見的檢測indel的方法。VarScan25 and GATK Unified Genotyper are based on heuristics for indel calling using raw statistics such as coverage, number of indel-supporting reads, read mapping qualities and mismatch counts.
現(xiàn)有的許多工具對短indels (< 5-8 bp)檢測效果較好,但缺乏高的陽性率。此外,他們通常無法檢測中等大小的indel,包括一些已知的'druggable' and/or prognostic events。 最后,低復(fù)雜度區(qū)域(如均聚物)的檢測尤其具有挑戰(zhàn)性。SAMtools、Dindel可以call出短indel,Pindel、DELLY8采用了一種借鑒蛋白質(zhì)數(shù)據(jù)分析的模式生長方法來檢測indel斷點,Pindel具有較高的精度,Burrows Wheeler aligner (BWA)-MEM41允許更好地發(fā)現(xiàn)長indels和SV, local de novo assembly or multiple alignments可以減少假陽性indel的數(shù)量。 -
2.3 CNA and structural variant detection
Accurate inference of copy number from sequence data requires normalization procedures that consider certain biases inherent to short-read sequencing methods (such as GC content and library biases). Approaches have been implemented for both GC-based coverage normalization and mapping bias.
尋找復(fù)發(fā)的CNA:Genomic identification of significant targets in cancer (GISTIC) and correlation matrix diagonal segmentation (CMDS) have been developed for the identification of recurrent CNAs.
檢測多種結(jié)構(gòu)變化(缺失、串聯(lián)或反向復(fù)制、倒置、插入和易位):BreakDancer, CREST (clipping reveals structure), VariationHunter, geometric analysis of structural variants (GASV)-Pro,and Genome STRucture In Populations (Genome STRiP) -
2.4 Gene fusion detection
RNA-Seq發(fā)現(xiàn)基因融合:TopHat-fusion、 deFuse、MapSplice、ChimeraScan、 BreakFusion
基因融合既可以發(fā)生在只涉及兩個遠(yuǎn)端loci的簡單易位,也可以由多個遠(yuǎn)端loci組成復(fù)雜重排:Comrad and nFuse,這兩種方法都將原始WGS和RNA-seq序列進(jìn)行比對,同時驗證融合和基因組斷點。
Comrad和nFuse可以解釋不明確的讀取對齊,因此可以最小化由不對齊引起的錯誤。
我們最近開發(fā)了BreakTrans,它聯(lián)合分析WGS和RNA-seq數(shù)據(jù),以測試其他工具(如TopHat-fusion、MapSplice、BreakDancer和CREST)產(chǎn)生的假設(shè),以進(jìn)一步描述基因融合的機制成分。
3. Driver mutations and pathways
-
3.1 Annotations and functional predictions
RefSeq基因和轉(zhuǎn)錄本:Ensembl和GENCODE
調(diào)控元件:ENCODE、TransFac和RegulomeDB
非編碼RNA:NONCODE、BodyMap和miRBase
蛋白質(zhì)注釋:Pfam和Interpro
綜合注釋:ANNOVAR和SNPeff提供轉(zhuǎn)錄變異的注釋,SKIPPY預(yù)測隱性剪接效應(yīng)因子,VEP、FunSeq和SNPnexus均擴展支持,包括非編碼元素和調(diào)控特性的注釋,VAAST(變異注釋、分析和搜索工具)和GEMINI(基因組挖掘)允許對編碼變異、非編碼變異、調(diào)控元件和表型進(jìn)行全面分析和整合
有害性:PolyPhen、SIFT、MutationAssessor和Condel
蛋白質(zhì)翻譯后修飾:ActiveDriver -
3.2 Significantly mutated genes
檢測Driver mutation的一個方法是區(qū)分掉背景突變率BMR。BMR的測量比較困難,許多因素可以影響B(tài)MR(包括基因長度、表達(dá)水平和復(fù)制時間的差異), variation among samples and errors in upstream analyses. BMR不僅在同一癌癥類型的患者之間存在差異,而且可能與環(huán)境因素和病毒特征有關(guān)的不同癌癥類型也有關(guān)。最后,對突變的不正確或有偏倚的注釋可能會導(dǎo)致假陽性。基因序列覆蓋不足加劇了這些問題。MuSiC和MutSig可以解決這些問題。
另一種用于區(qū)分司機突變和乘客突變的方法是檢查突變是否聚集在蛋白質(zhì)序列的特定殘基上。The '20/20 rule' 建議,如果一個基因至少20%的錯義突變(or identical in-frame indels)位于一個特定的殘基上,那么該基因應(yīng)該被歸類為致癌基因。相反,如果至少20%的突變處于失活狀態(tài)(即無意義的移碼、剪接位點或終止密碼子讀取突變),則基因可以被歸類為腫瘤抑制因子?,F(xiàn)在,這一方法被一些算法所補充,這些算法利用更嚴(yán)格的統(tǒng)計分?jǐn)?shù)來評估突變信號的模式,以及蛋白質(zhì)序列或三維蛋白質(zhì)結(jié)構(gòu)突變的聚類。 -
3.3 Pathway and network analyses
通路和網(wǎng)絡(luò)分析: 1.分析已知通路, which are represented as gene sets, 2.分析交互作用網(wǎng)絡(luò)to implicitly build pathways de novo.
方法1:評估突變基因組合的一種直接方法是檢查突變基因列表與已知生物功能的預(yù)定義基因集之間的重疊:KEGG、GO和MSigDB。例如,假設(shè)我們有一個突變基因列表(M),我們的目標(biāo)是看看這個列表中是否包含調(diào)控細(xì)胞周期的基因,利用KEGG數(shù)據(jù)庫,我們發(fā)現(xiàn)了20多個細(xì)胞周期基因(L)的列表,有兩個統(tǒng)計檢驗可以用來檢驗M和L是否有顯著重疊。首先,如果對M進(jìn)行排序(例如,使用上面描述的突變顯著性評分之一),那么可以使用基因集富集分析(GSEA)來確定L中的基因是否接近排序列表的頂部(M);其次,如果M未排序,則可以使用超幾何檢驗評估M和L之間的重疊。
方法2:以上分析方法的缺陷:1. Human gene annotations and pathway databases remain incomplete, and there is extensive crosstalk between pathways, which implies that decisions regarding the genes that form the boundary of a pathway are arbitrary to some extent. 2. The crosstalk is represented in gene-set and pathway databases by the presence of multiple overlapping gene sets, thus complicating the interpretation of reported enrichments. 3. Finally, signalling and regulatory pathways have a rich topology of activating and inhibitory interactions, and this information is not represented in the list of genes or proteins that are members of the pathway,激活和抑制作用無法通過富集分析體現(xiàn)。為了克服這些限制,分析突變組合的第二種方法是使用生物相互作用網(wǎng)絡(luò):相互作用網(wǎng)絡(luò)已被用來取代基因集,以確定應(yīng)進(jìn)一步評估的突變組合。然而,大多數(shù)生物網(wǎng)絡(luò)具有不均勻的拓?fù)浣Y(jié)構(gòu),其特征是中心或節(jié)點的存在。HotNet是一種查找大型交互網(wǎng)絡(luò)的子網(wǎng)絡(luò)的方法,該子網(wǎng)絡(luò)在隨機樣本中發(fā)生的變異比預(yù)期的要多,HotNet已被用于確定幾種癌癥類型的子網(wǎng)絡(luò),這些子網(wǎng)絡(luò)在TCGA的背景下進(jìn)行了分析,例如,涉及卵巢癌中Notch信號通路的突變。還有一些其他工具,如network-based stratification (NBS)、MEMo、Tied Diffusion Through Interacting Events (TieDIE)等。
方法3:第三種用于分析突變組合的方法是識別相互排斥的突變集。人們可以通過識別相互排斥的突變集來找到驅(qū)動突變的組合。MEMo使用這個概念來檢測已知相互作用的基因,或者,可以嘗試在不預(yù)先限制基因集的情況下重新發(fā)現(xiàn)相互排斥的基因集(Dendrix、Multi-Dendrix、RME)。
4. Genome integrity and clonal architectures
-
4.1 Kataegis, chromothripsis and chromoplexy
TCGA中最引人注目的發(fā)現(xiàn)之一是具有極端數(shù)量和突變類型的基因組。
Kataegis is the occurrence of an unusually large number of SNPs clustered in a single locus, and was first reported in breast tumours and other cancer types.
chromothripsis, in which one or more loci undergo a catastrophic event of simultaneous breakage and aberrant repair at multiple breakpoints in a single cell division,chromothripsis was originally reported in ~2–3% of all cancers but was shown to be particularly common in bone cancers (~25%),后來發(fā)現(xiàn)可能與TP53突變有關(guān)。chromoplexy是在前列腺癌中發(fā)現(xiàn)的類似事件。 -
4.2 Defining clonal architecture in heterogeneous tumours
以上討論的所有基因組改變都在克隆進(jìn)化中發(fā)揮作用。
ABSOLUTE增加了一個最佳擬合CNA模型和一個核型似然模型
PyClone使用分層貝葉斯聚類來識別克隆
SciClone使用貝葉斯混合模型來檢查來自患者的多個樣本(使用初始和復(fù)發(fā)的腫瘤樣本)或空間(使用多個活檢樣本)
腫瘤異質(zhì)性分析(THetA)算法解釋了CNAs的存在,這使得VAFs的分析變得混亂
5. Conclusion: basic and clinical applications
在癌癥基因組學(xué)進(jìn)入生物醫(yī)學(xué)領(lǐng)域的短短時間內(nèi),它做出了許多基礎(chǔ)性的貢獻(xiàn):
首先,癌癥相關(guān)基因和途徑已被確定;
其次,已經(jīng)建立了胚系的易感性;
三是技術(shù)和算法不斷完善;
第四,組織和記錄了大量的數(shù)據(jù)集;
最后,知識被分類到新的數(shù)據(jù)庫中。
未來的挑戰(zhàn):
'data spectrum' and associated analysis tools are not yet complete,如蛋白質(zhì)組數(shù)據(jù);
The second factor is the reality of cost;
癌癥研究的下一個篇章無疑將進(jìn)一步推動臨床應(yīng)用,并使大型制藥公司更多地參與開發(fā)新的治療藥物。