Expanding the computational toolbox for mining cancer genomes

Corresponding author: Li Ding
Director of Computational Biology, Oncology
Washington University School of Medicine, St. Louis, MO

Sample procurement, sequencing and analysis roadmap.

1.Sequencing strategies

WES轉(zhuǎn)向WGS:WGS data are therefore considered to be the unbiased 'gold standard'

  • 1.1 Traditional sequencing analyses
    In practice, detection of all germline and somatic aberrations is a formidable challenge owing to limitations in current analysis algorithms, as well as to the quantity and quality of sequence data.
    實際上,由于當(dāng)前分析算的局限性,以及測序數(shù)據(jù)的數(shù)量和質(zhì)量的限制,檢測所有種系和體細(xì)胞突變是一項艱巨的挑戰(zhàn)。

  • 1.2 Subclonal analyses
    cancer progression has long been known to be a fundamentally clonal process, and sequence coverage is now becoming sufficiently large to permit detection of the low-prevalence events that are routinely associated with tumour subclones. Multisite and/or multistage sequencing and tumour sectioning experiments have begun to identify founding clones and subclones that contribute to cancer progression

  • 1.3 Single-cell sequencing
    Pioneering work on assessing CNAs in multiple tumour subpopulations was followed by single-cell sequencing using whole-genome amplification (WGA) of DNA extracted from nuclei that were sorted by flow cytometry.
    目前仍然存在一些挑戰(zhàn),如簡并寡核苷酸引物WGA的放大偏差和多重置換擴增技術(shù)(degenerate oligonucleotide-primed WGA是指引物的3' 含6bp的隨機序列,可以隨機的和基因組DNA結(jié)合,從而實現(xiàn)對全基因組的擴增;multiple displacement amplification techniques利用隨機引物和等溫擴增可以獲得高保真的DNA大片段,但該方法的主要缺陷在于非平衡的基因組覆蓋率、擴增偏倚、嵌合序列及非特異擴增等),這些技術(shù)的偏倚導(dǎo)致了不均勻的覆蓋,并因此難以確定體細(xì)胞的變化,包括SNVs、CNAs和結(jié)構(gòu)畸變。由于兩個等位基因中的一個的優(yōu)先擴增,檢測靈敏度受等位基因缺失的影響最大,有報道稱等位基因缺失率為8 - 40%。大的CNAs仍然可以在基因組覆蓋率較低的情況下進(jìn)行檢測(例如,5-6%),而不平等的覆蓋率使得分析較小的CNAs和結(jié)構(gòu)變異極其困難

2.Dissecting genomic changes in cancer

以下表格是注釋和解讀腫瘤基因組突變的計算工具

Program Function Synopsis Refs
SNV and indel detection
Bassovac SNV and indel detection Bayesian approach with tumour or normal impurity and clonality
GATK SNV and indel detection Analysis framework using MapReduce 23
JointSNVMix SNV detection Binomial/multinomial probability with pre-filtering 31
MuTect SNV and indel detection Bayesian probability with pre- and post-filtering 28
Pindel Indel detection Pattern growth learning method 38
SNVMix SNV detection Binomial mixture model 30
SomaticSniper SNV and indel detection Bayesian probability with posterior filtering 27
Strelka SNV and indel detection Bayesian probability with posterior filtering 29
VarScan SNV and indel detection Fisher exact test, filtering and FDR correction 24,25
Copy-number aberration, structural variant and gene fusion detection
BreakDancer Structural variant and indel detection Kolmogorov–Smirnov test on discordant reads 54
BreakFusion Gene fusion detection Alignment-based pipeline for transcriptomic data 68
BreakTrans Gene fusion mapping Integration of fusion discovery and breakpoint tools 73
ChimeraScan Chimeric transcription detection Discordant read pairs with posterior filtering 67
CREST Structural variant detection Heuristics and binomial test on soft-clipped reads 55
deFuse Gene fusion detection Dynamic programming split and discordant reads 65
DELLY Structural variant detection Integrated method of discordant and split reads 40
GASV-Pro Structural variant detection Plane sweep for segment intersection 57
Genome STRiP Structural variant detection Depth and split or discordant reads on populations 59
Hydra Structural variant detection Discordant reads with assembly validation 139
LUMPY Structural variant detection Integrated method of discordant and split reads 167
TIGRA Structural variant detection Debruijn graph-based assembly 42
Level I annotation and interpretation
ABSOLUTE Purity, ploidy and clonality prediction Optimization of logarithmic scores 148
ANNOVAR Functional prediction Annotation-based prediction 74
ASCAT Purity, ploidy and clonality prediction Goodness-of-fit ranking of candidate solutions 168
TUSON Explorer Gene classification Oncogene or tumour suppressor discovery using mutational signatures 100
CHASM Functional prediction Random forest classifier 84,85
MutationAssessor Functional prediction Conservation-based prediction (entropy score) 83
PolyPhen2 Functional prediction Probability model based on structure and alignment 81,169
SciClone Tumour clonality prediction Bayesian mixture model
SIFT Functional prediction Conservation-based prediction 82
SNPeff Functional prediction Annotation and coding effect prediction 75
THetA Purity, ploidy and clonality prediction Maximum likelihood of mixture composition 151
VEP Functional prediction Annotation-based prediction 170
Level II annotation and interpretation
Dendrix Mutation analysis De novo discovery of mutually exclusive mutations 128
HotNet Network analysis Diffusion model for significant networks 119
MEMo Network analysis Network modules with mutual exclusivity 122
MuSiC Mutation analysis Framework for significance analysis of mutations 92
Multi-Dendrix Mutation analysis De novo discovery of multiple sets of exclusive mutations 129
MutSigCV Mutation analysis Gene significance with variable background mutation rate 93
NBS Network analysis Clustering using non-negative matrix factorization 121
Oncodrive-CIS and OncodriveCLUST Mutation analysis Z-statistics for copy numbers of driver genes 171,172
PARADIGM Gene expression analysis Network analysis of gene expression 126
PathScan Pathway analysis Probability model for mutation-enriched pathways 109
TieDIE Network analysis Network diffusion model linking mutations to gene expression 125

根據(jù)經(jīng)驗,由多個獨立算法call出來的候選事件不太可能是假陽性,而由任何單個算法call出來的候選事件則反之。因此,使用multicaller strategies現(xiàn)在變得更加普遍,當(dāng)然這樣做也會影響結(jié)果的靈敏度。但是各類工具的組合數(shù)量太龐大了,較難實現(xiàn)。

  • 2.1 SNV detection
    SNV檢測算法:GATK、VarScan、SAMtools、SomaticSniper、MuTect、Strelka、JointSNVMix和SNVMix。前三種方法能夠同時處理germline and somatic variants,其他幾種方法用來call somatic mutations using tumour and matched normal genomic sequences.
    盡管在生殖系樣本中雜合子VAFs(variant allele fraction)預(yù)計為50%,但這一數(shù)字不適用于腫瘤中的體細(xì)胞突變,主要原因是正常組織污染和/或腫瘤異質(zhì)性。目前,算法開發(fā)的重點是在廣泛的VAFs上處理體細(xì)胞突變。例如Bassovac算法,它在call變異時考慮了雙向雜質(zhì)和腫瘤亞克隆結(jié)構(gòu)(即異質(zhì)性)的影響。
  • 2.2 Indel detection
    Indel detection is still challenging, mainly owing both to their lower frequencies than those of SNVs and to mapping difficulties.
    大多數(shù)工具默認(rèn)允許two mismatches and no gaps in 'seeded' regions (that is, in the first 28 bp in a read), 從而導(dǎo)致了包含indel的序列無法正常比對。Paired-end mapping對于發(fā)現(xiàn)末端再翼側(cè)的大片段indel很有幫助,Gapped alignment, split read and de novo assembly 是目前常見的檢測indel的方法。VarScan25 and GATK Unified Genotyper are based on heuristics for indel calling using raw statistics such as coverage, number of indel-supporting reads, read mapping qualities and mismatch counts.
    現(xiàn)有的許多工具對短indels (< 5-8 bp)檢測效果較好,但缺乏高的陽性率。此外,他們通常無法檢測中等大小的indel,包括一些已知的'druggable' and/or prognostic events。 最后,低復(fù)雜度區(qū)域(如均聚物)的檢測尤其具有挑戰(zhàn)性。SAMtools、Dindel可以call出短indel,Pindel、DELLY8采用了一種借鑒蛋白質(zhì)數(shù)據(jù)分析的模式生長方法來檢測indel斷點,Pindel具有較高的精度,Burrows Wheeler aligner (BWA)-MEM41允許更好地發(fā)現(xiàn)長indels和SV, local de novo assembly or multiple alignments可以減少假陽性indel的數(shù)量。
  • 2.3 CNA and structural variant detection
    Accurate inference of copy number from sequence data requires normalization procedures that consider certain biases inherent to short-read sequencing methods (such as GC content and library biases). Approaches have been implemented for both GC-based coverage normalization and mapping bias.
    尋找復(fù)發(fā)的CNA:Genomic identification of significant targets in cancer (GISTIC) and correlation matrix diagonal segmentation (CMDS) have been developed for the identification of recurrent CNAs.
    檢測多種結(jié)構(gòu)變化(缺失、串聯(lián)或反向復(fù)制、倒置、插入和易位):BreakDancer, CREST (clipping reveals structure), VariationHunter, geometric analysis of structural variants (GASV)-Pro,and Genome STRucture In Populations (Genome STRiP)
  • 2.4 Gene fusion detection
    RNA-Seq發(fā)現(xiàn)基因融合:TopHat-fusion、 deFuse、MapSplice、ChimeraScan、 BreakFusion
    基因融合既可以發(fā)生在只涉及兩個遠(yuǎn)端loci的簡單易位,也可以由多個遠(yuǎn)端loci組成復(fù)雜重排:Comrad and nFuse,這兩種方法都將原始WGS和RNA-seq序列進(jìn)行比對,同時驗證融合和基因組斷點。
    ComradnFuse可以解釋不明確的讀取對齊,因此可以最小化由不對齊引起的錯誤。
    我們最近開發(fā)了BreakTrans,它聯(lián)合分析WGS和RNA-seq數(shù)據(jù),以測試其他工具(如TopHat-fusion、MapSplice、BreakDancer和CREST)產(chǎn)生的假設(shè),以進(jìn)一步描述基因融合的機制成分。

3. Driver mutations and pathways

  • 3.1 Annotations and functional predictions
    RefSeq基因和轉(zhuǎn)錄本:Ensembl和GENCODE
    調(diào)控元件:ENCODE、TransFac和RegulomeDB
    非編碼RNA:NONCODE、BodyMap和miRBase
    蛋白質(zhì)注釋:Pfam和Interpro
    綜合注釋:ANNOVAR和SNPeff提供轉(zhuǎn)錄變異的注釋,SKIPPY預(yù)測隱性剪接效應(yīng)因子,VEP、FunSeq和SNPnexus均擴展支持,包括非編碼元素和調(diào)控特性的注釋,VAAST(變異注釋、分析和搜索工具)和GEMINI(基因組挖掘)允許對編碼變異、非編碼變異、調(diào)控元件和表型進(jìn)行全面分析和整合
    有害性:PolyPhen、SIFT、MutationAssessor和Condel
    蛋白質(zhì)翻譯后修飾:ActiveDriver
  • 3.2 Significantly mutated genes
    檢測Driver mutation的一個方法是區(qū)分掉背景突變率BMR。BMR的測量比較困難,許多因素可以影響B(tài)MR(包括基因長度、表達(dá)水平和復(fù)制時間的差異), variation among samples and errors in upstream analyses. BMR不僅在同一癌癥類型的患者之間存在差異,而且可能與環(huán)境因素和病毒特征有關(guān)的不同癌癥類型也有關(guān)。最后,對突變的不正確或有偏倚的注釋可能會導(dǎo)致假陽性。基因序列覆蓋不足加劇了這些問題。MuSiCMutSig可以解決這些問題。
    另一種用于區(qū)分司機突變和乘客突變的方法是檢查突變是否聚集在蛋白質(zhì)序列的特定殘基上。The '20/20 rule' 建議,如果一個基因至少20%的錯義突變(or identical in-frame indels)位于一個特定的殘基上,那么該基因應(yīng)該被歸類為致癌基因。相反,如果至少20%的突變處于失活狀態(tài)(即無意義的移碼、剪接位點或終止密碼子讀取突變),則基因可以被歸類為腫瘤抑制因子?,F(xiàn)在,這一方法被一些算法所補充,這些算法利用更嚴(yán)格的統(tǒng)計分?jǐn)?shù)來評估突變信號的模式,以及蛋白質(zhì)序列或三維蛋白質(zhì)結(jié)構(gòu)突變的聚類。
  • 3.3 Pathway and network analyses
    通路和網(wǎng)絡(luò)分析: 1.分析已知通路, which are represented as gene sets, 2.分析交互作用網(wǎng)絡(luò)to implicitly build pathways de novo.
    方法1:評估突變基因組合的一種直接方法是檢查突變基因列表與已知生物功能的預(yù)定義基因集之間的重疊:KEGG、GOMSigDB。例如,假設(shè)我們有一個突變基因列表(M),我們的目標(biāo)是看看這個列表中是否包含調(diào)控細(xì)胞周期的基因,利用KEGG數(shù)據(jù)庫,我們發(fā)現(xiàn)了20多個細(xì)胞周期基因(L)的列表,有兩個統(tǒng)計檢驗可以用來檢驗M和L是否有顯著重疊。首先,如果對M進(jìn)行排序(例如,使用上面描述的突變顯著性評分之一),那么可以使用基因集富集分析(GSEA)來確定L中的基因是否接近排序列表的頂部(M);其次,如果M未排序,則可以使用超幾何檢驗評估M和L之間的重疊。
    方法2:以上分析方法的缺陷:1. Human gene annotations and pathway databases remain incomplete, and there is extensive crosstalk between pathways, which implies that decisions regarding the genes that form the boundary of a pathway are arbitrary to some extent. 2. The crosstalk is represented in gene-set and pathway databases by the presence of multiple overlapping gene sets, thus complicating the interpretation of reported enrichments. 3. Finally, signalling and regulatory pathways have a rich topology of activating and inhibitory interactions, and this information is not represented in the list of genes or proteins that are members of the pathway,激活和抑制作用無法通過富集分析體現(xiàn)。為了克服這些限制,分析突變組合的第二種方法是使用生物相互作用網(wǎng)絡(luò):相互作用網(wǎng)絡(luò)已被用來取代基因集,以確定應(yīng)進(jìn)一步評估的突變組合。然而,大多數(shù)生物網(wǎng)絡(luò)具有不均勻的拓?fù)浣Y(jié)構(gòu),其特征是中心或節(jié)點的存在。HotNet是一種查找大型交互網(wǎng)絡(luò)的子網(wǎng)絡(luò)的方法,該子網(wǎng)絡(luò)在隨機樣本中發(fā)生的變異比預(yù)期的要多,HotNet已被用于確定幾種癌癥類型的子網(wǎng)絡(luò),這些子網(wǎng)絡(luò)在TCGA的背景下進(jìn)行了分析,例如,涉及卵巢癌中Notch信號通路的突變。還有一些其他工具,如network-based stratification (NBS)、MEMo、Tied Diffusion Through Interacting Events (TieDIE)等。
    方法3:第三種用于分析突變組合的方法是識別相互排斥的突變集。人們可以通過識別相互排斥的突變集來找到驅(qū)動突變的組合。MEMo使用這個概念來檢測已知相互作用的基因,或者,可以嘗試在不預(yù)先限制基因集的情況下重新發(fā)現(xiàn)相互排斥的基因集(Dendrix、Multi-Dendrix、RME)。

4. Genome integrity and clonal architectures

  • 4.1 Kataegis, chromothripsis and chromoplexy
    TCGA中最引人注目的發(fā)現(xiàn)之一是具有極端數(shù)量和突變類型的基因組。
    Kataegis is the occurrence of an unusually large number of SNPs clustered in a single locus, and was first reported in breast tumours and other cancer types.
    chromothripsis, in which one or more loci undergo a catastrophic event of simultaneous breakage and aberrant repair at multiple breakpoints in a single cell division,chromothripsis was originally reported in ~2–3% of all cancers but was shown to be particularly common in bone cancers (~25%),后來發(fā)現(xiàn)可能與TP53突變有關(guān)。chromoplexy是在前列腺癌中發(fā)現(xiàn)的類似事件。
  • 4.2 Defining clonal architecture in heterogeneous tumours
    以上討論的所有基因組改變都在克隆進(jìn)化中發(fā)揮作用。
    ABSOLUTE增加了一個最佳擬合CNA模型和一個核型似然模型
    PyClone使用分層貝葉斯聚類來識別克隆
    SciClone使用貝葉斯混合模型來檢查來自患者的多個樣本(使用初始和復(fù)發(fā)的腫瘤樣本)或空間(使用多個活檢樣本)
    腫瘤異質(zhì)性分析(THetA)算法解釋了CNAs的存在,這使得VAFs的分析變得混亂

5. Conclusion: basic and clinical applications

在癌癥基因組學(xué)進(jìn)入生物醫(yī)學(xué)領(lǐng)域的短短時間內(nèi),它做出了許多基礎(chǔ)性的貢獻(xiàn):
首先,癌癥相關(guān)基因和途徑已被確定;
其次,已經(jīng)建立了胚系的易感性;
三是技術(shù)和算法不斷完善;
第四,組織和記錄了大量的數(shù)據(jù)集;
最后,知識被分類到新的數(shù)據(jù)庫中。
未來的挑戰(zhàn):
'data spectrum' and associated analysis tools are not yet complete,如蛋白質(zhì)組數(shù)據(jù);
The second factor is the reality of cost;
癌癥研究的下一個篇章無疑將進(jìn)一步推動臨床應(yīng)用,并使大型制藥公司更多地參與開發(fā)新的治療藥物。

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

  • The Inner Game of Tennis W Timothy Gallwey Jonathan Cape ...
    網(wǎng)事_79a3閱讀 12,917評論 3 20
  • rljs by sennchi Timeline of History Part One The Cognitiv...
    sennchi閱讀 7,854評論 0 10
  • 01 女兒的8000元助學(xué)貸款給還上了。 02 走了一萬六千多步。 03 任務(wù)完成之后一定要整理好,再上交。整理,...
    whp一生平安閱讀 123評論 0 0
  • 開課留影 文/非象 已經(jīng)在簡書學(xué)堂上過課,也買了些寫作書在看。想著不會再網(wǎng)上報名學(xué)什么的,可是無戒老師的365天,...
    非象閱讀 188評論 0 1
  • 認(rèn)識這個詞(基礎(chǔ)篇) 詞:characterize英英釋義:to be typical of a person, ...
    Yvettetaitai閱讀 1,365評論 0 0

友情鏈接更多精彩內(nèi)容