Dragon star Day 1 20190729
關(guān)于測序技術(shù)、NGS數(shù)據(jù)格式、變異檢測
中英混雜3000字超長補(bǔ)丁,一只酸菜魚瘋狂填補(bǔ)知識盲區(qū)的筆記
- Genomic technologies in disease studies
- NGS data formats and variant calling
Part Ⅰ Genomic technologies in disease studies
1 Types of genetic variation
SNVs (Single Nucleotide Variants) - 單核苷酸變異
Indel (Insertion or deletion < 50 bp)
-
SV (Structural Variants) - 結(jié)構(gòu)變異
can be balanced or unbalanced
Balanced events do not involve gain or loss of genetic materials
-
Inversions and translocations
- Complex SVs (several types together)
Unbalanced events:
-
deletions/insertions/duplications (Deletions and duplications are two subtypes of CNVs (Copy Number Variants)
- Chromosomal aneuploidies (such as trisomy 21 21三體) are extreme cases of unbalanced SV.
-
Allele frequency and effect size - 等位基因頻率與效應(yīng)量/效應(yīng)值
Manolio et al, Nature, 2009Genome-wide association studies (GWAS) are effective in detecting common alleles that contribute to the inherited component of common multifactorial diseases. Typically, the alleles identified by this approach have modest effect sizes that cannot fully account for disease susceptibility. This discrepancy may exist because it is hard to identify rare alleles with a low to modest penetrance using GWAS. Penetrance is a measure of the proportion of individuals in a population carrying a particular allele that expresses the related phenotype. In contrast to multifactorial diseases, Mendelian diseases have a high penetrance and very rare allele frequency.
McCarthy, Mark I., et al. "Genome-wide association studies for complex traits: consensus, uncertainty and challenges." Nature reviews genetics 9.5 (2008): 356.
在統(tǒng)計(jì)學(xué)中,效應(yīng)值(Effect size)是量化現(xiàn)象強(qiáng)度的數(shù)值。[1]效應(yīng)值實(shí)際的統(tǒng)計(jì)量包括了二個(gè)變數(shù)間的相關(guān)程度、回歸模型中的回歸系數(shù)、不同處理間平均值的差異……等等。無論哪種效應(yīng)值,其絕對值越大表示效應(yīng)越強(qiáng),也就是現(xiàn)象越明顯。效應(yīng)值與特效檢驗(yàn)的概念是互補(bǔ)的。在估算統(tǒng)計(jì)檢定力、需要的樣本數(shù)與進(jìn)行元分析時(shí),效應(yīng)值經(jīng)常扮演重要角色。
https://zh.wikipedia.org/wiki/效應(yīng)值
Effect size is a simple way of quantifying the difference between two groups that has many advantages over the use of tests of statistical significance alone. Effect size emphasises the size of the difference rather than confounding this with sample size.*
Coe R. It's the effect size, stupid: What effect size is and why it is important[J]. 2002.
2 Revolution: single-molecule long-read sequencing
單分子長讀長測序
2.1 Oxford Nanopore Sequencing
測量電流
The DNA/RNA is sequenced when it is going though a protein pore.
- DNA (or RNA)
- Motor protein
- Adapter sequence
The nucleotides in the DNA/RNA block the ionic current and induce changes of current, which can be measured.
2.2 PacBio Single-molecule real-time (SMRT) sequencing
測量熒光
SMRTbell library: 環(huán)狀
Adapter: 環(huán)狀
-
Types of SMRT sequencing reads
-
Continuous Long Reads (CLR)
Long inserts so that the polymerase can synthesize along a single strand
-
Circular Consensus Sequencing (CCS, HiFi)
Short inserts, so polymerase can continue around the entire SMRTbell multiple times and generate multiple sub-reads from the same single molecule
-
存在由于技術(shù)造成的 error insertion
Insertions tend to be more than deletions and substitutions.
-
2.3 Linked-read Sequencing
依靠barcode
By adding a unique barcode to every short read generated from an individual molecule, the short reads are linked together.
10X Genomics 公司的 linked reads 技術(shù)本質(zhì)上是將 barcode 序列引入長序列片段,透過將長片段分配到不同的油滴微粒中,利用 GemCode 平臺(tái)對長片段序列進(jìn)行擴(kuò)增引入 barcode 序列以及定序接頭引物,然後將序列打斷成適合定序大小的片段進(jìn)行定序,通過 barcode 序列資訊追蹤來自每個(gè)大片段DNA模板的多個(gè) Reads,從而獲得大片段的遺傳資訊。通過 linked reads 結(jié)合常規(guī)二代定序組裝得到的Scaffold,可建構(gòu)準(zhǔn)確度更長的Scaffold。
What are molecular barcodes?
The concept of molecular barcodes is that each original DNA fragment, within the same sample, is attached to a unique sequence barcode.https://pdfs.semanticscholar.org/310b/3bac42989485c98406848217418ff22c22e7.pdf
Linked read use molecular barcoding to preserve long-range information
Short read pairs (2 x150 bp) are generated using barcode-containing primers.
Short reads that contains the same barcode and within a certain distance can be linked together to “reconstruct” the original long DNA fragment.
2.4 Single-molecule optical mapping (Bionano Genomics)
依靠限制性內(nèi)切酶位點(diǎn),測量光譜
Optical mapping
By mapping the location of restriction enzyme sites along the unknown DNA of an organism, the spectrum of resulting DNA fragments collectively serves as a unique "fingerprint" or "barcode" for that sequence.
Part Ⅱ NGS data formats and variant calling
1 Basic Concepts in NGS
-
Insert – the DNA portion that is used for sequencing *note: 和突變中的insertion不同
Insert size is the length of the DNA (or RNA) that you want to sequence and that is "inserted" between the adapters (so adapters excluded).
Read – the part of the insert that is sequenced 讀長
Single End – a sequencing procedure by which the insert is sequenced from one end only 單端
Paired End – a sequencing procedure by which the insert is sequenced from both ends 雙端
2 Data formats
2.1 The Rawset of Raw Data
Typically: images 根據(jù)不同堿基發(fā)射出不同顏色的熒光

base calling: call nucleotides at each base of each read.
2.2 NGS data formats
FASTQ: The raw sequence data format
Millions of short reads from unknown genetic locations.

-
Basic Qualities: a "Phred" scale
BQ = -10log10(ε) where ε is the probability of an error.
Phred: Phil's revised editor by Phil Green
Base quality conversion
Nowadays, we settled on using quality scores on the original Sanger format,(Phred+33).
需要知道測序年份/公司,來確定Phred
https://en.wikipedia.org/wiki/FASTQ_format -
BED: Genomic region format
-
The first 3 required BED fields are:
chrom: "chr1"
chromStart: "0"
chromEnd: "3"
只包含這三種信息的BED: Minimal BED format. So-called BED3 format.
-
再加上:
- labels: "foo", "bar", "biz"...
- scores: "3.1"
- strands: "+", "-"
This is so-called BED6 format.
-
GFF: annotates one line per feature
-
BAM/SAM: Genome alignment format
- SAM: Sequence Alignment/Map format (tab-delimited text file).
- BAM: The binary equivalent of a SAM file, which stores the same data in a compressed binary representation
https://www.samformat.info/sam-format-flag -
CRAM
和 BAM/SAM 類似的另一種格式,無損壓縮,適用于大型人群基因組/外顯子組測序項(xiàng)目,如 UK Biobank.
- CRAM was designed to be an efficient reference-based alternative to SAM/BAM file formats
- Better lossless compression than BAM, but also allow for controlled loss of BAM data
-
VCF: Variant Call Format
A gold standard for describing variants.
One locus per line, and it may contain more than one mutations, but most lines contain one variant only.
-
主要分為:
-
the header line
包括:CHROM POS ID REF ALT QUAL FILTER INFO
存在genotype時(shí),還有:FORMAT
-
the INFO line
用分號分隔的keys, 用等號定義可選的值
-
the genotypes
定義數(shù)據(jù)類型及順序
-
-
gVCF (Genomic VCF):
- the basic format specification is the same as for a regular VCF, but gVCF contains extra information.
- gVCF儲(chǔ)存了變異及非變異位點(diǎn)的測序信息,適用于臨床分析
2.3 Formats use different coordinate systems, which adds confusion
不同文件格式對染色體坐標(biāo)的起始位置定義不同,造成了一定的困擾
BED: 0-based, half-open
GFF: 1-based, fully closed
SAM: 1-based, fully closed
BAM: 0-based, half-open
VCF: 1-based, fully closed
…
3 Visualization of genomic data: IGV
強(qiáng)大的IGV
IGV: a high-performance visualization tool for interactive exploration of large, integrated genomic datasets.
4 Coverage
4.1 What is coverage?
-
The depth of sequencing coverage can be defined theoretically as LN/G, where L is the read length, N is the number of reads and G is the haploid genome length.
測序深度可理解為:(read長度/read數(shù)量)/單倍體基因組長度
The breadth of coverage is the percentage of target bases that have been sequenced for a given number of times.
-
The accuracy of variant calling is affected by sequence quality, uniformity of coverage and the threshold of false-discovery rate that is used.
What is variant calling?
Variant calling is the process by which we identify variants from sequence data (Figure 11).
- Carry out whole genome or whole exome sequencing to create FASTQ files.
- Align the sequences to a reference genome, creating BAM or CRAM files.
- Identify where the aligned reads differ from the reference genome and write to a VCF file.
Identify where the aligned reads differ from the reference genome and write to a VCF file.
Figure 11 A CRAM file aligned to a reference genomic region as visualised in Ensembl. Differences are highlighted in red in the reads, and will be called as variants.
reads中標(biāo)紅的堿基為variants.
4.2 Coverage: how many reads we need to cover the genome?
-
Depth of coverage model
鳥槍法測序中,genome size G, read size S, N reads
某個(gè)read出現(xiàn)在某個(gè)間隙(interval)的概率為:p=L/G
D: 出現(xiàn)在與read同等長度的間隙的read的數(shù)量
D ~ Binomial(N, L/G),D與N, p 符合二項(xiàng)分布
即:b(x; n, P) = nCx* Px * (1 - P) n-x**
? b(D; N, L/G) = NCD* (L/G)D * (1 - L/G) N-D**
Let L = S, D is also the number of reads that cover the last position of the interval → D is depth of coverage.
因?yàn)?S << G, N的值特別大,depth可近似理解為泊松分布:
D ~ Poisson(λ)
λ= SN/G (average depth of coverage) -
Fraction of genome that are covered
根據(jù)上述公式可以計(jì)算:
Given λ=40, the fraction of genome that are covered more than 30x (D>30) is: 0.938.
4.3 Overdispersion
-
Main cause of overdispersion
GC bias and other technical factors lead to systematic bias in coverage, resulting in overdispersion.
-
How to model overdispersion
Ideal situation (Poisson distrition):
Gamma-Poisson is equivalent to Negative Binomial, which is a commonly used model for dealing with overdispersion in count data:
過離散模型也適用于其他因素,如 biological noise
4.4 Question on coverage
Why do we need average 30-50x in a typical WGS experiment, and 100x in WES?
WGS is less biased compared to WES. We do not need as much depth to call a variant confidently. Check the following publications for more detailed information.
" Exome-seq achieves 95% SNP detection sensitivity at a mean on-target depth of 40 reads, whereas WGS only requires a mean of 14 reads. "
Article Variant detection sensitivity and biases in whole genome and...
"WGS is less biased compared to WES."的原因:WES的樣品首先需要經(jīng)過PCR
Why 30X WGS beats 100X WES for variant coverage
https://www.variantyx.com/variantyx-posts/why-30x-wgs-beats-100x-wes-for-variant-coverage/
5 General strategy for variant calling
5.1 Possible Genotypes
-
當(dāng)只有參考基因組時(shí),各種情形的概率均為1
P(reads|A/A, read mapped) = P(C observed|A/A, read mapped) = 1.0
P(reads|A/C, read mapped) = P(C observed|A/C, read mapped) = 1.0
P(reads|C/C, read mapped) = P(C observed|C/C, read mapped) = 1.0
-
假定error rate為0.01
-
當(dāng)?shù)?條read的該位點(diǎn)為C時(shí)
P(reads|A/A, read mapped) = 0.01
P(reads|A/C, read mapped) = 0.5
P(reads|C/C, read mapped) = 1 - 0.01 = 0.99
-
當(dāng)?shù)?條read的該位點(diǎn)為C時(shí)
P(reads|A/A, read mapped) = 0.012 = 0.0001
P(reads|A/C, read mapped) = 0.52 = 0.25
P(reads|C/C, read mapped) = 0.992 = 0.9801
-
當(dāng)?shù)?條read的該位點(diǎn)為C時(shí)
P(reads|A/A, read mapped) = 0.013 = 0.000001
P(reads|A/C, read mapped) = 0.53 = 0.125
P(reads|C/C, read mapped) = 0.993 = 0.970299
-
當(dāng)?shù)?條read的該位點(diǎn)為A時(shí)
P(reads|A/A, read mapped) = 0.013 * 0.99 = 0.00000099
P(reads|A/C, read mapped) = 0.54 = 0.0625
P(reads|C/C, read mapped) = 0.993 * 0.01 = 0.00970299
-
當(dāng)?shù)?條read的該位點(diǎn)為A
P(reads|A/A, read mapped) = 0.013 * 0.992 = 0.00000098
P(reads|A/C, read mapped) = 0.55 = 0.03125
P(reads|C/C, read mapped) = 0.993 * 0.012 = 0.0000970299 ≈ 0.000097
總結(jié)出貝葉斯公式:

Combine these likelihoods with a prior incorporating information from other individuals and flanking sites to assign a genotype.
5.2 From Sequence to Genotype: Individual Based Prior
Individual Based Prior: Evry site has 1/1000 probability of varying.
Ingredients That Go Into Prior
Most sites don’t vary
P(non-reference base) ~ 0.001
When a site does vary, it is usually heterozygous
P(non-reference heterozygote) ~ 0.001 * 2/3
P(non-reference homozygote) ~ 0.001 * 1/3Mutation model
Transitions account for most variants (C?T or A?G)
Transversions account for minority of variantshttps://pdfs.semanticscholar.org/0156/04c3ba76cf247a8010f74ec1386e58ceb530.pdf
因此,各位點(diǎn)的先驗(yàn)概率為:
Prior(A/A) = 0.001 * 1/3 ≈ 0.00033 0.00034?
Prior(A/C) = 0.001 * 2/3 ≈ 0.00067 0.00066?
Prior(C/C) = 1 - 0.001 = 0.999
琢磨了兩個(gè)小時(shí)并把google搜索翻到了第十頁,依然沒看懂這里的后驗(yàn)概率是怎么算的,如果有統(tǒng)計(jì)大神看到這里,求賜教??

5.3 From Sequence to Genotype: Population Based Prior
Population Based Prior: Use frequency information from examining others at the same site. In the example above, we estimated P(A) = 0.20
因此,各位點(diǎn)的先驗(yàn)概率為:
Prior(A/A) = 0.2 * 0.2 = 0.04
Prior(A/C) = 1 - 0.04 - 0.64 = 0.32
Prior(C/C) = 0.8 * 0.8 = 0.64
同樣的迷惑:

5.4 Sequence Based Genotype Calls
- Individual Based Prior
- Assumes all sites have an equal probability of showing polymorphism
- Specifically, assumption is that about 1/1000 bases differ from reference
- If reads where error free and sampling Poisson …
- … 14x coverage would allow for 99.8% genotype accuracy
- … 30x coverage of the genome needed to allow for errors and clustering
- Population Based Prior
- Uses frequency information obtained from examining other individuals
- Calling very rare polymorphisms still requires 20-30x coverage of the genome
- Calling common polymorphisms requires much less data
https://pdfs.semanticscholar.org/0156/04c3ba76cf247a8010f74ec1386e58ceb530.pdf
最后,向大家隆重推薦生信技能樹的一系列干貨!
- 生信技能樹全球公益巡講:https://mp.weixin.qq.com/s/E9ykuIbc-2Ja9HOY0bn_6g
- B站公益74小時(shí)生信工程師教學(xué)視頻合輯:https://mp.weixin.qq.com/s/IyFK7l_WBAiUgqQi8O7Hxw
- 招學(xué)徒:https://mp.weixin.qq.com/s/KgbilzXnFjbKKunuw7NVfw







