用R語(yǔ)言對(duì)vcf文件進(jìn)行數(shù)據(jù)挖掘.10 利用vcf信息判斷物種染色體倍數(shù)

目錄

  1. 前言
  2. 方法簡(jiǎn)介
  3. 從vcf文件里提取有用信息
  4. tidy vcfR
  5. vcf可視化1
  6. vcf可視化2
  7. 測(cè)序深度覆蓋度
  8. 窗口縮放
  9. 如何單獨(dú)分離染色體
  10. 利用vcf信息判斷物種染色體倍數(shù)
  11. CNV分析

vcf數(shù)據(jù)包含了所有的等位對(duì)立基因的信息,這樣就可以幫助我們判斷染色體的倍數(shù)。比方說有一個(gè)位點(diǎn)的堿基是A/T,測(cè)序覆蓋率為20, 如果這個(gè)物種是二倍體,那么A,T的出現(xiàn)概率就是(50%),會(huì)各自出現(xiàn)10次,如果是3倍體,那么A會(huì)出現(xiàn)13次,T會(huì)出現(xiàn)7次,當(dāng)然也有可能相反。當(dāng)把所有的點(diǎn)位集合在一起的時(shí)候,我們就可以判斷這個(gè)物種的倍數(shù)體了。

數(shù)據(jù)導(dǎo)入

用包里的自帶數(shù)據(jù),有疑問的小盆友可以查閱之前的文章,這里就不做贅述了。

# Load libraries
library(vcfR)
library(pinfsc50)

# Determine file locations
vcf_file <- system.file("extdata", "pinf_sc50.vcf.gz",
                        package = "pinfsc50")

# Read data into memory
vcf <- read.vcfR(vcf_file, verbose = FALSE)
vcf
## ***** Object of Class vcfR *****
## 18 samples
## 1 CHROMs
## 22,031 variants
## Object size: 20.9 Mb
## 7.929 percent missing data
## *****        *****         *****

等位對(duì)立基因平衡

高通量數(shù)據(jù)測(cè)序可以保證每一個(gè)位點(diǎn)都經(jīng)過很多次的讀取,這樣就相當(dāng)于每一個(gè)等位基因都被測(cè)序過了差不多相等的次數(shù)。假設(shè)我們對(duì)一個(gè)二倍雜合體進(jìn)行了覆蓋率為30的測(cè)序,那么每一條染色體都被測(cè)了15次。當(dāng)然真實(shí)情況不可能正好是這個(gè)數(shù)字,畢竟測(cè)序的時(shí)候會(huì)發(fā)生一定概率的錯(cuò)誤。
假設(shè)我們用覆蓋率為30給一個(gè)三倍雜合體進(jìn)行測(cè)序,某基因位點(diǎn)為A/A/T,那么,A和T出現(xiàn)的期待值將是20和10。當(dāng)某個(gè)基因位點(diǎn)的組合是A/G/C時(shí),那么A,G,C就會(huì)各自出現(xiàn)10次。

knitr::kable(vcf@gt[c(1:2,11,30),1:4])
FORMAT  BL2009P4_us23   DDR7602 IN2009T1_us22
GT:AD:DP:GQ:PL  1|1:0,7:7:21:283,21,0   1|1:0,6:6:18:243,18,0   1|1:0,8:8:24:324,24,0
GT:AD:DP:GQ:PL  0|0:12,0:12:36:0,36,427 0|0:20,0:20:60:0,60,819 0|0:16,0:16:48:0,48,650
GT:AD:DP:GQ:PL  0|1:17,12:29:99:453,0,667   0|0:32,0:32:96:0,96,1433    0|0:40,0:40:99:0,120,1765
GT:AD:DP:GQ:PL  1|0:7,4:11:99:130,0,260 0|0:16,0:16:48:0,48,677 0|0:26,0:26:78:0,78,1073

FORAMT里的AD表示對(duì)立基因的各自出現(xiàn)的次數(shù)。所以我們可以提取AD數(shù)據(jù)。

ad <- extract.gt(vcf, element = 'AD')

一般的SNP Caller都會(huì)默認(rèn)雙倍體檢驗(yàn),也就是出現(xiàn)兩種對(duì)立基因型。所以可以計(jì)算每種基因的出現(xiàn)概率。

allele1 <- masplit(ad, record = 1)
allele2 <- masplit(ad, record = 2)
ad1 <- allele1 / (allele1 + allele2)
ad2 <- allele2 / (allele1 + allele2)

然后用直方圖可視化一下。

hist(ad2[,"P17777us22"], breaks = seq(0,1,by=0.02), col = "#1f78b4", xaxt="n")
hist(ad1[,"P17777us22"], breaks = seq(0,1,by=0.02), col = "#a6cee3", add = TRUE)
axis(side=1, at=c(0,0.25,0.333,0.5,0.666,0.75,1), labels=c(0,"1/4","1/3","1/2","1/3","3/4",1))

可以發(fā)現(xiàn),大多數(shù)都是純合,所以需要去掉純合的部分。

gt <- extract.gt(vcf, element = 'GT')
hets <- is_het(gt)

is.na( ad[ !hets ] ) <- TRUE

allele1 <- masplit(ad, record = 1)
allele2 <- masplit(ad, record = 2)

ad1 <- allele1 / (allele1 + allele2)
ad2 <- allele2 / (allele1 + allele2)

hist(ad2[,"P17777us22"], breaks = seq(0,1,by=0.02), col = "#1f78b4", xaxt="n")
hist(ad1[,"P17777us22"], breaks = seq(0,1,by=0.02), col = "#a6cee3", add = TRUE)
axis(side=1, at=c(0,0.25,0.333,0.5,0.666,0.75,1), labels=c(0,"1/4","1/3","1/2","1/3","3/4",1))

我們發(fā)現(xiàn)峰值出現(xiàn)在了1/2,說明這個(gè)物種時(shí)二倍體,和預(yù)期的一樣。
然而這里有一個(gè)小小的問題,F(xiàn)equency幾乎從0到1橫跨整個(gè)橫坐標(biāo),這個(gè)明顯不合理,需要進(jìn)行改善。

根據(jù)等位對(duì)立測(cè)序深度進(jìn)行改善

我們可以通過等位對(duì)立深度(AD)的信息來改善剛才提到的問題。

ad <- extract.gt(vcf, element = 'AD')
#ad[1:3,1:4]

allele1 <- masplit(ad, record = 1)
allele2 <- masplit(ad, record = 2)

# Subset to a vector for manipulation.
tmp <- allele1[,"P17777us22"]
#sum(tmp == 0, na.rm = TRUE)
#tmp <- tmp[tmp > 0]
tmp <- tmp[tmp <= 100]

hist(tmp, breaks=seq(0,100,by=1), col="#808080", main = "P17777us22")

sums <- apply(allele1, MARGIN=2, quantile, probs=c(0.15, 0.95), na.rm=TRUE)
sums[,"P17777us22"]
## 15% 95% 
##  19  75
abline(v=sums[,"P17777us22"], col=2, lwd=2)

我們可以看到80%的數(shù)據(jù)分布在了19和75之間。然后再靠近40和60點(diǎn)的地方出現(xiàn)了兩個(gè)峰,這分別代表雜合峰和純合峰。然后整個(gè)數(shù)據(jù)還拖著一個(gè)尾巴,最長(zhǎng)的地方超過了100,這表示部分區(qū)域包含了著非常高的拷貝數(shù)(CNVs)。此處的目的是為了可視化倍數(shù)體,所以選擇100以下15%~95%的數(shù)據(jù)。

tmp <- allele2[,"P17777us22"]
tmp <- tmp[tmp>0]
tmp <- tmp[tmp<=100]

hist(tmp, breaks=seq(1,100,by=1), col="#808080", main="P17777us22")

sums[,"P17777us22"]
## 15% 95% 
##  19  75
abline(v=sums[,"P17777us22"], col=2, lwd=2)

回想一下之前文章里介紹過的用箱圖做可視化的內(nèi)容,我們也可以通過同樣的方法來確認(rèn)過濾數(shù)據(jù)的效果。

#vcf <- extract.indels(vcf)
#gq <- extract.gt(vcf, element = 'GQ', as.numeric = TRUE)
#vcf@gt[,-1][ gq < 99 ] <- NA

ad <- extract.gt(vcf, element = 'AD')
allele1 <- masplit(ad, record = 1)
allele2 <- masplit(ad, record = 2)
boxplot(allele1, las=3)
#hist(allele1[,"P17777us22"], ylim=c(0,2000),
#     breaks = seq(0,max(allele1[,"P17777us22"], na.rm = TRUE),by=5),
#     xlim=c(0,100))

# Subset to a vector for manipulation.
#tmp <- allele1[,"P17777us22"]
#tmp <- tmp[tmp <= 100]
#hist(tmp, breaks=seq(0,100,by=1), col="#808080")
sums <- apply(allele1, MARGIN=2, quantile, probs=c(0.15, 0.95), na.rm=TRUE)
# Allele 1
dp2 <- sweep(allele1, MARGIN=2, FUN = "-", sums[1,])
#allele1[dp2 < 0] <- NA
vcf@gt[,-1][ dp2 < 0 & !is.na(vcf@gt[,-1]) ] <- NA
dp2 <- sweep(allele1, MARGIN=2, FUN = "-", sums[2,])
#allele1[dp2 > 0] <- NA
vcf@gt[,-1][dp2 > 0] <- NA
# Allele 2
dp2 <- sweep(allele2, MARGIN=2, FUN = "-", sums[1,])
vcf@gt[,-1][ dp2 < 0 & !is.na(vcf@gt[,-1]) ] <- NA
dp2 <- sweep(allele2, MARGIN=2, FUN = "-", sums[2,])
vcf@gt[,-1][dp2 > 0] <- NA

# Hard filter
#dp[dp < 4] <- NA
#vcf@gt[,-1][allele1 < 8] <- NA
#

#adp <- adp[adp<=100]
#adp <- adp[adp>=sums[,"P17777us22"][1]]
#adp <- adp[adp<=sums[,"P17777us22"][2]]

#hist(adp, breaks=seq(0, max(adp, na.rm = TRUE )+1, by=1), col="#C0C0C0", xlim = c(0,100))
#axis(side=1,at=1:4*10)
#abline(v=sums[,"P17777us22"], col=2, lwd = 2)

#adp <- allele2[,"P17777us22"]
#adp <- adp[adp>0]
#adp <- adp[adp<=100]
#hist(adp, breaks=seq(0, max(adp, na.rm = TRUE), by=1), col="#C0C0C0")
#par(mfrow=c(1,1))
#abline(v=sums[,"P17777us22"], col=2, lwd = 2)

看一下過濾后的結(jié)果。

ad <- extract.gt(vcf, element = 'AD')
allele1 <- masplit(ad, record = 1)
allele2 <- masplit(ad, record = 2)
boxplot(allele1, las=3)

果然好看很多。
最后再回到一開始,看倍數(shù)體的可視化效果。

gt <- extract.gt(vcf, element = 'GT')
hets <- is_het(gt)
is.na( ad[ !hets ] ) <- TRUE

allele1 <- masplit(ad, record = 1)
allele2 <- masplit(ad, record = 2)

ad1 <- allele1 / (allele1 + allele2)
ad2 <- allele2 / (allele1 + allele2)

hist(ad2[,"P17777us22"], breaks = seq(0,1,by=0.02), col = "#1f78b4", xaxt="n", main="P17777us22")
hist(ad1[,"P17777us22"], breaks = seq(0,1,by=0.02), col = "#a6cee3", add = TRUE)
axis(side=1, at=c(0,0.25,0.333,0.5,0.666,0.75,1), labels=c(0,"1/4","1/3","1/2","2/3","3/4",1))

結(jié)果明顯干凈易懂好多。
有同學(xué)會(huì)問,那么不是二倍體的話會(huì)出現(xiàn)什么樣的結(jié)果呢。數(shù)據(jù)包的樣本里正好有一個(gè)三倍體。



可以看到兩個(gè)峰出現(xiàn)在了1/3,2/3處。結(jié)果和實(shí)際完美匹配。

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容