本文共分3部分。
首先介紹一下copyKAT原理和使用方法。
然后提出一個(gè)關(guān)于細(xì)胞過(guò)濾的問(wèn)題。
最后提出一個(gè)解決問(wèn)題的方法供參考。
1.copyKAT及使用簡(jiǎn)介
copyKAT是一種推斷腫瘤細(xì)胞CNV的方法,和inferCNV類似。文章于2021年1月發(fā)在nature biotechnology,鏈接見(jiàn)文末。也是大名鼎鼎的期刊。其原理同樣是通過(guò)單細(xì)胞轉(zhuǎn)錄組數(shù)據(jù)來(lái)推斷細(xì)胞的染色體倍數(shù),進(jìn)而推斷是正常細(xì)胞(diploid)還是腫瘤細(xì)胞(aneuploid)。
1.1原理及工作流程圖

(a)copyKAT首先對(duì)基因表達(dá)量做標(biāo)準(zhǔn)化并穩(wěn)定其方差。
(b)它可以自動(dòng)尋找diploid cells作為正常細(xì)胞。
(c)對(duì)每個(gè)非正常細(xì)胞,利用MCMC尋找其CNV的斷點(diǎn)(breakpoints)并得到segments。
(d)正常細(xì)胞和腫瘤細(xì)胞由于其基因表達(dá)量分布的不同可以被分開。
(e)腫瘤細(xì)胞通常還可以繼續(xù)聚類得到其亞群。
1.2正常運(yùn)行示例
我們首先導(dǎo)入示例數(shù)據(jù)運(yùn)行一下,用文章里的一個(gè)TNBC1數(shù)據(jù)(下載鏈接:https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM4476486)。運(yùn)行非常簡(jiǎn)單,缺點(diǎn)是慢,且windows只能用n.cores=1來(lái)運(yùn)行,建議使用服務(wù)器來(lái)做。我用16G內(nèi)存電腦跑了一次5000個(gè)細(xì)胞花費(fèi)了8h。4000個(gè)細(xì)胞5h。
exp.rawdata <- read.table("./copykat/data/GSM4476486_filtered_UMIcount_TNBC1.txt", header=T, sep='\t', check.names = F)
copykat.test <- copykat(rawmat=exp.rawdata,
id.type="S",
cell.line="no",
ngene.chr=5,
win.size=25,
KS.cut=0.15,
sam.name="TNBC1",
distance="euclidean",
n.cores=4)
pred.test <- data.frame(copykat.test$prediction)
CNA.test <- data.frame(copykat.test$CNAmat)
看一下結(jié)果,主要有兩個(gè),
1)預(yù)測(cè)的結(jié)果正常細(xì)胞(diploid)還是腫瘤細(xì)胞(aneuploid),
2) 每個(gè)CNV segment在每個(gè)細(xì)胞的表達(dá)量。
這里看出和inferCNV的不同來(lái)了,是基于genomic coordinate而不是gene level的表達(dá)量。。
head(pred.test)
cell.names copykat.pred
#AAACCTGCACCTTGTC AAACCTGCACCTTGTC aneuploid
#AAACGGGAGTCCTCCT AAACGGGAGTCCTCCT diploid
#AAACGGGTCCAGAGGA AAACGGGTCCAGAGGA aneuploid
#AAAGATGCAGTTTACG AAAGATGCAGTTTACG aneuploid
#AAAGCAACAGGAATGC AAAGCAACAGGAATGC aneuploid
#AAAGCAATCGGAATCT AAAGCAATCGGAATCT aneuploid
head(CNA.test[,1:5])
chrom chrompos abspos AAACCTGCACCTTGTC AAACGGGAGTCCTCCT
#1 1 1042457 1042457 -0.03206638 0.03170166
#2 1 1265484 1265484 -0.03206638 0.03170166
#3 1 1519859 1519859 -0.03206638 0.03170166
#4 1 1826619 1826619 -0.03206638 0.03170166
#5 1 2058465 2058465 -0.03206638 0.03170166
#6 1 2280372 2280372 -0.03206638 0.03170166
畫個(gè)熱圖看看。很明顯正常細(xì)胞和腫瘤細(xì)胞分開了。
my_palette <- colorRampPalette(rev(RColorBrewer::brewer.pal(n = 3, name = "RdBu")))(n = 999)
chr <- as.numeric(CNA.test$chrom) %% 2+1
rbPal1 <- colorRampPalette(c('black','grey'))
CHR <- rbPal1(2)[as.numeric(chr)]
chr1 <- cbind(CHR,CHR)
rbPal5 <- colorRampPalette(RColorBrewer::brewer.pal(n = 8, name = "Dark2")[2:1])
com.preN <- pred.test$copykat.pred
pred <- rbPal5(2)[as.numeric(factor(com.preN))]
cells <- rbind(pred,pred)
col_breaks = c(seq(-1,-0.4,length=50),seq(-0.4,-0.2,length=150),seq(-0.2,0.2,length=600),seq(0.2,0.4,length=150),seq(0.4, 1,length=50))
heatmap.3(t(CNA.test[,4:ncol(CNA.test)]),dendrogram="r", distfun = function(x) parallelDist::parDist(x,threads =4, method = "euclidean"),
hclustfun = function(x) hclust(x, method="ward.D2"),
ColSideColors=chr1,RowSideColors=cells,Colv=NA, Rowv=TRUE,
notecol="black",col=my_palette,breaks=col_breaks, key=TRUE,
keysize=1, density.info="none", trace="none",
cexRow=0.1,cexCol=0.1,cex.main=1,cex.lab=0.1,
symm=F,symkey=F,symbreaks=T,cex=1, cex.main=4, margins=c(10,10))
legend("topright", paste("pred.",names(table(com.preN)),sep=""), pch=15,col=RColorBrewer::brewer.pal(n = 8, name = "Dark2")[2:1], cex=0.6, bty="n")

再對(duì)腫瘤細(xì)胞再聚類并畫熱圖,又能分成兩群。
tumor.cells <- pred.test$cell.names[which(pred.test$copykat.pred=="aneuploid")]
tumor.mat <- CNA.test[, which(colnames(CNA.test) %in% tumor.cells)]
hcc <- hclust(parallelDist::parDist(t(tumor.mat),threads =4, method = "euclidean"), method = "ward.D2")
hc.umap <- cutree(hcc,2)
rbPal6 <- colorRampPalette(RColorBrewer::brewer.pal(n = 8, name = "Dark2")[3:4])
subpop <- rbPal6(2)[as.numeric(factor(hc.umap))]
cells <- rbind(subpop,subpop)
heatmap.3(t(tumor.mat),dendrogram="r", distfun = function(x) parallelDist::parDist(x,threads =4, method = "euclidean"),
hclustfun = function(x) hclust(x, method="ward.D2"),
ColSideColors=chr1,RowSideColors=cells,Colv=NA, Rowv=TRUE,
notecol="black",col=my_palette,breaks=col_breaks, key=TRUE,
keysize=1, density.info="none", trace="none",
cexRow=0.1,cexCol=0.1,cex.main=1,cex.lab=0.1,
symm=F,symkey=F,symbreaks=T,cex=1, cex.main=4, margins=c(10,10))
legend("topright", c("c1","c2"), pch=15,col=RColorBrewer::brewer.pal(n = 8, name = "Dark2")[3:4], cex=0.9, bty='n')

最后把CNV的結(jié)果投射到單細(xì)胞聚類結(jié)果上看一看是否合理,Seurat標(biāo)準(zhǔn)流程走一遍,聚類結(jié)果和copyKAT分群結(jié)果投射到TSNE上。
standard10X = function(dat,nPCs=50,res=1.0,verbose=FALSE){
srat = CreateSeuratObject(dat)
srat = NormalizeData(srat,verbose=verbose)
srat = ScaleData(srat,verbose=verbose)
srat = FindVariableFeatures(srat,verbose=verbose)
srat = RunPCA(srat,verbose=verbose)
srat = RunTSNE(srat,dims=seq(nPCs),verbose=verbose)
srat = FindNeighbors(srat,dims=seq(nPCs),verbose=verbose)
srat = FindClusters(srat,res=res,verbose=verbose)
return(srat)
}
TNBC1 <- standard10X(exp.rawdata, nPCs=30, res=0.6)
TNBC1@meta.data$copykat.pred <- pred.test$copykat.pred
TNBC1@meta.data$copykat.tumor.pred <- rep("normal", nrow(TNBC1@meta.data))
TNBC1@meta.data$copykat.tumor.pred[rownames(TNBC1@meta.data) %in% names(hc.umap[hc.umap==1])] <- "tumor cluster 1"
TNBC1@meta.data$copykat.tumor.pred[rownames(TNBC1@meta.data) %in% names(hc.umap[hc.umap==2])] <- "tumor cluster 2"
p1 <- DimPlot(TNBC1, label = T)
p2 <- DimPlot(TNBC1, group.by = "copykat.pred")
p3 <- DimPlot(TNBC1, group.by = "copykat.tumor.pred")
p1 + p2 + p3

從免疫細(xì)胞和腫瘤細(xì)胞的標(biāo)記基因表達(dá)來(lái)看,copyKAT可以正確找出正常細(xì)胞和腫瘤細(xì)胞。
FeaturePlot(TNBC1,features=c("PTPRC", "EPCAM"), order = T)

最后作者提到一個(gè)需要注意的點(diǎn),不是所有的腫瘤都存在CNV。兒童腫瘤和血液腫瘤中基本沒(méi)有copy number event,所以是不適合用這些方法(copyKAT或inferCNV)來(lái)尋找腫瘤細(xì)胞的。
2. 錯(cuò)誤示例
此處不放具體數(shù)據(jù)。遇到這個(gè)問(wèn)題的人一看就懂,遇不到的就更不用看了。說(shuō)明你的數(shù)據(jù)很魯棒,棒的根本過(guò)濾不掉啊。
簡(jiǎn)單說(shuō)就是我從seurat結(jié)果拿來(lái)已經(jīng)過(guò)濾好的100個(gè)細(xì)胞,我希望這100個(gè)細(xì)胞都被推斷出腫瘤細(xì)胞和非腫瘤細(xì)胞,
但是這個(gè)包會(huì)把我按照自己的標(biāo)準(zhǔn)選好的細(xì)胞再過(guò)濾一遍,導(dǎo)致不到100個(gè)細(xì)胞,然后最后就沒(méi)辦法和我們?cè)瓉?lái)seurat畫出來(lái)的tsne聚類圖很好的overlap。
如下,我輸入了4422個(gè)細(xì)胞,希望不要過(guò)濾全部推斷CNV.

但是copyKAT之后只剩下4364個(gè)細(xì)胞被用來(lái)做后續(xù)推斷,這不是我想要的結(jié)果。我試圖改參數(shù),發(fā)現(xiàn)所有參數(shù)改到最低0,也無(wú)法保證所有細(xì)胞被保留。于是嘗試去看了源代碼。。。發(fā)現(xiàn)代碼中是有幾處過(guò)濾的。見(jiàn)下一部分。


3.解決方案
我找到了源代碼,將過(guò)濾細(xì)胞的代碼行注釋掉或刪除即可完全保留細(xì)胞。
這里說(shuō)明下(注釋掉或刪除)。建議注釋,這是一個(gè)好習(xí)慣,可以保留代碼修改記錄,方便以后修改和理解。
有人可能還不知道如何查看源碼,對(duì)著要看的函數(shù)按下F2即可。
過(guò)濾的地方有3處。如下,找到這三處注釋掉,運(yùn)行下函數(shù)就可以保留全部細(xì)胞了,不信你試試?!供大家參考。不足之處,歡迎討論學(xué)習(xí)!
代碼塊1
# genes.raw <- apply(rawmat, 2, function(x) (sum(x > 0)))
# if (sum(genes.raw > 200) == 0)
# stop("none cells have more than 200 genes")
# if (sum(genes.raw < 100) > 1) {
# rawmat <- rawmat[, -which(genes.raw < 200)]
# print(paste("filtered out ", sum(genes.raw <= 200),
# " cells with less than 200 genes; remaining ", ncol(rawmat),
# " cells", sep = ""))
# }
代碼塊1
# if (length(toRev) > 0) {
# anno.mat <- anno.mat[-toRev, ]
# }
# ToRemov2 <- NULL
# for (i in 8:ncol(anno.mat)) {
# cell <- cbind(anno.mat$chromosome_name, anno.mat[, i])
# cell <- cell[cell[, 2] != 0, ]
# if (length(as.numeric(cell)) < 5) {
# rm <- colnames(anno.mat)[i]
# ToRemov2 <- c(ToRemov2, rm)
# }
# else if (length(rle(cell[, 1])$length) < 23 | min(rle(cell[,
# 1])$length) < ngene.chr) {
# rm <- colnames(anno.mat)[i]
# ToRemov2 <- c(ToRemov2, rm)
# }
# i <- i + 1
# }
# if (length(ToRemov2) == (ncol(anno.mat) - 7))
# stop("all cells are filtered")
# if (length(ToRemov2) > 0) {
# anno.mat <- anno.mat[, -which(colnames(anno.mat) %in%
# ToRemov2)]
# }
代碼塊2
# ToRemov3 <- NULL
# for (i in 8:ncol(anno.mat2)) {
# cell <- cbind(anno.mat2$chromosome_name, anno.mat2[,
# i])
# cell <- cell[cell[, 2] != 0, ]
# if (length(as.numeric(cell)) < 5) {
# rm <- colnames(anno.mat2)[i]
# ToRemov3 <- c(ToRemov3, rm)
# }
# else if (length(rle(cell[, 1])$length) < 23 | min(rle(cell[,
# 1])$length) < ngene.chr) {
# rm <- colnames(anno.mat2)[i]
# ToRemov3 <- c(ToRemov3, rm)
# }
# i <- i + 1
# }
# if (length(ToRemov3) == ncol(norm.mat.relat))
# stop("all cells are filtered")
# if (length(ToRemov3) > 0) {
# norm.mat.relat <- norm.mat.relat[, -which(colnames(norm.mat.relat) %in%
# ToRemov3)]
# }
References:
https://www.nature.com/articles/s41587-020-00795-2#data-availability
https://github.com/navinlabcode/copykat
copyKAT推斷單細(xì)胞轉(zhuǎn)錄組腫瘤細(xì)胞CNV