在本教程中,我們將學(xué)習(xí)使用Signac包對(duì)多樣本的scATAC-seq數(shù)據(jù)進(jìn)行整合分析。這里,我們對(duì)來(lái)自10x Genomics和sci-ATAC-seq技術(shù)測(cè)序的成年小鼠大腦的多個(gè)單細(xì)胞ATAC-seq數(shù)據(jù)集進(jìn)行了整合分析。
其中,10x Genomics平臺(tái)產(chǎn)生的原始數(shù)據(jù)可從官網(wǎng)下載:
- The Raw data
- The Metadata
- The fragments file
- The fragments file index
sci-ATAC-seq技術(shù)產(chǎn)生的數(shù)據(jù)集由Cusanovich和Hill等人生成。原始數(shù)據(jù)可從作者的網(wǎng)站下載:
我們將演示使用Seurat v3中的數(shù)據(jù)整合方法(dataset integration and label transfer)對(duì)多個(gè)scATAC-seq數(shù)據(jù)集進(jìn)行整合分析,以及使用harmony包進(jìn)行數(shù)據(jù)整合。
安裝并加載所需的R包
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("rtracklayer")
library(devtools)
install_github("immunogenomics/harmony")
library(Signac)
library(Seurat)
library(patchwork)
set.seed(1234)
加載數(shù)據(jù)集并構(gòu)建Seurat對(duì)象
# this object was created following the mouse brain vignette
# 加載10x Genomics的小鼠大腦的scATAC-seq數(shù)據(jù)
tenx <- readRDS(file = "/home/dongwei/scATAC-seq/brain/10x/adult_mouse_brain.rds")
tenx$tech <- '10x'
tenx$celltype <- Idents(tenx)
# 加載sci-ATAC-seq的小鼠大腦數(shù)據(jù)
sci.metadata <- read.table(
file = "/home/dongwei/scATAC-seq/brain/sci/cell_metadata.txt",
header = TRUE,
row.names = 1,
sep = "\t"
)
# subset to include only the brain data
sci.metadata <- sci.metadata[sci.metadata$tissue == 'PreFrontalCortex', ]
sci.counts <- readRDS(file = "/home/dongwei/scATAC-seq/brain/sci/atac_matrix.binary.qc_filtered.rds")
sci.counts <- sci.counts[, rownames(x = sci.metadata)]
數(shù)據(jù)預(yù)處理
在上述兩個(gè)數(shù)據(jù)集中,sci-ATAC-seq數(shù)據(jù)是比對(duì)到小鼠mm9參考基因組的,而10x的數(shù)據(jù)是比對(duì)到小鼠mm10參考基因組的,因此這兩個(gè)數(shù)據(jù)集中peaks的基因組坐標(biāo)信息是不同的。我們可以使用rtracklayer包將mm9參考基因組的坐標(biāo)信息轉(zhuǎn)換到mm10中,并使用mm10的坐標(biāo)更換sci-ATAC-seq數(shù)據(jù)中peaks的坐標(biāo),其中l(wèi)iftover轉(zhuǎn)換的chain文件可以從UCSC官網(wǎng)進(jìn)行下載。
# 將peaks坐標(biāo)信息轉(zhuǎn)換成GRanges格式
sci_peaks_mm9 <- StringToGRanges(regions = rownames(sci.counts), sep = c("_", "_"))
# 導(dǎo)入mm9ToMm10.over.chain文件
mm9_mm10 <- rtracklayer::import.chain("/home/dongwei/data/liftover/mm9ToMm10.over.chain")
# 使用rtracklayer包中的liftOver函數(shù)轉(zhuǎn)換坐標(biāo)信息
sci_peaks_mm10 <- rtracklayer::liftOver(x = sci_peaks_mm9, chain = mm9_mm10)
names(sci_peaks_mm10) <- rownames(sci.counts)
# discard any peaks that were mapped to >1 region in mm10
correspondence <- S4Vectors::elementNROWS(sci_peaks_mm10)
sci_peaks_mm10 <- sci_peaks_mm10[correspondence == 1]
sci_peaks_mm10 <- unlist(sci_peaks_mm10)
sci.counts <- sci.counts[names(sci_peaks_mm10), ]
# rename peaks with mm10 coordinates
rownames(sci.counts) <- GRangesToString(grange = sci_peaks_mm10)
# create Seurat object and perform some basic QC filtering
# 構(gòu)建Seurat對(duì)象
sci <- CreateSeuratObject(
counts = sci.counts,
meta.data = sci.metadata,
assay = 'peaks',
project = 'sci'
)
# 數(shù)據(jù)過(guò)濾
sci.ds <- sci[, sci$nFeature_peaks > 2000 & sci$nCount_peaks > 5000 & !(sci$cell_label %in% c("Collisions", "Unknown"))]
sci$tech <- 'sciATAC'
# 使用RunTFIDF函數(shù)進(jìn)行數(shù)據(jù)歸一化
sci <- RunTFIDF(sci)
# 使用FindTopFeatures函數(shù)提取高變異的peaks
sci <- FindTopFeatures(sci, min.cutoff = 50)
# 使用RunSVD函數(shù)進(jìn)行線性降維
sci <- RunSVD(sci, n = 30, reduction.name = 'lsi', reduction.key = 'LSI_')
# 使用RunUMAP函數(shù)進(jìn)行非線性降維
sci <- RunUMAP(sci, reduction = 'lsi', dims = 2:30)
現(xiàn)在,我們構(gòu)建好了兩個(gè)scATAC-seq對(duì)象,并且它們都含有基于相同的mm10參考基因組坐標(biāo)系統(tǒng)得到的peaks信息。但是,由于這兩個(gè)實(shí)驗(yàn)都單獨(dú)進(jìn)行了peak calling,因此這兩個(gè)數(shù)據(jù)集中得到的peaks坐標(biāo)不太可能完全重疊。為了在我們要整合的數(shù)據(jù)集中具有共同的特征,我們可以基于10x Genomics數(shù)據(jù)集對(duì)sci-ATAC-seq中peaks的reads進(jìn)行計(jì)數(shù),并使用這些計(jì)數(shù)創(chuàng)建一個(gè)新的assay。
# find peaks that intersect in both datasets
# 使用GetIntersectingFeatures函數(shù)提取兩個(gè)數(shù)據(jù)集中重疊的peak區(qū)域
intersecting.regions <- GetIntersectingFeatures(
object.1 = sci,
object.2 = tenx,
sep.1 = c("-", "-"),
sep.2 = c(":", "-")
)
# choose a subset of intersecting peaks
peaks.use <- sample(intersecting.regions[[1]], size = 10000, replace = FALSE)
# count fragments per cell overlapping the set of peaks in the 10x data
# 使用FeatureMatrix函數(shù)對(duì)peaks中的reads進(jìn)行計(jì)數(shù)
sci_peaks_tenx <- FeatureMatrix(
fragments = GetFragments(object = tenx, assay = 'peaks'),
features = StringToGRanges(peaks.use),
cells = colnames(tenx)
)
# create a new assay and add it to the 10x dataset
# 使用CreateAssayObject函數(shù)新建一個(gè)assay對(duì)象
tenx[['sciPeaks']] <- CreateAssayObject(counts = sci_peaks_tenx, min.cells = 1)
# 數(shù)據(jù)歸一化
tenx <- RunTFIDF(object = tenx, assay = 'sciPeaks')
多樣本scATAC-seq數(shù)據(jù)集的整合
在進(jìn)行數(shù)據(jù)整合之前,我們最好先檢查下是否存在數(shù)據(jù)集特異的差異,并將其刪除。如果沒(méi)有,我們可以簡(jiǎn)單地將多個(gè)對(duì)象進(jìn)行合并而不執(zhí)行整合。在本示例中,由于使用不同的測(cè)序技術(shù),兩個(gè)數(shù)據(jù)集之間存在很大的差異。我們可以使用Seurat v3中的數(shù)據(jù)整合方法來(lái)消除這種影響。
# 先簡(jiǎn)單的將兩個(gè)數(shù)據(jù)集進(jìn)行合并看一下聚類的效果
# Look at the data without integration first
# 使用MergeWithRegions函數(shù)將兩個(gè)數(shù)據(jù)對(duì)象進(jìn)行合并
unintegrated <- MergeWithRegions(
object.1 = sci,
object.2 = tenx,
assay.1 = 'peaks',
assay.2 = 'sciPeaks',
sep.1 = c("-", "-"),
sep.2 = c("-", "-")
)
# 對(duì)合并后數(shù)據(jù)進(jìn)行歸一化,特征選擇和降維可視化
unintegrated <- RunTFIDF(unintegrated)
unintegrated <- FindTopFeatures(unintegrated, min.cutoff = 50)
unintegrated <- RunSVD(unintegrated, n = 30, reduction.name = 'lsi', reduction.key = 'LSI_')
unintegrated <- RunUMAP(unintegrated, reduction = 'lsi', dims = 2:30)
p1 <- DimPlot(unintegrated, group.by = 'tech', pt.size = 0.1) + ggplot2::ggtitle("Unintegrated")
# 使用Seurat v3的數(shù)據(jù)整合方法進(jìn)行數(shù)據(jù)集的整合
# find integration anchors between 10x and sci-ATAC
# 使用FindIntegrationAnchors函數(shù)識(shí)別共享的整合anchors
anchors <- FindIntegrationAnchors(
object.list = list(tenx, sci),
anchor.features = peaks.use,
assay = c('sciPeaks', 'peaks'),
k.filter = NA
)
# integrate data and create a new merged object
# 使用IntegrateData函數(shù)根據(jù)識(shí)別的anchors進(jìn)行數(shù)據(jù)整合
integrated <- IntegrateData(
anchorset = anchors,
weight.reduction = sci[['lsi']],
dims = 2:30,
preserve.order = TRUE
)
# we now have a "corrected" TF-IDF matrix, and can run LSI again on this corrected matrix
# 對(duì)整合后的數(shù)據(jù)進(jìn)行降維可視化
integrated <- RunSVD(
object = integrated,
n = 30,
reduction.name = 'integratedLSI'
)
integrated <- RunUMAP(
object = integrated,
dims = 2:30,
reduction = 'integratedLSI'
)
p2 <- DimPlot(integrated, group.by = 'tech', pt.size = 0.1) + ggplot2::ggtitle("Integrated")
p1 + p2
Label transfer
我們還可以使用Seurat v3中的Label transfer方法進(jìn)行數(shù)據(jù)集的整合,它將數(shù)據(jù)從一個(gè)query數(shù)據(jù)集映射到另一個(gè)reference數(shù)據(jù)集中。在這里,我們通過(guò)將細(xì)胞類型標(biāo)簽從10x Genomics scATAC-seq數(shù)據(jù)映射到到sci-ATAC-seq數(shù)據(jù)中。
# 使用FindTransferAnchors函數(shù)識(shí)別整合的anchors
transfer.anchors <- FindTransferAnchors(
reference = tenx,
query = sci,
reference.assay = 'sciPeaks',
query.assay = 'peaks',
reduction = 'cca',
features = peaks.use,
k.filter = NA
)
# 使用TransferData函數(shù)基于識(shí)別好的anchors進(jìn)行數(shù)據(jù)映射整合
predicted.id <- TransferData(
anchorset = transfer.anchors,
refdata = tenx$celltype,
weight.reduction = sci[['lsi']],
dims = 2:30
)
sci <- AddMetaData(
object = sci,
metadata = predicted.id
)
sci$predicted.id <- factor(sci$predicted.id, levels = levels(tenx$celltype)) # to make the colors match
# 數(shù)據(jù)可視化
p3 <- DimPlot(tenx, group.by = 'celltype', label = TRUE) + NoLegend() + ggplot2::ggtitle("Celltype labels 10x scATAC-seq")
p4 <- DimPlot(sci, group.by = 'predicted.id', label = TRUE) + NoLegend() + ggplot2::ggtitle("Predicted labels sci-ATAC-seq")
p3 + p4
Integration with Harmony使用Harmony包進(jìn)行數(shù)據(jù)整合
Harmony需要一個(gè)對(duì)象作為輸入,因此這里我們使用MergeWithRegions函數(shù)以coordinate-aware的方式將sci-ATAC-seq數(shù)據(jù)集和10x的scATAC-seq數(shù)據(jù)集進(jìn)行合并,然后對(duì)合并后的對(duì)象進(jìn)行LSI線性降維。數(shù)據(jù)降維后,我們可以使用RunHarmony函數(shù)調(diào)用Harmony的方法進(jìn)行數(shù)據(jù)的整合,并提供用作分組變量的技術(shù)來(lái)消除sci-ATAC-seq和10x Genomics scATAC-seq數(shù)據(jù)集之間的批次差異。這會(huì)產(chǎn)生一組“校正”的LSI嵌入,可以進(jìn)一步用于UMAP或tSNE降維,并進(jìn)行細(xì)胞聚類分群。
# 加載harmony包
library(harmony)
# 使用RunHarmony函數(shù)進(jìn)行數(shù)據(jù)整合
hm.integrated <- RunHarmony(
object = unintegrated,
group.by.vars = 'tech',
reduction = 'lsi',
assay.use = 'peaks',
project.dim = FALSE
)
# re-compute the UMAP using corrected LSI embeddings
# 數(shù)據(jù)降維可視化
hm.integrated <- RunUMAP(hm.integrated, dims = 2:30, reduction = 'harmony')
p5 <- DimPlot(hm.integrated, group.by = 'tech', pt.size = 0.1) + ggplot2::ggtitle("Harmony integration")
p1 + p5
參考來(lái)源:https://satijalab.org/signac/articles/integration.html