學(xué)習(xí)資料來源:
- scanpy主頁:https://scanpy.readthedocs.io/en/stable/
- 官網(wǎng):https://scanpy-tutorials.readthedocs.io/en/latest/integrating-data-using-ingest.html【注意教程有兩個版本,這里是latest版本的學(xué)習(xí)筆記】
此教程描述了一種簡單的基于pca的方法來整合數(shù)據(jù),我們稱之為ingest,并將其與BBKNN [Polanski19]進(jìn)行了比較。BBKNN與Scanpy工作流整合得很好,可以通過BBKNN功能訪問。
ingest函數(shù)假設(shè)有一個帶注釋的參考數(shù)據(jù)集,該數(shù)據(jù)集捕獲了感興趣的生物變異。合理的做法是在參考數(shù)據(jù)上擬合一個模型,并使用它來投射新的數(shù)據(jù)。
目前,這個模型是一個結(jié)合了鄰居查找搜索樹的PCA,我們將使用UMAP來實(shí)現(xiàn)這個功能。類似的基于pca的整合之前已經(jīng)被使用過,例如,[Weinreb18]。ingest的優(yōu)點(diǎn):
- ingest簡單,流程清晰且運(yùn)行快
- 與BBKNN一樣,ingest不改變數(shù)據(jù)matrix本身
- 不同于BBKNN的是,ingest解決了標(biāo)簽映射問題(如scmap),并保持了可能具有特定cluster或軌跡等所需屬性的嵌入
我們將這種非對稱數(shù)據(jù)集整合稱為將注釋從帶注釋的參考adata_ref攝取到?jīng)]有注釋的數(shù)據(jù)中。它不同于學(xué)習(xí)以對稱方式整合數(shù)據(jù)集的joint,如BBKNN, Scanorma, Conos, CCA(如Seurat)或有條件的VAE(如scVI, trVAE),但可與scran中的初始MNN實(shí)現(xiàn)相比較。
01 PBMCs數(shù)據(jù)
用到兩個數(shù)據(jù)集:一個帶注釋的參考數(shù)據(jù)集adata_ref和一個需要被注釋和嵌入數(shù)據(jù)的數(shù)據(jù)集。
import scanpy as sc
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as pl
# verbosity: errors (0), warnings (1), info (2), hints (3)
sc.settings.verbosity = 1
sc.logging.print_versions()
sc.settings.set_figure_params(dpi=80, frameon=False, figsize=(3, 3), facecolor='white')
outdir = '/Pub/Users/zhangjuan/project/scanpy/Intergrating/'
# this is an earlier version of the dataset from the pbmc3k tutorial
adata_ref = sc.datasets.pbmc3k_processed()
adata = sc.datasets.pbmc68k_reduced()
環(huán)境中軟件版本如下:
-----
anndata 0.8.0
scanpy 1.9.1
-----
PIL 9.2.0
beta_ufunc NA
binom_ufunc NA
colorama 0.4.5
cycler 0.10.0
cython_runtime NA
dateutil 2.8.2
h5py 3.7.0
hypergeom_ufunc NA
igraph 0.9.11
joblib 1.1.0
kiwisolver 1.4.4
leidenalg 0.8.10
llvmlite 0.38.1
louvain 0.7.1
matplotlib 3.4.3
mpl_toolkits NA
natsort 8.1.0
nbinom_ufunc NA
numba 0.55.2
numexpr 2.8.3
numpy 1.22.4
packaging 21.3
pandas 1.4.3
pkg_resources NA
pyparsing 3.0.9
pytz 2022.1
scipy 1.8.1
seaborn 0.11.2
session_info 1.0.0
setuptools 63.2.0
six 1.16.0
sklearn 1.1.1
statsmodels 0.13.2
texttable 1.6.4
threadpoolctl 3.1.0
-----
Python 3.9.13 | packaged by conda-forge | (main, May 27 2022, 16:58:50) [GCC 10.3.0]
Linux-5.10.0-9-amd64-x86_64-with-glibc2.31
-----
Session information updated at 2022-08-03 23:55
參考數(shù)據(jù)以及查詢數(shù)據(jù)的情況:
adata_ref
AnnData object with n_obs × n_vars = 2638 × 1838
obs: 'n_genes', 'percent_mito', 'n_counts', 'louvain'
var: 'n_cells'
uns: 'draw_graph', 'louvain', 'louvain_colors', 'neighbors', 'pca', 'rank_genes_groups'
obsm: 'X_pca', 'X_tsne', 'X_umap', 'X_draw_graph_fr'
varm: 'PCs'
obsp: 'distances', 'connectivities'
# 注釋:
adata_ref.obs
n_genes percent_mito n_counts louvain
index
AAACATACAACCAC-1 781 0.030178 2419.0 CD4 T cells
AAACATTGAGCTAC-1 1352 0.037936 4903.0 B cells
AAACATTGATCAGC-1 1131 0.008897 3147.0 CD4 T cells
AAACCGTGCTTCCG-1 960 0.017431 2639.0 CD14+ Monocytes
AAACCGTGTATGCG-1 522 0.012245 980.0 NK cells
... ... ... ... ...
TTTCGAACTCTCAT-1 1155 0.021104 3459.0 CD14+ Monocytes
TTTCTACTGAGGCA-1 1227 0.009294 3443.0 B cells
TTTCTACTTCCTCG-1 622 0.021971 1684.0 B cells
TTTGCATGAGAGGC-1 454 0.020548 1022.0 B cells
TTTGCATGCCTCAC-1 724 0.008065 1984.0 CD4 T cells
[2638 rows x 4 columns]
adata
AnnData object with n_obs × n_vars = 700 × 765
obs: 'bulk_labels', 'n_genes', 'percent_mito', 'n_counts', 'S_score', 'G2M_score', 'phase', 'louvain'
var: 'n_counts', 'means', 'dispersions', 'dispersions_norm', 'highly_variable'
uns: 'bulk_labels_colors', 'louvain', 'louvain_colors', 'neighbors', 'pca', 'rank_genes_groups'
obsm: 'X_pca', 'X_umap'
varm: 'PCs'
obsp: 'distances', 'connectivities'
使用sc.tl函數(shù),需要在相同的變量(即需要提共有基因數(shù)據(jù))上定義數(shù)據(jù)集
var_names = adata_ref.var_names.intersection(adata.var_names)
adata_ref = adata_ref[:, var_names]
adata = adata[:, var_names]
只有208個共同基因!
在參考數(shù)據(jù)上訓(xùn)練的模型和圖形(這里是PCA,鄰居,UMAP)將解釋其中觀察到的生物變異。
sc.pp.pca(adata_ref)
sc.pp.neighbors(adata_ref)
sc.tl.umap(adata_ref)
該流形看起來與聚類教程中的流形基本相同。
sc.pl.umap(adata_ref, color='louvain')
pl.savefig(outdir + "01-adata_ref-UMAP.png")
結(jié)果圖:

使用ingest整合pbmc
基于選擇的特征,我們使用adata_ref數(shù)據(jù)集合的labels and embeddings映射到adata。在這里,我們使用adata_ref.obsm['X_pca']來map 細(xì)胞亞群標(biāo)簽 and the UMAP 坐標(biāo)。
sc.tl.ingest函數(shù):
- obs:Labels’ keys in
adata_ref.obswhich need to be mapped toadata.obs(inferred for observation ofadata)
sc.tl.ingest(adata, adata_ref, obs='louvain')
# fix colors
adata.uns['louvain_colors'] = adata_ref.uns['louvain_colors']
sc.pl.umap(adata, color=['louvain', 'bulk_labels'], wspace=0.5)
## 修改圖片邊距,之前圖片保存字體顯示不全
pl.margins(0, 0)
pl.savefig(outdir + "02-adata-UMAP.png", bbox_inches='tight', dpi=300, pad_inches=0.1)
結(jié)果圖:這個圖片跟官方的結(jié)果是個鏡像,180度反的,左邊louvain注釋是通過ingest整合的結(jié)果,右邊bulk_labels是數(shù)據(jù)原有的注釋標(biāo)簽:

通過比較louvain注釋結(jié)果與bulk_labels注釋結(jié)果,我們發(fā)現(xiàn)基本上都能對上,除了DC細(xì)胞。但是DC細(xì)胞可能在adata中就存在注釋問題。
adata_concat = adata_ref.concatenate(adata, batch_categories=['ref', 'new'])
adata_concat.obs.louvain = adata_concat.obs.louvain.astype('category')
# fix category ordering
adata_concat.obs.louvain.cat.reorder_categories(adata_ref.obs.louvain.cat.categories, inplace=True)
# fix category colors
adata_concat.uns['louvain_colors'] = adata_ref.uns['louvain_colors']
sc.pl.umap(adata_concat, color=['batch', 'louvain'])
pl.margins(0, 0)
pl.savefig(outdir + "03-adata_ref-UMAP-new.png", bbox_inches='tight', dpi=300, pad_inches=0.1)
矯正之后的結(jié)果:雖然在單核細(xì)胞和樹突狀細(xì)胞簇中似乎存在一些批量效應(yīng),但是新的數(shù)據(jù)在其他方面被映射得相對均勻。
巨核細(xì)胞megakaryoctes只存在于 adata _ ref 中,adata中沒有cell映射到它們
如果交換adata _ ref 和adata,巨核細(xì)胞megakaryoctes不再作為一個單獨(dú)的cluster出現(xiàn)。這是一個極端的情況,因?yàn)閰⒖紨?shù)據(jù)非常小; 但是人們應(yīng)該始終質(zhì)疑參考數(shù)據(jù)是否包含足夠的生物變異來有意義地容納查詢數(shù)據(jù)。

使用BBKNN整合數(shù)據(jù)
adata_concat是adata_ref(2638個細(xì)胞)與adata(700個細(xì)胞)兩個矩陣合并后的數(shù)據(jù)
參數(shù)join默認(rèn)采用inner方式即交集合并數(shù)據(jù) : str (default: 'inner'),Use intersection ('inner') or union ('outer') of variables
合并后,adata_concat包含:3338個細(xì)胞和208個共同基因
adata_concat
AnnData object with n_obs × n_vars = 3338 × 208
obs: 'n_genes', 'percent_mito', 'n_counts', 'louvain', 'bulk_labels', 'S_score', 'G2M_score', 'phase', 'batch'
var: 'n_counts-new', 'means-new', 'dispersions-new', 'dispersions_norm-new', 'highly_variable-new', 'n_cells-ref'
uns: 'louvain_colors', 'batch_colors', 'pca'
obsm: 'X_pca', 'X_umap'
varm: 'PCs'
整合數(shù)據(jù):
sc.tl.pca(adata_concat)
# running bbknn 1.3.6
sc.external.pp.bbknn(adata_concat, batch_key='batch')
sc.tl.umap(adata_concat)
sc.pl.umap(adata_concat, color=['batch', 'louvain'])
pl.savefig(outdir + "04-adata_concat-UMAP-BBKNN.png",bbox_inches='tight', dpi=300, pad_inches=0.1)
結(jié)果圖:左圖為adata_ref數(shù)據(jù)與adata數(shù)據(jù)整合在一起的情況,右邊為注釋后的結(jié)果

02 Pancreas
本數(shù)據(jù)集在the scGen paper [Lotfollahi19],here中使用過,在here被核實(shí)過,可以在here (the BBKNN paper)這里進(jìn)行下載。
它包含了來自4個不同研究(Segerstolpe16, Baron16, Wang16, Muraro16)的人類胰腺的數(shù)據(jù),它已經(jīng)在單細(xì)胞數(shù)據(jù)集整合的開創(chuàng)性論文中被使用(Butler18,Haghverdi18) ,并且從那時起被多次使用。
將數(shù)據(jù)下載下來并存放在data目錄下:
# note that this collection of batches is already intersected on the genes
adata_all = sc.read('data/pancreas.h5ad', backup_url='https://www.dropbox.com/s/qj1jlm9w10wmt0u/pancreas.h5ad?dl=1')
adata_all.shape
(14693, 2448)
檢查在這些研究中觀察到的細(xì)胞類型:
counts = adata_all.obs.celltype.value_counts()
counts
alpha 4214
beta 3354
ductal 1804
acinar 1368
not applicable 1154
delta 917
gamma 571
endothelial 289
activated_stellate 284
dropped 178
quiescent_stellate 173
mesenchymal 80
macrophage 55
PSC 54
unclassified endocrine 41
co-expression 39
mast 32
epsilon 28
mesenchyme 27
schwann 13
t_cell 7
MHC class II 5
unclear 4
unclassified 2
Name: celltype, dtype: int64
為了簡化可視化,讓我們刪除5個細(xì)胞數(shù)量比較少的cluster:
# get the minority classes
minority_classes = counts.index[-5:].tolist()
# actually subset
adata_all = adata_all[~adata_all.obs.celltype.isin(minority_classes)]
# reorder according to abundance
adata_all.obs.celltype.cat.reorder_categories(counts.index[:-5].tolist(), inplace=True)
查看批次效應(yīng):
sc.pp.pca(adata_all)
sc.pp.neighbors(adata_all)
sc.tl.umap(adata_all)
sc.pl.umap(adata_all, color=['batch', 'celltype'], palette=sc.pl.palettes.vega_20_scanpy)
pl.savefig(outdir + "05-adata_all-UMAP.png",bbox_inches='tight', dpi=300, pad_inches=0.1)
可以看到很明顯的批次效應(yīng):這里跟官網(wǎng)的圖有點(diǎn)不一樣,聚類形狀差不多,但是就是有些cluster旋轉(zhuǎn)了某個角度

使用BBKNN整合數(shù)據(jù)
上面數(shù)據(jù)中的批次可以很好地使用 BBKNN [Polanski19]來解決。
sc.external.pp.bbknn(adata_all, batch_key='batch')
sc.tl.umap(adata_all)
sc.pl.umap(adata_all, color=['batch', 'celltype'])
pl.savefig(outdir + "06-adata_all-UMAP_batchAdjust.png",bbox_inches='tight', dpi=300, pad_inches=0.1)
結(jié)果如下:

使用ingest整合數(shù)據(jù)
選擇一個參考批處理來訓(xùn)練模型和建立鄰域圖(這里是PCA) ,并分離出所有其他批處理。
和以前一樣,對參考批次進(jìn)行訓(xùn)練的模型將解釋在其中觀察到的生物學(xué)變異。
這里數(shù)據(jù)有4個batch,選擇batch 0 作為adata_ref,其余為adata
adata_ref = adata_all[adata_all.obs.batch == '0']
# Compute the PCA, neighbors and UMAP on the reference data
sc.pp.pca(adata_ref)
sc.pp.neighbors(adata_ref)
sc.tl.umap(adata_ref)
# The reference batch contains 12 of the 19 cell types across all batches.
sc.pl.umap(adata_ref, color='celltype')
pl.savefig(outdir + "07-adata_all-UMAP_ingest.png",bbox_inches='tight', dpi=300, pad_inches=0.1)
結(jié)果圖如下:參考batch中包含所有19中細(xì)胞類型中的12種細(xì)胞

迭代地將label(例如‘ 細(xì)胞類型’)和embeddings(例如‘ X _ pca’和‘ X _ umap’)從參考數(shù)據(jù)映射到需要注釋的數(shù)據(jù)中:
adatas = [adata_all[adata_all.obs.batch == i].copy() for i in ['1', '2', '3']]
# a bit more logging
sc.settings.verbosity = 2
for iadata, adata in enumerate(adatas):
print(f'... integrating batch {iadata+1}')
# save the original cell type
adata.obs['celltype_orig'] = adata.obs.celltype
sc.tl.ingest(adata, adata_ref, obs='celltype')
... integrating batch 1
running ingest
finished (0:00:15)
... integrating batch 2
running ingest
finished (0:00:06)
... integrating batch 3
running ingest
finished (0:00:02)
現(xiàn)在,每個查詢的batch都帶有已經(jīng)與 adata _ ref 進(jìn)行上下文關(guān)聯(lián)的注釋
使用concatenate,我們可以一起可視化看看結(jié)果:
adata_concat = adata_ref.concatenate(adatas)
adata_concat.obs.celltype = adata_concat.obs.celltype.astype('category')
# fix category ordering
adata_concat.obs.celltype.cat.reorder_categories(adata_ref.obs.celltype.cat.categories, inplace=True)
# fix category coloring
adata_concat.uns['celltype_colors'] = adata_ref.uns['celltype_colors']
sc.pl.umap(adata_concat, color=['batch', 'celltype'])
pl.savefig(outdir + "08-adata_concat-UMAP_ingest.png",bbox_inches='tight', dpi=300, pad_inches=0.1)
結(jié)果圖:與BBKNN的結(jié)果相比,ingest 以一種更明顯的方式維持cluster。如果已經(jīng)觀察到想要的連續(xù)結(jié)構(gòu)(例如在造血數(shù)據(jù)集中),ingest 可以輕松地維持這個結(jié)構(gòu)。

評估一致性
提取數(shù)據(jù)中剛剛被當(dāng)做查詢的部分,即batch1,2,3,包含6113個細(xì)胞
## 評估一致性
adata_query = adata_concat[adata_concat.obs.batch.isin(['1', '2', '3'])]
adata_query
View of AnnData object with n_obs × n_vars = 6113 × 2448
# The following plot is a bit hard to read, hence, move on to confusion matrices below
sc.pl.umap(adata_query, color=['batch', 'celltype', 'celltype_orig'], wspace=0.4)
pl.savefig(outdir + "09-adata_query-consistency.png",bbox_inches='tight', dpi=300, pad_inches=0.1)
結(jié)果圖:整合后數(shù)據(jù)的batch,整合后數(shù)據(jù)的注釋,數(shù)據(jù)原有的注釋label

不同batches中保守的細(xì)胞類型
我們首先來查看在reference batch與query batch中保守的細(xì)胞類型:
obs_query = adata_query.obs
# intersected categories
conserved_categories = obs_query.celltype.cat.categories.intersection(obs_query.celltype_orig.cat.categories)
# intersect categories
obs_query_conserved = obs_query.loc[obs_query.celltype.isin(conserved_categories) & obs_query.celltype_orig.isin(conserved_categories)]
# remove unused categoriyes
obs_query_conserved.celltype.cat.remove_unused_categories(inplace=True)
# remove unused categoriyes
obs_query_conserved.celltype_orig.cat.remove_unused_categories(inplace=True)
# fix category ordering
obs_query_conserved.celltype_orig.cat.reorder_categories(obs_query_conserved.celltype.cat.categories, inplace=True)
pd.crosstab(obs_query_conserved.celltype, obs_query_conserved.celltype_orig)
總的來說,保守的細(xì)胞類型按預(yù)期那樣映射。意外的是原注釋中一些腺泡細(xì)胞表現(xiàn)為腺泡細(xì)胞。然而,已經(jīng)觀察到的參考數(shù)據(jù)顯示腺泡和導(dǎo)管細(xì)胞聚集,這解釋了差異,并表明初始注釋中可能存在不一致。

所有的細(xì)胞類型
現(xiàn)在,我們來看看所有的細(xì)胞類型:
pd.crosstab(adata_query.obs.celltype, adata_query.obs.celltype_orig)
結(jié)果:
我們觀察到PSC(胰腺星狀細(xì)胞)細(xì)胞實(shí)際上只是不一致地注釋和正確地映射在“激活星狀細(xì)胞”上;
此外,很高興看到“間充質(zhì)”和“間充質(zhì)”細(xì)胞都屬于同一類別。然而,這個類別仍然是'activated_stellate',而且很可能是錯誤的

各批次的可視化分布
通常,批量與人們想要比較的實(shí)驗(yàn)相對應(yīng)。Scanpy為此提供了方便的可視化可能性
Density plot
sc.tl.embedding_density(adata_concat, groupby='batch')
sc.pl.embedding_density(adata_concat, groupby='batch')
pl.savefig(outdir + "10-adata_concat-density_plot.png",bbox_inches='tight', dpi=300, pad_inches=0.1)
結(jié)果圖:

嵌入群子集的部分可視化
for batch in ['1', '2', '3']:
sc.pl.umap(adata_concat, color='batch', groups=[batch])
out = outdir + "11-adata_concat-Partial_visualizaton"+batch+".png"
pl.savefig(out, bbox_inches='tight', dpi=300, pad_inches=0.1)

下回見~