10X單細(xì)胞空間聯(lián)合分析之再次解讀cell2location

到了年根了,其實(shí)想總結(jié)一些之前的東西,其中cell2location很想再次分享一下,好好研讀一下。

Cell2location 是一種有principled Bayesian model,可以解析空間轉(zhuǎn)錄組數(shù)據(jù)中的 fine-grained細(xì)胞類型,并創(chuàng)建不同組織的綜合細(xì)胞圖。

Given cell type annotation for each cell, the corresponding reference cell type signatures (g_{f,g}), which represent the average mRNA count of each gene (g) in each cell type (f), can be estimated from sc/snRNA-seq data using 2 provided methods (see below). Cell2location needs untransformed unnormalised spatial mRNA counts as input. You also need to provide cell2location with the expected average cell abundance per location which is used as a prior to guide estimation of absolute cell abundance. This value depends on the tissue and can be estimated by counting nuclei for a few locations in the paired histology image but can be approximate (see paper methods for more guidance).

provide 2 methods for estimating reference cell type signatures from scRNA-seq data:

  • 一種基于負(fù)二項(xiàng)式回歸的統(tǒng)計(jì)方法。 通常建議使用 NB 回歸,它允許跨技術(shù)和批次穩(wěn)健地組合數(shù)據(jù),從而提高空間映射精度。
  • 單個(gè)基因的每個(gè)cluster平均 mRNA 計(jì)數(shù)的硬編碼計(jì)算 (cell2location.cluster_averages.compute_cluster_averages)。 當(dāng)批次效應(yīng)較小時(shí),這種更快的硬編碼計(jì)算每個(gè)集群平均值的方法提供了類似的高準(zhǔn)確度。

代碼部分也很值得多多學(xué)習(xí)

Loading packages

import sys

#if branch is stable, will install via pypi, else will install from source
branch = "github"
user = "BayraktarLab"
IN_COLAB = "google.colab" in sys.modules

if IN_COLAB and branch == "stable":
    !pip install --quiet cell2location
elif IN_COLAB and branch != "stable":
    !pip install --quiet --upgrade jsonschema
    !pip install --quiet git+https://github.com/BayraktarLab/cell2location#egg=cell2location[tutorials]

import scanpy as sc
import anndata
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl

import cell2location
import scvi

from matplotlib import rcParams
rcParams['pdf.fonttype'] = 42 # enables correct plotting of text
import seaborn as sns

First, let’s define where we save the results of our analysis:

results_folder = './results/lymph_nodes_analysis/'

# create paths and names to results folders for reference regression and cell2location models
ref_run_name = f'{results_folder}/reference_signatures'
run_name = f'{results_folder}/cell2location_map'
  • 這里的用法我很喜歡, f'{results_folder}/reference_signatures'

Loading Visium and scRNA-seq reference data

首先從 10X Space Ranger 輸出中讀取空間 Visium 數(shù)據(jù)。 可以使用 scanpy 方便地下載和導(dǎo)入此數(shù)據(jù)集。

adata_vis = sc.datasets.visium_sge(sample_id="V1_Human_Lymph_Node")
adata_vis.obs['sample'] = list(adata_vis.uns['spatial'].keys())[0]

# rename genes to ENSEMBL
adata_vis.var['SYMBOL'] = adata_vis.var_names
adata_vis.var_names = adata_vis.var['gene_ids']
adata_vis.var_names.name = None
  • 這里對(duì)scanpy讀入的對(duì)象的信息添加很值得學(xué)習(xí)。
注意! 線粒體編碼的基因(基因名稱以前綴 mt- 或 MT- 開頭)與空間映射無(wú)關(guān),因?yàn)樗鼈兊谋磉_(dá)代表了單細(xì)胞和細(xì)胞核數(shù)據(jù)中的技術(shù)產(chǎn)物,而不是線粒體的生物學(xué)豐度。 然而,這些基因在每個(gè)位置構(gòu)成了 15-40% 的 mRNA。 因此,為了避免映射偽影,我們強(qiáng)烈建議去除線粒體基因。
# find mitochondria-encoded (MT) genes
adata_vis.var['MT_gene'] = [gene.startswith('MT-') for gene in adata_vis.var['SYMBOL']]

# remove MT genes for spatial mapping (keeping their counts in the object)
adata_vis.obsm['MT'] = adata_vis[:, adata_vis.var['MT_gene'].values].X.toarray()
adata_vis = adata_vis[:, ~adata_vis.var['MT_gene'].values]
在單細(xì)胞參考中包含了跨越淋巴結(jié)、脾臟和扁桃體的 scRNA-seq 數(shù)據(jù)集,以確保我們捕獲了空間轉(zhuǎn)錄組數(shù)據(jù)集中可能存在的免疫細(xì)胞狀態(tài)的全部多樣性。
# Download data if not already here
import os
if not os.path.exists('./data/sc.h5ad'):
    !cd ./data/ && wget https://cell2location.cog.sanger.ac.uk/paper/integrated_lymphoid_organ_scrna/RegressionNBV4Torch_57covariates_73260cells_10237genes/sc.h5ad

# Read data
adata_ref = sc.read(f'./data/sc.h5ad')

# Use ENSEMBL as gene IDs to make sure IDs are unique and correctly matched
adata_ref.var['SYMBOL'] = adata_ref.var.index
adata_ref.var.index = adata_ref.var['GeneID-2'].copy()
adata_ref.var_names = adata_ref.var['GeneID-2'].copy()
adata_ref.var.index.name = None
adata_ref.raw.var['SYMBOL'] = adata_ref.raw.var.index
adata_ref.raw.var.index = adata_ref.raw.var['GeneID-2'].copy()
adata_ref.raw.var.index.name = None
# before we estimate the reference cell type signature we recommend to perform very permissive genes selection
# in this 2D histogram orange rectangle lays over excluded genes.
# In this case, the downloaded dataset was already filtered using this method,
# hence no density under the orange rectangle
from cell2location.utils.filtering import filter_genes
selected = filter_genes(adata_ref, cell_count_cutoff=5, cell_percentage_cutoff2=0.03, nonz_mean_cutoff=1.12)

# filter the object
adata_ref = adata_ref[:, selected].copy()
圖片.png

Estimation of reference cell type signatures (NB regression)

The signatures are estimated from scRNA-seq data, accounting for batch effect, using a Negative binomial regression model.(這里需要考慮批次效應(yīng))
# prepare anndata for the regression model
scvi.data.setup_anndata(adata=adata_ref,
                        # 10X reaction / sample / batch
                        batch_key='Sample',
                        # cell type, covariate used for constructing signatures
                        labels_key='Subset',
                        # multiplicative technical effects (platform, 3' vs 5', donor effect)
                        categorical_covariate_keys=['Method']
                       )
scvi.data.view_anndata_setup(adata_ref)
圖片.png
# create and train the regression model
from cell2location.models import RegressionModel
mod = RegressionModel(adata_ref)

# Use all data for training (validation not implemented yet, train_size=1)
mod.train(max_epochs=250, batch_size=2500, train_size=1, lr=0.002, use_gpu=True)

# plot ELBO loss history during training, removing first 20 epochs from the plot
mod.plot_history(20)
圖片.png
# In this section, we export the estimated cell abundance (summary of the posterior distribution).
adata_ref = mod.export_posterior(
    adata_ref, sample_kwargs={'num_samples': 1000, 'batch_size': 2500, 'use_gpu': True}
)

# Save model
mod.save(f"{ref_run_name}", overwrite=True)

# Save anndata object with results
adata_file = f"{ref_run_name}/sc.h5ad"
adata_ref.write(adata_file)

Examine QC plots

  • 1、重建準(zhǔn)確性以評(píng)估推理是否存在任何問(wèn)題。
  • 2、由于批次效應(yīng),估計(jì)的表達(dá)特征不同于每個(gè)cluster中的平均表達(dá)。 對(duì)于不受批效應(yīng)影響的 scRNA-seq 數(shù)據(jù)集(該數(shù)據(jù)集有),可以使用聚類平均表達(dá)而不是用模型估計(jì)特征。 當(dāng)此圖與對(duì)角線圖非常不同時(shí)(例如,Y 軸上的值非常低,到處都是密度),則表明特征估計(jì)存在問(wèn)題。
mod.plot_QC()
圖片.png

The model and output h5ad can be loaded later like this:

mod = cell2location.models.RegressionModel.load(f"{ref_run_name}", adata_ref)
adata_file = f"{ref_run_name}/sc.h5ad"
adata_ref = sc.read_h5ad(adata_file)
# export estimated expression in each cluster
if 'means_per_cluster_mu_fg' in adata_ref.varm.keys():
    inf_aver = adata_ref.varm['means_per_cluster_mu_fg'][[f'means_per_cluster_mu_fg_{i}'
                                    for i in adata_ref.uns['mod']['factor_names']]].copy()
else:
    inf_aver = adata_ref.var[[f'means_per_cluster_mu_fg_{i}'
                                    for i in adata_ref.uns['mod']['factor_names']]].copy()
inf_aver.columns = adata_ref.uns['mod']['factor_names']
inf_aver.iloc[0:5, 0:5]
圖片.png

Cell2location: spatial mapping

# find shared genes and subset both anndata and reference signatures
intersect = np.intersect1d(adata_vis.var_names, inf_aver.index)
adata_vis = adata_vis[:, intersect].copy()
inf_aver = inf_aver.loc[intersect, :].copy()

# prepare anndata for cell2location model
scvi.data.setup_anndata(adata=adata_vis, batch_key="sample")
scvi.data.view_anndata_setup(adata_vis)
圖片.png
Note! While you can often use the default value of detection_alpha hyperparameter, it is useful to adapt the expected cell abundance N_cells_per_location to every tissue. This value can be estimated from paired histology images and as described in the note above. Change the value presented in this tutorial (N_cells_per_location=30) to the value observed in your your tissue.
# create and train the model
mod = cell2location.models.Cell2location(
    adata_vis, cell_state_df=inf_aver,
    # the expected average cell abundance: tissue-dependent
    # hyper-prior which can be estimated from paired histology:
    N_cells_per_location=30,
    # hyperparameter controlling normalisation of
    # within-experiment variation in RNA detection (using default here):
    detection_alpha=200
)

mod.train(max_epochs=30000,
          # train using full data (batch_size=None)
          batch_size=None,
          # use all data points in training because
          # we need to estimate cell abundance at all locations
          train_size=1,
          use_gpu=True)

# plot ELBO loss history during training, removing first 100 epochs from the plot
mod.plot_history(1000)
plt.legend(labels=['full data training']);
圖片.png
# In this section, we export the estimated cell abundance (summary of the posterior distribution).
adata_vis = mod.export_posterior(
    adata_vis, sample_kwargs={'num_samples': 1000, 'batch_size': mod.adata.n_obs, 'use_gpu': True}
)

# Save model
mod.save(f"{run_name}", overwrite=True)

# mod = cell2location.models.Cell2location.load(f"{run_name}", adata_vis)

# Save anndata object with results
adata_file = f"{run_name}/sp.h5ad"
adata_vis.write(adata_file)

Visualising cell abundance in spatial coordinates

# add 5% quantile, representing confident cell abundance, 'at least this amount is present',
# to adata.obs with nice names for plotting
adata_vis.obs[adata_vis.uns['mod']['factor_names']] = adata_vis.obsm['q05_cell_abundance_w_sf']

# select one slide
from cell2location.utils import select_slide
slide = select_slide(adata_vis, 'V1_Human_Lymph_Node')

# plot in spatial coordinates
with mpl.rc_context({'axes.facecolor':  'black',
                     'figure.figsize': [4.5, 5]}):

    sc.pl.spatial(slide, cmap='magma',
                  # show first 8 cell types
                  color=['B_Cycling', 'B_GC_LZ', 'T_CD4+_TfH_GC', 'FDC',
                         'B_naive', 'T_CD4+_naive', 'B_plasma', 'Endo'],
                  ncols=4, size=1.3,
                  img_key='hires',
                  # limit color scale at 99.2% quantile of cell abundance
                  vmin=0, vmax='p99.2'
                 )
圖片.png
# Now we use cell2location plotter that allows showing multiple cell types in one panel
from cell2location.plt import plot_spatial

# select up to 6 clusters
clust_labels = ['T_CD4+_naive', 'B_naive', 'FDC']
clust_col = ['' + str(i) for i in clust_labels] # in case column names differ from labels

slide = select_slide(adata_vis, 'V1_Human_Lymph_Node')

with mpl.rc_context({'figure.figsize': (15, 15)}):
    fig = plot_spatial(
        adata=slide,
        # labels to show on a plot
        color=clust_col, labels=clust_labels,
        show_img=True,
        # 'fast' (white background) or 'dark_background'
        style='fast',
        # limit color scale at 99.2% quantile of cell abundance
        max_color_quantile=0.992,
        # size of locations (adjust depending on figure size)
        circle_diameter=6,
        colorbar_position='right'
    )
圖片.png

Downstream analysis

Identifying discrete tissue regions by Leiden clustering

通過(guò)使用由 cell2location 估計(jì)的細(xì)胞豐度對(duì)位置進(jìn)行聚類來(lái)識(shí)別細(xì)胞組成不同的組織區(qū)域。

我們通過(guò)使用每種細(xì)胞類型的估計(jì)細(xì)胞豐度對(duì) Visium 點(diǎn)進(jìn)行聚類來(lái)找到組織區(qū)域。我們構(gòu)建了一個(gè) K-nearest neigbour (KNN) 圖,表示估計(jì)細(xì)胞豐度中位置的相似性,然后應(yīng)用 Leiden 聚類。 KNN 鄰居的數(shù)量應(yīng)適應(yīng)數(shù)據(jù)集的大小和解剖學(xué)定義區(qū)域的大?。春qR區(qū)域相當(dāng)小,因此可能被大型 n_neighbors 掩蓋)。這可以針對(duì)范圍 KNN 鄰居和 Leiden 聚類分辨率完成,直到獲得與組織解剖結(jié)構(gòu)匹配的聚類。

聚類是在所有 Visium 部分/批次中聯(lián)合完成的,因此區(qū)域身份是直接可比的。當(dāng)多個(gè)批次之間存在很強(qiáng)的技術(shù)影響時(shí)(這里不是這種情況),原則上可以使用 sc.external.pp.bbknn 來(lái)解釋 KNN 構(gòu)建過(guò)程中的這些影響。

The resulting clusters are saved in adata_vis.obs['region_cluster'].
# compute KNN using the cell2location output stored in adata.obsm
sc.pp.neighbors(adata_vis, use_rep='q05_cell_abundance_w_sf',
                n_neighbors = 15)

# Cluster spots into regions using scanpy
sc.tl.leiden(adata_vis, resolution=1.1)

# add region as categorical variable
adata_vis.obs["region_cluster"] = adata_vis.obs["leiden"].astype("category")
# compute UMAP using KNN graph based on the cell2location output
sc.tl.umap(adata_vis, min_dist = 0.3, spread = 1)

# show regions in UMAP coordinates
with mpl.rc_context({'axes.facecolor':  'white',
                     'figure.figsize': [8, 8]}):
    sc.pl.umap(adata_vis, color=['region_cluster'], size=30,
               color_map = 'RdPu', ncols = 2, legend_loc='on data',
               legend_fontsize=20)
    sc.pl.umap(adata_vis, color=['sample'], size=30,
               color_map = 'RdPu', ncols = 2,
               legend_fontsize=20)

# plot in spatial coordinates
with mpl.rc_context({'axes.facecolor':  'black',
                     'figure.figsize': [4.5, 5]}):
    sc.pl.spatial(adata_vis, color=['region_cluster'],
                  size=1.3, img_key='hires', alpha=0.5)
圖片.png

Identifying cellular compartments / tissue zones using matrix factorisation (NMF)(這部分應(yīng)該是新的內(nèi)容)

在這里,我們使用 cell2location 映射結(jié)果來(lái)識(shí)別細(xì)胞類型的空間共現(xiàn),以便更好地了解組織組織并預(yù)測(cè)細(xì)胞相互作用。 我們對(duì)來(lái)自 cell2location 的細(xì)胞類型豐度估計(jì)進(jìn)行了非負(fù)矩陣分解(NMF)。 與將 NMF 應(yīng)用于傳統(tǒng) scRNA-seq 的既定好處類似,附加 NMF 分解產(chǎn)生了一組空間細(xì)胞類型豐度曲線,將其分組為捕獲共定位細(xì)胞類型的組件。 這種基于 NMF 的分解自然地解釋了這樣一個(gè)事實(shí),即多種細(xì)胞類型和微環(huán)境可以在相同的 Visium 位置共存,同時(shí)跨組織區(qū)域(例如單個(gè)生發(fā)中心)共享信息。

提示 在實(shí)踐中,最好針對(duì)一系列因子 (R={5, .., 30}) 訓(xùn)練 NMF,并選擇 (R) 作為捕獲精細(xì)組織區(qū)域和拆分已知區(qū)室之間的平衡。 如果您想找到幾個(gè)最明顯的細(xì)胞隔室,請(qǐng)使用少量因子。 如果您想找到非常強(qiáng)的協(xié)同定位信號(hào)并假設(shè)大多數(shù)細(xì)胞類型不協(xié)同定位,請(qǐng)使用很多因子(> 30 - 此處使用)。

Below we show how to perform this analysis. To aid this analysis, we wrapped the analysis shown the notebook on advanced downstream analysis into a pipeline that automates training of the NMF model with varying number of factors:
from cell2location import run_colocation
res_dict, adata_vis = run_colocation(
    adata_vis,
    model_name='CoLocatedGroupsSklearnNMF',
    train_args={
      'n_fact': np.arange(11, 13), # IMPORTANT: use a wider range of the number of factors (5-30)
      'sample_name_col': 'sample', # columns in adata_vis.obs that identifies sample
      'n_restarts': 3 # number of training restarts
    },
    export_args={'path': f'{run_name}/CoLocatedComb/'}
)
For every factor number, the model produces the following list of folder outputs:
  • cell_type_fractions_heatmap/: a dot plot of the estimated NMF weights of cell types (rows) across NMF components (columns)
  • cell_type_fractions_mean/: the data used for dot plot
  • factor_markers/: tables listing top 10 cell types most speficic to each NMF factor
  • models/: saved NMF models
  • predictive_accuracy/: 2D histogram plot showing how well NMF explains cell2location output
  • spatial/: NMF weights across locatinos in spatial coordinates
  • location_factors_mean/: the data used for the plot in spatial coordiantes
  • stability_plots/: stability of NMF weights between training restarts

檢查的關(guān)鍵輸出是 cell_type_fractions_heatmap/ 中的文件,它們顯示了與細(xì)胞隔室相對(duì)應(yīng)的 NMF 組件(列)中細(xì)胞類型(行)的估計(jì) NMF 權(quán)重的點(diǎn)圖。 顯示的是相對(duì)權(quán)重,對(duì)每種細(xì)胞類型的組件進(jìn)行了標(biāo)準(zhǔn)化。

# Here we plot the NMF weights (Same as saved to `cell_type_fractions_heatmap`)
res_dict['n_fact12']['mod'].plot_cell_type_loadings()
圖片.png

Advanced use

Estimate cell-type specific expression of every gene in the spatial data

For this, we adapt the approach of estimating conditional expected expression proposed by RCTD (Cable et al) method. With cell2location, we can look at the posterior distribution rather than just point estimates of cell type specific expression (see mod.samples.keys() and next section on using full distribution).

# Compute expected expression per cell type
expected_dict = mod.module.model.compute_expected_per_cell_type(
    mod.samples["post_sample_q05"], mod.adata
)

# Add to anndata layers
for i, n in enumerate(mod.factor_names_):
    adata_vis.layers[n] = expected_dict['mu'][i]

# Save anndata object with results
adata_file = f"{run_name}/sp.h5ad"
adata_vis.write(adata_file)
adata_file
# Look at cell type specific expression in spatial coordinates,
# Here we highlight CD3D, pan T-cell marker expressed by
# 2 subtypes of T cells in distinct locations but not expressed by co-located B cells
ctypes = ['T_CD4+_TfH_GC', 'T_CD4+_naive', 'B_GC_LZ']
genes = ['CD3D', 'CR2']

with mpl.rc_context({'axes.facecolor':  'black'}):
    # select one slide
    slide = select_slide(adata_vis, 'V1_Human_Lymph_Node')

    from tutorial_utils import plot_genes_per_cell_type
    plot_genes_per_cell_type(slide, genes, ctypes);
圖片.png
就是想回顧一下,當(dāng)然了,軟件也更新了很多內(nèi)容

生活很好,有你更好

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。
禁止轉(zhuǎn)載,如需轉(zhuǎn)載請(qǐng)通過(guò)簡(jiǎn)信或評(píng)論聯(lián)系作者。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容