python版的singler單細胞注釋工具

網(wǎng)上可以搜到大量的R語言singleR的代碼和教程,但python版的就比較少啦,恭喜你找到了我。

1.文件讀取

輸入的數(shù)據(jù)是10X標準的三個文件

import singlecellexperiment as sce
import scanpy as sc
import os
print(os.listdir("01_data"))
['barcodes.tsv', 'genes.tsv', 'matrix.mtx']

用read_10x_mtx讀取

adata = sc.read_10x_mtx("01_data/")
print(adata.shape)
(2700, 32738)

2. 質(zhì)控

sc.pp.filter_cells(adata,min_genes=200)
sc.pp.filter_genes(adata,min_cells=3)
adata.var['mt']=adata.var_names.str.startswith('MT-')
sc.pp.calculate_qc_metrics(adata,qc_vars=['mt'],log1p=False,percent_top=None,inplace=True)
sc.pl.violin(adata,["n_genes_by_counts", "total_counts", "pct_counts_mt"],jitter=0.4, multi_panel=True)

adata=adata[adata.obs.n_genes_by_counts>200]
adata=adata[adata.obs.n_genes_by_counts<2500]
adata=adata[adata.obs.pct_counts_mt<20]

print(adata.shape)
(2693, 13714)

3.降維聚類分群

sc.pp.normalize_total(adata,target_sum=1e4)
sc.pp.log1p(adata)
adata.raw=adata

sc.pp.highly_variable_genes(adata,n_top_genes=2000)
sc.pp.scale(adata)
sc.pp.pca(adata)
sc.pp.neighbors(adata,n_pcs=15)
sc.tl.leiden(adata,flavor="igraph",n_iterations=2,resolution=0.5)
sc.tl.umap(adata)
sc.pl.umap(adata,color='leiden')

4.singler自動注釋

singler的資料實在太少,文檔也很簡潔,我學習到這個地方時,請教了包的作者兩個問題:

1.如何按照cluster完成注釋?

作者回答可以用scranpy的aggregate_across_cells函數(shù)按簇整合;

Q: In the R package singleR, I am able to utilize the cluster parameter; however, it appears that this parameter does not exist in the Python version of singler.Did I miss anything?

A: scranpy has an aggregate_across_cells() function that you can use to get the aggregated matrix that can be used in classify_single_reference(). That should be the same as what SingleR::SingleR() does under the hood.

I suppose we could add this argument, but to be honest, the only reason that cluster= still exists in SingleR() is for back-compatibility purposes. It's easy enough to do the aggregation outside of the function and I don't want to add more responsibilities to the singler package.

2.應該選擇raw count還是lognormalized data 還是scaled data?

作者回答都可以

Q: Thank you. I've been learning singler recently. According to the quick start guide on the pip website,the test_data parameter seems to require the original count data:

data = sce.read_tenx_h5("pbmc4k-tenx.h5", realize_assays=True)
mat = data.assay("counts")

However, the R version of SingleR typically uses log-normalized data. The documentation also mentions,”or if you are coming from scverse ecosystem, i.e. AnnData, simply read the object as SingleCellExperiment and extract the matrix and the features.“,but data processed with Scanpy could be extracted as scaled data. Could you provide advice on which matrix I should use, or if either would be suitable?

A: For the test dataset, it doesn't matter. Only the ranks of the values are used by SingleR itself, so it will give the same results for any monotonic transformation within each cell.

IIRC the only place where the log/normalization-status makes a difference is in SingleR::plotMarkerHeatmap() (R package only, not in the Python package yet) which computes log-fold changes in the test dataset to prioritize the markers to be visualized in the heatmap. This is for diagnostic purposes only.

Of course, the reference dataset should always be some kind of log-normalized value, as log-fold changes are computed via the difference of means, e.g., with getClassicMarkers().

其實使用哪個數(shù)據(jù)還是會產(chǎn)生一些差別的,我們就沿用log-normalized數(shù)據(jù)吧(當然其他的也可以)

mat = adata.raw.X.T # 矩陣
features = list(adata.raw.var.index) #矩陣的行名-基因
import scranpy
m2 = scranpy.aggregate_across_cells(mat,adata.obs['leiden']) #按照聚類結(jié)果整合單細胞矩陣
m2
SummarizedExperiment(number_of_rows=13714, number_of_columns=8, assays=['sums', 'detected'], row_data=BiocFrame(data={}, number_of_rows=13714, column_names=[]), column_data=BiocFrame(data={'factor_1': StringList(data=['0', '2', '3', '4', '1', '5', '6', '7']), 'counts': array([452, 350, 226, 252, 713, 226, 450,  24], dtype=int32)}, number_of_rows=8, column_names=['factor_1', 'counts']), column_names=['0', '2', '3', '4', '1', '5', '6', '7'])

查看都有哪些可選的注釋

import celldex
refs = celldex.list_references() #這句也有可能因為網(wǎng)絡問題而報錯,不過可以不運行,只是為了知道下面可以寫什么注釋和什么版本。
print(refs[["name", "version"]])
                        name     version
0                       dice  2024-02-26
1           blueprint_encode  2024-02-26
2                     immgen  2024-02-26
3               mouse_rnaseq  2024-02-26
4                       hpca  2024-02-26
5  novershtern_hematopoietic  2024-02-26
6              monaco_immune  2024-02-26

celldex的參考數(shù)據(jù)是會下載的,經(jīng)常有網(wǎng)絡問題下載困難,導致運行失敗,可以存本地文件,只有第一次運行時會下載,但要注意換了參考數(shù)據(jù)則fr和fetch_reference里兩處要修改

import os
import pickle

fr = "ref_blueprint_encode_data.pkl" 
if not os.path.exists(fr):
    ref_data = celldex.fetch_reference("blueprint_encode", "2024-02-26", realize_assays=True)
    with open(fr, 'wb') as file:
        pickle.dump(ref_data, file)
else:
    with open(fr, 'rb') as file:
        ref_data = pickle.load(file)

完成注釋

import singler
results = singler.annotate_single(
    test_data = m2,
    test_features = features,
    ref_data = ref_data,
    ref_labels = "label.main"
)

將注釋結(jié)果添加到anndata對象里,并畫圖

dd = dict(zip(list(m2.column_data.row_names), results['best']))
dd
{'0': 'CD8+ T-cells',
 '2': 'B-cells',
 '3': 'Monocytes',
 '4': 'NK cells',
 '1': 'CD4+ T-cells',
 '5': 'CD8+ T-cells',
 '6': 'Monocytes',
 '7': 'Monocytes'}
adata.obs['singler']=adata.obs['leiden'].map(dd)

sc.pl.umap(adata,color = 'singler')

自動注釋不一定是完全準確的,你換一個參考數(shù)據(jù)也會發(fā)現(xiàn)結(jié)果會變。發(fā)現(xiàn)有問題就要結(jié)合背景知識(比如marker基因)去檢查一下。

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務。

相關閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容