網(wǎng)上可以搜到大量的R語言singleR的代碼和教程,但python版的就比較少啦,恭喜你找到了我。
1.文件讀取
輸入的數(shù)據(jù)是10X標準的三個文件
import singlecellexperiment as sce
import scanpy as sc
import os
print(os.listdir("01_data"))
['barcodes.tsv', 'genes.tsv', 'matrix.mtx']
用read_10x_mtx讀取
adata = sc.read_10x_mtx("01_data/")
print(adata.shape)
(2700, 32738)
2. 質(zhì)控
sc.pp.filter_cells(adata,min_genes=200)
sc.pp.filter_genes(adata,min_cells=3)
adata.var['mt']=adata.var_names.str.startswith('MT-')
sc.pp.calculate_qc_metrics(adata,qc_vars=['mt'],log1p=False,percent_top=None,inplace=True)
sc.pl.violin(adata,["n_genes_by_counts", "total_counts", "pct_counts_mt"],jitter=0.4, multi_panel=True)
adata=adata[adata.obs.n_genes_by_counts>200]
adata=adata[adata.obs.n_genes_by_counts<2500]
adata=adata[adata.obs.pct_counts_mt<20]
print(adata.shape)

(2693, 13714)
3.降維聚類分群
sc.pp.normalize_total(adata,target_sum=1e4)
sc.pp.log1p(adata)
adata.raw=adata
sc.pp.highly_variable_genes(adata,n_top_genes=2000)
sc.pp.scale(adata)
sc.pp.pca(adata)
sc.pp.neighbors(adata,n_pcs=15)
sc.tl.leiden(adata,flavor="igraph",n_iterations=2,resolution=0.5)
sc.tl.umap(adata)
sc.pl.umap(adata,color='leiden')

4.singler自動注釋
singler的資料實在太少,文檔也很簡潔,我學習到這個地方時,請教了包的作者兩個問題:
1.如何按照cluster完成注釋?
作者回答可以用scranpy的aggregate_across_cells函數(shù)按簇整合;
Q: In the R package singleR, I am able to utilize the cluster parameter; however, it appears that this parameter does not exist in the Python version of singler.Did I miss anything?
A: scranpy has an aggregate_across_cells() function that you can use to get the aggregated matrix that can be used in classify_single_reference(). That should be the same as what SingleR::SingleR() does under the hood.
I suppose we could add this argument, but to be honest, the only reason that cluster= still exists in SingleR() is for back-compatibility purposes. It's easy enough to do the aggregation outside of the function and I don't want to add more responsibilities to the singler package.
2.應該選擇raw count還是lognormalized data 還是scaled data?
作者回答都可以
Q: Thank you. I've been learning singler recently. According to the quick start guide on the pip website,the
test_dataparameter seems to require the original count data:data = sce.read_tenx_h5("pbmc4k-tenx.h5", realize_assays=True) mat = data.assay("counts")However, the R version of SingleR typically uses log-normalized data. The documentation also mentions,”or if you are coming from scverse ecosystem, i.e. AnnData, simply read the object as SingleCellExperiment and extract the matrix and the features.“,but data processed with Scanpy could be extracted as scaled data. Could you provide advice on which matrix I should use, or if either would be suitable?
A: For the test dataset, it doesn't matter. Only the ranks of the values are used by SingleR itself, so it will give the same results for any monotonic transformation within each cell.
IIRC the only place where the log/normalization-status makes a difference is in
SingleR::plotMarkerHeatmap()(R package only, not in the Python package yet) which computes log-fold changes in the test dataset to prioritize the markers to be visualized in the heatmap. This is for diagnostic purposes only.Of course, the reference dataset should always be some kind of log-normalized value, as log-fold changes are computed via the difference of means, e.g., with
getClassicMarkers().
其實使用哪個數(shù)據(jù)還是會產(chǎn)生一些差別的,我們就沿用log-normalized數(shù)據(jù)吧(當然其他的也可以)
mat = adata.raw.X.T # 矩陣
features = list(adata.raw.var.index) #矩陣的行名-基因
import scranpy
m2 = scranpy.aggregate_across_cells(mat,adata.obs['leiden']) #按照聚類結(jié)果整合單細胞矩陣
m2
SummarizedExperiment(number_of_rows=13714, number_of_columns=8, assays=['sums', 'detected'], row_data=BiocFrame(data={}, number_of_rows=13714, column_names=[]), column_data=BiocFrame(data={'factor_1': StringList(data=['0', '2', '3', '4', '1', '5', '6', '7']), 'counts': array([452, 350, 226, 252, 713, 226, 450, 24], dtype=int32)}, number_of_rows=8, column_names=['factor_1', 'counts']), column_names=['0', '2', '3', '4', '1', '5', '6', '7'])
查看都有哪些可選的注釋
import celldex
refs = celldex.list_references() #這句也有可能因為網(wǎng)絡問題而報錯,不過可以不運行,只是為了知道下面可以寫什么注釋和什么版本。
print(refs[["name", "version"]])
name version
0 dice 2024-02-26
1 blueprint_encode 2024-02-26
2 immgen 2024-02-26
3 mouse_rnaseq 2024-02-26
4 hpca 2024-02-26
5 novershtern_hematopoietic 2024-02-26
6 monaco_immune 2024-02-26
celldex的參考數(shù)據(jù)是會下載的,經(jīng)常有網(wǎng)絡問題下載困難,導致運行失敗,可以存本地文件,只有第一次運行時會下載,但要注意換了參考數(shù)據(jù)則fr和fetch_reference里兩處要修改
import os
import pickle
fr = "ref_blueprint_encode_data.pkl"
if not os.path.exists(fr):
ref_data = celldex.fetch_reference("blueprint_encode", "2024-02-26", realize_assays=True)
with open(fr, 'wb') as file:
pickle.dump(ref_data, file)
else:
with open(fr, 'rb') as file:
ref_data = pickle.load(file)
完成注釋
import singler
results = singler.annotate_single(
test_data = m2,
test_features = features,
ref_data = ref_data,
ref_labels = "label.main"
)
將注釋結(jié)果添加到anndata對象里,并畫圖
dd = dict(zip(list(m2.column_data.row_names), results['best']))
dd
{'0': 'CD8+ T-cells',
'2': 'B-cells',
'3': 'Monocytes',
'4': 'NK cells',
'1': 'CD4+ T-cells',
'5': 'CD8+ T-cells',
'6': 'Monocytes',
'7': 'Monocytes'}
adata.obs['singler']=adata.obs['leiden'].map(dd)
sc.pl.umap(adata,color = 'singler')

自動注釋不一定是完全準確的,你換一個參考數(shù)據(jù)也會發(fā)現(xiàn)結(jié)果會變。發(fā)現(xiàn)有問題就要結(jié)合背景知識(比如marker基因)去檢查一下。