pySCENIC is a lightning-fast python implementation of the SCENIC pipeline (Single-Cell rEgulatory Network Inference and Clustering) which enables biologists to infer transcription factors, gene regulatory networks and cell types from single-cell RNA-seq data
R環(huán)境中操作流程見:http://www.itdecent.cn/p/b0fd795ad05c
分析步驟:
0. Preliminary work
1. Inference of co-expression modules
2. Prune modules for targets with cis regulatory footprints (aka RcisTarget)
3. Cellular regulon enrichment matrix (aka AUCell)
0. 前期參數(shù)準(zhǔn)備
import os
import glob
import pickle
import pandas as pd
import numpy as np
from dask.diagnostics import ProgressBar
from arboreto.utils import load_tf_names
from arboreto.algo import grnboost2
from pyscenic.rnkdb import FeatherRankingDatabase as RankingDatabase
from pyscenic.utils import modules_from_adjacencies, load_motifs
from pyscenic.prune import prune2df, df2regulons
from pyscenic.aucell import aucell
# 這個包我在導(dǎo)入的時候出現(xiàn)了如下報錯:(Segmentation fault)
# 因為該函數(shù)只涉及到出圖,因此沒有導(dǎo)入的必要(后期結(jié)果直接導(dǎo)入到R中進(jìn)行出圖)
# import seaborn as sns
DATA_FOLDER="~/tmp"
RESOURCES_FOLDER="~/resources"
DATABASE_FOLDER = "~/databases/"
SCHEDULER="123.122.8.24:8786"
DATABASES_GLOB = os.path.join(DATABASE_FOLDER, "mm9-*.feather")
MOTIF_ANNOTATIONS_FNAME = os.path.join(RESOURCES_FOLDER, "motifs-v9-nr.mgi-m0.001-o0.0.tbl")
MM_TFS_FNAME = os.path.join(RESOURCES_FOLDER, 'mm_tfs.txt')
SC_EXP_FNAME = os.path.join(RESOURCES_FOLDER, "GSE60361_C1-3005-Expression.txt")
REGULONS_FNAME = os.path.join(DATA_FOLDER, "regulons.p")
MOTIFS_FNAME = os.path.join(DATA_FOLDER, "motifs.csv")
# 讀入表達(dá)矩陣,表達(dá)矩陣的格式:橫坐標(biāo)是基因,縱坐標(biāo)是細(xì)胞
ex_matrix = pd.read_csv(SC_EXP_FNAME, sep='\t', header=0, index_col=0).T
ex_matrix.shape
# 導(dǎo)入轉(zhuǎn)錄因子
tf_names = load_tf_names(MM_TFS_FNAME)
# 導(dǎo)入數(shù)據(jù)庫
db_fnames = glob.glob(DATABASES_GLOB)
def name(fname):
return os.path.basename(fname).split(".")[0]
dbs = [RankingDatabase(fname=fname, name=name(fname)) for fname in db_fnames]
dbs
1. Inference of co-expression modules
# 兩條命令解決
adjacencies = grnboost2(ex_matrix, tf_names=tf_names, verbose=True) #耗時
modules = list(modules_from_adjacencies(adjacencies, ex_matrix))
2. Prune modules for targets with cis regulatory footprints (aka RcisTarget)
# Calculate a list of enriched motifs and the corresponding target genes for all modules.
with ProgressBar():
df = prune2df(dbs, modules, MOTIF_ANNOTATIONS_FNAME)
# Create regulons from this table of enriched motifs.
regulons = df2regulons(df)
# Save the enriched motifs and the discovered regulons to disk.
df.to_csv(MOTIFS_FNAME)
with open(REGULONS_FNAME, "wb") as f:
pickle.dump(regulons, f)
3. Cellular regulon enrichment matrix (aka AUCell)
auc_mtx = aucell(ex_matrix, regulons, num_workers=4)
# 這一步出圖
# sns.clustermap(auc_mtx, figsize=(8,8))
將結(jié)果導(dǎo)入到R
首先將pySCENIC中得到的結(jié)果保存成.loom格式
# 還是在剛才的環(huán)境中(Python)
from pyscenic.export import export2loom
export2loom(ex_mtx = ex_matrix, auc_mtx = auc_mtx, regulons = regulons, out_fname = "your path way/xxx.loom")
# 這里會有提示(見圖1),按照提示改一下
export2loom(ex_mtx = ex_matrix, auc_mtx = auc_mtx, regulons = [r.rename(r.name.replace('(+)',' ('+str(len(r))+'g)')) for r in regulons], out_fname = "your path way/xxx.loom")
# 這一句話運行完畢后,會在指定目錄下生成xxx.loom文件,這就是導(dǎo)入R所需要的文件

然后進(jìn)入R環(huán)境
# 保存xxx.loom文件的路徑
pyScenicDir <- "pySCENIC_example/output"
library(SCENIC)
library(SCopeLoomR)
pyScenicLoomFile <- file.path(pyScenicDir, "SCENIC.loom")
loom <- open_loom(pyScenicLoomFile, mode="r")
# Read information from loom file:
regulons_incidMat <- get_regulons(loom)
regulons <- regulonsToGeneLists(regulons_incidMat)
regulonsAUC <- get_regulonsAuc(loom)
regulonsAucThresholds <- get_regulonThresholds(loom)
embeddings <- get_embeddings(loom)
exprMat <- get_dgem(loom)
cellInfo <- get_cellAnnotation(loom)
clusterings <- get_clusterings_withName(loom)
close_loom(loom)
以上過程完成之后,pySCENIC中的結(jié)果就成功導(dǎo)入到R環(huán)境中了,接下來就可以在R中出圖(可參考:SCENIC—Single Cell rEgulatory Network Inference and Clustering)
所用數(shù)據(jù)集
# Transcription factors:
wget https://raw.githubusercontent.com/aertslab/scenic-nf/master/example/allTFs_hg38.txt
# Motif to TF annotation database:
wget https://raw.githubusercontent.com/aertslab/scenic-nf/master/example/motifs.tbl
# Ranking databases:
wget https://raw.githubusercontent.com/aertslab/scenic-nf/master/example/genome-ranking.feather
# Finally, get a small sample expression matrix (loom format):
wget https://raw.githubusercontent.com/aertslab/scenic-nf/master/example/expr_mat.loom
# Alternatively, in tsv format:
wget https://raw.githubusercontent.com/aertslab/scenic-nf/master/example/expr_mat.tsv
參考鏈接:
Importing pySCENIC results
pyscenic
數(shù)據(jù)集
補充
當(dāng)我用真實數(shù)據(jù)跑這一套流程的時候,在“將pySCENIC中得到的結(jié)果保存成.loom格式”中出現(xiàn)了如下報錯:

網(wǎng)上搜到比較靠譜的解釋是HDF5文件類型大小有限制(HDF5 has a header limit of 64kb for all metadata of the columns. This include name, types, etc. When you go about roughly 2000 columns, you will run out of space to store all the metadata.)
因為不是很懂怎么解決這個問題,再加上我只需要regulonAUC這個對象(對應(yīng)著pyscenic當(dāng)中的auc_mtx),所以我后面就沒有用到上述提到的方法導(dǎo)入R
# 接上面的 Cellular regulon enrichment matrix (aka AUCell)
AUC_FNAME=os.path.join(DATA_FOLDER, "auc.tsv")
regulons = [r.rename(r.name.replace('(+)',' ('+str(len(r))+'g)')) for r in regulons]
auc_mtx = aucell(ex_matrix, regulons, num_workers=4)
auc_mtx.to_csv(AUC_FNAME)
# 進(jìn)入R環(huán)境
# cellInfo就是SeuratObject@metadata
library(SCENIC)
library(SCopeLoomR)
library(AUCell)
pyScenicDir=DATA_FOLDER
regulonAUC <- importAUCfromText(file.path(pyScenicDir, "auc..tsv"))
regulonAUC <- regulonAUC[onlyNonDuplicatedExtended(rownames(regulonAUC)),]
regulonActivity_byCellType <- sapply(split(rownames(cellInfo), cellInfo$CellType),
function(cells) rowMeans(getAUC(regulonAUC)[,cells]))
regulonActivity_byCellType_Scaled <- t(scale(t(regulonActivity_byCellType), center = T, scale=T))
pheatmap::pheatmap(regulonActivity_byCellType_Scaled, #fontsize_row=3,
color=colorRampPalette(c("blue","white","red"))(100), breaks=seq(-3, 3, length.out = 100),
treeheight_row=10, treeheight_col=10, border_color=NA)

還有一種方法,因為這個包里面限速的步驟是GENIE3(R里面),也可以單獨把這一步在Python中運行(GRNBoost,其他步驟可在R中運行),生成的結(jié)果保存到R中的/int目錄下。具體操作:
https://github.com/aertslab/SCENIC/issues/40
https://arboreto.readthedocs.io/en/latest/examples.html