10X單細胞 && 10XATAC聯(lián)合分析之scJoint

開工第一彈,我們來看看最新的10X單細胞聯(lián)合ATAC的分析方法,文章在scJoint integrates atlas-scale single-cell RNA-seq and ATAC-seq data with transfer learning,2022年1月發(fā)表于nature biotechnology,IF54分,相當高了~~~~我們來看一下,其實這里要解決的就是多組學的聯(lián)合分析問題,下面列舉了一些我之前分享的方法,供大家借鑒和參考。

10X單細胞轉錄組整合、轉錄組 && ATAC整合分析之VIPCCA

10X單細胞(10X空間轉錄組)多組學(RNA + ATAC)推斷Velovyto(MultiVelo

10X單細胞 & 10XATAC 聯(lián)合分析表征細胞調控網(wǎng)絡(MIRA)

10X單細胞 & 10XATAC聯(lián)合之調控網(wǎng)絡分析(IReNA)

10X單細胞(10X空間轉錄組)數(shù)據(jù)分析遷移之scGCN

10X單細胞核轉錄組 + 10X單細胞ATAC的聯(lián)合分析(WNN)

10X單細胞轉錄組與10X單細胞ATAC聯(lián)合分析之Seurat

當然了,10X單細胞 && 10XATAC有兩種數(shù)據(jù)類型,一種是多組學,同時測一個細胞內的轉錄組和ATAC,另外一種是把樣本分成了2份,一份測轉錄組,一份測ATAC,大家要根據(jù)自己的情況來甄別一下。

首先我們來分分類,看看這些10X單細胞 && 10XATAC的聯(lián)合方法的區(qū)別,以及他們的優(yōu)缺點

  • manifold alignment(Manifold alignment methods have demonstrated promising results in integrating several modalities from the same tissue. However, requiring distributions to match globally is often too restrictive when different modalities are derived from different tissues and cell types
  • matrix factorization (Liger, coupled-NMF)(matrix factorization and correlation-based methods designed for unpaired data require a separate feature selection step before integration for dimension reduction, and the method’s performance is sensitive to which genes are selected
  • using correlations to identify nearby cells across modalities(Conos, Seurat)
  • neural-network approaches(Most existing neural-network methods for multiomics integration are based on autoencoders, which, with a few exceptions,
    require paired data. In general, unsupervised training of several autoencoders simultaneously can be very challenging without pairing information across different modalities, with finding a common embedding manifold becomes more difficult as the complexity of the data increases)

因此,當前的方法在處理表征多組學圖譜數(shù)據(jù)的復雜性和規(guī)模方面的能力有限。

scJoint的改進之處

  • 一個新的損失函數(shù),明確地將降維作為轉移學習中特征工程過程的一部分,允許在整個訓練過程中修改低維特征,并且無需選擇高度可變的基因。
  • 當細胞類型不完全重疊時,增加了模態(tài)對齊的靈活性的相似性損失
  • weight sharing across encoders for different modalities resulting in stable training
圖片.png

方法核心

scJoint 的核心是對標記(scRNA-seq)和未標記(scATAC-seq)數(shù)據(jù)進行協(xié)同訓練的半監(jiān)督方法,解決了通過共同的低維空間對齊這兩種不同數(shù)據(jù)模式的主要挑戰(zhàn)。 scJoint 包括三個主要步驟。步驟 1 分別通過新的基于神經(jīng)網(wǎng)絡的降維 (NNDR) 損失和余弦相似度損失在公共嵌入空間中執(zhí)行聯(lián)合降維和模態(tài)對齊。 NNDR 損失在類似于 PCA 的vein中提取具有最大可變性的正交特征,而余弦相似性損失鼓勵神經(jīng)網(wǎng)絡找到嵌入空間的投影,以便可以對齊兩種模式的大部分。 scRNA-seq 的嵌入進一步由細胞類型分類損失引導,形成半監(jiān)督部分。在步驟 2 中,將 scATAC-seq 數(shù)據(jù)中的每個細胞視為query,通過測量它們在公共嵌入空間中的距離來識別 scRNA-seq 細胞之間的 k 最近鄰(KNN),并從 scRNA 中轉移細胞類型標簽-seq 通過“多數(shù)票”通過 scATAC-seq。在第 3 步中,通過在度量學習損失中使用轉移標簽來進一步改進兩種模式之間的混合。使用標準工具(包括 tSNE 和 UMAP)從最終嵌入層獲得數(shù)據(jù)集的聯(lián)合可視化。 scJoint 需要簡單的數(shù)據(jù)預處理,輸入維度等于給定數(shù)據(jù)集中經(jīng)過適當過濾后的基因數(shù)。 scATAC-seq 數(shù)據(jù)中的染色質可及性首先轉換為基因活性評分,從而允許使用單個編碼器,RNA 和 ATAC 的權重共享。

方法比較,我們簡單看一下

scJoint shows accurate and robust performance on atlas data
圖片.png
Refining scATAC-seq annotations in heterogeneous atlas data
圖片.png
Integration of multimodal data across biological conditions
圖片.png
scJoint shows versatile performance on paired data
圖片.png
當然了,文章寫的方法,自然是作者的方法效果最好

核心算法,這個需要大家自己來解讀了

圖片.png

示例代碼

安裝
git clone https://github.com/SydneyBioX/scJoint.git
scJoint是python版本,如果分析結果保存成了rds(R版本),數(shù)據(jù)需要轉換一下
library(SingleCellExperiment)
library(DropletUtils)
library(scater)
library(ggplot2)
sce_10xPBMC_atac <- readRDS("data_10x/sce_10xPBMC_atac.rds")
sce_10xPBMC_rna <- readRDS("data_10x/sce_10xPBMC_rna.rds")
# Only keep common genes between two dataset
common_genes <- intersect(rownames(sce_10xPBMC_atac),
                          rownames(sce_10xPBMC_rna))
length(common_genes)

# Extract the logcounts data from sce object
exprs_atac <- logcounts(sce_10xPBMC_atac[common_genes, ])
exprs_rna <- logcounts(sce_10xPBMC_rna[common_genes, ])



source("data_to_h5.R")
write_h5_scJoint(exprs_list = list(rna = exprs_rna,
                                   atac = exprs_atac), 
                 h5file_list = c("data_10x/exprs_10xPBMC_rna.h5", 
                                 "data_10x/exprs_10xPBMC_atac.h5"))
write_csv_scJoint(cellType_list =  list(rna = sce_10xPBMC_rna$cellTypes),
                  csv_list = c("data_10x/cellType_10xPBMC_rna.csv"))

data_to_h5.R這個軟件有提供,大家可以下載。

Analysis of PBMC data from 10x Genomics using scJoint

import process_db
import h5py
import pandas as pd
import numpy as np
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import seaborn as sns
import random
random.seed(1)

rna_h5_files = ["data_10x/exprs_10xPBMC_rna.h5"] 
rna_label_files = ["data_10x/cellType_10xPBMC_rna.csv"] # csv file

atac_h5_files = ["data_10x/exprs_10xPBMC_atac.h5"]
atac_label_files = []

process_db.data_parsing(rna_h5_files, atac_h5_files)
rna_label = pd.read_csv(rna_label_files[0], index_col = 0)
rna_label
print(rna_label.value_counts(sort = False))
process_db.label_parsing(rna_label_files, atac_label_files)

Running scJoint

The scRNA-seq and scATAC-seq have 15463 genes in common. And we have 11 cell types in total in the scRNA_seq adta, so we will set the number of class as 11 in the config.py file. The paths also needed to be edited accordingly. Here are the settings for integration of scRNA-seq and scATAC-seq from 10x Genomics we used:

self.number_of_class = 11 # Number of cell types in scRNA-seq data
self.input_size = 15463 # Number of common genes and proteins between scRNA-seq data and scATAC-seq
self.rna_paths = ['data_10x/exprs_10xPBMC_rna.npz'] # RNA gene expression from scRNA-seq data
self.rna_labels = ['data_10x/cellType_10xPBMC_rna.txt'] # scRNA-seq data cell type labels (coverted to numeric)
self.atac_paths = ['data_10x/exprs_10xPBMC_atac.npz'] # ATAC gene activity matrix from scATAC-seq data
self.atac_labels = []
self.rna_protein_paths = []
self.atac_protein_paths = []

Training config

self.batch_size = 256
self.lr_stage1 = 0.01
self.lr_stage3 = 0.01
self.lr_decay_epoch = 20
self.epochs_stage1 = 20
self.epochs_stage3 = 20
self.p = 0.8
self.embedding_size = 64
self.momentum = 0.9
self.center_weight = 1
self.with_crossentorpy = True
self.seed = 1
self.checkpoint = ''
After modifying config.py, we are ready to run scJoint in terminal with
python3 main.py
This takes ~4 minutes using 1 thread for PyTorch.

Visualisation

rna_embeddings = np.loadtxt('./output/exprs_10xPBMC_rna_embeddings.txt')
atac_embeddings = np.loadtxt('./output/exprs_10xPBMC_atac_embeddings.txt')
print(rna_embeddings.shape)
print(atac_embeddings.shape)
embeddings =  np.concatenate((rna_embeddings, atac_embeddings))
print(embeddings.shape)
tsne_results = TSNE(perplexity=30, n_iter = 1000).fit_transform(embeddings)
tsne_results.shape
df = pd.DataFrame()
df['tSNE1'] = tsne_results[:,0]
df['tSNE2'] = tsne_results[:,1]
rna_labels = np.loadtxt('./data_10x/cellType_10xPBMC_rna.txt')
atac_predictions = np.loadtxt('./output/exprs_10xPBMC_atac_knn_predictions.txt')
labels =  np.concatenate((rna_labels, atac_predictions))
label_to_idx = pd.read_csv('./data_10x/label_to_idx.txt', sep = '\t', header = None)
label_to_idx.shape
label_dic = []
for i in range(label_to_idx.shape[0]):
    label_dic = np.append(label_dic, label_to_idx[0][i][:-2])

data_label = np.array(["scRNA-seq", "scATAC-seq"])
df['data'] = np.repeat(data_label, [rna_embeddings.shape[0], atac_embeddings.shape[0]], axis=0)
df['predicted'] = label_dic[labels.astype(int)]

plt.figure(figsize=(10,10))
sns.scatterplot(
    x = "tSNE1", y = "tSNE2",
    hue = "data",
    palette = sns.color_palette("tab10", 2),
    data = df,
    legend = "full",
    alpha = 0.3
)

plt.figure(figsize=(10,10))
sns.scatterplot(
    x = "tSNE1", y = "tSNE2",
    hue = "predicted",
    data = df,
    legend = "full",
    alpha = 0.3
)
圖片.png
Mouse Primary Motor Cortex Data Integration using scJoint
import process_db
import h5py
import pandas as pd
import numpy as np
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import seaborn as sns
import random
import os
import re
random.seed(1)
output_dir = "data_MOp/output/"
embeddings_file_names = [fn for fn in os.listdir(output_dir)
                         if any(fn.endswith(ext) for ext in ['_embeddings.txt'])]
embeddings_file_names.sort()
embeddings_file_names
embeddings = np.loadtxt(output_dir + embeddings_file_names[0])
batch = np.repeat(re.sub('_embeddings.txt', '', embeddings_file_names[0]), embeddings.shape[0])
for fn in embeddings_file_names[1:]:
    e = np.loadtxt(output_dir + fn)
    embeddings = np.append(embeddings, e, axis=0)
    batch = np.append(batch, np.repeat(re.sub('_embeddings.txt', '', fn), e.shape[0]))
print(embeddings.shape)
print(batch.shape)

tsne_results = TSNE(perplexity=30, n_iter = 1000).fit_transform(embeddings)

df = pd.DataFrame()
df['tSNE1'] = tsne_results[:,0]
df['tSNE2'] = tsne_results[:,1]
df['data'] = batch

prediction_file_names = [fn for fn in os.listdir(output_dir)
                         if any(fn.endswith(ext) for ext in ['_knn_predictions.txt'])]
#print(prediction_file_names)
atac_prediction = np.loadtxt(output_dir + prediction_file_names[0])
methy_prediction = np.loadtxt(output_dir + prediction_file_names[1])
#print(atac_prediction.shape)
#print(methy_prediction.shape)

# Reading the original labels
data_dir = "data_MOp/"
label_file_names = [fn for fn in os.listdir(data_dir)
                         if any(fn.endswith(ext) for ext in ['cellTypes.txt'])]
label_file_names.sort()
#print(label_file_names)

atac_original = np.loadtxt(data_dir + label_file_names[0])
methy_original = np.loadtxt(data_dir + label_file_names[8])
rna_original = []

for fn in label_file_names[1:8]:
    p = np.loadtxt(data_dir + fn)
    rna_original = np.append(rna_original, p)
    
#print(rna_original.shape)
prediction = np.append(atac_prediction, rna_original)
prediction = np.append(prediction, methy_original)

# matching the numeric labels to cell type annotation
label_to_idx = pd.read_csv(data_dir + 'label_to_idx.txt', sep = '\t', header = None)
label_to_idx.shape
label_dic = []
for i in range(label_to_idx.shape[0]):
    label_dic = np.append(label_dic, label_to_idx[0][i][:-2])
    
df['predicted'] = label_dic[prediction.astype(int)]

plt.figure(figsize=(10,10))
sns.scatterplot(
    x = "tSNE1", y = "tSNE2",
    hue = "data",
    palette = sns.color_palette("tab10", 9),
    data = df.sample(frac=1),
    legend = "full",
    alpha = 0.3
)
圖片.png
plt.figure(figsize=(10,10))
sns.scatterplot(
    x = "tSNE1", y = "tSNE2",
    hue = "predicted",
    data = df,
    legend = "full",
    alpha = 0.3
)

圖片.png

生活很好,有你更好

?著作權歸作者所有,轉載或內容合作請聯(lián)系作者
【社區(qū)內容提示】社區(qū)部分內容疑似由AI輔助生成,瀏覽時請結合常識與多方信息審慎甄別。
平臺聲明:文章內容(如有圖片或視頻亦包括在內)由作者上傳并發(fā)布,文章內容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務。
禁止轉載,如需轉載請通過簡信或評論聯(lián)系作者。

相關閱讀更多精彩內容

友情鏈接更多精彩內容