代謝組學(xué)數(shù)據(jù)-統(tǒng)計(jì)分析及生物靶標(biāo)挖掘

@我的博客:有味

導(dǎo)讀

寫(xiě)這篇文檔的初衷就是一直想把代謝組學(xué)數(shù)據(jù)寫(xiě)成一個(gè)小流程,之前關(guān)于非靶向代謝組學(xué)寫(xiě)了一個(gè)分析總結(jié)——非靶向 | 靶向代謝組學(xué)數(shù)據(jù)分析總結(jié)-綱要。那么今天我將把代謝組數(shù)據(jù)的具體分析思路做個(gè)整理。思路參考的是中科院大連化物所許國(guó)旺老師不久前發(fā)表的一篇文章。附上文章截圖如下:

文章截圖

代謝組學(xué)研究樣本設(shè)計(jì),特別是臨床樣本,一般在有條件的情況下(樣本量大),一般采用三步走的方法:

實(shí)驗(yàn)群體概況

分析流程介紹

樣本收集

  • 大多數(shù)用到的是血清/血漿樣本(血清和血漿的最大區(qū)別就是血清在凝血的過(guò)程中纖維蛋白原轉(zhuǎn)變成纖維蛋白塊,也就是說(shuō)血清中沒(méi)有纖維蛋白原)。一般在次日凌晨6:00~8:00(經(jīng)過(guò)一晚上的禁食,主要是為了避免飲食對(duì)生物流體樣本檢測(cè)結(jié)果的干擾)。

  • 除了血清等生物流體樣本,如果是做組織病變的,還可以采集組織的病變的部分和鄰近的正常組織部分。值得注意的是,在取樣使用時(shí)一定要在冰上解凍,或放在-4℃冰箱

上機(jī)測(cè)定

  • discovery和test實(shí)驗(yàn)集的樣本采用偽靶向(pseudotargeted method,目前一種較為新穎的技術(shù),也有一些相關(guān)的文章報(bào)道)的方法進(jìn)行測(cè)定,而在validation群體中進(jìn)行靶向驗(yàn)證在discovery和test集合中找到的代謝物biomarker。

數(shù)據(jù)分析

  • 這篇文章的多元統(tǒng)計(jì)分析是通過(guò)SMICA-P軟件來(lái)實(shí)現(xiàn)的,包括PCA分析(無(wú)監(jiān)督)用來(lái)評(píng)估整體代謝物在組間的變化及監(jiān)測(cè)整個(gè)實(shí)驗(yàn)的穩(wěn)定性。pls-da分析(有監(jiān)督)用于組間最大化,以及鑒定重要的代謝物(根據(jù)投影變量的重要性,variable important in the projection,VIP),并設(shè)置排列檢驗(yàn),以防止模型的過(guò)擬合。

  • 單變量分析:變量如果經(jīng)過(guò)適當(dāng)轉(zhuǎn)換后符合正態(tài)性就可以使用常規(guī)的參數(shù)檢驗(yàn)如 student-t test。如果不轉(zhuǎn)換可以使用非參數(shù)檢驗(yàn)如wilcoxon test(兩組),Kruskal wallis test(適用于三組及以上分組的情況)。當(dāng)然多重檢驗(yàn)還需要進(jìn)行P值的校正(這里一般采用p.adjust()函數(shù),BH方法)。

  • 回歸模型,疾病表型的響應(yīng)變量一般為二分類(lèi)變量,即患病 Vs. 健康,或者根據(jù)疾病嚴(yán)重程度也可以分為多組,視具體情況而定。所以這里一般采用二元邏輯回歸或者隨機(jī)森林模型尋找潛在的代謝物靶標(biāo)。

總結(jié)

  • discovery集合用于發(fā)現(xiàn)組間差異顯著的代謝物;

  • test集合永和驗(yàn)證(或者說(shuō)看能重復(fù)多少)discovery集合中的差異代謝物,這里有兩個(gè)要求那就是:a)在discovery里是差異顯著的在test中也是差異顯著,顯著性水平設(shè)置要一樣;b)差異的方向在兩個(gè)群體中要一致。選擇出在兩個(gè)群體中重復(fù)到的代謝物,進(jìn)一步通過(guò)binary-logistic模型選擇代謝物來(lái)構(gòu)建最佳模型。

  • validation集合中用于驗(yàn)證上一步得到的結(jié)果

附加代碼在最后

# Description: this script was arranged for metabolomics data analysis
# under the guidelines of article "A Large-Scale, Multicenter Serum Metabolite Biomarker Identification
# Study for the Early Detection of Hepatocellular Carcinoma"
# the mock data was from the R package statTarget
rm(list = ls())
setwd("D:\\R\\R-exercise\\metabo_data")
library(statTarget)
library(impute)
datapath <- system.file('extdata',package = 'statTarget')
file <- paste(datapath,'data_example.csv',sep = '/')
example1 <- read.csv(file)
## the second example data
example2 <- read.csv(choose.files(),check.names = F,header = T,row.names = 1)
# to impute the value NA by KNN method in the package impute
group_label <- as.factor(example2[,1])
example2[,1] <- NULL
example3 <- as.data.frame(t(example2))
example4 <- impute.knn(as.matrix(example3),k=9)
example5 <- as.data.frame(t(example4$data))

## PCAj無(wú)監(jiān)督分析用于評(píng)估整體代謝物在組間的變化
library("FactoMineR")
library("factoextra")
# 數(shù)據(jù)間進(jìn)行標(biāo)度化處理
Z_score <- function(x){
  x <- (x - mean(x))/sd(x)
}
example5 <- apply(example5, 1, Z_score)
example5 <- as.data.frame(t(example5))
data.pca <- PCA(example5,graph = FALSE)
Group <- group_label
data <- cbind(example5, Group)
colnames(data)[ncol(data)] <- "class"
# Scree plot to determine the number of principal components
tiff("meta_scree.tiff",width=2000,height=2000,res=300)
fviz_eig(data.pca, addlabels = TRUE, ylim = c(0, 100))
dev.off()
# individual plot
tiff("meta_PCA.tiff",width=2000,height=2000,res=300)
ind.p <- fviz_pca_ind(data.pca, geom = "point", col.ind = data$class)
ggpubr::ggpar(ind.p,
              title = "Principal Component Analysis",
              subtitle = "metabolomic sets",
              #caption = "Source: factoextra",
              xlab = "PC1(53.2%)", ylab = "PC2(7.3%)",
              legend.title = "Group", legend.position = "top",
              ggtheme = theme_gray(), palette = "npg"
)
dev.off()

## pls-da分析
library(ropls)
example5_oplsda <- opls(example5,Group,predI = 1,orthoI = NA)
# calculate the vip value
VipVn <- getVipVn(opls(example5,Group,predI = 1,orthoI = NA,plot = FALSE))
# extract the variable name which the VIP value > 1
VipVn_great_than_1 <- names(VipVn[VipVn > 1])
# univariate statistics to extract variables differred between groups
pvaVn <- apply(example5,2,function(feaVn) wilcox.test(feaVn ~ Group,paired=FALSE)$p.value)
pvaVn_adj <- p.adjust(pvaVn,method = "BH",n=length(pvaVn))
# extract the adjusted p-value less than 0.05
pvaVn_adj_less_0.05 <- names(pvaVn_adj[pvaVn_adj < 0.05])

# use intersect function to obtain the variables both VIP value great than 1
# and p value less than 0.05
final_variable <- intersect(VipVn_great_than_1,pvaVn_adj_less_0.05)

# 熱圖展示13個(gè)final_variable的豐度
heatmap_df <- example4$data
heatmap_df <- as.data.frame(t(heatmap_df))
heatmap_df <- heatmap_df[,final_variable]
heatmap_df_log10 <- log10(heatmap_df)
heatmap_df_log10 <- as.matrix(t(heatmap_df_log10))
library(pheatmap)
annotation_col <- data.frame(Group=factor(rep(c("N","T"),c(25,61))))
rownames(annotation_col) <- colnames(heatmap_df_log10)
ann_colors <- list(Group=c(N="#4DAF4A",T="#984EA3"))
#plot.new()
tiff("metabo_biomarker_pheatmap.tiff",width = 3000,height = 700,res = 300)
pheatmap(heatmap_df_log10,treeheight_col = 20,annotation_colors = ann_colors,cluster_cols = FALSE,
         cluster_rows = TRUE,show_colnames = FALSE,annotation_col=annotation_col,
         gaps_col = c(25,25),border_color = "NA",cellwidth = 4.7,cellheight = 10)
dev.off()


## 建立模型
## first transform the Group 'N' and 'T' into '0' and '1'
example6 <- example5[,final_variable]
outcome = Group
outcome <- sub("N",0,outcome) # 將分類(lèi)變量轉(zhuǎn)為0,1變量
outcome <- sub("T",1,outcome)
# merge the data without classification with outcome
data_logi <- cbind(example6,outcome)
# 按照一定比例選擇,訓(xùn)練集70%樣本,驗(yàn)證集30%的樣本
index <- sample(x=2,size = nrow(data_logi),replace = TRUE,prob = c(0.7,0.3))
traindata <- data_logi[index==1,]
testdata <- data_logi[index==2,]
# 先對(duì)訓(xùn)練集數(shù)據(jù)構(gòu)建logit模型
logit_lm <- glm(outcome ~.,family = binomial,data=traindata)
# summary(logit_lm)
logit_step <- step(logit_lm,direction = "backward")
# summary(logit_step)
# 模型的顯著性檢驗(yàn),不止變量要顯著,模型也要顯著
anova(object = logit_step,test = "Chisq")
# 模型對(duì)樣本外數(shù)據(jù)(驗(yàn)證集)的預(yù)測(cè)
prob <- predict(object = logit_step,newdata=testdata,type="response")
prediction <- ifelse(prob >=0.5, 1,0)
prediction <- factor(prediction,levels = c(1,0),ordered = TRUE)
f <- table(testdata$outcome,prediction)
f
agreement <- as.vector(prediction) == testdata$outcome
table(agreement)
prop.table(table(agreement))

# ROC curve and AUC value calculation
library(pROC)
roc(testdata$outcome,prob)
roc1 <- roc(testdata$outcome,
            prob,
            percent=TRUE,
            partial.auc.correct=TRUE,
            ci=TRUE, boot.n=100, ci.alpha=0.9, stratified=FALSE,
            plot=F, auc.polygon=TRUE, max.auc.polygon=TRUE, grid=TRUE)
tiff("example_variable_pROC.tiff",width=2000,height=2000,res=300)
roc1 <- roc(testdata$outcome,prob,
            ci=TRUE, boot.n=100, ci.alpha=0.9, stratified=FALSE,
            plot=TRUE, percent=roc1$percent,col="#F781BF",cex.lab=1.2, cex.axis=1.2,cex.main=1.5)
sens.ci <- ci.se(roc1, specificities=seq(0, 100, 5))
plot(sens.ci, type="shape", col="#FF69B4")
plot(sens.ci, type="bars")
plot(roc1,col="black",add=T)
legend("bottomright",cex=1.2,c(paste("AUC=",round(roc1$ci[2],2),"%"),
                               paste("95% CI:",round(roc1$ci[1],2),"%-",round(roc1$ci[3],2),"%")))
dev.off()

## 使用隨機(jī)森林進(jìn)行潛在biomarker的選擇
library(randomForest)

# 首先確定下后面要用到的參數(shù)mtry和ntree
n <- length(names(traindata))
set.seed(100)
names(traindata)[1:ncol(traindata)-1] <- paste("M",1:13,sep = "_")
# 選擇mtry參數(shù)
for(i in 1:(n-1)){
  mtry_fit <- randomForest(outcome ~ ., data = traindata, mtry = i)
  err <- mean(mtry_fit$err.rate)
  print(err)  # 當(dāng)mtry=2的時(shí)候誤差最小
}

# 選擇ntree參數(shù)
set.seed(100)
ntree_fit <- randomForest(outcome ~., data = traindata,mtry=2,ntree=1000)
plot(ntree_fit) # 在400左右模型內(nèi)誤差基本穩(wěn)定,默認(rèn)值為500,可以選默認(rèn)值

# 變量重要性選擇
set.seed(100)
#' rfcv是隨機(jī)森林交叉驗(yàn)證函數(shù):Random Forest Cross Validation
result <- rfcv(traindata[,-ncol(traindata)],traindata$outcome,cv.fold = 10)
result$error.cv #' 查看錯(cuò)誤率表,21時(shí)錯(cuò)誤率最低,為最佳模型
#' 繪制驗(yàn)證結(jié)果
with(result,plot(n.var,error.cv,log="x",type = "o",lwd=2)) # 結(jié)果是選擇13個(gè)變量誤差最小

# 構(gòu)建模型
set.seed(100)
rf <- randomForest(outcome ~ ., data = traindata, mtry=2,ntree=400,importance=TRUE)
rf

# 驗(yàn)證并預(yù)測(cè)
names(testdata)[1:ncol(testdata)-1] <- paste("M",1:13,sep = "_")
pred1 <- predict(rf, testdata)
# 查看準(zhǔn)確性
test_matrix <- table(pred1, testdata$outcome)
sum(diag(test_matrix))/sum(test_matrix)
# 0.8148

參考

[1] A Large-Scale, Multicenter Serum Metabolite Biomarker Identification Study for the Early Detection of Hepatocellular Carcinoma

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書(shū)系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容