lasso回歸
1.準備輸入數(shù)據(jù)
load("TCGA-KIRC_sur_model.Rdata")
ls()
#> [1] "exprSet" "meta"
exprSet[1:4,1:4]
#> TCGA-A3-3307-01A-01T-0860-13 TCGA-A3-3308-01A-02R-1324-13
#> hsa-let-7a-1 11.94363 13.16162
#> hsa-let-7a-2 12.97324 14.17303
#> hsa-let-7a-3 12.04630 13.18481
#> hsa-let-7b 13.76790 14.51511
#> TCGA-A3-3311-01A-02R-1324-13 TCGA-A3-3313-01A-02R-1324-13
#> hsa-let-7a-1 12.26391 12.38326
#> hsa-let-7a-2 13.26650 13.39119
#> hsa-let-7a-3 12.28186 12.35551
#> hsa-let-7b 14.09461 12.33623
meta[1:4,1:4]
#> ID event death last_followup
#> 1 TCGA-A3-3307 0 0 1436
#> 2 TCGA-A3-3308 0 0 16
#> 3 TCGA-A3-3311 1 1191 0
#> 4 TCGA-A3-3313 1 735 0
2.構建lasso回歸模型
輸入數(shù)據(jù)是表達矩陣(僅含tumor樣本)和每個病人對應的生死(順序必須一致)。
x=t(exprSet)#轉(zhuǎn)置后的表達矩陣
y=meta$event#事件
library(glmnet)
model_lasso <- glmnet(x, y,nlambda=10, alpha=1)#構建模型,nlambda=10計算十個λ
print(model_lasso)
#>
#> Call: glmnet(x = x, y = y, alpha = 1, nlambda = 10)
#>
#> Df %Dev Lambda
#> 1 0 0.00 0.129900
#> 2 11 11.73 0.077850
#> 3 21 20.92 0.046670
#> 4 54 30.69 0.027980
#> 5 95 44.71 0.016770
#> 6 160 57.60 0.010050
#> 7 228 68.75 0.006028
#> 8 295 77.36 0.003613
#> 9 347 83.51 0.002166
#> 10 380 88.43 0.001299
這里是舉例子,所以只計算了10個λ值,解釋一下輸出結(jié)果三列的意思。 - Df 是自由度 - 列%Dev代表了由模型解釋的殘差的比例,對于線性模型來說就是模型擬合的R^2(R-squred)。它在0和1之間,越接近1說明模型的表現(xiàn)越好,如果是0,說明模型的預測結(jié)果還不如直接把因變量的均值作為預測值來的有效。 - Lambda 是構建模型的重要參數(shù)。
解釋的殘差百分比越高越好,但是構建模型使用的基因的數(shù)量也不能太多,需要取一個折中值。
2.1挑選合適的λ值
計算1000個,畫圖,篩選表現(xiàn)最好的λ值
cv_fit <- cv.glmnet(x=x, y=y, nlambda = 1000,alpha = 1)
plot(cv_fit)
兩條虛線分別指示了兩個特殊的λ值,一個是lambda.min,一個是lambda.1se,這兩個值之間的lambda都認為是合適的。lambda.1se構建的模型最簡單,即使用的基因數(shù)量少,而lambda.min則準確率更高一點,使用的基因數(shù)量更多一點。
2.2 用這兩個λ值重新建模
model_lasso_min <- glmnet(x=x, y=y, alpha = 1, lambda=cv_fit$lambda.min)
model_lasso_1se <- glmnet(x=x, y=y, alpha = 1, lambda=cv_fit$lambda.1se)
這兩個值體現(xiàn)在參數(shù)lambda上。有了模型,可以將篩選的基因挑出來了。所有基因存放于模型的子集beta中,用到的基因有一個s0值,沒用的基因只記錄了“.”,所以可以用下面代碼挑出用到的基因。
head(model_lasso_min$beta,20)
#> 20 x 1 sparse Matrix of class "dgCMatrix"
#> s0
#> hsa-let-7a-1 .
#> hsa-let-7a-2 .
#> hsa-let-7a-3 .
#> hsa-let-7b .
#> hsa-let-7c .
#> hsa-let-7d .
#> hsa-let-7e .
#> hsa-let-7f-1 .
#> hsa-let-7f-2 .
#> hsa-let-7g .
#> hsa-let-7i .
#> hsa-mir-1-2 .
#> hsa-mir-100 .
#> hsa-mir-101-1 -0.02460853
#> hsa-mir-101-2 .
#> hsa-mir-103-1 .
#> hsa-mir-103-2 .
#> hsa-mir-105-1 .
#> hsa-mir-105-2 .
#> hsa-mir-106a .
choose_gene_min=rownames(model_lasso_min$beta)[as.numeric(model_lasso_min$beta)!=0]
choose_gene_1se=rownames(model_lasso_1se$beta)[as.numeric(model_lasso_1se$beta)!=0]
length(choose_gene_min)
#> [1] 69
length(choose_gene_1se)
#> [1] 15
3.模型預測和評估
3.1自己預測自己
newx參數(shù)是預測對象。輸出結(jié)果lasso.prob是一個矩陣,第一列是min的預測結(jié)果,第二列是1se的預測結(jié)果,預測結(jié)果是概率,或者說百分比,不是絕對的0和1。
將每個樣本的生死和預測結(jié)果放在一起,直接cbind即可。
lasso.prob <- predict(cv_fit, newx=x , s=c(cv_fit$lambda.min,cv_fit$lambda.1se) )#newx=x測試集的表達矩陣
re=cbind(y ,lasso.prob)
head(re)
#> y 1 2
#> TCGA-A3-3307-01A-01T-0860-13 0 0.1304463 0.2161883
#> TCGA-A3-3308-01A-02R-1324-13 0 0.3652738 0.3646634
#> TCGA-A3-3311 -01A-02R-1324-13 1 0.3015306 0.2927687
#> TCGA-A3-3313-01A-02R-1324-13 1 0.4953936 0.3473468
#> TCGA-A3-3316-01A-01T-0860-13 0 0.3381294 0.3110597
#> TCGA-A3-3317-01A-01T-0860-13 0 0.3472768 0.3380213
3.2 箱線圖
對預測結(jié)果進行可視化。以實際的生死作為分組,畫箱線圖整體上查看預測結(jié)果。
re=as.data.frame(re)
colnames(re)=c('event','prob_min','prob_1se')
re$event=as.factor(re$event)
library(ggpubr)
p1 = ggboxplot(re, x = "event", y = "prob_min",
color = "event", palette = "jco",
add = "jitter")+
scale_y_continuous(limits = c(0,1)) +
stat_compare_means()
p2 = ggboxplot(re, x = "event", y = "prob_1se",
color = "event", palette = "jco",
add = "jitter")+
scale_y_continuous(limits = c(0,1)) +
stat_compare_means()
library(patchwork)
p1+p2
#> Warning: Removed 16 rows containing non-finite values (stat_boxplot).
#> Warning: Removed 16 rows containing non-finite values (stat_compare_means).
#> Warning: Removed 16 rows containing missing values (geom_point).
可以看到,真實結(jié)果是生(0)的樣本,預測的值就小一點(靠近0),真實結(jié)果是死(1)的樣本,預測的值就大一點(靠近1),整體上趨勢是對的,但不是完全準確,模型是可用的。
對比兩個λ值構建的模型,差別不大,min的預測值準確一點。
3.3 ROC曲線
計算AUC取值范圍在0.5-1之間,越接近于1越好??梢愿鶕?jù)預測結(jié)果繪制ROC曲線。
library(ROCR)
library(caret)
# 自己預測自己
#min
pred_min <- prediction(re[,2], re[,1])
auc_min = performance(pred_min,"auc")@y.values[[1]]
perf_min <- performance(pred_min,"tpr","fpr")
plot(perf_min,colorize=FALSE, col="blue")
lines(c(0,1),c(0,1),col = "gray", lty = 4 )
text(0.8,0.2, labels = paste0("AUC = ",round(auc_min,3)))
#1se
pred_1se <- prediction(re[,3], re[,1])
auc_1se = performance(pred_1se,"auc")@y.values[[1]]
perf_1se <- performance(pred_1se,"tpr","fpr")
plot(perf_1se,colorize=FALSE, col="red")
lines(c(0,1),c(0,1),col = "gray", lty = 4 )
text(0.8,0.2, labels = paste0("AUC = ",round(auc_1se,3)))
- 強迫癥選項,想把兩個模型畫一起。
plot(perf_min,colorize=FALSE, col="blue")
plot(perf_1se,colorize=FALSE, col="red",add = T)
lines(c(0,1),c(0,1),col = "gray", lty = 4 )
text(0.8,0.3, labels = paste0("AUC = ",round(auc_min,3)),col = "blue")
text(0.8,0.2, labels = paste0("AUC = ",round(auc_1se,3)),col = "red")
-還可以用ggplot2畫的更好看一點
tpr_min = performance(pred_min,"tpr")@y.values[[1]]
tpr_1se = performance(pred_1se,"tpr")@y.values[[1]]
dat = data.frame(tpr_min = perf_min@y.values[[1]],
fpr_min = perf_min@x.values[[1]],
tpr_1se = perf_1se@y.values[[1]],
fpr_1se = perf_1se@x.values[[1]])
library(ggplot2)
ggplot() +
geom_line(data = dat,aes(x = fpr_min, y = tpr_min),color = "blue") +
geom_line(aes(x=c(0,1),y=c(0,1)),color = "grey")+
theme_bw()+
annotate("text",x = .75, y = .25,
label = paste("AUC of min = ",round(auc_min,2)),color = "blue")+
scale_x_continuous(name = "fpr")+
scale_y_continuous(name = "tpr")
ggplot() +
geom_line(data = dat,aes(x = fpr_min, y = tpr_min),color = "blue") +
geom_line(data = dat,aes(x = fpr_1se, y = tpr_1se),color = "red")+
geom_line(aes(x=c(0,1),y=c(0,1)),color = "grey")+
theme_bw()+
annotate("text",x = .75, y = .25,
label = paste("AUC of min = ",round(auc_min,2)),color = "blue")+
annotate("text",x = .75, y = .15,label = paste("AUC of 1se = ",round(auc_1se,2)),color = "red")+
scale_x_continuous(name = "fpr")+
scale_y_continuous(name = "tpr")
5.切割數(shù)據(jù)構建模型并預測
5.1 切割數(shù)據(jù)
用R包caret切割數(shù)據(jù),生成的結(jié)果是一組代表列數(shù)的數(shù)字,用這些數(shù)字來給表達矩陣和meta取子集即可。
library(caret)
set.seed(12345679)
sam<- createDataPartition(meta$event, p = .5,list = FALSE)
head(sam)
#> Resample1
#> [1,] 5
#> [2,] 9
#> [3,] 13
#> [4,] 17
#> [5,] 19
#> [6,] 22
可查看兩組一些臨床參數(shù)切割比例
train <- exprSet[,sam]
test <- exprSet[,-sam]
train_meta <- meta[sam,]
test_meta <- meta[-sam,]
prop.table(table(train_meta$stage))
#>
#> i ii iii iv
#> 0.4636015 0.1072797 0.2796935 0.1494253
prop.table(table(test_meta$stage))
#>
#> i ii iii iv
#> 0.5249042 0.1111111 0.1954023 0.1685824
prop.table(table(test_meta$race))
#>
#> asian black or african american white
#> 0.01171875 0.08593750 0.90234375
prop.table(table(train_meta$race))
#>
#> asian black or african american white
#> 0.01937984 0.13953488 0.84108527
5.2 切割后的train數(shù)據(jù)集建模
和上面的建模方法一樣。
#計算lambda
x = t(train)
y = train_meta$event
cv_fit <- cv.glmnet(x=x, y=y, nlambda = 1000,alpha = 1)
plot(cv_fit)
#構建模型
model_lasso_min <- glmnet(x=x, y=y, alpha = 1, lambda=cv_fit$lambda.min)
model_lasso_1se <- glmnet(x=x, y=y, alpha = 1, lambda=cv_fit$lambda.1se)
#挑出基因
head(model_lasso_min$beta)
#> 6 x 1 sparse Matrix of class "dgCMatrix"
#> s0
#> hsa-let-7a-1 .
#> hsa-let-7a-2 .
#> hsa-let-7a-3 .
#> hsa-let-7b .
#> hsa-let-7c .
#> hsa-let-7d .
choose_gene_min=rownames(model_lasso_min$beta)[as.numeric(model_lasso_min$beta)!=0]
choose_gene_1se=rownames(model_lasso_1se$beta)[as.numeric(model_lasso_1se$beta)!=0]
length(choose_gene_min)
#> [1] 42
length(choose_gene_1se)
#> [1] 6
4.模型預測
用訓練集構建模型,預測測試集的生死,注意newx參數(shù)變了。
lasso.prob <- predict(cv_fit, newx=t(test), s=c(cv_fit$lambda.min,cv_fit$lambda.1se) )
re=cbind(event = test_meta$event ,lasso.prob)
head(re)
#> event 1 2
#> TCGA-A3-3307-01A-01T-0860-13 0 0.2346060 0.2849366
#> TCGA-A3-3308-01A-02R-1324-13 0 0.3545987 0.3752418
#> TCGA-A3-3311-01A-02R-1324-13 1 0.3812493 0.3368730
#> TCGA-A3-3313-01A-02R-1324-13 1 0.4420153 0.3805503
#> TCGA-A3-3317-01A-01T-0860-13 0 0.3536417 0.3175332
#> TCGA-A3-3319-01A-02R-1324-13 0 0.7300191 0.4180086
再次進行可視化
re=as.data.frame(re)
colnames(re)=c('event','prob_min','prob_1se')
re$event=as.factor(re$event)
library(ggpubr)
p1 = ggboxplot(re, x = "event", y = "prob_min",
color = "event", palette = "jco",
add = "jitter")+
scale_y_continuous(limits = c(0,1)) +
stat_compare_means()
p2 = ggboxplot(re, x = "event", y = "prob_1se",
color = "event", palette = "jco",
add = "jitter")+
scale_y_continuous(limits = c(0,1)) +
stat_compare_means()
library(patchwork)
p1+p2
#> Warning: Removed 4 rows containing non-finite values (stat_boxplot).
#> Warning: Removed 4 rows containing non-finite values (stat_compare_means).
#> Warning: Removed 4 rows containing missing values (geom_point).
再畫ROC曲線
library(ROCR)
library(caret)
# 訓練集模型預測測試集
#min
pred_min <- prediction(re[,2], re[,1])
auc_min = performance(pred_min,"auc")@y.values[[1]]
perf_min <- performance(pred_min,"tpr","fpr")
#1se
pred_1se <- prediction(re[,3], re[,1])
auc_1se = performance(pred_1se,"auc")@y.values[[1]]
perf_1se <- performance(pred_1se,"tpr","fpr")
tpr_min = performance(pred_min,"tpr")@y.values[[1]]
tpr_1se = performance(pred_1se,"tpr")@y.values[[1]]
dat = data.frame(tpr_min = perf_min@y.values[[1]],
fpr_min = perf_min@x.values[[1]],
tpr_1se = perf_1se@y.values[[1]],
fpr_1se = perf_1se@x.values[[1]])
ggplot() +
geom_line(data = dat,aes(x = fpr_min, y = tpr_min),color = "blue") +
geom_line(data = dat,aes(x = fpr_1se, y = tpr_1se),color = "red")+
geom_line(aes(x=c(0,1),y=c(0,1)),color = "grey")+
theme_bw()+
annotate("text",x = .75, y = .25,
label = paste("AUC of min = ",round(auc_min,2)),color = "blue")+
annotate("text",x = .75, y = .15,label = paste("AUC of 1se = ",round(auc_1se,2)),color = "red")+
scale_x_continuous(name = "fpr")+
scale_y_continuous(name = "tpr")
AUC值比不拆分時降低。
*生信技能樹課程筆記