apply和function
差異基因的檢驗估計會用到function和apply。不過差異基因表達的function和apply估計也不會有太大的難度,所以我這里就直接放上《R語言實戰(zhàn)》的例子。詳細地題目應對我還是放到后面具體題目(當然也是因為我R語言太菜了……)
function
function可以去看下《R語言實戰(zhàn)》第二版的P436,我這里截個圖

function的基本語法為
functionname <- function(parameters){
statements
return(value)
}
functionname就是你的函數(shù)名字,你也可以寫成wodehanshu這種名字。
parameters里面就是設定你要用到的參數(shù)
statement就是你執(zhí)行的一系列操作,一般來說你這一系列操作會產生一個值。然后用return返回
例子的話,上面已經舉了。
apply
apply的使用我還是放一個《R語言實戰(zhàn)》的例子,在P95和P96


差異基因表達
差異基因估計是用t-test做檢驗,然后用p.adjust做矯正(雖然照理說不能用t-test做)。我這里就不詳細地講了。舉幾個例子吧
題目1
我們第4次作業(yè)的第四題
Type 1 diabetes is a multigenic disease caused by T-cell mediated destruction of the insulin producing β-cells. Although conventional (targeted) approaches of identifying causative genes have advanced our knowledge of this disease, many questions remain unanswered.
Here we have a gene data from NOD mouse after(case) and before(control) treatment. The data can be found in “Data.txt”.Use the information mentioned above to answer the following questions:
1.use paired t-test to find genes which have significant expression (p<0.05) between case and control sample. Give the number of differential expressed genes and give the names of top 10 significantly differential expression genes. hint: “apply(data,1,function(x){…})” can apply function to every row in data more quickly than “for{}”, “names()” or “rownames()” can be used to extract names of differentially expressed genes
這里是輸入了基因表達的數(shù)據。分別是10個control和10個case,然后做配對的t-test。我們可以手動地對每個基因(即每一列)對t-test。但這樣太麻煩,所以我們就可以用apply。apply可以對每一列做批量化的操作。
# 讀取和整理數(shù)據
# paste0這個操作就是把“gene”這個字符串和對應的列名粘起來,這樣使得列名比較直觀一點
test4 <- read.table("rawdata/test4.txt",header = T)
rownames(test4) <- paste0("gene",rownames(test4))
# 看下數(shù)據(我后來發(fā)現(xiàn)沒有gene1了。。。。。但我最后的答案跟作業(yè)答案一樣,應該是原始文件的鍋)
> head(test4)
control1 control2 control3 control4 control5 control6 control7 control8 control9 control10 case1 case2
gene2 6.917468 6.308350 5.318841 5.886811 5.082975 5.629453 4.919035 3.134226 4.242564 5.783208 3.574525 5.371273
gene3 7.862730 7.065809 6.783732 6.275773 3.063104 5.131017 3.708938 2.766133 6.755942 3.954392 7.163509 5.600665
gene4 7.425996 7.707939 7.550885 5.708736 5.900326 6.888329 5.987753 5.948979 7.281995 6.758037 5.367602 7.898202
gene5 6.544029 8.364054 7.239746 9.037322 8.783692 8.863040 10.602350 9.046190 9.007177 7.866627 8.079780 8.128662
gene6 4.986893 7.248867 7.098780 6.787237 6.252609 6.428099 7.758556 6.726613 7.123883 6.584567 6.642005 7.376994
gene7 6.995863 6.448682 6.136518 7.098878 4.984325 5.896835 6.488555 5.299349 5.524948 6.763232 7.004367 6.478264
case3 case4 case5 case6 case7 case8 case9 case10
gene2 7.217279 6.786191 5.523913 5.355584 6.693710 4.364467 6.270432 6.993901
gene3 6.079167 7.125393 5.169236 4.524597 4.715383 2.946439 4.200206 5.798403
gene4 6.321147 7.810430 6.231793 6.367243 7.579535 7.207376 7.548390 6.522594
gene5 8.716447 9.940214 9.224614 9.139514 9.429897 9.373779 9.287744 9.275545
gene6 7.080039 7.100746 6.453575 6.618557 7.614747 7.141351 6.076726 7.369737
gene7 6.273696 6.413472 4.983931 5.629765 5.213316 5.931045 5.776625 6.522169
# 構建函數(shù)(我個人可能比較推薦先構建函數(shù),然后把函數(shù)放到apply中,這樣可能比較直觀一點)
# 我輸入數(shù)據為x,然后我提取1:10個數(shù)為data1,11:20個數(shù)為data2。然后做t.test
myFun <- function(x){
data1 <- x[1:10]
data2 <- x[11:20]
t_result <- t.test(data1,data2,paired = T)
p_value <- t_result$p.value
return(p_value)
}
# 利用apply,一行行地把數(shù)據放入myFun里面,一行行地做t.test
# 輸出應該是個向量
test4_result <- apply(test4_paired_result, 1, result)
# 也可以直接一行代碼
apply(test4, 1, function(x) {t.test(x[1:10],x[11:20],paired = T)$p.value})
# 數(shù)下差異基因的數(shù)量
> sum(test4_result < 0.05)
[1] 2296
# 輸出前10個
## 先對得到的p-value進行排序
test4_result[order(test4_result)]
## 然后得到前10
> test4_result[order(test4_result)][1:10]
gene28801 gene27868 gene27438 gene21642 gene24019 gene12323 gene12962 gene28939 gene2387
4.277344e-06 6.093302e-05 8.229606e-05 9.889894e-05 1.124717e-04 1.261658e-04 1.298200e-04 1.462493e-04 1.522920e-04
gene18712
1.601297e-04
2.Adjust the p-values in question a) with bonferroni and FDR method to find differentially expressed genes in stringent way( list the differentially expressed gene names and the adjusted p-value)
矯正的話,直接用p.adjust就可以了。這里再用sum或者mean數(shù)一數(shù)差異基因的個數(shù)
test4_adjust_bonf <- p.adjust(test4_result, method = "bonferroni")
mean(test4_adjust_bonf < 0.05)
test4_adjust_fdr <- p.adjust(test4_result, method = "fdr")
mean(test4_adjust_fdr < 0.05)
題目2
把題目中的paired數(shù)據變成非配對的數(shù)據。那么我們就要多一步方差齊性檢驗了。
# 在函數(shù)里面做判斷
myFun <- function(x){
data1 <- x[1:10]
data2 <- x[11:20]
if (var.test(data1,data2)$p.value > 0.05){
t_result <- t.test(data1,data2,var.equal = T)
p_value <- t_result$p.value
return(p_value)
} else {
t_result <- t.test(data1,data2,var.equal = F)
p_value <- t_result$p.value
return(p_value)
}
}
test4_result <- apply(test4, 1, myFun)
題目3
來自17年考試的第5題
Nonalcoholic steatohepatitis or NASH is a common liver disease, and was found to be linked to obesity and diabetes, suggesting an important role of adipose tissue in the pathogenesis of NASH. Therefore, the mouse model was used to investigate the interaction between adipose tissue and liver. Wildtype male C57Bl/6 mice were fed low fat food (LFD) or high fat food (HFD) for 21 weeks. The detailed data can be found in GDS4013.txt. Hint:provide the R code and result calculated with R. (本題共20分)
1.Read the data. Check whether there are NAs in the data? If NAs exist, you should deaf with them before next steps. Hint: delete the rows with NAs in data. (5分)
# 先讀取數(shù)據
> test <- read.table("rawdata/GDS4013.txt",header = T)
> head(test)
LFD LFD.1 LFD.2 LFD.3 LFD.4 LFD.5 LFD.6 LFD.7 LFD.8 LFD.9 HFD HFD.1 HFD.2
COPG1 7.756187 7.513726 7.927167 7.742323 7.974731 7.632936 7.419638 8.022296 7.649602 7.694316 7.717317 7.721638 7.873688
ATP6V0D1 8.968808 9.004967 9.025234 9.139161 9.142125 8.998360 9.097204 9.181852 9.127648 9.149562 9.029930 8.948781 9.095943
GOLGA7 9.698317 9.798994 9.765514 9.581075 9.776361 9.829636 9.670068 9.675262 9.812124 9.637824 9.625283 9.822128 9.700576
PSPH 6.897664 NA 6.744140 7.225862 7.343744 6.877918 6.422594 7.188237 6.819006 6.524627 6.514469 6.949181 7.044664
TRAPPC4 5.406903 5.473714 5.749002 5.416623 5.351018 5.431658 5.606044 5.401849 5.374199 5.331412 5.439976 5.487543 5.290085
DPM2 6.532734 6.215751 6.520986 6.143827 6.597586 6.731978 6.553594 6.659309 6.288114 6.561733 6.819006 6.616581 6.349064
HFD.3 HFD.4 HFD.5 HFD.6 HFD.7
COPG1 7.637193 7.570261 7.549871 7.687248 7.603990
ATP6V0D1 9.092459 8.981442 9.014060 9.122049 9.000419
GOLGA7 9.739647 9.679455 9.790654 9.817816 9.761968
PSPH 6.708345 6.413345 6.617198 6.843793 6.467956
TRAPPC4 5.485993 5.376865 5.522681 5.430362 5.490806
DPM2 6.554808 6.416926 6.562119 6.524627 6.771978
# 利用table看看缺失值的多少
> table(is.na(test))
FALSE TRUE
481694 4
# 利用na.omit
test_new <- na.omit(test)
> table(is.na(test_new))
FALSE
481626
2.Find differentially expressed genes (DEGs) (up-regulated, HF>LF) between LFD and HFD samples, list the DEGs with adjusted p-values less than 0.05. For expressions of each gene, check the homogeneity of the variances in two groups at first. Hint: FDR should be used to adjust p-values.
# 因為不是配對檢驗,所以我們首先要做的是檢驗方差齊性
## 所以我們需要先構建一個函數(shù)來檢驗方差的齊性
myFun_var <- function(data){
LFD <- data[1:10]
HFD <- data[11:18]
var_result <- var.test(LFD,HFD)
p_value <- var_result$p.value
}
## 應用apply來得到p-value
var_pvalue <- apply(test_new, 1, myFun_var)
## 然后往表達矩陣上多加一列p-value,得到有附加信息的表達矩陣
> test_new_var <- cbind(test_new,var_pvalue)
> head(test_new_var)
LFD LFD.1 LFD.2 LFD.3 LFD.4 LFD.5 LFD.6 LFD.7 LFD.8 LFD.9 HFD HFD.1
COPG1 7.756187 7.513726 7.927167 7.742323 7.974731 7.632936 7.419638 8.022296 7.649602 7.694316 7.717317 7.721638
ATP6V0D1 8.968808 9.004967 9.025234 9.139161 9.142125 8.998360 9.097204 9.181852 9.127648 9.149562 9.029930 8.948781
GOLGA7 9.698317 9.798994 9.765514 9.581075 9.776361 9.829636 9.670068 9.675262 9.812124 9.637824 9.625283 9.822128
TRAPPC4 5.406903 5.473714 5.749002 5.416623 5.351018 5.431658 5.606044 5.401849 5.374199 5.331412 5.439976 5.487543
DPM2 6.532734 6.215751 6.520986 6.143827 6.597586 6.731978 6.553594 6.659309 6.288114 6.561733 6.819006 6.616581
PSMB5 10.217143 9.992716 9.974211 10.201814 10.244172 10.023191 10.072604 10.147738 10.267069 10.024750 10.142353 10.119065
HFD.2 HFD.3 HFD.4 HFD.5 HFD.6 HFD.7 var_pvalue
COPG1 7.873688 7.637193 7.570261 7.549871 7.687248 7.603990 0.1119611
ATP6V0D1 9.095943 9.092459 8.981442 9.014060 9.122049 9.000419 0.5807521
GOLGA7 9.700576 9.739647 9.679455 9.790654 9.817816 9.761968 0.6524620
TRAPPC4 5.290085 5.485993 5.376865 5.522681 5.430362 5.490806 0.1770907
DPM2 6.349064 6.554808 6.416926 6.562119 6.524627 6.771978 0.6063077
PSMB5 10.086752 10.240642 10.086752 10.246507 10.291952 10.129569 0.3887486
# 得到了方差檢驗的p-value,我們就可以來把這個條件加入到t.test里面,然后做t-test了
## 首先也要構建t-test的函數(shù)
myFun_t_test <- function(data){
LFD <- data[1:10] # 1:10列為LFD
HFD <- data[11:18] # 11:18列為HFD
var_result <- data[19] # 19列為方差齊性的p-value
t_result <- t.test(LFD,HFD,alternative = "less",var.equal = var_result > 0.05) ## 這里var_result是p-value,p-value如果大于0.05了(方差齊性),就會輸出TRUE。
p_value <- t_result$p.value
}
## 利用apply,一行行地輸入函數(shù)里面,做t-test
t_test_pvalue <- apply(test_new_var, 1, myFun_t_test)
# 得到了p-value之后,就用fdr方法做矯正
> t_test_padjust <- p.adjust(t_test_pvalue,method = "fdr")
> t_test_padjust[t_test_padjust < 0.05]
SEMA5B BC048403
0.01698630 0.04004457
大家可能會奇怪這里為什么是less了,一般我們不都是two.side么
選擇單雙邊實際上是跟你的假設有關系的,我們這里是假設up-regulated, HF>LF。只是單邊而已。那為啥又是less,而不是greater呢。因為t.test(a,b,alternative)里面的alternative是針對前一個比后一個,即如果是greater的話,就是a比b大。我們這里是LFD放在了前面,假設是HF>LF的話,那么我們寫的時候應該是LFD less than HFD。
舉個例子的話
# 模擬兩個值
> a <- rnorm(20)
> b <- rnorm(20,mean = 1)
# 可以看到下面的統(tǒng)計檢驗,如果是greater的話,即a>b,是不顯著的。而less是顯著的
> t.test(a,b,alternative = "greater")
Welch Two Sample t-test
data: a and b
t = -1.5134, df = 30.883, p-value = 0.9298
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
-0.7961539 Inf
sample estimates:
mean of x mean of y
0.4103696 0.7858315
> t.test(a,b,alternative = "less")
Welch Two Sample t-test
data: a and b
t = -1.5134, df = 30.883, p-value = 0.07017
alternative hypothesis: true difference in means is less than 0
95 percent confidence interval:
-Inf 0.04523018
sample estimates:
mean of x mean of y
0.4103696 0.7858315
然后先方差齊性檢驗,后做t-tes的代碼,除了上面的,也可以用題目2的代碼
# 用了if else
myFun <- function(x){
data1 <- x[1:10]
data2 <- x[11:18]
if (var.test(data1,data2)$p.value > 0.05){
t_result <- t.test(data1,data2,var.equal = T,alternative = "less")
p_value <- t_result$p.value
return(p_value)
} else {
t_result <- t.test(data1,data2,var.equal = F,alternative = "less")
p_value <- t_result$p.value
return(p_value)
}
}
t_test_pvalue <- apply(test_new_var, 1, myFun)
t_test_padjust <- p.adjust(t_test_pvalue,method = "fdr")
t_test_padjust[t_test_padjust < 0.05]
也可以用答案里面簡略的
# 方差齊性檢驗
p.vartest <- apply(GDS4013_new, 1, function(x)var.test(x[1:10],x[11:18])$p.value)
# 利用cbind合并p-value和表達矩陣
GDS4013_new_data <- cbind(GDS4013_new, p.vartest)
# t.test檢驗
p.value <- apply(GDS4013_new_data, 1, function(x)t.test(x[1:10],x[11:18],alternative = 'less', var.equal = (x[19] >= 0.05))$p.value)
p.fdr <- p.adjust(p.value, "fdr")
> sum(p.fdr<0.05)
[1] 2
> sig.p.fdr<-p.fdr[p.fdr<0.05]
> names(sort(sig.p.fdr))
[1] "SEMA5B" "BC048403"
富集分析
富集分析的原理我應該會在考完試有空的時候再重新理一下,我這里就直接放上用fihser exact test(fisher 精確檢驗)做富集分析的代碼。
稍微提下富集的原理。
一般差異表達的基因會有上千個,一個個手動看肯定不現(xiàn)實。所以我們就需要從宏觀上來看這些基因,用到的一個方法就是富集分析。一般用的比較多的是GO富集分析,就是比如我們得到1000個差異基因,然后我們感興趣的GO通路在差異基因中有50個基因。總的基因有2w個,在這2w個基因中,我們感興趣的GO有200個。我們想知道感興趣GO相比于總體來說,在差異基因集中是不是富集的。
fisher exact test輸入的格式是列聯(lián)表的格式,就像下面
| 差異基因 | 無差異基因 | |
|---|---|---|
| 在通路中 | 50 | 200-50=150 |
| 不在通路中 | 1000-50=950 | 19000-150=18850 |
變成代碼就是
> test <- matrix(c(50,950,150,18850),2)
> test
[,1] [,2]
[1,] 50 150
[2,] 950 18850
> fisher.test(test)
Fisher's Exact Test for Count Data
data: test
p-value < 2.2e-16
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
4.670869 9.230328
sample estimates:
odds ratio
6.61189
看下我們某年的一個GO富集題目
Usually you are interested in the function indicated by differentially expressed genes, for which GO enrichment is a widely used method. In order to find whether the differentially expression genes (downregulated, p<=0.05) are enriched in “l(fā)eukocyte activation during immune response” (GO term), please show a conclusive result using fisher exact test. Known genes annotated with this GO are listed in “GO_2_2.txt”
全部基因總共是14082個基因,前面差異表達做出來的差異基因是617個。然后GO_2_2.txt里面給出了leukocyte activation during immune response這個GO的基因,基因數(shù)目為193。我們就需要找差異基因中有多少這個GO通路的基因,答案里用到的intersect,其實就是取交集。我舉個例子
> total <- c(1:10)
> part <- c(1,4,5)
> intersect(part,total)
[1] 1 4 5
# 還可以知道有多少
> length(intersect(part,total))
[1] 3
然后答案里的intersect出來是10個。這樣列聯(lián)表的數(shù)據就全了。
| 差異基因 | 無差異基因 | |
|---|---|---|
| 在通路中 | 10 | 193-10=183 |
| 不在通路中 | 617-10=607 | 14082-607-183-10=13282 |
然后就可以做fisher了
> test <- matrix(c(10,607,183,13282),nrow = 2)
> test
[,1] [,2]
[1,] 10 183
[2,] 607 13282
> fisher.test(test)
Fisher's Exact Test for Count Data
data: test
p-value = 0.5923
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.5609475 2.2650703
sample estimates:
odds ratio
1.195686
然后有人提到答案里面其實是不準的,因為GO2_2的表中,有幾個基因是不在總的基因里面的。
畫出列聯(lián)表的步驟我覺得可以是這樣的
- 差異基因數(shù)目定下了,然后GO表和差異基因取交集,那么就得到了列聯(lián)表的左上。
- 然后差異基因減去左上,就得到了左下
- 然后GO表減去左上,就得到了右上
- 然后總的基因減去左上、左下、右上,就得到了右下。
我只是按照了答案的思路來,不保證對哈。大家自行斟酌,原理就是這么個原理。