R語言 | 差異表達(dá)基因分析(DEGs)| 原始數(shù)據(jù)處理&火山圖繪制

差異表達(dá)基因分析 即篩選處理組與對照組相比,呈現(xiàn)差異表達(dá)的基因,Up,No sig,Down. 今天使用的R包為:DESeq2[1] 這個包基于RNA Seq data-count data(也就是說這里要求輸入的數(shù)據(jù)矩陣必須為count,而不是已經(jīng)標(biāo)準(zhǔn)化后的TPM,F(xiàn)RAM),基于負(fù)二項廣義線性模型估算樣本間基因差異表達(dá)概率。 獲取示例數(shù)據(jù) 這里我沒有合適的數(shù)據(jù)進(jìn)行演示,于是搜索到pasilla包中有一個示例數(shù)據(jù)可用。那我們就從pasilla包中獲取今天的示例數(shù)據(jù)。 code1
  1>?if?(!requireNamespace("BiocManager",?quietly?=?TRUE)) 2??install.packages("BiocManager") 3>?BiocManager::install("pasilla") 4>?library(pasilla) 5>?pasCts<-system.file("extdata", 6+?????????????????????"pasilla_gene_counts.tsv", 7+?????????????????????package="pasilla", 8+?????????????????????mustWork?=?T) 9>?pasCts10[1]?"E:/R-4.1.0/library/pasilla/extdata/pasilla_gene_counts.tsv"11>?pasAnno<-system.file("extdata",12+??????????????????????"pasilla_sample_annotation.csv",13+??????????????????????package="pasilla",14+??????????????????????mustWork=T)15>?pasAnno16[1]?"E:/R-4.1.0/library/pasilla/extdata/pasilla_sample_annotation.csv"17>?df<-read.csv(pasCts,sep="\t",row.names?=?"gene_id")18>?cts<-as.matrix(df)19   >?coldata<-read.csv(pasAnno,row.names?=?1) 
這樣就獲得了今天我們需要的兩個示例數(shù)據(jù)集,即表達(dá)矩陣cts,樣本分組數(shù)據(jù)集coldata。如下: code2
  1>?head(coldata)
2???????????condition????????type
3treated1?????treated?single-read
4treated2?????treated??paired-end
5treated3?????treated??paired-end
6untreated1?untreated?single-read
7untreated2?untreated?single-read
8untreated3?untreated??paired-end
9>?head(cts)
10????????????treated1?treated2?treated3?untreated1?untreated2?untreated3?untreated4
11FBgn0000003????????0????????0????????1??????????0??????????0??????????0??????????0
12FBgn0000008??????140???????88???????70?????????92????????161?????????76?????????70
13FBgn0000014????????4????????0????????0??????????5??????????1??????????0??????????0
14FBgn0000015????????1????????0????????0??????????0??????????2??????????1??????????2
15FBgn0000017?????6205?????3072?????3334???????4664???????8714???????3564???????3150
16FBgn0000018??????722??????299??????308????????583????????761????????245????????310
實際上仔細(xì)看看code1中row10和row16即可知道我們需要的數(shù)據(jù)集所在位置。 示例數(shù)據(jù)獲取完畢,下面利用DESeq2進(jìn)行差異表達(dá)分析: code3
 1dds?<-?DESeqDataSetFromMatrix(countData?=?cts,?
2colData?=?coldata,?design?=?~?condition)
3dds?<-?DESeqDataSetFromMatrix(countData?=?cts,?
4colData?=?coldata,?design?=?~?condition)
5dds?<-?DESeq(dds)????
畫火山圖 code4
 1sum(res$padj?<?0.05,?na.rm?=?TRUE)????#統(tǒng)計padj<0.05顯著差異的基因
2plotMA(res)????#畫火山圖
3plotMA(res,?alpha?=?0.05,?colSig?=?'red',?colLine?=?'skyblue')????#稍微設(shè)置一下參數(shù)
通常,畫出火山圖不是目的,目的是需要知道差異表達(dá)的基因到底是哪些? 于是,需要統(tǒng)計差異表達(dá)基因的信息。 code5
 1filter_up?<-?subset(res,?pvalue?<?0.05?&?log2FoldChange?>?1)?#過濾上調(diào)基因
2filter_down?<-?subset(res,?pvalue?<?0.05?&?log2FoldChange?<?-1)?#過濾下調(diào)基因
3print(paste('差異上調(diào)基因數(shù)量:?',?nrow(filter_up)))??#打印上調(diào)基因數(shù)量
4print(paste('差異下調(diào)基因數(shù)量:?',?nrow(filter_down)))??#打印下調(diào)基因數(shù)量
統(tǒng)計完成,我們當(dāng)然還需要對統(tǒng)計結(jié)果進(jìn)行保存。 code6
 1write.table(res,?file?=?"example_differential_gene.txt")?
2write.table(filter_up,?file="example_filter_up_gene.txt",?quote?=?F)??
3write.table(filter_down,?file="example_filter_down_gene.txt",?quote?=?F)
到這里,基本的分析就算是完成了。但是還可以繼續(xù):
  1###讀取剛才保存的差異表達(dá)基因分析數(shù)據(jù)
2df?=?read.table("example_differential_gene.txt",header?=T,stringsAsFactor?=?F)
3###查看前6行
4head(df)
5????????????????baseMean?log2FoldChange?????lfcSE?????????stat????pvalue??????padj
6FBgn0000003????0.1715687????1.026045410?3.8055034??0.269621465?0.7874515????????NA
7FBgn0000008???95.1440790????0.002151424?0.2238838??0.009609555?0.9923328?0.9969271
8FBgn0000014????1.0565722???-0.496735569?2.1602643?-0.229942039?0.8181368????????NA
9FBgn0000015????0.8467233???-1.882761702?2.1064322?-0.893815463?0.3714206????????NA
10FBgn0000017?4352.5928988???-0.240025230?0.1260243?-1.904594503?0.0568328?0.2823611
11FBgn0000018??418.6149305???-0.104799112?0.1482803?-0.706763605?0.4797134?0.8239073
12###可以看到我們一會兒重新繪圖需要的padj列有NA值,故需要刪掉包含NA的行
13df?=?na.omit(df)
14##再次查看,可以看到包含NA的行已被刪除
15head(df)
16???????????????baseMean?log2FoldChange?????lfcSE?????????stat?????pvalue??????padj
17FBgn0000008????95.14408????0.002151424?0.2238838??0.009609555?0.99233280?0.9969271
18FBgn0000017??4352.59290???-0.240025230?0.1260243?-1.904594503?0.05683280?0.2823611
19FBgn0000018???418.61493???-0.104799112?0.1482803?-0.706763605?0.47971340?0.8239073
20FBgn0000032???989.73003???-0.091905049?0.1476974?-0.622252169?0.53377607?0.8497739
21FBgn0000037????14.09481????0.463068060?0.4914026??0.942339492?0.34601886?0.7409094
22FBgn0000042?82207.72464????0.314524848?0.1405415??2.237949545?0.02522435?0.1645315
23###接下來需要設(shè)置限定值
24df$group?=?ifelse(df$log2FoldChange>=1&df$padj<=0.05,"Up",
25??????????????????ifelse(df$log2FoldChange<=-1&df$padj<=0.05,"Down","Not?sig"))
26table(df$group)
27???Down?Not?sig??????Up?
28????105????8103?????115?
29###可以看到,這里有105個下調(diào)gene,115個上調(diào)基因。但是還是太多了,畢竟后面的分析中我希望得到限制條件更嚴(yán)格的結(jié)果。那就重新設(shè)定限定值
30df$group?=?ifelse(df$log2FoldChange>=2&df$padj<=0.01,"Up",
31??????????????????ifelse(df$log2FoldChange<=-2&df$padj<=0.01,"Down","Not?sig"))
32table(df$group)
33???Down?Not?sig??????Up?
34????25????8275?????23
35#好,差不多了下面開始用ggplot2繪圖?
36install.packages("ggplot2")
37library(ggplot2)
38ggplot(df,aes(x=log2FoldChange,y?=?-log10(padj)))+
39??geom_point(aes(color=group))+
40??scale_color_manual(values?=?c("red","grey","blue"),limit?=?c('Up','Not?sig',"Down"))+
41??theme_bw(base_size?=?20)+
42??ggtitle("今日之森Volcano?Plot")+
43??theme(plot.title?=?element_text(size=30,hjust?=?0.5))+
44??coord_cartesian(xlim?=?c(-5,5),ylim?=?c(0,75))

[1]Anders S, Huber W. Differential expression analysis for sequence count data.?Genome Biol. 2010;11(10):R106. doi:10.1186/gb-2010-11-10-r106

本文使用 文章同步助手 同步

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容