gfold 分析

之前分析遇到2個(gè)問題:
1.樣本只有一個(gè),沒有replicates,是不是就沒有辦法進(jìn)行分析?
2.一個(gè)樣本有3個(gè)重復(fù),另外一個(gè)只有1個(gè),怎么進(jìn)行分析?

Jimmy老師給出了提示,使用gfold進(jìn)行分析是可行的,并且還有安裝說明

http://www.biotrainee.com/thread-752-1-1.html
參考文獻(xiàn)是這篇-gfold
https://pubmed.ncbi.nlm.nih.gov/22923299/

Gfold 軟件安裝和使用方法

安裝軟件 gfold

cd /public/vip/biosoft/
wget http://mirrors.ocf.berkeley.edu/gnu/gsl/gsl-latest.tar.gz
tar -zxvf gsl-latest.tar.gz
cd gsl-2.7/
./configure --prefix=/public/vip/biosoft/gsl-2.7
make
make install
cd ../
wget https://zhanglab.#edu.cn/softwares/GFOLD/gfold.V1.1.4.tar.gz
tar -zxf gfold.V1.1.4.tar.gz
cd gfold.V1.1.4/
make
export CXXFLAGS="-g -O3 -I//public/vip/biosoft/gsl-2.7/include -L//public/vip/biosoft/gsl-2.7/lib"
export LD_LIBRARY_PATH="/public/vip/biosoft/gsl-2.7/lib:"$LD_LIBRARY_PATH
source ~/.bashrc
g++ -O3 -Wall -g main.cc -o gfold -lgsl -lgslcblas -I/public/vip/biosoft/gsl-2.7/include -L/public/vip/biosoft/gsl-2.7/lib

把環(huán)境變量增加到系統(tǒng)中去

vi ~/.bashrc

export CXXFLAGS="-g -O3 -I//public/vip/biosoft/gsl-2.7/include -L//public/vip/biosoft/gsl-2.7/lib"

export LD_LIBRARY_PATH="/public/vip/biosoft/gsl-2.7/lib:"$LD_LIBRARY_PATH

source ~/.bashrc

可以直接運(yùn)行了

/public/vip/biosoft/gfold.V1.1.4/gfold -h
軟件使用說明

Example 3: Identify differentially expressed genes without replicates

gfold diff -s1 sample1 -s2 sample2 -suf .read_cnt -o sample1VSsample2.diff

Example 4: Identify differentially expressed genes with replicates

gfold diff -s1 sample1,sample2,sample3 -s2 sample4,sample5,sample6 -suf .read_cnt -o 123VS456.diff


Example 5: Identify differentially expressed genes with replicates only in one condition

Only the first group contains replicates. In this case, the variance estimated based on the first group will be used as the variance of the second group

gfold diff -s1 sample1,sample2 -s2 sample3 -suf .read_cnt -o 12VS3.diff


gfold diff -s1 sample1,sample2 -s2 sample3 -suf .read_cnt -o 12VS3.diff
對(duì)輸入文件的需求:

All fields in a output file are separated by TABs.

The output file contains 5 columns :共有5列,順序如下

1.GeneSymbol

2.GeneName

3.Read Count

4.Gene exon length

5.RPKM

輸出文件是這樣的

生成文件:

1 GeneSymbol

2 GeneName

3 GFOLD:
GFOLD value for every gene. The GFOLD value could be considered as a reliable log2 fold change. It is positive/negative if the gene is up/down regulated. The main usefulness of GFOLD is to provide a biological meanlingful ranking of the genes. The GFOLD value is zero if the gene doesn't show differential expression.
If the log2 fold change is treated as a random variable, a positive GFOLD value x means that the probability of the log2 fold change (2nd/1st) being larger than x is (1 - the paramete specified by -sc); A negative GFOLD value the parameter specified by -sc). If this file is sorted by this column in descending order then genes ranked at the top are differentially up-regulated and genes ranked at the bottom are differentially down-regulated. Note that a gene with GFOLD value 0 should never be considered differentially expressed. However, it doesn't mean that all genes with non-negative GFOLD value are differentially expressed. For taking top differentially expressed genes, the user is responsible for selecting the cutoff.
4 E-FDR:
Empirical FDR based on replicates. It is always 1 when no replicates are available

5 log2fdc:
log2 fold change. If no replicate is available, and -acc is T, log2 fold change is based on read counts and normalization constants. Otherwise, log2 fold change is based on the sampled expression level from the posterior distribution.

6: 1stRPKM:
The RPKM for the first condition. It is available only if gene length is available. If multiple replicates are available, the RPKM is calculated simply by summing over replicates.Because RPKM is acturally using sequencing depth as the normalization constant, log2 fold change based on RPKM could be different from the log2fdc field.

7:2ndRPKM:
The RPKM for the second condition. It is available only if gene length is available. Please refer to 1stRPKM for more information.

第一步:準(zhǔn)備在R語言中完成數(shù)據(jù)的預(yù)處理,并讀取數(shù)據(jù)
rm(list = ls())
chenshu <- read.csv(file = "chenshu.csv", header = T)
colnames(chenshu)[1] <- "GeneName"
colnames(chenshu)
chenshu_R2 <- chenshu[,grep("R2",colnames(chenshu))]
第二步:對(duì)文件進(jìn)行處理,包裝成一個(gè)函數(shù),讓其符合格式要求,5列內(nèi)容
gfold_sample <- function(IPT,groupname){
  tmp <- IPT[,grep(groupname,colnames(IPT))]
  tmp <- data.frame(IPT$GeneSymbol,IPT$GeneName,tmp,IPT$Exonic.gene.sizes)
  colnames(tmp) <- c("GeneSymbol", "GeneName", "Read Count", "Gene exon length", "RPKM") 
  suffix = ".read_cnt"
  vari = groupname
  write.table(tmp,file = paste0(vari,suffix), row.names = F, col.names = F, quote = F, sep = "\t")
  
}
#輸入已經(jīng)得到的矩陣,樣本名稱,運(yùn)行函數(shù),得到了對(duì)應(yīng)的文件
gfold_sample(IPT = chenshu,  "R2")
gfold_sample(IPT = chenshu,  "R3")
Example 3: Identify differentially expressed genes without replicates
gfold diff -s1 sample1 -s2 sample2 -suf .read_cnt -o sample1VSsample2.diff
第三步:數(shù)據(jù)在Linux中進(jìn)行運(yùn)行之 第一種情況是:D2和D4兩個(gè)單獨(dú)比較
$gfold diff -s1 ../Data_Count/D2.read_cnt  -s2 ../Data_Count/D4.read_cnt -o D2_D4.diff
awk '{if ($3>0 && $4=1) print $0}' D2_D4.diff OFS='\t' >D2_D4.up.txt

awk '{if ($3>1 && $4=1) print $0}' D2_D4.diff OFS='\t' >D2_D4_2_up.txt

awk '{if ($3<0 && $4=1) print $0}' D2_D4.diff OFS='\t' >D2_D4.down.txt

awk '{if ($3<-1 && $4=1) print $0}' D2_D4.diff OFS='\t' >D2_D4_2_down.txt
第三步:數(shù)據(jù)在Linux中進(jìn)行運(yùn)行之 第二種情況是:R2與(R3+R4)比較
$gfold diff -s1 ../Data_Count/R3.read_cnt,../Data_Count/R4.read_cnt -s2 ../Data_Count/R2.read_cnt -o R34_R2.diff

提取差異表達(dá)的基因

awk '{if ($3>0 && $4=1) print $0}' R34_R2.diff OFS='\t' >R34_R2.diff.up.txt

awk '{if ($3>1 && $4=1) print $0}' R34_R2.diff OFS='\t' >R34_R2.diff_2_up.txt

awk '{if ($3<0 && $4=1) print $0}' R34_R2.diff OFS='\t' >R34_R2.diff.down.txt

awk '{if ($3<-1 && $4=1) print $0}' R34_R2.diff OFS='\t' >R34_R2.diff_2_down.txt
$gfold diff -s1 ../Data_Count/C1.read_cnt,../Data_Count/C2.read_cnt,../Data_Count/C3.read_cnt -s2 ../Data_Count/D2.read_cnt -o C123_D2.diff
awk '{if ($3>0 && $4=1) print $0}' C123_D2.diff OFS='\t' >C123_D2.diff.up.txt

awk '{if ($3>1 && $4=1) print $0}' C123_D2.diff OFS='\t' >C123_D2.diff_2_up.txt

awk '{if ($3<0 && $4=1) print $0}' C123_D2.diff OFS='\t' >C123_D2.diff.down.txt

awk '{if ($3<-1 && $4=1) print $0}' C123_D2.diff OFS='\t' >C123_D2.diff_2_down.txt
$gfold diff -s1 ../Data_Count/C1.read_cnt,../Data_Count/C2.read_cnt,../Data_Count/C3.read_cnt -s2 ../Data_Count/R3.read_cnt,../Data_Count/R4.read_cnt -o C123_R34.diff
awk '{if ($3>0 && $4=1) print $0}' C123_R34.diff OFS='\t' >C123_R34.diff.up.txt

awk '{if ($3>1 && $4=1) print $0}' C123_R34.diff OFS='\t' >C123_R34.diff_2_up.txt

awk '{if ($3<0 && $4=1) print $0}' C123_R34.diff OFS='\t' >C123_R34.diff.down.txt

awk '{if ($3<-1 && $4=1) print $0}' C123_R34.diff OFS='\t' >C123_R34.diff_2_down.txt

第四步:剩下的所有的是在Rstudio中進(jìn)行差異表達(dá)分析

getwd()
chenshudat <- read.table("./different_expression/C123_D2/C123_D2.diff.txt")
原始數(shù)據(jù)
step1 對(duì)讀取的數(shù)據(jù)先進(jìn)行分組,分為3組-上調(diào),下調(diào),沒有變化
# 1 統(tǒng)一設(shè)置為Normal
chenshudat$regulated <- "normal"
# 2 將上調(diào)和下調(diào)的位置找出來 - 設(shè)置一個(gè)閾值
log_up <- c(chenshudat$V3>0)
log_down <- c(chenshudat$V3<0)
# 3 將上述找出的部分進(jìn)行替換到原來的數(shù)據(jù)那一列,將 normal 進(jìn)行了替換
chenshudat$regulated[log_up] <- "up"
chenshudat$regulated[log_down] <- "down"
table(chenshudat$regulated)

  down normal     up 
   192  60202    218 
# 4 進(jìn)一步對(duì)上述數(shù)據(jù)進(jìn)行命名,還是把gfold 文件輸出名稱給標(biāo)正確了
colnames(chenshudat) <- c("SYMBOL","ENSEMBL","GFOLD","E-FDR","log2fdc","1stRPKM","2ndRPKM","regulated")
整理后的數(shù)據(jù)
step2 繪制火山圖
#加載包,火山圖繪制有symbol這一列
library(ggplot2)
dat <- chenshudat
dat <- dat[order(dat$log2fdc, decreasing = T),]
dim(dat)
# 對(duì)這個(gè)數(shù)據(jù)進(jìn)行了去除NA值
dat <- dat[dat$SYMBOL!="NA",]
dim(dat)
#畫圖
p <- ggplot(data = dat, aes(x = GFOLD,
                            y = log2fdc)) +
  geom_point(alpha  = 0.4, size = 3.5,
             aes(color = regulated)) +
  ylab("log2fdc") +
  scale_colour_manual(values = c("blue","grey","red")) + theme_bw()
p

#得到了最基礎(chǔ)的火山圖后,對(duì)其進(jìn)行了標(biāo)注
## 首先是獲得需要標(biāo)注的基因名進(jìn)行集合-for_label
#### 第一種是指定基因的標(biāo)注
for_label <- dat%>% filter (symbol %in% c("Oca2","Edar"))

#### 第二種是自己選擇前5個(gè)表達(dá)高和前5個(gè)表達(dá)低的基因
x1 = dat %>% 
    filter(regulated=="up") %>% 
    head(7)
  x2 = dat %>% 
    filter(regulated=="down") %>% 
    tail(7)
  for_label = rbind(x1,x2)
## 第二步是將這個(gè)基因的symbol映射到原來的圖中,兩個(gè)函數(shù),第一個(gè)將突出的點(diǎn)進(jìn)行了標(biāo)注,第二個(gè)將名字放上去
volcano_plot <- p + 
  geom_point(size  = 3, shape=1, data = for_label) +
  ggrepel::geom_label_repel(
    aes(label = SYMBOL),
    data = for_label,
    color = "black"
  )
## 保存圖片
volcano_plot
save(volcano_plot,file = "volcano_plot_label.pdf")
dev.off()

KEGG 分析

第一步: 準(zhǔn)備數(shù)據(jù)工作
#首先是對(duì)數(shù)據(jù)的行名為ENSEMBL ID, 并對(duì)數(shù)據(jù)進(jìn)行了篩選,將上下調(diào)的基因都給選出來
rownames(dat)=dat$ENSEMBL
limma_sigGene <- dat[dat$regulated!="normal",2]
head(limma_sigGene,3)#這里挑選的基因是ENSEMBL ID
[1] "ENSG00000148346" "ENSG00000143546" "ENSG00000232810"
DEG <- limma_sigGene
# 提取所有基因名
gene_all <- rownames(dat)
# 從org.Hs.eg.db提取ENSG的ID 和GI號(hào)對(duì)應(yīng)關(guān)系:bitr in clusterProfiler
library(clusterProfiler)
allID <- bitr(gene_all, fromType = "ENSEMBL", toType = c( "ENTREZID" ), OrgDb = org.Hs.eg.db )
degID <- bitr(DEG, fromType = "ENSEMBL", toType = c( "ENTREZID" ), OrgDb = org.Hs.eg.db )
head(degID)

第二步:進(jìn)行富集通路分析
#函數(shù)是這樣的
enrichKEGG(
  gene,
  organism = "hsa",
  keyType = "kegg",
  pvalueCutoff = 0.05,
  pAdjustMethod = "BH",
  universe,
  minGSSize = 10,
  maxGSSize = 500,
  qvalueCutoff = 0.2,
  use_internal_data = FALSE
)
#對(duì)應(yīng)的參數(shù)進(jìn)行操作
enrich <- enrichKEGG(gene =degID[,2],organism='hsa',universe=allID[,2],pvalueCutoff=1,qvalueCutoff=1)
#對(duì)基因ratio的提取和比值進(jìn)行了運(yùn)算
GeneRatio <- as.numeric(lapply(strsplit(enrich$GeneRatio,split="/"),function(x) as.numeric(x[1])/as.numeric(x[2])))
#對(duì)BgRatio的提取和比值進(jìn)行了運(yùn)算                               
BgRatio <- as.numeric(lapply(strsplit(enrich$BgRatio,split="/"),function(x) as.numeric(x[1])/as.numeric(x[2])  ))
#計(jì)算enrich_factor                             
enrich_factor <- GeneRatio/BgRatio 
out <- data.frame(enrich$ID,enrich$Description,enrich$GeneRatio,enrich$BgRatio,round(enrich_factor,2),enrich$pvalue,enrich$qvalue,enrich$geneID)
colnames(out) <- c("ID","Description","GeneRatio","BgRatio","enrich_factor","pvalue","qvalue","geneID")
write.table(out,"trut_VS_untrt_enrich_KEGG.xls",row.names = F,sep="\t",quote = F)

out_sig0.05 <- out[out$qvalue<0.01,]

# barplot
bar <- barplot(enrich,showCategory=10,title="KEGG Pathway",colorBy="p.adjust")
bar
讓KEGG分析得到對(duì)應(yīng)的symbol名稱
mm <- DOSE::setReadable(enrich, OrgDb = org.Hs.eg.db, keyType="ENTREZID")
S2E <- data.frame(mm@ gene2Symbol)
colnames(S2E) <- c("Symbol")
write.csv(S2E, file = "Symbol_to_ENTREZID.csv")#將生成的文件進(jìn)行了導(dǎo)出
進(jìn)行了轉(zhuǎn)換

應(yīng)該有更簡(jiǎn)單的方法對(duì)ENTREZ ID在KEGG通路中進(jìn)行對(duì)應(yīng),下次查到再補(bǔ)上

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容