參考：http://www.itdecent.cn/p/0a14d971bd0a

require software

geneDuplication analysis

基因的復(fù)制類(lèi)型分類(lèi)分析，使用的是DupGen-finder，可以把所有基因按照復(fù)制類(lèi)型分為5類(lèi)，
WGD：全基因組復(fù)制
TD：串聯(lián)重復(fù)（相鄰的兩個(gè)重復(fù)基因）
PD：近端重復(fù)（相隔10個(gè)以?xún)?nèi)基因的重復(fù)基因）
TRD：轉(zhuǎn)置重復(fù)（祖先和新基因座組成的重復(fù)基因）
DSD：分散重復(fù)（不相鄰也不共線性的重復(fù)基因）
SL：?jiǎn)慰截?/p>

require input file

研究物種的gff,pep

#分析模式1（自身比較）和模式2（和其他物種比較）
####分析模式1
cat Spd.bed |sed 's/^/Spd-/g'|awk '{print $1"\t"$4"\t"$2"\t"$3}' >Spd.gff
cat Ath.bed |sed 's/^/Ath-Chr/g'|awk '{print $1"\t"$4"\t"$2"\t"$3}' >Ath.gff
sed -i 's/Chr0/Chr/g' Spd.gff

cat Spd.gff Ath.gff >Spd_Ath.gff

makeblastdb -in Spd.pep -dbtype prot -title Spd -parse_seqids -out Spd
blastp -query Spd.pep -db Spd -evalue 1e-10 -max_target_seqs 5 -outfmt 6 -out Spd.blast
# Create a reference database
makeblastdb -in Ath.pep -dbtype prot -title Ath -parse_seqids -out Ath
# Align protein query sequences against the reference database
blastp -query Ath.pep -db Ath -evalue 1e-10 -max_target_seqs 5 -outfmt 6 -out Ath.blast
mkdir Spd_Ath
cat Spd.blast Ath.blast >Spd_Ath.blast

#-t 是實(shí)驗(yàn)組 -c是外源對(duì)照組
#一般模式
DupGen_finder.pl -i $PWD -t Spd -c Ath -o ${PWD}/Spd_Ath/results1
#嚴(yán)格模式
DupGen_finder-unique.pl -i $PWD -t Spd -c Ath -o ${PWD}/Spd_Ath/results2

輸入文件格式

要求的Ath.gff文件格式:

Ath-Chr1        AT1G01010.1     3630    5899
Ath-Chr1        AT1G01020.1     6787    9130
Ath-Chr1        AT1G01030.1     11648   13714

Ath.pep文件的格式：

>AT3G05780.1
MMPKRFNTSGFDTTLRLPSYYGFLHLTQSLTLNSRVFYGARHVTPPAIRIGSNPVQSLLL
FRAPTQLTGWNRSSRDLLGRRVSFSDRSDGVDLLSSSPILSTNPNLDDSLTVIALPLPHK
PLIPGFYMPIHVKDPKVLAALQESTRQQSPYVGAFLLKDCASTDSSSRSETEDNVVEKFK
VKGKPKKKRRKELLNRIHQVGTLAQISSIQGEQVILVGRRRLIIEEMVSEDPLTVRVDHL
>AT4G32100.1
MATNACKFLCLVLLFAFVTQGYGDDSYSLESLSVIQSKTGNMVENKPEWEVKVLNSSPCY
FTHTTLSCVRFKSVTPIDSKVLSKSGDTCLLGNGDSIHDISFKYVWDTSFDLKVVDGYIA
CS

輸出文件

在results1和results2中輸出結(jié)果文件基本一致，results2中的是嚴(yán)格模式，文件后面后綴是-unique
Spd.pairs.stats-unique #是各組復(fù)制信息的pairs數(shù)量。
Spd.wgd.genes-unique #WGD類(lèi)型的所有基因
Spd.wgd.pairs-unique #WGD類(lèi)型的pairs的詳細(xì)信息
Spd.singletons #singletons類(lèi)型的基因(即沒(méi)有同源基因的基因)
Spd.tandem.pairs-unique # TD類(lèi)型
Spd.proximal.pairs-unique #PD類(lèi)型pairs
Spd.transposed.pairs-unique #TRD類(lèi)型pairs
Spd.dispersed.pairs-unique #DSD類(lèi)型pairs
Spd.collinearity #Spd物種內(nèi)的共線性文件
Spd_Ath.collinearity #Spd和Ath物種的共線性文件

注意：一定要分清楚gene和gene pairs的數(shù)量并不是一一對(duì)應(yīng)的關(guān)系，會(huì)有一對(duì)多。

后續(xù)分析

GO,KEGG (使用的是5個(gè)genes-unique文件)
Ka/Ks （使用的是5個(gè)pairs-unique文件和2個(gè)collinearity文件）

獲取目標(biāo)基因后，可以和擴(kuò)張的基因取并集，然后富集分析這部分基因的功能。

expansion.genes是擴(kuò)張基因列表

cd results2
grep -f expansion.genes Spd.wgd.genes-unique|cut -f1 >expansion_wgd.csv
grep -f expansion.genes Spd.td.genes-unique|cut -f1 >expansion_td.csv
grep -f expansion.genes Spd.trd.genes-unique|cut -f1 >expansion_trd.csv
grep -f expansion.genes Spd.pd.genes-unique|cut -f1 >expansion_pd.csv
grep -f expansion.genes Spd.dsd.genes-unique|cut -f1 >expansion_dsd.csv

expansion_wgd.genes.csv即為擴(kuò)張的wgd的基因。
注意：gff文件，protein文件和expansion文件，各處使用的geneid一定要一致
可以按照這5種分類(lèi)

GO、KEGG富集分析

下面代碼可以直接在R-studio運(yùn)行，

要求先安裝ggplot2和clusterProfiler

修改工作路徑，工作路徑下要有以下文件：
GO.info #GO的數(shù)據(jù)庫(kù)
KEGG.info #KEGG的數(shù)據(jù)庫(kù)
expansion_wgd.csv
expansion_td.csv
expansion_dsd.csv
expansion_pd.csv
expansion_trd.csv
5個(gè)csv文件時(shí)擴(kuò)張基因和對(duì)應(yīng)復(fù)制方式的共有的基因的列表

#使用新的GO和KEGG進(jìn)行富集分析
setwd("e:/Family_expansion/GO/results2")
library(clusterProfiler)
library(ggplot2)

##獲取spo基因的富集分析結(jié)果，需要輸入基因列表和genetype(用于輸出文件名的前綴)
  
get_go_kegg <- function(filename,genetype,GO_list="GO.info",KEGG_list="KEGG.info"){
    #定義需要分析的基因列表
    gene <- read.csv(filename,header = FALSE)
    genetype <- genetype
    
    #修改GO和KEGG文件名
    golist <- read.table(file=GO_list,sep = "\t",header = FALSE)
    kegglist <- read.table(file=KEGG_list,sep = "\t",header = FALSE)
    
    #下面內(nèi)容基本不需要修改
    spo2gene <- data.frame(golist$V1,golist$V3) #GO,Geneid
    spo2name <- data.frame(golist$V1,golist$V2) #GO,GO描述
    gene1 <- t(gene$V1)  #獲取需要分析的基因列表
    #GO富集分析
    spo_GO <- enricher(gene1,TERM2GENE = spo2gene,TERM2NAME = spo2name)
    p1 <- dotplot(spo_GO)
    p1
    ggsave(file=paste(genetype,".GO.pdf",sep=""))
    write.csv(spo_GO,file=paste(genetype,"_GO.csv",sep = ""))
    
    #kegg富集
    kegg2gene <- data.frame(kegglist$V1,kegglist$V3) #keggid,Geneid
    kegg2name <- data.frame(kegglist$V1,kegglist$V2) #keggid,kegg描述
    gene1 <- t(gene$V1)  #獲取需要分析的基因列表
    spo_KEGG <- enricher(gene1,TERM2GENE = kegg2gene,TERM2NAME = kegg2name)
    p2 <- dotplot(spo_KEGG)
    p2
    ggsave(file=paste(genetype,".KEGG.pdf",sep=""))
    write.csv(spo_KEGG,file=paste(genetype,"_kegg.csv",sep=""))
    
    cowplot::plot_grid(p1, p2, nrow=2, labels=c("A", "B"))  #ggplot2的函數(shù)，nrow指定2行，如果是ncol則是2列。
    ggsave(file=paste(genetype,"_go_kegg.pdf",sep=""))
  
}
#運(yùn)行富集分析，第1個(gè)參數(shù)是基因列表csv格式，第2個(gè)是前綴類(lèi)型，第3個(gè)是GO,第4個(gè)是KEGG，參數(shù)3和4要求是tab分割符
get_go_kegg("expansion_trd.csv","trd")
get_go_kegg("expansion_td.csv","td")
get_go_kegg("expansion_dsd.csv","dsd")
get_go_kegg("expansion_wgd.csv","wgd")
get_go_kegg("expansion_pd.csv","pd")

#GO和KEGG文件的格式
head(GO.info)
#   GO:0005452       inorganic anion exchanger activity Sp09G018780.1 Molecular Function
#   GO:0006820                          anion transport Sp09G018780.1 Biological Process
#   GO:0016021           integral component of membrane Sp09G018780.1 Cellular Component
head(KEGG.info)
#   ko00010 Glycolysis / Gluconeogenesis Sp12G016220.1
#   ko00010 Glycolysis / Gluconeogenesis Sp10G003910.1
#   ko00010 Glycolysis / Gluconeogenesis Sp16G007040.1


#合并多組GO分析，生成氣泡圖
#在進(jìn)行氣泡圖分析之前，可以手動(dòng)修改上一步輸出的富集分析的csv文件中的GO/KEGG term的詞條，把長(zhǎng)詞條變成短詞條，以方便在氣泡圖上展示。
#從csv文件讀取富集的信息到數(shù)據(jù)框（count_num可以指定獲取的GO/KEGG term的數(shù)量，默認(rèn)是15;enrich_type指定富集類(lèi)型GO/KEGG）
getdata <-  function(Prefix,enrich_type="GO",count_num=15){
  trd_GO <- read.csv(paste(Prefix,"_",enrich_type,".csv",sep = ""),header = T)
  pic_trd_GO <- head(trd_GO,count_num)
  pic_trd_GO$adjust <- -log10(pic_trd_GO$p.adjust)
  pic_trd_GO$type <- Prefix
  return(pic_trd_GO)
}
#讀取GO
trd_GO <- getdata("trd")
wgd_GO <- getdata("wgd")
dsd_GO <- getdata("dsd")
pd_GO <- getdata("pd")
td_GO <- getdata("td")
#合并5組GO數(shù)據(jù)
data_GO_all <- rbind(trd_GO,wgd_GO,dsd_GO,pd_GO,td_GO)
#氣泡圖可視化富集結(jié)果
ggplot(data_GO_all,aes(type,Description)) +
  geom_point(aes(fill=adjust,size=Count),alpha=0.9,pch=21) +  #fill對(duì)應(yīng)點(diǎn)的填充色，colour對(duì)應(yīng)點(diǎn)的邊框色
  scale_fill_gradient(low='SpringGreen',high='DeepPink') + #設(shè)定顏色的變化范圍
  scale_size_area() + #設(shè)定點(diǎn)的大小比例和圖例上顯示的間隔
  labs(y='GO term',x='type',fill='-log10(P.adjust)',size='Metabolites number')+
  theme_bw()
ggsave("geneduplication.GO.pdf",dpi=300)
write.csv(data_GO_all,file="data_GO_all.csv")

#讀取KEGG
trd_KEGG <- getdata("trd",enrich_type = "KEGG")
wgd_KEGG <- getdata("wgd",enrich_type = "KEGG")
dsd_KEGG <- getdata("dsd",enrich_type = "KEGG")
pd_KEGG <- getdata("pd",enrich_type = "KEGG")
td_KEGG <- getdata("td",enrich_type = "KEGG")
#合并5組GO數(shù)據(jù)
data_kegg_all <- rbind(trd_KEGG,wgd_KEGG,dsd_KEGG,pd_KEGG,td_KEGG)
#氣泡圖可視化富集結(jié)果
ggplot(data_kegg_all,aes(type,Description)) +
  geom_point(aes(fill=adjust,size=Count),alpha=0.9,pch=21) +  #fill對(duì)應(yīng)點(diǎn)的填充色，colour對(duì)應(yīng)點(diǎn)的邊框色
  scale_fill_gradient(low='SpringGreen',high='DeepPink') + #設(shè)定顏色的變化范圍
  scale_size_area() + #設(shè)定點(diǎn)的大小比例和圖例上顯示的間隔
  labs(y='KEGG term',x='type',fill='-log10(P.adjust)',size='Metabolites number')+
  theme_bw()
ggsave("geneduplication.KEGG.pdf",dpi=300)
write.csv(data_kegg_all,file="data_KEGG_all.csv")

分別分析Ka/Ks 參考

需要安裝的軟件 KaKs,mafft,ParaAT.pl (推薦使用muscle比對(duì)，速度更快)

當(dāng)基因比較多超過(guò)1萬(wàn)時(shí)，比較耗時(shí)，請(qǐng)?jiān)黾觕pu數(shù)量，對(duì)應(yīng)的參數(shù)時(shí)proc腳本中設(shè)置的是16.
KaKs.sh腳本如下,工作目錄results2


#定義物種的縮寫(xiě)
species=$1

cat $species.dispersed.pairs-unique|cut -f 1,3 |sed '1d' >dsd.homolog
cat $species.tandem.pairs-unique | cut -f 1,3 |sed '1d' >td.homolog
cat $species.transposed.pairs-unique | cut -f 1,3 |sed '1d' >trd.homolog
cat $species.wgd.pairs-unique | cut -f 1,3 |sed '1d' >wgd.homolog
cat $species.proximal.pairs-unique | cut -f 1,3 |sed '1d' >pd.homolog
echo "16" >proc
function getKaKs(){
    ParaAT.pl -h $1.homolog -n $2.cds -a $2.pep -p proc -m mafft -f axt -g -k -o $1.result_dir
    #把KaKs輸出到文件KaKs.txt
    cat $1.result_dir/*.axt.kaks | cut -f 1,3,4,5 | grep -v 'Sequence' |sort|uniq >$1.KaKs.txt
    cat $1.KaKs.txt| tr "\t" "," |sed '1i Sequence,Ka,Ks,Ka/Ks' >$1.KaKs.csv
        #添加新的列，Item用于后續(xù)合并后可視化分組(注意：awk里的$1是全局的環(huán)境的$1，awk調(diào)用外部變量，使用多重引號(hào))
    cat $1.KaKs.csv|awk -F , '{print $0",""'"$1"'"}' | sed '1d' >$1.KaKs.Item.csv
}
list=(wgd td pd trd dsd)
for ele in ${list[@]}
do
    getKaKs $ele $species 
done
#合并所有的KaKs輸出到一個(gè)csv文件
cat td.KaKs.Item.csv wgd.KaKs.Item.csv pd.KaKs.Item.csv trd.KaKs.Item.csv dsd.KaKs.Item.csv |sed '1i Sequence,Ka,Ks,Ka.Ks,Item' >gene_dup_KaKs.csv
##計(jì)算4DTv
# 將多行axt文件轉(zhuǎn)換成單行
function get4DTv(){
    cd $1.result_dir
    #使用axt2one-line.py合并axt為1行。https://github.com/scbgfengchao/4DTv/blob/master/axt2one-line.py
    for i in `ls *.axt`;do axt2one-line.py $i ${i}.one-line;done    #使用calculate_4DTV_correction.pl腳本計(jì)算4dtv值。腳本地址：https://github.com/JinfengChen/Scripts/blob/master/FFgenome/03.evolution/distance_kaks_4dtv/bin/calculate_4DTV_correction.pl
    ls *.axt.one-line|while read id;do calculate_4DTV_correction.pl $id >${id%%one-line}4dtv;done
    #合并所有同源基因?qū)Φ?dtv
    for i in `ls *.4dtv`;do awk 'NR>1{print $1"\t"$3}' $i >>$1.4DTv.txt;done
    sort $1.4DTv.txt|uniq > ../$1.4DTv.txt
    cat ../$1.4DTv.txt| tr "\t" "," |sed '1i Gene,4DTv' > ../$1.4DTv.csv
    cd ../
    #添加新的列，Item用于后續(xù)合并后可視化分組
    cat $1.4DTv.csv|awk -F , '{print $0",""'"$1"'"}' | sed '1d' >$1.4DTv.Item.csv
}
list=(wgd td pd trd dsd)
for ele in ${list[@]}
do
    get4DTv $ele;
done
cat td.4DTv.Item.csv wgd.4DTv.Item.csv pd.4DTv.Item.csv trd.4DTv.Item.csv dsd.4DTv.Item.csv |sed '1i Gene,4DTv,Item' >gene_dup_4DTv.csv

運(yùn)行bash KaKs.sh Spd即可得到最終的總KaKs和4DTv, gene_dup_KaKs.csv和gene_dup_4DTv.csv,還有各組分組的.KaKs.csv和.4DTv.csv。

可視化Ka/Ks的分布,

kaks.R腳本代碼如下，直接在上述結(jié)果目錄運(yùn)行Rscript kaks.R 即可得到ka/ks的分布，輸出4個(gè)pdf,3個(gè)是單獨(dú)的，1個(gè)是組圖。

#setwd("e:/****/GeneDuplication")
library("ggplot2")
data_kaks <- read.csv("gene_dup_KaKs.csv")
#刪除包含NA的行
data_kaks <- na.omit(data_kaks)
p0 <- ggplot(data=data_kaks, aes(x=Item, y=Ka.Ks, fill=Item)) +
  geom_violin(alpha=0.8,width=1)+ guides(fill=F)+xlab(' ')+ylab('Ka/Ks')+
  geom_boxplot(alpha=0.5,width = 0.3)+
  coord_flip()+ #顛倒xy軸
  ylim(0,2) #設(shè)置y軸最大值
p0
ggsave('distribute_of_KaKs.pdf',dpi=300)

p1 <- ggplot(data=data_kaks, aes(x=Ka,group=Item)) + 
  geom_density(alpha=0.4,aes(color = Item))+ xlab('Ka value')+ylab('Density')+
 labs(title = "Distribution of Ka distances", size = 1.5)+guides()
p1
ggsave('Distribute_of_Ka.pdf',dpi=300)

p2 <- ggplot(data=data_kaks, aes(x=Ks,group=Item)) + 
  geom_density(alpha=0.4,aes(color = Item))+xlab('Ks value')+ylab('Density')+
  labs(title = "Distribution of Ks distances", size = 1.5)
p2
ggsave('Distribute_of_Ks.pdf',dpi=300)

#可視化4DTv
data_4DTv <- read.csv("gene_dup_4DTv.csv")
#刪除包含NA的行
data_4DTv <- na.omit(data_4DTv)
p3 <- ggplot(data=data_4DTv, aes(x=X4DTv,group=Item)) + 
  geom_density(alpha=0.4,aes(color = Item))+xlab('4DTv value')+ylab('Density')
  #labs(title = "Distribution of 4DTv distances", size = 1.5)
p3
ggsave("Distribution_of_4DTv.pdf",dpi=300)
#拼圖，直接出組合圖
p4.1 <- cowplot::plot_grid(p1, p2, ncol=2, labels=c("a", "b"))  #ggplot2的函數(shù)，nrow指定2行，如果是ncol則是2列。
p4.2<-cowplot::plot_grid(p0,p3,ncol = 2,labels = c("c","d"))
p4<-cowplot::plot_grid(p4.1,p4.2,nrow = 2)
p4
ggsave("Distribution_of_Ka_Ks_Kaks_4DTv.pdf",dpi = 300)

2022/06/07更新

注意：

即使使用和其他物種比較，即模式2，獲取的仍然是本身物種的復(fù)制信息。所以最終獲取到的仍然是自身的分化時(shí)間。但是，外群不一樣，分化時(shí)間會(huì)有變化?？梢愿鶕?jù)物種的進(jìn)化樹(shù)來(lái)選擇合適的外群。距離研究物種比較近的物種作為外群，可能會(huì)鑒定到比較少的復(fù)制基因?qū)ΑＲ驗(yàn)橥馊耗Ｊ降慕Y(jié)果是物種和外群分開(kāi)后的復(fù)制信息。
例如：rice作為研究物種，A. thaliana作為外群。這時(shí)候計(jì)算出的結(jié)果就是rice和Ath分開(kāi)后，rice經(jīng)歷的復(fù)制。分開(kāi)前共同經(jīng)歷的復(fù)制沒(méi)有被計(jì)算。
選擇物種進(jìn)化比較近的物種作為外群，這樣可能更方便講故事。
外群的選擇，可以根據(jù)研究分析目的來(lái)選擇不同的外群。同時(shí)還可以比較同一個(gè)物種和不同外群分開(kāi)后的復(fù)制差異。https://github.com/qiao-xin/DupGen_finder/issues/5
作者提供了下游分析的pipline

Calculating Ks values for WGD-derived gene pairs pipeline 計(jì)算KaKs的值的管道
Inferring the Ks peaks corresponding to WGD events of different ages pipeline計(jì)算WGD峰值的管道
compute Pearson's coorelation coefficient (r) between expression profiles of duplicate gene pairs derived from different modes of gene duplication 計(jì)算不同復(fù)制模式的基因?qū)捅磉_(dá)量之間的皮爾遜相關(guān)性
detect gene conversion events檢測(cè)目標(biāo)物種和外群之間的 WGD 衍生基因?qū)χ械幕蜣D(zhuǎn)換事件

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

基因組基因復(fù)制的分類(lèi)鑒定：DupGen-finder

基因組基因復(fù)制的分類(lèi)鑒定：DupGen-finder

require software

geneDuplication analysis

require input file

輸入文件格式

輸出文件

后續(xù)分析

獲取目標(biāo)基因后，可以和擴(kuò)張的基因取并集，然后富集分析這部分基因的功能。

GO、KEGG富集分析

要求先安裝ggplot2和clusterProfiler

分別分析Ka/Ks 參考

需要安裝的軟件 KaKs,mafft,ParaAT.pl (推薦使用muscle比對(duì)，速度更快)

可視化Ka/Ks的分布,

2022/06/07更新

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

基因組基因復(fù)制的分類(lèi)鑒定：DupGen-finder

require software

geneDuplication analysis

require input file

輸入文件格式

輸出文件

后續(xù)分析

獲取目標(biāo)基因后，可以和擴(kuò)張的基因取并集，然后富集分析這部分基因的功能。

GO、KEGG富集分析

要求先安裝ggplot2和clusterProfiler

分別分析Ka/Ks 參考

需要安裝的軟件 KaKs,mafft,ParaAT.pl (推薦使用muscle比對(duì)，速度更快)

可視化Ka/Ks的分布,

2022/06/07更新

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

獲取目標(biāo)基因后，可以和擴(kuò)張的基因取并集，然后富集分析這部分基因的功能。

GO、KEGG富集分析

需要安裝的軟件 KaKs,mafft,ParaAT.pl (推薦使用muscle比對(duì)，速度更快)