嚴格意義上說遺傳進化中的群體結(jié)構(gòu)應(yīng)該包括PCA、群體進化樹和群體STRUCRURE,傳統(tǒng)的PCA和群體進化樹只能直觀的了解群體中個體的分類關(guān)系,但無法解釋群體間具體的分群數(shù)目和基因交流情況。這就需要群體STRUCRURE作為參考,群體結(jié)構(gòu)分析能夠提供個體的血統(tǒng)來源及其組成信息,是一種重要的遺傳關(guān)系分析手段。目前經(jīng)典的群體結(jié)構(gòu)分析軟件有structure和admixture,而我們常常將群體結(jié)構(gòu)圖稱為STRUCRURE圖。
structure是與PCA、進化樹相似的方法,就是利用分子標(biāo)記的基因型信息對一組樣本進行分類,分子標(biāo)記可以是SNP、indel、SSR。相比于PCA,進化樹,群體結(jié)構(gòu)分析可明確各個群之間是否存在交流及交流程度
群體遺傳結(jié)構(gòu) (structure) 指遺傳變異在物種或群體中的一種非隨機分布。按照地理分布或其他標(biāo)準可將一個群體分為若干亞群,處于同一亞群內(nèi)的不同個體親緣關(guān)系較高,而亞群與亞群之間則親緣關(guān)系較遠。
群體結(jié)構(gòu),又稱群體分層,指所研究的群體中存在基因頻率不同的亞群。
群體結(jié)構(gòu)分析有助于理解進化過程,并且可以通過基因型和表型的關(guān)聯(lián)研究確定個體所屬的亞群。

1、群體結(jié)構(gòu)影響因素
哈迪溫伯格平衡:如果一個群體符合特定條件,這個群體中各個等位基因、基因型以及表型的比例可以從一代傳到下一代維持不變,即世代保持不變。
特定條件指:
- 1、種群無限大,群體有一個數(shù)目巨大的個體組成;
- 2、群體中個體間可以隨機交配,任何個體都有同等的機會與異性進行交配,不受任何遺傳學(xué)或行為學(xué)的限制;
- 3、群體中的基因庫中,沒有新的突變產(chǎn)生;
- 4、群體中的個體沒有遷出,也沒有遷入,即沒有個體遷徙;
- 5、所有基因型的個體都有同等的機會存活和繁殖后代,沒有任何形式的自然選擇。
哈迪溫伯格平衡兩大特點:
- 1、等位基因的頻率在世代遺傳中穩(wěn)定不變;
- 2、無論群體起始的等位基因頻率如何,只要經(jīng)過一代的隨機交配,群體的基因頻率和等位基因頻率就可以達到平衡狀態(tài)。
影響群體平衡(哈迪溫伯格平衡)的主要因素有突變、選擇、遷徙、遺傳漂變、交配系統(tǒng)等。

2 群體結(jié)構(gòu)原理及應(yīng)用
基本原理:
將群體分成 K 個服從 Hardy-Weinberger(HWE) 平衡的亞群,每個材料歸到各個亞群,計算每個材料其基因組變異源于第 K 個亞群的可能性(Pritchard et al.2000),主要用 Q 矩陣表示,Q 值越大,表明某個材料來自某個亞群的可能性越大。無論是structure和admixture 都是這個原理
研究意義:
1、了解研究材料的分群情況,推斷祖先群體,評判最優(yōu)分群;
2、輔助控制關(guān)聯(lián)分析結(jié)果的假陽性。多個亞群混合成一個較大群體后,由于各個亞群基因頻率不同,群體 LD 發(fā)生變化,導(dǎo)致GWAS 分析結(jié)果出現(xiàn)假陽性;
3、雜交事件/個體祖先來源。
3 群體結(jié)構(gòu)格式轉(zhuǎn)換
structure 和 admixture 兩款群體結(jié)構(gòu)分析的軟件都有其特定的格式要求,為了下游分析,需要將 VCF 格式的基因型文件轉(zhuǎn)換為兩款軟件指定的格式。
1 、添加 ID 信息
perl add_vcf_ID.pl Test.vcf Test_addID.vcf
2、根據(jù) LD( 連鎖不平衡), 針對 SNP 位點進行過濾,保留 unlinked SNP 位點
plink \
--vcf Test_addID.vcf \
--indep-pairwise 50 10 0.2 \
--out Test_addID.maf0.05.int0.8 \
--maf 0.05 \
--geno 0.2 \
--allow-extra-chr
3 、提取Test_addID.maf0.05.int0.8.prune.in 中的標(biāo)記
plink \
--vcf Test_addID.vcf \
--extract Test_addID.maf0.05.int0.8.prune.in \
--out Test_addID.maf0.05.int0.8.prune.in \
--recode vcf-iid \
--allow-extra-chr
4、將篩選的位點轉(zhuǎn)換為 structure 和 和 admixture 格式
structure 格式:
plink \
--vcf Test_addID.maf0.05.int0.8.prune.in.vcf \
--recode structure \
--out Test_addID.maf0.05.int0.8.prune.in \
--allow-extra-chr
admixture 格式:
plink \
--vcf Test_addID.maf0.05.int0.8.prune.in.vcf \
--recode 12 \
--out Test_addID.maf0.05.int0.8.prune.in \
--allow-extra-chr
Test_addID.maf0.05.int0.8.prune.in.recode.strct_in 即是structure 要求的格式文件
Test_addID.maf0.05.int0.8.prune.in.ped 即是admixture 要求的格式文件
4 格式介紹
structure 格式:

第一行代表 SNP 的 ID 信息;
第二行代表相鄰標(biāo)記間的距離;
第一列代表樣品信息;
第二列代表樣品其它信息。
備注:不同軟件轉(zhuǎn)換格式可能不一樣。
admixture格式:

第一列:Family ID
第二列:Individual ID
第三列:Paternal ID
第四列:Maternal ID
第五列:Sex (1=male; 2=female; other=unknown)
第六列:Phenotype(-9 表示缺失)
圖片中 1 2 代表基因型,其中每個標(biāo)記占兩列(二等位),純合為 1 1 或者 2 2,雜合為 1 2。
5 structure 分析
網(wǎng)址:http://web.stanford.edu/group/pritchardlab/software.html
structure軟件兩個配置文件:
- mainparams_structure.cfg
- extraparams_structure.cfg
主配置文件:mainparams_structure.cfg
Basic Program Parameters
#define MAXPOPS 2 // (int) number of populations assumed ### k value
#define BURNIN 5000 // (int) length of burnin period ### at least > 5000, general 30000, maybe
#define NUMREPS 50000 // (int) number of MCMC reps after burnin ### at least > 5000, usually 10000, maybe
Input/Output files
#define INFILE /home/dengdj/bin/Bio_soft/Structure/structure2.3.1_console/testdata1 // (str) name of input data file
#define OUTFILE results //(str) name of output data file
Data file format
#define NUMINDS 200 // (int) number of diploid individuals in data file
#define NUMLOCI 5 // (int) number of loci in data file
#define PLOIDY 2 // (int) ploidy of data
#define MISSING 0 // (int) value given to missing genotype data
#define ONEROWPERIND 1 // (B) store data for individuals in a single line
#define LABEL 1 // (B) Input file contains individual labels
#define POPDATA 1 // (B) Input file contains a population identifier
#define POPFLAG 0 // (B) Input file contains a flag which says
whether to use popinfo when USEPOPINFO==1
#define LOCDATA 0 // (B) Input file contains a location identifier
#define PHENOTYPE 0 // (B) Input file contains phenotype information
#define EXTRACOLS 0 // (int) Number of additional columns of data
before the genotype data start.
#define MARKERNAMES 1 // (B) data file contains row of marker names
#define RECESSIVEALLELES 0 // (B) data file contains dominant markers (eg AFLPs)
// and a row to indicate which alleles are recessive
#define MAPDISTANCES 1 // (B) data file contains row of map distances
// between loci
Advanced data file options
#define PHASED 0 // (B) Data are in correct phase (relevant for linkage model only)
#define PHASEINFO 0 // (B) the data for each individual contains a line
indicating phase (linkage model)
#define MARKOVPHASE 0 // (B) the phase info follows a Markov model.
#define NOTAMBIGUOUS -999 // (int) for use in some analyses of polyploid data
一般情況下,主配置文件 mainparams_structure.cfg 紅色標(biāo)準的需要檢查外,其它默認即可。
extraparams_structure.cfg 可以直接使用官方提供的,內(nèi)容也可以設(shè)置為空。
structure 主要參數(shù)如下:
-m mainparams 主配置文件
-e extraparams 其它配置文件
-k 推斷分群數(shù)目
-i 輸入文件 輸入 structure 格式文件
-o 輸出文件
-L 位點數(shù),分析的位點數(shù)目
-N 樣本數(shù)
-D 隨機種子
命令投遞:

輸出結(jié)果:

6 admixture 分析
網(wǎng)址:http://dalexander.github.io/admixture/index.html
admixtrue 是一款運算速度非??斓娜后w結(jié)構(gòu)分析軟件,操作簡單。需要提供的文件是plink中的ped格式。
命令行(批量輸出):
for i in $(seq 1 10);
do echo "admixture -C 0.01 --cv Test_addID.maf0.05.int0.8.prune.in.ped $i -j5 > test_k_$i.log";
done
輸出結(jié)果:
admixture -C 0.01 --cv Test_addID.maf0.05.int0.8.prune.in.ped 1 -j5 > test_k_1.log
admixture -C 0.01 --cv Test_addID.maf0.05.int0.8.prune.in.ped 2 -j5 > test_k_2.log
admixture -C 0.01 --cv Test_addID.maf0.05.int0.8.prune.in.ped 3 -j5 > test_k_3.log
admixture -C 0.01 --cv Test_addID.maf0.05.int0.8.prune.in.ped 4 -j5 > test_k_4.log
admixture -C 0.01 --cv Test_addID.maf0.05.int0.8.prune.in.ped 5 -j5 > test_k_5.log
admixture -C 0.01 --cv Test_addID.maf0.05.int0.8.prune.in.ped 6 -j5 > test_k_6.log
admixture -C 0.01 --cv Test_addID.maf0.05.int0.8.prune.in.ped 7 -j5 > test_k_7.log
admixture -C 0.01 --cv Test_addID.maf0.05.int0.8.prune.in.ped 8 -j5 > test_k_8.log
admixture -C 0.01 --cv Test_addID.maf0.05.int0.8.prune.in.ped 9 -j5 > test_k_9.log
admixture -C 0.01 --cv Test_addID.maf0.05.int0.8.prune.in.ped 10 -j5 > test_k_10.log
-C:收斂閾值,設(shè)置 0.01
--cv: 不需要提供參數(shù)或者數(shù)據(jù)(默認交叉驗證 5 次)
Test_addID.maf0.05.int0.8.prune.in.ped:即admixtrue分析需要的基因型數(shù)據(jù)
-j5: 表示分析采用5個線程
test_k_*.log:輸出的日志文件,里面記錄了交叉驗證錯誤率,用來推斷最優(yōu)分群。
7 結(jié)果分析
7.1 structure 結(jié)果分析
一般情況下,structure分析完畢之后,最優(yōu)分群主要是基于deltak進行判斷。而deltak值的獲取,每個k值至少重復(fù)3次。
目前在線工具structureHarvester 可以基于structure分析的原始結(jié)果進行最優(yōu)分群數(shù)目的推斷。
登錄網(wǎng)站,上傳*.zip格式的壓縮文件(即將原始結(jié)果zip格式壓縮),然后點擊Harvest,執(zhí)行一段時間,即可返回結(jié)果。
群體結(jié)構(gòu)圖形繪制
R 語言包安裝:
install.packages("devtools")
devtools::install_github('royfrancis/pophelper')
備注:最新版本2.3.1
library(pophelper)
files <- list.files(path = ".",full.names = T,pattern = "_f$")
slist <- readQ(files = files, indlabfromfile = T,filetype = "structure")
xlist <- alignK(slist)
xlist2 <- mergeQ(xlist)
plotQ(xlist2, imgoutput = "join",imgtype = "pdf",width = 20,height = 1.2,
showindlab = T,useindlab = T,indlabhjust = 0.5,indlabangle = 0,
indlabsize = 1.2,sortind = "all",sharedindlab = F,
panelspacer = 0.02, indlabspacer = 0,indlabcol = "black",
showtitle = T,titlelab = "",titlecol = "black",
titlehjust = 0.5,showsp = T,
sppos = "left",splab = paste0("K=",1:11),splabangle = 180,dpi = 600,
outputfilename = paste0("structure_barplot"),units = "cm",
exportplot = T,exportpath = getwd())
##draw by group
pop <- read.table("group.txt",header = T,sep = "\t",stringsAsFactors =
F,row.names = 1)
plotQ(xlist2,imgoutput = "join",imgtype = "png",height = 1.2,width = 20,
showindlab = T,useindlab = T,sharedindlab = T,
showsp = T,sppos = "left",splab = paste0("K=",1:11),splabangle = 180,
showlegend = F,ordergrp = T,grplab = pop,
subsetgrp = LETTERS[1:9],dpi = 600,
outputfilename = paste0("structure_barplot_pop"),units = "cm",
exportpath = getwd())
輸出最優(yōu)分群:
write.table(x = xlist2[[11]],file = "Q_score.bestk.txt",append = F,quote = F,
row.names = T,col.names = T,sep = "\t")

pophelper 進行最優(yōu)分群推斷-evanno 法
library(pophelper)
files <- list.files(path = ".",full.names = T,pattern = "_f$")
slist <- readQ(files = files, indlabfromfile = T,filetype = "structure")
tabulateQ(slist, writetable = F, sorttable = TRUE)

summariseQ(tabulateQ(slist, writetable = F, sorttable = TRUE))

evannoMethodStructure(summariseQ(tabulateQ(slist, writetable = F, sorttable =
TRUE)),writetable = F,exportplot = F)

7.2 admixture 結(jié)果分析
admixture分析基于交叉驗證錯誤率的結(jié)果進行評斷最優(yōu)分群。當(dāng)然,目前很多文章也會利用admixture軟件重復(fù)運行k值(>=3),然后基于evanno方法進行最優(yōu)k值判斷。
如果k值僅重復(fù)運行一次,則需要CV法判斷最優(yōu)分群。CV值記錄在運行的結(jié)果日志文件中,可以通過捕獲的方法批量查詢:
grep -h 'CV' *.log

圖形繪制:
x <- 1:10
y<-c(0.52940,0.51765,0.51313,0.52238,0.53772,0.56136,0.59252,0.61948,0.662
29,0.71867)
par(mgp = c(5,1,0),mar = c(8,8,4,4))
plot(x,y,type="b",xlab="K value",ylab="Cross-Validation (CV) errors",
lab = c(10,5,1),
col=c(rep("black",2),"red",rep("black",7)),
bg = c(rep("black",2),"red",rep("black",7)),
bty = "l",las = 1,cex.lab = 2,cex.axis = 1.5,cex = 2,
pch = 21,lwd = 2,font = 2,font.lab = 2,font.axis = 2)
box(lwd = 3,bty = "l")

群體結(jié)構(gòu)圖繪制
sfile = paste0("Test_addID.maf0.05.int0.8.prune.in.",1:10,".Q")
slist <- readQ(sfile,indlabfromfile = T)
sample <- read.table("sampleID.txt",header = F,sep = "\t",stringsAsFactors = F)[,1]
for(i in 1:length(slist)){
rownames(slist[[i]]) <-sample
}
plotQ(slist, imgoutput = "join",imgtype = "png",width = 20,height = 1.2,
showindlab = T,useindlab = T,indlabhjust = 0.5,indlabangle = 0,
indlabsize = 1.2,sortind = "all",sharedindlab = F,
panelspacer = 0.02, indlabspacer = 0,indlabcol = "black",
showtitle = T,titlelab = "",titlecol = "black",
titlehjust = 0.5,showsp = T,
sppos = "left",splab = paste0("K=",1:10),splabangle = 180,dpi = 600,
outputfilename = paste0("admixture_barplot"),units = "cm",
exportplot = T,exportpath = getwd())
##draw by group
pop <- read.table("group.txt",header = T,sep = "\t",stringsAsFactors = F,row.names = 1)
plotQ(slist,imgoutput = "join",imgtype = "png",height = 1.2,width = 20,
showindlab = T,useindlab = T,sharedindlab = T,
showsp = T,sppos = "left",splab = paste0("K=",1:10),splabangle = 180,
showlegend = F,ordergrp = T,grplab = pop,
subsetgrp = LETTERS[1:9],dpi = 600,
outputfilename = paste0("admixture_barplot_pop"),units = "cm",
exportpath = getwd())

pophelper 參數(shù)說明

補充- 高級分析
clist<-list("shiny"=c("#1D72F5","#DF0101","#77CE61","#FF9326","#A945FF","#0089B2","#FDF060","#FFA6B2",
"#BFF217","#60D5FD","#CC1577","#F2B950","#7FB21D","#EC496F","#326397","#B26314","#027368","#A4A4A
4","#610B5E"),
"strong"=c("#11A4C8","#63C2C5","#1D4F9F","#0C516D","#2A2771","#396D35","#80C342","#725DA8","#B620
25","#ED2224","#ED1943","#ED3995","#7E277C","#F7EC16","#F8941E","#8C2A1C","#808080"),
"oceanfive"=c("#00A0B0", "#6A4A3C", "#CC333F", "#EB6841", "#EDC951"),
"keeled"=c("#48B098", "#91CB62", "#FFEE3B", "#FB9013", "#FF3C28"),
"vintage"=c("#400F13", "#027368", "#A3BF3F", "#F2B950", "#D93A2B"),
"muted"=c("#46BDDD","#82DDCE","#F5F06A","#F5CC6A","#F57E6A"),
"teal"=c("#CFF09E","#A8DBA8","#79BD9A","#3B8686","#0B486B"),
"merry"=c("#5BC0EB","#FDE74C","#9BC53D","#E55934","#FA7921"),
"funky"=c("#A6CEE3", "#3F8EAA", "#79C360", "#E52829", "#FDB762","#ED8F47","#9471B4"),
"retro"=c("#01948E","#A9C4E2","#E23560","#01A7B3","#FDA963","#323665","#EC687D"),
"cb_paired"=c("#A6CEE3","#1F78B4","#B2DF8A","#33A02C","#FB9A99","#E31A1C","#FDBF6F","#FF7F00","#C
AB2D6","#6A3D9A","#FFFF99","#B15928"),
"cb_set3"=c("#8DD3C7","#FFFFB3","#BEBADA","#FB8072","#80B1D3","#FDB462","#B3DE69","#FCCDE5","#D9
D9D9","#BC80BD","#CCEBC5","#FFED6F"),
"morris"=c("#4D94CC","#34648A","#8B658A","#9ACD32","#CC95CC","#9ACD32","#8B3A39","#CD6601","#CC
5C5B","#8A4500"),
"wong"=c("#000000","#E69F00","#56B4E9","#009E73","#F0E442","#006699","#D55E00","#CC79A7"),
"krzywinski"=c("#006E82","#8214A0","#005AC8","#00A0FA","#FA78FA","#14D2DC","#AA0A3C","#FA7850","#
0AB45A","#F0F032","#A0FA82","#FAE6BE"))
# add length of palettes
lengths <- sapply(clist,length)
names(clist) <- paste0(names(clist),"_",lengths)
par(mar=c(0.2,6,0.2,0))
par(mfrow=c(length(clist),1))
for(i in 1:length(clist)){
barplot(rep(1,max(lengths)),
col=c(clist[[i]],rep("white",max(lengths)-length(clist[[i]]))),
axes=F,border=F)
text(x=-0.1,y=0.5,adj=1,label=names(clist)[i],xpd=T,cex=1.2)
}

p1 <- plotQ(qlist = slist[2:4],exportplot = F,returnplot = T,
imgoutput = "join",basesize = 11,
clustercol=clist$shiny_19,splab=paste0("K=",2:4),sppos = "left")
p2 <- plotQ(qlist = slist[2:4],exportplot = F,returnplot = T,
imgoutput = "join",basesize = 11,
clustercol=clist$oceanfive_5,splab=paste0("K=",2:4),sppos = "left")
p3 <- plotQ(qlist = slist[2:4],exportplot = F,returnplot = T,
imgoutput = "join",basesize = 11,
clustercol=clist$wong_8,splab=paste0("K=",2:4),sppos = "left")
p4 <- plotQ(qlist = slist[2:4],exportplot = F,returnplot = T,
imgoutput = "join",basesize = 11,
clustercol=clist$vintage_5,splab=paste0("K=",2:4),sppos = "left")
cowplot::plot_grid(p1$plot[[1]],p2$plot[[1]],p3$plot[[1]],p4$plot[[1]],
align = "hv",labels = LETTERS[1:4])

8 系統(tǒng)樹- 群體結(jié)構(gòu)組合圖
當(dāng)前很多文章,都是將系統(tǒng)發(fā)育樹的結(jié)果和群體結(jié)構(gòu)的結(jié)果一起

R 操作代碼
library(ggtree)
library(ggplot2)
library(reshape2)
library(ape)
library(RColorBrewer)
library(aplot)
Set1 <- brewer.pal(n = 9,name = "Set1")
Set3 <- brewer.pal(n = 12,name = 'Set3')
Set2 <- brewer.pal(n = 8,name = "Set2")
Dark2 <- brewer.pal(n = 8,name = "Dark2")
Paired <- brewer.pal(n = 12,name = "Paired")
Set <- unique(c(Set1,Set3,Set2,Dark2,Paired))
# 刪除黃色
yellow <- c("#FFFF99","#FFED6F","#FFFFB3","#FFFF33")
Set <- Set[!Set %in% yellow]
##tree
tree <- read.tree(file = "Test.nwk")
##root
tree <- root(tree,outgroup = paste0("R",11:20))
##group
info <- read.table("group.txt",header = T,stringsAsFactors = F)
groupInfo <- split(x = as.character(info[,1]),f = as.character(info[,2]))
tree <- groupOTU(.data = tree,.node = groupInfo)
##admixture
data <- read.table("Q_score.bestk.txt",header = T)
df <- data.frame(tree$tip.label, data[,1:ncol(data)])
df <- melt(df, id = "tree.tip.label")
names(df) <- c("Sample","Cluster","Qvalue")
##draw
p <- ggtree(tree, branch.length = "none",mapping = aes(color = group)) +
theme(legend.position='none')
p <- p + scale_x_reverse() +
scale_y_reverse() +
coord_flip()
p <- p + scale_color_manual(values = Set,labels = names(groupInfo))
b <- ggplot(df, aes(x = Sample, y = Qvalue, fill = Cluster,color = Cluster)) +
geom_bar(stat='identity', colour="grey80") +
scale_y_continuous(expand = c(0,0)) +
theme(legend.position="none",
axis.text.x = element_text(size = 6,angle = 90, vjust = 0.5,
face = "bold", colour = "black", hjust=0.95),
axis.text.y = element_text(face = "bold", size = 6, colour = "black"),
axis.title = element_blank()) +
scale_fill_manual(values = Set) +
scale_color_manual(values = Set)
b %>% insert_top(p,height = 2.5)

https://blog.csdn.net/weixin_30273763/article/details/99769659
http://www.itdecent.cn/p/3b621b2d6c5f
http://www.itdecent.cn/p/d46f27665074
https://www.bilibili.com/read/cv6093335/
http://www.itdecent.cn/p/968f34e820ca
https://blog.csdn.net/rainforestist/article/details/114403010