寫在前面
gwas整套流程按順序如下:
- 1.測序數(shù)據(jù)質(zhì)控(fastp)
- 2.比對&標(biāo)記重復(fù)(bwa,samtools)
- 3.變異檢測(gatk)
- 4.關(guān)聯(lián)分析
前面一篇文章已經(jīng)講了以上的前3個(gè)內(nèi)容,這次說最后關(guān)聯(lián)分析的內(nèi)容。
4.關(guān)聯(lián)分析
在進(jìn)行關(guān)聯(lián)分析之前,需要對我們前面獲得的變異位點(diǎn)信息文件.vcf進(jìn)行數(shù)據(jù)質(zhì)控過濾,并對測序個(gè)體進(jìn)行群體結(jié)構(gòu)和連鎖不平衡分析(部分結(jié)果作為協(xié)變量輸入)。
1.數(shù)據(jù)質(zhì)控過濾
數(shù)據(jù)質(zhì)控分為個(gè)體質(zhì)控和位點(diǎn)質(zhì)控,即有些測序個(gè)體是否存在大量位點(diǎn)缺失的情況,有些變異位點(diǎn)缺少部分個(gè)體的變異信息。這些個(gè)體和變異位點(diǎn)可能需要去除。
- ①個(gè)體質(zhì)控(一般不做)
事實(shí)上我們一般不進(jìn)行個(gè)體過濾,只有在進(jìn)行位點(diǎn)過濾后變異位點(diǎn)較少的情況下,才會(huì)考慮:是否因?yàn)槟承﹤€(gè)體因?yàn)橛行y序深度過低導(dǎo)致大量位點(diǎn)缺失,進(jìn)而在位點(diǎn)過濾時(shí)去掉這些位點(diǎn)而導(dǎo)致過少的變異位點(diǎn)。這些個(gè)體可以舍去。
#1.生成缺失率文件plink.imiss (按樣本統(tǒng)計(jì)缺失率), plink.lmiss (按位點(diǎn)統(tǒng)計(jì)缺失率)
plink --vcf all.filtered.snp.vcf.gz \
--allow-extra-chr \
--missing \
--out snp
#2.個(gè)體缺失率查看,這里可以看到變異位點(diǎn)缺失最多的樣本也才5.7%的缺失率
#如果有20%的缺失率,那這個(gè)樣本測得可能有問題,可以舍去,后面遇到再說
(base) [wangtao01@sonmi:~/gwas/tana/gatk_joint/merged_vcf]$ sort -k6,6gr snp.imiss | head
SRR17015761 SRR17015761 Y 634107 11094943 0.05715
SRR17015762 SRR17015762 Y 565257 11094943 0.05095
SRR17015732 SRR17015732 Y 542475 11094943 0.04889
SRR17015737 SRR17015737 Y 433534 11094943 0.03907
SRR17015741 SRR17015741 Y 400748 11094943 0.03612
SRR17015765 SRR17015765 Y 384860 11094943 0.03469
SRR17015775 SRR17015775 Y 329450 11094943 0.02969
SRR17015773 SRR17015773 Y 324598 11094943 0.02926
SRR17015759 SRR17015759 Y 324127 11094943 0.02921
SRR17015760 SRR17015760 Y 320872 11094943 0.02892
- ②位點(diǎn)質(zhì)控
一般用進(jìn)行缺失率(missing genotypes)和最小等位基因頻率(minor allele frequency,MAF)進(jìn)行質(zhì)控。后續(xù)進(jìn)行g(shù)was分析用plink質(zhì)控,bcftools工具可用非gwas分析的數(shù)據(jù)質(zhì)控比較方便。
plink --vcf merge.filtered.gatk.vcf.gz \
--allow-extra-chr \ #允許不規(guī)范的染色體名稱
--set-missing-var-ids @:# \ #生成變異位點(diǎn)ID,需加
--biallelic-only strict \ #只保留二等位位點(diǎn)
--geno 0.1 \ #去掉缺失率 >10% 的位點(diǎn)
--mind 0.1 \ #去掉缺失率 >10% 的樣本,一般不設(shè)置
--maf 0.05 \
--make-bed \ #輸出bed格式文件,作為gwas主流輸入文件
--recode vcf-iid \ #同時(shí)輸出vcf文件備份
--out gwas.qc
#過濾后的文件
-rw-r--r-- 1 wangtao01 wangtao01 22268443 Dec 18 16:26 gwas.qc.bed
-rw-r--r-- 1 wangtao01 wangtao01 81443920 Dec 18 16:26 gwas.qc.bim
-rw-r--r-- 1 wangtao01 wangtao01 1320 Dec 18 16:26 gwas.qc.fam
-rw-r--r-- 1 wangtao01 wangtao01 1328 Dec 18 16:26 gwas.qc.log
-rw-r--r-- 1 wangtao01 wangtao01 960 Dec 18 16:26 gwas.qc.nosex
-rw-r--r-- 1 wangtao01 wangtao01 455555358 Dec 18 16:26 gwas.qc.vcf
#輸出結(jié)果日志:80%被過濾掉了
# 10841114 variants loaded
# 425867 variants removed due to missing genotype data (--geno)
# 8188403 variants removed due to minor allele threshold (--maf)
# 2226844 variants pass filters
$cat gwas.qc.log
PLINK v1.9.0-b.7.7 64-bit (22 Oct 2024)
Options in effect:
--allow-extra-chr
--biallelic-only strict
--geno 0.1
--maf 0.05
--make-bed
--out gwas.qc
--recode vcf-iid
--set-missing-var-ids @:#
--vcf all.filtered.snp.vcf.gz
Hostname: sonmi
Working directory: /share/home/wangtao01/gwas/tana/gatk_joint/merged_vcf
Start time: Thu Dec 18 16:25:05 2025
Random number seed: 1766046305
515089 MB RAM detected; reserving 257544 MB for main workspace.
--vcf: gwas.qc-temporary.bed + gwas.qc-temporary.bim + gwas.qc-temporary.fam
written.
(253829 variants skipped.)
10841114 variants loaded from .bim file.
10841114 missing IDs set.
40 people (0 males, 0 females, 40 ambiguous) loaded from .fam.
Ambiguous sex IDs written to gwas.qc.nosex .
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 40 founders and 0 nonfounders present.
Calculating allele frequencies... done.
Total genotyping rate is 0.975406.
425867 variants removed due to missing genotype data (--geno).
8188403 variants removed due to minor allele threshold(s)
(--maf/--max-maf/--mac/--max-mac).
2226844 variants and 40 people pass filters and QC.
Note: No phenotypes present.
--make-bed to gwas.qc.bed + gwas.qc.bim + gwas.qc.fam ... done.
--recode vcf-iid to gwas.qc.vcf ... done.
- ③準(zhǔn)備表型數(shù)據(jù)
我這里做的性別關(guān)聯(lián)分析,樣本的表型就是性別信息,是雌性還是雄性,這里把雄性數(shù)據(jù)編碼為1,雌性數(shù)據(jù)編碼為2。我這里準(zhǔn)備一個(gè)包含3列(FID IID SEX)的表型數(shù)據(jù)文件sex.pheno.txt:
cat sex.pheno.txt
SRR17015732 SRR17015732 1
SRR17015733 SRR17015733 1
SRR17015734 SRR17015734 2
SRR17015735 SRR17015735 2
SRR17015736 SRR17015736 1
SRR17015737 SRR17015737 2
SRR17015738 SRR17015738 2
SRR17015739 SRR17015739 2
SRR17015740 SRR17015740 1
SRR17015741 SRR17015741 1
SRR17015742 SRR17015742 2
SRR17015743 SRR17015743 1
SRR17015744 SRR17015744 1
SRR17015745 SRR17015745 1
SRR17015746 SRR17015746 2
SRR17015747 SRR17015747 2
SRR17015748 SRR17015748 2
SRR17015749 SRR17015749 1
SRR17015750 SRR17015750 1
SRR17015751 SRR17015751 1
SRR17015752 SRR17015752 1
SRR17015753 SRR17015753 1
SRR17015754 SRR17015754 2
SRR17015755 SRR17015755 2
SRR17015756 SRR17015756 2
SRR17015757 SRR17015757 2
SRR17015758 SRR17015758 2
SRR17015759 SRR17015759 1
SRR17015760 SRR17015760 1
SRR17015761 SRR17015761 2
SRR17015762 SRR17015762 1
SRR17015763 SRR17015763 2
SRR17015765 SRR17015765 1
SRR17015767 SRR17015767 2
SRR17015769 SRR17015769 2
SRR17015771 SRR17015771 1
SRR17015773 SRR17015773 1
SRR17015775 SRR17015775 2
SRR17015777 SRR17015777 1
SRR17015779 SRR17015779 2
2.群體結(jié)構(gòu)和連鎖不平衡分析
全基因組關(guān)聯(lián)分析gwas本質(zhì)原理是,利用標(biāo)記(變異位點(diǎn))與目標(biāo)位點(diǎn)之間存在的連瑣不平衡,定位到與目標(biāo)性狀關(guān)聯(lián)的連瑣不平衡區(qū)域。當(dāng)然,我們用于定位性狀的各個(gè)位點(diǎn)顯著性p值的是基于線性回歸模型計(jì)算的,群體結(jié)構(gòu)和個(gè)體間的親緣關(guān)系會(huì)影響該值,因此使用關(guān)聯(lián)分析的模型需要輸入群體結(jié)構(gòu)和親緣關(guān)系數(shù)據(jù)對模型進(jìn)行一定的矯正。
- ①輸入數(shù)據(jù)調(diào)整
盡管我么已經(jīng)準(zhǔn)備質(zhì)控過的變異文件,vcf,1.但是由于我們的染色體名稱不規(guī)范(規(guī)范染色體名稱是1,2,3,4,5......,即純數(shù)字),會(huì)導(dǎo)致下游一些分析關(guān)鍵報(bào)錯(cuò),因此,需要將染色體名稱進(jìn)行調(diào)整;其次,2.下游structure分析需要排除連瑣不平衡位點(diǎn),因此也需要將這部分變異位點(diǎn)進(jìn)行去除。
#1.調(diào)整染色體名稱
#染色體名稱在3個(gè)BED文件中的bim中修改
awk 'NR==FNR{m[$1]=$2; next}
{if($1 in m){$1=m[$1]; $2=$1":"$4} print}' \
chr_map.txt gwas.qc.bim > gwas.qc.chrnum.bim
ln -s gwas.qc.bed gwas.qc.chrnum.bed
ln -s gwas.qc.fam gwas.qc.chrnum.fam
#查看替換前后結(jié)果
(base) [wangtao01@sonmi:~/gwas/tana/gwas_workdir]$ head gwas.qc.bim gwas.qc.chrnum.bim
==> gwas.qc.bim <==
chr1A chr1A:52447 0 52447 G A
chr1A chr1A:52891 0 52891 A G
chr1A chr1A:53186 0 53186 G T
chr1A chr1A:53391 0 53391 C T
chr1A chr1A:53434 0 53434 G A
chr1A chr1A:53517 0 53517 G A
chr1A chr1A:53930 0 53930 A G
chr1A chr1A:53950 0 53950 C T
chr1A chr1A:54060 0 54060 T G
chr1A chr1A:54198 0 54198 A C
==> gwas.qc.chrnum.bim <==
1 1:52447 0 52447 G A
1 1:52891 0 52891 A G
1 1:53186 0 53186 G T
1 1:53391 0 53391 C T
1 1:53434 0 53434 G A
1 1:53517 0 53517 G A
1 1:53930 0 53930 A G
1 1:53950 0 53950 C T
1 1:54060 0 54060 T G
1 1:54198 0 54198 A C
#2.去除連瑣不平衡LD位點(diǎn)
##第一步:計(jì)算 LD 剪枝位點(diǎn)列表(生成 prune.in / prune.out)
#注意??!--indep-pairwise 接受三個(gè)參數(shù),
#第一個(gè)是窗口大?。梢允?位點(diǎn)數(shù)variant count 或者帶 'kb' 指定為 kilobase),默認(rèn)沒有 'kb' 時(shí),window size 是 SNP 個(gè)數(shù) 而不是物理距離
#第二個(gè)是 step size(位點(diǎn)數(shù)量)
#第三個(gè)是閾值( r2)
plink --bfile gwas.qc.chrnum \
--indep-pairwise 50 10 0.2 \
--out gwas.qc.chrnum.ld
##第二步:按 prune.in 提取,生成 LD-pruned 的新數(shù)據(jù)集
plink --bfile gwas.qc.chrnum \
--extract gwas.qc.chrnum.ld.prune.in \
--make-bed \
--out gwas.qc.chrnum.LD
#得到可作為群體結(jié)構(gòu)structure分析的編譯位點(diǎn)文件gwas.qc.chrnum.LD
(base) [wangtao01@sonmi:~/gwas/tana/gwas_workdir]$ ll -tr
total 643036
-rw-r--r-- 1 wangtao01 wangtao01 1041 Dec 18 17:07 sex.pheno.txt
-rw-r--r-- 1 wangtao01 wangtao01 81443920 Dec 18 17:09 gwas.qc.bim
-rw-r--r-- 1 wangtao01 wangtao01 22268443 Dec 18 17:09 gwas.qc.bed
-rw-r--r-- 1 wangtao01 wangtao01 455555358 Dec 18 17:09 gwas.qc.vcf
-rw-r--r-- 1 wangtao01 wangtao01 960 Dec 18 17:09 gwas.qc.nosex
-rw-r--r-- 1 wangtao01 wangtao01 1328 Dec 18 17:09 gwas.qc.log
-rw-r--r-- 1 wangtao01 wangtao01 1320 Dec 18 17:09 gwas.qc.fam
-rw-r--r-- 1 wangtao01 wangtao01 222 Dec 19 17:46 chr_map.txt
-rw-r--r-- 1 wangtao01 wangtao01 63629168 Dec 19 18:17 gwas.qc.chrnum.bim
lrwxrwxrwx 1 wangtao01 wangtao01 11 Dec 19 18:22 gwas.qc.chrnum.bed -> gwas.qc.bed
lrwxrwxrwx 1 wangtao01 wangtao01 11 Dec 19 18:22 gwas.qc.chrnum.fam -> gwas.qc.fam
-rw-r--r-- 1 wangtao01 wangtao01 960 Dec 19 18:41 gwas.qc.chrnum.ld.nosex
-rw-r--r-- 1 wangtao01 wangtao01 3033528 Dec 19 18:41 gwas.qc.chrnum.ld.prune.in
-rw-r--r-- 1 wangtao01 wangtao01 22100524 Dec 19 18:41 gwas.qc.chrnum.ld.prune.out
-rw-r--r-- 1 wangtao01 wangtao01 2426 Dec 19 18:41 gwas.qc.chrnum.ld.log
-rw-r--r-- 1 wangtao01 wangtao01 960 Dec 19 18:48 gwas.qc.chrnum.LD.nosex
-rw-r--r-- 1 wangtao01 wangtao01 1320 Dec 19 18:48 gwas.qc.chrnum.LD.fam
-rw-r--r-- 1 wangtao01 wangtao01 2698783 Dec 19 18:48 gwas.qc.chrnum.LD.bed
-rw-r--r-- 1 wangtao01 wangtao01 7686324 Dec 19 18:48 gwas.qc.chrnum.LD.bim
-rw-r--r-- 1 wangtao01 wangtao01 980 Dec 19 18:48 gwas.qc.chrnum.LD.log
- ②親緣關(guān)系kinship計(jì)算
使用GCTA軟件基于IBS的方式計(jì)算親緣關(guān)系,并行格式轉(zhuǎn)換成常規(guī)的對角矩陣文件
##1.使用GCTA軟件計(jì)算親緣關(guān)系
#注意加上-autosome-num 參數(shù),GCTA默認(rèn)按照人的染色體名稱,超過或者不匹配染色體名稱則會(huì)報(bào)錯(cuò)
gcta --bfile gwas.qc.chrnum \
--make-grm-gz \
--make-grm-alg 0 \
--autosome-num 24 \
--out gcta_kinship
#輸出如下兩個(gè)文件,.grm.gz即grm陣列(genetic relationship matrix)文件,需要轉(zhuǎn)成對角矩陣用于下游分析
-rw-r--r-- 1 wangtao01 wangtao01 9512 Dec 21 00:58 gcta_kinship.grm.gz
-rw-r--r-- 1 wangtao01 wangtao01 960 Dec 21 00:58 gcta_kinship.grm.id
##2.替換樣品名稱,轉(zhuǎn)成對角矩陣
#用如下腳本grm_to_matrix.py進(jìn)行替換,注意更改輸入文件前綴:
——————————————————————————————————————————————————————————————-————————————————————
python3 - <<'PY'
import gzip
import numpy as np
prefix = "gcta_kinship" # ? 如果你的前綴不一樣,就改這里
# 1. 讀取樣本ID(行列名)
ids = []
with open(prefix + ".grm.id") as f:
for line in f:
fid, iid = line.rstrip("\n").split()[:2]
ids.append(iid)
n = len(ids)
K = np.eye(n, dtype=np.float32)
# 2. 讀取 pairwise GRM
with gzip.open(prefix + ".grm.gz", "rt") as f:
for line in f:
i, j, _N, v = line.split()
i = int(i) - 1
j = int(j) - 1
v = float(v)
K[i, j] = v
K[j, i] = v
# 3. 輸出方陣
out = prefix + ".matrix.tsv"
with open(out, "w") as w:
w.write("\t" + "\t".join(ids) + "\n")
for i, name in enumerate(ids):
w.write(name + "\t" + "\t".join(f"{x:.6g}" for x in K[i]) + "\n")
print("Done! Output:", out)
PY
——————————————————————————————————————————————————————————————————————————————————
#格式轉(zhuǎn)換
python3 grm_to_matrix.py
#最后拿到親緣關(guān)系G矩陣
-rw-r--r-- 1 wangtao01 wangtao01 18181 Dec 21 01:26 gcta_kinship.matrix.tsv

- ③主成分分析PCA
可使用plink直接分析,也可以用gcta生成親緣關(guān)系再計(jì)算PCA
##1.plink
#--pca設(shè)置幾個(gè)主成分,可設(shè)置樣本數(shù)相同的數(shù)量
plink --bfile gwas.qc.chrnum \
--pca 10 \
--out plink_PCA \
--allow-extra-chr --set-missing-var-ids @:#
#輸出主成分特征值和特征向量兩個(gè)主要文件,eigenvec即包括10列主成分值
-rw-r--r-- 1 wangtao01 wangtao01 78 Dec 21 02:53 plink_PCA.eigenval
-rw-r--r-- 1 wangtao01 wangtao01 4999 Dec 21 02:53 plink_PCA.eigenvec
##2.gcta
#先進(jìn)行親緣關(guān)系計(jì)算。其實(shí)plink也做了,兩步一起做了
#注意--make-grm參數(shù)而非--make-grm-gz
gcta --bfile gwas.qc.chrnum \
--make-grm \
--make-grm-alg 0 \
--out gcta_PCA
#輸出
-rw-r--r-- 1 wangtao01 wangtao01 3280 Dec 21 03:03 gcta_PCA.grm.bin
-rw-r--r-- 1 wangtao01 wangtao01 960 Dec 21 03:03 gcta_PCA.grm.id
-rw-r--r-- 1 wangtao01 wangtao01 3280 Dec 21 03:03 gcta_PCA.grm.N.bin
#再進(jìn)行PCA分析
gcta --grm gcta_PCA \
--pca 10 \
--out gcta_PCA
#同樣輸出主成分特征值和特征向量兩個(gè)主要文件
-rw-r--r-- 1 wangtao01 wangtao01 339 Dec 21 03:10 gcta_PCA.eigenval
-rw-r--r-- 1 wangtao01 wangtao01 5026 Dec 21 03:10 gcta_PCA.eigenvec

在文獻(xiàn)里找一個(gè)好看點(diǎn)的圖,截給大模型,讓其寫一個(gè)用于可視化的R腳本pca_plot_plink.R,畫出來效果不錯(cuò),挺還原的。另外我這里畫圖的時(shí)候,我重新創(chuàng)建了一個(gè)R環(huán)境r_gwas,專門用于畫圖,防止污染我常用的base主環(huán)境。
############################################################
## PLINK PCA visualization (PC1 vs PC2)
## Environment: conda env "r_gwas"
############################################################
suppressPackageStartupMessages({
library(data.table)
library(dplyr)
library(ggplot2)
})
## ---------- 1. read eigenvec ----------
eigvec <- fread("plink_PCA.eigenvec", header = FALSE)
# FID IID PC1 PC2 ...
setnames(eigvec, c("FID", "IID", paste0("PC", 1:(ncol(eigvec) - 2))))
## ---------- 2. read eigenval & explained variance ----------
eigval <- scan("plink_PCA.eigenval", quiet = TRUE)
var_exp <- eigval / sum(eigval)
pc1_lab <- sprintf("PC1 (%.1f%%)", 100 * var_exp[1])
pc2_lab <- sprintf("PC2 (%.1f%%)", 100 * var_exp[2])
## ---------- 3. read phenotype (sex) ----------
pheno <- fread("sex.pheno.txt", header = FALSE)
setnames(pheno, c("FID", "IID", "sex"))
## ?? 按你的實(shí)際含義改這里
## 例如:1 = Female, 2 = Male(如果相反就調(diào)換)
pheno[, sex := factor(sex, levels = c(1, 2),
labels = c("Sex1", "Sex2"))]
## ---------- 4. merge ----------
df <- eigvec %>%
inner_join(pheno, by = c("FID", "IID"))
## ---------- 5. PCA plot ----------
p <- ggplot(df, aes(x = PC1, y = PC2, color = sex)) +
geom_hline(yintercept = 0, linetype = "dashed",
linewidth = 0.4, color = "grey50") +
geom_vline(xintercept = 0, linetype = "dashed",
linewidth = 0.4, color = "grey50") +
geom_point(size = 2.6, alpha = 0.9) +
stat_ellipse(
aes(fill = sex),
geom = "polygon",
type = "norm",
alpha = 0.18,
color = NA
) +
labs(
title = "PLINK PCA (PC1 vs PC2)",
x = pc1_lab,
y = pc2_lab,
color = "Sex",
fill = "Sex"
) +
theme_minimal(base_size = 13) +
theme(
plot.title = element_text(hjust = 0.5, face = "bold"),
panel.grid.minor = element_blank()
)
## ---------- 6. save ----------
ggsave("PCA_PC1_PC2_sex.png", p, width = 7.2, height = 5.6, dpi = 300)
ggsave("PCA_PC1_PC2_sex.pdf", p, width = 7.2, height = 5.6)
message("Done: PCA_PC1_PC2_sex.png / PCA_PC1_PC2_sex.pdf")

- ④群體結(jié)構(gòu)structure分析
群體結(jié)構(gòu)分析可以推斷群體的分層關(guān)系,需要使用LD過濾后的數(shù)據(jù)gwas.qc.chrnum.LD進(jìn)行分析。
用如下for循環(huán)生成一個(gè)K2到K10的admixture腳本admixture.jobs.sh,后面用parallel并行運(yùn)行。
#!/usr/bin/env bash
set -euo pipefail
PREFIX="gwas.qc.chrnum.LD" # LD過濾后的bed前綴
K_MIN=2
K_MAX=10
# 輸出命令列表
OUT_SH="admixture.jobs.sh"
: > ${OUT_SH}
for K in $(seq ${K_MIN} ${K_MAX}); do
echo "admixture --cv=10 -j4 ${PREFIX}.bed ${K} 1> logs/admixture.K${K}.log 2>&1" >> ${OUT_SH}
done
echo "[OK] wrote ${OUT_SH}"
echo "Run: mkdir -p logs && parallel -j ${K_MAX} < ${OUT_SH}"
用parallel并行運(yùn)行,得到Q矩陣,以及判斷最合適(最?。㎏所用到的cv error值
mkdir -p logs
parallel -j 3 < admixture.jobs.sh
#輸出Q矩陣,Q值代表來源某一亞群的可能性,所有亞群加起來等于1
-rw-r--r-- 1 wangtao01 wangtao01 720 Dec 21 14:15 gwas.qc.chrnum.LD.2.Q
-rw-r--r-- 1 wangtao01 wangtao01 4857804 Dec 21 14:15 gwas.qc.chrnum.LD.2.P
-rw-r--r-- 1 wangtao01 wangtao01 1440 Dec 21 14:16 gwas.qc.chrnum.LD.4.Q
-rw-r--r-- 1 wangtao01 wangtao01 9715608 Dec 21 14:16 gwas.qc.chrnum.LD.4.P
-rw-r--r-- 1 wangtao01 wangtao01 1080 Dec 21 14:17 gwas.qc.chrnum.LD.3.Q
-rw-r--r-- 1 wangtao01 wangtao01 7286706 Dec 21 14:17 gwas.qc.chrnum.LD.3.P
-rw-r--r-- 1 wangtao01 wangtao01 1800 Dec 21 14:18 gwas.qc.chrnum.LD.5.Q
-rw-r--r-- 1 wangtao01 wangtao01 12144510 Dec 21 14:18 gwas.qc.chrnum.LD.5.P
-rw-r--r-- 1 wangtao01 wangtao01 2160 Dec 21 14:19 gwas.qc.chrnum.LD.6.Q
-rw-r--r-- 1 wangtao01 wangtao01 14573412 Dec 21 14:19 gwas.qc.chrnum.LD.6.P
-rw-r--r-- 1 wangtao01 wangtao01 2880 Dec 21 14:21 gwas.qc.chrnum.LD.8.Q
-rw-r--r-- 1 wangtao01 wangtao01 19431216 Dec 21 14:21 gwas.qc.chrnum.LD.8.P
drwxr-xr-x 2 wangtao01 wangtao01 4096 Dec 21 14:21 logs
-rw-r--r-- 1 wangtao01 wangtao01 2520 Dec 21 14:25 gwas.qc.chrnum.LD.7.Q
-rw-r--r-- 1 wangtao01 wangtao01 17002314 Dec 21 14:25 gwas.qc.chrnum.LD.7.P
-rw-r--r-- 1 wangtao01 wangtao01 3240 Dec 21 14:26 gwas.qc.chrnum.LD.9.Q
-rw-r--r-- 1 wangtao01 wangtao01 21860118 Dec 21 14:26 gwas.qc.chrnum.LD.9.P
-rw-r--r-- 1 wangtao01 wangtao01 3600 Dec 21 14:37 gwas.qc.chrnum.LD.10.Q
-rw-r--r-- 1 wangtao01 wangtao01 24289020 Dec 21 14:37 gwas.qc.chrnum.LD.10.P
head gwas.qc.chrnum.LD.4.Q
0.515412 0.330016 0.000010 0.154562
0.000010 0.634053 0.006361 0.359577
0.000010 0.855060 0.000010 0.144920
0.000010 0.436062 0.000010 0.563918
0.000010 0.000010 0.000010 0.999970
0.000010 0.822138 0.002708 0.175144
0.000010 0.879779 0.000010 0.120201
0.000010 0.808053 0.191927 0.000010
0.000010 0.862587 0.137393 0.000010
0.000010 0.816857 0.029244 0.153889
#提取log文件里的cv error值,最小的那個(gè)對應(yīng)著最合適的k值,這里結(jié)果K為2的時(shí)候最小
grep -h "CV error" logs/admixture.K*.log
CV error (K=10): 1.19629
CV error (K=2): 0.56840
CV error (K=3): 0.65797
CV error (K=4): 0.69925
CV error (K=5): 0.78295
CV error (K=6): 0.85496
CV error (K=7): 0.92420
CV error (K=8): 1.02147
CV error (K=9): 1.10976
同樣切到r_gwas環(huán)境,用genk里的一個(gè)R腳本進(jìn)行畫圖,這里調(diào)用的是pophelper包。
at draw_admixtureG.R
#!/usr/bin/env Rscript
# parse parameter ---------------------------------------------------------
library(argparser, quietly=TRUE)
# Create a parser
p <- arg_parser("draw sturcture figure for admixture result")
# Add command line arguments
p <- add_argument(p, "Qlist", help="input: Q matirx list file", type="character")
p <- add_argument(p, "sample", help="input: samplefile, ie .fam", type="character")
p <- add_argument(p, "sample_order", help="input: sample order file with group, define order in output figure ", type="character")
p <- add_argument(p, "output", help="output prefix", type="character")
# Parse the command line arguments
argv <- parse_args(p)
sfiles <- argv$Qlist
samp <- argv$sample
ord <- argv$sample_order
outpre <- argv$output
#sfiles <- "./demo.Q.list"
#samp <- "../00.prepare/demo.LD.fam"
#ord <- "../data/sample.pop"
#outpre <- "test"
library(pophelper)
library(ggplot2)
Qfiles <- read.table(sfiles, header = F)
qlist <- readQ(Qfiles$V1, filetype = "basic", indlabfromfile=F)
label <- read.table( samp, header = F,stringsAsFactors=F )
ordinfo <- read.table(ord, header = F, stringsAsFactors=F )
rownames(ordinfo) <- ordinfo$V1
for( i in 1:length(qlist) ){
rownames(qlist[[i]]) <- label$V1
qlist[[i]] <- qlist[[i]][as.character(ordinfo[,1]),]
}
# head(qlist[[3]])
xlist <- alignK(qlist)
## 每個(gè)K分別排序繪圖
p1_sort <- plotQ(
xlist,
sortind = "all", ## ind每個(gè)K分別排序
imgoutput = "join", ## 一張圖還是多張
returnplot=T,
exportplot=F,
useindlab = T , ## 顯示ind名稱
sharedindlab= F , ## ind出現(xiàn)一次
showindlab=T ,
splab=paste0("K=",sapply(xlist,ncol))
)
ggsave(filename = paste(outpre, "sorted.pdf",sep=".") ,
p1_sort$plot[[1]],
width = 20,
height = 10
)
## 按指定順序繪圖
p1_order <- plotQ(
xlist,
sortind = NA, ## ind排序
imgoutput = "join", ## 一張圖還是多張
returnplot=T,
exportplot=F,
useindlab = T , ## 顯示ind名稱
sharedindlab= T , ## ind出現(xiàn)一次
showindlab=T,
splab=paste0("K=",sapply(xlist,ncol)), ## 只顯示K值
grplab = ordinfo[,2,drop=FALSE], ## 添加grplab
grplabsize=4,linesize=0.8,pointsize=4,
ordergrp = F, # 不安裝grp排序
)
ggsave(filename = paste(outpre, "ordered.pdf",sep=".") ,
p1_order$plot[[1]],
width = 20,
height = 10
)
## 按指順序
p1_group <- plotQ(
xlist,
sortind = NA, ## ind排序
imgoutput = "join", ## 一張圖還是多張
returnplot=T,
exportplot=F,
useindlab = T , ## 顯示ind名稱
sharedindlab= T , ## ind出現(xiàn)一次
showindlab=T,
splab=paste0("K=",sapply(xlist,ncol)), ## 只顯示K值
grplab = ordinfo[,2,drop=FALSE],
grplabsize=4,linesize=0.8,pointsize=4,
ordergrp = T)
ggsave(filename = paste(outpre, "orderedgroup.pdf",sep=".") ,
p1_group$plot[[1]],
width = 20,
height = 10
)
這個(gè)腳本對你數(shù)據(jù)的輸入要求(需要準(zhǔn)備 3 個(gè)輸入文件)
它的命令行參數(shù)是:
1.Qlist:一個(gè)文件,里面每行一個(gè) .Q 文件路徑
2.sample:fam 文件(gwas.qc.chrnum.LD.fam)
3.sample_order:一個(gè)兩列文件:樣本ID+分組名
4.output:輸出前綴
#生成一個(gè)Q 文件路徑列表
ls gwas.qc.chrnum.LD.{2,3,4,5}.Q > gwas.Q.list
#fam文件已存在
#手寫或者用awk命令準(zhǔn)備一個(gè)sample_order文件
awk 'BEGIN{OFS="\t"}
{g=($3==1?"Male":($3==2?"Female":"Unknown")); print $1,g}' sex.pheno.txt \
| sort -k2,2 -k1,1 > sample.pop.order
##運(yùn)行
Rscript 2.structure_plot_admixture.R \
gwas.Q.list \
gwas.qc.chrnum.LD.fam \
sample.pop.order \
structure
#輸出3個(gè)pdf文件
-rw-r--r-- 1 wangtao01 wangtao01 9780 Dec 22 00:15 structure.sorted.pdf
-rw-r--r-- 1 wangtao01 wangtao01 9477 Dec 22 00:15 structure.ordered.pdf
-rw-r--r-- 1 wangtao01 wangtao01 9477 Dec 22 00:15 structure.orderedgroup.pdf

- ⑤連瑣不平衡衰減分析LD decay
LD 衰減分析可以查看整個(gè)群體及亞群的變異位點(diǎn)間的連鎖水平。
軟件:PopLDdecay ;conda安裝的找不到畫圖腳本,所以去GitHub重新下載安裝了。
官網(wǎng):https://github.com/BGI-shenzhen/PopLDdecay
分析還是分兩步:1.PopLDdecay程序計(jì)算群體內(nèi)兩兩位點(diǎn)間的r^2值;2.Plot_OnePop.pl畫圖。如果有多個(gè)群體就分別進(jìn)行r^2值計(jì)算,再用Plot_MultiPop.pl一起可視化
#1.計(jì)算位點(diǎn)間的r^2值
#分亞群的先生成一個(gè)亞群的樣本列表文件,然后加參數(shù) -SubPop
~/software/PopLDdecay/bin/PopLDdecay \
-InVCF gwas.qc.vcf \
-SubPop female.list \
-MaxDist 500 \
-OutStat pop.Female.LDdecay
~/software/PopLDdecay/bin/PopLDdecay \
-InVCF gwas.qc.vcf \
-SubPop male.list \
-MaxDist 500 \
-OutStat pop.Male.LDdecay
#輸出.stat.gz結(jié)果
-rw-r--r-- 1 wangtao01 wangtao01 5633612 Dec 23 03:15 pop.Female.LDdecay.stat.gz
-rw-r--r-- 1 wangtao01 wangtao01 5639255 Dec 23 03:15 pop.Male.LDdecay.stat.gz
gzip -dc pop.Female.LDdecay.stat.gz |head -n 10
#Dist Mean_r^2 Mean_D' Sum_r^2 Sum_D' NumberPairs
1 0.7804 NA 47608.7121 NA 61006
2 0.7489 NA 26949.2520 NA 35983
3 0.7270 NA 23488.2912 NA 32310
4 0.7199 NA 22281.5461 NA 30949
5 0.7109 NA 21016.1581 NA 29561
6 0.7025 NA 21004.5115 NA 29901
7 0.6942 NA 19522.0154 NA 28123
8 0.6894 NA 19483.0819 NA 28259
9 0.6822 NA 19381.8086 NA 28411
#2. 單個(gè)群體或者亞群的畫圖
~/software/PopLDdecay/bin/Plot_OnePop.pl \
-inFile pop.Female.LDdecay.stat.gz \
-output pop.Female.LDdecay \
-keepR #保留畫圖的R腳本,后續(xù)圖丑可以進(jìn)行調(diào)整
#3. 多個(gè)群體或者亞群的畫圖
#先準(zhǔn)備一個(gè)所以亞群LD分析結(jié)果文件列表Pop.ResultPath.list
ls pop.*.stat.gz |awk -F"." '{ print $0"\t"$2 }' > Pop.ResultPath.list
cat Pop.ResultPath.list
pop.Female.LDdecay.stat.gz Female
pop.Male.LDdecay.stat.gz Male
#畫圖
~/software/PopLDdecay/bin/Plot_MultiPop.pl \
-inList Pop.ResultPath.list \
-output pop.all.LDdecay \
-keepR


- ⑥連瑣不平衡區(qū)塊LD block分析
全基因組水平的LD block分析用plink計(jì)算,計(jì)算較慢,同時(shí)也可設(shè)置--chr --from-bp --to-bp三個(gè)參數(shù)來限定染色體特定區(qū)域的LD block分析;染色體局部區(qū)域的連鎖情況使用LDBlockShow展示出圖,-SeleVar參數(shù)選擇d'或r^2值作為LD值。
##1 plink 計(jì)算全基因組LD block,另外--chr --from-bp --to-bp 3個(gè)參數(shù)用來計(jì)算染色體局部連瑣情況
plink --vcf gwas.qc.vcf \
--out plink_ldblock \
--chr chr1A --from-bp 1 --to-bp 500000 \
--blocks no-pheno-req \
--allow-extra-chr --set-missing-var-ids @:#
#輸出
-rw-r--r-- 1 wangtao01 wangtao01 960 Dec 24 15:29 plink_ldblock.nosex
-rw-r--r-- 1 wangtao01 wangtao01 4725 Dec 24 15:29 plink_ldblock.blocks.det
-rw-r--r-- 1 wangtao01 wangtao01 2419 Dec 24 15:29 plink_ldblock.blocks
-rw-r--r-- 1 wangtao01 wangtao01 1180 Dec 24 15:29 plink_ldblock.log
(main) [wangtao01@sonmi:~/gwas/tana/gwas_workdir]$ head plink_ldblock.blocks.det
CHR BP1 BP2 KB NSNPS SNPS
chr1A 53391 55361 1.971 11 chr1A:53391|chr1A:53434|chr1A:53517|chr1A:53930|chr1A:53950|chr1A:54060|chr1A:54198|chr1A:54559|chr1A:54913|chr1A:54941|chr1A:55361
chr1A 58187 59474 1.288 6 chr1A:58187|chr1A:58477|chr1A:58478|chr1A:59167|chr1A:59283|chr1A:59474
chr1A 59600 59720 0.121 2 chr1A:59600|chr1A:59720
chr1A 60573 60605 0.033 2 chr1A:60573|chr1A:60605
chr1A 71835 72613 0.779 4 chr1A:71835|chr1A:72181|chr1A:72233|chr1A:72613
chr1A 73471 77176 3.706 3 chr1A:73471|chr1A:77097|chr1A:77176
chr1A 83372 90234 6.863 9 chr1A:83372|chr1A:84389|chr1A:84390|chr1A:86049|chr1A:86260|chr1A:87214|chr1A:87230|chr1A:88741|chr1A:90234
chr1A 107186 108656 1.471 2 chr1A:107186|chr1A:108656
chr1A 111987 112416 0.43 3 chr1A:111987|chr1A:112165|chr1A:112416
##2 LDBlockShow 用來進(jìn)行染色體局部LDblock的可視化,-SeleVar 用來指定LD值計(jì)算方法(1: D' 2: r^2)
#另外默認(rèn)輸出svg圖片,-OutPdf -OutPng參數(shù)可用來轉(zhuǎn)化,但默認(rèn)會(huì)調(diào)用ImageMagick軟件,其依賴GLIBCXX較新版本,本服務(wù)器不支持,可用其他軟件如rsvg-convert替代。
#謹(jǐn)慎解決ImageMagick軟件報(bào)錯(cuò)問題,我吃了大虧,解決依賴沖突的時(shí)候把我base環(huán)境污染了,conda都用不了,得重新裝miniconda/miniforge。
LDBlockShow -InVCF gwas.qc.vcf \
-OutPut LDBlockShow \
-Region chr1A:300000-500000 \
-SeleVar 2 \
-OutPdf -OutPng
#rsvg-convert軟件轉(zhuǎn)換圖片
rsvg-convert -o LDBlockShow.png LDBlockShow.svg
rsvg-convert -f pdf -o LDBlockShow.pdf LDBlockShow.svg
