GWAS全流程#02 | 數(shù)據(jù)準(zhǔn)備

寫在前面

gwas整套流程按順序如下:

  • 1.測序數(shù)據(jù)質(zhì)控(fastp)
  • 2.比對&標(biāo)記重復(fù)(bwa,samtools)
  • 3.變異檢測(gatk)
  • 4.關(guān)聯(lián)分析

前面一篇文章已經(jīng)講了以上的前3個(gè)內(nèi)容,這次說最后關(guān)聯(lián)分析的內(nèi)容。

4.關(guān)聯(lián)分析

在進(jìn)行關(guān)聯(lián)分析之前,需要對我們前面獲得的變異位點(diǎn)信息文件.vcf進(jìn)行數(shù)據(jù)質(zhì)控過濾,并對測序個(gè)體進(jìn)行群體結(jié)構(gòu)和連鎖不平衡分析(部分結(jié)果作為協(xié)變量輸入)

1.數(shù)據(jù)質(zhì)控過濾

數(shù)據(jù)質(zhì)控分為個(gè)體質(zhì)控和位點(diǎn)質(zhì)控,即有些測序個(gè)體是否存在大量位點(diǎn)缺失的情況,有些變異位點(diǎn)缺少部分個(gè)體的變異信息。這些個(gè)體和變異位點(diǎn)可能需要去除。

  • ①個(gè)體質(zhì)控(一般不做)
    事實(shí)上我們一般不進(jìn)行個(gè)體過濾,只有在進(jìn)行位點(diǎn)過濾后變異位點(diǎn)較少的情況下,才會(huì)考慮:是否因?yàn)槟承﹤€(gè)體因?yàn)橛行y序深度過低導(dǎo)致大量位點(diǎn)缺失,進(jìn)而在位點(diǎn)過濾時(shí)去掉這些位點(diǎn)而導(dǎo)致過少的變異位點(diǎn)。這些個(gè)體可以舍去。
#1.生成缺失率文件plink.imiss (按樣本統(tǒng)計(jì)缺失率), plink.lmiss (按位點(diǎn)統(tǒng)計(jì)缺失率)
plink --vcf all.filtered.snp.vcf.gz \
     --allow-extra-chr \
     --missing \
     --out snp
#2.個(gè)體缺失率查看,這里可以看到變異位點(diǎn)缺失最多的樣本也才5.7%的缺失率
#如果有20%的缺失率,那這個(gè)樣本測得可能有問題,可以舍去,后面遇到再說
(base) [wangtao01@sonmi:~/gwas/tana/gatk_joint/merged_vcf]$ sort -k6,6gr snp.imiss | head
 SRR17015761   SRR17015761          Y   634107 11094943  0.05715
 SRR17015762   SRR17015762          Y   565257 11094943  0.05095
 SRR17015732   SRR17015732          Y   542475 11094943  0.04889
 SRR17015737   SRR17015737          Y   433534 11094943  0.03907
 SRR17015741   SRR17015741          Y   400748 11094943  0.03612
 SRR17015765   SRR17015765          Y   384860 11094943  0.03469
 SRR17015775   SRR17015775          Y   329450 11094943  0.02969
 SRR17015773   SRR17015773          Y   324598 11094943  0.02926
 SRR17015759   SRR17015759          Y   324127 11094943  0.02921
 SRR17015760   SRR17015760          Y   320872 11094943  0.02892
  • ②位點(diǎn)質(zhì)控
    一般用進(jìn)行缺失率(missing genotypes)和最小等位基因頻率(minor allele frequency,MAF)進(jìn)行質(zhì)控。后續(xù)進(jìn)行g(shù)was分析用plink質(zhì)控,bcftools工具可用非gwas分析的數(shù)據(jù)質(zhì)控比較方便。
plink --vcf merge.filtered.gatk.vcf.gz \
      --allow-extra-chr \    #允許不規(guī)范的染色體名稱
      --set-missing-var-ids @:#   \   #生成變異位點(diǎn)ID,需加
      --biallelic-only strict \    #只保留二等位位點(diǎn)
      --geno 0.1 \  #去掉缺失率 >10% 的位點(diǎn)
      --mind 0.1 \   #去掉缺失率 >10% 的樣本,一般不設(shè)置
      --maf 0.05 \
      --make-bed \   #輸出bed格式文件,作為gwas主流輸入文件
      --recode vcf-iid  \   #同時(shí)輸出vcf文件備份
      --out gwas.qc

#過濾后的文件
-rw-r--r-- 1 wangtao01 wangtao01   22268443 Dec 18 16:26 gwas.qc.bed
-rw-r--r-- 1 wangtao01 wangtao01   81443920 Dec 18 16:26 gwas.qc.bim
-rw-r--r-- 1 wangtao01 wangtao01       1320 Dec 18 16:26 gwas.qc.fam
-rw-r--r-- 1 wangtao01 wangtao01       1328 Dec 18 16:26 gwas.qc.log
-rw-r--r-- 1 wangtao01 wangtao01        960 Dec 18 16:26 gwas.qc.nosex
-rw-r--r-- 1 wangtao01 wangtao01  455555358 Dec 18 16:26 gwas.qc.vcf

#輸出結(jié)果日志:80%被過濾掉了
# 10841114 variants loaded
# 425867 variants removed due to missing genotype data (--geno)
# 8188403 variants removed due to minor allele threshold (--maf)
# 2226844 variants pass filters
$cat gwas.qc.log
PLINK v1.9.0-b.7.7 64-bit (22 Oct 2024)
Options in effect:
  --allow-extra-chr
  --biallelic-only strict
  --geno 0.1
  --maf 0.05
  --make-bed
  --out gwas.qc
  --recode vcf-iid
  --set-missing-var-ids @:#
  --vcf all.filtered.snp.vcf.gz

Hostname: sonmi
Working directory: /share/home/wangtao01/gwas/tana/gatk_joint/merged_vcf
Start time: Thu Dec 18 16:25:05 2025

Random number seed: 1766046305
515089 MB RAM detected; reserving 257544 MB for main workspace.
--vcf: gwas.qc-temporary.bed + gwas.qc-temporary.bim + gwas.qc-temporary.fam
written.
(253829 variants skipped.)
10841114 variants loaded from .bim file.
10841114 missing IDs set.
40 people (0 males, 0 females, 40 ambiguous) loaded from .fam.
Ambiguous sex IDs written to gwas.qc.nosex .
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 40 founders and 0 nonfounders present.
Calculating allele frequencies... done.
Total genotyping rate is 0.975406.
425867 variants removed due to missing genotype data (--geno).
8188403 variants removed due to minor allele threshold(s)
(--maf/--max-maf/--mac/--max-mac).
2226844 variants and 40 people pass filters and QC.
Note: No phenotypes present.
--make-bed to gwas.qc.bed + gwas.qc.bim + gwas.qc.fam ... done.
--recode vcf-iid to gwas.qc.vcf ... done.
  • ③準(zhǔn)備表型數(shù)據(jù)
    我這里做的性別關(guān)聯(lián)分析,樣本的表型就是性別信息,是雌性還是雄性,這里把雄性數(shù)據(jù)編碼為1,雌性數(shù)據(jù)編碼為2。我這里準(zhǔn)備一個(gè)包含3列(FID IID SEX)的表型數(shù)據(jù)文件sex.pheno.txt
cat sex.pheno.txt
SRR17015732     SRR17015732     1
SRR17015733     SRR17015733     1
SRR17015734     SRR17015734     2
SRR17015735     SRR17015735     2
SRR17015736     SRR17015736     1
SRR17015737     SRR17015737     2
SRR17015738     SRR17015738     2
SRR17015739     SRR17015739     2
SRR17015740     SRR17015740     1
SRR17015741     SRR17015741     1
SRR17015742     SRR17015742     2
SRR17015743     SRR17015743     1
SRR17015744     SRR17015744     1
SRR17015745     SRR17015745     1
SRR17015746     SRR17015746     2
SRR17015747     SRR17015747     2
SRR17015748     SRR17015748     2
SRR17015749     SRR17015749     1
SRR17015750     SRR17015750     1
SRR17015751     SRR17015751     1
SRR17015752     SRR17015752     1
SRR17015753     SRR17015753     1
SRR17015754     SRR17015754     2
SRR17015755     SRR17015755     2
SRR17015756     SRR17015756     2
SRR17015757     SRR17015757     2
SRR17015758     SRR17015758     2
SRR17015759     SRR17015759     1
SRR17015760     SRR17015760     1
SRR17015761     SRR17015761     2
SRR17015762     SRR17015762     1
SRR17015763     SRR17015763     2
SRR17015765     SRR17015765     1
SRR17015767     SRR17015767     2
SRR17015769     SRR17015769     2
SRR17015771     SRR17015771     1
SRR17015773     SRR17015773     1
SRR17015775     SRR17015775     2
SRR17015777     SRR17015777     1
SRR17015779     SRR17015779     2

2.群體結(jié)構(gòu)和連鎖不平衡分析

全基因組關(guān)聯(lián)分析gwas本質(zhì)原理是,利用標(biāo)記(變異位點(diǎn))與目標(biāo)位點(diǎn)之間存在的連瑣不平衡,定位到與目標(biāo)性狀關(guān)聯(lián)的連瑣不平衡區(qū)域。當(dāng)然,我們用于定位性狀的各個(gè)位點(diǎn)顯著性p值的是基于線性回歸模型計(jì)算的,群體結(jié)構(gòu)和個(gè)體間的親緣關(guān)系會(huì)影響該值,因此使用關(guān)聯(lián)分析的模型需要輸入群體結(jié)構(gòu)和親緣關(guān)系數(shù)據(jù)對模型進(jìn)行一定的矯正。

  • ①輸入數(shù)據(jù)調(diào)整
    盡管我么已經(jīng)準(zhǔn)備質(zhì)控過的變異文件,vcf,1.但是由于我們的染色體名稱不規(guī)范(規(guī)范染色體名稱是1,2,3,4,5......,即純數(shù)字),會(huì)導(dǎo)致下游一些分析關(guān)鍵報(bào)錯(cuò),因此,需要將染色體名稱進(jìn)行調(diào)整;其次,2.下游structure分析需要排除連瑣不平衡位點(diǎn),因此也需要將這部分變異位點(diǎn)進(jìn)行去除。
#1.調(diào)整染色體名稱
#染色體名稱在3個(gè)BED文件中的bim中修改
awk 'NR==FNR{m[$1]=$2; next}
     {if($1 in m){$1=m[$1]; $2=$1":"$4} print}' \
    chr_map.txt gwas.qc.bim > gwas.qc.chrnum.bim

ln -s  gwas.qc.bed  gwas.qc.chrnum.bed
ln -s  gwas.qc.fam  gwas.qc.chrnum.fam

#查看替換前后結(jié)果
(base) [wangtao01@sonmi:~/gwas/tana/gwas_workdir]$ head gwas.qc.bim gwas.qc.chrnum.bim
==> gwas.qc.bim <==
chr1A   chr1A:52447     0       52447   G       A
chr1A   chr1A:52891     0       52891   A       G
chr1A   chr1A:53186     0       53186   G       T
chr1A   chr1A:53391     0       53391   C       T
chr1A   chr1A:53434     0       53434   G       A
chr1A   chr1A:53517     0       53517   G       A
chr1A   chr1A:53930     0       53930   A       G
chr1A   chr1A:53950     0       53950   C       T
chr1A   chr1A:54060     0       54060   T       G
chr1A   chr1A:54198     0       54198   A       C

==> gwas.qc.chrnum.bim <==
1 1:52447 0 52447 G A
1 1:52891 0 52891 A G
1 1:53186 0 53186 G T
1 1:53391 0 53391 C T
1 1:53434 0 53434 G A
1 1:53517 0 53517 G A
1 1:53930 0 53930 A G
1 1:53950 0 53950 C T
1 1:54060 0 54060 T G
1 1:54198 0 54198 A C

#2.去除連瑣不平衡LD位點(diǎn)
##第一步:計(jì)算 LD 剪枝位點(diǎn)列表(生成 prune.in / prune.out)
#注意??!--indep-pairwise 接受三個(gè)參數(shù),
#第一個(gè)是窗口大?。梢允?位點(diǎn)數(shù)variant count 或者帶 'kb' 指定為 kilobase),默認(rèn)沒有 'kb' 時(shí),window size 是 SNP 個(gè)數(shù) 而不是物理距離
#第二個(gè)是 step size(位點(diǎn)數(shù)量)
#第三個(gè)是閾值( r2)
plink --bfile gwas.qc.chrnum \
      --indep-pairwise 50 10 0.2 \
      --out gwas.qc.chrnum.ld
##第二步:按 prune.in 提取,生成 LD-pruned 的新數(shù)據(jù)集
plink --bfile gwas.qc.chrnum \
      --extract gwas.qc.chrnum.ld.prune.in \
      --make-bed \
      --out gwas.qc.chrnum.LD

#得到可作為群體結(jié)構(gòu)structure分析的編譯位點(diǎn)文件gwas.qc.chrnum.LD
(base) [wangtao01@sonmi:~/gwas/tana/gwas_workdir]$ ll -tr
total 643036
-rw-r--r-- 1 wangtao01 wangtao01      1041 Dec 18 17:07 sex.pheno.txt
-rw-r--r-- 1 wangtao01 wangtao01  81443920 Dec 18 17:09 gwas.qc.bim
-rw-r--r-- 1 wangtao01 wangtao01  22268443 Dec 18 17:09 gwas.qc.bed
-rw-r--r-- 1 wangtao01 wangtao01 455555358 Dec 18 17:09 gwas.qc.vcf
-rw-r--r-- 1 wangtao01 wangtao01       960 Dec 18 17:09 gwas.qc.nosex
-rw-r--r-- 1 wangtao01 wangtao01      1328 Dec 18 17:09 gwas.qc.log
-rw-r--r-- 1 wangtao01 wangtao01      1320 Dec 18 17:09 gwas.qc.fam
-rw-r--r-- 1 wangtao01 wangtao01       222 Dec 19 17:46 chr_map.txt
-rw-r--r-- 1 wangtao01 wangtao01  63629168 Dec 19 18:17 gwas.qc.chrnum.bim
lrwxrwxrwx 1 wangtao01 wangtao01        11 Dec 19 18:22 gwas.qc.chrnum.bed -> gwas.qc.bed
lrwxrwxrwx 1 wangtao01 wangtao01        11 Dec 19 18:22 gwas.qc.chrnum.fam -> gwas.qc.fam
-rw-r--r-- 1 wangtao01 wangtao01       960 Dec 19 18:41 gwas.qc.chrnum.ld.nosex
-rw-r--r-- 1 wangtao01 wangtao01   3033528 Dec 19 18:41 gwas.qc.chrnum.ld.prune.in
-rw-r--r-- 1 wangtao01 wangtao01  22100524 Dec 19 18:41 gwas.qc.chrnum.ld.prune.out
-rw-r--r-- 1 wangtao01 wangtao01      2426 Dec 19 18:41 gwas.qc.chrnum.ld.log
-rw-r--r-- 1 wangtao01 wangtao01       960 Dec 19 18:48 gwas.qc.chrnum.LD.nosex
-rw-r--r-- 1 wangtao01 wangtao01      1320 Dec 19 18:48 gwas.qc.chrnum.LD.fam
-rw-r--r-- 1 wangtao01 wangtao01   2698783 Dec 19 18:48 gwas.qc.chrnum.LD.bed
-rw-r--r-- 1 wangtao01 wangtao01   7686324 Dec 19 18:48 gwas.qc.chrnum.LD.bim
-rw-r--r-- 1 wangtao01 wangtao01       980 Dec 19 18:48 gwas.qc.chrnum.LD.log
  • ②親緣關(guān)系kinship計(jì)算
    使用GCTA軟件基于IBS的方式計(jì)算親緣關(guān)系,并行格式轉(zhuǎn)換成常規(guī)的對角矩陣文件
##1.使用GCTA軟件計(jì)算親緣關(guān)系
#注意加上-autosome-num 參數(shù),GCTA默認(rèn)按照人的染色體名稱,超過或者不匹配染色體名稱則會(huì)報(bào)錯(cuò)
gcta --bfile gwas.qc.chrnum \
     --make-grm-gz \
     --make-grm-alg 0 \
     --autosome-num 24 \
     --out gcta_kinship
#輸出如下兩個(gè)文件,.grm.gz即grm陣列(genetic relationship matrix)文件,需要轉(zhuǎn)成對角矩陣用于下游分析
-rw-r--r-- 1 wangtao01 wangtao01      9512 Dec 21 00:58 gcta_kinship.grm.gz
-rw-r--r-- 1 wangtao01 wangtao01       960 Dec 21 00:58 gcta_kinship.grm.id

##2.替換樣品名稱,轉(zhuǎn)成對角矩陣
#用如下腳本grm_to_matrix.py進(jìn)行替換,注意更改輸入文件前綴:
——————————————————————————————————————————————————————————————-————————————————————
python3 - <<'PY'
import gzip
import numpy as np

prefix = "gcta_kinship"   # ? 如果你的前綴不一樣,就改這里

# 1. 讀取樣本ID(行列名)
ids = []
with open(prefix + ".grm.id") as f:
    for line in f:
        fid, iid = line.rstrip("\n").split()[:2]
        ids.append(iid)

n = len(ids)
K = np.eye(n, dtype=np.float32)

# 2. 讀取 pairwise GRM
with gzip.open(prefix + ".grm.gz", "rt") as f:
    for line in f:
        i, j, _N, v = line.split()
        i = int(i) - 1
        j = int(j) - 1
        v = float(v)
        K[i, j] = v
        K[j, i] = v

# 3. 輸出方陣
out = prefix + ".matrix.tsv"
with open(out, "w") as w:
    w.write("\t" + "\t".join(ids) + "\n")
    for i, name in enumerate(ids):
        w.write(name + "\t" + "\t".join(f"{x:.6g}" for x in K[i]) + "\n")

print("Done! Output:", out)
PY
——————————————————————————————————————————————————————————————————————————————————
#格式轉(zhuǎn)換
python3  grm_to_matrix.py  
#最后拿到親緣關(guān)系G矩陣
-rw-r--r-- 1 wangtao01 wangtao01     18181 Dec 21 01:26 gcta_kinship.matrix.tsv
gcta_kinship.matrix.tsv
  • ③主成分分析PCA
    可使用plink直接分析,也可以用gcta生成親緣關(guān)系再計(jì)算PCA
##1.plink
#--pca設(shè)置幾個(gè)主成分,可設(shè)置樣本數(shù)相同的數(shù)量
plink --bfile gwas.qc.chrnum \
     --pca 10 \
     --out plink_PCA  \
     --allow-extra-chr --set-missing-var-ids @:#
#輸出主成分特征值和特征向量兩個(gè)主要文件,eigenvec即包括10列主成分值
-rw-r--r-- 1 wangtao01 wangtao01        78 Dec 21 02:53 plink_PCA.eigenval
-rw-r--r-- 1 wangtao01 wangtao01      4999 Dec 21 02:53 plink_PCA.eigenvec

##2.gcta
#先進(jìn)行親緣關(guān)系計(jì)算。其實(shí)plink也做了,兩步一起做了
#注意--make-grm參數(shù)而非--make-grm-gz
gcta --bfile gwas.qc.chrnum \
     --make-grm \
     --make-grm-alg 0 \
     --out gcta_PCA
#輸出
-rw-r--r-- 1 wangtao01 wangtao01      3280 Dec 21 03:03 gcta_PCA.grm.bin
-rw-r--r-- 1 wangtao01 wangtao01       960 Dec 21 03:03 gcta_PCA.grm.id
-rw-r--r-- 1 wangtao01 wangtao01      3280 Dec 21 03:03 gcta_PCA.grm.N.bin

#再進(jìn)行PCA分析
gcta --grm gcta_PCA \
     --pca 10 \
     --out gcta_PCA
#同樣輸出主成分特征值和特征向量兩個(gè)主要文件
-rw-r--r-- 1 wangtao01 wangtao01       339 Dec 21 03:10 gcta_PCA.eigenval
-rw-r--r-- 1 wangtao01 wangtao01      5026 Dec 21 03:10 gcta_PCA.eigenvec

特征向量文件每一列代表每種pca值,特征值文件每種PCA變異量

在文獻(xiàn)里找一個(gè)好看點(diǎn)的圖,截給大模型,讓其寫一個(gè)用于可視化的R腳本pca_plot_plink.R,畫出來效果不錯(cuò),挺還原的。另外我這里畫圖的時(shí)候,我重新創(chuàng)建了一個(gè)R環(huán)境r_gwas,專門用于畫圖,防止污染我常用的base主環(huán)境。

############################################################
## PLINK PCA visualization (PC1 vs PC2)
## Environment: conda env "r_gwas"
############################################################

suppressPackageStartupMessages({
  library(data.table)
  library(dplyr)
  library(ggplot2)
})

## ---------- 1. read eigenvec ----------
eigvec <- fread("plink_PCA.eigenvec", header = FALSE)
# FID IID PC1 PC2 ...
setnames(eigvec, c("FID", "IID", paste0("PC", 1:(ncol(eigvec) - 2))))

## ---------- 2. read eigenval & explained variance ----------
eigval <- scan("plink_PCA.eigenval", quiet = TRUE)
var_exp <- eigval / sum(eigval)

pc1_lab <- sprintf("PC1 (%.1f%%)", 100 * var_exp[1])
pc2_lab <- sprintf("PC2 (%.1f%%)", 100 * var_exp[2])

## ---------- 3. read phenotype (sex) ----------
pheno <- fread("sex.pheno.txt", header = FALSE)
setnames(pheno, c("FID", "IID", "sex"))

## ?? 按你的實(shí)際含義改這里
## 例如:1 = Female, 2 = Male(如果相反就調(diào)換)
pheno[, sex := factor(sex, levels = c(1, 2),
                      labels = c("Sex1", "Sex2"))]

## ---------- 4. merge ----------
df <- eigvec %>%
  inner_join(pheno, by = c("FID", "IID"))

## ---------- 5. PCA plot ----------
p <- ggplot(df, aes(x = PC1, y = PC2, color = sex)) +
  geom_hline(yintercept = 0, linetype = "dashed",
             linewidth = 0.4, color = "grey50") +
  geom_vline(xintercept = 0, linetype = "dashed",
             linewidth = 0.4, color = "grey50") +
  geom_point(size = 2.6, alpha = 0.9) +
  stat_ellipse(
    aes(fill = sex),
    geom = "polygon",
    type = "norm",
    alpha = 0.18,
    color = NA
  ) +
  labs(
    title = "PLINK PCA (PC1 vs PC2)",
    x = pc1_lab,
    y = pc2_lab,
    color = "Sex",
    fill  = "Sex"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),
    panel.grid.minor = element_blank()
  )

## ---------- 6. save ----------
ggsave("PCA_PC1_PC2_sex.png", p, width = 7.2, height = 5.6, dpi = 300)
ggsave("PCA_PC1_PC2_sex.pdf", p, width = 7.2, height = 5.6)

message("Done: PCA_PC1_PC2_sex.png / PCA_PC1_PC2_sex.pdf")
  • ④群體結(jié)構(gòu)structure分析
    群體結(jié)構(gòu)分析可以推斷群體的分層關(guān)系,需要使用LD過濾后的數(shù)據(jù)gwas.qc.chrnum.LD進(jìn)行分析。
    用如下for循環(huán)生成一個(gè)K2到K10的admixture腳本admixture.jobs.sh,后面用parallel并行運(yùn)行。
#!/usr/bin/env bash
set -euo pipefail

PREFIX="gwas.qc.chrnum.LD"     # LD過濾后的bed前綴
K_MIN=2
K_MAX=10

# 輸出命令列表
OUT_SH="admixture.jobs.sh"
: > ${OUT_SH}

for K in $(seq ${K_MIN} ${K_MAX}); do
  echo "admixture --cv=10 -j4 ${PREFIX}.bed ${K} 1> logs/admixture.K${K}.log 2>&1" >> ${OUT_SH}
done

echo "[OK] wrote ${OUT_SH}"
echo "Run: mkdir -p logs && parallel -j ${K_MAX} < ${OUT_SH}"

parallel并行運(yùn)行,得到Q矩陣,以及判斷最合適(最?。㎏所用到的cv error值

mkdir -p logs
parallel -j 3 < admixture.jobs.sh

#輸出Q矩陣,Q值代表來源某一亞群的可能性,所有亞群加起來等于1
-rw-r--r-- 1 wangtao01 wangtao01       720 Dec 21 14:15 gwas.qc.chrnum.LD.2.Q
-rw-r--r-- 1 wangtao01 wangtao01   4857804 Dec 21 14:15 gwas.qc.chrnum.LD.2.P
-rw-r--r-- 1 wangtao01 wangtao01      1440 Dec 21 14:16 gwas.qc.chrnum.LD.4.Q
-rw-r--r-- 1 wangtao01 wangtao01   9715608 Dec 21 14:16 gwas.qc.chrnum.LD.4.P
-rw-r--r-- 1 wangtao01 wangtao01      1080 Dec 21 14:17 gwas.qc.chrnum.LD.3.Q
-rw-r--r-- 1 wangtao01 wangtao01   7286706 Dec 21 14:17 gwas.qc.chrnum.LD.3.P
-rw-r--r-- 1 wangtao01 wangtao01      1800 Dec 21 14:18 gwas.qc.chrnum.LD.5.Q
-rw-r--r-- 1 wangtao01 wangtao01  12144510 Dec 21 14:18 gwas.qc.chrnum.LD.5.P
-rw-r--r-- 1 wangtao01 wangtao01      2160 Dec 21 14:19 gwas.qc.chrnum.LD.6.Q
-rw-r--r-- 1 wangtao01 wangtao01  14573412 Dec 21 14:19 gwas.qc.chrnum.LD.6.P
-rw-r--r-- 1 wangtao01 wangtao01      2880 Dec 21 14:21 gwas.qc.chrnum.LD.8.Q
-rw-r--r-- 1 wangtao01 wangtao01  19431216 Dec 21 14:21 gwas.qc.chrnum.LD.8.P
drwxr-xr-x 2 wangtao01 wangtao01      4096 Dec 21 14:21 logs
-rw-r--r-- 1 wangtao01 wangtao01      2520 Dec 21 14:25 gwas.qc.chrnum.LD.7.Q
-rw-r--r-- 1 wangtao01 wangtao01  17002314 Dec 21 14:25 gwas.qc.chrnum.LD.7.P
-rw-r--r-- 1 wangtao01 wangtao01      3240 Dec 21 14:26 gwas.qc.chrnum.LD.9.Q
-rw-r--r-- 1 wangtao01 wangtao01  21860118 Dec 21 14:26 gwas.qc.chrnum.LD.9.P
-rw-r--r-- 1 wangtao01 wangtao01      3600 Dec 21 14:37 gwas.qc.chrnum.LD.10.Q
-rw-r--r-- 1 wangtao01 wangtao01  24289020 Dec 21 14:37 gwas.qc.chrnum.LD.10.P

head gwas.qc.chrnum.LD.4.Q
0.515412 0.330016 0.000010 0.154562
0.000010 0.634053 0.006361 0.359577
0.000010 0.855060 0.000010 0.144920
0.000010 0.436062 0.000010 0.563918
0.000010 0.000010 0.000010 0.999970
0.000010 0.822138 0.002708 0.175144
0.000010 0.879779 0.000010 0.120201
0.000010 0.808053 0.191927 0.000010
0.000010 0.862587 0.137393 0.000010
0.000010 0.816857 0.029244 0.153889

#提取log文件里的cv error值,最小的那個(gè)對應(yīng)著最合適的k值,這里結(jié)果K為2的時(shí)候最小
grep -h "CV error" logs/admixture.K*.log
CV error (K=10): 1.19629
CV error (K=2): 0.56840
CV error (K=3): 0.65797
CV error (K=4): 0.69925
CV error (K=5): 0.78295
CV error (K=6): 0.85496
CV error (K=7): 0.92420
CV error (K=8): 1.02147
CV error (K=9): 1.10976

同樣切到r_gwas環(huán)境,用genk里的一個(gè)R腳本進(jìn)行畫圖,這里調(diào)用的是pophelper包。

at draw_admixtureG.R
#!/usr/bin/env Rscript
# parse parameter ---------------------------------------------------------
library(argparser, quietly=TRUE)
# Create a parser
p <- arg_parser("draw sturcture figure for admixture result")

# Add command line arguments
p <- add_argument(p, "Qlist", help="input: Q matirx list file", type="character")
p <- add_argument(p, "sample", help="input: samplefile, ie  .fam", type="character")
p <- add_argument(p, "sample_order", help="input: sample order file with group, define order in output figure ", type="character")
p <- add_argument(p, "output", help="output prefix", type="character")

# Parse the command line arguments
argv <- parse_args(p)

sfiles <- argv$Qlist
samp <- argv$sample
ord  <- argv$sample_order
outpre <- argv$output


#sfiles <- "./demo.Q.list"
#samp <- "../00.prepare/demo.LD.fam"
#ord <- "../data/sample.pop"
#outpre <- "test"


library(pophelper)
library(ggplot2)


Qfiles <- read.table(sfiles, header = F)
qlist <- readQ(Qfiles$V1, filetype = "basic", indlabfromfile=F)

label <- read.table( samp, header = F,stringsAsFactors=F )

ordinfo <- read.table(ord, header = F, stringsAsFactors=F )
rownames(ordinfo) <- ordinfo$V1

for( i in 1:length(qlist) ){
    rownames(qlist[[i]]) <- label$V1
    qlist[[i]] <- qlist[[i]][as.character(ordinfo[,1]),]
}

# head(qlist[[3]])

xlist <- alignK(qlist)

## 每個(gè)K分別排序繪圖
p1_sort  <- plotQ(
      xlist,
      sortind = "all",  ## ind每個(gè)K分別排序 
      imgoutput = "join",  ## 一張圖還是多張
      returnplot=T,  
      exportplot=F,  
      useindlab = T ,    ## 顯示ind名稱
      sharedindlab= F ,  ## ind出現(xiàn)一次
      showindlab=T ,
      splab=paste0("K=",sapply(xlist,ncol))
      )

ggsave(filename = paste(outpre, "sorted.pdf",sep=".") , 
       p1_sort$plot[[1]],  
       width = 20,
       height = 10  
       )

## 按指定順序繪圖
p1_order  <- plotQ(
  xlist, 
  sortind = NA,  ## ind排序 
  imgoutput = "join",  ## 一張圖還是多張
  returnplot=T,  
  exportplot=F,
  useindlab = T ,    ## 顯示ind名稱
  sharedindlab= T ,   ## ind出現(xiàn)一次
  showindlab=T,
  splab=paste0("K=",sapply(xlist,ncol)), ## 只顯示K值
  grplab = ordinfo[,2,drop=FALSE], ## 添加grplab
  grplabsize=4,linesize=0.8,pointsize=4,
  ordergrp = F,  # 不安裝grp排序
  )

ggsave(filename = paste(outpre, "ordered.pdf",sep=".") , 
       p1_order$plot[[1]],  
       width = 20,
       height = 10  
)



## 按指順序
p1_group  <- plotQ(
  xlist, 
  sortind = NA,  ## ind排序 
  imgoutput = "join",  ## 一張圖還是多張
  returnplot=T,  
  exportplot=F,
  useindlab = T ,    ## 顯示ind名稱
  sharedindlab= T ,   ## ind出現(xiàn)一次
  showindlab=T,
  splab=paste0("K=",sapply(xlist,ncol)), ## 只顯示K值
  grplab = ordinfo[,2,drop=FALSE],
  grplabsize=4,linesize=0.8,pointsize=4,
  ordergrp = T)

ggsave(filename = paste(outpre, "orderedgroup.pdf",sep=".") , 
       p1_group$plot[[1]],  
       width = 20,
       height = 10  
)

這個(gè)腳本對你數(shù)據(jù)的輸入要求(需要準(zhǔn)備 3 個(gè)輸入文件)
它的命令行參數(shù)是:
1.Qlist:一個(gè)文件,里面每行一個(gè) .Q 文件路徑
2.sample:fam 文件(gwas.qc.chrnum.LD.fam)
3.sample_order:一個(gè)兩列文件:樣本ID+分組名
4.output:輸出前綴

#生成一個(gè)Q 文件路徑列表
ls gwas.qc.chrnum.LD.{2,3,4,5}.Q > gwas.Q.list

#fam文件已存在

#手寫或者用awk命令準(zhǔn)備一個(gè)sample_order文件
awk 'BEGIN{OFS="\t"}
     {g=($3==1?"Male":($3==2?"Female":"Unknown")); print $1,g}' sex.pheno.txt \
| sort -k2,2 -k1,1 > sample.pop.order


##運(yùn)行
Rscript 2.structure_plot_admixture.R \
  gwas.Q.list \
  gwas.qc.chrnum.LD.fam \
  sample.pop.order \
  structure
#輸出3個(gè)pdf文件
-rw-r--r--  1 wangtao01 wangtao01       9780 Dec 22 00:15 structure.sorted.pdf
-rw-r--r--  1 wangtao01 wangtao01       9477 Dec 22 00:15 structure.ordered.pdf
-rw-r--r--  1 wangtao01 wangtao01       9477 Dec 22 00:15 structure.orderedgroup.pdf
  • ⑤連瑣不平衡衰減分析LD decay
    LD 衰減分析可以查看整個(gè)群體及亞群的變異位點(diǎn)間的連鎖水平。
    軟件:PopLDdecay ;conda安裝的找不到畫圖腳本,所以去GitHub重新下載安裝了。
    官網(wǎng):https://github.com/BGI-shenzhen/PopLDdecay
    分析還是分兩步:1.PopLDdecay程序計(jì)算群體內(nèi)兩兩位點(diǎn)間的r^2值;2. Plot_OnePop.pl畫圖。如果有多個(gè)群體就分別進(jìn)行r^2值計(jì)算,再用Plot_MultiPop.pl一起可視化
#1.計(jì)算位點(diǎn)間的r^2值
#分亞群的先生成一個(gè)亞群的樣本列表文件,然后加參數(shù) -SubPop
~/software/PopLDdecay/bin/PopLDdecay \
  -InVCF  gwas.qc.vcf \
  -SubPop female.list \
  -MaxDist 500 \
  -OutStat pop.Female.LDdecay   

~/software/PopLDdecay/bin/PopLDdecay \
  -InVCF  gwas.qc.vcf \
  -SubPop male.list \
  -MaxDist 500 \
  -OutStat pop.Male.LDdecay   

#輸出.stat.gz結(jié)果
-rw-r--r--  1 wangtao01 wangtao01    5633612 Dec 23 03:15 pop.Female.LDdecay.stat.gz
-rw-r--r--  1 wangtao01 wangtao01    5639255 Dec 23 03:15 pop.Male.LDdecay.stat.gz

gzip -dc pop.Female.LDdecay.stat.gz |head -n 10
#Dist   Mean_r^2        Mean_D' Sum_r^2 Sum_D'  NumberPairs
1       0.7804  NA      47608.7121      NA      61006
2       0.7489  NA      26949.2520      NA      35983
3       0.7270  NA      23488.2912      NA      32310
4       0.7199  NA      22281.5461      NA      30949
5       0.7109  NA      21016.1581      NA      29561
6       0.7025  NA      21004.5115      NA      29901
7       0.6942  NA      19522.0154      NA      28123
8       0.6894  NA      19483.0819      NA      28259
9       0.6822  NA      19381.8086      NA      28411


#2. 單個(gè)群體或者亞群的畫圖
~/software/PopLDdecay/bin/Plot_OnePop.pl \
  -inFile pop.Female.LDdecay.stat.gz  \
  -output pop.Female.LDdecay \
  -keepR #保留畫圖的R腳本,后續(xù)圖丑可以進(jìn)行調(diào)整

#3. 多個(gè)群體或者亞群的畫圖
#先準(zhǔn)備一個(gè)所以亞群LD分析結(jié)果文件列表Pop.ResultPath.list 
ls pop.*.stat.gz |awk -F"." '{ print $0"\t"$2 }' > Pop.ResultPath.list 

cat Pop.ResultPath.list
pop.Female.LDdecay.stat.gz      Female
pop.Male.LDdecay.stat.gz        Male

#畫圖
~/software/PopLDdecay/bin/Plot_MultiPop.pl \
  -inList  Pop.ResultPath.list  \
  -output pop.all.LDdecay \
  -keepR

  • ⑥連瑣不平衡區(qū)塊LD block分析
    全基因組水平的LD block分析用plink計(jì)算,計(jì)算較慢,同時(shí)也可設(shè)置--chr --from-bp --to-bp三個(gè)參數(shù)來限定染色體特定區(qū)域的LD block分析;染色體局部區(qū)域的連鎖情況使用LDBlockShow展示出圖,-SeleVar參數(shù)選擇d'或r^2值作為LD值。
##1 plink 計(jì)算全基因組LD block,另外--chr --from-bp   --to-bp 3個(gè)參數(shù)用來計(jì)算染色體局部連瑣情況
plink --vcf gwas.qc.vcf \
        --out plink_ldblock \
        --chr chr1A --from-bp 1  --to-bp 500000 \ 
        --blocks no-pheno-req \
        --allow-extra-chr  --set-missing-var-ids @:#
#輸出
-rw-r--r--  1 wangtao01 wangtao01        960 Dec 24 15:29 plink_ldblock.nosex
-rw-r--r--  1 wangtao01 wangtao01       4725 Dec 24 15:29 plink_ldblock.blocks.det
-rw-r--r--  1 wangtao01 wangtao01       2419 Dec 24 15:29 plink_ldblock.blocks
-rw-r--r--  1 wangtao01 wangtao01       1180 Dec 24 15:29 plink_ldblock.log

(main) [wangtao01@sonmi:~/gwas/tana/gwas_workdir]$ head plink_ldblock.blocks.det
 CHR          BP1          BP2           KB  NSNPS SNPS
chr1A        53391        55361        1.971     11 chr1A:53391|chr1A:53434|chr1A:53517|chr1A:53930|chr1A:53950|chr1A:54060|chr1A:54198|chr1A:54559|chr1A:54913|chr1A:54941|chr1A:55361
chr1A        58187        59474        1.288      6 chr1A:58187|chr1A:58477|chr1A:58478|chr1A:59167|chr1A:59283|chr1A:59474
chr1A        59600        59720        0.121      2 chr1A:59600|chr1A:59720
chr1A        60573        60605        0.033      2 chr1A:60573|chr1A:60605
chr1A        71835        72613        0.779      4 chr1A:71835|chr1A:72181|chr1A:72233|chr1A:72613
chr1A        73471        77176        3.706      3 chr1A:73471|chr1A:77097|chr1A:77176
chr1A        83372        90234        6.863      9 chr1A:83372|chr1A:84389|chr1A:84390|chr1A:86049|chr1A:86260|chr1A:87214|chr1A:87230|chr1A:88741|chr1A:90234
chr1A       107186       108656        1.471      2 chr1A:107186|chr1A:108656
chr1A       111987       112416         0.43      3 chr1A:111987|chr1A:112165|chr1A:112416

##2 LDBlockShow 用來進(jìn)行染色體局部LDblock的可視化,-SeleVar 用來指定LD值計(jì)算方法(1: D'   2: r^2)
#另外默認(rèn)輸出svg圖片,-OutPdf  -OutPng參數(shù)可用來轉(zhuǎn)化,但默認(rèn)會(huì)調(diào)用ImageMagick軟件,其依賴GLIBCXX較新版本,本服務(wù)器不支持,可用其他軟件如rsvg-convert替代。
#謹(jǐn)慎解決ImageMagick軟件報(bào)錯(cuò)問題,我吃了大虧,解決依賴沖突的時(shí)候把我base環(huán)境污染了,conda都用不了,得重新裝miniconda/miniforge。
LDBlockShow  -InVCF gwas.qc.vcf \
   -OutPut  LDBlockShow \
   -Region chr1A:300000-500000 \
   -SeleVar 2 \
   -OutPdf   -OutPng
#rsvg-convert軟件轉(zhuǎn)換圖片
rsvg-convert -o LDBlockShow.png LDBlockShow.svg
rsvg-convert -f pdf -o LDBlockShow.pdf LDBlockShow.svg   
     
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡書系信息發(fā)布平臺,僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容