多組學(xué)聯(lián)合分析-Matrix eQTL

找到Matrix eQTL這個包,看下文章Matrix eQTL: ultra fast eQTL analysis via large matrix operations(https://doi.org/10.1093/bioinformatics/bts163

eQTL(表達(dá)數(shù)量性狀位點(diǎn))計(jì)算transcript-SNP 的關(guān)系,即分析SNP與基因的表達(dá)是否相關(guān)。由于計(jì)算數(shù)量巨大,很多人都用較小的數(shù)據(jù)來做。因此該作者開發(fā)了Matrix eQTL,用于處理大數(shù)據(jù),支持additive linear and ANOVA models with covariates,并且可以將cis- and trans-eQTLs分開計(jì)算。
Matrix eQTL相較于其他軟件如FastMap — 18.4 min, Merlin — 12.3 min, Plink — 9.0 min, Matrix eQTL — 5.7 min and snpMatrix — 3.3 min要快,它設(shè)置一個閾值,只有超過這個閾值的p值才會被計(jì)算。
采用的是線型回歸模型,g為基因表達(dá)情況,s為SNP分型結(jié)果。

說明文檔http://www.bios.unc.edu/research/genomic_software/Matrix_eQTL/runit.html
示例數(shù)據(jù):http://www.bios.unc.edu/research/genomic_software/Matrix_eQTL/R.html
分析過程很簡單,首先設(shè)置好要分析文件的路徑和名稱:

install.packages("MatrixEQTL")

# 設(shè)置數(shù)據(jù)目錄,示例數(shù)據(jù)放在包的安裝目錄下了。
base.dir = find.package("MatrixEQTL")

#設(shè)置分析的模型
useModel = modelLINEAR; # modelANOVA or modelLINEAR or modelLINEAR_CROSS 
#設(shè)置SNP文件的名稱
SNP_file_name = paste(base.dir, "/data/SNP.txt", sep="");
# 設(shè)置表達(dá)數(shù)據(jù)文件的名稱
expression_file_name = paste(base.dir, "/data/GE.txt", sep="");
# 設(shè)置協(xié)變量文件的名稱
# 無協(xié)變量設(shè)置為character()
covariates_file_name = paste(base.dir, "/data/Covariates.txt", sep="");

output_file_name = tempfile();
  • 提供了三種分析模型供選擇
    (1) modelLINEAR
Model: useModel = modelLINEAR
Equation: expression = α + ∑k βk?covariatek + γ?genotype_additive
Testing for significance of: γ
Test statistic: t-statistic

(2) modelANOVA

Model: useModel = modelANOVA
Equation: expression = α + ∑k βk?covariatek + γ1?genotype_additive + γ2?genotype_dominant
Testing for significance of: (γ1,γ2) pair
Test statistic: F-statistic

(3) modelLINEAR_CROSS

Model: useModel = modelLINEAR_CROSS
Equation:
    expression = α + ∑k βk?covariatek + γ?genotype_additive + δ?genotype_additive?covariateK
Testing for significance of: δ
Test statistic: t-statistic
  • 注意這里要設(shè)置一個p值的閾值,一般越大的數(shù)據(jù)量閾值設(shè)的越小,之前說過它會按這個閾值來計(jì)算結(jié)果,如果設(shè)的過大,分析耗時并且輸出很多結(jié)果。輸出的結(jié)果都儲存在output_file_name里
pvOutputThreshold = 1e-2
# 設(shè)置協(xié)變量矩陣為 numeric(),很少用,默認(rèn)
errorCovariance = numeric()
# 這里建立了一個SlicedData的新對象,用于存放martix的數(shù)據(jù),并設(shè)置存放數(shù)據(jù)的格式
snps = SlicedData$new();
# 設(shè)置數(shù)據(jù)分隔符為tab
snps$fileDelimiter = "\t";      # the TAB character
# 設(shè)置缺失值為NA
snps$fileOmitCharacters = "NA"; # denote missing values;

snps$fileSkipRows = 1;          # one row of column labels
snps$fileSkipColumns = 1;       # one column of row labels
snps$fileSliceSize = 2000;      # read file in pieces of 2,000 rows
snps$LoadFile( SNP_file_name );

## Load gene expression data

gene = SlicedData$new();
gene$fileDelimiter = "\t";      # the TAB character
gene$fileOmitCharacters = "NA"; # denote missing values;
gene$fileSkipRows = 1;          # one row of column labels
gene$fileSkipColumns = 1;       # one column of row labels
gene$fileSliceSize = 2000;      # read file in slices of 2,000 rows
gene$LoadFile(expression_file_name);

## Load covariates

cvrt = SlicedData$new();
cvrt$fileDelimiter = "\t";      # the TAB character
cvrt$fileOmitCharacters = "NA"; # denote missing values;
cvrt$fileSkipRows = 1;          # one row of column labels
cvrt$fileSkipColumns = 1;       # one column of row labels

看下文件格式,snp文件用0,1,2表示,基因文件是表達(dá)量,cvrt是covariates:


image.png

image.png

image.png

設(shè)置好文件后可以用 Matrix_eQTL_engine主函數(shù)進(jìn)行eQTL分析了,參數(shù)snps設(shè)置SNP文件,gene設(shè)置表達(dá)量文件,cvrt設(shè)置協(xié)變量。然后將每行的SNP和gene放到一塊進(jìn)行線性回歸的分析。

me = Matrix_eQTL_engine(
    snps = snps,
    gene = gene,
    cvrt = cvrt,
    output_file_name = output_file_name,
    pvOutputThreshold = pvOutputThreshold,
    useModel = useModel, 
    errorCovariance = errorCovariance, 
    verbose = TRUE,
    pvalue.hist = TRUE,
    min.pv.by.genesnp = FALSE,
    noFDRsaveMemory = FALSE);

運(yùn)行完后得到的me對象是一個list:


image.png

輸出文件的每行eqtl為:SNP name, a transcript name, estimate of the effect size, t- or F-statistic, p-value, and FDR。

Matrix eQTL可以區(qū)分順式(cis,local)和反式(trans,distant)eQTL,主要用Matrix_eQTL_main函數(shù)來分析。其包括以下幾個參數(shù):

*   `pvOutputThreshold.cis` – p-value threshold for cis-eQTLs.
*   `output_file_name.cis` – detected cis-eQTLs are saved in this file.
*   `cisDist` – maximum distance at which gene-SNP pair is considered local.
*   `snpspos` – data frame with information about SNP locations, must have 3 columns - SNP name, chromosome, and position. See [sample SNP location file](http://www.bios.unc.edu/research/genomic_software/Matrix_eQTL/Sample_Data/snpsloc.txt).
*   `genepos` – data frame with information about gene locations, must have 4 columns - the name, chromosome, and positions of the left and right ends. See [sample gene location file](http://www.bios.unc.edu/research/genomic_software/Matrix_eQTL/Sample_Data/geneloc.txt).

下面來看具體代碼:

# source("Matrix_eQTL_R/Matrix_eQTL_engine.r");
library(MatrixEQTL)

## Location of the package with the data files.
base.dir = find.package('MatrixEQTL');
# base.dir = '.';

## Settings

# Linear model to use, modelANOVA, modelLINEAR, or modelLINEAR_CROSS
useModel = modelLINEAR; # modelANOVA, modelLINEAR, or modelLINEAR_CROSS

# Genotype file name
SNP_file_name = paste(base.dir, "/data/SNP.txt", sep="");
snps_location_file_name = paste(base.dir, "/data/snpsloc.txt", sep="");

# Gene expression file name
expression_file_name = paste(base.dir, "/data/GE.txt", sep="");
gene_location_file_name = paste(base.dir, "/data/geneloc.txt", sep="");

# Covariates file name
# Set to character() for no covariates
covariates_file_name = paste(base.dir, "/data/Covariates.txt", sep="");

# Output file name
output_file_name_cis = tempfile();
output_file_name_tra = tempfile();

# Only associations significant at this level will be saved
pvOutputThreshold_cis = 2e-2;
pvOutputThreshold_tra = 1e-2;

# Error covariance matrix
# Set to numeric() for identity.
errorCovariance = numeric();
# errorCovariance = read.table("Sample_Data/errorCovariance.txt");

# Distance for local gene-SNP pairs
cisDist = 1e6;

## Load genotype data

snps = SlicedData$new();
snps$fileDelimiter = "\t";      # the TAB character
snps$fileOmitCharacters = "NA"; # denote missing values;
snps$fileSkipRows = 1;          # one row of column labels
snps$fileSkipColumns = 1;       # one column of row labels
snps$fileSliceSize = 2000;      # read file in slices of 2,000 rows
snps$LoadFile(SNP_file_name);

## Load gene expression data

gene = SlicedData$new();
gene$fileDelimiter = "\t";      # the TAB character
gene$fileOmitCharacters = "NA"; # denote missing values;
gene$fileSkipRows = 1;          # one row of column labels
gene$fileSkipColumns = 1;       # one column of row labels
gene$fileSliceSize = 2000;      # read file in slices of 2,000 rows
gene$LoadFile(expression_file_name);

## Load covariates

cvrt = SlicedData$new();
cvrt$fileDelimiter = "\t";      # the TAB character
cvrt$fileOmitCharacters = "NA"; # denote missing values;
cvrt$fileSkipRows = 1;          # one row of column labels
cvrt$fileSkipColumns = 1;       # one column of row labels
if(length(covariates_file_name)>0) {
cvrt$LoadFile(covariates_file_name);
}

## Run the analysis
snpspos = read.table(snps_location_file_name, header = TRUE, stringsAsFactors = FALSE);
genepos = read.table(gene_location_file_name, header = TRUE, stringsAsFactors = FALSE);

me = Matrix_eQTL_main(
snps = snps, 
gene = gene, 
cvrt = cvrt,
output_file_name     = output_file_name_tra,
pvOutputThreshold     = pvOutputThreshold_tra,
useModel = useModel, 
errorCovariance = errorCovariance, 
verbose = TRUE, 
output_file_name.cis = output_file_name_cis,
pvOutputThreshold.cis = pvOutputThreshold_cis,
snpspos = snpspos, 
genepos = genepos,
cisDist = cisDist,
pvalue.hist = "qqplot",
min.pv.by.genesnp = FALSE,
noFDRsaveMemory = FALSE);

unlink(output_file_name_tra);
unlink(output_file_name_cis);

## Results:

cat('Analysis done in: ', me$time.in.sec, ' seconds', '\n');
cat('Detected local eQTLs:', '\n');
show(me$cis$eqtls)
cat('Detected distant eQTLs:', '\n');
show(me$trans$eqtls)

## Plot the Q-Q plot of local and distant p-values

plot(me)

因此,分析自己的數(shù)據(jù)需要準(zhǔn)備
genotype
expression
covariates
gene location
SNP location
這五個文件,前三個需要每列的樣本名對應(yīng)且順序一致。
作者也提供了生成模擬數(shù)據(jù)的代碼:

# Create an artificial dataset and plot the histogram and Q-Q plot of all p-values
library('MatrixEQTL')

# Number of samples
n = 100;

# Number of variables
ngs = 2000;

# Common signal in all variables (population stratification)
pop = 0.2 * rnorm(n);

# data matrices
snps.mat = matrix(rnorm(n*ngs), ncol = ngs) + pop;
gene.mat = matrix(rnorm(n*ngs), ncol = ngs) + pop + snps.mat*((1:ngs)/ngs)^9/2;

# data objects for Matrix eQTL engine
snps1 = SlicedData$new( t( snps.mat ) );
gene1 = SlicedData$new( t( gene.mat ) );
cvrt1 = SlicedData$new( );
rm(snps.mat, gene.mat)

# Slice data in blocks of 500 variables
snps1$ResliceCombined(500);
gene1$ResliceCombined(500);

# name of temporary output file
filename = tempfile();

# Perform analysis recording information for 
# a histogram
meh = Matrix_eQTL_engine(
  snps = snps1,
  gene = gene1,
  cvrt = cvrt1,
  output_file_name = filename, 
  pvOutputThreshold = 1e-100, 
  useModel = modelLINEAR, 
  errorCovariance = numeric(), 
  verbose = TRUE,
  pvalue.hist = 100);
unlink( filename );
# png(filename = "histogram.png", width = 650, height = 650)
plot(meh, col="grey")
# dev.off();

# Perform the same analysis recording information for 
# a Q-Q plot
meq = Matrix_eQTL_engine(
  snps = snps1, 
  gene = gene1, 
  cvrt = cvrt1, 
  output_file_name = filename,
  pvOutputThreshold = 1e-6, 
  useModel = modelLINEAR, 
  errorCovariance = numeric(), 
  verbose = TRUE,
  pvalue.hist = "qqplot");
unlink( filename );
# png(filename = "QQplot.png", width = 650, height = 650)
plot(meq, pch = 16, cex = 0.7)
# dev.off();

閱讀推薦:

生信技能樹公益視頻合輯:學(xué)習(xí)順序是linux,r,軟件安裝,geo,小技巧,ngs組學(xué)!

B站鏈接:https://m.bilibili.com/space/338686099

YouTube鏈接:https://m.youtube.com/channel/UC67sImqK7V8tSWHMG8azIVA/playlists

生信工程師入門最佳指南:https://mp.weixin.qq.com/s/vaX4ttaLIa19MefD86WfUA

學(xué)徒培養(yǎng):https://mp.weixin.qq.com/s/3jw3_PgZXYd7FomxEMxFmw

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

友情鏈接更多精彩內(nèi)容