蛋白組學分析包——DEqMS學習

在初次接觸蛋白組學的數(shù)據(jù)之時,外觀上,其數(shù)據(jù)格式與我們常見的基因表達測序矩陣文件沒有什么的不同。事實上公司采用的差異蛋白分析方式也是最基礎(chǔ)的T-test,但由于蛋白和核苷酸相差甚遠且上機方法也大有不同,故懷著好奇上網(wǎng)沖浪。目前主要的蛋白組學分析R工具有如下三款:

①limma;②DEqMS;③DEP

本次主要對DEqMS展開學習:

tutorial官網(wǎng):http://www.bioconductor.org/packages/release/bioc/vignettes/DEqMS/inst/doc/DEqMS-package-vignette.html#download-and-read-the-input-protein-table

1 Overview

DEqMS builds on top of Limma, a widely-used R package for microarray data analysis (Smyth G. et al 2004), and improves it with proteomics data specific properties, accounting for variance dependence on the number of quantified peptides or PSMs for statistical testing of differential protein expression.

DEqMS是基于limma包建立的蛋白組分析R語言工具,教程摘要中PSMs的定義為:peptide spectrum matches——即“肽匹配圖譜”[1]

PSM理論解釋:為鑒定肽段匹配到數(shù)據(jù)庫內(nèi)的蛋白質(zhì)的理論酶切肽段圖譜數(shù)(或通過算法對二者相似度評分后,分值最高的理論肽段即作為鑒定結(jié)果),或顯示蛋白質(zhì)的已識別肽段序列數(shù)(包括多次被識別的序列)。故該方法在蛋白組檢測中既可以定性,又可以定量,但定性依賴于數(shù)據(jù)庫的數(shù)據(jù)構(gòu)成[2]。

2 prepare the input protein data

2.1下載文件——以TMT數(shù)據(jù)為例

url <- "https://ftp.ebi.ac.uk/pride-archive/2016/06/PXD004163/Yan_miR_Protein_table.flatprottable.txt"
download.file(url, destfile = "./miR_Proteintable.txt",method = "auto")
df.prot = read.table("miR_Proteintable.txt",stringsAsFactors = FALSE, header = TRUE, quote = "", comment.char = "",sep = "\t") 
df.prot

2.2數(shù)據(jù)過濾

# filter at 1% protein FDR and extract TMT quantifications
TMT_columns = seq(15,33,2)#選擇15-33列中、間隔1列的數(shù)據(jù),也就是所有的COUNT數(shù)據(jù)
dat = df.prot[df.prot$miR.FASP_q.value<0.01,TMT_columns]#篩選q值小于0.01的數(shù)據(jù)
rownames(dat) = df.prot[df.prot$miR.FASP_q.value<0.01,]$Protein.accession
# The protein dataframe is a typical protein expression matrix structure
# Samples are in columns, proteins are in rows
# use unique protein IDs for rownames
# to view the whole data frame, use the command View(dat)
View(dat)
dat

3.Data processing and grouping

3.1數(shù)據(jù)處理

dat.log = log2(dat)#數(shù)據(jù)取對數(shù)
dat.log = na.omit(dat.log)#remove rows with NAs
View(dat.log)
boxplot(dat.log,las=2,main="TMT10plex data PXD004163")#Use boxplot to check if the samples have medians centered. if not, do median centering.
# Here the data is already median centered, we skip the following step.
#由于數(shù)據(jù)表達量足夠一致,所以可以不進行標準化
# dat.log = equalMedianNormalization(dat.log)
boxplot
dat.log

3.2數(shù)據(jù)分組

# if there is only one factor, such as treatment. You can define a vector with
# the treatment group in the same order as samples in the protein table.
cond = as.factor(c("ctrl","miR191","miR372","miR519","ctrl", "miR372",
"miR519","ctrl","miR191","miR372"))

# The function model.matrix is used to generate the design matrix
design = model.matrix(~0+cond) # 0 means no intercept for the linear model
#這里是設(shè)計一個對比矩陣,其做法原理可以詳見:https://treeh.cn/?id=21

colnames(design) = gsub("cond","",colnames(design))#去除表格中的cond字符
View(design)
design
# you can define one or multiple contrasts here
#設(shè)計哪些組間需要對比
x <- c("miR372-ctrl","miR519-ctrl","miR191-ctrl", "miR372-miR519",
"miR372-miR191","miR519-miR191")

contrast =  makeContrasts(contrasts=x,levels=design)
#按照x的對比方式,對design的樣本分組對樣本進行比較
View(contrast)
contrast
fit1 <- lmFit(dat.log, design)#線性擬合模型構(gòu)建
fit2 <- contrasts.fit(fit1,contrasts = contrast)#Compute Contrasts from Linear Model Fit
#Given a linear model fit to microarray data, compute estimated coefficients and standard errors for a given set of contrasts.
fit3 <- eBayes(fit2)#Empirical Bayes Statistics for Differential Expression
#Given a linear model fit from lmFit, compute moderated t-statistics, moderated F-statistic, and log-odds of differential expression by empirical Bayes moderation of the standard errors towards a global value.
#具體介紹參考自help(contrasts.fit)、help(eBayes)
fit對比

4.DEqMS analysis

以上的研究是基于limma進行的,以下的教程將以DEqMS包構(gòu)建的方法,使用實驗內(nèi)和實驗間用于量化的最小PSM數(shù)量來模擬方差和PSM數(shù)量之間的關(guān)系。

# assign a extra variable `count` to fit3 object, telling how many PSMs are quantifed for each protein
library(matrixStats)
count_columns = seq(16,34,2)#選擇16-34列中、間隔1列的數(shù)據(jù),也就是所有的PSMs數(shù)據(jù)
psm.count.table = data.frame(count = rowMins(as.matrix(df.prot[,count_columns])), row.names =  df.prot$Protein.accession)
#rowMins: Calculates the minimum for each row (column) of a matrix-like object
fit3$count = psm.count.table[rownames(fit3$coefficients),"count"]#數(shù)據(jù)導入fit3中
fit4 = spectraCounteBayes(fit3)
#Peptide/Spectra Count Based Empirical Bayes Statistics for Differential Expression. This function is to calculate peptide/PSM count adjusted t-statistics, p-values.
head(psm.count.table)
fit3
fit4
psm.count.table

fit4新增內(nèi)容解釋:
Outputs of spectraCounteBayes:
object is augmented form of “fit” object from eBayes in Limma, with the additions being:
sca.t - Spectra Count Adjusted posterior t-value
sca.p - Spectra Count Adjusted posterior p-value
sca.dfprior - DEqMS estimated prior degrees of freedom
sca.priorvar- DEqMS estimated prior variance
sca.postvar - DEqMS estimated posterior variance
model - fitted model

# n=30 limits the boxplot to show only proteins quantified by <= 30 PSMs.
VarianceBoxplot(fit4,n=30,main="TMT10plex dataset PXD004163",xlab="PSM count")
#橫坐標為不同PSM count值,縱坐標為log值,但是是什么數(shù)據(jù)的Log值并不清楚
#查看VarianceBoxplot源碼,x軸y軸輸入如下
#x <- fit$count
#y <- fit$sigma^2,這個sigma值不知道是什么,應該是涉及上機檢測的數(shù)據(jù)?在公司給的數(shù)據(jù)里面好像有見到類似值?等待一個解答***
VarianceBoxplot
VarianceScatterplot(fit4,main="TMT10plex dataset PXD004163")
VarianceScatterplot

5.extract the result as a data frame and save

#if you are not sure which coef_col refers to the specific contrast,type
#查看我們設(shè)定的對照組
head(fit4$coefficients)
對照組查看
#提取第一個對比組
DEqMS.results = outputResult(fit4,coef_col = 1)#miR372-ctrl
#提取第二個對比組
#DEqMS.results = outputResult(fit4,coef_col = 2)#miR519-ctrl
#a quick look on the DEqMS results table
head(DEqMS.results)
# Save it into a tabular text file
write.table(DEqMS.results,"DEqMS.results.miR372-ctrl.txt",sep = "\t", row.names = F,quote=F)
差異蛋白結(jié)果

表格列名的解釋(關(guān)鍵可看logFC、adjp、sca.adj.p值):

Explaination of the columns in DEqMS.results:

  1. logFC - log2 fold change between two groups, Here it’s log2(miR372/ctrl).
  2. AveExpr - the mean of the log2 ratios/intensities across all samples. Since input matrix is log2 ratio values, it is the mean log2 ratios of all samples.
  3. t - Limma output t-statistics
  4. P.Value- Limma p-values
  5. adj.P.Val - BH method adjusted Limma p-values
  6. B - Limma B valuescount - PSM/peptide
  7. count values you assigned
  8. sca.t - DEqMS t-statistics
  9. sca.P.Value - DEqMS p-values
  10. sca.adj.pval - BH method adjusted DEqMS p-values

6.Other analysis

官網(wǎng)還提供了熱圖可視化差異蛋白的方法(ggplot2),以及分析label-free數(shù)據(jù)的教程(和TMT)教程類似。還對t檢驗/limma/Anova差異分析的結(jié)果與DEqMS的結(jié)果進行了對照研究,可見DEqMS的包是較為可靠的。

image

7.總結(jié)

DEqMS是一個較為簡單易學的差異分析包,難度大的地方還是在對于其實驗的一些概念,以及一些數(shù)學方面的原理。當然作為臨床與生信人,這都不是我們需要去深刻探究的問題,至少先學會應用于基礎(chǔ)概念。

目前公司上機處理后會直接提供整理好的表達矩陣文件,不會提供PSM矩陣,但和公司溝通后還是拿得到PSM矩陣進行分析的。

本文為個人學習筆記,如有引起任何侵權(quán)問題,請及時與我聯(lián)系,謝謝。

參考文獻:

[1]Zhu Y, Orre LM, Zhou Tran Y, Mermelekas G, Johansson HJ, Malyutina A, Anders S, Lehti? J. DEqMS: A Method for Accurate Variance Estimation in Differential Protein Expression Analysis. Mol Cell Proteomics. 2020 Jun;19(6):1047-1057. doi: 10.1074/mcp.TIR119.001646. Epub 2020 Mar 23. PMID: 32205417; PMCID: PMC7261819.

[2]徐洪凱,閆克強,何燕斌,聞博,楊煥明,劉斯奇.宏蛋白質(zhì)組學信息分析的基本策略及其挑戰(zhàn)[J].生物化學與生物物理進展,2018,45(01):23-35.DOI:10.16476/j.pibb.2017.0187.

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容