GSVA的用途
一般我們做GSEA都是先進(jìn)行差異基因分析,然后取差異倍數(shù)排序結(jié)果進(jìn)行GSEA。但如果你沒有條件進(jìn)行差異基因分析也可以進(jìn)行GSEA,這就是ssGSEA(single sample GSEA, 單樣品GSEA)對(duì)每個(gè)樣品進(jìn)行GSEA,R包GSVA(Gene Set Variation Analysis)可以用來進(jìn)行ssGSEA,GSVA可以將表達(dá)矩陣轉(zhuǎn)換成通路富集分?jǐn)?shù)(ES)矩陣
Cite:GSVA + limma進(jìn)行差異通路分析
也就是說GSVA最終的結(jié)果得到的是一個(gè)GSVA矩陣,這個(gè)矩陣衡量的是pathway的"表達(dá)矩陣",GSVA所分析的是pathway(功能)上的表達(dá)情況
GSVA計(jì)算原理
可參考文獻(xiàn):《GSVA: gene set variation analysis for microarray and RNA-Seq data》
第一步:
對(duì)于RNA-seq的原始count數(shù)據(jù)我們需要利用泊松核函數(shù)計(jì)算Fr:

其中:
- xij 是第 j 個(gè)sample中第 i 個(gè)基因的原始count數(shù)
- n為總的sample數(shù)
- k與sample數(shù)有關(guān)系,k=1...n
- y=1...xij
- xik 為第 k 個(gè)sample中第 i 個(gè)基因的count值
- r = 0.5
對(duì)于芯片數(shù)據(jù),首先得對(duì)芯片的表達(dá)矩陣取一個(gè)log2,然后利用高斯核函數(shù)計(jì)算Fhi:

其中:
- xij 是第 j 個(gè)sample中第 i 個(gè)基因的log2表達(dá)值
- n為總的sample數(shù)
- k與sample數(shù)有關(guān)系,k=1...n
- t 為形式變量
- xik 為第 k 個(gè)sample中第 i 個(gè)基因的log2表達(dá)值
- hi = si / 4,si 為第 i 個(gè)基因在所有sample中表達(dá)量的標(biāo)準(zhǔn)差;hi可以控制分辨率
第二步:
我們把計(jì)算出來的 Fr 和 Fhi 稱為zij,那么每一個(gè)sample每一個(gè)基因都會(huì)有對(duì)應(yīng)的 Fr 和 Fhi 的值,接下來對(duì)于每一個(gè)sample由zij的大小進(jìn)行排序

即對(duì)于每一個(gè)sample,zij大的排前面,小的排后面,排完序后將zij改寫為z(i)j,z(i)j定義為排過序的zij;并定義:rij = | p/2 - z(i)j |,其中 p 為表達(dá)矩陣中所有的基因數(shù)
第三步:
這一步是利用 Kolmogorov-Smirnov 計(jì)算富集分?jǐn)?shù),首先計(jì)算vjk

其中:
- τ 取 1 為權(quán)重參數(shù)
- rij 為第二步所定義,rij = | p/2 - z(i)j |
- p 為表達(dá)矩陣中所有的基因數(shù)
- ? 代表的是排序后從i = 1...? 的基因
- γk 代表第 k 個(gè)gene set ,類似于GSEA,每個(gè)gene set 對(duì)應(yīng)一個(gè)功能(pathway)
- I 其實(shí)是判別函數(shù),當(dāng)g(i)∈γk,值為1;當(dāng)g(i)不屬于γk,值為0
那么富集分?jǐn)?shù)怎么計(jì)算呢?類似于GSEA:
對(duì)于某一個(gè)sample j,定義富集分?jǐn)?shù)為vjk(?)所構(gòu)成的集合中(? = 1...p)的最大值減去vjk(?)所構(gòu)成的集合中(? = 1...p)的最小值,用它們的差值作為富集分?jǐn)?shù)。
其數(shù)學(xué)表達(dá)式為:

(加 0 是為了做矯正)
GSVA的簡(jiǎn)單運(yùn)用
這一塊主要看官網(wǎng),官方鏈接:GSVA官網(wǎng)
其主函數(shù)為:
gsva(expr, gset.idx.list, annotation,
method=c("gsva", "ssgsea", "zscore", "plage"),
kcdf=c("Gaussian", "Poisson", "none"),
abs.ranking=FALSE,
min.sz=1,
max.sz=Inf,
parallel.sz=0,
parallel.type="SOCK",
mx.diff=TRUE,
tau=switch(method, gsva=1, ssgsea=0.25, NA),
ssgsea.norm=TRUE,
verbose=TRUE)
參數(shù)說明:
expr:Gene expression data which can be given either as an ExpressionSet object or as a matrix of expression values where rows correspond to genes and columns correspond to samples.
gset.idx.list:Gene sets provided either as a list object or as a GeneSetCollection object.
annotation:In the case of calling gsva() with expression data in a matrix and gene sets as a GeneSetCollection object, the annotation argument can be used to supply the name of the Bioconductor package that contains annotations for the class of gene identifiers occurring in the row names of the expression data matrix. By default gsva() will try to match the identifiers in expr to the identifiers in gset.idx.list just as they are, unless the annotation argument is set.
method: Method to employ in the estimation of gene-set enrichment scores per sample. By default this is set to gsva (H?nzelmann et al, 2013) and other options are ssgsea (Barbie et al, 2009), zscore (Lee et al, 2008) or plage (Tomfohr et al, 2005). The latter two standardize first expression profiles into z-scores over the samples and, in the case of zscore, it combines them together as their sum divided by the square-root of the size of the gene set, while in the case of plage they are used to calculate the singular value decomposition (SVD) over the genes in the gene set and use the coefficients of the first right-singular vector as pathway activity profile.
kcdf:Character string denoting the kernel to use during the non-parametric estimation of the cumulative distribution function of expression levels across samples when method="gsva". By default, kcdf="Gaussian" which is suitable when input expression values are continuous, such as microarray fluorescent units in logarithmic scale, RNA-seq log-CPMs, log-RPKMs or log-TPMs. When input expression values are integer counts, such as those derived from RNA-seq experiments, then this argument should be set to kcdf="Poisson".
abs.ranking:Flag used only when mx.diff=TRUE. When abs.ranking=FALSE (default) a modified Kuiper statistic is used to calculate enrichment scores, taking the magnitude difference between the largest positive and negative random walk deviations. When abs.ranking=TRUE the original Kuiper statistic that sums the largest positive and negative random walk deviations, is used. In this latter case, gene sets with genes enriched on either extreme (high or low) will be regarded as 'highly' activated.
min.sz:Minimum size of the resulting gene sets.
max.sz: Maximum size of the resulting gene sets.
parallel.sz:Number of processors to use when doing the calculations in parallel. This requires to previously load either the parallel or the snow library. If parallel is loaded and this argument is left with its default value (parallel.sz=0) then it will use all available core processors unless we set this argument with a smaller number. If snow is loaded then we must set this argument to a positive integer number that specifies the number of processors to employ in the parallel calculation.
parallel.type: Type of cluster architecture when using snow.
mx.diff :Offers two approaches to calculate the enrichment statistic (ES) from the KS random walk statistic. mx.diff=FALSE: ES is calculated as the maximum distance of the random walk from 0. mx.diff=TRUE (default): ES is calculated as the magnitude difference between the largest positive and negative random walk deviations.
tau:Exponent defining the weight of the tail in the random walk performed by both the gsva (H?nzelmann et al., 2013) and the ssgsea (Barbie et al., 2009) methods. By default, this tau=1 when method="gsva" and tau=0.25 when method="ssgsea" just as specified by Barbie et al. (2009) where this parameter is called alpha.
ssgsea.norm:Logical, set to TRUE (default) with method="ssgsea" runs the SSGSEA method from Barbie et al. (2009) normalizing the scores by the absolute difference between the minimum and the maximum, as described in their paper. When ssgsea.norm=FALSE this last normalization step is skipped.
verbose:Gives information about each calculation step. Default: FALSE.
其中重點(diǎn)說一下參數(shù)gset.idx.list,這個(gè)其實(shí)就是功能注釋文件gmt,這個(gè)可以去GSEA官網(wǎng)下載:msigdb
這個(gè)數(shù)據(jù)庫的注釋是基于人的,如果是其他物種需要做同源轉(zhuǎn)換
一共有這么幾類:

而kcdf則是你要進(jìn)行轉(zhuǎn)換的核函數(shù)