前言

我們?cè)谑褂肈ESeq2做下游分析的時(shí)候，往往會(huì)利用到DESeq2的標(biāo)準(zhǔn)化函數(shù)：rlog和vst函數(shù)進(jìn)行標(biāo)準(zhǔn)化，而這兩種標(biāo)準(zhǔn)化方式是有一定區(qū)別的：

rlog要求

rlog更適用于樣本量小于30的數(shù)據(jù)，相應(yīng)的vst負(fù)責(zé)的是樣本量大于30的數(shù)據(jù)

負(fù)二項(xiàng)分布模型

對(duì)于計(jì)數(shù)型響應(yīng)變量，類別（因子類）決策變量，往往利用Poisson回歸來擬合模型，但是Poisson回歸對(duì)離散值的要求很大，因此大多少情況下采用的是負(fù)二項(xiàng)回歸來擬合模型，而經(jīng)過負(fù)二項(xiàng)分布擬合的data的系數(shù)通常采用的是wald test來檢驗(yàn)其顯著性

那么判斷組別間的差異，可以在兩個(gè)組之間建立線性模型（負(fù)二項(xiàng)回歸模型）來判斷：

以下3個(gè)回歸系數(shù)均是與control組進(jìn)行比較，那么系數(shù)為正數(shù)，則代表相對(duì)于control組是上調(diào)的；為負(fù)值，則代表相對(duì)于control組是下調(diào)的

rlog核心函數(shù)

我們先看看整體的rlog函數(shù)布局：

  rld <- rlogData(object, intercept, betaPriorVar)
  se <- SummarizedExperiment(
           assays = rld,
           colData = colData(object),
           rowRanges = rowRanges(object),
           metadata = metadata(object))
  dt <- DESeqTransform(se)
  attr(dt,"betaPriorVar") <- attr(rld, "betaPriorVar")
  if (!is.null(attr(rld,"intercept"))) {
    mcols(dt)$rlogIntercept <- attr(rld,"intercept")
  }
  dt

通過源代碼我們知道，我們要獲得Normalized的data，其核心代碼為：rld <- rlogData(object, intercept, betaPriorVar)，而rld返回的即為我們的Normalized table

因此我們有必要查看一下rlogData的代碼：

  lambda <- 1/rep(betaPriorVar,ncol(modelMatrix))
  # except for intercept which we set to wide prior
  if ("Intercept" %in% modelMatrixNames) {
    lambda[which(modelMatrixNames == "Intercept")] <- 1e-6
  }
  
  fit <- fitNbinomGLMs(object=objectNZ, modelMatrix=modelMatrix,
                       lambda=lambda, renameCols=FALSE,
                       alpha_hat=mcols(objectNZ)$dispFit,
                       betaTol=1e-4, useOptim=FALSE,
                       useQR=TRUE)
  normalizedDataNZ <- t(modelMatrix %*% t(fit$betaMatrix))

  normalizedData <- buildMatrixWithZeroRows(normalizedDataNZ, mcols(object)$allZero)

  # add back in the intercept, if finite
  if (!missing(intercept)) {
    normalizedData <- normalizedData + ifelse(infiniteIntercept, 0, intercept)
  }
  rownames(normalizedData) <- rownames(object)
  colnames(normalizedData) <- colnames(object)
  attr(normalizedData,"betaPriorVar") <- betaPriorVar
  if ("Intercept" %in% modelMatrixNames) {
    fittedInterceptNZ <- fit$betaMatrix[,which(modelMatrixNames == "Intercept"),drop=FALSE]
    fittedIntercept <- buildMatrixWithNARows(fittedInterceptNZ, mcols(object)$allZero)
    fittedIntercept[is.na(fittedIntercept)] <- -Inf
    attr(normalizedData,"intercept") <- fittedIntercept
  }
 
  #最后返回normalizedData
  normalizedData

由rlog的源代碼我們知道，輸入進(jìn)去的dds對(duì)象需要經(jīng)過負(fù)二項(xiàng)分布的擬合（fitNbinomGLMs），這一步主要是利用負(fù)二項(xiàng)回歸來估計(jì)回歸系數(shù)的值，而最終返回負(fù)二項(xiàng)分布回歸的系數(shù)值

那么，經(jīng)過擬合后返回fit對(duì)象，其中 fit$betaMatrix 指代的是負(fù)二項(xiàng)回歸的回歸系數(shù)，由于fitNbinomGLMs()里面部分是由C++寫的，所以如果你要跑源代碼，需要把對(duì)應(yīng)腳本給加載下：

library(Rcpp)

Rcpp::sourceCpp('DESeq2.cpp')

我們不妨模擬運(yùn)行下 rlog 里面的核心代碼（運(yùn)行前加載好內(nèi)部函數(shù)）：

dds <- DESeqDataSetFromMatrix(data, DataFrame(condition), 
                              design=~condition)
dds <- DESeq(dds)
object = dds

samplesVector <- as.character(seq_len(ncol(object)))

samplesVector <- factor(samplesVector,levels=unique(samplesVector))

samples <- factor(c("null_level",as.character(samplesVector)),
                  levels=c("null_level",levels(samplesVector)))
modelMatrix <- stats::model.matrix.default(~samples)[-1,]
modelMatrixNames <- colnames(modelMatrix)
modelMatrixNames[modelMatrixNames == "(Intercept)"] <- "Intercept"  


# only continue on the rows with non-zero row sums
objectNZ <- object[!mcols(object)$allZero,]
stopifnot(all(!is.na(mcols(objectNZ)$dispFit)))

logCounts <- log2(counts(objectNZ,normalized=TRUE) + 0.5)
logFoldChangeMatrix <- logCounts - log2(mcols(objectNZ)$baseMean + 0.5)
logFoldChangeVector <- as.numeric(logFoldChangeMatrix)
varlogk <- 1/mcols(objectNZ)$baseMean + mcols(objectNZ)$dispFit
weights <- 1/varlogk   
betaPriorVar <- matchWeightedUpperQuantileForVariance(logFoldChangeVector, rep(weights,ncol(objectNZ)))

stopifnot(length(betaPriorVar)==1)

lambda <- 1/rep(betaPriorVar,ncol(modelMatrix))
# except for intercept which we set to wide prior


fit <- fitNbinomGLMs(object=objectNZ, modelMatrix=modelMatrix,
                     lambda=lambda, renameCols=FALSE,
                     alpha_hat=mcols(objectNZ)$dispFit,
                     betaTol=1e-4, useOptim=FALSE,
                     useQR=TRUE)
normalizedDataNZ <- t(modelMatrix %*% t(fit$betaMatrix))

normalizedData <- buildMatrixWithZeroRows(normalizedDataNZ, mcols(object)$allZero)
# add back in the intercept, if finite
rownames(normalizedData) <- rownames(object)
colnames(normalizedData) <- colnames(object)
attr(normalizedData,"betaPriorVar") <- betaPriorVar

normalizedData

尤其注意的是

 normalizedDataNZ <- t(modelMatrix %*% t(fit$betaMatrix))

這句命令，我們不妨打開 t(fit$betaMatrix) 看看：

t(fit$betaMatrix)

不難發(fā)現(xiàn)，軟件按照每一個(gè)基因，一共有sample1，2，3，4，5，6 這6個(gè)組別分別做負(fù)二項(xiàng)回歸，得到不同的回歸系數(shù)（推測這6組都是和control=0，對(duì)照組處理進(jìn)行兩兩比較），其中intercept是截距項(xiàng)

而modelMatrix如下圖所示：

modelMatrix

上圖的列表示不同的sample，行表示不同的處理
那么：

normalizedDataNZ <- t(modelMatrix %*% t(fit$betaMatrix))

t(normalizedDataNZ)

上面圖示矩陣的列代表每一個(gè)基因，行代表不同的處理

根據(jù)矩陣的乘法原理，我們舉個(gè)例子，比方說normalizedDataNZ第一行第一列的數(shù)據(jù)10.052690，它是fit$betaMatrix的intercept（第一行第一列的值）加上其第二行第一列的值，也就是每一個(gè)負(fù)二項(xiàng)回歸的截距項(xiàng)加上對(duì)應(yīng)的比較組的斜率（該例子的回歸系數(shù)解釋為sample1與control=0做負(fù)二項(xiàng)回歸的回歸系數(shù)）

小tip

當(dāng)然在fitNbinomGLMs的源代碼中，

 # here we transform the betaMatrix and betaSE to a log2 scale
  betaMatrix <- log2(exp(1))*betaRes$beta_mat

##計(jì)算betaRes的函數(shù):
  betaRes <- fitBetaWrapper(ySEXP = counts(object), xSEXP = modelMatrix,
                            nfSEXP = normalizationFactors,
                            alpha_hatSEXP = alpha_hat,
                            beta_matSEXP = beta_mat,
                            lambdaSEXP = lambdaNatLogScale,
                            weightsSEXP = weights,
                            useWeightsSEXP = useWeights,
                            tolSEXP = betaTol, maxitSEXP = maxit,
                            useQRSEXP=useQR, minmuSEXP=minmu)

  # Note on deviance: the 'deviance' calculated in fitBeta() (C++)
  # is not returned in mcols(object)$deviance. instead, we calculate
  # the log likelihood below and use -2 * logLike.
  # (reason is that we have other ways of estimating beta:
  # above intercept code, and below optim code)

我們可以看到為了得到betaMatrix，軟件對(duì)擬合出來的負(fù)二項(xiàng)回歸系數(shù)（betaRes$beta_mat）做了一個(gè)乘 log2(exp(1)) 的處理，而fitBetaWrapper函數(shù)的實(shí)質(zhì)就是利用負(fù)二項(xiàng)回歸去做擬合（這一部分是C++寫的），從而得到負(fù)二項(xiàng)回歸的回歸系數(shù)
C++寫的這部分主要是先對(duì)row count文件做一個(gè)log標(biāo)準(zhǔn)化處理，log標(biāo)準(zhǔn)化原理具體可參考學(xué)習(xí)：StatQuest-Normalization，或者參考Introduction to DGE - ARCHIVED:
然后在對(duì)每一個(gè)基因?qū)?yīng)于每一個(gè)sample去擬合負(fù)二項(xiàng)回歸系數(shù)

總結(jié)

引入負(fù)二項(xiàng)回歸來標(biāo)準(zhǔn)化，可以利用測序數(shù)據(jù)有生物學(xué)重復(fù)的優(yōu)勢，從一定程度上消除離群值得影響

實(shí)質(zhì)上我們通過rlog標(biāo)準(zhǔn)化的raw count數(shù)據(jù)，實(shí)質(zhì)上是先將row count文件做一個(gè)log標(biāo)準(zhǔn)化處理，然后對(duì)每一個(gè)基因做一個(gè)負(fù)二項(xiàng)回歸，其響應(yīng)變量為基因的表達(dá)量，決策變量為不同的分組信息。對(duì)于分組信息，相當(dāng)于是sample1，2，3，4，5，6 這6個(gè)組別分別做負(fù)二項(xiàng)回歸，得到不同的回歸系數(shù)（推測這6組都是和control=0進(jìn)行比較），可參照t(fit$betaMatrix)表格來看

在同一套坐標(biāo)體系下：
為什么要強(qiáng)調(diào)在同一套坐標(biāo)體系下，是因?yàn)樵谶@樣的條件下，各個(gè)基因的表達(dá)量以及擬合的負(fù)二項(xiàng)回歸方程可以相互比較

回歸系數(shù)的大小可以理解為一組分組的表達(dá)量相對(duì)于另一組分組的增長的幅度，比方說A與B兩組進(jìn)行比較，其負(fù)二項(xiàng)回歸系數(shù)為5，那么從某種程度上理解可以理解為B組比A組高表達(dá)5個(gè)單位；
如果有截距項(xiàng)，截距項(xiàng)為8，A組的負(fù)二項(xiàng)回歸系數(shù)（β1）為5，，B組的負(fù)二項(xiàng)回歸系數(shù)（β2）為3，那么A組的基因表達(dá)量為8+5，B組的基因表達(dá)量為8+3，因此回歸系數(shù)加上截距項(xiàng)即代表基因表達(dá)量的高低。
由于我們的決策變量是因子型變量，比方說sample 1和control=0進(jìn)行建模，那么sample 1的因子為1，control=0的因子為0，假設(shè)負(fù)二項(xiàng)回歸的回歸方程為 y = βx + intercept，那么對(duì)于control組，其基因表達(dá)量為intercept；對(duì)于sample 1組，其基因表達(dá)量為β + intercept，這也就印證了上一段敘述中我們對(duì)基因表達(dá)量的討論

因此，在與相同的control組比較的條件下，回歸系數(shù)與截距項(xiàng)的和其實(shí)就是基因表達(dá)量的高低

上圖黑線表示的是回歸線，這幅圖即表示類別（因子型）變量負(fù)二項(xiàng)回歸的擬合

因此，對(duì)于DESeq2這樣的尋找差異基因的軟件，將其轉(zhuǎn)換為回歸系數(shù)是不影響結(jié)果的

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

DESeq2中rlog標(biāo)準(zhǔn)化那些事

DESeq2中rlog標(biāo)準(zhǔn)化那些事

前言

負(fù)二項(xiàng)分布模型

rlog核心函數(shù)

小tip

總結(jié)

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

DESeq2中rlog標(biāo)準(zhǔn)化那些事

前言

負(fù)二項(xiàng)分布模型

rlog核心函數(shù)

小tip

總結(jié)

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av