單細(xì)胞中的標(biāo)準(zhǔn)化和PCA:哈佛生信課程學(xué)習(xí)(一)

搬運(yùn)自哈佛生物信息課程:《Introduction to Single-cell RNA-seq》
鏈接:https://hbctraining.github.io/scRNA-seq/lessons/05_normalization_and_PCA.html

本節(jié)主要內(nèi)容:

Count Normalization and Principal Component Analysis

After attaining our high quality single cells, the next step in the single-cell RNA-seq (scRNA-seq) analysis workflow is to perform clustering. The goal of clustering is to separate different cell types into unique clusters of cells. To perform clustering, we determine the genes that are most different in their expression between cells. Then, we use these genes to determine which correlated genes sets are responsible for the largest differences in expression between cells.
獲得高質(zhì)量的單細(xì)胞數(shù)據(jù)后,在單細(xì)胞RNA-seq (scRNA-seq)分析工作流中的下一步是執(zhí)行聚類分群。聚類的目的是將不同類型的細(xì)胞分離成獨特的細(xì)胞簇。為了進(jìn)行聚類,我們確定在細(xì)胞之間表達(dá)最不同/變化最大的基因(HVGs)。然后,我們使用這些基因來確定哪些相關(guān)的基因是造成細(xì)胞間表達(dá)差異最大的原因。

1、Count normalization

First one is count normalization, which is essential to make accurate comparisons(精準(zhǔn)比較) of gene expression between cells (or samples). The counts of mapped reads for each gene is proportional to the expression of RNA (“interesting”) in addition to many other factors (“uninteresting”). Normalization is the process of scaling raw count values to account for the “uninteresting” factors. In this way the expression levels are more comparable between and/or within cells.

標(biāo)準(zhǔn)化最重要的目的就是使表達(dá)水平在細(xì)胞之間和/或細(xì)胞內(nèi)更具有可比性。那么在標(biāo)準(zhǔn)化中主要需要處理的因素包括:

  • 測序深度:考慮測序深度是比較細(xì)胞之間基因表達(dá)的必要條件。在下面的示例中,每個基因在細(xì)胞2中的表達(dá)似乎都增加了一倍,但這是細(xì)胞2具有兩倍測序深度的結(jié)果。


    image.png

Each cell in scRNA-seq will have a differing number of reads associated with it. So to accurately compare expression between cells, it is necessary to normalize for sequencing depth.(在scRNA-seq中每個細(xì)胞都有不同數(shù)量的reads與之關(guān)聯(lián)。為了準(zhǔn)確比較細(xì)胞間的表達(dá),對測序深度進(jìn)行標(biāo)準(zhǔn)化是有必要的。)

  • 基因長度:需要基因長度來比較同一細(xì)胞內(nèi)不同基因之間的表達(dá)?;蜷L度越長比對到的reads理論上會越多。如下圖所示:低表達(dá)的較長基因測序到的reads數(shù)與較高表達(dá)的短基因相差不大。
    image.png

如果進(jìn)行的是5’末端或3’末端測序,則不需要考慮基因長度的影響;

如果使用全長測序則需要考慮。

2、Principal Component Analysis (PCA)

例樣:

如果你已經(jīng)定量了兩個樣本(或細(xì)胞)中四個基因的表達(dá),則可以繪制這些基因的表達(dá)值,其中一個樣本在x軸上表示,另一個樣本在y軸上表示,如下所示:

image.png

You could draw a line through the data in the direction representing the most variation, which is on the diagonal in this example. The maximum variation in the dataset is between the genes that make up the two endpoints of this line.
我們可以沿代表最大變化的方向在數(shù)據(jù)上畫一條線,在此示例中為對角線,數(shù)據(jù)中變化第一大的變量。數(shù)據(jù)集中的最大變異是在組成兩個端點的基因。我們還看到基因在該線的上方和下方有些不同。我們可以在該條線的中點繪制另一條與其垂直的線,代表數(shù)據(jù)中變化第二大的變量。

image.png

每條線段末端附近的基因變異最大;從數(shù)學(xué)上講,這些基因?qū)€條的方向有最大的影響。


image.png

例如,基因C值的一個小變化會極大地改變更長的線的方向,而基因a或基因D的一個小變化對它幾乎沒有影響。


image.png

We could also rotate the entire plot and view the lines representing the variation as left-to-right and up-and-down. We see most of the variation in the data is left-to-right (longer line) and the second most variation in the data is up-and-down (shorter line). You can now think of these lines as the axes that represent the variation. These axes are essentially the “Principal Components”, with PC1 representing the most variation in the data and PC2 representing the second most variation in the data.
(我們還可以旋轉(zhuǎn)整個圖形,并將表示變化的線看作從左到右和從上到下。我們看到數(shù)據(jù)中的大部分變化是從左到右(較長的線),數(shù)據(jù)中的第二大部分變化是上下(較短的線)。你可以把這些線看作是表示變化的坐標(biāo)軸。這些軸本質(zhì)上是“主成分”,其中PC1代表數(shù)據(jù)中最大的變化,PC2代表數(shù)據(jù)中第二大變化。)


image.png

If we had three samples/cells, then we would have an extra direction in which we could have variation (3D). Therefore, if we have N samples/cells we would have N-directions of variation or principal components (PC)! Once these PCs have been calculated, the PC that deals with the largest variation in the dataset is designated PC1, and the next one is designated PC2 and so on.

確定PCs后,則需要對每個PC進(jìn)行評分,按照以下步驟對所有樣本PC對(sample-PC pairs)計算分?jǐn)?shù):

1)首先,根據(jù)基因?qū)γ總€PC的影響程度,為其分配“影響力”評分。對給定PC沒有任何影響的基因得分接近零,而具有更大影響力的基因得分更高。PC線末端的基因?qū)a(chǎn)生更大的影響,因此它們將獲得更大的分?jǐn)?shù),但兩端的符號相反。


image.png

2)確定影響分?jǐn)?shù)后,使用以下公式計算每個樣本的分?jǐn)?shù):

Sample1 PC1 score = (read count * influence) + ... for all genes

For our 2-sample example, the following is how the scores would be calculated:

## Sample1
PC1 score = (4 * -2) + (1 * -10) + (8 * 8) + (5 * 1) = 51
PC2 score = (4 * 0.5) + (1 * 1) + (8 * -5) + (5 * 6) = -7

## Sample2
PC1 score = (5 * -2) + (4 * -10) + (8 * 8) + (7 * 1) = 21
PC2 score = (5 * 0.5) + (4 * 1) + (8 * -5) + (7 * 6) = 8.5

3)一旦為各個樣本的所有PC計算了這些分?jǐn)?shù),就可以將其繪制在簡單的散點圖上。下面是示例圖:

image.png

對于具有大量樣本或細(xì)胞的數(shù)據(jù)集,通常會繪制每個樣本/細(xì)胞的PC1和PC2分?jǐn)?shù)。由于這些PC解釋了數(shù)據(jù)集中最大的變化,因此更相似的樣本/細(xì)胞將在PC1和PC2聚在一起。請參見下面的示例:

image.png

單細(xì)胞數(shù)據(jù)分析流程圖
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容