接上面的學(xué)習(xí)筆記,這一講是關(guān)于單細(xì)胞測(cè)序的質(zhì)量控制,視頻比較長(zhǎng),這篇寫(xiě)的也比較長(zhǎng),而且我基本用英文做的筆記,是把大部分的視頻內(nèi)容記錄了下來(lái)(有的地方很啰嗦,我就直接寫(xiě)了中文,簡(jiǎn)化了一下)。因?yàn)檫@一部分是非常重要的。
視頻地址:https://www.youtube.com/watch?v=rOm6UIPhHnc&list=PLjiXAZO27elC_xnk7gVNM85I2IQl5BEJN&index=2
課程官網(wǎng):https://www.csc.fi/web/training/-/scrnaseq
另附關(guān)于質(zhì)量控制的實(shí)戰(zhàn)練習(xí),可以從這里練習(xí),下載練習(xí)數(shù)據(jù),具體的代碼也在里面:https://github.com/NBISweden/excelerate-scRNAseq/blob/master/session-qc/Quality_control.md
由于官網(wǎng)上只有youtube視頻鏈接,和PPT,但是PPT沒(méi)有解說(shuō)內(nèi)容,看了也比較空洞,所以我就把講解的內(nèi)容(不是全部)敲下來(lái),以便以后查看。
廢話(huà)不多說(shuō),下面開(kāi)始學(xué)習(xí):
第二講:?jiǎn)渭?xì)胞測(cè)序的質(zhì)量控制

Now I will talk about one of the most important steps when you do single cell analysis. Because every time you do single cell RNA-seq whether it's SMART-seq2 or 10x, you will have fake libraries, you will have some half-dead cells, you have doublets. You really need to look at your data, filter your data before you start clustering. So I think it's really important not to rush through quality control too fast.

I will talk a bit first and go through the different steps of how you do single cell experiments and talk about the issues at these different points and a bit about filtering themselves.

But before we start, actually I want to talk about how is the transcriptional bursting because this is also really important for understanding single cell data. But in each single cell, a gene is not constantly "on" most of the time. Expression happens in burst, so basically, the transcriptional machinery bind to a gene its starts producing mRNA , and its circles around the gene locus and produces a lot of mRNA. Then it falls off. (left up corner A) Here is an illustration from an experiment, where you basically see in black is a gene is "on", you start producing mRNA, and later they degraded. And once the mRNA is produced, you also start producing protein. So when you master protein abundance in each single cell, it will vary across time. (right panel A) But the mRNA is much much more spiky, so basically an experiment when each blue line here is a gene turned on(右下角藍(lán)色峰).

So basically if you have a population of cells from the same cell type, here with only 4 genes. We vary in size. Some genes is not there and some genes are quite abundant in individual cells. When you do bulk RNA-seq, so you get average of these. But if you do single cell, of course you get some bias. And you do reverse transcription, and here you also loss a lot of transcripts. You will just detect 10% percent or 40% percent of transcripts depending on what technique you are using. And then you amplify it where you might introduce bias and also this selection is not completely random. There will be preference for some transcripts to be reverse transcripted more easily than others. So there's a lot of sources of dropout or missing genes that gives all of these zeros in your data frame on the end.

And then of course you end up with these datasets if you used to working on bulk-RNA-seq,you will find the single cell RNA-seq is looks quite crappy. We have a lot of batch effect in single cell RNA-seq data.

I try to visualize you from raw data to gene expression metrix. You do quality control before you go into through, doing normalization, removing batch effects. You try to do clustering, and try to visualize and so on. And QC is always the first step once you have gene expression metrix.

The first step is cell dissociation. Single cell capture.

This is I would say the most critical step in single cell RNA-seq. It's the biggest contribution to batch effects. And this's really important to have whole healthy cells. If they are hard to dissociated, you can do laser capture. If you have too harsh conditions , you damage the cells, you will have leakage RNA from these damage cell types that will give you a background signal, and they will always give a bias to your data. 如果你從組織里提取樣品做bulk-RNA seq和單細(xì)胞測(cè)序,你可能得到的細(xì)胞比例是不一樣的,因?yàn)閱渭?xì)胞測(cè)序?qū)τ诩?xì)胞狀態(tài)較好的樣品比較“友好”。

This is a study where they show the induction of some genes during dissociations ,so they basically stain cells here before and after dissociation. You can see there a lot of genes are upregulated and they show you can get artificial clusters or cells that are dissociation induced effect.

As I said, you might get ambient RNA. This work is with immune cells where you have three different time points. And here is different cell types. As you can see all the samples from time point three sort of have a background of a neutrophil genes. We think that probably at the day3 we have more neutrophils but also a lot of them broken. We have a lot of RNA from that.

The single cell capture we talk about we can do FACS sort, we can do droplet-based, and you always get doublets, you will get empty wells.(意思是無(wú)論什么方法都有可能有doublet,也有可能有空孔)

Here we have human cells and mouse cells, and we have some cells that have reads form both species. And the estimates are there is a linear increase with the number of cells you load and the number of doublets you get. (所以如果你只load870 個(gè)左右的細(xì)胞,你大概只有0.4%的細(xì)胞是doublets的。如果你要用1萬(wàn)多個(gè)細(xì)胞,那么你將有8%左右的doublets。)

Another problem is that if you have damaged cells, a lot of doublets you get actually cell debris from one cell sticking to another cell. This is a dataset that I work with where I did clustering and I'm finding cells that have signature from two clusters and call them doublets. And as you see, they had a very bad experimental design. We have some samples with few like around thousand cells , and some up here with 17,000 cells. So it's also not advised. Because normalization and everything will be different produce datasets , even though we seq this one much deeper. We don't really reach saturation here to the same level. But we also see the more cells , the more doublets.

上面左邊的這張圖,兩個(gè)群之間的連接部分(藍(lán)色的),看起來(lái)這部分細(xì)胞有兩邊細(xì)胞群的signature。如果是研究分化的實(shí)驗(yàn),有可能左邊的細(xì)胞群經(jīng)過(guò)分化,成為右邊的細(xì)胞群,這時(shí)候你就無(wú)法區(qū)分究竟是分化過(guò)程的中間體,還是你的doublets。

So, how do we find doublets? This is the hardest thing.(這時(shí)有人提問(wèn),是否在測(cè)序前把細(xì)胞固定會(huì)提高測(cè)序的質(zhì)量,主講人說(shuō)貌似沒(méi)有文章證明這個(gè)方法有效,但是如果你的tissue非常難分離,或許這是個(gè)good idea。)

這里有一些檢測(cè)doublets的方法。That's very tricky when you looking into cellular differentiation. Hopefully, during differentiation, if you go from one state to one state, you will have some genes coming up here in the middle during differentiation time. So you can distinguish what's being differentiation to just doublets being a mix of these two signatures. But it's bad if signal is weak, you don't really know what it is. What you are looking for it can be hard. So I plan to have less doublets if you are looking into differentiation.

Once we sorted them out, we need to lysis the cells.(不同的細(xì)胞類(lèi)型可能裂解的方法也不一樣,比如植物細(xì)胞的話(huà),還需要破碎細(xì)胞壁。)What I'v seen also depending on the cell you may or may not get nuclear lysis of your cells. That will give a clearly different transcriptional landscape because the nuclear transcripts will be detected or not depending on if you have nuclear lysis.

And then reverse transcription, of course, is a big limited step.

同理,擴(kuò)增的過(guò)程也會(huì)引入bias.

測(cè)序的機(jī)器也不是完美的,它們工作的時(shí)候也會(huì)引入bias。 比如有時(shí)會(huì)有air bubbles, contaminations.

You can not use spike-in in 10× method. Because as you said most of the droplets are empty and if you throw spike-in into everything and then we mainly just sequencing empty droplets with spike-in, and that will cost a lot before we actually manager to find the cells. So our two kits used with artificial sequences that we add to the cells, and you should add it into lysis buffer so that they are going through all these steps as the cell's RNA as they are being processed. The idea is that these two cells looks like have identical transcriptional landscapes. But if you add spike-in , you can actually see that you have twice as much RNA as this one compare to this one.

Spike-in也不是完美的。We do have a difference in amplification bias because these ERCC have shorter transcripts amplify better. But we can still sort of estimate back the number of molecules RNA using the spike-in . We can use it to model technical noise, drop-out reads, so basically for each experiment you can see how efficient the reverse transcription was in single well.

不同的文獻(xiàn)里可能使用的過(guò)濾細(xì)胞和基因的標(biāo)準(zhǔn)是不一樣的。

一般來(lái)講,如果你使用SMART-seq2方法測(cè)序,需要看一下以上這4 個(gè)方面,來(lái)評(píng)估一下你的測(cè)序質(zhì)量。

When it comes to spike-in, if you don't even amplify these spike-in and undetect it in your library prep, you know you can't use it. You can throw it away that library. The relative proportion of spike-in to RNA can tell you something about how much RNA you have in cells.So you can use that for filtering. (如果你用的細(xì)胞很大,可以多加一些spike-in,加多少spike-in取決于你的細(xì)胞類(lèi)型,所以最好做之前查詢(xún)一下別人都是加了多少的spike-in)


Bigger cells have bigger ratio of RNA to spike-in. They also have higher number of detected genes.

也有很多人過(guò)濾基因的時(shí)候回把線(xiàn)粒體基因過(guò)濾掉。因?yàn)槿绻愕募?xì)胞破損了,RNA會(huì)漏出來(lái),但是線(xiàn)粒體仍然在那里,所以你會(huì)得到很多線(xiàn)粒體的reads。一般來(lái)說(shuō),如果線(xiàn)粒體的read比例太高,說(shuō)明你的細(xì)胞有些問(wèn)題。

核糖體RNA也與你的data quality有關(guān),但是需要注意的是核糖體RNA在不同細(xì)胞類(lèi)型里的比例是不一樣的,過(guò)濾的時(shí)候需要注意。

如果你使用SMART-seq2, 你還可以看一下3'端的bias,如果你的RNA降解了,就會(huì)像右邊這個(gè)圖一樣。你的reads就只會(huì)集中在基因的3'端。這種細(xì)胞就必須要過(guò)濾掉。

質(zhì)控這里需要注意的東西很多,但是上面3個(gè)粗體標(biāo)注的是特別要注意的方面。

So how to do the filter. I say look at the distributions.

You can also just take all you different quality control metrics that you have, and throw them into a big metrics, and do PCA, and find outliers in the PCA actually that's a quite good way to find outliers. But it might also be that the outliers you get in the PCA are your small cluster.

In Drop-seq data or 10× data, a lot of people are using number of molecules. Cutting both ends.

As I said ,it is not a easy to filter. You have to know your data, you have to know how heterogenous what kind of cell types and sizes you expect from your dataset when you apply your filtering.

Another thing is quality control of genes,

Here is a plot with a contribution of total counts per gene and you see here clearly case with too much spike-in genes(ERCC).上面這種情況可能是細(xì)胞膜破碎導(dǎo)致的測(cè)序質(zhì)量不太好。還有就是有些基因的reads占總reads的20%甚至更多,最好在clustering之前去掉它們。

As we said, batch effect will happen. And I think it quite important to take all your quality control metrics and plot them per batch.

So you can sort of look at unique map in your different sample that you can also understand what batch effects are.

I think PCA is really important because PCA gives you really good understanding of your data what's the main variability in your data.

You can also use PCA to check batch effects. You can look at your different batch and how well they correlated to different principle components to start understanding how much batch effects you have and what they are. Because you can also go into the loading, the PC1 ,for instance here, and see what are the genes that are driving this separation of these two datasets.


主講人提到通常在分析完所有數(shù)據(jù)后,她會(huì)再次回到QC這一步來(lái)進(jìn)行檢查,因?yàn)槟悴恢滥愕腸lustering分群是真的兩個(gè)不同的細(xì)胞群,還是因?yàn)槟鉗C這一步?jīng)]做好導(dǎo)致的。