scater包簡介
scater是一個(gè)優(yōu)秀的單細(xì)胞轉(zhuǎn)錄組數(shù)據(jù)分析工具包,它可以對單細(xì)胞數(shù)據(jù)進(jìn)行常規(guī)的質(zhì)量控制,數(shù)據(jù)的標(biāo)準(zhǔn)化與歸一化,以及數(shù)據(jù)的降維與可視化分析。它主要基于SingleCellExperiment類(來自SingleCellExperiment包)來進(jìn)行操作處理,因此可以與其他許多Bioconductor包(如scran,batchelor和iSEE等)相互操作。
scater包主要含有以下特性:
- Use of the
SingleCellExperimentclass as adata containerfor interoperability with a wide range of other Bioconductor packages; - Functions to import
kallistoandSalmonresults; - Simple calculation of many
quality control metricsfrom the expression data; - Many tools for
visualising scRNA-seq data, especially diagnostic plots for quality control; - Subsetting and many other methods for
filtering out problematic cells and features; - Methods for
identifying important experimental variablesandnormalising dataahead of downstream statistical analysis and modeling.
scater包的工作流程為:
構(gòu)建SingleCellExperiment對象
使用SingleCellExperiment函數(shù)導(dǎo)入單細(xì)胞轉(zhuǎn)錄組的基因表達(dá)矩陣構(gòu)建一個(gè)SingleCellExperiment對象,該表達(dá)矩陣是一個(gè)行為基因,列為細(xì)胞的大型數(shù)據(jù)框。
SingleCellExperiment對象內(nèi)容
SingleCellExperiment對象常見操作
# 導(dǎo)入scater包
library(scater)
# 加載示例數(shù)據(jù)
data("sc_example_counts")
data("sc_example_cell_info")
# 查看基因表達(dá)矩陣
head(sc_example_counts)
Cell_001 Cell_002 Cell_003 Cell_004 Cell_005 Cell_006 Cell_007
Gene_0001 0 123 2 0 0 0 0
Gene_0002 575 65 3 1561 2311 160 2
Gene_0003 0 0 0 0 1213 0 0
Gene_0004 0 1 0 0 0 99 476
Gene_0005 0 0 11 0 0 0 0
Gene_0006 0 0 0 0 0 0 673
Cell_008 Cell_009 Cell_010 Cell_011 Cell_012 Cell_013 Cell_014
Gene_0001 21 2 0 2624 1 1015 0
Gene_0002 1 0 0 2 0 2710 0
Gene_0003 1 0 0 2 178 0 0
Gene_0004 0 1 66 0 1 0 1
Gene_0005 0 1 0 0 2 2 0
Gene_0006 0 3094 0 0 270 2 0
Cell_015 Cell_016 Cell_017 Cell_018 Cell_019 Cell_020 Cell_021
Gene_0001 0 1 34 1 0 6 0
Gene_0002 4 0 908 673 174 622 2085
Gene_0003 0 0 0 0 1 0 3320
Gene_0004 0 906 655 1020 1 0 0
Gene_0005 0 0 0 2 0 0 3
Gene_0006 1176 0 3 0 0 0 1
# 查看樣本信息
head(sc_example_cell_info)
Cell Mutation_Status Cell_Cycle Treatment
Cell_001 Cell_001 positive S treat1
Cell_002 Cell_002 positive G0 treat1
Cell_003 Cell_003 negative G1 treat1
Cell_004 Cell_004 negative S treat1
Cell_005 Cell_005 negative G1 treat2
Cell_006 Cell_006 negative G0 treat1
# 使用SingleCellExperiment函數(shù)構(gòu)建SingleCellExperiment對象
example_sce <- SingleCellExperiment(
assays = list(counts = sc_example_counts),
colData = sc_example_cell_info
)
# 查看SingleCellExperiment對象
example_sce
class: SingleCellExperiment
dim: 2000 40
metadata(0):
assays(1): counts
rownames(2000): Gene_0001 Gene_0002 ... Gene_1999 Gene_2000
rowData names(0):
colnames(40): Cell_001 Cell_002 ... Cell_039 Cell_040
colData names(4): Cell Mutation_Status Cell_Cycle Treatment
reducedDimNames(0):
spikeNames(0):
View(example_sce)
我們通常使用原始的count矩陣存儲到SingleCellExperiment對象的“counts” Assay中,同時(shí)也可以使用counts函數(shù)提取SingleCellExperiment對象中的count表達(dá)矩陣。
str(counts(example_sce))
int [1:2000, 1:40] 0 575 0 0 0 0 0 0 416 12 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:2000] "Gene_0001" "Gene_0002" "Gene_0003" "Gene_0004" ...
..$ : chr [1:40] "Cell_001" "Cell_002" "Cell_003" "Cell_004" ...
head(counts(example_sce))
對于樣本行和列的meta信息,我們也提供了一些常用函數(shù)來進(jìn)行操作處理,如isSpike, sizeFactors, 和reducedDim等函數(shù)。
# 添加一個(gè)新列的meta信息whee
example_sce$whee <- sample(LETTERS, ncol(example_sce), replace=TRUE)
# 使用colData函數(shù)查看列的meta信息
colData(example_sce)
DataFrame with 40 rows and 5 columns
Cell Mutation_Status Cell_Cycle Treatment whee
<character> <character> <character> <character> <character>
Cell_001 Cell_001 positive S treat1 N
Cell_002 Cell_002 positive G0 treat1 T
Cell_003 Cell_003 negative G1 treat1 Y
Cell_004 Cell_004 negative S treat1 T
Cell_005 Cell_005 negative G1 treat2 C
... ... ... ... ... ...
Cell_036 Cell_036 negative G0 treat1 Q
Cell_037 Cell_037 negative G0 treat1 X
Cell_038 Cell_038 negative G0 treat2 W
Cell_039 Cell_039 negative G1 treat1 B
Cell_040 Cell_040 negative G0 treat2 Z
# 添加一個(gè)新行的meta信息
rowData(example_sce)$stuff <- runif(nrow(example_sce))
# 使用rowData函數(shù)查看行的meta信息
rowData(example_sce)
DataFrame with 2000 rows and 1 column
stuff
<numeric>
Gene_0001 0.146899100858718
Gene_0002 0.547358682611957
Gene_0003 0.381470382912084
Gene_0004 0.0698823253624141
Gene_0005 0.577666614903137
... ...
Gene_1996 0.810028552776203
Gene_1997 0.92471176572144
Gene_1998 0.73105761455372
Gene_1999 0.496801204746589
Gene_2000 0.135669085429981
# 根據(jù)基因的表達(dá)過濾掉那些在所有細(xì)胞中表達(dá)量之和為0的基因
keep_feature <- rowSums(counts(example_sce) > 0) > 0
example_sce <- example_sce[keep_feature,]
對于原始的count表達(dá)矩陣,我們也提供了一些函數(shù)對其進(jìn)行數(shù)據(jù)的歸一化和標(biāo)準(zhǔn)化處理。如使用calculateCPM函數(shù)計(jì)算表達(dá)量的CPM(counts-per-million)值,其結(jié)果將會存儲在SingleCellExperiment對象的“cpm” Assay中,可以通過cpm函數(shù)進(jìn)行訪問
cpm(example_sce) <- calculateCPM(example_sce)
head(cpm(example_sce))
Cell_001 Cell_002 Cell_003 Cell_004 Cell_005 Cell_006
Gene_0001 0.00 749.529259 6.561271 0.000 0.000 0.0000
Gene_0002 1344.85 396.092698 9.841906 5558.424 2826.476 923.5422
Gene_0003 0.00 0.000000 0.000000 0.000 1483.563 0.0000
Gene_0004 0.00 6.093734 0.000000 0.000 0.000 571.4418
Gene_0005 0.00 0.000000 36.086989 0.000 0.000 0.0000
Gene_0006 0.00 0.000000 0.000000 0.000 0.000 0.0000
Cell_007 Cell_008 Cell_009 Cell_010 Cell_011 Cell_012
Gene_0001 0.000000 69.780424 2.698593 0.0000 9959.085768 5.882457
Gene_0002 6.938109 3.322877 0.000000 0.0000 7.590767 0.000000
Gene_0003 0.000000 3.322877 0.000000 0.0000 7.590767 1047.077301
Gene_0004 1651.269847 0.000000 1.349296 168.5733 0.000000 5.882457
Gene_0005 0.000000 0.000000 1.349296 0.0000 0.000000 11.764913
Gene_0006 2334.673545 0.000000 4174.723091 0.0000 0.000000 1588.263322
Cell_013 Cell_014 Cell_015 Cell_016 Cell_017 Cell_018
Gene_0001 2013.948828 0.000000 0.00000 2.64154 118.89733 2.069181
Gene_0002 5377.144161 0.000000 11.56233 0.00000 3175.25816 1392.558811
Gene_0003 0.000000 0.000000 0.00000 0.00000 0.00000 0.000000
Gene_0004 0.000000 3.334801 0.00000 2393.23554 2290.52213 2110.564617
Gene_0005 3.968372 0.000000 0.00000 0.00000 0.00000 4.138362
Gene_0006 3.968372 0.000000 3399.32534 0.00000 10.49094 0.000000
同樣的,我們也可以使用normalize函數(shù)進(jìn)行數(shù)據(jù)的歸一化處理,它將對原始的count矩陣進(jìn)行一個(gè)log2的轉(zhuǎn)換處理。This is done by dividing each count by its size factor (or scaled library size, if no size factors are defined), adding a pseudo-count and log-transforming. 歸一化后的結(jié)果存儲在"logcounts" Assay中,可以通過logcounts函數(shù)進(jìn)行訪問。
# 使用normalize函數(shù)進(jìn)行數(shù)據(jù)歸一化
example_sce <- normalize(example_sce)
# 查看assay的信息
assayNames(example_sce)
[1] "counts" "cpm" "logcounts"
head(logcounts(example_sce))
我們可以使用calcAverage函數(shù)計(jì)算基因的平均表達(dá)量
head(calcAverage(example_sce))
Gene_0001 Gene_0002 Gene_0003 Gene_0004 Gene_0005 Gene_0006
305.551749 325.719897 183.090462 162.143201 1.231123 187.167913
使用其他的方法導(dǎo)入基因表達(dá)矩陣
- 對于CSV格式存儲的基因表達(dá)矩陣,我們可以通過
read.table()函數(shù)或data.table包中的fread()函數(shù)進(jìn)行讀取。 - 對于一些大型的數(shù)據(jù)集,在讀取的過程中會產(chǎn)生大量的緩存,需要較大的內(nèi)存,因此我們可以通過Matrix包中的
readSparseCounts()函數(shù)讀取大型數(shù)據(jù)集,并將其存儲為一個(gè)稀疏矩陣,可以有效減小系統(tǒng)的讀取內(nèi)存。 - 對于來自10x Genomics產(chǎn)生的表達(dá)矩陣,我們可以通過DropletUtils包中的
read10xCounts()函數(shù)進(jìn)行讀取,讀取后它會自動生成一個(gè)SingleCellExperiment對象 - 對于kallisto和Salmon等比對軟件產(chǎn)生的基因表達(dá)矩陣,我們可以通過tximeta包中的
readSalmonResults()和readKallistoResults()函數(shù)進(jìn)行讀取。
參考來源:
http://www.bioconductor.org/packages/release/bioc/vignettes/scater/inst/doc/overview.html