immunarch — Fast and Seamless Exploration of Single-cell and Bulk T-cell/Antibody Immune Repertoires in R
前情提要:
10× Genomics單細(xì)胞免疫組庫VDJ分析必知必會
免疫組庫數(shù)據(jù)分析||immunarch教程:克隆型分析
免疫組庫數(shù)據(jù)分析||immunarch教程:探索性數(shù)據(jù)分析
免疫組庫數(shù)據(jù)分析||immunarch教程:載入10X數(shù)據(jù)
免疫組庫數(shù)據(jù)分析||immunarch教程:快速開始
今天,我們繼續(xù)我們的免疫組庫數(shù)據(jù)分析的Demos,這一次我們來談?wù)凣eneUsage分析。像我這樣剛?cè)腴T免疫組庫的人首先會問什么是GeneUsage?
我不由自主地拿出了周光炎老師的《免疫學(xué)原理》:

免疫組庫的多樣性是由VDJ基因重排帶來的,這里的VDJ是區(qū)域名稱,每個區(qū)域又有基因,雖然這些基因的名稱帶著明顯的區(qū)域的印記。geneUsage指的就是這些基因的出現(xiàn)頻率(次數(shù))。在文章中它出現(xiàn)的形式是這樣的:

Analysis of TRA variable (TRAV) and TRB variable (TRBV) gene segment usage among analyzed samples. a?d Analysis of (a, b) TRAV and (c, d) TRBV gene segment usage, including (a, c) heatmap representation of TRAV and TRBV gene usage across samples and (~) frequency of observed TRAV and TRBV gene usages among the samples
下面開始immunarch的表演。
ibrary(immunarch); data(immdata) # Load the package and the test dataset
?geneUsage
immunarch提供了一個基因片段數(shù)據(jù)表,包含幾個物種的已知基因片段數(shù)量,按照IMGT的命名法。調(diào)用gene_stats()函數(shù)可以得到基因的當(dāng)前統(tǒng)計(jì)信息:
gene_stats()
alias species ighd ighj ighv igij igkj igkv iglj iglv traj trav trbd trbj trbv trdd
1 bt BosTaurus 21 4 25 0 1 6 5 26 46 0 0 0 0 5
2 cd CamelusDromedarius 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 clf CanisLupusFamiliaris 0 0 0 0 0 0 0 0 0 0 2 8 19 0
4 dr DanioRerio 7 7 0 3 0 0 0 0 0 0 0 0 0 0
5 hs HomoSapiens 30 13 248 0 5 64 7 69 57 60 3 14 64 3
6 macmul MacacaMulatta 24 7 19 0 4 83 5 0 0 0 2 15 58 0
7 mmc MusMusculusCastaneus 0 0 0 0 0 4 0 0 0 0 0 0 0 0
8 mmd MusMusculusDomesticus 0 0 0 0 0 2 0 0 0 0 0 0 0 0
9 musmus MusMusculus 32 8 225 0 8 109 3 5 42 145 2 14 23 2
10 oa OrnithorhynchusAnatinus 3 10 0 0 0 0 0 0 0 0 0 0 0 0
11 oc OryctolagusCuniculus 10 11 39 0 8 26 2 20 0 0 0 0 0 0
12 om OncorhynchusMykiss 9 7 6 0 0 0 0 0 0 0 1 9 0 0
13 rn RattusNorvegicus 30 4 113 0 6 132 2 8 0 0 0 0 0 0
14 smth MusMusculusMolossinus 0 0 0 0 0 1 0 0 0 0 0 0 0 0
15 smth MusMusculusMusculus 0 0 0 0 0 1 0 0 0 0 0 0 0 0
16 smth MusSpretus 0 0 0 0 0 2 0 2 0 0 0 0 0 0
17 ss SusScrofa 5 5 15 0 8 19 4 14 0 0 0 0 0 0
trdj trdv trgj trgv
1 3 0 6 15
2 0 7 2 2
3 0 0 7 8
4 0 0 0 0
5 4 6 4 10
6 0 0 0 0
7 0 0 0 0
8 0 0 0 0
9 3 7 0 11
10 0 0 0 0
11 0 0 0 0
12 0 0 0 0
13 0 0 0 0
14 0 0 0 0
15 0 0 0 0
16 0 0 0 0
17 0 0 0 0
我們來驗(yàn)證一下,打開IMGT 官網(wǎng): http://www.imgt.org/找到http://www.imgt.org/IMGTrepertoire/LocusGenes/#H

點(diǎn)進(jìn)去我們就看到:

HomoSapiens 的 trdd 確實(shí)是有3個。
為了計(jì)算基因的分布,immunarch包含了geneUsage函數(shù)。它接收到一個或多個免疫庫(我們讀進(jìn)來的數(shù)據(jù)對象)作為輸入,以及你想要得到統(tǒng)計(jì)數(shù)據(jù)的基因和物種。例如,如果你計(jì)劃使用智人的TRBV基因,你需要使用hs.trbv,函數(shù)中的trbv字符串,其中hs來自別名列,trbv是基因名稱。如果你計(jì)劃使用Mus musmus_ighj基因,你需要使用musmus_ighj。
# Next four function calls are equal. "hs" is from the "alias" column.
imm_gu <- geneUsage(immdata$data, "hs.trbv")
imm_gu
# A tibble: 48 x 13
Names `A2-i129` `A2-i131` `A2-i133` `A2-i132` `A4-i191` `A4-i192` MS1 MS2 MS3 MS4 MS5 MS6
<chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 TRBV10-1 24 28 NA 16 29 6 19 4 43 19 13 21
2 TRBV10-2 42 60 8 29 28 16 25 35 63 39 22 43
3 TRBV10-3 230 282 135 108 376 215 195 142 304 262 179 442
4 TRBV11-1 21 14 26 17 17 16 14 10 16 32 14 20
5 TRBV11-2 183 172 125 161 95 113 94 105 174 174 122 94
6 TRBV11-3 8 11 5 24 2 7 4 13 13 13 3 32
7 TRBV12-4 603 459 313 433 333 557 406 606 452 646 537 951
8 TRBV12-5 37 54 8 38 18 17 7 17 25 32 16 24
9 TRBV13 44 53 45 78 29 43 39 28 31 33 28 47
10 TRBV14 65 91 48 73 40 30 23 94 21 84 26 7
# ... with 38 more rows
imm_gu 不就是一個豐度表嗎?行是基因,列是樣本,然后我們可以走我們的那一套了,pca啦,聚類啦,巴拉巴拉。
immunarch 提供了方便的可視化函數(shù),與ggplot2無縫銜接。
p1 <- vis(imm_gu[c(1, 2)])
p2 <- vis(imm_gu[c(1, 2)]) + coord_polar()
p3<- vis(imm_gu[c(1, 2)]) + coord_flip() + theme_bw()
library(patchwork)
p1 + p2 +p3

當(dāng)然這個函數(shù)的細(xì)節(jié)還是有一些需要注意的,就是你到底要計(jì)算什么?;蚍植伎梢酝ㄟ^單個克隆型的計(jì)數(shù)來計(jì)算。quant = "count")或不使用它們(quant = NA)。要計(jì)算等位基因水平或家族水平分布,請更改.type參數(shù)。Parameter .norm控制immunarch是否將數(shù)據(jù)歸一化,以確保所有頻率的和是否等于1。
為了明確其計(jì)算的細(xì)節(jié),我們用debug(geneUsage)來進(jìn)入函數(shù)內(nèi)部一探究竟。當(dāng)然debug(geneUsage)是一個動態(tài)過程,我們僅演示主要結(jié)果。
debug(geneUsage)
imm_gu <- geneUsage(immdata$data, "hs.trbv")
Browse[3]> .gene
[1] "TRBV10-1" "TRBV10-2" "TRBV10-3" "TRBV11-1" "TRBV11-2" "TRBV11-3" "TRBV12-3" "TRBV12-4" "TRBV12-5"
[10] "TRBV13" "TRBV14" "TRBV15" "TRBV16" "TRBV18" "TRBV19" "TRBV2" "TRBV20-1" "TRBV24-1"
[19] "TRBV25-1" "TRBV27" "TRBV28" "TRBV29-1" "TRBV3-1" "TRBV30" "TRBV4-1" "TRBV4-2" "TRBV4-3"
[28] "TRBV5-1" "TRBV5-4" "TRBV5-5" "TRBV5-6" "TRBV5-8" "TRBV6-1" "TRBV6-2" "TRBV6-3" "TRBV6-4"
[37] "TRBV6-5" "TRBV6-6" "TRBV6-8" "TRBV6-9" "TRBV7-2" "TRBV7-3" "TRBV7-4" "TRBV7-6" "TRBV7-7"
[46] "TRBV7-8" "TRBV7-9" "TRBV9"
Browse[3]>
debug: .data[[gene_col]] <- return_segments(.data[[gene_col]])
Browse[3]>
debug: if (has_class(.data, "data.table")) {
.data <- .data %>% lazy_dt()
}
Browse[3]>
debug: dataset <- .data %>% select(Gene = gene_col, IMMCOL$count) %>%
group_by(Gene) %>% rename(Quant = IMMCOL$count)
Browse[3]>
debug: if (is.na(.quant)) {
dataset <- dataset %>% summarise(count = n())
} else {
dataset <- dataset %>% summarise(count = sum(Quant))
}
Browse[3]>
debug: dataset <- dataset %>% summarise(count = n())
Browse[3]>
debug: dataset <- dataset %>% collect(n = Inf)
Browse[3]> dataset
# A tibble: 48 x 2
Gene count
<chr> <int>
1 TRBV10-1 28
2 TRBV10-2 60
3 TRBV10-3 282
4 TRBV11-1 14
5 TRBV11-2 172
6 TRBV11-3 11
7 TRBV12-4 459
8 TRBV12-5 54
9 TRBV13 53
10 TRBV14 91
# ... with 38 more rows
Browse[3]> .data
# A tibble: 6,553 x 15
Clones Proportion CDR3.nt CDR3.aa V.name D.name J.name V.end D.start D.end J.start VJ.ins VD.ins DJ.ins
<dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <int> <int> <int> <int> <dbl> <dbl> <dbl>
1 111 0.0131 TGCAGT~ CSASRG~ TRBV2~ TRBD1 TRBJ2~ 11 12 17 25 -1 0 7
2 93 0.0109 TGTGCC~ CASSVA~ TRBV9 TRBD1 TRBJ2~ 15 21 23 24 -1 5 0
3 66 0.00776 TGTGCC~ CASSRM~ TRBV13 TRBD1 TRBJ2~ 11 18 24 32 -1 6 7
4 59 0.00694 TGTGCC~ CASSPT~ TRBV6~ TRBD2 TRBJ2~ 10 14 19 33 -1 3 13
5 57 0.00671 TGCGCC~ CASSLD~ TRBV5~ TRBD2 TRBJ1~ 15 17 20 26 -1 1 5
6 47 0.00553 TGTGCC~ CASRGL~ TRBV6~ TRBD2 TRBJ2~ 10 11 16 20 -1 0 3
7 46 0.00541 TGCAGC~ CSVTGV~ TRBV2~ TRBD1 TRBJ2~ 8 9 13 20 -1 0 6
8 30 0.00353 TGTGCC~ CASSYL~ TRBV6~ TRBD2 TRBJ1~ 15 17 19 20 -1 1 0
9 29 0.00341 TGTGCC~ CASSLA~ TRBV5~ TRBD1 TRBJ1~ 15 21 26 32 -1 5 5
10 29 0.00341 TGTGCC~ CASSYI~ TRBV6~ TRBD1 TRBJ1~ 14 17 20 25 -1 2 4
# ... with 6,543 more rows, and 1 more variable: Sequence <lgl>
Browse[3]> gene_col
[1] "V.name"
Browse[3]> IMMCOL$count
[1] "Clones"
Browse[3]> dataset
# A tibble: 48 x 2
Gene count
<chr> <int>
1 TRBV10-1 28
2 TRBV10-2 60
3 TRBV10-3 282
4 TRBV11-1 14
5 TRBV11-2 172
6 TRBV11-3 11
7 TRBV12-4 459
8 TRBV12-5 54
9 TRBV13 53
10 TRBV14 91
# ... with 38 more rows
Browse[3]> .data$V.name[1]
[1] "TRBV20-1"
Browse[3]> dataset %>% filter(Gene == 'TRBV20-1')
# A tibble: 1 x 2
Gene count
<chr> <int>
1 TRBV20-1 570
Browse[3]> .data %>% filter(V.name == 'TRBV20-1')
# A tibble: 570 x 15
Clones Proportion CDR3.nt CDR3.aa V.name D.name J.name V.end D.start D.end J.start VJ.ins VD.ins DJ.ins
<dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <int> <int> <int> <int> <dbl> <dbl> <dbl>
1 111 0.0131 TGCAGT~ CSASRG~ TRBV2~ TRBD1 TRBJ2~ 11 12 17 25 -1 0 7
2 11 0.00129 TGCAGT~ CSATGL~ TRBV2~ TRBD2 TRBJ2~ 9 12 18 23 -1 2 4
3 7 0.000824 TGCAGT~ CSARPG~ TRBV2~ TRBD1 TRBJ1~ 11 14 19 24 -1 2 4
4 6 0.000706 TGCAGT~ CSARGG~ TRBV2~ TRBD2 TRBJ2~ 12 15 23 24 -1 2 0
5 5 0.000588 TGCAGT~ CSARDG~ TRBV2~ TRBD2 TRBJ1~ 13 16 20 21 -1 2 0
6 5 0.000588 TGCAGT~ CSGASG~ TRBV2~ TRBD1 TRBJ2~ 6 7 10 23 -1 0 12
7 4 0.000471 TGCAGT~ CSVGGG~ TRBV2~ TRBD2 TRBJ2~ 6 14 20 22 -1 7 1
8 4 0.000471 TGCAGT~ CSARDR~ TRBV2~ TRBD1 TRBJ1~ 13 14 18 19 -1 0 0
9 4 0.000471 TGCAGT~ CSASQF~ TRBV2~ TRBD1 TRBJ2~ 10 12 14 17 -1 1 2
10 4 0.000471 TGCAGT~ CSAGDR~ TRBV2~ TRBD1 TRBJ1~ 7 12 17 21 -1 4 3
# ... with 560 more rows, and 1 more variable: Sequence <lgl>
Browse[3]> .data %>% filter(V.name == 'TRBV20-1') %>% dim()
[1] 570 15
Browse[3]> .data %>% filter(V.name == 'TRBV10-1') %>% dim()
[1] 28 15
Browse[3]> .data %>% filter(V.name == 'TRBV10-1') %>% DT::datatable()
undebug(geneUsage)

大家看到這個28 的來源了嗎? 你可以從不同的角度來觀察基因使用的直方圖:
imm_gu <- geneUsage(immdata$data[c(1, 2)], "hs.trbv", .norm = T)
vis(imm_gu)

imm_gu <- geneUsage(immdata$data, "hs.trbv", .norm = T)
vis(imm_gu, .grid = T)

如果樣本有分組信息,還可以做組間的差異統(tǒng)計(jì):
imm_gu <- geneUsage(immdata$data, "hs.trbv", .norm = T
vis(imm_gu, .by = "Status", .meta = immdata$meta)

vis(imm_gu, .by = "Status", .meta = immdata$meta, .plot = "box")

由于一些克隆型的基因序列不明確,geneUsage有以下選項(xiàng)來處理不明確的數(shù)據(jù),也許這些不明確的基因才是潛藏在數(shù)據(jù)中的瑰寶啊。
- .ambig = "inc" - includes all possible combinations of ambiguous gene alignments from the data. NOTE: ImmunoSEQ formats use non-standart gene segment names, so it is preferable to use this argument value with ImmunoSEQ formats. This argument is ON by default to ease the gene manipulation. Feel free to change it to "exc" in case of other data formats. It is ON by default, we recommend it to leave it that way.
- .ambig = "exc" - filters out all clonotypes with ambiguous gene alignments.
- .ambig = "wei" - introduces weighted approach (divides by n (1/n) the frequency for each entry of the corresponding gene if there are n genes for a clonotype).
- .ambig = "maj" - chooses only the first gene segment.
作為福利,我們在文章中經(jīng)??吹竭@樣的V-J基因的圈圖,其實(shí)也是一種桑基圖啦:

咋做?
做一個不成熟的?;鶊D:
library(ggforce)
immdata$data$`A2-i129` %>%
gather_set_data(c(6,7,5)) %>%
ggplot(aes(x, id = id, split = y, value = 1)) +
geom_parallel_sets(aes(fill = J.name), show.legend = FALSE, alpha = 0.3) +
geom_parallel_sets_axes(axis.width = 0.1, color = "lightgrey", fill = "white") +
geom_parallel_sets_labels(angle = 0) +
theme_no_axes()
VDJ之間的流向:

然后是一個不成熟的圈圖:
library(sankeywheel)
sankeywheel(
from = immdata$data$`A2-i129`$V.name, to = immdata$data$`A2-i129`$J.name,
weight = immdata$data$`A2-i129`$Clones,#type = "sankey",
title = " circus plots",
subtitle = "j_gene & v_gene",
seriesName = "", width = "100%", height = "600px"
)

關(guān)鍵還是要理解這里面的數(shù)據(jù)結(jié)構(gòu)和算法及其背后的生物學(xué)意義啊。咦,我們?yōu)槭裁匆鰃eneUsage?
因?yàn)間eneUsage也是一種異質(zhì)性啊,主要是可以找到哪些geneUsage較高或者受到抑制,這樣我們可以對癥下藥啊。一如文章所言:
CDR3 sequence length distribution, amino acid conservation and gene usage variability for ATL patients resembled those of peripheral blood cells from ACs and healthy donors. Thus, determining monoclonal architecture and clonal diversity by RNA sequencing might be useful for prognostic purposes and for personalizing ATL diagnosis and assessment of treatments.
[1] Farmanbar, A., Kneller, R. & Firouzi, S. RNA sequencing identifies clonal structure of T-cell repertoires in patients with adult T-cell leukemia/lymphoma. npj Genom. Med. 4, 10 (2019). https://doi.org/10.1038/s41525-019-0084-9