免疫組庫數(shù)據(jù)分析||immunarch教程:GeneUsage分析

immunarch — Fast and Seamless Exploration of Single-cell and Bulk T-cell/Antibody Immune Repertoires in R

前情提要:
10× Genomics單細(xì)胞免疫組庫VDJ分析必知必會
免疫組庫數(shù)據(jù)分析||immunarch教程:克隆型分析
免疫組庫數(shù)據(jù)分析||immunarch教程:探索性數(shù)據(jù)分析
免疫組庫數(shù)據(jù)分析||immunarch教程:載入10X數(shù)據(jù)
免疫組庫數(shù)據(jù)分析||immunarch教程:快速開始

今天,我們繼續(xù)我們的免疫組庫數(shù)據(jù)分析的Demos,這一次我們來談?wù)凣eneUsage分析。像我這樣剛?cè)腴T免疫組庫的人首先會問什么是GeneUsage?

我不由自主地拿出了周光炎老師的《免疫學(xué)原理》:

免疫組庫的多樣性是由VDJ基因重排帶來的,這里的VDJ是區(qū)域名稱,每個區(qū)域又有基因,雖然這些基因的名稱帶著明顯的區(qū)域的印記。geneUsage指的就是這些基因的出現(xiàn)頻率(次數(shù))。在文章中它出現(xiàn)的形式是這樣的:

Analysis of TRA variable (TRAV) and TRB variable (TRBV) gene segment usage among analyzed samples. a?d Analysis of (a, b) TRAV and (c, d) TRBV gene segment usage, including (a, c) heatmap representation of TRAV and TRBV gene usage across samples and (~) frequency of observed TRAV and TRBV gene usages among the samples^{[1]}

下面開始immunarch的表演。

ibrary(immunarch); data(immdata)       # Load the package and the test dataset
?geneUsage

immunarch提供了一個基因片段數(shù)據(jù)表,包含幾個物種的已知基因片段數(shù)量,按照IMGT的命名法。調(diào)用gene_stats()函數(shù)可以得到基因的當(dāng)前統(tǒng)計(jì)信息:

gene_stats()
    alias                 species ighd ighj ighv igij igkj igkv iglj iglv traj trav trbd trbj trbv trdd
1      bt               BosTaurus   21    4   25    0    1    6    5   26   46    0    0    0    0    5
2      cd      CamelusDromedarius    0    0    0    0    0    0    0    0    0    0    0    0    0    0
3     clf    CanisLupusFamiliaris    0    0    0    0    0    0    0    0    0    0    2    8   19    0
4      dr              DanioRerio    7    7    0    3    0    0    0    0    0    0    0    0    0    0
5      hs             HomoSapiens   30   13  248    0    5   64    7   69   57   60    3   14   64    3
6  macmul           MacacaMulatta   24    7   19    0    4   83    5    0    0    0    2   15   58    0
7     mmc    MusMusculusCastaneus    0    0    0    0    0    4    0    0    0    0    0    0    0    0
8     mmd   MusMusculusDomesticus    0    0    0    0    0    2    0    0    0    0    0    0    0    0
9  musmus             MusMusculus   32    8  225    0    8  109    3    5   42  145    2   14   23    2
10     oa OrnithorhynchusAnatinus    3   10    0    0    0    0    0    0    0    0    0    0    0    0
11     oc    OryctolagusCuniculus   10   11   39    0    8   26    2   20    0    0    0    0    0    0
12     om      OncorhynchusMykiss    9    7    6    0    0    0    0    0    0    0    1    9    0    0
13     rn        RattusNorvegicus   30    4  113    0    6  132    2    8    0    0    0    0    0    0
14   smth   MusMusculusMolossinus    0    0    0    0    0    1    0    0    0    0    0    0    0    0
15   smth     MusMusculusMusculus    0    0    0    0    0    1    0    0    0    0    0    0    0    0
16   smth              MusSpretus    0    0    0    0    0    2    0    2    0    0    0    0    0    0
17     ss               SusScrofa    5    5   15    0    8   19    4   14    0    0    0    0    0    0
   trdj trdv trgj trgv
1     3    0    6   15
2     0    7    2    2
3     0    0    7    8
4     0    0    0    0
5     4    6    4   10
6     0    0    0    0
7     0    0    0    0
8     0    0    0    0
9     3    7    0   11
10    0    0    0    0
11    0    0    0    0
12    0    0    0    0
13    0    0    0    0
14    0    0    0    0
15    0    0    0    0
16    0    0    0    0
17    0    0    0    0

我們來驗(yàn)證一下,打開IMGT 官網(wǎng): http://www.imgt.org/找到http://www.imgt.org/IMGTrepertoire/LocusGenes/#H

點(diǎn)進(jìn)去我們就看到:

HomoSapiens 的 trdd 確實(shí)是有3個。

為了計(jì)算基因的分布,immunarch包含了geneUsage函數(shù)。它接收到一個或多個免疫庫(我們讀進(jìn)來的數(shù)據(jù)對象)作為輸入,以及你想要得到統(tǒng)計(jì)數(shù)據(jù)的基因和物種。例如,如果你計(jì)劃使用智人的TRBV基因,你需要使用hs.trbv,函數(shù)中的trbv字符串,其中hs來自別名列,trbv是基因名稱。如果你計(jì)劃使用Mus musmus_ighj基因,你需要使用musmus_ighj。

# Next four function calls are equal. "hs" is from the "alias" column.
imm_gu <- geneUsage(immdata$data, "hs.trbv")
imm_gu

# A tibble: 48 x 13
   Names    `A2-i129` `A2-i131` `A2-i133` `A2-i132` `A4-i191` `A4-i192`   MS1   MS2   MS3   MS4   MS5   MS6
   <chr>        <int>     <int>     <int>     <int>     <int>     <int> <int> <int> <int> <int> <int> <int>
 1 TRBV10-1        24        28        NA        16        29         6    19     4    43    19    13    21
 2 TRBV10-2        42        60         8        29        28        16    25    35    63    39    22    43
 3 TRBV10-3       230       282       135       108       376       215   195   142   304   262   179   442
 4 TRBV11-1        21        14        26        17        17        16    14    10    16    32    14    20
 5 TRBV11-2       183       172       125       161        95       113    94   105   174   174   122    94
 6 TRBV11-3         8        11         5        24         2         7     4    13    13    13     3    32
 7 TRBV12-4       603       459       313       433       333       557   406   606   452   646   537   951
 8 TRBV12-5        37        54         8        38        18        17     7    17    25    32    16    24
 9 TRBV13          44        53        45        78        29        43    39    28    31    33    28    47
10 TRBV14          65        91        48        73        40        30    23    94    21    84    26     7
# ... with 38 more rows

imm_gu 不就是一個豐度表嗎?行是基因,列是樣本,然后我們可以走我們的那一套了,pca啦,聚類啦,巴拉巴拉。

immunarch 提供了方便的可視化函數(shù),與ggplot2無縫銜接。

p1 <- vis(imm_gu[c(1, 2)])
 p2 <- vis(imm_gu[c(1, 2)]) + coord_polar()
p3<- vis(imm_gu[c(1, 2)]) + coord_flip() + theme_bw()

library(patchwork)
p1 + p2 +p3 

當(dāng)然這個函數(shù)的細(xì)節(jié)還是有一些需要注意的,就是你到底要計(jì)算什么?;蚍植伎梢酝ㄟ^單個克隆型的計(jì)數(shù)來計(jì)算。quant = "count")或不使用它們(quant = NA)。要計(jì)算等位基因水平或家族水平分布,請更改.type參數(shù)。Parameter .norm控制immunarch是否將數(shù)據(jù)歸一化,以確保所有頻率的和是否等于1。

為了明確其計(jì)算的細(xì)節(jié),我們用debug(geneUsage)來進(jìn)入函數(shù)內(nèi)部一探究竟。當(dāng)然debug(geneUsage)是一個動態(tài)過程,我們僅演示主要結(jié)果。

debug(geneUsage)
imm_gu <- geneUsage(immdata$data, "hs.trbv")


Browse[3]> .gene
 [1] "TRBV10-1" "TRBV10-2" "TRBV10-3" "TRBV11-1" "TRBV11-2" "TRBV11-3" "TRBV12-3" "TRBV12-4" "TRBV12-5"
[10] "TRBV13"   "TRBV14"   "TRBV15"   "TRBV16"   "TRBV18"   "TRBV19"   "TRBV2"    "TRBV20-1" "TRBV24-1"
[19] "TRBV25-1" "TRBV27"   "TRBV28"   "TRBV29-1" "TRBV3-1"  "TRBV30"   "TRBV4-1"  "TRBV4-2"  "TRBV4-3" 
[28] "TRBV5-1"  "TRBV5-4"  "TRBV5-5"  "TRBV5-6"  "TRBV5-8"  "TRBV6-1"  "TRBV6-2"  "TRBV6-3"  "TRBV6-4" 
[37] "TRBV6-5"  "TRBV6-6"  "TRBV6-8"  "TRBV6-9"  "TRBV7-2"  "TRBV7-3"  "TRBV7-4"  "TRBV7-6"  "TRBV7-7" 
[46] "TRBV7-8"  "TRBV7-9"  "TRBV9"   
Browse[3]> 
debug: .data[[gene_col]] <- return_segments(.data[[gene_col]])
Browse[3]> 
debug: if (has_class(.data, "data.table")) {
    .data <- .data %>% lazy_dt()
}
Browse[3]> 
debug: dataset <- .data %>% select(Gene = gene_col, IMMCOL$count) %>% 
    group_by(Gene) %>% rename(Quant = IMMCOL$count)
Browse[3]> 
debug: if (is.na(.quant)) {
    dataset <- dataset %>% summarise(count = n())
} else {
    dataset <- dataset %>% summarise(count = sum(Quant))
}
Browse[3]> 
debug: dataset <- dataset %>% summarise(count = n())
Browse[3]> 
debug: dataset <- dataset %>% collect(n = Inf)
Browse[3]> dataset
# A tibble: 48 x 2
   Gene     count
   <chr>    <int>
 1 TRBV10-1    28
 2 TRBV10-2    60
 3 TRBV10-3   282
 4 TRBV11-1    14
 5 TRBV11-2   172
 6 TRBV11-3    11
 7 TRBV12-4   459
 8 TRBV12-5    54
 9 TRBV13      53
10 TRBV14      91
# ... with 38 more rows
Browse[3]> .data
# A tibble: 6,553 x 15
   Clones Proportion CDR3.nt CDR3.aa V.name D.name J.name V.end D.start D.end J.start VJ.ins VD.ins DJ.ins
    <dbl>      <dbl> <chr>   <chr>   <chr>  <chr>  <chr>  <int>   <int> <int>   <int>  <dbl>  <dbl>  <dbl>
 1    111    0.0131  TGCAGT~ CSASRG~ TRBV2~ TRBD1  TRBJ2~    11      12    17      25     -1      0      7
 2     93    0.0109  TGTGCC~ CASSVA~ TRBV9  TRBD1  TRBJ2~    15      21    23      24     -1      5      0
 3     66    0.00776 TGTGCC~ CASSRM~ TRBV13 TRBD1  TRBJ2~    11      18    24      32     -1      6      7
 4     59    0.00694 TGTGCC~ CASSPT~ TRBV6~ TRBD2  TRBJ2~    10      14    19      33     -1      3     13
 5     57    0.00671 TGCGCC~ CASSLD~ TRBV5~ TRBD2  TRBJ1~    15      17    20      26     -1      1      5
 6     47    0.00553 TGTGCC~ CASRGL~ TRBV6~ TRBD2  TRBJ2~    10      11    16      20     -1      0      3
 7     46    0.00541 TGCAGC~ CSVTGV~ TRBV2~ TRBD1  TRBJ2~     8       9    13      20     -1      0      6
 8     30    0.00353 TGTGCC~ CASSYL~ TRBV6~ TRBD2  TRBJ1~    15      17    19      20     -1      1      0
 9     29    0.00341 TGTGCC~ CASSLA~ TRBV5~ TRBD1  TRBJ1~    15      21    26      32     -1      5      5
10     29    0.00341 TGTGCC~ CASSYI~ TRBV6~ TRBD1  TRBJ1~    14      17    20      25     -1      2      4
# ... with 6,543 more rows, and 1 more variable: Sequence <lgl>
Browse[3]> gene_col
[1] "V.name"
Browse[3]> IMMCOL$count
[1] "Clones"
Browse[3]> dataset
# A tibble: 48 x 2
   Gene     count
   <chr>    <int>
 1 TRBV10-1    28
 2 TRBV10-2    60
 3 TRBV10-3   282
 4 TRBV11-1    14
 5 TRBV11-2   172
 6 TRBV11-3    11
 7 TRBV12-4   459
 8 TRBV12-5    54
 9 TRBV13      53
10 TRBV14      91
# ... with 38 more rows
Browse[3]> .data$V.name[1]
[1] "TRBV20-1"
Browse[3]> dataset %>% filter(Gene == 'TRBV20-1')
# A tibble: 1 x 2
  Gene     count
  <chr>    <int>
1 TRBV20-1   570
Browse[3]> .data  %>% filter(V.name == 'TRBV20-1')
# A tibble: 570 x 15
   Clones Proportion CDR3.nt CDR3.aa V.name D.name J.name V.end D.start D.end J.start VJ.ins VD.ins DJ.ins
    <dbl>      <dbl> <chr>   <chr>   <chr>  <chr>  <chr>  <int>   <int> <int>   <int>  <dbl>  <dbl>  <dbl>
 1    111   0.0131   TGCAGT~ CSASRG~ TRBV2~ TRBD1  TRBJ2~    11      12    17      25     -1      0      7
 2     11   0.00129  TGCAGT~ CSATGL~ TRBV2~ TRBD2  TRBJ2~     9      12    18      23     -1      2      4
 3      7   0.000824 TGCAGT~ CSARPG~ TRBV2~ TRBD1  TRBJ1~    11      14    19      24     -1      2      4
 4      6   0.000706 TGCAGT~ CSARGG~ TRBV2~ TRBD2  TRBJ2~    12      15    23      24     -1      2      0
 5      5   0.000588 TGCAGT~ CSARDG~ TRBV2~ TRBD2  TRBJ1~    13      16    20      21     -1      2      0
 6      5   0.000588 TGCAGT~ CSGASG~ TRBV2~ TRBD1  TRBJ2~     6       7    10      23     -1      0     12
 7      4   0.000471 TGCAGT~ CSVGGG~ TRBV2~ TRBD2  TRBJ2~     6      14    20      22     -1      7      1
 8      4   0.000471 TGCAGT~ CSARDR~ TRBV2~ TRBD1  TRBJ1~    13      14    18      19     -1      0      0
 9      4   0.000471 TGCAGT~ CSASQF~ TRBV2~ TRBD1  TRBJ2~    10      12    14      17     -1      1      2
10      4   0.000471 TGCAGT~ CSAGDR~ TRBV2~ TRBD1  TRBJ1~     7      12    17      21     -1      4      3
# ... with 560 more rows, and 1 more variable: Sequence <lgl>
Browse[3]> .data  %>% filter(V.name == 'TRBV20-1') %>% dim()
[1] 570  15
Browse[3]> .data  %>% filter(V.name == 'TRBV10-1') %>% dim()
[1] 28 15
Browse[3]> .data  %>% filter(V.name == 'TRBV10-1') %>% DT::datatable()



undebug(geneUsage)

大家看到這個28 的來源了嗎? 你可以從不同的角度來觀察基因使用的直方圖:

imm_gu <- geneUsage(immdata$data[c(1, 2)], "hs.trbv", .norm = T)
vis(imm_gu)
imm_gu <- geneUsage(immdata$data, "hs.trbv", .norm = T)
vis(imm_gu, .grid = T)

如果樣本有分組信息,還可以做組間的差異統(tǒng)計(jì):

imm_gu <- geneUsage(immdata$data, "hs.trbv", .norm = T
vis(imm_gu, .by = "Status", .meta = immdata$meta)
vis(imm_gu, .by = "Status", .meta = immdata$meta, .plot = "box")

由于一些克隆型的基因序列不明確,geneUsage有以下選項(xiàng)來處理不明確的數(shù)據(jù),也許這些不明確的基因才是潛藏在數(shù)據(jù)中的瑰寶啊。

  • .ambig = "inc" - includes all possible combinations of ambiguous gene alignments from the data. NOTE: ImmunoSEQ formats use non-standart gene segment names, so it is preferable to use this argument value with ImmunoSEQ formats. This argument is ON by default to ease the gene manipulation. Feel free to change it to "exc" in case of other data formats. It is ON by default, we recommend it to leave it that way.
  • .ambig = "exc" - filters out all clonotypes with ambiguous gene alignments.
  • .ambig = "wei" - introduces weighted approach (divides by n (1/n) the frequency for each entry of the corresponding gene if there are n genes for a clonotype).
  • .ambig = "maj" - chooses only the first gene segment.

作為福利,我們在文章中經(jīng)??吹竭@樣的V-J基因的圈圖,其實(shí)也是一種桑基圖啦:

咋做?

做一個不成熟的?;鶊D:

library(ggforce)
immdata$data$`A2-i129` %>%
    gather_set_data(c(6,7,5)) %>%
    ggplot(aes(x, id = id, split = y, value = 1))  +
    geom_parallel_sets(aes(fill = J.name), show.legend = FALSE, alpha = 0.3) +
    geom_parallel_sets_axes(axis.width = 0.1, color = "lightgrey", fill = "white") +
    geom_parallel_sets_labels(angle = 0) +
    theme_no_axes()

VDJ之間的流向:

然后是一個不成熟的圈圖:

library(sankeywheel)
sankeywheel(
    from = immdata$data$`A2-i129`$V.name, to = immdata$data$`A2-i129`$J.name,
    weight = immdata$data$`A2-i129`$Clones,#type = "sankey",
    title = " circus plots",
    subtitle = "j_gene &  v_gene",
    seriesName = "", width = "100%", height = "600px"
)

關(guān)鍵還是要理解這里面的數(shù)據(jù)結(jié)構(gòu)和算法及其背后的生物學(xué)意義啊。咦,我們?yōu)槭裁匆鰃eneUsage?

因?yàn)間eneUsage也是一種異質(zhì)性啊,主要是可以找到哪些geneUsage較高或者受到抑制,這樣我們可以對癥下藥啊。一如文章所言:

CDR3 sequence length distribution, amino acid conservation and gene usage variability for ATL patients resembled those of peripheral blood cells from ACs and healthy donors. Thus, determining monoclonal architecture and clonal diversity by RNA sequencing might be useful for prognostic purposes and for personalizing ATL diagnosis and assessment of treatments.


[1] Farmanbar, A., Kneller, R. & Firouzi, S. RNA sequencing identifies clonal structure of T-cell repertoires in patients with adult T-cell leukemia/lymphoma. npj Genom. Med. 4, 10 (2019). https://doi.org/10.1038/s41525-019-0084-9

https://immunarch.com/articles/web_only/v5_gene_usage.html

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容