R 數(shù)據(jù)可視化 —— circlize 基因組初始化

前言

圓形可視化廣泛應(yīng)用于基因組及其相關(guān)的組學(xué)領(lǐng)域中,能夠有效地展示高維基因組學(xué)數(shù)據(jù)。

在基因組數(shù)據(jù)中,通常是根據(jù)染色體進(jìn)行分類,x 軸對(duì)應(yīng)于基因組上的位置,也可以是其他類型的基因組數(shù)據(jù)

circlize 提供了一些專門的基因組繪圖函數(shù),讓基因組分析更加簡單方便,如:

  • circos.genomicTrack(): 添加軌跡和圖形
  • circos.genomicPoints(): 添加點(diǎn)
  • circos.genomicLines(): 添加線條或線段
  • circos.genomicRect(): 添加矩形
  • circos.genomicText(): 添加文本
  • circos.genomicLink(): 添加連接

這樣函數(shù)與基礎(chǔ)的繪制函數(shù)是類似的,只是接受的輸入數(shù)據(jù)格式不同,都是基于基礎(chǔ)的 circlize 繪圖函數(shù)實(shí)現(xiàn)的(如 circos.track(), circos.points() 等)。

輸入數(shù)據(jù)

基因組數(shù)據(jù)通常使用的是 BED 格式的文件,即前三列標(biāo)識(shí)某一基因組區(qū)域:染色體、起始位置、終止位置。

circlize 提供了一個(gè)簡單函數(shù) generateRandomBed() 來創(chuàng)建隨機(jī)的基因組數(shù)據(jù)。從人類基因組中均勻的生成基因組區(qū)域,區(qū)域的數(shù)量與染色體的大小成正比。

nrnc 參數(shù)用于控制需要?jiǎng)?chuàng)建的行列的數(shù)量,可能最后生成的行數(shù)不一定與 nr 相同,fun 參數(shù)用于接受自定義的值生成函數(shù)。

> bed <- generateRandomBed()
> head(bed)
   chr   start     end      value1
1 chr1  261327  520533  0.07617057
2 chr1  596180  606938  0.81289852
3 chr1  769058 1176608  0.61876561
4 chr1 1179719 1671784  0.21739949
5 chr1 1860787 2066114 -0.01665364
6 chr1 2183578 2277911  0.01477448
> # 設(shè)置行列數(shù)量
> bed <- generateRandomBed(nr = 200, nc = 4)
> nrow(bed)
[1] 205
> # 自定義值生成函數(shù)
> bed <- generateRandomBed(
+   nc = 2, fun = function(k) sample(letters, k, replace = TRUE)
+   )
> head(bed)
   chr   start     end value1 value2
1 chr1  154420  432520      o      q
2 chr1  621080  658294      w      g
3 chr1  923320  962390      b      t
4 chr1  964699 1202322      y      v
5 chr1 1336707 1405512      r      g
6 chr1 1455202 1534223      i      a

初始化

1. 染色體條帶初始化

cytoband 類型的數(shù)據(jù)是理想的輸入格式,其包含染色體的長度以及染色體條帶信息,能夠有效標(biāo)識(shí)染色體的位置

1.1 基本使用

如果是繪制人類基因組數(shù)據(jù),可以直接使用 circos.initializeWithIdeogram() 函數(shù),例如

circos.initializeWithIdeogram()
text(0, 0, "default", cex = 1)
circos.clear()

染色體名稱顯示的是純數(shù)字,但是其內(nèi)部的索引名稱還是帶有 chr 的字符串

> circos.info()
All your sectors:
 [1] "chr1"  "chr2"  "chr3"  "chr4"  "chr5"  "chr6"  "chr7"  "chr8"  "chr9"  "chr10" "chr11" "chr12" "chr13" "chr14"
[15] "chr15" "chr16" "chr17" "chr18" "chr19" "chr20" "chr21" "chr22" "chrX"  "chrY" 

All your tracks:
[1] 1 2

Your current sector.index is chrY
Your current track.index is 2

circos.initializeWithIdeogram() 默認(rèn)使用的是 hg19 版的 cytoband 數(shù)據(jù),可以使用 species 參數(shù)來指定為 hg18 或其他物種

circos.initializeWithIdeogram(species = "hg18")
circos.initializeWithIdeogram(species = "mm10")

會(huì)自動(dòng)從網(wǎng)上下載對(duì)應(yīng)的數(shù)據(jù),如果提供的物種不存在,還是會(huì)從 UCSC 數(shù)據(jù)庫中下載染色體信息 chromInfo 文件,有染色體長度,但是不包含條帶信息

如果網(wǎng)絡(luò)受限或者 UCSC 中沒有對(duì)應(yīng)的數(shù)據(jù),則可以自己手動(dòng)創(chuàng)建數(shù)據(jù)框,或傳入存儲(chǔ)在本地的文件

cytoband.file <- system.file(
  package = "circlize", "extdata", "cytoBand.txt"
)
circos.initializeWithIdeogram(cytoband.file)

cytoband.df <- read.table(
  cytoband.file, colClasses = c(
    "character", "numeric", "numeric", "character", "character"), 
  sep = "\t"
)
circos.initializeWithIdeogram(cytoband.df)

注意,如果是從文件中讀取,需要指定每列的數(shù)據(jù)類型,并指定位置列為 numeric 類型,因?yàn)?read.table 會(huì)將數(shù)字作為 integer 類型,而基因組長度是較大的數(shù),會(huì)導(dǎo)致整數(shù)溢出

circos.intializeWithIdeogram() 默認(rèn)會(huì)讀取 cytoband 數(shù)據(jù)中的所有信息,使用 chromosome.index 參數(shù)可以選擇要顯示的染色體

circos.initializeWithIdeogram(
  chromosome.index = paste0("chr", c(3,5,2,8))
)
text(0, 0, "subset of chromosomes", cex = 1)
circos.clear()

如果沒有相應(yīng)的物種,且使用了 chromInfo 文件時(shí),會(huì)包含很多的短的 contigs,使用 chromosome.index 可以刪除一些不需要的 contigs

1.2 預(yù)定義軌跡

使用 circos.initializeWithIdeogram() 初始化圓形圖,會(huì)創(chuàng)建兩個(gè)軌跡,一個(gè)用于包含軸和染色體名稱,另一個(gè)軌跡用于繪制 ideogram,使用 plotType 可以控制需要繪制的軌跡

par(mfcol = c(1, 2))
circos.initializeWithIdeogram(plotType = c("axis", "labels"))
text(0, 0, "plotType = c('axis', 'labels')", cex = 1)
circos.clear()

circos.initializeWithIdeogram(plotType = NULL)
text(0, 0, "plotType = NULL", cex = 1)
circos.clear()

1.3 其他設(shè)置

與常規(guī)的圓形圖類似,可以使用 circos.par() 函數(shù)來控制圓形布局

par(mfcol = c(1, 2))
circos.par("start.degree" = 90)
circos.initializeWithIdeogram()
circos.clear()
text(0, 0, "'start.degree' = 90", cex = 1)

circos.par("gap.degree" = rep(c(2, 4), 12))
circos.initializeWithIdeogram()
circos.clear()
text(0, 0, "'gap.degree' = rep(c(2, 4), 12)", cex = 1)

2. 自定義染色體軌跡

circos.initializeWithIdeogram() 函數(shù)默認(rèn)會(huì)初始化圓形布局,并添加兩個(gè)軌跡,通過設(shè)置 plotType = NULL,只創(chuàng)建布局而不添加軌跡,我們就可以添加自定義圖形樣式

例如,我們?yōu)槿旧w設(shè)置不同的顏色,并將染色體名稱放置在單元格內(nèi)部

circos.initializeWithIdeogram(plotType = NULL)
circos.track(
  ylim = c(0, 1), 
  panel.fun = function(x, y) {
    chr = CELL_META$sector.index
    xlim = CELL_META$xlim
    ylim = CELL_META$ylim
    circos.rect(xlim[1], 0, xlim[2], 1, col = rand_color(1))
    circos.text(
      mean(xlim), mean(ylim), chr, cex = 0.7,
      col = "white", facing = "inside",
      niceFacing = TRUE
    )
  }, 
  track.height = 0.15, bg.border = NA
)
circos.clear()

3. 常規(guī)基因組類別初始化

染色體只是一種特殊的基因組分類,使用 circos.genomicInitialize() 可以初始化任意基因組分類的圓形布局,circos.initializeWithIdeogram() 函數(shù)也是基于 circos.genomicInitialize() 實(shí)現(xiàn)的。

circos.genomicInitialize() 函數(shù)也是接受數(shù)據(jù)框型的輸入數(shù)據(jù),數(shù)據(jù)必須至少包含三列,第一列代表基因組分類,后兩列為每種分類在基因組中的順序

例如,基因的位置信息

df <- data.frame(
  name  = c("TP53",  "TP63",    "TP73"),
  start = c(7565097, 189349205, 3569084),
  end   = c(7590856, 189615068, 3652765))
circos.genomicInitialize(df)

并不是說一個(gè)基因只能記錄為一行,也可以是多行,比如基因的轉(zhuǎn)錄本信息。我們讀取包中自帶的示例數(shù)據(jù),包含 TP53TP63TP73 三個(gè)基因的信息

> tp_family <- readRDS(system.file(
+   package = "circlize", "extdata", "tp_family_df.rds")
+   )
> head(tp_family)
  gene   start     end        transcript exon
1 TP53 7565097 7565332 ENST00000413465.2    7
2 TP53 7577499 7577608 ENST00000413465.2    6
3 TP53 7578177 7578289 ENST00000413465.2    5
4 TP53 7578371 7578554 ENST00000413465.2    4
5 TP53 7579312 7579590 ENST00000413465.2    3
6 TP53 7579700 7579721 ENST00000413465.2    2

繪制填充色來標(biāo)識(shí)三個(gè)基因

circos.genomicInitialize(tp_family)
circos.track(
  ylim = c(0, 1), 
  bg.col = c("#1f78b480", "#33a02c80", "#e31a1c80"), 
  bg.border = NA, track.height = 0.05
)

繪制基因的轉(zhuǎn)錄本

n <- max(tapply(
  tp_family$transcript, tp_family$gene, 
  function(x) length(unique(x)))
)
circos.genomicTrack(
  tp_family, ylim = c(0.5, n + 0.5), 
  panel.fun = function(region, value, ...) {
    all_tx = unique(value$transcript)
    for(i in seq_along(all_tx)) {
      l = value$transcript == all_tx[i]
      # 對(duì)于每個(gè)轉(zhuǎn)錄本
      current_tx_start = min(region[l, 1])
      current_tx_end = max(region[l, 2])
      circos.lines(
        c(current_tx_start, current_tx_end),
        c(n - i + 1, n - i + 1), col = "#CCCCCC"
      )
      circos.genomicRect(
        region[l, , drop = FALSE],
        ytop = n - i + 1 + 0.4,
        ybottom = n - i + 1 - 0.4,
        col = "orange",
        border = NA
      )
    }
  }, 
  bg.border = NA, track.height = 0.4
)
circos.clear()

4. 放大基因組

基因組區(qū)域的放大方式類似于前面介紹的,也是將需要放大的區(qū)域的數(shù)據(jù)提取出來,并設(shè)置不同的分類,然后添加到輸入數(shù)據(jù)中

extend_chromosomes <- function(bed, chromosome, prefix = "zoom_") {
  zoom_bed = bed[bed[[1]] %in% chromosome, , drop = FALSE]
  zoom_bed[[1]] = paste0(prefix, zoom_bed[[1]])
  rbind(bed, zoom_bed)
}

我們使用 read.cytoband() 函數(shù)從 UCSC 中下載并讀取 cytoband 數(shù)據(jù)

cytoband <- read.cytoband()
cytoband_df <- cytoband$df
chromosome <- cytoband$chromosome

xrange <- c(cytoband$chr.len, cytoband$chr.len[c("chr1", "chr2")])
normal_chr_index <- 1:24
zoomed_chr_index <- 25:26

# 設(shè)置寬度
sector.width <- c(
  xrange[normal_chr_index] / sum(xrange[normal_chr_index]), 
  xrange[zoomed_chr_index] / sum(xrange[zoomed_chr_index])
) 

繪制圖形

circos.par(start.degree = 90)
circos.initializeWithIdeogram(
  extend_chromosomes(cytoband_df, c("chr1", "chr2")), 
  sector.width = sector.width)

添加一個(gè)新的軌跡

bed <- generateRandomBed(500)
circos.genomicTrack(
  extend_chromosomes(bed, c("chr1", "chr2")),
  panel.fun = function(region, value, ...) {
    circos.genomicPoints(
      region, value, pch = 16, 
      cex = 0.3, col = 'blue')
    }
)

添加連接

circos.link(
  "chr1", get.cell.meta.data("cell.xlim", sector.index = "chr1"),
  "zoom_chr1", get.cell.meta.data("cell.xlim", sector.index = "zoom_chr1"),
  col = "#33a02c20", border = NA)
circos.clear()

5. 合并兩個(gè)基因組

在某些情況下,可能想要將多個(gè)基因組繪制在同一個(gè)圓形圖中,例如,我們可以將人類與小鼠的基因組組合起來

首先,獲取兩個(gè)基因組的 cytoband 數(shù)據(jù)

human_cytoband <- read.cytoband(species = "hg19")$df
mouse_cytoband <- read.cytoband(species = "mm10")$df

注意,要區(qū)分兩個(gè)基因組的染色體名稱,給它們加上一個(gè)前綴

human_cytoband[ ,1] <- paste0("human_", human_cytoband[, 1])
mouse_cytoband[ ,1] <- paste0("mouse_", mouse_cytoband[, 1])

然后,合并兩個(gè)數(shù)據(jù)

cytoband <- rbind(human_cytoband, mouse_cytoband)

> head(cytoband)
          V1       V2       V3     V4     V5
1 human_chr1        0  2300000 p36.33   gneg
2 human_chr1  2300000  5400000 p36.32 gpos25
3 human_chr1  5400000  7200000 p36.31   gneg
4 human_chr1  7200000  9200000 p36.23 gpos25
5 human_chr1  9200000 12700000 p36.22   gneg
6 human_chr1 12700000 16200000 p36.21 gpos50

通過設(shè)置 chromosome.index 參數(shù)的值,讓兩個(gè)基因組的 1 號(hào)染色體放置在相鄰的位置

chromosome.index <- c(
  paste0("human_chr", c(1:22, "X", "Y")), 
  rev(paste0("mouse_chr", c(1:19, "X", "Y")))
)
circos.initializeWithIdeogram(
  cytoband, chromosome.index = chromosome.index
)
circos.clear()

染色體數(shù)據(jù)太多了,使得圖像比較臃腫,為了讓圖形更加美觀,我們關(guān)閉染色體名稱和軸刻度的顯示,只簡單的顯示染色體號(hào),并添加染色體分組顏色和間距

# 兩組之間 5 度間距
circos.par(
  gap.after = c(rep(1, 23), 5, rep(1, 20), 5)
)
circos.initializeWithIdeogram(
  cytoband, plotType = NULL, 
  # 染色體順序
  chromosome.index = chromosome.index
)
# 添加染色體號(hào)
circos.track(
  ylim = c(0, 1), track.height = mm_h(1), 
  panel.fun = function(x, y) {
    circos.text(
      CELL_META$xcenter,
      CELL_META$ylim[2] + mm_y(2),
      gsub(".*chr", "", CELL_META$sector.index),
      cex = 0.6,
      niceFacing = TRUE
    )
  }, 
  cell.padding = c(0, 0, 0, 0), bg.border = NA
)

highlight.chromosome(
  paste0("human_chr", c(1:22, "X", "Y")), 
  col = "#66c2a5", track.index = 1
)
highlight.chromosome(
  paste0("mouse_chr", c(1:19, "X", "Y")), 
  col = "#fc8d62", track.index = 1
)
# 繪制 ideogram
circos.genomicIdeogram(cytoband)
circos.clear()

同樣地,對(duì)于不包含條帶信息,只有染色體范圍的輸入數(shù)據(jù),我們也可以進(jìn)行組合。

首先,使用 read.chromInfo() 來獲取染色體范圍信息,然后合并兩份數(shù)據(jù)

human_chromInfo <- read.chromInfo(species = "hg19")$df
mouse_chromInfo <- read.chromInfo(species = "mm10")$df
human_chromInfo[ ,1] <- paste0("human_", human_chromInfo[, 1])
mouse_chromInfo[ ,1] <- paste0("mouse_", mouse_chromInfo[, 1])
chromInfo <- rbind(human_chromInfo, mouse_chromInfo)
# 控制染色體的順序
chromInfo[, 1] <- factor(chromInfo[ ,1], levels = chromosome.index)
> head(chromInfo)
         chr start       end
1 human_chr1     0 249250621
2 human_chr2     0 243199373
3 human_chr3     0 198022430
4 human_chr4     0 191154276
5 human_chr5     0 180915260
6 human_chr6     0 171115067

使用 genomicInitialize() 來初始化布局,并添加圖形

circos.par(gap.after = c(rep(1, 23), 5, rep(1, 20), 5))
circos.genomicInitialize(chromInfo, plotType = NULL)
circos.track(
  ylim = c(0, 1),
  panel.fun = function(x, y) {
    circos.text(
      CELL_META$xcenter,
      CELL_META$ylim[2] + mm_y(2),
      gsub(".*chr", "", CELL_META$sector.index),
      cex = 0.6,
      niceFacing = TRUE
    )
  },
  track.height = mm_h(1),
  cell.padding = c(0, 0, 0, 0),
  bg.border = NA
)
highlight.chromosome(
  paste0("human_chr", c(1:22, "X", "Y")), 
  col = "#66c2a5", track.index = 1
)
highlight.chromosome(
  paste0("mouse_chr", c(1:19, "X", "Y")), 
  col = "#fc8d62", track.index = 1
)
# 添加空白軌跡
circos.track(ylim = c(0, 1))
circos.clear()

我們可以為圖形添加更多的軌跡以及連接,讓圖形看起來更加的充實(shí)

# 設(shè)置間距
circos.par(gap.after = c(rep(1, 23), 5, rep(1, 20), 5))
# 初始化布局,不添加圖形
circos.genomicInitialize(chromInfo, plotType = NULL)
# 添加數(shù)字染色體號(hào)
circos.track(
  ylim = c(0, 1),
  panel.fun = function(x, y) {
    circos.text(
      CELL_META$xcenter,
      CELL_META$ylim[2] + mm_y(2),
      gsub(".*chr", "", CELL_META$sector.index),
      cex = 0.6,
      niceFacing = TRUE
    )
  },
  track.height = mm_h(1),
  cell.padding = c(0, 0, 0, 0),
  bg.border = NA
)
# 添加分組顏色軌跡
highlight.chromosome(
  paste0("human_chr", c(1:22, "X", "Y")),
  col = "#66c2a5", track.index = 1
)
highlight.chromosome(
  paste0("mouse_chr", c(1:19, "X", "Y")),
  col = "#fc8d62", track.index = 1
)

# 添加 ideogram
circos.genomicIdeogram(cytoband)

# 創(chuàng)建隨機(jī)數(shù)據(jù)
human_df <- generateRandomBed(200, species = "hg19")
mouse_df <- generateRandomBed(200, species = "mm10")
human_df[, 1] <- paste0("human_", human_df[, 1])
mouse_df[, 1] <- paste0("mouse_", mouse_df[, 1])
df <- rbind(human_df, mouse_df)
# 添加點(diǎn)圖
circos.genomicTrack(
  df,
  panel.fun = function(region, value, ...) {
    circos.genomicPoints(region, value, col = rand_color(1), cex = 0.5, ...)
  }
)

# 添加人類與小鼠基因組之間的連接
human_mid <- data.frame(
  chr = paste0("human_chr", 1:19),
  mid = round((human_chromInfo[1:19, 2] + human_chromInfo[1:19, 3]) / 2)
)
mouse_mid <- data.frame(
  chr = paste0("mouse_chr", 1:19),
  mid = round((mouse_chromInfo[1:19, 2] + mouse_chromInfo[1:19, 3]) / 2)
)
circos.genomicLink(human_mid, mouse_mid, col = rand_color(19))
circos.clear()
# 添加注釋
text(-0.9,-0.8, "Human\ngenome")
text(0.9, 0.8, "Mouse\ngenome")

完整代碼:https://github.com/dxsbiocc/learn/blob/main/R/plot/genomic_human_mouse.R

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容