R 數(shù)據(jù)可視化 —— 集合可視化 UpSetR

前言

上一節(jié),我們介紹了如何繪制韋恩圖來顯示集合間的交疊關(guān)系

但是,隨著集合的增多,韋恩圖顯示的關(guān)系會越來越復(fù)雜,很難一眼看出其中的信息。

今天,我們要介紹的是,當(dāng)集合數(shù)目較多時,該如何繪制

我們將使用 UpSetR 包來繪制下面這種圖

該圖由三個子圖組成:

  1. 表示交集大小的柱狀圖(上方)
  2. 表示集合大小的條形圖(下左)
  3. 表示集合之間的交疊矩陣(下右),矩陣的列表示每種交集組合,對應(yīng)于柱狀圖的橫坐標(biāo);矩陣的行表示集合,對應(yīng)于條形圖的縱坐標(biāo)

通過這樣一張圖,可以展示多個集合之間的交疊關(guān)系,且很容易從圖中看出集合之間的交集信息

那怎么繪制出這樣一張圖呢?

基礎(chǔ)

1. 安裝導(dǎo)入

install.packages("UpSetR")

library(UpSetR)

我們使用該包自帶的示例數(shù)據(jù)

movies <- read.csv(system.file("extdata", "movies.csv", package = "UpSetR"), 
    header = T, sep = ";")

2. 數(shù)據(jù)

在開始繪制之前,我們需要知道輸入數(shù)據(jù)的格式。

UpSetR 提供了兩個轉(zhuǎn)換函數(shù) fromListfromExpression 用于格式化數(shù)據(jù)

  • fromList 函數(shù)接受一個 list(每個變量表示一個集合),并將其轉(zhuǎn)換為數(shù)據(jù)框,例如
listInput <- list(
        one = c(1, 2, 3, 5, 7, 8, 11, 12, 13), 
        two = c(1, 2, 4, 5, 10), 
        three = c(1, 5, 6, 7, 8, 9, 10, 12, 13))
  • fromExpression 函數(shù)接受一個命名向量表達(dá)式,包含了每個集合的大小,以及交集的大小,交集的名稱通過 & 符號相連,例如
expressionInput <- c(
        one = 2, two = 1, three = 2, 
        `one&two` = 1, `one&three` = 4, 
        `two&three` = 1, `one&two&three` = 2)

根據(jù)上面的數(shù)據(jù),可以繪制如下圖形

upset(fromList(listInput), order.by = "freq")
# upset(fromExpression(expressionInput), order.by = "freq")

3. 繪制部分集合

在這里,我們通過設(shè)置 nsets = 6 將集合范圍限制在最大的 6 個集合

upset(movies, nsets = 6, 
      number.angles = 30, 
      point.size = 3.5, 
      line.size = 2, 
      mainbar.y.label = "Genre Intersections", 
      sets.x.label = "Movies Per Genre", 
      text.scale = c(1.3, 1.3, 1, 1, 2, 0.75))

同時,可以指定參數(shù),來調(diào)整圖形屬性,例如,使用 number.angles 來設(shè)置柱狀圖柱子上方數(shù)字的傾斜角度;使用 point.sizeline.size 來設(shè)置矩陣點圖中點和線的大??;mainbar.y.labelsets.x.label 可以設(shè)置柱狀圖和條形圖的軸標(biāo)簽;text.scale 包含 6 個值,用于指定圖上所有文本標(biāo)簽的大小。

text.scale 參數(shù)值的順序為:

  • 柱狀圖的軸標(biāo)簽和刻度
  • 條形圖的軸標(biāo)簽和刻度
  • 集合名稱
  • 柱子上方表示交集大小的數(shù)值

我們也可以指定需要展示的集合

upset(movies, 
      sets = c("Action", "Comedy", "Drama", 
               "Mystery", "Thriller", "Romance", "War"),
      mb.ratio = c(0.55, 0.45)
      )

mb.ratio 用于控制上下圖形所占比例

4. 排序

我們可以設(shè)置 order.by 參數(shù),來對交集進行排序。

upset(movies, 
      sets = c("Action", "Comedy", "Drama", 
               "Mystery", "Thriller", "Romance", "War"),
      mb.ratio = c(0.55, 0.45),
      order.by = "freq",
      decreasing = TRUE
      )

freq 默認(rèn)是升序,可以使用 decreasing = TRUE 讓其降序排列

upset(movies, 
      sets = c("Action", "Comedy", "Drama", 
               "Mystery", "Thriller", "Romance", "War"),
      mb.ratio = c(0.55, 0.45),
      order.by = "degree",
      decreasing = FALSE
      )

degree 默認(rèn)為降序排序,設(shè)置 decreasing = FALSE 使其升序排列

也可以同時指定這兩個值

upset(movies, 
      sets = c("Action", "Comedy", "Drama", 
               "Mystery", "Thriller", "Romance", "War"),
      mb.ratio = c(0.55, 0.45),
      order.by = c("degree", "freq"),
      decreasing = c(TRUE, FALSE)
      )

如果想要讓集合按照 sets 參數(shù)中指定的出現(xiàn)的順序排列,可以設(shè)置 keep.order = TRUE

upset(movies, 
      sets = c("Action", "Comedy", "Drama", 
               "Mystery", "Thriller", "Romance", "War"),
      mb.ratio = c(0.55, 0.45),
      order.by = c("degree", "freq"),
      decreasing = c(TRUE, FALSE),
      keep.order = TRUE
      )

如果想要顯示交集為空的組合,可以設(shè)置 empty.intersections 參數(shù)

upset(movies, 
      sets = c("Action", "Comedy", "Drama", 
               "Mystery", "Thriller", "Romance", "War"),
      empty.intersections = "on"
      )

查詢

查詢通過 queries 參數(shù)來執(zhí)行,接受一個嵌套的 list 來表示多個查詢條件,每個查詢條件包含四個字段:

  • query:需要執(zhí)行的查詢
  • params:查詢參數(shù)列表
  • color:設(shè)置滿足查詢條件的元素在圖中的顏色
  • active:如果為 TRUE,柱狀圖顏色將會被覆蓋,為 FALSE 則會在柱子上添加帶有隨機擾動的點

例如

1. 內(nèi)置交集查詢

我們使用內(nèi)置的交集查詢:intersects,用來尋找或顯示特定的交集,并將找到的交集進行上色

upset(movies, queries = list(
  list(
    query = intersects, 
    params = list("Drama", "Comedy", "Action"), 
    color = "orange", 
    active = T), 
  list(
    query = intersects, 
    params = list("Drama"), 
    color = "red", 
    active = F), 
  list(
    query = intersects,
    params = list("Action", "Drama"), 
    active = T)
  )
  )

2. 內(nèi)置元素查詢

我們使用 elements 來進行元素查詢,來展示元素在交集中的分布情況

upset(movies, 
      queries = list(
        list(
          query = elements, 
          params = list("AvgRating",  3.5, 4.1), 
          color = "blue", 
          active = T), 
        list(
          query = elements, 
          params = list("ReleaseDate", 1980, 1990, 2000), 
          color = "red", 
          active = F)
        )
      )

3. 使用表達(dá)式

我們可以為 expression 參數(shù)設(shè)置過濾表達(dá)式來提取查詢結(jié)果的子集。

upset(movies, 
      queries = list(
        list(
          query = intersects, 
          params = list("Action", "Drama"), 
          active = T), 
        list(
          query = elements, 
          params = list("ReleaseDate", 1980, 1990, 2000), 
          color = "red", 
          active = F)), 
      expression = "AvgRating > 3 & Watches > 100"
      )

4. 自定義查詢

查詢函數(shù)會應(yīng)用于數(shù)據(jù)的每一行中,我們可以定義如下查詢函數(shù)

Myfunc <- function(row, release, rating) {
  data <- (row["ReleaseDate"] %in% release) & (row["AvgRating"] > rating)
}

篩選發(fā)行日期在 release 內(nèi),且平均評分大于某個值的電影

執(zhí)行查詢

upset(movies, 
      queries = list(
        list(
          query = Myfunc, 
          params = list(c(1970, 1980, 1990, 1999, 2000), 2.5), 
          color = "blue", 
          active = T)
        )
      )

5. 添加查詢圖例

可以使用 query.legend 參數(shù)來指定查詢圖例的位置,topbottom

在查詢條件中,使用 query.name 來設(shè)置查詢的名稱,如果為設(shè)置,會自動生成

upset(movies, 
      query.legend = "top", 
      queries = list(
        list(
          query = intersects, 
          params = list("Drama", "Comedy", "Action"), 
          color = "orange", active = T, 
          query.name = "Funny action"), 
        list(
          query = intersects, 
          params = list("Drama"), 
          color = "red", active = F), 
        list(
          query = intersects, 
          params = list("Action", "Drama"), 
          active = T, 
          query.name = "Emotional action")
        )
      )

屬性圖

attribute.plots 參數(shù)用于執(zhí)行屬性圖的繪制,包含 3 個字段:

  • gridrows:設(shè)置屬性圖的空間大小,UpSet plot 默認(rèn)為 100 X 100,如果設(shè)置為 50,則整個圖形變成 150 X 100
  • plots:圖形列表,每個元素包含 4 個參數(shù):
    • plot:返回 ggplot 對象的函數(shù)
    • x:圖形的 x 軸變量
    • y:圖形的 y 軸變量
    • queries:是否使用已經(jīng)存在的查詢來覆蓋繪圖數(shù)據(jù)
  • ncols:設(shè)置列數(shù)

1. 內(nèi)置繪圖函數(shù)

我們使用包中自帶的 histogram 函數(shù)來繪制直方圖

upset(movies, 
      main.bar.color = "black", 
      queries = list(
        list(
          query = intersects, 
          params = list("Drama"), 
          active = T)
        ), 
      attribute.plots = list(
        gridrows = 50, 
        plots = list(
          list(
            plot = histogram, 
            x = "ReleaseDate", 
            queries = F), 
          list(
            plot = histogram,
            x = "AvgRating", 
            queries = T)
          ), 
        ncols = 2
        )
      )

使用 scatter_plot 函數(shù)繪制散點圖

upset(movies, 
      main.bar.color = "black", 
      queries = list(
        list(
          query = intersects, 
          params = list("Drama"), 
          color = "red", 
          active = F), 
        list(
          query = intersects, 
          params = list("Drama", "Comedy", "Action"), 
          color = "orange", 
          active = T)
        ), 
      attribute.plots = list(
        gridrows = 45, 
        plots = list(
          list(
            plot = scatter_plot, 
            x = "ReleaseDate", 
            y = "AvgRating", 
            queries = T), 
          list(plot = scatter_plot, 
               x = "AvgRating", 
               y = "Watches", 
               queries = F)
          ), 
        ncols = 2), 
      query.legend = "bottom"
      )

2. 自定義繪圖函數(shù)

我們先定義兩個基于 ggplot2 的函數(shù),用于繪制散點圖和密度圖

my_scatter <- function(data, x, y) {
  p <- ggplot(data, aes_string(x, y, colour = "color")) +
    geom_point() +
    scale_colour_identity() +
    theme(
      plot.margin = unit(c(0, 0, 0, 0), "cm")
    )
  p
}

my_density <- function(data, x, y) {
  data$decades <- data[, y] %/% 10 * 10
  data <- data[which(data$decades >= 1970), ]
  p <- ggplot(data, aes_string(x)) +
    geom_density(aes(fill = factor(decades)), alpha = 0.3) +
    theme(
      plot.margin = unit(c(0, 0, 0, 0), "cm"), 
      legend.key.size = unit(0.4, "cm")
    )
  p
}

然后應(yīng)用在屬性圖中

upset(movies, 
      main.bar.color = "black", 
      queries = list(
        list(
          query = intersects, 
          params = list("Drama"), 
          color = "red", active = F), 
        list(
          query = intersects, 
          params = list("Action", "Drama"), 
          active = T),
        list(
          query = intersects, 
          params = list("Drama", "Comedy", "Action"), 
          color = "orange", active = T)
        ), 
      attribute.plots = list(
        gridrows = 45, 
        plots = list(
          list(
            plot = my_scatter, 
            x = "ReleaseDate", 
            y = "AvgRating", 
            queries = T),
          list(
            plot = my_density,
            x = "AvgRating",
            y = "ReleaseDate",
            queries = F)
          ),
        ncols = 2)
      )

3. 繪制箱線圖

想要繪制箱線圖,可以使用 boxplot.summary 參數(shù),最多只能同時繪制兩個變量的箱線圖。

upset(movies, boxplot.summary = c("AvgRating", "ReleaseDate"))

當(dāng)然,用自定義的方式也能實現(xiàn)

集合元數(shù)據(jù)

set.metadata 參數(shù)可以用來設(shè)置集合的元數(shù)據(jù),包含 3 個字段:

  • data:數(shù)據(jù)框,第一列為集合名,后面的列為對應(yīng)的集合屬性
  • ncols:列數(shù)
  • plots:也是一個 list,每個元素包含 4 個字段 column, type, assigncolors
    • columndata 中用于繪制的列名

    • type:需要繪制的圖像類型,如果指定的列為數(shù)值型,則可以是 histheat;如果是布爾型,則可以繪制 bool 熱圖;如果是分類類型(字符串),則可以是 heattext;如果想在矩陣中繪制,可以使用 matrix_rows

    • assign:該元數(shù)據(jù)圖分配的列數(shù),如果繪制 2 列數(shù)據(jù),并分別分配了 2010,則 UpSet 圖變?yōu)?100 X 130

    • colors:元數(shù)據(jù)圖顏色,如果是條形圖,則會應(yīng)用于整個元數(shù)據(jù)圖;如果是 heatbool,則可以設(shè)置一個顏色向量;如果是 factor 則沒有 colors 參數(shù),并且圖像為漸變色;如果是 text 則可以為每個唯一的字符串設(shè)置一個顏色,不設(shè)置會自動分配顏色

1. 條形圖

我們?yōu)槊總€集合添加元數(shù)據(jù)屬性,為每部電影隨機設(shè)置爛番茄的電影評分

sets <- names(movies[3:19])
avgRottenTomatoesScore <- round(runif(17, min = 0, max = 90))
metadata <- as.data.frame(cbind(sets, avgRottenTomatoesScore))
names(metadata) <- c("sets", "avgRottenTomatoesScore")

要繪制條形圖,需要保證對應(yīng)列的數(shù)據(jù)類型必須是數(shù)值型

> str(metadata)
'data.frame':   17 obs. of  2 variables:
 $ sets                  : Factor w/ 17 levels "Action","Adventure",..: 1 2 3 4 5 6 7 8 12 9 ...
 $ avgRottenTomatoesScore: Factor w/ 12 levels "13","16","21",..: 6 10 12 5 1 1 3 2 11 11 ...

我們看到,評分列為 factor,所以需要先進行轉(zhuǎn)換

metadata$avgRottenTomatoesScore <- as.numeric(as.character(metadata$avgRottenTomatoesScore))

現(xiàn)在可以繪制元數(shù)據(jù)圖了

upset(movies, 
      set.metadata = list(
        data = metadata, 
        plots = list(
          list(
            type = "hist", 
            column = "avgRottenTomatoesScore", 
            assign = 20)
          )
        )
      )

2. 熱圖

我們再構(gòu)造電影的元數(shù)據(jù),為電影添加城市屬性,同時確保該列為字符串類型而不是 factor

Cities <- sample(c("Boston", "NYC", "LA"), 17, replace = T)
metadata <- cbind(metadata, Cities)
metadata$Cities <- as.character(metadata$Cities)

我們繪制兩幅熱圖,一幅指定了顏色,另一幅不指定顏色

upset(movies, 
      set.metadata = list(
        data = metadata, 
        plots = list(
          list(
            type = "heat",
            column = "Cities", 
            assign = 10, 
            colors = c(
              Boston = "green", 
              NYC = "navy",
              LA = "purple")
            ), 
          list(
            type = "heat", 
            column = "avgRottenTomatoesScore", 
            assign = 10)
          )
        )
      )

可以看到,不指定顏色的熱圖為灰色漸變色

布爾型熱圖

我們?yōu)殡娪疤砑右涣?accepted 信息,值為 0、1

accepted <- round(runif(17, min = 0, max = 1))
metadata <- cbind(metadata, accepted)

設(shè)置方式與上面類似

upset(movies, 
      set.metadata = list(
        data = metadata, 
        plots = list(
          list(
            type = "bool", 
            column = "accepted", 
            assign = 5, 
            colors = c("#FF3333", "#006400")
            )
          )
        )
      )

如果將 bool 換成 heat

upset(movies, 
      set.metadata = list(
        data = metadata, 
        plots = list(
          list(
            type = "heat", 
            column = "accepted", 
            assign = 5, 
            colors = c("#FF3333", "#006400")
            )
          )
        )
      )

會將 0、1 布爾型數(shù)據(jù)視為數(shù)值型,并繪制漸變色

3. 文本

對于城市信息元數(shù)據(jù),可能顯示文本比熱圖更合適一些

upset(movies, 
      set.metadata = list(
        data = metadata, 
        plots = list(
          list(
            type = "text", 
            column = "Cities", 
            assign = 10, 
            colors = c(
              Boston = "green", 
              NYC = "navy",        
              LA = "purple")
            )
          )
        )
      )

4. 在矩陣中應(yīng)用元數(shù)據(jù)

有時候,我們可能想將元數(shù)據(jù)信息直接體現(xiàn)在 UpSet 圖中,可以設(shè)置 type = "matrix_rows",在矩陣中為不同城市設(shè)置不同的顏色

upset(movies, 
      set.metadata = list(
        data = metadata, 
        plots = list(
          list(
            type = "hist", 
            column = "avgRottenTomatoesScore", 
            assign = 20), 
          list(
            type = "matrix_rows", 
            column = "Cities", 
            colors = c(
              Boston = "green", 
              NYC = "navy", 
              LA = "purple"),
            alpha = 0.5)
          )
        )
      )

匯總

最后,我們將這些圖合并在一起

upset(movies, 
      # 查詢
      queries = list(
        list(
          query = intersects, 
          params = list("Drama"), 
          color = "red", 
          active = F), 
        list(
          query = intersects, 
          params = list("Action", "Drama"), 
          active = T), 
        list(
          query = intersects,
          params = list("Drama", "Comedy", "Action"), 
          color = "orange", 
          active = T)), 
      # 元數(shù)據(jù)圖
      set.metadata = list(
        data = metadata, 
        plots = list(
          list(
            type = "hist", 
            column = "avgRottenTomatoesScore", 
            assign = 20), 
          list(
            type = "bool", 
            column = "accepted",
            assign = 5, 
            colors = c("#FF3333", "#006400")), 
          list(
            type = "text", 
            column = "Cities",
            assign = 5, 
            colors = c(
              Boston = "green", 
              NYC = "navy", 
              LA = "purple")), 
          list(
            type = "matrix_rows", 
            column = "Cities", 
            colors = c(
              Boston = "green", 
              NYC = "navy", 
              LA = "purple"), 
            alpha = 0.5)
          )
        ), 
      # 屬性圖
      attribute.plots = list(
        gridrows = 45, 
        plots = list(
          list(
            plot = my_scatter, 
            x = "ReleaseDate", 
            y = "AvgRating", 
            queries = T), 
          list(plot = my_density, 
               x = "AvgRating", 
               y = "ReleaseDate", 
               queries = F)), 
        ncols = 2), 
      query.legend = "bottom"
      )

代碼:
https://github.com/dxsbiocc/learn/blob/main/R/plot/upset_plot.R

參數(shù)詳情


?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容