前言
上一節(jié),我們介紹了如何繪制韋恩圖來顯示集合間的交疊關(guān)系
但是,隨著集合的增多,韋恩圖顯示的關(guān)系會越來越復(fù)雜,很難一眼看出其中的信息。
今天,我們要介紹的是,當(dāng)集合數(shù)目較多時,該如何繪制
我們將使用 UpSetR 包來繪制下面這種圖

該圖由三個子圖組成:
- 表示交集大小的柱狀圖(上方)
- 表示集合大小的條形圖(下左)
- 表示集合之間的交疊矩陣(下右),矩陣的列表示每種交集組合,對應(yīng)于柱狀圖的橫坐標(biāo);矩陣的行表示集合,對應(yīng)于條形圖的縱坐標(biāo)
通過這樣一張圖,可以展示多個集合之間的交疊關(guān)系,且很容易從圖中看出集合之間的交集信息
那怎么繪制出這樣一張圖呢?
基礎(chǔ)
1. 安裝導(dǎo)入
install.packages("UpSetR")
library(UpSetR)
我們使用該包自帶的示例數(shù)據(jù)
movies <- read.csv(system.file("extdata", "movies.csv", package = "UpSetR"),
header = T, sep = ";")
2. 數(shù)據(jù)
在開始繪制之前,我們需要知道輸入數(shù)據(jù)的格式。
UpSetR 提供了兩個轉(zhuǎn)換函數(shù) fromList 和 fromExpression 用于格式化數(shù)據(jù)
-
fromList函數(shù)接受一個list(每個變量表示一個集合),并將其轉(zhuǎn)換為數(shù)據(jù)框,例如
listInput <- list(
one = c(1, 2, 3, 5, 7, 8, 11, 12, 13),
two = c(1, 2, 4, 5, 10),
three = c(1, 5, 6, 7, 8, 9, 10, 12, 13))
-
fromExpression函數(shù)接受一個命名向量表達(dá)式,包含了每個集合的大小,以及交集的大小,交集的名稱通過&符號相連,例如
expressionInput <- c(
one = 2, two = 1, three = 2,
`one&two` = 1, `one&three` = 4,
`two&three` = 1, `one&two&three` = 2)
根據(jù)上面的數(shù)據(jù),可以繪制如下圖形
upset(fromList(listInput), order.by = "freq")
# upset(fromExpression(expressionInput), order.by = "freq")

3. 繪制部分集合
在這里,我們通過設(shè)置 nsets = 6 將集合范圍限制在最大的 6 個集合
upset(movies, nsets = 6,
number.angles = 30,
point.size = 3.5,
line.size = 2,
mainbar.y.label = "Genre Intersections",
sets.x.label = "Movies Per Genre",
text.scale = c(1.3, 1.3, 1, 1, 2, 0.75))

同時,可以指定參數(shù),來調(diào)整圖形屬性,例如,使用 number.angles 來設(shè)置柱狀圖柱子上方數(shù)字的傾斜角度;使用 point.size 和 line.size 來設(shè)置矩陣點圖中點和線的大??;mainbar.y.label 和 sets.x.label 可以設(shè)置柱狀圖和條形圖的軸標(biāo)簽;text.scale 包含 6 個值,用于指定圖上所有文本標(biāo)簽的大小。
text.scale 參數(shù)值的順序為:
- 柱狀圖的軸標(biāo)簽和刻度
- 條形圖的軸標(biāo)簽和刻度
- 集合名稱
- 柱子上方表示交集大小的數(shù)值
我們也可以指定需要展示的集合
upset(movies,
sets = c("Action", "Comedy", "Drama",
"Mystery", "Thriller", "Romance", "War"),
mb.ratio = c(0.55, 0.45)
)

mb.ratio 用于控制上下圖形所占比例
4. 排序
我們可以設(shè)置 order.by 參數(shù),來對交集進行排序。
upset(movies,
sets = c("Action", "Comedy", "Drama",
"Mystery", "Thriller", "Romance", "War"),
mb.ratio = c(0.55, 0.45),
order.by = "freq",
decreasing = TRUE
)

freq 默認(rèn)是升序,可以使用 decreasing = TRUE 讓其降序排列
upset(movies,
sets = c("Action", "Comedy", "Drama",
"Mystery", "Thriller", "Romance", "War"),
mb.ratio = c(0.55, 0.45),
order.by = "degree",
decreasing = FALSE
)

degree 默認(rèn)為降序排序,設(shè)置 decreasing = FALSE 使其升序排列
也可以同時指定這兩個值
upset(movies,
sets = c("Action", "Comedy", "Drama",
"Mystery", "Thriller", "Romance", "War"),
mb.ratio = c(0.55, 0.45),
order.by = c("degree", "freq"),
decreasing = c(TRUE, FALSE)
)

如果想要讓集合按照 sets 參數(shù)中指定的出現(xiàn)的順序排列,可以設(shè)置 keep.order = TRUE
upset(movies,
sets = c("Action", "Comedy", "Drama",
"Mystery", "Thriller", "Romance", "War"),
mb.ratio = c(0.55, 0.45),
order.by = c("degree", "freq"),
decreasing = c(TRUE, FALSE),
keep.order = TRUE
)

如果想要顯示交集為空的組合,可以設(shè)置 empty.intersections 參數(shù)
upset(movies,
sets = c("Action", "Comedy", "Drama",
"Mystery", "Thriller", "Romance", "War"),
empty.intersections = "on"
)

查詢
查詢通過 queries 參數(shù)來執(zhí)行,接受一個嵌套的 list 來表示多個查詢條件,每個查詢條件包含四個字段:
-
query:需要執(zhí)行的查詢 -
params:查詢參數(shù)列表 -
color:設(shè)置滿足查詢條件的元素在圖中的顏色 -
active:如果為TRUE,柱狀圖顏色將會被覆蓋,為FALSE則會在柱子上添加帶有隨機擾動的點
例如
1. 內(nèi)置交集查詢
我們使用內(nèi)置的交集查詢:intersects,用來尋找或顯示特定的交集,并將找到的交集進行上色
upset(movies, queries = list(
list(
query = intersects,
params = list("Drama", "Comedy", "Action"),
color = "orange",
active = T),
list(
query = intersects,
params = list("Drama"),
color = "red",
active = F),
list(
query = intersects,
params = list("Action", "Drama"),
active = T)
)
)

2. 內(nèi)置元素查詢
我們使用 elements 來進行元素查詢,來展示元素在交集中的分布情況
upset(movies,
queries = list(
list(
query = elements,
params = list("AvgRating", 3.5, 4.1),
color = "blue",
active = T),
list(
query = elements,
params = list("ReleaseDate", 1980, 1990, 2000),
color = "red",
active = F)
)
)

3. 使用表達(dá)式
我們可以為 expression 參數(shù)設(shè)置過濾表達(dá)式來提取查詢結(jié)果的子集。
upset(movies,
queries = list(
list(
query = intersects,
params = list("Action", "Drama"),
active = T),
list(
query = elements,
params = list("ReleaseDate", 1980, 1990, 2000),
color = "red",
active = F)),
expression = "AvgRating > 3 & Watches > 100"
)
4. 自定義查詢
查詢函數(shù)會應(yīng)用于數(shù)據(jù)的每一行中,我們可以定義如下查詢函數(shù)
Myfunc <- function(row, release, rating) {
data <- (row["ReleaseDate"] %in% release) & (row["AvgRating"] > rating)
}
篩選發(fā)行日期在 release 內(nèi),且平均評分大于某個值的電影
執(zhí)行查詢
upset(movies,
queries = list(
list(
query = Myfunc,
params = list(c(1970, 1980, 1990, 1999, 2000), 2.5),
color = "blue",
active = T)
)
)

5. 添加查詢圖例
可以使用 query.legend 參數(shù)來指定查詢圖例的位置,top 或 bottom
在查詢條件中,使用 query.name 來設(shè)置查詢的名稱,如果為設(shè)置,會自動生成
upset(movies,
query.legend = "top",
queries = list(
list(
query = intersects,
params = list("Drama", "Comedy", "Action"),
color = "orange", active = T,
query.name = "Funny action"),
list(
query = intersects,
params = list("Drama"),
color = "red", active = F),
list(
query = intersects,
params = list("Action", "Drama"),
active = T,
query.name = "Emotional action")
)
)

屬性圖
attribute.plots 參數(shù)用于執(zhí)行屬性圖的繪制,包含 3 個字段:
-
gridrows:設(shè)置屬性圖的空間大小,UpSet plot默認(rèn)為100 X 100,如果設(shè)置為50,則整個圖形變成150 X 100 -
plots:圖形列表,每個元素包含4個參數(shù):-
plot:返回ggplot對象的函數(shù) -
x:圖形的x軸變量 -
y:圖形的y軸變量 -
queries:是否使用已經(jīng)存在的查詢來覆蓋繪圖數(shù)據(jù)
-
-
ncols:設(shè)置列數(shù)
1. 內(nèi)置繪圖函數(shù)
我們使用包中自帶的 histogram 函數(shù)來繪制直方圖
upset(movies,
main.bar.color = "black",
queries = list(
list(
query = intersects,
params = list("Drama"),
active = T)
),
attribute.plots = list(
gridrows = 50,
plots = list(
list(
plot = histogram,
x = "ReleaseDate",
queries = F),
list(
plot = histogram,
x = "AvgRating",
queries = T)
),
ncols = 2
)
)

使用 scatter_plot 函數(shù)繪制散點圖
upset(movies,
main.bar.color = "black",
queries = list(
list(
query = intersects,
params = list("Drama"),
color = "red",
active = F),
list(
query = intersects,
params = list("Drama", "Comedy", "Action"),
color = "orange",
active = T)
),
attribute.plots = list(
gridrows = 45,
plots = list(
list(
plot = scatter_plot,
x = "ReleaseDate",
y = "AvgRating",
queries = T),
list(plot = scatter_plot,
x = "AvgRating",
y = "Watches",
queries = F)
),
ncols = 2),
query.legend = "bottom"
)

2. 自定義繪圖函數(shù)
我們先定義兩個基于 ggplot2 的函數(shù),用于繪制散點圖和密度圖
my_scatter <- function(data, x, y) {
p <- ggplot(data, aes_string(x, y, colour = "color")) +
geom_point() +
scale_colour_identity() +
theme(
plot.margin = unit(c(0, 0, 0, 0), "cm")
)
p
}
my_density <- function(data, x, y) {
data$decades <- data[, y] %/% 10 * 10
data <- data[which(data$decades >= 1970), ]
p <- ggplot(data, aes_string(x)) +
geom_density(aes(fill = factor(decades)), alpha = 0.3) +
theme(
plot.margin = unit(c(0, 0, 0, 0), "cm"),
legend.key.size = unit(0.4, "cm")
)
p
}
然后應(yīng)用在屬性圖中
upset(movies,
main.bar.color = "black",
queries = list(
list(
query = intersects,
params = list("Drama"),
color = "red", active = F),
list(
query = intersects,
params = list("Action", "Drama"),
active = T),
list(
query = intersects,
params = list("Drama", "Comedy", "Action"),
color = "orange", active = T)
),
attribute.plots = list(
gridrows = 45,
plots = list(
list(
plot = my_scatter,
x = "ReleaseDate",
y = "AvgRating",
queries = T),
list(
plot = my_density,
x = "AvgRating",
y = "ReleaseDate",
queries = F)
),
ncols = 2)
)

3. 繪制箱線圖
想要繪制箱線圖,可以使用 boxplot.summary 參數(shù),最多只能同時繪制兩個變量的箱線圖。
upset(movies, boxplot.summary = c("AvgRating", "ReleaseDate"))

當(dāng)然,用自定義的方式也能實現(xiàn)
集合元數(shù)據(jù)
set.metadata 參數(shù)可以用來設(shè)置集合的元數(shù)據(jù),包含 3 個字段:
-
data:數(shù)據(jù)框,第一列為集合名,后面的列為對應(yīng)的集合屬性 -
ncols:列數(shù) -
plots:也是一個list,每個元素包含4個字段column,type,assign和colorscolumn:data中用于繪制的列名type:需要繪制的圖像類型,如果指定的列為數(shù)值型,則可以是hist和heat;如果是布爾型,則可以繪制bool熱圖;如果是分類類型(字符串),則可以是heat和text;如果想在矩陣中繪制,可以使用matrix_rows。assign:該元數(shù)據(jù)圖分配的列數(shù),如果繪制2列數(shù)據(jù),并分別分配了20和10,則UpSet圖變?yōu)?100 X 130colors:元數(shù)據(jù)圖顏色,如果是條形圖,則會應(yīng)用于整個元數(shù)據(jù)圖;如果是heat或bool,則可以設(shè)置一個顏色向量;如果是factor則沒有colors參數(shù),并且圖像為漸變色;如果是text則可以為每個唯一的字符串設(shè)置一個顏色,不設(shè)置會自動分配顏色
1. 條形圖
我們?yōu)槊總€集合添加元數(shù)據(jù)屬性,為每部電影隨機設(shè)置爛番茄的電影評分
sets <- names(movies[3:19])
avgRottenTomatoesScore <- round(runif(17, min = 0, max = 90))
metadata <- as.data.frame(cbind(sets, avgRottenTomatoesScore))
names(metadata) <- c("sets", "avgRottenTomatoesScore")
要繪制條形圖,需要保證對應(yīng)列的數(shù)據(jù)類型必須是數(shù)值型
> str(metadata)
'data.frame': 17 obs. of 2 variables:
$ sets : Factor w/ 17 levels "Action","Adventure",..: 1 2 3 4 5 6 7 8 12 9 ...
$ avgRottenTomatoesScore: Factor w/ 12 levels "13","16","21",..: 6 10 12 5 1 1 3 2 11 11 ...
我們看到,評分列為 factor,所以需要先進行轉(zhuǎn)換
metadata$avgRottenTomatoesScore <- as.numeric(as.character(metadata$avgRottenTomatoesScore))
現(xiàn)在可以繪制元數(shù)據(jù)圖了
upset(movies,
set.metadata = list(
data = metadata,
plots = list(
list(
type = "hist",
column = "avgRottenTomatoesScore",
assign = 20)
)
)
)

2. 熱圖
我們再構(gòu)造電影的元數(shù)據(jù),為電影添加城市屬性,同時確保該列為字符串類型而不是 factor
Cities <- sample(c("Boston", "NYC", "LA"), 17, replace = T)
metadata <- cbind(metadata, Cities)
metadata$Cities <- as.character(metadata$Cities)
我們繪制兩幅熱圖,一幅指定了顏色,另一幅不指定顏色
upset(movies,
set.metadata = list(
data = metadata,
plots = list(
list(
type = "heat",
column = "Cities",
assign = 10,
colors = c(
Boston = "green",
NYC = "navy",
LA = "purple")
),
list(
type = "heat",
column = "avgRottenTomatoesScore",
assign = 10)
)
)
)

可以看到,不指定顏色的熱圖為灰色漸變色
布爾型熱圖
我們?yōu)殡娪疤砑右涣?accepted 信息,值為 0、1
accepted <- round(runif(17, min = 0, max = 1))
metadata <- cbind(metadata, accepted)
設(shè)置方式與上面類似
upset(movies,
set.metadata = list(
data = metadata,
plots = list(
list(
type = "bool",
column = "accepted",
assign = 5,
colors = c("#FF3333", "#006400")
)
)
)
)

如果將 bool 換成 heat
upset(movies,
set.metadata = list(
data = metadata,
plots = list(
list(
type = "heat",
column = "accepted",
assign = 5,
colors = c("#FF3333", "#006400")
)
)
)
)

會將 0、1 布爾型數(shù)據(jù)視為數(shù)值型,并繪制漸變色
3. 文本
對于城市信息元數(shù)據(jù),可能顯示文本比熱圖更合適一些
upset(movies,
set.metadata = list(
data = metadata,
plots = list(
list(
type = "text",
column = "Cities",
assign = 10,
colors = c(
Boston = "green",
NYC = "navy",
LA = "purple")
)
)
)
)

4. 在矩陣中應(yīng)用元數(shù)據(jù)
有時候,我們可能想將元數(shù)據(jù)信息直接體現(xiàn)在 UpSet 圖中,可以設(shè)置 type = "matrix_rows",在矩陣中為不同城市設(shè)置不同的顏色
upset(movies,
set.metadata = list(
data = metadata,
plots = list(
list(
type = "hist",
column = "avgRottenTomatoesScore",
assign = 20),
list(
type = "matrix_rows",
column = "Cities",
colors = c(
Boston = "green",
NYC = "navy",
LA = "purple"),
alpha = 0.5)
)
)
)

匯總
最后,我們將這些圖合并在一起
upset(movies,
# 查詢
queries = list(
list(
query = intersects,
params = list("Drama"),
color = "red",
active = F),
list(
query = intersects,
params = list("Action", "Drama"),
active = T),
list(
query = intersects,
params = list("Drama", "Comedy", "Action"),
color = "orange",
active = T)),
# 元數(shù)據(jù)圖
set.metadata = list(
data = metadata,
plots = list(
list(
type = "hist",
column = "avgRottenTomatoesScore",
assign = 20),
list(
type = "bool",
column = "accepted",
assign = 5,
colors = c("#FF3333", "#006400")),
list(
type = "text",
column = "Cities",
assign = 5,
colors = c(
Boston = "green",
NYC = "navy",
LA = "purple")),
list(
type = "matrix_rows",
column = "Cities",
colors = c(
Boston = "green",
NYC = "navy",
LA = "purple"),
alpha = 0.5)
)
),
# 屬性圖
attribute.plots = list(
gridrows = 45,
plots = list(
list(
plot = my_scatter,
x = "ReleaseDate",
y = "AvgRating",
queries = T),
list(plot = my_density,
x = "AvgRating",
y = "ReleaseDate",
queries = F)),
ncols = 2),
query.legend = "bottom"
)

代碼:
https://github.com/dxsbiocc/learn/blob/main/R/plot/upset_plot.R
參數(shù)詳情
