如何靈活,快速地可視化置信區(qū)間、標(biāo)準(zhǔn)誤差以及變量的均值和中位數(shù)
本節(jié)將詳細(xì)介紹stat_summary函數(shù)的應(yīng)用,喜歡的小伙伴可以關(guān)注我的公眾號(hào)R語(yǔ)言數(shù)據(jù)分析指南持續(xù)分享更多優(yōu)質(zhì)資源
原文鏈接:https://mp.weixin.qq.com/s/v8Vdo8BtKoQdiGrR-GOFYA
加載必需R包
BiocManager::install("gapminder")
BiocManager::install("Hmisc")
library(tidyverse)
library(gapminder)
library(Hmisc)
根據(jù)diamonds數(shù)據(jù)集來(lái)創(chuàng)建含有統(tǒng)計(jì)信息的條形圖:
diamonds %>%
group_by(cut) %>%
summarise(mean = mean(price)) %>%
ggplot(aes(x = cut, y = mean)) +
geom_col()

這種方法有效,但不是最有效的。首先,如果我可以直接使用ggplot2進(jìn)行計(jì)算,則不需要先對(duì)數(shù)據(jù)進(jìn)行統(tǒng)計(jì)。另一方面,計(jì)算可能會(huì)變得相對(duì)復(fù)雜,尤其是當(dāng)我想可視化置信區(qū)間時(shí)。
stat_summary( )的含義
幸運(yùn)的是,ggplot2的開(kāi)發(fā)人員已經(jīng)考慮了如何深入可視化統(tǒng)計(jì)信息的問(wèn)題。解決方案是使用stat_summary函數(shù)。我們將使用gapminder數(shù)據(jù)集,其中包含有不同國(guó)家/地區(qū)人們的預(yù)期壽命的數(shù)據(jù)。
library(tidyverse)
library(gapminder)
gapminder
> gapminder
# A tibble: 1,704 x 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
7 Afghanistan Asia 1982 39.9 12881816 978.
8 Afghanistan Asia 1987 40.8 13867957 852.
9 Afghanistan Asia 1992 41.7 16317921 649.
10 Afghanistan Asia 1997 41.8 22227415 635.
# ... with 1,694 more rows
gapminder %>%
ggplot(aes(x = year, y = lifeExp)) +
geom_col()

如圖所見(jiàn),近幾十年來(lái)預(yù)期壽命有所增加。但是,條形圖并未顯示所有國(guó)家的平均預(yù)期壽命或中位數(shù)預(yù)期壽命,而是把每個(gè)國(guó)家和年份的預(yù)期壽命進(jìn)行了匯總
但是,可以使用geom_bar計(jì)算國(guó)家/地區(qū)的平均預(yù)期壽命。我們要做的就是指定一個(gè)要為y軸上的變量進(jìn)行計(jì)算的函數(shù),并另外指定自變量stat = "summary".
https://stackoverflow.com/questions/30183199/ggplot2-plot-mean-with-geom-bar
gapminder %>%
ggplot(aes(x = year, y = lifeExp)) +
geom_bar(fun = "mean", stat = "summary")

但是我們無(wú)法將數(shù)據(jù)顯示為點(diǎn)或線,因?yàn)樗鼈兪鞘褂胓eom_bar創(chuàng)建的。這時(shí)stat_summary函數(shù)的強(qiáng)大之處就體現(xiàn)的淋漓盡致。stat_summary允許我們通過(guò)不同的可視化顯示任何類型的數(shù)據(jù)統(tǒng)計(jì)信息。無(wú)論我們是要可視化點(diǎn)還是線或面,請(qǐng)接著往下看
在此示例中,我們將兩個(gè)參數(shù)傳遞給stat_summary函數(shù)。首先,我們告訴stat_summary fun.y = mean我們想要計(jì)算變量lifeExp的平均值。使用參數(shù)geom = "bar"我們告訴stat_summary將平均值顯示為條形圖
我們也可以告訴stat_summary,我們要?jiǎng)?chuàng)建折線圖而不是條形圖,并添加每年平均值的單個(gè)點(diǎn)以提高可視化效果的可讀性
gapminder %>%
ggplot(aes(x = year, y = lifeExp)) +
stat_summary(fun = "mean", geom = "point") +
stat_summary(fun = "mean", geom = "line")

從此示例中,可以看到我們也可以將幾個(gè)stat_summaries合并在一起。與上一個(gè)示例相比,唯一的變化是我們更改了geom,我們現(xiàn)在使用點(diǎn)和線
此外我們還可以更改需要顯示的統(tǒng)計(jì)信息。各國(guó)之間的預(yù)期壽命可能差異很大,因此我們想顯示中位數(shù)而不是平均值
gapminder %>%
ggplot(aes(x = year, y = lifeExp)) +
stat_summary(fun = "median", geom = "bar")

還可以使用stat_summary顯示區(qū)域而不是直線和點(diǎn)
gapminder %>%
mutate(year = as.integer(year)) %>%
ggplot(aes(x = year, y = lifeExp)) +
stat_summary(fun = "mean", geom = "area",
fill = "#EB5286",
alpha = .5) +
stat_summary(fun = "mean", geom = "point",
color = "#6F213F")

同理還可以顯示各國(guó)最高和最低預(yù)期壽命
gapminder %>%
ggplot(aes(x = year, y = lifeExp)) +
stat_summary(fun = mean,
geom = "pointrange",
fun.min = min,
fun.max = max)

我們還可以使用經(jīng)典的誤差線來(lái)顯示最大值和最小值
gapminder %>%
ggplot(aes(x = year, y = lifeExp)) +
stat_summary(geom = "errorbar",
width = 1,
fun.min = min,
fun.max = max)

創(chuàng)建標(biāo)準(zhǔn)偏差
gapminder %>%
ggplot(aes(x = year, y = lifeExp)) +
stat_summary(fun = mean,
geom = "pointrange",
fun.max = function(x) mean(x) + sd(x),
fun.min = function(x) mean(x) - sd(x))

創(chuàng)建標(biāo)準(zhǔn)誤差
gapminder %>%
ggplot(aes(x = year, y = lifeExp)) +
stat_summary(fun = mean,
geom = "pointrange",
fun.max = function(x) mean(x) + sd(x) / sqrt(length(x)),
fun.min = function(x) mean(x) - sd(x) / sqrt(length(x)))

創(chuàng)建經(jīng)典的是95%置信區(qū)間。同樣,Hmisc包有一個(gè)函數(shù)可以用來(lái)顯示置信區(qū)間:mean_cl_normal和mean_cl_boot
gapminder %>%
ggplot(aes(x = year, y = lifeExp)) +
stat_summary(fun.data = "mean_cl_normal")

我們還可以對(duì)其添加誤差線
gapminder %>%
ggplot(aes(x = year, y = lifeExp)) +
stat_summary(fun.data = "mean_cl_normal",
geom = "errorbar",
width = .4) +
stat_summary(fun = "mean", geom = "point")

隨意顯示置信區(qū)間
幸運(yùn)的是,mean_cl_normal函數(shù)具有用于更改置信區(qū)間寬度的參數(shù)conf.int
gapminder %>%
ggplot(aes(x = year, y = lifeExp)) +
stat_summary(fun.data = "mean_cl_normal",
fun.args = list(conf.int = .99))

將統(tǒng)計(jì)信息與多個(gè)幾何對(duì)象合并
創(chuàng)建顯示具有95%置信區(qū)間的條形圖
gapminder %>%
filter(year == 2007) %>%
ggplot(aes(x = continent, y = lifeExp)) +
stat_summary(fun = "mean", geom = "bar", alpha = .7) +
stat_summary(fun = "mean", geom = "point",
size = 1) +
stat_summary(fun.data = "mean_cl_normal",
geom = "errorbar",
width = .2)

position = position_dodge( )并排顯示多個(gè)條形圖
colors <-c("#E41A1C","#1E90FF","#FF8C00","#4DAF4A","#984EA3",
"#40E0D0","#FFC0CB","#00BFFF","#FFDEAD","#90EE90",
"#EE82EE","#00FFFF","#F0A3FF", "#0075DC",
"#993F00","#4C005C","#2BCE48","#FFCC99",
"#808080","#94FFB5","#8F7C00","#9DCC00",
"#C20088","#003380","#FFA405","#FFA8BB",
"#426600","#FF0010","#5EF1F2","#00998F",
"#740AFF","#990000","#FFFF00")
gapminder %>%
mutate(
year = as.factor(year)
) %>%
ggplot(aes(x = continent, y = lifeExp, fill = year)) +
stat_summary(fun = "mean", geom = "bar",
alpha = .7, position = position_dodge(0.95)) +
stat_summary(fun = "mean", geom = "point",
position = position_dodge(0.95),
size = 1) +
stat_summary(fun.data = "mean_cl_normal",
geom = "errorbar",
position = position_dodge(0.95),
width = .2) +
scale_fill_manual(values = colors)+
theme_minimal()+
scale_y_continuous(expand=c(0,0))

盡管已經(jīng)討論過(guò)geom_()的局限性并證明了stat_()的強(qiáng)大之處,但兩者都有自己的位置。這不是非此即彼的問(wèn)題。實(shí)際上,它們彼此需要-就像stat_summary()有一個(gè)geom論點(diǎn),geom_()也有一個(gè)stat論點(diǎn)。在更高的層次上,stat_()和geom_*()是layer()構(gòu)建ggplot函數(shù)的便捷實(shí)例
引用Hadley的話解釋這個(gè)錯(cuò)誤的二分法
不幸的是,由于早期的設(shè)計(jì)錯(cuò)誤,我將它們稱為stat_( )或geom_( )一個(gè)更好的決定是將它們稱為layer_( )函數(shù):這是一個(gè)更準(zhǔn)確的描述,因?yàn)槊恳粚佣及粋€(gè)stat和geom.
參考:https://cran.r-project.org/web/packages/ggplot2/vignettes/extending-ggplot2.html