通過(guò)ggplot2中stat_summary函數(shù)快速進(jìn)行數(shù)據(jù)統(tǒng)計(jì)

如何靈活,快速地可視化置信區(qū)間、標(biāo)準(zhǔn)誤差以及變量的均值和中位數(shù)

本節(jié)將詳細(xì)介紹stat_summary函數(shù)的應(yīng)用,喜歡的小伙伴可以關(guān)注我的公眾號(hào)R語(yǔ)言數(shù)據(jù)分析指南持續(xù)分享更多優(yōu)質(zhì)資源

原文鏈接:https://mp.weixin.qq.com/s/v8Vdo8BtKoQdiGrR-GOFYA

加載必需R包

BiocManager::install("gapminder")
BiocManager::install("Hmisc")
library(tidyverse)
library(gapminder)
library(Hmisc)

根據(jù)diamonds數(shù)據(jù)集來(lái)創(chuàng)建含有統(tǒng)計(jì)信息的條形圖:

diamonds %>% 
  group_by(cut) %>% 
  summarise(mean = mean(price)) %>% 
  ggplot(aes(x = cut, y = mean)) + 
  geom_col()

這種方法有效,但不是最有效的。首先,如果我可以直接使用ggplot2進(jìn)行計(jì)算,則不需要先對(duì)數(shù)據(jù)進(jìn)行統(tǒng)計(jì)。另一方面,計(jì)算可能會(huì)變得相對(duì)復(fù)雜,尤其是當(dāng)我想可視化置信區(qū)間時(shí)。

stat_summary( )的含義

幸運(yùn)的是,ggplot2的開(kāi)發(fā)人員已經(jīng)考慮了如何深入可視化統(tǒng)計(jì)信息的問(wèn)題。解決方案是使用stat_summary函數(shù)。我們將使用gapminder數(shù)據(jù)集,其中包含有不同國(guó)家/地區(qū)人們的預(yù)期壽命的數(shù)據(jù)。

library(tidyverse)
library(gapminder)

gapminder
> gapminder
# A tibble: 1,704 x 6
   country     continent  year lifeExp      pop gdpPercap
   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
 1 Afghanistan Asia       1952    28.8  8425333      779.
 2 Afghanistan Asia       1957    30.3  9240934      821.
 3 Afghanistan Asia       1962    32.0 10267083      853.
 4 Afghanistan Asia       1967    34.0 11537966      836.
 5 Afghanistan Asia       1972    36.1 13079460      740.
 6 Afghanistan Asia       1977    38.4 14880372      786.
 7 Afghanistan Asia       1982    39.9 12881816      978.
 8 Afghanistan Asia       1987    40.8 13867957      852.
 9 Afghanistan Asia       1992    41.7 16317921      649.
10 Afghanistan Asia       1997    41.8 22227415      635.
# ... with 1,694 more rows
gapminder %>% 
  ggplot(aes(x = year, y = lifeExp)) + 
  geom_col()

如圖所見(jiàn),近幾十年來(lái)預(yù)期壽命有所增加。但是,條形圖并未顯示所有國(guó)家的平均預(yù)期壽命或中位數(shù)預(yù)期壽命,而是把每個(gè)國(guó)家和年份的預(yù)期壽命進(jìn)行了匯總

但是,可以使用geom_bar計(jì)算國(guó)家/地區(qū)的平均預(yù)期壽命。我們要做的就是指定一個(gè)要為y軸上的變量進(jìn)行計(jì)算的函數(shù),并另外指定自變量stat = "summary".
https://stackoverflow.com/questions/30183199/ggplot2-plot-mean-with-geom-bar

gapminder %>% 
  ggplot(aes(x = year, y = lifeExp)) + 
  geom_bar(fun = "mean", stat = "summary")

但是我們無(wú)法將數(shù)據(jù)顯示為點(diǎn)或線,因?yàn)樗鼈兪鞘褂胓eom_bar創(chuàng)建的。這時(shí)stat_summary函數(shù)的強(qiáng)大之處就體現(xiàn)的淋漓盡致。stat_summary允許我們通過(guò)不同的可視化顯示任何類型的數(shù)據(jù)統(tǒng)計(jì)信息。無(wú)論我們是要可視化點(diǎn)還是線或面,請(qǐng)接著往下看

在此示例中,我們將兩個(gè)參數(shù)傳遞給stat_summary函數(shù)。首先,我們告訴stat_summary fun.y = mean我們想要計(jì)算變量lifeExp的平均值。使用參數(shù)geom = "bar"我們告訴stat_summary將平均值顯示為條形圖

我們也可以告訴stat_summary,我們要?jiǎng)?chuàng)建折線圖而不是條形圖,并添加每年平均值的單個(gè)點(diǎn)以提高可視化效果的可讀性

gapminder %>% 
  ggplot(aes(x = year, y = lifeExp)) +
  stat_summary(fun = "mean", geom = "point") +
  stat_summary(fun = "mean", geom = "line") 

從此示例中,可以看到我們也可以將幾個(gè)stat_summaries合并在一起。與上一個(gè)示例相比,唯一的變化是我們更改了geom,我們現(xiàn)在使用點(diǎn)和線

此外我們還可以更改需要顯示的統(tǒng)計(jì)信息。各國(guó)之間的預(yù)期壽命可能差異很大,因此我們想顯示中位數(shù)而不是平均值

gapminder %>% 
  ggplot(aes(x = year, y = lifeExp)) +
  stat_summary(fun = "median", geom = "bar")

還可以使用stat_summary顯示區(qū)域而不是直線和點(diǎn)

gapminder %>% 
  mutate(year = as.integer(year)) %>% 
  ggplot(aes(x = year, y = lifeExp)) +
  stat_summary(fun = "mean", geom = "area",
               fill = "#EB5286",
               alpha = .5) +
  stat_summary(fun = "mean", geom = "point",
               color = "#6F213F") 

同理還可以顯示各國(guó)最高和最低預(yù)期壽命

gapminder %>% 
  ggplot(aes(x = year, y = lifeExp)) +
  stat_summary(fun = mean,
               geom = "pointrange",
               fun.min = min,
               fun.max = max)

我們還可以使用經(jīng)典的誤差線來(lái)顯示最大值和最小值

gapminder %>% 
  ggplot(aes(x = year, y = lifeExp)) +
  stat_summary(geom = "errorbar",
               width = 1,
               fun.min = min,
               fun.max = max)
image.png

創(chuàng)建標(biāo)準(zhǔn)偏差

gapminder %>% 
  ggplot(aes(x = year, y = lifeExp)) +
  stat_summary(fun = mean,
               geom = "pointrange",
               fun.max = function(x) mean(x) + sd(x),
               fun.min = function(x) mean(x) - sd(x))

創(chuàng)建標(biāo)準(zhǔn)誤差

gapminder %>% 
  ggplot(aes(x = year, y = lifeExp)) +
  stat_summary(fun = mean,
               geom = "pointrange",
               fun.max = function(x) mean(x) + sd(x) / sqrt(length(x)),
               fun.min = function(x) mean(x) - sd(x) / sqrt(length(x)))

創(chuàng)建經(jīng)典的是95%置信區(qū)間。同樣,Hmisc包有一個(gè)函數(shù)可以用來(lái)顯示置信區(qū)間:mean_cl_normalmean_cl_boot

gapminder %>% 
  ggplot(aes(x = year, y = lifeExp)) +
  stat_summary(fun.data = "mean_cl_normal")

我們還可以對(duì)其添加誤差線

gapminder %>% 
  ggplot(aes(x = year, y = lifeExp)) +
  stat_summary(fun.data = "mean_cl_normal",
               geom = "errorbar",
               width = .4) +
  stat_summary(fun = "mean", geom = "point")

隨意顯示置信區(qū)間

幸運(yùn)的是,mean_cl_normal函數(shù)具有用于更改置信區(qū)間寬度的參數(shù)conf.int

gapminder %>% 
  ggplot(aes(x = year, y = lifeExp)) +
  stat_summary(fun.data = "mean_cl_normal",
               fun.args = list(conf.int = .99))

將統(tǒng)計(jì)信息與多個(gè)幾何對(duì)象合并

創(chuàng)建顯示具有95%置信區(qū)間的條形圖

gapminder %>% 
  filter(year == 2007) %>% 
  ggplot(aes(x = continent, y = lifeExp)) +
  stat_summary(fun = "mean", geom = "bar", alpha = .7) +
  stat_summary(fun = "mean", geom = "point", 
               size = 1) +
  stat_summary(fun.data = "mean_cl_normal",
               geom = "errorbar",
               width = .2) 

position = position_dodge( )并排顯示多個(gè)條形圖

colors <-c("#E41A1C","#1E90FF","#FF8C00","#4DAF4A","#984EA3",
           "#40E0D0","#FFC0CB","#00BFFF","#FFDEAD","#90EE90",
           "#EE82EE","#00FFFF","#F0A3FF", "#0075DC", 
           "#993F00","#4C005C","#2BCE48","#FFCC99",
           "#808080","#94FFB5","#8F7C00","#9DCC00",
           "#C20088","#003380","#FFA405","#FFA8BB",
           "#426600","#FF0010","#5EF1F2","#00998F",
           "#740AFF","#990000","#FFFF00")
gapminder %>% 
  mutate(
    year = as.factor(year)
  ) %>%
  ggplot(aes(x = continent, y = lifeExp, fill = year)) +
  stat_summary(fun = "mean", geom = "bar", 
               alpha = .7, position = position_dodge(0.95)) +
  stat_summary(fun = "mean", geom = "point", 
               position = position_dodge(0.95),
               size = 1) +
  stat_summary(fun.data = "mean_cl_normal",
               geom = "errorbar",
               position = position_dodge(0.95),
               width = .2) +
  scale_fill_manual(values = colors)+
  theme_minimal()+
  scale_y_continuous(expand=c(0,0))

盡管已經(jīng)討論過(guò)geom_()的局限性并證明了stat_()的強(qiáng)大之處,但兩者都有自己的位置。這不是非此即彼的問(wèn)題。實(shí)際上,它們彼此需要-就像stat_summary()有一個(gè)geom論點(diǎn),geom_()也有一個(gè)stat論點(diǎn)。在更高的層次上,stat_()和geom_*()是layer()構(gòu)建ggplot函數(shù)的便捷實(shí)例

引用Hadley的話解釋這個(gè)錯(cuò)誤的二分法

不幸的是,由于早期的設(shè)計(jì)錯(cuò)誤,我將它們稱為stat_( )或geom_( )一個(gè)更好的決定是將它們稱為layer_( )函數(shù):這是一個(gè)更準(zhǔn)確的描述,因?yàn)槊恳粚佣及粋€(gè)stat和geom.

參考:https://cran.r-project.org/web/packages/ggplot2/vignettes/extending-ggplot2.html

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書(shū)系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容