可視化是一個重要工具,但是我們需要把數(shù)據(jù)整理成正確的形式來進行可視化。 通常,需要創(chuàng)建一些新的變量或摘要,或者重命名變量或?qū)τ^察值進行重新排序,以使數(shù)據(jù)可視化起來更容易一些。 接下來我們主要會使用到dplyr包,本節(jié)內(nèi)容涉及到summarise()函數(shù)。
summarise()提取數(shù)據(jù)總結(jié)
summarise(flights, delay = mean(dep_delay, na.rm = TRUE))
#> # A tibble: 1 x 1
#> delay
#> <dbl>
#> 1 12.6
summarise()并不是非常有用,除非將其與group_by()配對。 這會將分析單位從完整的數(shù)據(jù)集更改為各個組。 例如,如果將完全相同的代碼應用于按日期分組的數(shù)據(jù)框,則將獲得每個日期的平均延遲:
by_day <- group_by(flights, year, month, day)
summarise(by_day, delay = mean(dep_delay, na.rm = TRUE))
#> `summarise()` regrouping output by 'year', 'month' (override with `.groups` argument)
#> # A tibble: 365 x 4
#> # Groups: year, month [12]
#> year month day delay
#> <int> <int> <int> <dbl>
#> 1 2013 1 1 11.5
#> 2 2013 1 2 13.9
#> 3 2013 1 3 11.0
#> 4 2013 1 4 8.95
#> 5 2013 1 5 5.73
#> 6 2013 1 6 7.15
#> # … with 359 more rows
group_by()和summarise()是使用dplyr時最常用的工具之一:分組摘要。 通過管道運算符可以讓結(jié)果更進一步。
多種操作與管道結(jié)合起來
如果想探索每個位置的距離和平均延遲之間的關(guān)系,可以通過如下代碼:
by_dest <- group_by(flights, dest)
delay <- summarise(by_dest,
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
)
#> `summarise()` ungrouping output (override with `.groups` argument)
delay <- filter(delay, count > 20, dest != "HNL")
# It looks like delays increase with distance up to ~750 miles
# and then decrease. Maybe as flights get longer there's more
# ability to make up delays in the air?
ggplot(data = delay, mapping = aes(x = dist, y = delay)) +
geom_point(aes(size = count), alpha = 1/3) +
geom_smooth(se = FALSE)
#> `geom_smooth()` using method = 'loess' and formula 'y ~ x'

管道
%>%可以解決上面代碼中產(chǎn)生很多不必要的中間數(shù)據(jù)框的問題:
delays <- flights %>%
group_by(dest) %>%
summarise(
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
) %>%
filter(count > 20, dest != "HNL")
#> `summarise()` ungrouping output (override with `.groups` argument)
缺失值:
所有聚合函數(shù)都有一個na.rm參數(shù),該參數(shù)會在計算之前刪除丟失的值:
flights %>%
group_by(year, month, day) %>%
summarise(mean = mean(dep_delay, na.rm = TRUE))
#> `summarise()` regrouping output by 'year', 'month' (override with `.groups` argument)
#> # A tibble: 365 x 4
#> # Groups: year, month [12]
#> year month day mean
#> <int> <int> <int> <dbl>
#> 1 2013 1 1 11.5
#> 2 2013 1 2 13.9
#> 3 2013 1 3 11.0
#> 4 2013 1 4 8.95
#> 5 2013 1 5 5.73
#> 6 2013 1 6 7.15
#> # … with 359 more rows
在這種情況下,缺失值表示已取消的航班,我們還可以通過先刪除已取消的航班來解決該問題。 保存此數(shù)據(jù)集,以便在接下來的幾個示例中可以重復使用。
not_cancelled <- flights %>%
filter(!is.na(dep_delay), !is.na(arr_delay))
not_cancelled %>%
group_by(year, month, day) %>%
summarise(mean = mean(dep_delay))
#> `summarise()` regrouping output by 'year', 'month' (override with `.groups` argument)
#> # A tibble: 365 x 4
#> # Groups: year, month [12]
#> year month day mean
#> <int> <int> <int> <dbl>
#> 1 2013 1 1 11.4
#> 2 2013 1 2 13.7
#> 3 2013 1 3 10.9
#> 4 2013 1 4 8.97
#> 5 2013 1 5 5.73
#> 6 2013 1 6 7.15
#> # … with 359 more rows
計數(shù):
每當您進行任何匯總時,最好都包含一個計數(shù)(n())或一個非缺失值的計數(shù)(sum(!is.na(x)))。 這樣一來,就可以檢查自己是否基于少量數(shù)據(jù)得出結(jié)論。 例如,讓我們看一下平均延誤最高的飛機(由其機尾號確定):
delays <- not_cancelled %>%
group_by(tailnum) %>%
summarise(
delay = mean(arr_delay)
)
#> `summarise()` ungrouping output (override with `.groups` argument)
ggplot(data = delays, mapping = aes(x = delay)) +
geom_freqpoly(binwidth = 10)

如果我們繪制航班數(shù)與平均延誤的散點圖,則可以得到更多的見解:
delays <- not_cancelled %>%
group_by(tailnum) %>%
summarise(
delay = mean(arr_delay, na.rm = TRUE),
n = n()
)
#> `summarise()` ungrouping output (override with `.groups` argument)
ggplot(data = delays, mapping = aes(x = n, y = delay)) +
geom_point(alpha = 1/10)

查看此類圖時,通??梢詫⒂^察次數(shù)最少的組過濾掉,這樣您就可以在最小的組中看到更多的模式而更少的極端變化。
delays %>%
filter(n > 25) %>%
ggplot(mapping = aes(x = n, y = delay)) +
geom_point(alpha = 1/10)

在這里,我使用來自Lahman軟件包的數(shù)據(jù)來計算每個職業(yè)棒球大聯(lián)盟球員的擊球平均值(命中次數(shù)/嘗試次數(shù))。
# Convert to a tibble so it prints nicely
batting <- as_tibble(Lahman::Batting)
batters <- batting %>%
group_by(playerID) %>%
summarise(
ba = sum(H, na.rm = TRUE) / sum(AB, na.rm = TRUE),
ab = sum(AB, na.rm = TRUE)
)
#> `summarise()` ungrouping output (override with `.groups` argument)
batters %>%
filter(ab > 100) %>%
ggplot(mapping = aes(x = ab, y = ba)) +
geom_point() +
geom_smooth(se = FALSE)
#> `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

有用的總結(jié)函數(shù)
位置測量:我們使用了均值(x),但中位數(shù)(x)也很有用。 平均值是總和除以長度; 中位數(shù)是x的50%在其上方而x的50%在其下方的值。
not_cancelled %>%
group_by(year, month, day) %>%
summarise(
avg_delay1 = mean(arr_delay),
avg_delay2 = mean(arr_delay[arr_delay > 0]) # the average positive delay
)
#> `summarise()` regrouping output by 'year', 'month' (override with `.groups` argument)
#> # A tibble: 365 x 5
#> # Groups: year, month [12]
#> year month day avg_delay1 avg_delay2
#> <int> <int> <int> <dbl> <dbl>
#> 1 2013 1 1 12.7 32.5
#> 2 2013 1 2 12.7 32.0
#> 3 2013 1 3 5.73 27.7
#> 4 2013 1 4 -1.93 28.3
#> 5 2013 1 5 -1.53 22.6
#> 6 2013 1 6 4.24 24.4
#> # … with 359 more rows
傳播程度:sd(x),IQR(x),mad(x)。 均方根偏差或標準偏差sd(x)是擴展的標準度量。 四分位間距IQR(x)和中位數(shù)絕對偏差mad(x)是魯棒的等效項,如果您有異常值,則可能會更有用。
# Why is distance to some destinations more variable than to others?
not_cancelled %>%
group_by(dest) %>%
summarise(distance_sd = sd(distance)) %>%
arrange(desc(distance_sd))
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 104 x 2
#> dest distance_sd
#> <chr> <dbl>
#> 1 EGE 10.5
#> 2 SAN 10.4
#> 3 SFO 10.2
#> 4 HNL 10.0
#> 5 SEA 9.98
#> 6 LAS 9.91
#> # … with 98 more rows
等級的度量:min(x),分位數(shù)(x,0.25),max(x)。 分位數(shù)是中位數(shù)的概括。 例如,分位數(shù)(x,0.25)將找到一個x值,該值大于該值的25%,小于其余的75%。
# When do the first and last flights leave each day?
not_cancelled %>%
group_by(year, month, day) %>%
summarise(
first = min(dep_time),
last = max(dep_time)
)
#> `summarise()` regrouping output by 'year', 'month' (override with `.groups` argument)
#> # A tibble: 365 x 5
#> # Groups: year, month [12]
#> year month day first last
#> <int> <int> <int> <int> <int>
#> 1 2013 1 1 517 2356
#> 2 2013 1 2 42 2354
#> 3 2013 1 3 32 2349
#> 4 2013 1 4 25 2358
#> 5 2013 1 5 14 2357
#> 6 2013 1 6 16 2355
#> # … with 359 more rows
位置度量:first(x),nth(x,2),last(x)。 它們的工作方式類似于x [1],x [2]和x [length(x)],但是如果該位置不存在,則讓您設置默認值(即,您嘗試從該組中獲取第三個元素 只有兩個元素)。 例如,我們可以找到每天的首班車和末班車:
not_cancelled %>%
group_by(year, month, day) %>%
summarise(
first_dep = first(dep_time),
last_dep = last(dep_time)
)
#> `summarise()` regrouping output by 'year', 'month' (override with `.groups` argument)
#> # A tibble: 365 x 5
#> # Groups: year, month [12]
#> year month day first_dep last_dep
#> <int> <int> <int> <int> <int>
#> 1 2013 1 1 517 2356
#> 2 2013 1 2 42 2354
#> 3 2013 1 3 32 2349
#> 4 2013 1 4 25 2358
#> 5 2013 1 5 14 2357
#> 6 2013 1 6 16 2355
#> # … with 359 more rows
這些功能是對等級過濾的補充。 過濾為您提供了所有變量,每個觀察值都位于單獨的行中:
not_cancelled %>%
group_by(year, month, day) %>%
mutate(r = min_rank(desc(dep_time))) %>%
filter(r %in% range(r))
#> # A tibble: 770 x 20
#> # Groups: year, month, day [365]
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
#> 1 2013 1 1 517 515 2 830 819
#> 2 2013 1 1 2356 2359 -3 425 437
#> 3 2013 1 2 42 2359 43 518 442
#> 4 2013 1 2 2354 2359 -5 413 437
#> 5 2013 1 3 32 2359 33 504 442
#> 6 2013 1 3 2349 2359 -10 434 445
#> # … with 764 more rows, and 12 more variables: arr_delay <dbl>, carrier <chr>,
#> # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
#> # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>, r <int>
計數(shù):您已經(jīng)看到n(),它不帶任何參數(shù),并返回當前組的大小。 要計算非缺失值的數(shù)量,請使用sum(!is.na(x))。 要計算不同(唯一)值的數(shù)量,請使用n_distinct(x)。
# Which destinations have the most carriers?
not_cancelled %>%
group_by(dest) %>%
summarise(carriers = n_distinct(carrier)) %>%
arrange(desc(carriers))
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 104 x 2
#> dest carriers
#> <chr> <int>
#> 1 ATL 7
#> 2 BOS 7
#> 3 CLT 7
#> 4 ORD 7
#> 5 TPA 7
#> 6 AUS 6
#> # … with 98 more rows
計數(shù)非常有用,如果您只想計數(shù),dplyr會提供一個簡單的幫助器:
not_cancelled %>%
count(dest)
#> # A tibble: 104 x 2
#> dest n
#> <chr> <int>
#> 1 ABQ 254
#> 2 ACK 264
#> 3 ALB 418
#> 4 ANC 8
#> 5 ATL 16837
#> 6 AUS 2411
#> # … with 98 more rows
您可以選擇提供一個權(quán)重變量。 例如,您可以使用它來“計算”(總和)飛機飛行的總英里數(shù):
not_cancelled %>%
count(tailnum, wt = distance)
#> # A tibble: 4,037 x 2
#> tailnum n
#> <chr> <dbl>
#> 1 D942DN 3418
#> 2 N0EGMQ 239143
#> 3 N10156 109664
#> 4 N102UW 25722
#> 5 N103US 24619
#> 6 N104UW 24616
#> # … with 4,031 more rows
邏輯值的計數(shù)和比例:sum(x> 10),mean(y == 0)。 與數(shù)值函數(shù)一起使用時,TRUE轉(zhuǎn)換為1,F(xiàn)ALSE轉(zhuǎn)換為0。這使sum()和mean()非常有用:sum(x)給出x中TRUE的數(shù)量,而mean(x)給出比例。
# How many flights left before 5am? (these usually indicate delayed
# flights from the previous day)
not_cancelled %>%
group_by(year, month, day) %>%
summarise(n_early = sum(dep_time < 500))
#> `summarise()` regrouping output by 'year', 'month' (override with `.groups` argument)
#> # A tibble: 365 x 4
#> # Groups: year, month [12]
#> year month day n_early
#> <int> <int> <int> <int>
#> 1 2013 1 1 0
#> 2 2013 1 2 3
#> 3 2013 1 3 4
#> 4 2013 1 4 3
#> 5 2013 1 5 3
#> 6 2013 1 6 2
#> # … with 359 more rows
# What proportion of flights are delayed by more than an hour?
not_cancelled %>%
group_by(year, month, day) %>%
summarise(hour_prop = mean(arr_delay > 60))
#> `summarise()` regrouping output by 'year', 'month' (override with `.groups` argument)
#> # A tibble: 365 x 4
#> # Groups: year, month [12]
#> year month day hour_prop
#> <int> <int> <int> <dbl>
#> 1 2013 1 1 0.0722
#> 2 2013 1 2 0.0851
#> 3 2013 1 3 0.0567
#> 4 2013 1 4 0.0396
#> 5 2013 1 5 0.0349
#> 6 2013 1 6 0.0470
#> # … with 359 more rows
Grouping by multiple variables
當您按多個變量分組時,每個摘要都會剝離分組的一個級別。 這樣可以輕松地逐步匯總數(shù)據(jù)集:
daily <- group_by(flights, year, month, day)
(per_day <- summarise(daily, flights = n()))
#> `summarise()` regrouping output by 'year', 'month' (override with `.groups` argument)
#> # A tibble: 365 x 4
#> # Groups: year, month [12]
#> year month day flights
#> <int> <int> <int> <int>
#> 1 2013 1 1 842
#> 2 2013 1 2 943
#> 3 2013 1 3 914
#> 4 2013 1 4 915
#> 5 2013 1 5 720
#> 6 2013 1 6 832
#> # … with 359 more rows
(per_month <- summarise(per_day, flights = sum(flights)))
#> `summarise()` regrouping output by 'year' (override with `.groups` argument)
#> # A tibble: 12 x 3
#> # Groups: year [1]
#> year month flights
#> <int> <int> <int>
#> 1 2013 1 27004
#> 2 2013 2 24951
#> 3 2013 3 28834
#> 4 2013 4 28330
#> 5 2013 5 28796
#> 6 2013 6 28243
#> # … with 6 more rows
(per_year <- summarise(per_month, flights = sum(flights)))
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 1 x 2
#> year flights
#> <int> <int>
#> 1 2013 336776
逐步匯總時要小心:總計和計數(shù)是可以的,但是您需要考慮加權(quán)均值和方差,并且不可能對中位數(shù)等基于排名的統(tǒng)計數(shù)據(jù)進行精確的加權(quán)。 換句話說,按組和的和是總和,但是按組中和的中值不是整體中值。
Ungrouping
如果需要刪除分組并返回對未分組數(shù)據(jù)的操作,請使用ungroup()。
daily %>%
ungroup() %>% # no longer grouped by date
summarise(flights = n()) # all flights
#> # A tibble: 1 x 1
#> flights
#> <int>
#> 1 336776
Grouped mutates (and filters)
與summarise()結(jié)合使用時,分組最有用,但是您也可以使用mutate()和filter()進行便捷的操作:
flights_sml %>%
group_by(year, month, day) %>%
filter(rank(desc(arr_delay)) < 10)
#> # A tibble: 3,306 x 7
#> # Groups: year, month, day [365]
#> year month day dep_delay arr_delay distance air_time
#> <int> <int> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 2013 1 1 853 851 184 41
#> 2 2013 1 1 290 338 1134 213
#> 3 2013 1 1 260 263 266 46
#> 4 2013 1 1 157 174 213 60
#> 5 2013 1 1 216 222 708 121
#> 6 2013 1 1 255 250 589 115
#> # … with 3,300 more rows
查找大于閾值的所有組:
popular_dests <- flights %>%
group_by(dest) %>%
filter(n() > 365)
popular_dests
#> # A tibble: 332,577 x 19
#> # Groups: dest [77]
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
#> 1 2013 1 1 517 515 2 830 819
#> 2 2013 1 1 533 529 4 850 830
#> 3 2013 1 1 542 540 2 923 850
#> 4 2013 1 1 544 545 -1 1004 1022
#> 5 2013 1 1 554 600 -6 812 837
#> 6 2013 1 1 554 558 -4 740 728
#> # … with 332,571 more rows, and 11 more variables: arr_delay <dbl>,
#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
標準化以計算每組指標:
popular_dests %>%
filter(arr_delay > 0) %>%
mutate(prop_delay = arr_delay / sum(arr_delay)) %>%
select(year:day, dest, arr_delay, prop_delay)
#> # A tibble: 131,106 x 6
#> # Groups: dest [77]
#> year month day dest arr_delay prop_delay
#> <int> <int> <int> <chr> <dbl> <dbl>
#> 1 2013 1 1 IAH 11 0.000111
#> 2 2013 1 1 IAH 20 0.000201
#> 3 2013 1 1 MIA 33 0.000235
#> 4 2013 1 1 ORD 12 0.0000424
#> 5 2013 1 1 FLL 19 0.0000938
#> 6 2013 1 1 ORD 8 0.0000283
#> # … with 131,100 more rows
標準化以計算每組指標:
popular_dests %>%
filter(arr_delay > 0) %>%
mutate(prop_delay = arr_delay / sum(arr_delay)) %>%
select(year:day, dest, arr_delay, prop_delay)
#> # A tibble: 131,106 x 6
#> # Groups: dest [77]
#> year month day dest arr_delay prop_delay
#> <int> <int> <int> <chr> <dbl> <dbl>
#> 1 2013 1 1 IAH 11 0.000111
#> 2 2013 1 1 IAH 20 0.000201
#> 3 2013 1 1 MIA 33 0.000235
#> 4 2013 1 1 ORD 12 0.0000424
#> 5 2013 1 1 FLL 19 0.0000938
#> 6 2013 1 1 ORD 8 0.0000283
#> # … with 131,100 more rows