分組統(tǒng)計是指對一組或若干組觀測值(例如各科成績),基于分組水平(例如每個班級),統(tǒng)計出每個班級的數(shù)據(jù)分布特征(均值、總和...)。目前所了解的有三種方式,一是基礎(chǔ)包
stats提供的aggregate()函數(shù);二是reshape2包提供的melt()、dcast()組合函數(shù);三是dplyr函數(shù)提供的group_by()、summarise()函數(shù)。因為dplyr包表達操作相關(guān)已在之前的筆記整理dplyr表格操作 - 簡書 (jianshu.com),本小節(jié)重點學(xué)習(xí)下前兩種方式。
1、aggregate()
形式1:aggregate(觀測值, 分組信息, 統(tǒng)計函數(shù))
- 對于觀測值,為dataframe格式,可以有多列;
- 對于分組信息,為list格式,可以包含多類分組,但需要保證list里的每個分組信息長度與前面的觀測值一致;
- 統(tǒng)計函數(shù)即mean,sum之類
通過下面的例子可以快速理解~
head(state.x77)
# Population Income Illiteracy Life Exp Murder HS Grad Frost Area
# Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708
# Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432
# Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417
# Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945
# California 21198 5114 1.1 71.71 10.3 62.6 20 156361
# Colorado 2541 4884 0.7 72.06 6.8 63.9 166 103766
dim(state.x77)
#[1] 50 8
str(state.region)
#Factor w/ 4 levels "Northeast","South",..: 2 4 4 2 4 4 1 2 2 2 ...
table(state.region)
# state.region
# Northeast South North Central West
# 9 16 12 13
aggregate(state.x77, list(Region = state.region), mean)
# Region Population Income Illiteracy Life Exp Murder HS Grad Frost Area
# 1 Northeast 5495.111 4570.222 1.000000 71.26444 4.722222 53.96667 132.7778 18141.00
# 2 South 4208.125 4011.938 1.737500 69.70625 10.581250 44.34375 64.6250 54605.12
# 3 North Central 4803.000 4611.083 0.700000 71.76667 5.275000 54.51667 138.8333 62652.00
# 4 West 2915.308 4702.615 1.023077 71.23462 7.215385 62.00000 102.1538 134463.00
#兩個分組的情況
aggregate(state.x77[,1:4], #選取指定列的觀測值
list(Region = state.region,
Cold = state.x77[,"Frost"] > 130), #兩個分組
mean)
形式2:aggregate(觀測值 ~ 分組信息, 數(shù)據(jù)集, 統(tǒng)計函數(shù))
- 使用這種形式的前提是dataframe同時包含有觀測值與分組信息才可以
# value ~ group
aggregate(weight ~ feed, data = chickwts, mean)
# value ~ group1 + group2
aggregate(breaks ~ wool + tension, data = warpbreaks, mean)
# cbind(value1, value2) ~ group
aggregate(cbind(Ozone, Temp) ~ Month, data = airquality, mean)
# # cbind(value1, value2) ~ group1 + group2
aggregate(cbind(ncases, ncontrols) ~ alcgp + tobgp, data = esoph, sum)
## Dot notation: . 代表除了指定列以外的所有列
aggregate(. ~ Species, data = iris, mean)
aggregate(len ~ ., data = ToothGrowth, mean)
## Often followed by xtabs():
ag <- aggregate(len ~ ., data = ToothGrowth, mean)
# supp dose len
# 1 OJ 0.5 13.23
# 2 VC 0.5 7.98
# 3 OJ 1.0 22.70
# 4 VC 1.0 16.77
# 5 OJ 2.0 26.06
# 6 VC 2.0 26.14
xtabs(len ~ ., data = ag)
# dose
# supp 0.5 1 2
# OJ 13.23 22.70 26.06
# VC 7.98 16.77 26.14
2、reshape2包
思路:先melt融合,再dcast分組統(tǒng)計。前提也是分組信息與觀測值信息在同一個數(shù)據(jù)框dataframe里。
- melt融合(也適用于ggplot2繪圖的需要)
library(reshape2)
library(dplyr)
airquality %>% head
# Ozone Solar.R Wind Temp Month Day
# 1 41 190 7.4 67 5 1
# 2 36 118 8.0 72 5 2
# 3 12 149 12.6 74 5 3
# 4 18 313 11.5 62 5 4
# 5 NA NA 14.3 56 5 5
# 6 28 NA 14.9 66 5 6
# id參數(shù)指定那些列為分類、分組信息、ID列(可以是字符串或者位置序號)
# 除id參數(shù)指定的列,其余均為觀測值列
melt(airquality, id=c("Month", "Day")) %>% head
# Month Day variable value
# 1 5 1 Ozone 41
# 2 5 2 Ozone 36
# 3 5 3 Ozone 12
# 4 5 4 Ozone 18
# 5 5 5 Ozone NA
# 6 5 6 Ozone 28
melt(ChickWeight, id=2:4) %>% head
# variable.name參數(shù)指定變量列名,默認(rèn)為variable
# value.name參數(shù)指定觀測值列名, 默認(rèn)為value
melt(airquality, id=c("Month", "Day"),
variable.name = "AA",
value.name = "aa") %>% head
# Month Day AA aa
# 1 5 1 Ozone 41
# 2 5 2 Ozone 36
# 3 5 3 Ozone 12
# 4 5 4 Ozone 18
# 5 5 5 Ozone NA
# 6 5 6 Ozone 28
melt()也可用于list,并產(chǎn)生我之前需要手動整理的格式,很方便~
- dcast分組統(tǒng)計: dcast(數(shù)據(jù)集, 分組 ~ variable,統(tǒng)計函數(shù))
aqm <- melt(airquality, id=c("Month", "Day"), na.rm=TRUE)
head(aqm)
# group ~ variable column
dcast(aqm, Month ~ variable, mean)
#多個分組
dcast(aqm, Day + Month ~ variable, mean)
# margins參數(shù),是否計算全局的統(tǒng)計指標(biāo)
dcast(aqm, Month ~ variable, mean, margins = T)
#返回原始表格的形式
dcast(aqm, Day + Month ~ variable) %>% head
# 如果是dcast(數(shù)據(jù)集, 分組 ~ 分組)格式則是統(tǒng)計分組頻數(shù)
chick_m <- melt(ChickWeight, id=2:4, na.rm=TRUE)
head(chick_m)
dcast(chick_m, Time ~ variable, mean) # average effect of time
dcast(chick_m, Diet ~ variable, mean) # average effect of diet
dcast(chick_m, Diet ~ Chick) #統(tǒng)計不同類型分組的頻數(shù)(兩組)
dcast(chick_m, Time + Diet ~ Chick) ##統(tǒng)計不同類型分組的頻數(shù)(三組)
由于melt融合的長表格結(jié)果形式也是ggplot2繪圖所需的格式~
