2019-02-03

R4DS

wj

2019,2,2

library(tidyverse)library(nycflights13)library(forcats)library(lubridate)library(modelr)

前言

(1)導入R(readr)

實際上就是讀取保存文件、數(shù)據(jù)庫或web Api中的數(shù)據(jù),再加載到R的數(shù)據(jù)框中。

(2)整理(tibble)

就是將數(shù)據(jù)保存為一致的形式,以滿足其所在數(shù)據(jù)集在語義上的要求。 簡而言之,如果數(shù)據(jù)是整潔的,那么每列都是一個變量,每行都是一個觀測值。

(3)轉換(dplyr)

包括選取感興趣的觀測(如居住在某個城市里的所有人,或者去年的所有數(shù)據(jù))、使用現(xiàn)有變量創(chuàng)建新變量、以及計算一些摘要統(tǒng)計量(如均值或計數(shù)) 數(shù)據(jù)整理和數(shù)據(jù)轉化統(tǒng)稱為數(shù)據(jù)處理。

(4)可視化

本質(zhì)上是人類活動。對數(shù)據(jù)提出新的問題。 ###(5)模型 如果將問題定義得足夠清晰,那么你就可以使用一個模型來回答問題。 模型本質(zhì)上是一種數(shù)學工具或計算工具。 每個模型都有前提假設,而且模型本身不會對自己的前提假設提出疑問。

(6)溝通

(7)編程

假設驗證

數(shù)據(jù)分析可以分為兩類:假設生成和假設驗證(有時成為驗證性分析)。 無需掩飾,本書的重點就在于假設生成,或者說是數(shù)據(jù)探索。 經(jīng)常有人認為建模是用來進行假設驗證的工具,而可視化是用來進行假設生成的工具。這種簡單的二分法是錯誤的:模型經(jīng)常用于數(shù)據(jù)探索;只需稍作處理,可視化也可以用來進行假設驗證。核心區(qū)別在于你使用每個觀測的頻率:如果只用一次,那么就是假設驗證;如果多于一次,那么就是數(shù)據(jù)探索。

第一部分探索

第一章使用ggplot2進行數(shù)據(jù)可視化

畫圖函數(shù):

geom_point() 散點圖

facet_wrap() 單個變量分面

facet_grid() 多個變量分面

geom_smooth()直線圖

geom_bar()條形圖

geom_boxplot()箱線圖

coord_flip()交換x軸和y軸

coord_polar()設置極坐標

圖形屬性:

color=class在mpg數(shù)據(jù)集中按照class進行描繪顏色

alpha=class

shape

linetype線段形狀

se=FALSE平滑曲線上下無條狀帶

..prop..百分比

fill填充色

mpg數(shù)據(jù)框

mpg

## # A

tibble: 234 x 11##??? manufacturer model displ? year??cyl trans drv???? cty?? hwy fl???class##??? ??????? ##? 1 audi????????a4????? 1.8? 1999????4 auto~ f??????? 18??? 29 p????comp~##? 2 audi????????a4????? 1.8? 1999????4 manu~ f??????? 21??? 29 p????comp~##? 3 audi????????a4????? 2??? 2008????4 manu~ f??????? 20??? 31 p????comp~##? 4 audi????????a4????? 2??? 2008????4 auto~ f??????? 21??? 30 p????comp~##? 5 audi????????a4????? 2.8? 1999????6 auto~ f??????? 16??? 26 p????comp~##? 6 audi????????a4????? 2.8? 1999????6 manu~ f??????? 18??? 26 p????comp~##? 7 audi????????a4????? 3.1? 2008????6 auto~ f??????? 18??? 27 p????comp~##? 8 audi????????a4 q~?? 1.8? 1999? ???4 manu~ 4??????? 18???26 p???? comp~##? 9 audi????????a4 q~?? 1.8? 1999????4 auto~ 4??????? 16??? 25 p????comp~## 10audi???????? a4 q~?? 2???2008???? 4 manu~ 4??????? 20???28 p???? comp~## # ...

with 224 more rows

displ:引擎大小,單位為升。

hwy:汽車在高速公路上行駛時的燃油效率,單位為英里/加侖。

創(chuàng)建ggplot圖形

繪制mpg的圖形,運行一下代碼將displ放在x軸,hwy放在y軸。

ggplot(data=mpg)+? geom_point(mapping=aes(x=displ,y=hwy))

[if !vml]

[endif]

繪圖模板

ggplot(data=)+ (mapping=aes())

圖形屬性映射

圖形屬性是圖中對象的可視化屬性,其中包括數(shù)據(jù)點的大小、形狀和顏色。

(1)顏色映射為class

ggplot(data=mpg)+? geom_point(mapping=aes(x=displ,y=hwy,color=class))

[if !vml]

[endif]

分面

ggplot(data=mpg)+? geom_point(mapping=aes(x=displ,y=hwy))+? facet_wrap(~class,nrow=2)

[if !vml]

[endif]

ggplot(data=mpg)+? geom_smooth(mapping=aes(x=displ,y=hwy))

##

`geom_smooth()` using method = 'loess' and formula 'y ~ x'

[if !vml]

[endif]

疊塊色

這種堆疊是由position參數(shù)設定的位置調(diào)整自動完成。如果不想生成堆疊式條形圖,還可以使用一下3項之一:

position=identity

ggplot(? data=diamonds,? mapping=aes(x=cut,fill=clarity))+? geom_bar(alpha=1/5,position="identity")

[if !vml]

[endif]

ggplot(? data=diamonds,? mapping=aes(x=cut,color=clarity))+? geom_bar(fill=NA,position="identity")

[if !vml]

[endif]

position=“fill”的效果與堆疊相似,但每組堆疊條形具有同樣高度,因此這種條形圖可以非常輕松地比較各組間的比例:

ggplot(data=diamonds)+? geom_bar(??? mapping=aes(x=cut,fill=clarity),??? position="fill"? )

[if !vml]

[endif]

position=“dodge”將每組的條形依次并列放置

ggplot(data=diamonds)+? geom_bar(??? mapping=aes(x=cut,fill=clarity),??? position="dodge"? )

[if !vml]

[endif]

有些點會聚在一起,此時需要一個抖動,可以避免網(wǎng)格化排列。

ggplot(data=mpg)+? geom_point(??? mapping=aes(x=displ,y=hwy),??? position="jitter"? )

[if !vml]

[endif]

極坐標coord_polar()

bar <- ggplot(data=diamonds)+? geom_bar(??? mapping=aes(x=cut,fill=cut),??? show.legend=FALSE,??? width=1??? )+? theme(aspect.ratio=1)+? labs(x=NULL,y=NULL)bar+coord_flip()

[if !vml]

[endif]

bar+coord_flip()

[if !vml]

[endif]

圖形分層語法

ggplot(data=)+

(

?mapping=aes(),?stat=,?position=

)+

+

?你可以將任何圖形精確第描述為數(shù)據(jù)集、幾何對象、 映射集合、統(tǒng)計變換、位置調(diào)整、坐標系和分面模式 的一個組合

第二章工作流:基礎

函數(shù)

function_name(arg1=val1,arg2=val2,…)

seq()函數(shù),可以生成規(guī)則的數(shù)值排列

x%in%y 選取出x是y中一個值時的所有行

is.na()判斷是否為缺失值

year:day選擇在year到day中的所有列

-(year:day)選擇不在year到day之間的所有列(不包括“year”和“day”)

starts_with(“abc”):匹配以“abc”開頭的名稱。

ends_with(“xyz”)匹配以xyz結尾的名稱

cotains(“ijk”)匹配包含ijk的名稱

matches(“(.)\1”)選擇匹配正則表達式的那些變量。這個正則表達式會匹配名稱中有重復字符的變量。

num_range(“x”,1:3)匹配x1,x2和x3

na.rm=TRUE為除去缺失值(sum(AB,na.rm=TRUE))

常用摘要函數(shù) mean(x)求x平均

median(x)求x的中位數(shù)

sd(x)均方誤差(又稱標準誤差,standard

deviation,sd)是分散程度的標準度量方式。

IQR(x)四分位距

mad(x)絕對中位差

秩的度量

min(x)

quantile(x,0.25)找出x中按從小到大順序大于前25%而小于75%的值

max(x)

first(x)=x[1]

nth(x,2)=x[2]

last(x)=x[length(x)]

sum(!is.na(x))計算非缺失值的數(shù)量

n_distinct(x)計算出唯一值的數(shù)量

wt=distance按照distance進行加權賦值

快捷鍵

ALT+-:代碼的快速輸入

tab:選定需要的函數(shù)后再按一次Tab鍵,Rstudio可以自動添加括號()

ctrl+shift創(chuàng)建腳本

ctrl+enter運行該段代碼

變量類型

int整型

dbl雙精度浮點數(shù)型變量,或稱實數(shù)

chr表示字符向量,或稱字符串

dttm表示日期時間(日期+時間)型變量

lgl邏輯型

fctr表示因子,R是用來表示具有固定數(shù)目的值的分類變量

date表示日期型變量

運算符

向量化,使用所謂的“循環(huán)法則”

%/%整數(shù)除法

%%求余

log()、log2()、log10()

lead()和lag()函數(shù)可以返回一個序列的領先值和滯后值

dplyr基礎

5個dplyr核心函數(shù)

filter()按值篩選觀測(選擇行,有邏輯表達式在后面)

arrange()對行進行重新排序

mutate()使用現(xiàn)有變量的函數(shù)創(chuàng)建新變量

summarize()將多個值總結為一個摘要統(tǒng)計量

select()選擇列

cumsum()累加和

cumprod()累加積

commin()累加最小值

cummax()累加最大值

cummean()累加均值

filter()篩選行

filter(flights,month==1,day==1)

## # A

tibble: 842 x 19##???? year month?? day dep_time sched_dep_time dep_delayarr_time##??? ??? ????????? ???? ??? ##? 1?2013???? 1???? 1?????517??????????? 515???????? 2?????830##? 2?2013???? 1???? 1?????533??????????? 529???????? 4?????850##? 3?2013???? 1???? 1?????542??????????? 540???????? 2?????923##? 4?2013???? 1???? 1?????544??????????? 545??????? -1????1004##? 5?2013???? 1???? 1?????554??????????? 600??????? -6?????812##? 6?2013???? 1???? 1?????554??????????? 558??????? -4?????740##? 7?2013???? 1???? 1?????555??????????? 600??????? -5?????913##? 8?2013???? 1???? 1?????557??????????? 600??????? -3?????709##? 9?2013???? 1???? 1?????557??????????? 600???????-3?????838## 10? 2013????1???? 1????? 558??????????? 600??????? -2?????753## # ...

with 832 more rows, and 12 more variables: sched_arr_time <int>,## #?? arr_delay , carrier ,flight , tailnum ,## #?? origin , dest ,air_time , distance , hour ,## #?? minute , time_hour

%in%

(nov_dec

<- filter(flights,month%in%c(11,12)))

## # A

tibble: 55,403 x 19##???? year month?? day dep_time sched_dep_time dep_delayarr_time##??? ??? ????????? ???? ??? ##? 1?2013??? 11???? 1???????5?????????? 2359???????? 6?????352##? 2?2013??? 11???? 1??????35?????????? 2250?????? 105?????123##? 3?2013??? 11???? 1?????455??????????? 500??????? -5?????641##? 4? 2013??? 11????1????? 539??????????? 545??????? -6?????856##? 5?2013??? 11???? 1?????542??????????? 545??????? -3?????831##? 6?2013??? 11???? 1?????549??????????? 600?????? -11?????912##? 7?2013??? 11???? 1?????550??????????? 600?????? -10?????705##? 8?2013??? 11???? 1?????554??????????? 600??????? -6?????659##? 9?2013??? 11???? 1?????554??????????? 600??????? -6?????826## 10? 2013???11???? 1????? 554??????????? 600??????? -6?????749## # ...

with 55,393 more rows, and 12 more variables: sched_arr_time <int>,## #?? arr_delay , carrier ,flight , tailnum ,## #?? origin , dest ,air_time , distance , hour ,## #?? minute , time_hour

filter()只能篩選出條件為TRUE的行;他會排除那些條件為FALSE和NA的行。

df <- tibble(x=c(1,NA,3))filter(df,x>1)

## # A

tibble: 1 x 1##?????? x##?? ## 1???? 3

filter(df,is.na(x)|x>1)

## # A

tibble: 2 x 1##?????? x##?? ## 1??? NA## 2???? 3

arrange()排列行

arrange(flights,year,month,day)

## # A

tibble: 336,776 x 19##???? year month?? day dep_time sched_dep_time dep_delayarr_time##??? ??? ????????? ???? ??? ##? 1?2013???? 1???? 1?????517??????????? 515???????? 2?????830##? 2?2013???? 1???? 1?????533??????????? 529??????? ?4?????850##? 3?2013???? 1???? 1?????542??????????? 540???????? 2?????923##? 4?2013???? 1???? 1?????544??????????? 545??????? -1????1004##? 5?2013???? 1???? 1?????554??????????? 600??????? -6?????812##? 6?2013???? 1???? 1?????554?????????? ?558???????-4????? 740##? 7?2013???? 1???? 1?????555??????????? 600??????? -5?????913##? 8?2013???? 1???? 1?????557??????????? 600??????? -3?????709##? 9?2013???? 1???? 1?????557??????????? 600??????? -3?????838## 10? 2013????1???? 1????? 558??????????? 600??????? -2?????753## # ...

with 336,766 more rows, and 12 more variables: sched_arr_time <int>,## #?? arr_delay , carrier ,flight , tailnum ,## #?? origin , dest ,air_time , distance , hour ,## #?? minute , time_hour

注意: filter是默認升序排列

使用desc()可以按列進行降序排序:

arrange(flights,desc(arr_delay))

## # A

tibble: 336,776 x 19##???? year month?? day dep_time sched_dep_time dep_delayarr_time##??? ??? ????????? ? ?????? ##? 1?2013???? 1???? 9?????641??????????? 900????? 1301????1242##? 2?2013???? 6??? 15????1432?????????? 1935????? 1137????1607##? 3?2013???? 1??? 10????1121?????????? 1635????? 1126????1239##? 4?2013???? 9??? 20????1139???? ??????1845?????1014???? 1457##? 5?2013???? 7??? 22?????845?????????? 1600????? 1005????1044##? 6?2013???? 4??? 10????1100?????????? 1900?????? 960????1342##? 7?2013???? 3??? 17????2321??????????? 810?????? 911?????135##? 8?2013???? 7??? 22? ???2257??????????? 759?????? 898?????121##? 9?2013??? 12???? 5?????756?????????? 1700?????? 896????1058## 10? 2013????5???? 3???? 1133?????????? 2055?????? 878????1250## # ...

with 336,766 more rows, and 12 more variables: sched_arr_time <int>,## #?? arr_delay , carrier ,flight , tailnum ,## #?? origin , dest ,air_time , distance , hour ,## #?? minute , time_hour

缺失值總是排在最后

df <- tibble(x=c(5,2,NA))arrange(df,x)

## # A

tibble: 3 x 1##?????? x##?? ## 1???? 2## 2???? 5## 3??? NA

arrange(df,desc(x))

## # A

tibble: 3 x 1##?????? x##?? ## 1???? 5## 2???? 2## 3??? NA

使用select()選擇列

select(flights,year,month,day)

## # A

tibble: 336,776 x 3##???? year month?? day##??? ##? 1?2013???? 1???? 1##? 2?2013???? 1???? 1##? 3?2013???? 1???? 1##? 4?2013???? 1???? 1##? 5?2013???? 1???? 1##? 6?2013???? 1???? 1##? 7?2013???? 1???? 1##? 8?2013???? 1???? 1##? 9?2013???? 1???? 1## 10? 2013????1???? 1## # ...

with 336,766 more rows

select(flights,year:day)

## # A

tibble: 336,776 x 3##???? year month?? day##??? ##? 1?2013???? 1???? 1##? 2?2013???? 1???? 1##? 3?2013???? 1???? 1##? 4?2013???? 1???? 1##? 5?2013???? 1???? 1##? 6?2013???? 1???? 1##? 7?2013???? 1???? 1##? 8?2013???? 1???? 1##? 9?2013???? 1???? 1## 10? 2013????1???? 1## # ...

with 336,766 more rows

選擇不再year到day之間的所有列(不包括year和day)

select(flights,-(year:day))

## # A

tibble: 336,776 x 16##??? dep_time sched_dep_time dep_delay arr_timesched_arr_time arr_delay##?????? ????????? ???? ??? ????????? ???? ##? 1?????517??????????? 515???????? 2?????830??????????? 819??????? 11##? 2?????533??????????? 529???????? 4?????850? ??????????830??????? 20##? 3?????542??????????? 540???????? 2?????923??????????? 850??????? 33##? 4?????544??????????? 545??????? -1????1004?????????? 1022?????? -18##? 5?????554??????????? 600??????? -6?????812??????????? 837?????? -25##? 6???? ?554???????????558??????? -4????? 740??????????? 728??????? 12##? 7?????555??????????? 600??????? -5?????913??????????? 854??????? 19##? 8?????557??????????? 600??????? -3?????709??????????? 723?????? -14##? 9?????557??????????? 600??????? -3?????838??????????? 846??????? -8## 10????? 558??????????? 600??????? -2?????753??????????? 745???????? 8## # ...

with 336,766 more rows, and 10 more variables: carrier <chr>,## #?? flight , tailnum ,origin , dest , air_time ,## #?? distance , hour ,minute , time_hour

將幾個變量移到數(shù)據(jù)框開頭

select(flights,time_hour,air_time,everything())

## # A

tibble: 336,776 x 19##??? time_hour?????????? air_time? year month??day dep_time sched_dep_time##??? ?????????????? ?? ??? ????????? ##? 1 2013-01-01 05:00:00????? 227?2013???? 1???? 1?????517??????????? 515##? 2 2013-01-01 05:00:00????? 227?2013???? 1???? 1?????533??????????? 529##? 3 2013-01-01 05:00:00????? 160?2013???? 1???? 1?????542??????????? 540##? 4 2013-01-01 05:00:00????? 183?2013???? 1???? 1?????544??????????? 545##? 5 2013-01-01 06:00:00????? 116?2013???? 1???? 1?????554??????????? 600##? 6 2013-01-01 05:00:00????? 150?2013???? 1???? 1?????554??????????? 558##? 7 2013-01-01 06:00:00????? 158?2013???? 1???? 1?????555??????????? 600##? 8 2013-01-01 06:00:00?????? 53?2013???? 1???? 1?????557??????????? 600##? 9 2013-01-01 06:00:00????? 140?2013???? 1???? 1?????557??????????? 600## 102013-01-01 06:00:00????? 138? 2013????1???? 1????? 558??????????? 600## # ...

with 336,766 more rows, and 12 more variables: dep_delay <dbl>,## #?? arr_time , sched_arr_time, arr_delay , carrier ,## #?? flight , tailnum ,origin , dest , distance ,## #?? hour , minute

使用mutate()添加新變量

flights_sml

<- select(flights,? year:day,? ends_with("delay"),? distance,? air_time? )mutate(flights_sml,? gain=arr_delay-dep_delay,? speed=distance/air_time*60? )

## # A

tibble: 336,776 x 9##???? year month?? day dep_delay arr_delay distanceair_time? gain speed##??? ???? ???? ??? ??? ##? 1?2013???? 1???? 1????????2??????? 11???? 1400?????227???? 9? 370.##? 2?2013???? 1???? 1????????4??????? 20???? 1416?????227??? 16? 374.##? 3?2013???? 1???? 1????????2??????? 33???? 1089?????160??? 31? 408.##? 4?2013???? 1???? 1???????-1?????? -18???? 1576?????183?? -17? 517.##? 5?2013???? 1???? 1???????-6?????? -25?????762????? 116?? -19?394.##? 6?2013???? 1???? 1???????-4??????? 12????? 719?????150??? 16? 288.##? 7?2013???? 1???? 1???????-5??????? 19???? 1065?????158??? 24? 404.##? 8?2013???? 1???? 1???????-3?????? -14????? 229??????53?? -11? 259.##? 9?2013???? 1???? 1???????-3??????? -8????? 944?????140??? -5? 405.## 10? 2013????1???? 1??????? -2???????? 8?????733????? 138??? 10?319.## # ...

with 336,766 more rows

一旦創(chuàng)建,新列就可以立即使用:

mutate(flights_sml,? gain=arr_delay-dep_delay,? hours=air_time/60,? gain_per_hour=gain/hours? )

## # A

tibble: 336,776 x 10##???? year month?? day dep_delay arr_delay distanceair_time? gain hours##??? ???? ???? ??? ??? ##? 1?2013???? 1???? 1????????2??????? 11????1400????? 227???? 9 3.78##? 2?2013???? 1???? 1????????4??????? 20???? 1416?????227??? 16 3.78##? 3?2013???? 1???? 1????????2??????? 33???? 1089?????160??? 31 2.67##? 4?2013???? 1???? 1???????-1?????? -18???? 1576?????183?? -17 3.05##? 5?2013???? 1???? 1???????-6?????? -25????? 762?????116?? -19 1.93##? 6?2013???? 1???? 1???????-4??????? 12????? 719?????150??? 16 2.5? ##? 7?2013???? 1???? 1???????-5??????? 19???? 1065?????158??? 24 2.63##? 8?2013???? 1??? ?1???????-3?????? -14????? 229??????53?? -11 0.883##? 9?2013???? 1???? 1???????-3??????? -8????? 944?????140??? -5 2.33## 10? 2013????1???? 1??????? -2???????? 8?????733????? 138??? 10 2.3?## # ...

with 336,766 more rows, and 1 more variable: gain_per_hour <dbl>

如果只想保留新變量,可以使用transmute()函數(shù)

transmute(flights,? gain=arr_delay-dep_delay,? hours=air_time/60,? gain_per_hour=gain/hours)

## # A

tibble: 336,776 x 3##???? gain hours gain_per_hour##??? ???????? ##? 1???? 93.78???? ??????2.38##? 2??? 163.78?????????? 4.23##? 3??? 312.67????????? 11.6##? 4?? -173.05????????? -5.57##? 5?? -191.93????????? -9.83##? 6??? 162.5??????????? 6.4##? 7??? 242.63?????????? 9.11##? 8?? -110.883??????? -12.5##? 9??? -52.33????????? -2.14## 10??? 10 2.3??????????? 4.35## # ...

with 336,766 more rows

lead()和lag()函數(shù)可以返回一個序列的領先值和滯后值

(x <- 1:10)

##? [1]?1? 2? 3?4? 5? 6?7? 8? 9 10

lag(x)

##? [1] NA?1? 2? 3?4? 5? 6?7? 8? 9

lead(x)

##? [1]? 2?3? 4? 5? 6? 7?8? 9 10NA

使用summarize()進行分組摘要 可以將數(shù)據(jù)框折疊成一行:

summarize(flights,delay=mean(dep_delay,na.rm=TRUE))

## # A

tibble: 1 x 1##?? delay##?? ## 1? 12.6

group_by()可以將分析單位從整個數(shù)據(jù)集更改為單個分組。接下來,在 分組后的數(shù)據(jù)框中使用dplyr函數(shù)時,會自動地應用到每個分組。

by_day = group_by(flights,year,month,day)summarize(by_day,delay=mean(dep_delay,na.rm=TRUE))

## # A

tibble: 365 x 4## #Groups:?? year, month [?]##???? year month?? day delay##??? ##? 1?2013???? 1???? 1 11.5##? 2?2013???? 1???? 2 13.9##? 3?2013????1???? 3 11.0##? 4?2013???? 1???? 4?8.95##? 5?2013???? 1???? 5?5.73##? 6?2013???? 1???? 6?7.15##? 7?2013???? 1???? 7?5.42##? 8?2013???? 1???? 8?2.55##? 9?2013???? 1???? 9?2.28## 10? 2013????1??? 10? 2.84## # ...

with 355 more rows

管道%>%

%>%管道最好讀作“然后”

x%>%f(y)會轉換為f(x,y)

x%>%f(y)%>%g(z)會轉換為g(f(x,y),z)

缺失值

flights%>%? group_by(year,month,day)%>%? summarize(mean=mean(dep_delay))

## # A

tibble: 365 x 4## #Groups:?? year, month [?]##???? year month?? day? mean##??? ##? 1?2013???? 1???? 1???NA##? 2?2013???? 1???? 2???NA##? 3?2013???? 1???? 3???NA##? 4?2013???? 1???? 4???NA##? 5?2013???? 1???? 5???NA##? 6?2013???? 1???? 6???NA##? 7?2013???? 1???? 7???NA##? 8? 2013???? 1????8??? NA##? 9?2013???? 1???? 9???NA## 10? 2013????1??? 10??? NA## # ...

with 355 more rows

這樣會得到很多缺失值,這是因為聚合函數(shù)遵循缺失值的一般規(guī)則:如果輸入中有確實值,那么輸出也會是缺失值。

flights%>%? group_by(year,month,day)%>%? summarize(mean=mean(dep_delay,na.rm = TRUE))

## # A tibble:

365 x 4## #Groups:?? year, month [?]##???? year month?? day?mean##??? ##? 1?2013???? 1???? 1 11.5##? 2?2013???? 1???? 2 13.9##? 3?2013???? 1???? 3 11.0##? 4?2013???? 1???? 4?8.95##? 5?2013???? 1???? 5?5.73##? 6?2013???? 1???? 6?7.15##? 7?2013???? 1???? 7?5.42##? 8?2013???? 1???? 8?2.55##? 9?2013???? 1???? 9?2.28## 10? 2013????1??? 10? 2.84## # ...

with 355 more rows

缺失值表示取消的航班,我們也可以通過先去除取消的航班來解決缺失值問題。

not_cancelled

<- flights%>%? filter(!is.na(dep_delay),!is.na(arr_delay))#關鍵一步not_cancelled%>%? group_by(year,month,day)%>%? summarize(mean=mean(dep_delay))

## # A

tibble: 365 x 4## #Groups:?? year, month [?]##???? year month?? day?mean##??? ##? 1?2013???? 1???? 1 11.4##? 2?2013???? 1???? 2 13.7##? 3?2013???? 1???? 3 10.9##? 4?2013???? 1???? 4?8.97##? 5?2013???? 1???? 5?5.73##? 6?2013???? 1???? 6?7.15##? 7?2013???? 1???? 7?5.42##? 8?2013???? 1???? 8?2.56##? 9?2013???? 1???? 9?2.30## 10? 2013????1??? 10? 2.84## # ...

with 355 more rows

位置度量

not_cancelled%>%? group_by(year,month,day)%>%? summarize(??? avg_delay1=mean(arr_delay),??? avg_delay2=mean(arr_delay[arr_delay>0])#[arr_delay>0]arr_delay向量中變量中的arr_delay大于0的元素。? )

## # A

tibble: 365 x 5## #Groups:?? year, month [?]##???? year month?? day avg_delay1 avg_delay2##??? ????? ????? ##? 1?2013???? 1???? 1????12.7???????? 32.5##? 2?2013???? 1???? 2????12.7???????? 32.0##? 3?2013???? 1????3?????5.73??????? 27.7##? 4?2013???? 1???? 4????-1.93??????? 28.3##? 5?2013???? 1???? 5????-1.53??????? 22.6##? 6?2013???? 1???? 6?????4.24??????? 24.4##? 7?2013???? 1???? 7????-4.95??????? 27.8##? 8?2013???? 1???? 8????-3.23??????? 20.8##? 9?2013???? 1???? 9????-0.264?????? 25.6## 10? 2013????1??? 10???? -5.90??????? 27.3## # ...

with 355 more rows

not_cancelled%>%? count(tailnum,wt=distance)

## # A

tibble: 4,037 x 2##??? tailnum?????n##??? ??? ##? 1 D942DN???3418##? 2 N0EGMQ?239143##? 3 N10156?109664##? 4 N102UW??25722##? 5 N103US??24619##? 6 N104UW??24616##? 7 N10575?139903##? 8 N105UW??23618##? 9 N107US??21677## 10N108UW?? 32070## # ...

with 4,027 more rows

邏輯計數(shù)和比例

sum(x>10)

true=1,false=0

not_cancelled%>%? group_by(year,month,day)%>%? summarize(n_early=sum(dep_time<500))

## # A

tibble: 365 x 4## #Groups:?? year, month [?]##???? year month?? day n_early##??? ?? ##? 1?2013???? 1???? 1??????0##? 2?2013???? 1?? ??2??????3##? 3?2013???? 1???? 3??????4##? 4?2013???? 1???? 4??????3##? 5?2013???? 1???? 5??????3##? 6?2013???? 1???? 6??????2##? 7?2013???? 1???? 7??????2##? 8?2013???? 1???? 8??????1##? 9?2013???? 1???? 9??????3## 10? 2013????1?? ?10??????3## # ...

with 355 more rows

第五章探索性數(shù)據(jù)分析

如何使用可視化和數(shù)據(jù)轉換來系統(tǒng)化地探索數(shù)據(jù),統(tǒng)計學家將這項任務稱為探索性數(shù)據(jù)分析(EDA)。EDA是一個可迭代的循環(huán)過程。

EDA工具:可視化,數(shù)據(jù)轉換和建模。

?著作權歸作者所有,轉載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務。

相關閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容