中文字幕在线中,校园春色干妹子,99re8这里

R4DS

2019,2,2

library(tidyverse)library(nycflights13)library(forcats)library(lubridate)library(modelr)

前言

（1）導入R(readr)

實際上就是讀取保存文件、數(shù)據(jù)庫或web Api中的數(shù)據(jù)，再加載到R的數(shù)據(jù)框中。

（2）整理（tibble）

就是將數(shù)據(jù)保存為一致的形式，以滿足其所在數(shù)據(jù)集在語義上的要求。簡而言之，如果數(shù)據(jù)是整潔的，那么每列都是一個變量，每行都是一個觀測值。

（3）轉換（dplyr）

包括選取感興趣的觀測（如居住在某個城市里的所有人，或者去年的所有數(shù)據(jù)）、使用現(xiàn)有變量創(chuàng)建新變量、以及計算一些摘要統(tǒng)計量（如均值或計數(shù)）數(shù)據(jù)整理和數(shù)據(jù)轉化統(tǒng)稱為數(shù)據(jù)處理。

（4）可視化

本質(zhì)上是人類活動。對數(shù)據(jù)提出新的問題。 ###（5）模型如果將問題定義得足夠清晰，那么你就可以使用一個模型來回答問題。模型本質(zhì)上是一種數(shù)學工具或計算工具。每個模型都有前提假設，而且模型本身不會對自己的前提假設提出疑問。

（6）溝通

（7）編程

假設驗證

數(shù)據(jù)分析可以分為兩類：假設生成和假設驗證（有時成為驗證性分析）。無需掩飾，本書的重點就在于假設生成，或者說是數(shù)據(jù)探索。經(jīng)常有人認為建模是用來進行假設驗證的工具，而可視化是用來進行假設生成的工具。這種簡單的二分法是錯誤的：模型經(jīng)常用于數(shù)據(jù)探索；只需稍作處理，可視化也可以用來進行假設驗證。核心區(qū)別在于你使用每個觀測的頻率：如果只用一次，那么就是假設驗證；如果多于一次，那么就是數(shù)據(jù)探索。

第一部分探索

第一章使用ggplot2進行數(shù)據(jù)可視化

畫圖函數(shù)：

geom_point() 散點圖

facet_wrap() 單個變量分面

facet_grid() 多個變量分面

geom_smooth()直線圖

geom_bar()條形圖

geom_boxplot()箱線圖

coord_flip()交換x軸和y軸

coord_polar()設置極坐標

圖形屬性：

color=class在mpg數(shù)據(jù)集中按照class進行描繪顏色

alpha=class

shape

linetype線段形狀

se=FALSE平滑曲線上下無條狀帶

..prop..百分比

fill填充色

mpg數(shù)據(jù)框

mpg

## # A

tibble: 234 x 11##??? manufacturer model displ? year??cyl trans drv???? cty?? hwy fl???class##??? ??????? ##? 1 audi????????a4????? 1.8? 1999????4 auto~ f??????? 18??? 29 p????comp~##? 2 audi????????a4????? 1.8? 1999????4 manu~ f??????? 21??? 29 p????comp~##? 3 audi????????a4????? 2??? 2008????4 manu~ f??????? 20??? 31 p????comp~##? 4 audi????????a4????? 2??? 2008????4 auto~ f??????? 21??? 30 p????comp~##? 5 audi????????a4????? 2.8? 1999????6 auto~ f??????? 16??? 26 p????comp~##? 6 audi????????a4????? 2.8? 1999????6 manu~ f??????? 18??? 26 p????comp~##? 7 audi????????a4????? 3.1? 2008????6 auto~ f??????? 18??? 27 p????comp~##? 8 audi????????a4 q~?? 1.8? 1999? ???4 manu~ 4??????? 18???26 p???? comp~##? 9 audi????????a4 q~?? 1.8? 1999????4 auto~ 4??????? 16??? 25 p????comp~## 10audi???????? a4 q~?? 2???2008???? 4 manu~ 4??????? 20???28 p???? comp~## # ...

with 224 more rows

displ:引擎大小，單位為升。

hwy：汽車在高速公路上行駛時的燃油效率，單位為英里/加侖。

創(chuàng)建ggplot圖形

繪制mpg的圖形，運行一下代碼將displ放在x軸，hwy放在y軸。

ggplot(data=mpg)+? geom_point(mapping=aes(x=displ,y=hwy))

[if !vml]

[endif]

繪圖模板

ggplot（data=）+ (mapping=aes())

圖形屬性映射

圖形屬性是圖中對象的可視化屬性，其中包括數(shù)據(jù)點的大小、形狀和顏色。

（1）顏色映射為class

ggplot(data=mpg)+? geom_point(mapping=aes(x=displ,y=hwy,color=class))

[if !vml]

[endif]

分面

ggplot(data=mpg)+? geom_point(mapping=aes(x=displ,y=hwy))+? facet_wrap(~class,nrow=2)

[if !vml]

[endif]

ggplot(data=mpg)+? geom_smooth(mapping=aes(x=displ,y=hwy))

`geom_smooth()` using method = 'loess' and formula 'y ~ x'

[if !vml]

[endif]

疊塊色

這種堆疊是由position參數(shù)設定的位置調(diào)整自動完成。如果不想生成堆疊式條形圖，還可以使用一下3項之一：

position=identity

ggplot(? data=diamonds,? mapping=aes(x=cut,fill=clarity))+? geom_bar(alpha=1/5,position="identity")

[if !vml]

[endif]

ggplot(? data=diamonds,? mapping=aes(x=cut,color=clarity))+? geom_bar(fill=NA,position="identity")

[if !vml]

[endif]

position=“fill”的效果與堆疊相似，但每組堆疊條形具有同樣高度，因此這種條形圖可以非常輕松地比較各組間的比例：

ggplot(data=diamonds)+? geom_bar(??? mapping=aes(x=cut,fill=clarity),??? position="fill"? )

[if !vml]

[endif]

position=“dodge”將每組的條形依次并列放置

ggplot(data=diamonds)+? geom_bar(??? mapping=aes(x=cut,fill=clarity),??? position="dodge"? )

[if !vml]

[endif]

有些點會聚在一起，此時需要一個抖動，可以避免網(wǎng)格化排列。

ggplot(data=mpg)+? geom_point(??? mapping=aes(x=displ,y=hwy),??? position="jitter"? )

[if !vml]

[endif]

極坐標coord_polar()

bar <- ggplot(data=diamonds)+? geom_bar(??? mapping=aes(x=cut,fill=cut),??? show.legend=FALSE,??? width=1??? )+? theme(aspect.ratio=1)+? labs(x=NULL,y=NULL)bar+coord_flip()

[if !vml]

[endif]

bar+coord_flip()

[if !vml]

[endif]

圖形分層語法

ggplot(data=)+

(

?mapping=aes(),?stat=,?position=

?你可以將任何圖形精確第描述為數(shù)據(jù)集、幾何對象、映射集合、統(tǒng)計變換、位置調(diào)整、坐標系和分面模式的一個組合

第二章工作流：基礎

函數(shù)

function_name(arg1=val1,arg2=val2,…)

seq()函數(shù)，可以生成規(guī)則的數(shù)值排列

x%in%y 選取出x是y中一個值時的所有行

is.na()判斷是否為缺失值

year:day選擇在year到day中的所有列

-（year：day）選擇不在year到day之間的所有列（不包括“year”和“day”）

starts_with(“abc”):匹配以“abc”開頭的名稱。

ends_with(“xyz”)匹配以xyz結尾的名稱

cotains（“ijk”）匹配包含ijk的名稱

matches（“（.）\1”）選擇匹配正則表達式的那些變量。這個正則表達式會匹配名稱中有重復字符的變量。

num_range(“x”,1:3)匹配x1，x2和x3

na.rm=TRUE為除去缺失值（sum（AB，na.rm=TRUE））

常用摘要函數(shù) mean（x）求x平均

median（x）求x的中位數(shù)

sd(x)均方誤差（又稱標準誤差，standard

deviation，sd）是分散程度的標準度量方式。

IQR（x）四分位距

mad（x）絕對中位差

秩的度量

min（x）

quantile（x，0.25）找出x中按從小到大順序大于前25%而小于75%的值

max（x）

first（x）=x[1]

nth(x,2)=x[2]

last(x)=x[length(x)]

sum(!is.na(x))計算非缺失值的數(shù)量

n_distinct（x）計算出唯一值的數(shù)量

wt=distance按照distance進行加權賦值

快捷鍵

ALT+-：代碼的快速輸入

tab：選定需要的函數(shù)后再按一次Tab鍵，Rstudio可以自動添加括號（）

ctrl+shift創(chuàng)建腳本

ctrl+enter運行該段代碼

變量類型

int整型

dbl雙精度浮點數(shù)型變量，或稱實數(shù)

chr表示字符向量，或稱字符串

dttm表示日期時間（日期+時間）型變量

lgl邏輯型

fctr表示因子，R是用來表示具有固定數(shù)目的值的分類變量

date表示日期型變量

運算符

向量化，使用所謂的“循環(huán)法則”

%/%整數(shù)除法

%%求余

log（）、log2（）、log10（）

lead（）和lag()函數(shù)可以返回一個序列的領先值和滯后值

dplyr基礎

5個dplyr核心函數(shù)

filter（）按值篩選觀測（選擇行，有邏輯表達式在后面）

arrange（）對行進行重新排序

mutate（）使用現(xiàn)有變量的函數(shù)創(chuàng)建新變量

summarize（）將多個值總結為一個摘要統(tǒng)計量

select（）選擇列

cumsum（）累加和

cumprod（）累加積

commin（）累加最小值

cummax（）累加最大值

cummean（）累加均值

filter（）篩選行

filter(flights,month==1,day==1)

## # A

tibble: 842 x 19##???? year month?? day dep_time sched_dep_time dep_delayarr_time##??? ??? ????????? ???? ??? ##? 1?2013???? 1???? 1?????517??????????? 515???????? 2?????830##? 2?2013???? 1???? 1?????533??????????? 529???????? 4?????850##? 3?2013???? 1???? 1?????542??????????? 540???????? 2?????923##? 4?2013???? 1???? 1?????544??????????? 545??????? -1????1004##? 5?2013???? 1???? 1?????554??????????? 600??????? -6?????812##? 6?2013???? 1???? 1?????554??????????? 558??????? -4?????740##? 7?2013???? 1???? 1?????555??????????? 600??????? -5?????913##? 8?2013???? 1???? 1?????557??????????? 600??????? -3?????709##? 9?2013???? 1???? 1?????557??????????? 600???????-3?????838## 10? 2013????1???? 1????? 558??????????? 600??????? -2?????753## # ...

with 832 more rows, and 12 more variables: sched_arr_time <int>,## #?? arr_delay , carrier ,flight , tailnum ,## #?? origin , dest ,air_time , distance , hour ,## #?? minute , time_hour

%in%

(nov_dec

<- filter(flights,month%in%c(11,12)))

## # A

tibble: 55,403 x 19##???? year month?? day dep_time sched_dep_time dep_delayarr_time##??? ??? ????????? ???? ??? ##? 1?2013??? 11???? 1???????5?????????? 2359???????? 6?????352##? 2?2013??? 11???? 1??????35?????????? 2250?????? 105?????123##? 3?2013??? 11???? 1?????455??????????? 500??????? -5?????641##? 4? 2013??? 11????1????? 539??????????? 545??????? -6?????856##? 5?2013??? 11???? 1?????542??????????? 545??????? -3?????831##? 6?2013??? 11???? 1?????549??????????? 600?????? -11?????912##? 7?2013??? 11???? 1?????550??????????? 600?????? -10?????705##? 8?2013??? 11???? 1?????554??????????? 600??????? -6?????659##? 9?2013??? 11???? 1?????554??????????? 600??????? -6?????826## 10? 2013???11???? 1????? 554??????????? 600??????? -6?????749## # ...

with 55,393 more rows, and 12 more variables: sched_arr_time <int>,## #?? arr_delay , carrier ,flight , tailnum ,## #?? origin , dest ,air_time , distance , hour ,## #?? minute , time_hour

filter()只能篩選出條件為TRUE的行；他會排除那些條件為FALSE和NA的行。

df <- tibble(x=c(1,NA,3))filter(df,x>1)

## # A

tibble: 1 x 1##?????? x##?? ## 1???? 3

filter(df,is.na(x)|x>1)

## # A

tibble: 2 x 1##?????? x##?? ## 1??? NA## 2???? 3

arrange()排列行

arrange(flights,year,month,day)

## # A

tibble: 336,776 x 19##???? year month?? day dep_time sched_dep_time dep_delayarr_time##??? ??? ????????? ???? ??? ##? 1?2013???? 1???? 1?????517??????????? 515???????? 2?????830##? 2?2013???? 1???? 1?????533??????????? 529??????? ?4?????850##? 3?2013???? 1???? 1?????542??????????? 540???????? 2?????923##? 4?2013???? 1???? 1?????544??????????? 545??????? -1????1004##? 5?2013???? 1???? 1?????554??????????? 600??????? -6?????812##? 6?2013???? 1???? 1?????554?????????? ?558???????-4????? 740##? 7?2013???? 1???? 1?????555??????????? 600??????? -5?????913##? 8?2013???? 1???? 1?????557??????????? 600??????? -3?????709##? 9?2013???? 1???? 1?????557??????????? 600??????? -3?????838## 10? 2013????1???? 1????? 558??????????? 600??????? -2?????753## # ...

with 336,766 more rows, and 12 more variables: sched_arr_time <int>,## #?? arr_delay , carrier ,flight , tailnum ,## #?? origin , dest ,air_time , distance , hour ,## #?? minute , time_hour

注意： filter是默認升序排列

使用desc（）可以按列進行降序排序：

arrange(flights,desc(arr_delay))

## # A

tibble: 336,776 x 19##???? year month?? day dep_time sched_dep_time dep_delayarr_time##??? ??? ????????? ? ?????? ##? 1?2013???? 1???? 9?????641??????????? 900????? 1301????1242##? 2?2013???? 6??? 15????1432?????????? 1935????? 1137????1607##? 3?2013???? 1??? 10????1121?????????? 1635????? 1126????1239##? 4?2013???? 9??? 20????1139???? ??????1845?????1014???? 1457##? 5?2013???? 7??? 22?????845?????????? 1600????? 1005????1044##? 6?2013???? 4??? 10????1100?????????? 1900?????? 960????1342##? 7?2013???? 3??? 17????2321??????????? 810?????? 911?????135##? 8?2013???? 7??? 22? ???2257??????????? 759?????? 898?????121##? 9?2013??? 12???? 5?????756?????????? 1700?????? 896????1058## 10? 2013????5???? 3???? 1133?????????? 2055?????? 878????1250## # ...

with 336,766 more rows, and 12 more variables: sched_arr_time <int>,## #?? arr_delay , carrier ,flight , tailnum ,## #?? origin , dest ,air_time , distance , hour ,## #?? minute , time_hour

缺失值總是排在最后

df <- tibble(x=c(5,2,NA))arrange(df,x)

## # A

tibble: 3 x 1##?????? x##?? ## 1???? 2## 2???? 5## 3??? NA

arrange(df,desc(x))

## # A

tibble: 3 x 1##?????? x##?? ## 1???? 5## 2???? 2## 3??? NA

使用select（）選擇列

select(flights,year,month,day)

## # A

tibble: 336,776 x 3##???? year month?? day##??? ##? 1?2013???? 1???? 1##? 2?2013???? 1???? 1##? 3?2013???? 1???? 1##? 4?2013???? 1???? 1##? 5?2013???? 1???? 1##? 6?2013???? 1???? 1##? 7?2013???? 1???? 1##? 8?2013???? 1???? 1##? 9?2013???? 1???? 1## 10? 2013????1???? 1## # ...

with 336,766 more rows

select(flights,year:day)

## # A

with 336,766 more rows

選擇不再year到day之間的所有列（不包括year和day）

select(flights,-(year:day))

## # A

tibble: 336,776 x 16##??? dep_time sched_dep_time dep_delay arr_timesched_arr_time arr_delay##?????? ????????? ???? ??? ????????? ???? ##? 1?????517??????????? 515???????? 2?????830??????????? 819??????? 11##? 2?????533??????????? 529???????? 4?????850? ??????????830??????? 20##? 3?????542??????????? 540???????? 2?????923??????????? 850??????? 33##? 4?????544??????????? 545??????? -1????1004?????????? 1022?????? -18##? 5?????554??????????? 600??????? -6?????812??????????? 837?????? -25##? 6???? ?554???????????558??????? -4????? 740??????????? 728??????? 12##? 7?????555??????????? 600??????? -5?????913??????????? 854??????? 19##? 8?????557??????????? 600??????? -3?????709??????????? 723?????? -14##? 9?????557??????????? 600??????? -3?????838??????????? 846??????? -8## 10????? 558??????????? 600??????? -2?????753??????????? 745???????? 8## # ...

with 336,766 more rows, and 10 more variables: carrier <chr>,## #?? flight , tailnum ,origin , dest , air_time ,## #?? distance , hour ,minute , time_hour

將幾個變量移到數(shù)據(jù)框開頭

select(flights,time_hour,air_time,everything())

## # A

tibble: 336,776 x 19##??? time_hour?????????? air_time? year month??day dep_time sched_dep_time##??? ?????????????? ?? ??? ????????? ##? 1 2013-01-01 05:00:00????? 227?2013???? 1???? 1?????517??????????? 515##? 2 2013-01-01 05:00:00????? 227?2013???? 1???? 1?????533??????????? 529##? 3 2013-01-01 05:00:00????? 160?2013???? 1???? 1?????542??????????? 540##? 4 2013-01-01 05:00:00????? 183?2013???? 1???? 1?????544??????????? 545##? 5 2013-01-01 06:00:00????? 116?2013???? 1???? 1?????554??????????? 600##? 6 2013-01-01 05:00:00????? 150?2013???? 1???? 1?????554??????????? 558##? 7 2013-01-01 06:00:00????? 158?2013???? 1???? 1?????555??????????? 600##? 8 2013-01-01 06:00:00?????? 53?2013???? 1???? 1?????557??????????? 600##? 9 2013-01-01 06:00:00????? 140?2013???? 1???? 1?????557??????????? 600## 102013-01-01 06:00:00????? 138? 2013????1???? 1????? 558??????????? 600## # ...

with 336,766 more rows, and 12 more variables: dep_delay <dbl>,## #?? arr_time , sched_arr_time, arr_delay , carrier ,## #?? flight , tailnum ,origin , dest , distance ,## #?? hour , minute

使用mutate（）添加新變量

flights_sml

<- select(flights,? year:day,? ends_with("delay"),? distance,? air_time? )mutate(flights_sml,? gain=arr_delay-dep_delay,? speed=distance/air_time*60? )

## # A

tibble: 336,776 x 9##???? year month?? day dep_delay arr_delay distanceair_time? gain speed##??? ???? ???? ??? ??? ##? 1?2013???? 1???? 1????????2??????? 11???? 1400?????227???? 9? 370.##? 2?2013???? 1???? 1????????4??????? 20???? 1416?????227??? 16? 374.##? 3?2013???? 1???? 1????????2??????? 33???? 1089?????160??? 31? 408.##? 4?2013???? 1???? 1???????-1?????? -18???? 1576?????183?? -17? 517.##? 5?2013???? 1???? 1???????-6?????? -25?????762????? 116?? -19?394.##? 6?2013???? 1???? 1???????-4??????? 12????? 719?????150??? 16? 288.##? 7?2013???? 1???? 1???????-5??????? 19???? 1065?????158??? 24? 404.##? 8?2013???? 1???? 1???????-3?????? -14????? 229??????53?? -11? 259.##? 9?2013???? 1???? 1???????-3??????? -8????? 944?????140??? -5? 405.## 10? 2013????1???? 1??????? -2???????? 8?????733????? 138??? 10?319.## # ...

with 336,766 more rows

一旦創(chuàng)建，新列就可以立即使用：

mutate(flights_sml,? gain=arr_delay-dep_delay,? hours=air_time/60,? gain_per_hour=gain/hours? )

## # A

tibble: 336,776 x 10##???? year month?? day dep_delay arr_delay distanceair_time? gain hours##??? ???? ???? ??? ??? ##? 1?2013???? 1???? 1????????2??????? 11????1400????? 227???? 9 3.78##? 2?2013???? 1???? 1????????4??????? 20???? 1416?????227??? 16 3.78##? 3?2013???? 1???? 1????????2??????? 33???? 1089?????160??? 31 2.67##? 4?2013???? 1???? 1???????-1?????? -18???? 1576?????183?? -17 3.05##? 5?2013???? 1???? 1???????-6?????? -25????? 762?????116?? -19 1.93##? 6?2013???? 1???? 1???????-4??????? 12????? 719?????150??? 16 2.5? ##? 7?2013???? 1???? 1???????-5??????? 19???? 1065?????158??? 24 2.63##? 8?2013???? 1??? ?1???????-3?????? -14????? 229??????53?? -11 0.883##? 9?2013???? 1???? 1???????-3??????? -8????? 944?????140??? -5 2.33## 10? 2013????1???? 1??????? -2???????? 8?????733????? 138??? 10 2.3?## # ...

with 336,766 more rows, and 1 more variable: gain_per_hour <dbl>

如果只想保留新變量，可以使用transmute（）函數(shù)

transmute(flights,? gain=arr_delay-dep_delay,? hours=air_time/60,? gain_per_hour=gain/hours)

## # A

tibble: 336,776 x 3##???? gain hours gain_per_hour##??? ???????? ##? 1???? 93.78???? ??????2.38##? 2??? 163.78?????????? 4.23##? 3??? 312.67????????? 11.6##? 4?? -173.05????????? -5.57##? 5?? -191.93????????? -9.83##? 6??? 162.5??????????? 6.4##? 7??? 242.63?????????? 9.11##? 8?? -110.883??????? -12.5##? 9??? -52.33????????? -2.14## 10??? 10 2.3??????????? 4.35## # ...

with 336,766 more rows

lead（）和lag()函數(shù)可以返回一個序列的領先值和滯后值

(x <- 1:10)

##? [1]?1? 2? 3?4? 5? 6?7? 8? 9 10

lag(x)

##? [1] NA?1? 2? 3?4? 5? 6?7? 8? 9

lead(x)

##? [1]? 2?3? 4? 5? 6? 7?8? 9 10NA

使用summarize（）進行分組摘要可以將數(shù)據(jù)框折疊成一行：

summarize(flights,delay=mean(dep_delay,na.rm=TRUE))

## # A

tibble: 1 x 1##?? delay##?? ## 1? 12.6

group_by()可以將分析單位從整個數(shù)據(jù)集更改為單個分組。接下來，在分組后的數(shù)據(jù)框中使用dplyr函數(shù)時，會自動地應用到每個分組。

by_day = group_by(flights,year,month,day)summarize(by_day,delay=mean(dep_delay,na.rm=TRUE))

## # A

tibble: 365 x 4## #Groups:?? year, month [?]##???? year month?? day delay##??? ##? 1?2013???? 1???? 1 11.5##? 2?2013???? 1???? 2 13.9##? 3?2013????1???? 3 11.0##? 4?2013???? 1???? 4?8.95##? 5?2013???? 1???? 5?5.73##? 6?2013???? 1???? 6?7.15##? 7?2013???? 1???? 7?5.42##? 8?2013???? 1???? 8?2.55##? 9?2013???? 1???? 9?2.28## 10? 2013????1??? 10? 2.84## # ...

with 355 more rows

管道%>%

%>%管道最好讀作“然后”

x%>%f(y)會轉換為f（x，y）

x%>%f(y)%>%g(z)會轉換為g（f（x，y），z）

缺失值

flights%>%? group_by(year,month,day)%>%? summarize(mean=mean(dep_delay))

## # A

tibble: 365 x 4## #Groups:?? year, month [?]##???? year month?? day? mean##??? ##? 1?2013???? 1???? 1???NA##? 2?2013???? 1???? 2???NA##? 3?2013???? 1???? 3???NA##? 4?2013???? 1???? 4???NA##? 5?2013???? 1???? 5???NA##? 6?2013???? 1???? 6???NA##? 7?2013???? 1???? 7???NA##? 8? 2013???? 1????8??? NA##? 9?2013???? 1???? 9???NA## 10? 2013????1??? 10??? NA## # ...

with 355 more rows

這樣會得到很多缺失值，這是因為聚合函數(shù)遵循缺失值的一般規(guī)則：如果輸入中有確實值，那么輸出也會是缺失值。

flights%>%? group_by(year,month,day)%>%? summarize(mean=mean(dep_delay,na.rm = TRUE))

## # A tibble:

365 x 4## #Groups:?? year, month [?]##???? year month?? day?mean##??? ##? 1?2013???? 1???? 1 11.5##? 2?2013???? 1???? 2 13.9##? 3?2013???? 1???? 3 11.0##? 4?2013???? 1???? 4?8.95##? 5?2013???? 1???? 5?5.73##? 6?2013???? 1???? 6?7.15##? 7?2013???? 1???? 7?5.42##? 8?2013???? 1???? 8?2.55##? 9?2013???? 1???? 9?2.28## 10? 2013????1??? 10? 2.84## # ...

with 355 more rows

缺失值表示取消的航班，我們也可以通過先去除取消的航班來解決缺失值問題。

not_cancelled

<- flights%>%? filter(!is.na(dep_delay),!is.na(arr_delay))#關鍵一步not_cancelled%>%? group_by(year,month,day)%>%? summarize(mean=mean(dep_delay))

## # A

tibble: 365 x 4## #Groups:?? year, month [?]##???? year month?? day?mean##??? ##? 1?2013???? 1???? 1 11.4##? 2?2013???? 1???? 2 13.7##? 3?2013???? 1???? 3 10.9##? 4?2013???? 1???? 4?8.97##? 5?2013???? 1???? 5?5.73##? 6?2013???? 1???? 6?7.15##? 7?2013???? 1???? 7?5.42##? 8?2013???? 1???? 8?2.56##? 9?2013???? 1???? 9?2.30## 10? 2013????1??? 10? 2.84## # ...

with 355 more rows

位置度量

not_cancelled%>%? group_by(year,month,day)%>%? summarize(??? avg_delay1=mean(arr_delay),??? avg_delay2=mean(arr_delay[arr_delay>0])#[arr_delay>0]為arr_delay向量中變量中的arr_delay大于0的元素。? )

## # A

tibble: 365 x 5## #Groups:?? year, month [?]##???? year month?? day avg_delay1 avg_delay2##??? ????? ????? ##? 1?2013???? 1???? 1????12.7???????? 32.5##? 2?2013???? 1???? 2????12.7???????? 32.0##? 3?2013???? 1????3?????5.73??????? 27.7##? 4?2013???? 1???? 4????-1.93??????? 28.3##? 5?2013???? 1???? 5????-1.53??????? 22.6##? 6?2013???? 1???? 6?????4.24??????? 24.4##? 7?2013???? 1???? 7????-4.95??????? 27.8##? 8?2013???? 1???? 8????-3.23??????? 20.8##? 9?2013???? 1???? 9????-0.264?????? 25.6## 10? 2013????1??? 10???? -5.90??????? 27.3## # ...

with 355 more rows

not_cancelled%>%? count(tailnum,wt=distance)

## # A

tibble: 4,037 x 2##??? tailnum?????n##??? ??? ##? 1 D942DN???3418##? 2 N0EGMQ?239143##? 3 N10156?109664##? 4 N102UW??25722##? 5 N103US??24619##? 6 N104UW??24616##? 7 N10575?139903##? 8 N105UW??23618##? 9 N107US??21677## 10N108UW?? 32070## # ...

with 4,027 more rows

邏輯計數(shù)和比例

sum（x>10）

true=1，false=0

not_cancelled%>%? group_by(year,month,day)%>%? summarize(n_early=sum(dep_time<500))

## # A

tibble: 365 x 4## #Groups:?? year, month [?]##???? year month?? day n_early##??? ?? ##? 1?2013???? 1???? 1??????0##? 2?2013???? 1?? ??2??????3##? 3?2013???? 1???? 3??????4##? 4?2013???? 1???? 4??????3##? 5?2013???? 1???? 5??????3##? 6?2013???? 1???? 6??????2##? 7?2013???? 1???? 7??????2##? 8?2013???? 1???? 8??????1##? 9?2013???? 1???? 9??????3## 10? 2013????1?? ?10??????3## # ...

with 355 more rows

第五章探索性數(shù)據(jù)分析

如何使用可視化和數(shù)據(jù)轉換來系統(tǒng)化地探索數(shù)據(jù)，統(tǒng)計學家將這項任務稱為探索性數(shù)據(jù)分析（EDA）。EDA是一個可迭代的循環(huán)過程。

EDA工具：可視化，數(shù)據(jù)轉換和建模。

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

2019-02-03

2019-02-03

相關閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

2019-02-03

相關閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av