主要用到dplyr和tidyr
數(shù)據(jù)處理和分析以及數(shù)據(jù)的理解占據(jù)一個數(shù)據(jù)科學(xué)項目大量的時間。這一部分工作非常繁瑣,但是這一個步驟又是非常有必要以及重要的。
恐怖大師--斯蒂芬·金寫的一本寫作指南《寫作這回事》里面提到了作家的一個工具箱,好的作家需要有一個工具箱,里面放著自己非常順手的工具。
好的數(shù)據(jù)科學(xué)家和好的作家一樣,需要有自己的熟練的工具,市面上的任何工具都可以,只要你能很好的使用它。吹毛求疵的人希望制造自己的工具,當(dāng)然,有能力有資源這樣做是無傷大雅。
但是,一般而言,過于吹毛求疵只會得不償失。
我這里講一下我用的某些工具。
tidyr. packages
數(shù)據(jù)重塑
1. 將寬數(shù)據(jù)轉(zhuǎn)換成為短數(shù)據(jù)
> table4a
# A tibble: 3 x 3
country `1999` `2000`
* <chr> <int> <int>
1 Afghanistan 745 2666
2 Brazil 37737 80488
3 China 212258 213766
> gather(table4a, `1999`, `2000`, key = "year", value = "cases")
# A tibble: 6 x 3
country year cases
<chr> <chr> <int>
1 Afghanistan 1999 745
2 Brazil 1999 37737
3 China 1999 212258
4 Afghanistan 2000 2666
5 Brazil 2000 80488
6 China 2000 213766
2. 將長數(shù)據(jù)轉(zhuǎn)換成為寬數(shù)據(jù)
table2
# A tibble: 12 x 4
country year type count
<chr> <int> <chr> <int>
1 Afghanistan 1999 cases 745
2 Afghanistan 1999 population 19987071
3 Afghanistan 2000 cases 2666
4 Afghanistan 2000 population 20595360
5 Brazil 1999 cases 37737
6 Brazil 1999 population 172006362
7 Brazil 2000 cases 80488
8 Brazil 2000 population 174504898
9 China 1999 cases 212258
10 China 1999 population 1272915272
11 China 2000 cases 213766
12 China 2000 population 1280428583
> spread(table2, type, count)
# A tibble: 6 x 4
country year cases population
<chr> <int> <int> <int>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
3. 分割某一列
> table3
# A tibble: 6 x 3
country year rate
* <chr> <int> <chr>
1 Afghanistan 1999 745/19987071
2 Afghanistan 2000 2666/20595360
3 Brazil 1999 37737/172006362
4 Brazil 2000 80488/174504898
5 China 1999 212258/1272915272
6 China 2000 213766/1280428583
> separate(table3, rate, into = c("cases", "pop"))
# A tibble: 6 x 4
country year cases pop
* <chr> <int> <chr> <chr>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
4. 分割某一列然后變成多行
> table3
# A tibble: 6 x 3
country year rate
* <chr> <int> <chr>
1 Afghanistan 1999 745/19987071
2 Afghanistan 2000 2666/20595360
3 Brazil 1999 37737/172006362
4 Brazil 2000 80488/174504898
5 China 1999 212258/1272915272
6 China 2000 213766/1280428583
> separate_rows(table3, rate)
# A tibble: 12 x 3
country year rate
<chr> <int> <chr>
1 Afghanistan 1999 745
2 Afghanistan 1999 19987071
3 Afghanistan 2000 2666
4 Afghanistan 2000 20595360
5 Brazil 1999 37737
6 Brazil 1999 172006362
7 Brazil 2000 80488
8 Brazil 2000 174504898
9 China 1999 212258
10 China 1999 1272915272
11 China 2000 213766
12 China 2000 1280428583
5. 合并兩列
> table5
# A tibble: 6 x 4
country century year rate
* <chr> <chr> <chr> <chr>
1 Afghanistan 19 99 745/19987071
2 Afghanistan 20 00 2666/20595360
3 Brazil 19 99 37737/172006362
4 Brazil 20 00 80488/174504898
5 China 19 99 212258/1272915272
6 China 20 00 213766/1280428583
> unite(table5, century, year, col = "year", sep = "")
# A tibble: 6 x 3
country year rate
<chr> <chr> <chr>
1 Afghanistan 1999 745/19987071
2 Afghanistan 2000 2666/20595360
3 Brazil 1999 37737/172006362
4 Brazil 2000 80488/174504898
5 China 1999 212258/1272915272
6 China 2000 213766/1280428583
處理缺失值
drop_nafillreplace_na
擴(kuò)充數(shù)據(jù)
> head(mtcars)
mpg cyl disp hp drat wt
Mazda RX4 21.0 6 160 110 3.90 2.620
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875
Datsun 710 22.8 4 108 93 3.85 2.320
Hornet 4 Drive 21.4 6 258 110 3.08 3.215
Hornet Sportabout 18.7 8 360 175 3.15 3.440
Valiant 18.1 6 225 105 2.76 3.460
qsec vs am gear carb
Mazda RX4 16.46 0 1 4 4
Mazda RX4 Wag 17.02 0 1 4 4
Datsun 710 18.61 1 1 4 1
Hornet 4 Drive 19.44 1 0 3 1
Hornet Sportabout 17.02 0 0 3 2
Valiant 20.22 1 0 3 1
> complete(mtcars, cyl, gear, carb)
# A tibble: 74 x 11
cyl gear carb mpg disp hp drat
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 4 3 1 21.5 120. 97 3.7
2 4 3 2 NA NA NA NA
3 4 3 3 NA NA NA NA
4 4 3 4 NA NA NA NA
5 4 3 6 NA NA NA NA
6 4 3 8 NA NA NA NA
7 4 4 1 22.8 108 93 3.85
8 4 4 1 32.4 78.7 66 4.08
9 4 4 1 33.9 71.1 65 4.22
10 4 4 1 27.3 79 66 4.08
# ... with 64 more rows, and 4 more variables:
# wt <dbl>, qsec <dbl>, vs <dbl>, am <dbl>
只擴(kuò)充選中好的列
> head(mtcars)
mpg cyl disp hp drat wt
Mazda RX4 21.0 6 160 110 3.90 2.620
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875
Datsun 710 22.8 4 108 93 3.85 2.320
Hornet 4 Drive 21.4 6 258 110 3.08 3.215
Hornet Sportabout 18.7 8 360 175 3.15 3.440
Valiant 18.1 6 225 105 2.76 3.460
qsec vs am gear carb
Mazda RX4 16.46 0 1 4 4
Mazda RX4 Wag 17.02 0 1 4 4
Datsun 710 18.61 1 1 4 1
Hornet 4 Drive 19.44 1 0 3 1
Hornet Sportabout 17.02 0 0 3 2
Valiant 20.22 1 0 3 1
> expand(mtcars, cyl, gear, carb)
# A tibble: 54 x 3
cyl gear carb
<dbl> <dbl> <dbl>
1 4 3 1
2 4 3 2
3 4 3 3
4 4 3 4
5 4 3 6
6 4 3 8
7 4 4 1
8 4 4 2
9 4 4 3
10 4 4 4
# ... with 44 more rows
dplyr.packages
1. group by
head(iris)
Sepal.Length Sepal.Width Petal.Length
1 5.1 3.5 1.4
2 4.9 3.0 1.4
3 4.7 3.2 1.3
4 4.6 3.1 1.5
5 5.0 3.6 1.4
6 5.4 3.9 1.7
Petal.Width Species
1 0.2 setosa
2 0.2 setosa
3 0.2 setosa
4 0.2 setosa
5 0.2 setosa
6 0.4 setosa
> group_by(iris, Species)
# A tibble: 150 x 5
# Groups: Species [3]
Sepal.Length Sepal.Width Petal.Length
<dbl> <dbl> <dbl>
1 5.1 3.5 1.4
2 4.9 3 1.4
3 4.7 3.2 1.3
4 4.6 3.1 1.5
5 5 3.6 1.4
6 5.4 3.9 1.7
7 4.6 3.4 1.4
8 5 3.4 1.5
9 4.4 2.9 1.4
10 4.9 3.1 1.5
# ... with 140 more rows, and 2 more variables:
# Petal.Width <dbl>, Species <fct>
配合summarise進(jìn)行一系列聚合操作,比如group by 取平均值.
group_by(iris, Species) %>% summarise(MEAN=mean(Sepal.Length))
# A tibble: 3 x 2
Species MEAN
<fct> <dbl>
1 setosa 5.01
2 versicolor 5.94
3 virginica 6.59
2.篩選數(shù)據(jù)的行
1.filter
> starwars
# A tibble: 87 x 13
name height mass hair_color skin_color
<chr> <int> <dbl> <chr> <chr>
1 Luke… 172 77 blond fair
2 C-3PO 167 75 NA gold
3 R2-D2 96 32 NA white, bl…
4 Dart… 202 136 none white
5 Leia… 150 49 brown light
6 Owen… 178 120 brown, gr… light
7 Beru… 165 75 brown light
8 R5-D4 97 32 NA white, red
9 Bigg… 183 84 black light
10 Obi-… 182 77 auburn, w… fair
# ... with 77 more rows, and 8 more variables:
# eye_color <chr>, birth_year <dbl>,
# gender <chr>, homeworld <chr>,
# species <chr>, films <list>,
# vehicles <list>, starships <list>
> filter(starwars, mass > 1000)
# A tibble: 1 x 13
name height mass hair_color skin_color
<chr> <int> <dbl> <chr> <chr>
1 Jabb… 175 1358 NA green-tan…
# ... with 8 more variables: eye_color <chr>,
# birth_year <dbl>, gender <chr>,
# homeworld <chr>, species <chr>,
# films <list>, vehicles <list>,
# starships <list>
- 去重
distinct(.data, ..., .keep_all = FALSE)可以選擇根據(jù)某一行去重 - 抽樣
sample_frac(tbl, size = 1, replace = FALSE, weight = NULL, .env = parent.frame()) - 通過具體的行數(shù)進(jìn)行選擇變量
slice(.data, ...)
排序
arrange(mtcars, mpg)
降序排序
arrange(mtcars, desc(mpg))
添加一行
add_row(faithful, eruptions = 1, waiting = 1)
對列進(jìn)行操作
選擇某些列
select()生成一個新列
mutate(mtcars, gpm = 1/mpg)生成一個新列,去除老的列
transmute(mtcars, gpm = 1/mpg)
4.對所有的行進(jìn)行轉(zhuǎn)換mutate_if(iris, is.numeric, funs(log(.)))
連接
left_joinright_joininner_joinfull_join
合并數(shù)據(jù)
bind_rows
集合運(yùn)算
- 交集
intersect - 補(bǔ)集
setdiff - 并集
union
還有很多細(xì)節(jié),太繁瑣沒有寫下來。rstudio官網(wǎng)里面有很多學(xué)習(xí)資料。鏈接如下
https://www.rstudio.com/resources/cheatsheets/