數(shù)據(jù)清理以及轉(zhuǎn)換

主要用到dplyrtidyr

數(shù)據(jù)處理和分析以及數(shù)據(jù)的理解占據(jù)一個數(shù)據(jù)科學(xué)項目大量的時間。這一部分工作非常繁瑣,但是這一個步驟又是非常有必要以及重要的。

恐怖大師--斯蒂芬·金寫的一本寫作指南《寫作這回事》里面提到了作家的一個工具箱,好的作家需要有一個工具箱,里面放著自己非常順手的工具。

好的數(shù)據(jù)科學(xué)家和好的作家一樣,需要有自己的熟練的工具,市面上的任何工具都可以,只要你能很好的使用它。吹毛求疵的人希望制造自己的工具,當(dāng)然,有能力有資源這樣做是無傷大雅。

但是,一般而言,過于吹毛求疵只會得不償失。

我這里講一下我用的某些工具。

tidyr. packages

數(shù)據(jù)重塑

1. 將寬數(shù)據(jù)轉(zhuǎn)換成為短數(shù)據(jù)

> table4a
# A tibble: 3 x 3
  country     `1999` `2000`
* <chr>        <int>  <int>
1 Afghanistan    745   2666
2 Brazil       37737  80488
3 China       212258 213766
> gather(table4a, `1999`, `2000`, key = "year", value = "cases")
# A tibble: 6 x 3
  country     year   cases
  <chr>       <chr>  <int>
1 Afghanistan 1999     745
2 Brazil      1999   37737
3 China       1999  212258
4 Afghanistan 2000    2666
5 Brazil      2000   80488
6 China       2000  213766

2. 將長數(shù)據(jù)轉(zhuǎn)換成為寬數(shù)據(jù)

table2
# A tibble: 12 x 4
   country      year type            count
   <chr>       <int> <chr>           <int>
 1 Afghanistan  1999 cases             745
 2 Afghanistan  1999 population   19987071
 3 Afghanistan  2000 cases            2666
 4 Afghanistan  2000 population   20595360
 5 Brazil       1999 cases           37737
 6 Brazil       1999 population  172006362
 7 Brazil       2000 cases           80488
 8 Brazil       2000 population  174504898
 9 China        1999 cases          212258
10 China        1999 population 1272915272
11 China        2000 cases          213766
12 China        2000 population 1280428583
> spread(table2, type, count)
# A tibble: 6 x 4
  country      year  cases population
  <chr>       <int>  <int>      <int>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

3. 分割某一列

> table3
# A tibble: 6 x 3
  country      year rate             
* <chr>       <int> <chr>            
1 Afghanistan  1999 745/19987071     
2 Afghanistan  2000 2666/20595360    
3 Brazil       1999 37737/172006362  
4 Brazil       2000 80488/174504898  
5 China        1999 212258/1272915272
6 China        2000 213766/1280428583
> separate(table3, rate, into = c("cases", "pop"))
# A tibble: 6 x 4
  country      year cases  pop       
* <chr>       <int> <chr>  <chr>     
1 Afghanistan  1999 745    19987071  
2 Afghanistan  2000 2666   20595360  
3 Brazil       1999 37737  172006362 
4 Brazil       2000 80488  174504898 
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

4. 分割某一列然后變成多行

> table3
# A tibble: 6 x 3
  country      year rate             
* <chr>       <int> <chr>            
1 Afghanistan  1999 745/19987071     
2 Afghanistan  2000 2666/20595360    
3 Brazil       1999 37737/172006362  
4 Brazil       2000 80488/174504898  
5 China        1999 212258/1272915272
6 China        2000 213766/1280428583
> separate_rows(table3, rate)
# A tibble: 12 x 3
   country      year rate      
   <chr>       <int> <chr>     
 1 Afghanistan  1999 745       
 2 Afghanistan  1999 19987071  
 3 Afghanistan  2000 2666      
 4 Afghanistan  2000 20595360  
 5 Brazil       1999 37737     
 6 Brazil       1999 172006362 
 7 Brazil       2000 80488     
 8 Brazil       2000 174504898 
 9 China        1999 212258    
10 China        1999 1272915272
11 China        2000 213766    
12 China        2000 1280428583

5. 合并兩列

> table5
# A tibble: 6 x 4
  country     century year  rate             
* <chr>       <chr>   <chr> <chr>            
1 Afghanistan 19      99    745/19987071     
2 Afghanistan 20      00    2666/20595360    
3 Brazil      19      99    37737/172006362  
4 Brazil      20      00    80488/174504898  
5 China       19      99    212258/1272915272
6 China       20      00    213766/1280428583
> unite(table5, century, year, col = "year", sep = "")
# A tibble: 6 x 3
  country     year  rate             
  <chr>       <chr> <chr>            
1 Afghanistan 1999  745/19987071     
2 Afghanistan 2000  2666/20595360    
3 Brazil      1999  37737/172006362  
4 Brazil      2000  80488/174504898  
5 China       1999  212258/1272915272
6 China       2000  213766/1280428583

處理缺失值

  1. drop_na
  2. fill
  3. replace_na

擴(kuò)充數(shù)據(jù)

> head(mtcars)
                   mpg cyl disp  hp drat    wt
Mazda RX4         21.0   6  160 110 3.90 2.620
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875
Datsun 710        22.8   4  108  93 3.85 2.320
Hornet 4 Drive    21.4   6  258 110 3.08 3.215
Hornet Sportabout 18.7   8  360 175 3.15 3.440
Valiant           18.1   6  225 105 2.76 3.460
                   qsec vs am gear carb
Mazda RX4         16.46  0  1    4    4
Mazda RX4 Wag     17.02  0  1    4    4
Datsun 710        18.61  1  1    4    1
Hornet 4 Drive    19.44  1  0    3    1
Hornet Sportabout 17.02  0  0    3    2
Valiant           20.22  1  0    3    1
> complete(mtcars, cyl, gear, carb)
# A tibble: 74 x 11
     cyl  gear  carb   mpg  disp    hp  drat
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1     4     3     1  21.5 120.     97  3.7 
 2     4     3     2  NA    NA      NA NA   
 3     4     3     3  NA    NA      NA NA   
 4     4     3     4  NA    NA      NA NA   
 5     4     3     6  NA    NA      NA NA   
 6     4     3     8  NA    NA      NA NA   
 7     4     4     1  22.8 108      93  3.85
 8     4     4     1  32.4  78.7    66  4.08
 9     4     4     1  33.9  71.1    65  4.22
10     4     4     1  27.3  79      66  4.08
# ... with 64 more rows, and 4 more variables:
#   wt <dbl>, qsec <dbl>, vs <dbl>, am <dbl>

只擴(kuò)充選中好的列

> head(mtcars)
                   mpg cyl disp  hp drat    wt
Mazda RX4         21.0   6  160 110 3.90 2.620
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875
Datsun 710        22.8   4  108  93 3.85 2.320
Hornet 4 Drive    21.4   6  258 110 3.08 3.215
Hornet Sportabout 18.7   8  360 175 3.15 3.440
Valiant           18.1   6  225 105 2.76 3.460
                   qsec vs am gear carb
Mazda RX4         16.46  0  1    4    4
Mazda RX4 Wag     17.02  0  1    4    4
Datsun 710        18.61  1  1    4    1
Hornet 4 Drive    19.44  1  0    3    1
Hornet Sportabout 17.02  0  0    3    2
Valiant           20.22  1  0    3    1
> expand(mtcars, cyl, gear, carb)
# A tibble: 54 x 3
     cyl  gear  carb
   <dbl> <dbl> <dbl>
 1     4     3     1
 2     4     3     2
 3     4     3     3
 4     4     3     4
 5     4     3     6
 6     4     3     8
 7     4     4     1
 8     4     4     2
 9     4     4     3
10     4     4     4
# ... with 44 more rows

dplyr.packages

1. group by

 head(iris)
  Sepal.Length Sepal.Width Petal.Length
1          5.1         3.5          1.4
2          4.9         3.0          1.4
3          4.7         3.2          1.3
4          4.6         3.1          1.5
5          5.0         3.6          1.4
6          5.4         3.9          1.7
  Petal.Width Species
1         0.2  setosa
2         0.2  setosa
3         0.2  setosa
4         0.2  setosa
5         0.2  setosa
6         0.4  setosa
> group_by(iris, Species)
# A tibble: 150 x 5
# Groups:   Species [3]
   Sepal.Length Sepal.Width Petal.Length
          <dbl>       <dbl>        <dbl>
 1          5.1         3.5          1.4
 2          4.9         3            1.4
 3          4.7         3.2          1.3
 4          4.6         3.1          1.5
 5          5           3.6          1.4
 6          5.4         3.9          1.7
 7          4.6         3.4          1.4
 8          5           3.4          1.5
 9          4.4         2.9          1.4
10          4.9         3.1          1.5
# ... with 140 more rows, and 2 more variables:
#   Petal.Width <dbl>, Species <fct>

配合summarise進(jìn)行一系列聚合操作,比如group by 取平均值.

group_by(iris, Species) %>% summarise(MEAN=mean(Sepal.Length))
# A tibble: 3 x 2
  Species     MEAN
  <fct>      <dbl>
1 setosa      5.01
2 versicolor  5.94
3 virginica   6.59

2.篩選數(shù)據(jù)的行

1.filter

> starwars
# A tibble: 87 x 13
   name  height  mass hair_color skin_color
   <chr>  <int> <dbl> <chr>      <chr>     
 1 Luke…    172    77 blond      fair      
 2 C-3PO    167    75 NA         gold      
 3 R2-D2     96    32 NA         white, bl…
 4 Dart…    202   136 none       white     
 5 Leia…    150    49 brown      light     
 6 Owen…    178   120 brown, gr… light     
 7 Beru…    165    75 brown      light     
 8 R5-D4     97    32 NA         white, red
 9 Bigg…    183    84 black      light     
10 Obi-…    182    77 auburn, w… fair      
# ... with 77 more rows, and 8 more variables:
#   eye_color <chr>, birth_year <dbl>,
#   gender <chr>, homeworld <chr>,
#   species <chr>, films <list>,
#   vehicles <list>, starships <list>
> filter(starwars, mass > 1000)
# A tibble: 1 x 13
  name  height  mass hair_color skin_color
  <chr>  <int> <dbl> <chr>      <chr>     
1 Jabb…    175  1358 NA         green-tan…
# ... with 8 more variables: eye_color <chr>,
#   birth_year <dbl>, gender <chr>,
#   homeworld <chr>, species <chr>,
#   films <list>, vehicles <list>,
#   starships <list>
  1. 去重distinct(.data, ..., .keep_all = FALSE)可以選擇根據(jù)某一行去重
  2. 抽樣sample_frac(tbl, size = 1, replace = FALSE, weight = NULL, .env = parent.frame())
  3. 通過具體的行數(shù)進(jìn)行選擇變量slice(.data, ...)

排序

arrange(mtcars, mpg)

降序排序

arrange(mtcars, desc(mpg))

添加一行

add_row(faithful, eruptions = 1, waiting = 1)

對列進(jìn)行操作

  1. 選擇某些列select()

  2. 生成一個新列mutate(mtcars, gpm = 1/mpg)

  3. 生成一個新列,去除老的列transmute(mtcars, gpm = 1/mpg)

4.對所有的行進(jìn)行轉(zhuǎn)換mutate_if(iris, is.numeric, funs(log(.)))

連接

  1. left_join

  2. right_join

  3. inner_join

  4. full_join

合并數(shù)據(jù)

bind_rows

集合運(yùn)算

  1. 交集intersect
  2. 補(bǔ)集 setdiff
  3. 并集 union

還有很多細(xì)節(jié),太繁瑣沒有寫下來。rstudio官網(wǎng)里面有很多學(xué)習(xí)資料。鏈接如下
https://www.rstudio.com/resources/cheatsheets/

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容