1. Intro

1.1. 有效的提問

包含三部分：需要的packages, data, code.
先更新包。比如tidyverse_updata()
dput(mtcars)重新生成自己的數(shù)據(jù)。提供給別人重復(fù)。

2. 數(shù)據(jù)可視化

2.1. intro

模版

image.png
例子

image.png

3. workflow

3.1. 基礎(chǔ)

seq(1, 10, length.out = 5)
變量命名，建議用下劃線間隔

image.png

4. dplyr

4.1. filter條件過濾

filter根據(jù)給定條件過濾部分行。只包含條件為TRUE的，去掉FALSE和NA的值。
如果要保留NA的值
filter(df, is.na(x) | x > 1)

4.2. arrange排序

code

arrange(flights, desc(dep_delay))

NA排在最后

4.3. select選擇列

code

#選擇year列到day列間所有
select(flights, year:day)
#去掉year和day間列
select(flights, -(year:day))

有用的函數(shù)
starts_with("abc")matches names that begin with “abc”.
ends_with("xyz") matches names that end with “xyz”.
contains("ijk") matches names that contain “ijk”.
matches("(.)\\1") selects variables that match a regular expression. This one matches any variables that contain repeated characters. You’ll learn more about regular expressions in strings.
num_range("x", 1:3) matches x1, x2 and x3.
重命名
rename(flights, tail_num = tailnum)
everything 可用于調(diào)整排序，將部分列移到最左邊
select(flights, time_hour, air_time, everything())

4.4. mutate加入新變量

剛創(chuàng)建的變量就可以refer

image.png

只保留新變量用transmute

image.png

一些會用到的運(yùn)算符

%/% (integer division) and %% (remainder)
lead(), lag()，可與group_by一起用。x - lag(x)，x != lag(x)

image.png
cumsum(), cumprod(), cummin(), cummax(), cummean(). RcppRoll(滾動窗口計算）
min_rank()

image.png
row_number(), dense_rank(), percent_rank(), cume_dist(), ntile()

4.5. summarise分組統(tǒng)計

和group_by一起用

image.png
關(guān)于na.rm，在mean計算時候先去掉NA值。這樣避免了算出來結(jié)果都是NA。
建議在分組統(tǒng)計中，加一個count(n)或者non-missing值的count sum(!is.na(x))。

image.png

飛機(jī)平均每日delay時間，統(tǒng)計count。

image.png
為了更好的研究趨勢，可以將最小的observations的那些組排除掉。即去掉sample size少的，避免小樣本容量組導(dǎo)致的極端偏差。
it’s often useful to filter out the groups with the smallest numbers of observations, so you can see more of the pattern and less of the extreme variation in the smallest groups.

image.png
RStudio的cmd+shift+p，可以重發(fā)送剛運(yùn)行的code chunk到console。這樣方便多次修改n值，看結(jié)果的變化。
有用的函數(shù)：

median
sd(x), interquartile range IQR(x), median absolute deviation mad(x)。后兩者可用于找outliers。
min(x), quantile(x, 0.25)得到比25%的值大的一個值, max(x).
first(x), nth(x,2), last(x). x為vector，first(x)即x[1]

與filter一起用
以日期為分組，按deptime排序，r得到排位，然后通過range得到每天的最大和最?。╮為全部的排名，不是以天為單位）

image.png
n()返回組的大小, sum(!is.na(x))非NA值的數(shù)目, n_distinct(x)得到unique值的數(shù)目
關(guān)于count（）
指定weight variable。這里相當(dāng)于計算每架飛機(jī)（tailnum為飛機(jī)編號）飛過的總里程數(shù)。

image.png
ungroup()取消group

4.6. Group和mutate及filter

找出每組最差的members

image.png
找出大于某個閾值的所有組

image.png

5. workflow

cmd+shift+N 打開空的編輯器
cmd+Enter 執(zhí)行當(dāng)前代碼
cmd+shift+S 執(zhí)行所有代碼
cmd+shift+F10 重啟RStudio

6. Tibbles

as_tibble()將簡單的數(shù)據(jù)框轉(zhuǎn)換成tibble
tibble()創(chuàng)建一個tibble
code

df$x
#> [1] 0.434 0.395 0.548 0.762 0.254
df[["x"]]
#> [1] 0.434 0.395 0.548 0.762 0.254

# Extract by position
df[[1]]
#> [1] 0.434 0.395 0.548 0.762 0.254
df %>% .$x
#> [1] 0.434 0.395 0.548 0.762 0.254
df %>% .[["x"]]
#> [1] 0.434 0.395 0.548 0.762 0.254

7. Data Import

7.1. 讀取數(shù)據(jù)

函數(shù)

read_csv() reads comma delimited files, read_csv2() reads semicolon separated files (common in countries where , is used as the decimal place), read_tsv() reads tab delimited files, and read_delim() reads in files with any delimiter.
read_fwf() reads fixed width files. You can specify fields either by their widths with fwf_widths() or their position with fwf_positions(). read_table() reads a common variation of fixed width files where columns are separated by white space.
read_log() reads Apache style log files. (But also check out webreadr which is built on top of read_log() and provides many more helpful tools.

參數(shù)
read_csv(file, col_names = TRUE, col_types = NULL, locale = default_locale(), na = c("", "NA"), quoted_na = TRUE, quote = "\"", comment = "", trim_ws = TRUE, skip = 0, n_max = Inf, guess_max = min(1000, n_max), progress = show_progress())
file,col_names,na,comment,skip,col_names(可指定）

7.2. 寫入數(shù)據(jù)

函數(shù)

write_csv, write_tsv
write_rds, read_rds
write_excel_csv

其他函數(shù)

haven reads SPSS, Stata, and SAS files.
readxl reads excel files (both .xls and .xlsx).
DBI, along with a database specific backend (e.g. RMySQL, RSQLite, RPostgreSQL etc) allows you to run SQL queries against a database and return a data frame.

8. dplyr處理關(guān)系數(shù)據(jù)

8.1. 合并連接

8.1.1. 內(nèi)連接

image.png

只匹配鍵相等的行（觀測），即保留同時存在于兩個表中的觀測。不常用，因為容易丟失觀測。

8.1.2. 外連接

左連接：保留x中的所有觀測
右連接：保留y中的所有觀測
全連接：保留x和y中的所有觀測
圖

image.png

8.1.3. 參數(shù)by

連接兩個表是通過一個單變量來實(shí)現(xiàn)，需要這個變量在兩個表中具有同樣的名字。by = "key"
`by = c("a" = "b")可以匹配x表中的a變量和y表中的b變量。輸出結(jié)果使用x表中的變量。

9.4. 篩選連接

semi_join(x,y)保留x表中與y表中的觀測相匹配的所有觀測
anti_join(x,y)丟棄x表中與y表中的觀測相匹配的所有觀測
半連接：像合并連接一樣連接兩個表，但不添加新列，而是保留x表中那些可以配配y表的行

image.png

重要的是存在匹配，匹配到哪條觀測無關(guān)緊要。這樣半連接不會像合并連接那樣造成重復(fù)的行。

image.png
反連接：半連接的逆操作?？梢杂糜谠\斷連接中的不匹配。

9. stringr處理字符串

9.1. 基礎(chǔ)

推薦用雙引號創(chuàng)建字符
str_length 字符串長度
str_c組合兩個或多個字符

_1539484158_885230944.png

字符向量合并成字符串，間隔符號

_1539484273_309769392.png

str_sub（x,start,end）

_1539484424_793811016.png

利用賦值形式更改字符

_1539484461_1413330748.png

9.2. 正則表達(dá)式

9.2.1. 基礎(chǔ)

str_view與str_view_all，接受一個字符向量和一個正則表達(dá)式。
精確匹配

_1539484707_533262169.png
匹配任意字符

_1539484718_1088015452.png
反義需要\\

_1539484734_452447559.png

9.2.2. 字符類別

\d匹配任意數(shù)字
\s匹配任意空白字符（空格，制表符，換行符）
[abc]匹配a或b或c
[^abc]匹配abc外的任意字符
例

_1539485108_46530086.png

9.2.3. 重復(fù)

?: 0 or 1
+: 1 or more
*: 0 or more

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

《R數(shù)據(jù)科學(xué)》筆記

《R數(shù)據(jù)科學(xué)》筆記

1. Intro

1.1. 有效的提問

2. 數(shù)據(jù)可視化

2.1. intro

3. workflow

3.1. 基礎(chǔ)

4. dplyr

4.1. filter條件過濾

4.2. arrange排序

4.3. select選擇列

4.4. mutate加入新變量

4.5. summarise分組統(tǒng)計

4.6. Group和mutate及filter

5. workflow

6. Tibbles

7. Data Import

7.1. 讀取數(shù)據(jù)

7.2. 寫入數(shù)據(jù)

8. dplyr處理關(guān)系數(shù)據(jù)

8.1. 合并連接

8.1.1. 內(nèi)連接

8.1.2. 外連接

8.1.3. 參數(shù)by

9.4. 篩選連接

9. stringr處理字符串

9.1. 基礎(chǔ)

9.2. 正則表達(dá)式

9.2.1. 基礎(chǔ)

9.2.2. 字符類別

9.2.3. 重復(fù)

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

《R數(shù)據(jù)科學(xué)》筆記

1. Intro

1.1. 有效的提問

2. 數(shù)據(jù)可視化

2.1. intro

3. workflow

3.1. 基礎(chǔ)

4. dplyr

4.1. filter條件過濾

4.2. arrange排序

4.3. select選擇列

4.4. mutate加入新變量

4.5. summarise分組統(tǒng)計

4.6. Group和mutate及filter

5. workflow

6. Tibbles

7. Data Import

7.1. 讀取數(shù)據(jù)

7.2. 寫入數(shù)據(jù)

8. dplyr處理關(guān)系數(shù)據(jù)

8.1. 合并連接

8.1.1. 內(nèi)連接

8.1.2. 外連接

8.1.3. 參數(shù)by

9.4. 篩選連接

9. stringr處理字符串

9.1. 基礎(chǔ)

9.2. 正則表達(dá)式

9.2.1. 基礎(chǔ)

9.2.2. 字符類別

9.2.3. 重復(fù)

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av