Tidyverse是一系列優(yōu)秀R包的合集，其中最常用的7個(gè)package包括ggplot2/tibble/tidyr/readr/purrr/dplyr/stringr/forcat。
每個(gè)包的功能如下：

readr：用于數(shù)據(jù)讀取
tibble：用于形成強(qiáng)化數(shù)據(jù)框
tidyr：用于長(zhǎng)寬表格轉(zhuǎn)換，數(shù)據(jù)整潔，數(shù)據(jù)清理
dplyr：用于數(shù)據(jù)操縱，數(shù)據(jù)整理
stringr：用于處理字符串?dāng)?shù)據(jù)
forcats：用于處理因子數(shù)據(jù)
ggplot2：用于數(shù)據(jù)可視化

對(duì)于大多部分?jǐn)?shù)據(jù)分析任務(wù)，通常有一些固定的操作，操作對(duì)應(yīng)的命令和對(duì)應(yīng)的R包也是相對(duì)固定的，基本可以用下圖概括。

image.png

安裝-載入-大概了解tidyverse

install.packages("tidyverse")
library(tidyverse)

tidyverse_conflicts()     # tidyverse與其他包的沖突
tidyverse_deps()          # 列出所有tidyverse的依賴包
tidyverse_logo()          #獲取tidyverse的logo
tidyverse_packages()      # 列出所有tidyverse包
tidyverse_update()        # 更新tidyverse包

載入數(shù)據(jù)，了解數(shù)據(jù)

library(datasets)
install.packages("gapminder")
library(gapminder)
attach(iris)

head(iris)
str(iris)
glimpse(iris)
typeof(iris)
dim(iris)

readr包

readr包中主要的函數(shù)有：
read_csv，
read_delim，
read_table，
write_delim，
write_csv，
write_excel_csv，
write_delim函數(shù)，
其中read_table中分隔符是指定為固定空格的，不能修改分隔符，函數(shù)read_delim可以指定分隔符

管道符：%>%

意思是將%>%左邊的對(duì)象傳遞給右邊的函數(shù);可以大量減少內(nèi)存中的對(duì)象，節(jié)省內(nèi)存;
f（x）變成這樣：x ％>％ f和這樣的東西：h（g（f（x）））變成這樣：x％>％f％>％g％>％h

x %>% f   等效與   f(x) 
x %>% f(y)   等效與   f(x, y) 
x %>% f %>% g %>% h   等價(jià)于   h(g(f(x)))

參數(shù)占位符

x %>% f(y,. )   等價(jià)于   f(y, x)
x %>% f(y, z =. )   就相當(dāng)于   f(y, z = x)

正在使用屬性的占位符
它直接在右邊的表達(dá)式中多次使用占位符。但是，當(dāng)占位符僅出現(xiàn)在嵌套表達(dá)式magrittr中時(shí)，仍將應(yīng)用第一個(gè)參數(shù)規(guī)則。原因是在大多數(shù)情況下，這種結(jié)果更清晰。

x %>% f(y = nrow(.), z = ncol(.))     就相當(dāng)于     f(x, y = nrow(x), z = ncol(x))

行為可以通過(guò)在大括號(hào)中封閉右手來(lái)實(shí)現(xiàn) overruled:

x %>% {f(y = nrow(.), z = ncol(.))}     就相當(dāng)于     f(y = nrow(x), z = ncol(x))

帶變量的管道
許多函數(shù)接受數(shù)據(jù)參數(shù)，比如 lm 和 aggregate，這在一個(gè)處理數(shù)據(jù)的管道中非常有用。還有一些函數(shù)沒(méi)有數(shù)據(jù)參數(shù)，對(duì)于公開(kāi)數(shù)據(jù)中的變量很有用。這是用 %$% 運(yùn)算符完成的：

library(tidyverse)
library(magrittr)
iris %>%
  subset(Sepal.Length> mean(Sepal.Length)) %$%
  cor(Sepal.Length, Sepal.Width)

data.frame(z= rnorm(100)) %$% ts.plot(z)

復(fù)合分配管道操作
還有一個(gè)管道運(yùn)算符，可以用作 shorthand 符號(hào)，在左手邊是"被覆蓋"：

iris$Sepal.Length<- 
  iris$Sepal.Length %>%
  sqrt()

要避免在賦值運(yùn)算符后面立即重復(fù)左邊的操作，請(qǐng)使用 %<>% 運(yùn)算符：

iris$Sepal.Length %<>% sqrt

這里運(yùn)算符與 %>% 完全一樣，只是管道分配結(jié)果而不是返回結(jié)果。它必須是長(zhǎng)鏈中的第一個(gè)管道操作符。

除了%>%這個(gè)好用的符號(hào)外，magrittr還提供了其他三個(gè)比較好用的符號(hào),%$%，%<>%和%T>%。

%>%        forward-pipe operator.
%T>%     tee operator.
%<>%     compound assignment pipe-operator. （大神不建議這樣做，要聽(tīng)話）
%$%        exposition pipe-operator.

tidyr, reshape2的替代者，功能更純粹

tidyr會(huì)將數(shù)據(jù)變的整潔
整潔數(shù)據(jù)有三個(gè)原則：

1 變量構(gòu)成列
2 觀察組成行
3 值放在單元里面

整齊的數(shù)據(jù)特性：每一列都是一個(gè)變量；每一行都是一個(gè)觀測(cè)值
tidyr 四大常用函數(shù)

gather() 使“寬”數(shù)據(jù)變成長(zhǎng)數(shù)據(jù)
spread() 使“長(zhǎng)”數(shù)據(jù)變成寬數(shù)據(jù)
separate() 將單個(gè)列拆分為多個(gè)列
unite() 將多個(gè)列組合成一個(gè)列

我們使用gather（）來(lái)挖掘最初分散在三列中的數(shù)據(jù)，并將它們分為兩列：鍵和值
Gather占用多列并折疊成鍵值對(duì)，根據(jù)需要復(fù)制所有其他列，當(dāng)你注意到你的列不是變量時(shí)，你可以使用gather（）。 “這就是tidyverse定義gather的方式。

kv_gathered <- key_value %>% 
  gather(key, # this will be the new column for the 3 key columns
         value, # this will contain the 9 distinct values
         key1:key3, # this is the range of columns we want gathered
         na.rm = TRUE # handles missing
  )
kv_gathered

gather 函數(shù)主要四個(gè)參數(shù)
data :數(shù)據(jù)集
key ：列明
value ：原來(lái)值的新的列名
...: 需要聚集的變量，刪除前面加-

gather(data, key = "key", value = "value", ..., na.rm = FALSE, convert = FALSE, factor_key = FALSE)
     第一個(gè)參數(shù)data放的是原數(shù)據(jù)，數(shù)據(jù)類型要是一個(gè)數(shù)據(jù)框；
     下面?zhèn)饕粋€(gè)鍵值對(duì)，名字是自己起的，這兩個(gè)值是做新轉(zhuǎn)換成的二維表的表頭，即兩個(gè)變量名；
     第四個(gè)是選中要轉(zhuǎn)置的列，這個(gè)參數(shù)不寫的話就默認(rèn)全部轉(zhuǎn)置；
     后面還可以加可選參數(shù)na.rm，如果na.rm = TRUE，那么將會(huì)在新表中去除原表中的缺失值(NA)。

gather()舉例
先構(gòu)造一個(gè)數(shù)據(jù)框stu：

stu<-data.frame(grade=c("A","B","C","D","E"), female=c(5, 4, 1, 2, 3), male=c(1, 2, 3, 4, 5))  #成績(jī)-性別的人數(shù)分布
gather(stu,gender,count, -grade)
第一個(gè)參數(shù)是原數(shù)據(jù)stu，二、三兩個(gè)參數(shù)是鍵值對(duì)（性別，人數(shù)），第四個(gè)表示減去（除去grade列，就只轉(zhuǎn)置剩下兩列）

separate負(fù)責(zé)分割數(shù)據(jù)，把一個(gè)變量中就包含兩個(gè)變量的數(shù)據(jù)分來(lái)（上例gather中是屬性名也是一個(gè)變量，一個(gè)屬性名一個(gè)變量）
separate(data, col, into, sep (= 正則表達(dá)式), remove =TRUE,convert = FALSE, extra = "warn", fill = "warn", ...)
第一個(gè)參數(shù)放要分離的數(shù)據(jù)框；
第二個(gè)參數(shù)放要分離的列；
第三個(gè)參數(shù)是分割成的變量的列（肯定是多個(gè)），用向量表示；
第四個(gè)參數(shù)是分隔符，用正則表達(dá)式表示，或者寫數(shù)字，表示從第幾位分開(kāi)（文檔里是這樣寫的：

stu2<-data.frame(grade=c("A","B","C","D","E"), 
                 female_1=c(5, 4, 1, 2, 3), male_1=c(1, 2, 3, 4, 5),
                 female_2=c(4, 5, 1, 2, 3), male_2=c(0, 2, 3, 4, 6))
stu2
stu2_new<-gather(stu2,gender_class,count,-grade)
stu2_new

現(xiàn)在我們要做的就是把gender_class這一列分開(kāi)：

separate(stu2_new,gender_class,c("gender","class"))

spread用來(lái)擴(kuò)展表，把某一列的值（鍵值對(duì)）分開(kāi)拆成多列。

spread(data, key, value, fill = NA, convert = FALSE, drop =TRUE, sep = NULL)

key是原來(lái)要拆的那一列的名字（變量名），value是拆出來(lái)的那些列的值應(yīng)該填什么（填原表的哪一列）

name<-rep(c("Sally","Jeff","Roger","Karen","Brain"),c(2,2,2,2,2))
test<-rep(c("midterm","final"),5)
class1<-c("A","C",NA,NA,NA,NA,NA,NA,"B","B")
class2<-c(NA,NA,"D","E","C","A",NA,NA,NA,NA)
class3<-c("B","C",NA,NA,NA,NA,"C","C",NA,NA)
class4<-c(NA,NA,"A","C",NA,NA,"A","A",NA,NA)
class5<-c(NA,NA,NA,NA,"B","A",NA,NA,"A","C")
stu3<-data.frame(name,test,class1,class2,class3,class4,class5)
stu3
gather(stu3,class,grade, class1:class5, na.rm=TRUE)

用spread函數(shù)將test列分來(lái)成midterm和final兩列，這兩列的值是選的兩門課的成績(jī)。
再重復(fù)一遍，第二個(gè)參數(shù)是要拆分的那一列的列名，第三個(gè)參數(shù)是擴(kuò)展出的列的值應(yīng)該來(lái)自原表的哪一列的列名。

stu3_new<-gather(stu3, class, grade, class1:class5, na.rm = TRUE)
spread(stu3_new,test,grade)

最后補(bǔ)充一條，現(xiàn)在class列顯得有些冗余，直接用數(shù)字似乎更簡(jiǎn)潔，使用readr包中的parse_number()提出數(shù)字（還用到了dplyr的mutate函數(shù)）

library(readr)
library(dplyr)
mutate(spread(stu3_new,test,grade),class=parse_number(class))

unite--多列合并為一列

unite(data, col, …, sep = “_”, remove = TRUE)
data：為數(shù)據(jù)框
col：被組合的新列名稱
…：指定哪些列需要被組合
sep：組合列之間的連接符，默認(rèn)為下劃線
remove：是否刪除被組合的列

先虛構(gòu)一數(shù)據(jù)框

set.seed(1)
date <- as.Date('2016-11-01') + 0:14
hour <- sample(1:24, 15)
min <- sample(1:60, 15)
second <- sample(1:60, 15)
event <- sample(letters, 15)
data <- data.table(date, hour, min, second, event)

把date，hour，min和second列合并為新列datetime
R中的日期時(shí)間格式為"Year-Month-Day-Hour:Min:Second"

dataNew <- data %>%unite(datehour, date, hour, sep = ' ') %>%unite(datetime, datehour, min, second, sep = ':')
dataNew

dplyr

主要功能：

1、選擇數(shù)據(jù)表的行: filter
2、排序arrange
3、改變數(shù)據(jù)表的列: mutate, transmute
    mutate 會(huì)保留改變前和改變后的列
    transmute 則只會(huì)保留改變后的列, 而扔掉改變前的列
選擇數(shù)據(jù)表的列: select, rename
4、select 只會(huì)選擇你指定的列
5、rename 則會(huì)改變列名, 并選擇其他所有的列
6、通過(guò) group_by 和 summarize 函數(shù)可以把數(shù)據(jù)進(jìn)行分組進(jìn)行分析

過(guò)濾 filter()函數(shù)可以用來(lái)取數(shù)據(jù)子集。提取符合特定邏輯條件的行。

例如，iris％>％filter（Sepal.Length> 6）

iris %>% filter(Species == "virginica") # 指定滿足的行
iris %>% filter(Species == 'virginica',Sepal.Length > 6)  #多個(gè)條件用，分割

選擇 Sepal.Length > 6.7,且Species == "versicolor"或者 Species == "virginica"的行

iris %>% filter(
  Sepal.Length > 6.7, 
  Species %in% c("versicolor", "virginica" )
)

排序 arrange()函數(shù)用來(lái)對(duì)觀察值排序，默認(rèn)是升序。

iris %>% arrange(Sepal.Length)  #升序
iris %>% arrange(desc(Sepal.Length))  #降序
arrange(my_data, -Sepal.Length)  #根據(jù)Sepal.Length值排序（降序）

新增變量 mutate()可以更新或者新增數(shù)據(jù)框一列。

iris %>% mutate(Sepal.Length = Sepal.Length*10) # 將該列數(shù)值變成以mm為單位
iris %>% mutate(SLMn = Sepal.Length * 10) # 創(chuàng)建新的一列

select 選擇指定列

iris %>% select(1:3) #選擇第一列到第三列
iris %>% select(1,3)#選擇第一列和第三列
 
iris %>% select(Sepal.Length, Petal.Length)
iris %>% select(Sepal.Length:Petal.Length)
iris %>% select(starts_with("Petal"))  # Select column whose name starts with "Petal"
iris %>% select(ends_with("Width"))  # Select column whose name ends with "Width"
iris %>% select(contains("etal"))  # Select columns whose names contains "etal"
iris %>% select(matches(".t."))  # Select columns whose name maches a regular expression
iris %>% select(one_of(c("Sepal.Length", "Petal.Length")))  # selects variables provided in a character vector.

iris %>% select_if(is.numeric)  #選擇列屬性為數(shù)字的列

#刪除列(根據(jù)列的屬性）
iris %>% select(-Sepal.Length, -Petal.Length)  #Removing Sepal.Length and Petal.Length columns
iris %>% select(-(Sepal.Length:Petal.Length))  #Removing all columns from Sepal.Length to Petal.Length
iris %>% select(-starts_with("Petal"))  #Removing all columns whose name starts with “Petal”:

#根據(jù)列的位置刪除列
iris %>% select(-1)  #刪除第1列
iris %>% select(-(1:3))   #刪除第1到3列
iris %>% select(-1, -3)   #刪除第1列與第3列

rename（）重命名列

iris %>% 
  rename(
    sepal_length = Sepal.Length,
    sepal_width = Sepal.Width
  )
#將列Sepal.Length重命名為sepal_length，將Sepal.Width重命名為sepal_width：
# Rename column where names is "Sepal.Length"
names(iris)[names(iris) == "Sepal.Length"] <- "sepal_length"
names(iris)[names(iris) == "Sepal.Width"] <- "sepal_width"
iris #使用函數(shù)名稱（）或colnames（）獲取列名稱
 #根據(jù)列位置重命名
names(iris)[1] <- "sepal_length"
names(iris)[2] <- "sepal_width"

整合函數(shù)流：

iris %>%
filter(Species == "Virginica") %>%
mutate(SLMm = Sepal.Length) %>%
arrange(desc(SLMm))

summarize()函數(shù)可以讓我們將很多變量匯總為單個(gè)的數(shù)據(jù)點(diǎn)。

iris %>% summarize(medianSL = median(Sepal.Length))

iris %>% 
  filter(Species == "virginica") %>%
  summarize(medianSL=median(Sepal.Length))
#還可以一次性匯總多個(gè)變量;用，分割
iris %>% 
  filter(Species == "virginica") %>% 
  summarize(medianSL = median(Sepal.Length),
            maxSL = max(Sepal.Length))

group_by()可以讓我們安裝指定的組別進(jìn)行匯總數(shù)據(jù)，而不是針對(duì)整個(gè)數(shù)據(jù)框

iris %>% 
  group_by(Species) %>% 
  summarize(medianSL = median(Sepal.Length),
            maxSL = max(Sepal.Length))

iris %>% 
  filter(Sepal.Length>6) %>% 
  group_by(Species) %>% 
  summarize(medianPL = median(Petal.Length), 
            maxPL = max(Petal.Length))

ggplot2

#散點(diǎn)圖
#散點(diǎn)圖可以幫助我們理解兩個(gè)變量的數(shù)據(jù)關(guān)系，使用geom_point()可以繪制散點(diǎn)圖：
iris_small<- iris %>% 
  filter(Sepal.Length > 5)
ggplot(iris_small, aes(x=Petal.Length, y= Petal.Width)) + 
  geom_point()  

#顏色
ggplot(iris_small, aes(x = Petal.Length,
                       y = Petal.Width,
                       color = Species)) + 
  geom_point()

#大小
ggplot(iris_small, aes(x = Petal.Length,
                       y = Petal.Width,
                       color = Species,
                       size = Sepal.Length)) + 
  geom_point()

#分面
ggplot(iris_small, aes(x = Petal.Length,
                       y = Petal.Width)) + 
  geom_point() + 
  facet_wrap(~Species)


#線圖
by_year <- gapminder %>% 
  group_by(year) %>% 
  summarize(medianGdpPerCap = median(gdpPercap))

ggplot(by_year, aes(x = year,
                    y = medianGdpPerCap)) +
  geom_line() + 
  expand_limits(y=0)

#條形圖
by_species <- iris %>%  
  filter(Sepal.Length > 6) %>% 
  group_by(Species) %>% 
  summarize(medianPL=median(Petal.Length))

ggplot(by_species, aes(x = Species, y=medianPL)) + 
  geom_col()

#直方圖
ggplot(iris_small, aes(x = Petal.Length)) + 
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#箱線圖
ggplot(iris_small, aes(x=Species, y=Sepal.Length)) + 
  geom_boxplot()

其他數(shù)據(jù)導(dǎo)入和建模類的包

DBI，用于聯(lián)接數(shù)據(jù)庫(kù)
haven，用于讀入SPSS、SAS、Stata 數(shù)據(jù)
httr，用于聯(lián)接網(wǎng)頁(yè)API
jsonlite，用于讀入JSON 數(shù)據(jù)
readxl，用于讀入Excel 文檔
rvest，用于網(wǎng)絡(luò)爬蟲
xml2，用于讀入xml 數(shù)據(jù)
modelr，用于使用管道函數(shù)建模
broom，用于統(tǒng)計(jì)模型結(jié)果的整潔

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

R tidyverse 學(xué)習(xí)

R tidyverse 學(xué)習(xí)

安裝-載入-大概了解tidyverse

載入數(shù)據(jù)，了解數(shù)據(jù)

readr包

管道符：%>%

tidyr, reshape2的替代者，功能更純粹

dplyr

過(guò)濾 filter()函數(shù)可以用來(lái)取數(shù)據(jù)子集。提取符合特定邏輯條件的行。

選擇 Sepal.Length > 6.7,且Species == "versicolor"或者 Species == "virginica"的行

排序 arrange()函數(shù)用來(lái)對(duì)觀察值排序，默認(rèn)是升序。

新增變量 mutate()可以更新或者新增數(shù)據(jù)框一列。

select 選擇指定列

rename（）重命名列

整合函數(shù)流：

summarize()函數(shù)可以讓我們將很多變量匯總為單個(gè)的數(shù)據(jù)點(diǎn)。

group_by()可以讓我們安裝指定的組別進(jìn)行匯總數(shù)據(jù)，而不是針對(duì)整個(gè)數(shù)據(jù)框

ggplot2

其他數(shù)據(jù)導(dǎo)入和建模類的包

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

R tidyverse 學(xué)習(xí)

安裝-載入-大概了解tidyverse

載入數(shù)據(jù)，了解數(shù)據(jù)

readr包

管道符：%>%

tidyr, reshape2的替代者，功能更純粹

dplyr

過(guò)濾 filter()函數(shù)可以用來(lái)取數(shù)據(jù)子集。提取符合特定邏輯條件的行。

選擇 Sepal.Length > 6.7,且Species == "versicolor"或者 Species == "virginica"的行

排序 arrange()函數(shù)用來(lái)對(duì)觀察值排序，默認(rèn)是升序。

新增變量 mutate()可以更新或者新增數(shù)據(jù)框一列。

select 選擇指定列

rename（）重命名列

整合函數(shù)流：

summarize()函數(shù)可以讓我們將很多變量匯總為單個(gè)的數(shù)據(jù)點(diǎn)。

group_by()可以讓我們安裝指定的組別進(jìn)行匯總數(shù)據(jù)，而不是針對(duì)整個(gè)數(shù)據(jù)框

ggplot2

其他數(shù)據(jù)導(dǎo)入和建模類的包

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

載入數(shù)據(jù)，了解數(shù)據(jù)

tidyr, reshape2的替代者，功能更純粹

過(guò)濾 filter()函數(shù)可以用來(lái)取數(shù)據(jù)子集。提取符合特定邏輯條件的行。

排序 arrange()函數(shù)用來(lái)對(duì)觀察值排序，默認(rèn)是升序。

summarize()函數(shù)可以讓我們將很多變量匯總為單個(gè)的數(shù)據(jù)點(diǎn)。

group_by()可以讓我們安裝指定的組別進(jìn)行匯總數(shù)據(jù)，而不是針對(duì)整個(gè)數(shù)據(jù)框