dplyr是生信技能樹生信爆款入門課程R語言部分Day7的講到的一個重要知識點。為加深理解,現(xiàn)在找個數(shù)據(jù)集做下練習鞏固。
函數(shù)簡介
tidyverse是為數(shù)據(jù)科學設計的R軟件包,它包含(ggplot2、dplyr、tidyr、stringr、magrittr、tibble)等一系列熱門軟件包,首先學習下dplyr和管道符號
準備并查看測試數(shù)據(jù)
> 查看數(shù)據(jù)
> library(tidyverse)
> set.seed(123)
> diamonds <- diamonds[sample(nrow(diamonds), 10),]
> head(diamonds)
carat cut color clarity depth table price x y z
3 0.31 Ideal D VS1 61.6 55 713 4.30 4.33 2.66
10 1.10 Very Good I SI1 61.2 61 4640 6.61 6.66 4.01
2 0.70 Ideal G VS1 60.8 56 3300 5.73 5.80 3.51
8 0.70 Good H SI1 64.2 58 1771 5.59 5.62 3.60
6 0.83 Good E SI1 63.7 59 3250 5.95 5.89 3.77
9 0.40 Ideal E VS1 61.6 56 1053 4.73 4.78 2.93
> diamonds <- as.data.frame(diamonds)
> attributes(diamonds) #查看數(shù)據(jù)屬性
$names
[1] "carat" "cut" "color" "clarity" "depth" "table" "price" "x" "y"
[10] "z"
$row.names
[1] 3 10 2 8 6 9 1 7 5 4
$class
[1] "data.frame"
> 可以看到數(shù)據(jù)有10列,10行,數(shù)據(jù)類型為數(shù)據(jù)框;
> unique(diamonds$cut)
[1] Ideal Very Good Good
Levels: Fair < Good < Very Good < Premium < Ideal
1 按名稱選取carat,cut,price列
> select(diamonds,carat,cut,price)
carat cut price
3 0.31 Ideal 713
10 1.10 Very Good 4640
2 0.70 Ideal 3300
8 0.70 Good 1771
6 0.83 Good 3250
9 0.40 Ideal 1053
1 0.73 Ideal 2397
7 0.51 Very Good 1668
5 0.31 Ideal 987
4 0.31 Ideal 707
> p <- select(diamonds,carat,cut,price)
> 接著用此數(shù)據(jù)進行一個最基礎的可視化:
> ggplot(p,aes(carat,price))+
+ geom_point(aes(color=cut),size=2)

image.png
> select選擇兩列之間的所有列
> select(diamonds,carat:price)
carat cut color clarity depth table price
3 0.31 Ideal D VS1 61.6 55 713
10 1.10 Very Good I SI1 61.2 61 4640
2 0.70 Ideal G VS1 60.8 56 3300
8 0.70 Good H SI1 64.2 58 1771
6 0.83 Good E SI1 63.7 59 3250
9 0.40 Ideal E VS1 61.6 56 1053
1 0.73 Ideal I VS1 60.7 56 2397
7 0.51 Very Good D VS2 62.5 58 1668
5 0.31 Ideal E IF 60.9 55 987
4 0.31 Ideal H VVS1 62.2 56 707
> select選擇不在兩列之間的所有列
> select(diamonds,-(carat:price))
x y z
3 4.30 4.33 2.66
10 6.61 6.66 4.01
2 5.73 5.80 3.51
8 5.59 5.62 3.60
6 5.95 5.89 3.77
9 4.73 4.78 2.93
1 5.85 5.81 3.54
7 5.12 5.18 3.22
5 4.39 4.41 2.68
4 4.34 4.37 2.71
2 filter(按carat >=0.5,price >=3000篩選行)
> filter(diamonds,carat >=0.5,price >=3000)
carat cut color clarity depth table price x y z
1 1.10 Very Good I SI1 61.2 61 4640 6.61 6.66 4.01
2 0.70 Ideal G VS1 60.8 56 3300 5.73 5.80 3.51
3 0.83 Good E SI1 63.7 59 3250 5.95 5.89 3.77
3 根據(jù)price的數(shù)據(jù)進行排序,默認為升序
提示:arrange(改變行順序),
arrange(diamonds,price)
carat cut color clarity depth table price x y z
1 0.31 Ideal H VVS1 62.2 56 707 4.34 4.37 2.71
2 0.31 Ideal D VS1 61.6 55 713 4.30 4.33 2.66
3 0.31 Ideal E IF 60.9 55 987 4.39 4.41 2.68
4 0.40 Ideal E VS1 61.6 56 1053 4.73 4.78 2.93
5 0.51 Very Good D VS2 62.5 58 1668 5.12 5.18 3.22
6 0.70 Good H SI1 64.2 58 1771 5.59 5.62 3.60
7 0.73 Ideal I VS1 60.7 56 2397 5.85 5.81 3.54
8 0.83 Good E SI1 63.7 59 3250 5.95 5.89 3.77
9 0.70 Ideal G VS1 60.8 56 3300 5.73 5.80 3.51
10 1.10 Very Good I SI1 61.2 61 4640 6.61 6.66 4.01
> #desc()可以按列進行降序排序:
> arrange(diamonds,desc(price))
carat cut color clarity depth table price x y z
1 1.10 Very Good I SI1 61.2 61 4640 6.61 6.66 4.01
2 0.70 Ideal G VS1 60.8 56 3300 5.73 5.80 3.51
3 0.83 Good E SI1 63.7 59 3250 5.95 5.89 3.77
4 0.73 Ideal I VS1 60.7 56 2397 5.85 5.81 3.54
5 0.70 Good H SI1 64.2 58 1771 5.59 5.62 3.60
6 0.51 Very Good D VS2 62.5 58 1668 5.12 5.18 3.22
7 0.40 Ideal E VS1 61.6 56 1053 4.73 4.78 2.93
8 0.31 Ideal E IF 60.9 55 987 4.39 4.41 2.68
9 0.31 Ideal D VS1 61.6 55 713 4.30 4.33 2.66
10 0.31 Ideal H VVS1 62.2 56 707 4.34 4.37 2.71
4 將列名price改為prices
> 提示:rename(更改列名稱)新名稱在前,原始名稱在后
> rename(diamonds,prices=price)
carat cut color clarity depth table prices x y z
3 0.31 Ideal D VS1 61.6 55 713 4.30 4.33 2.66
10 1.10 Very Good I SI1 61.2 61 4640 6.61 6.66 4.01
2 0.70 Ideal G VS1 60.8 56 3300 5.73 5.80 3.51
8 0.70 Good H SI1 64.2 58 1771 5.59 5.62 3.60
6 0.83 Good E SI1 63.7 59 3250 5.95 5.89 3.77
9 0.40 Ideal E VS1 61.6 56 1053 4.73 4.78 2.93
1 0.73 Ideal I VS1 60.7 56 2397 5.85 5.81 3.54
7 0.51 Very Good D VS2 62.5 58 1668 5.12 5.18 3.22
5 0.31 Ideal E IF 60.9 55 987 4.39 4.41 2.68
4 0.31 Ideal H VVS1 62.2 56 707 4.34 4.37 2.71
5 添加兩列,group ="A",Length=10
mutate(添加新列)
> mutate(diamonds,group ="A",Length=10)
carat cut color clarity depth table price x y z group Length
1 0.31 Ideal D VS1 61.6 55 713 4.30 4.33 2.66 A 10
2 1.10 Very Good I SI1 61.2 61 4640 6.61 6.66 4.01 A 10
3 0.70 Ideal G VS1 60.8 56 3300 5.73 5.80 3.51 A 10
4 0.70 Good H SI1 64.2 58 1771 5.59 5.62 3.60 A 10
5 0.83 Good E SI1 63.7 59 3250 5.95 5.89 3.77 A 10
6 0.40 Ideal E VS1 61.6 56 1053 4.73 4.78 2.93 A 10
7 0.73 Ideal I VS1 60.7 56 2397 5.85 5.81 3.54 A 10
8 0.51 Very Good D VS2 62.5 58 1668 5.12 5.18 3.22 A 10
9 0.31 Ideal E IF 60.9 55 987 4.39 4.41 2.68 A 10
10 0.31 Ideal H VVS1 62.2 56 707 4.34 4.37 2.71 A 10
6 使用summarize求price的平均值,carat的標準差。
> # summarize它可以將數(shù)據(jù)框折疊成一行
>
> summarize(diamonds,mean(price),
+ sd(carat))
mean(price) sd(carat)
1 2048.6 0.2664999
7 使用group_by()求cut組每個數(shù)的統(tǒng)計值
> group_by可以將分析單位從整個數(shù)據(jù)集更改為單個分組
> diamonds %>% group_by(cut) %>%
+ summarize(m = mean(price,na.rm=T))
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 3 x 2
cut m
<ord> <dbl>
1 Good 2510.
2 Very Good 3154
3 Ideal 1526.
> na.rm=T 表示移除缺失數(shù)據(jù)
8 將下面的代碼使用%>%(管道)符號重寫
> p1 <- filter(diamonds,carat >=0.5,price >=3000)
> p1
carat cut color clarity depth table price x y z
1 1.10 Very Good I SI1 61.2 61 4640 6.61 6.66 4.01
2 0.70 Ideal G VS1 60.8 56 3300 5.73 5.80 3.51
3 0.83 Good E SI1 63.7 59 3250 5.95 5.89 3.77
> p2 <- group_by(p1,cut)
> p2
# A tibble: 3 x 10
# Groups: cut [3]
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 1.1 Very Good I SI1 61.2 61 4640 6.61 6.66 4.01
2 0.7 Ideal G VS1 60.8 56 3300 5.73 5.8 3.51
3 0.83 Good E SI1 63.7 59 3250 5.95 5.89 3.77
> p3 <- filter(p2,cut=='Premium')
> p3
# A tibble: 0 x 10
# Groups: cut [0]
# ... with 10 variables: carat <dbl>, cut <ord>, color <ord>, clarity <ord>, depth <dbl>,
# table <dbl>, price <int>, x <dbl>, y <dbl>, z <dbl>
> ggplot(p3,aes(carat,price))+
+ geom_point(aes(color=cut),size=2)
使用管道符號
> diamonds %>%
+ filter(carat >=0.5,price >=3000) %>%
+ group_by(cut) %>%
+ filter(cut=='Premium') %>%
+ ggplot(aes(carat,price))+
+ geom_point(aes(color=cut),size=2)
>
> # 這2段代碼結(jié)果相同,可以明顯看到使用了%>%減少了中間變量,提高了代碼的可閱讀性
> # diamonds %>%
> # filter(.,carat >=0.5,price >=3000)
> # 管道的原理就是將%>%左邊的變量傳遞到右邊的.處,通常在正式書寫時可省略.
9 使用count() 計算cut每組值的次數(shù)
> diamonds %>% count(cut)
cut n
1 Good 2
2 Very Good 2
3 Ideal 6
10.判斷是否存在1克拉價格5000的鉆石
> filter(diamonds,carat == 1,price == 5000)
[1] carat cut color clarity depth table price x y z
<0 行> (或0-長度的row.names)
> 返回0行,說明不存在。