Hassan Kibirige開(kāi)發(fā)的plydata包借鑒了R dplyr包的語(yǔ)法以及實(shí)現(xiàn)了管道函數(shù)%>%,并且與該作者另一個(gè)包plotnine(類(lèi)ggplot2語(yǔ)法繪圖工具包),本文在于對(duì)plydata包語(yǔ)法的入門(mén)以及與dplyr包的對(duì)比,讓對(duì)R熟悉的同學(xué)(比如我)更好的在python進(jìn)行數(shù)據(jù)處理工作。
首先導(dǎo)入包以及數(shù)據(jù)集,數(shù)據(jù)集使用經(jīng)典的nycflights13中的flights,這個(gè)數(shù)據(jù)集在python和R中均有包直接調(diào)用
# in R
library(dplyr)
library(nycflights13)
> head(flights)
# A tibble: 6 x 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour minute
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 2013 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR IAH 227 1400 5 15
2 2013 1 1 533 529 4 850 830 20 UA 1714 N24211 LGA IAH 227 1416 5 29
3 2013 1 1 542 540 2 923 850 33 AA 1141 N619AA JFK MIA 160 1089 5 40
4 2013 1 1 544 545 -1 1004 1022 -18 B6 725 N804JB JFK BQN 183 1576 5 45
5 2013 1 1 554 600 -6 812 837 -25 DL 461 N668DN LGA ATL 116 762 6 0
6 2013 1 1 554 558 -4 740 728 12 UA 1696 N39463 EWR ORD 150 719 5 58
# ... with 1 more variable: time_hour <dttm>
# in Python
import pandas as pd
import numpy as np
from plydata import *
from nycflights13 import flights
flights.head(6)
Out[]:
year month day dep_time ... distance hour minute time_hour
0 2013 1 1 517.0 ... 1400 5 15 2013-01-01T10:00:00Z
1 2013 1 1 533.0 ... 1416 5 29 2013-01-01T10:00:00Z
2 2013 1 1 542.0 ... 1089 5 40 2013-01-01T10:00:00Z
3 2013 1 1 544.0 ... 1576 5 45 2013-01-01T10:00:00Z
4 2013 1 1 554.0 ... 762 6 0 2013-01-01T11:00:00Z
5 2013 1 1 554.0 ... 719 5 58 2013-01-01T10:00:00Z
[6 rows x 19 columns]
Single Table Verbs
dplyr與plydata的數(shù)據(jù)處理邏輯為提供一個(gè)函數(shù),僅僅執(zhí)行一種Verb數(shù)據(jù)處理(data manipulation)。所有的Verbs可以歸于以下三類(lèi):
- 行
-
filter()根據(jù)條件篩選行 注:python中為query() -
slice()根據(jù)位置選擇行 -
arrange()排列行
-
- 列
-
select()選擇制定列 -
rename()修改列名 -
mutate()創(chuàng)建新列 -
relocate()修改列順序
-
- 行聚合
-
summarise()根據(jù)運(yùn)算方式聚合為一行
-
Pipe - 管道操作
所有的dplyr與plydata函數(shù)均取data frame為第一個(gè)參數(shù),所以與之每個(gè)函數(shù)中放入data frame,管道操作提供了一個(gè)數(shù)據(jù)處理的流水線,通過(guò)%>%和>>符號(hào)實(shí)現(xiàn)。
filter()與query()
- 選取數(shù)據(jù)集中飛行距離小于50英里的航班
# in R
> flights %>% filter(distance < 50)
# A tibble: 1 x 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour minute
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 2013 7 27 NA 106 NA NA 245 NA US 1632 NA EWR LGA NA 17 1 6
# ... with 1 more variable: time_hour <dttm>
# in Python
flights >> query('distance < 50')
Out[]:
year month day ... hour minute time_hour
275945 2013 7 27 ... 1 6 2013-07-27T05:00:00Z
[1 rows x 19 columns]
由結(jié)果可知,2013年唯一一個(gè)飛行距離低于50英里的航班是從紐約NWR紐瓦克機(jī)場(chǎng)飛到紐約LGA拉瓜地機(jī)場(chǎng)
- 選取所有1月1日從LGA機(jī)場(chǎng)起飛至DTW機(jī)場(chǎng)的航班
# in R
> flights %>% filter(month == 1 & day == 1 & origin == 'LGA' & dest == 'DTW')
# A tibble: 12 x 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour minute
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 2013 1 1 602 605 -3 821 805 16 MQ 4401 N730MQ LGA DTW 105 502 6 5
2 2013 1 1 659 705 -6 907 913 -6 DL 831 N998DL LGA DTW 105 502 7 5
3 2013 1 1 857 905 -8 1107 1120 -13 DL 181 N321NB LGA DTW 110 502 9 5
4 2013 1 1 858 900 -2 1102 1110 -8 MQ 4478 N737MQ LGA DTW 103 502 9 0
5 2013 1 1 1048 1050 -2 1302 1250 12 MQ 4589 N537MQ LGA DTW 112 502 10 50
6 2013 1 1 1113 1115 -2 1318 1315 3 DL 1031 N320NB LGA DTW 104 502 11 15
7 2013 1 1 1231 1238 -7 1449 1446 3 DL 1131 N920DL LGA DTW 108 502 12 38
8 2013 1 1 1305 1315 -10 1523 1520 3 MQ 4564 N725MQ LGA DTW 102 502 13 15
9 2013 1 1 1440 1440 0 1658 1643 15 DL 1231 N926DL LGA DTW 94 502 14 40
10 2013 1 1 1640 1640 0 1859 1850 9 MQ 4540 N723MQ LGA DTW 106 502 16 40
11 2013 1 1 1739 1745 -6 1956 1953 3 DL 2331 N965DL LGA DTW 93 502 17 45
12 2013 1 1 2053 2055 -2 2254 2250 4 MQ 4573 N730MQ LGA DTW 102 502 20 55
# ... with 1 more variable: time_hour <dttm>
# in Python
flights >> query("month == 1 & day == 1 & origin == 'LGA' & dest == 'DTW'")
Out[490]:
year month day dep_time ... distance hour minute time_hour
21 2013 1 1 602.0 ... 502 6 5 2013-01-01T11:00:00Z
66 2013 1 1 659.0 ... 502 7 5 2013-01-01T12:00:00Z
163 2013 1 1 857.0 ... 502 9 5 2013-01-01T14:00:00Z
166 2013 1 1 858.0 ... 502 9 0 2013-01-01T14:00:00Z
251 2013 1 1 1048.0 ... 502 10 50 2013-01-01T15:00:00Z
267 2013 1 1 1113.0 ... 502 11 15 2013-01-01T16:00:00Z
324 2013 1 1 1231.0 ... 502 12 38 2013-01-01T17:00:00Z
355 2013 1 1 1305.0 ... 502 13 15 2013-01-01T18:00:00Z
419 2013 1 1 1440.0 ... 502 14 40 2013-01-01T19:00:00Z
558 2013 1 1 1640.0 ... 502 16 40 2013-01-01T21:00:00Z
612 2013 1 1 1739.0 ... 502 17 45 2013-01-01T22:00:00Z
782 2013 1 1 2053.0 ... 502 20 55 2013-01-02T01:00:00Z
[12 rows x 19 columns]
...
[44 rows x 19 columns]
python的plydata包目前無(wú)法使用
,作為多個(gè)條件中的分割,但是可以使用&and代替
arrange()
arrange()可以通過(guò)一個(gè)或多個(gè)字段進(jìn)行排序
- 對(duì)所有1月1日從LGA機(jī)場(chǎng)起飛至DTW機(jī)場(chǎng)的航班,按照延誤時(shí)間排序,延誤時(shí)間相同的話按照起飛時(shí)間逆序排序
# in R
> flights %>% filter(month == 1 & day == 1 & origin == 'LGA' & dest == 'DTW') %>% arrange(arr_delay, dep_time)
# A tibble: 12 x 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour minute
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 2013 1 1 857 905 -8 1107 1120 -13 DL 181 N321NB LGA DTW 110 502 9 5
2 2013 1 1 858 900 -2 1102 1110 -8 MQ 4478 N737MQ LGA DTW 103 502 9 0
3 2013 1 1 659 705 -6 907 913 -6 DL 831 N998DL LGA DTW 105 502 7 5
4 2013 1 1 1113 1115 -2 1318 1315 3 DL 1031 N320NB LGA DTW 104 502 11 15
5 2013 1 1 1231 1238 -7 1449 1446 3 DL 1131 N920DL LGA DTW 108 502 12 38
6 2013 1 1 1305 1315 -10 1523 1520 3 MQ 4564 N725MQ LGA DTW 102 502 13 15
7 2013 1 1 1739 1745 -6 1956 1953 3 DL 2331 N965DL LGA DTW 93 502 17 45
8 2013 1 1 2053 2055 -2 2254 2250 4 MQ 4573 N730MQ LGA DTW 102 502 20 55
9 2013 1 1 1640 1640 0 1859 1850 9 MQ 4540 N723MQ LGA DTW 106 502 16 40
10 2013 1 1 1048 1050 -2 1302 1250 12 MQ 4589 N537MQ LGA DTW 112 502 10 50
11 2013 1 1 1440 1440 0 1658 1643 15 DL 1231 N926DL LGA DTW 94 502 14 40
12 2013 1 1 602 605 -3 821 805 16 MQ 4401 N730MQ LGA DTW 105 502 6 5
# ... with 1 more variable: time_hour <dttm>
# in Python
flights >> query("month == 1 & day == 1 & origin == 'LGA' & dest == 'DTW'") >> arrange("arr_delay", "-dep_time")
Out[]:
year month day dep_time ... distance hour minute time_hour
163 2013 1 1 857.0 ... 502 9 5 2013-01-01T14:00:00Z
166 2013 1 1 858.0 ... 502 9 0 2013-01-01T14:00:00Z
66 2013 1 1 659.0 ... 502 7 5 2013-01-01T12:00:00Z
612 2013 1 1 1739.0 ... 502 17 45 2013-01-01T22:00:00Z
355 2013 1 1 1305.0 ... 502 13 15 2013-01-01T18:00:00Z
324 2013 1 1 1231.0 ... 502 12 38 2013-01-01T17:00:00Z
267 2013 1 1 1113.0 ... 502 11 15 2013-01-01T16:00:00Z
782 2013 1 1 2053.0 ... 502 20 55 2013-01-02T01:00:00Z
558 2013 1 1 1640.0 ... 502 16 40 2013-01-01T21:00:00Z
251 2013 1 1 1048.0 ... 502 10 50 2013-01-01T15:00:00Z
419 2013 1 1 1440.0 ... 502 14 40 2013-01-01T19:00:00Z
21 2013 1 1 602.0 ... 502 6 5 2013-01-01T11:00:00Z
[12 rows x 19 columns]
slice()與slice_rows()
slice()與slice_rows()可以通過(guò)行號(hào)來(lái)選取行
# in R
> flights %>% slice(1:5)
# A tibble: 5 x 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour minute
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 2013 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR IAH 227 1400 5 15
2 2013 1 1 533 529 4 850 830 20 UA 1714 N24211 LGA IAH 227 1416 5 29
3 2013 1 1 542 540 2 923 850 33 AA 1141 N619AA JFK MIA 160 1089 5 40
4 2013 1 1 544 545 -1 1004 1022 -18 B6 725 N804JB JFK BQN 183 1576 5 45
5 2013 1 1 554 600 -6 812 837 -25 DL 461 N668DN LGA ATL 116 762 6 0
# ... with 1 more variable: time_hour <dttm>
# in Python
> flights %>% slice(1:5)
# A tibble: 5 x 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour minute
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 2013 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR IAH 227 1400 5 15
2 2013 1 1 533 529 4 850 830 20 UA 1714 N24211 LGA IAH 227 1416 5 29
3 2013 1 1 542 540 2 923 850 33 AA 1141 N619AA JFK MIA 160 1089 5 40
4 2013 1 1 544 545 -1 1004 1022 -18 B6 725 N804JB JFK BQN 183 1576 5 45
5 2013 1 1 554 600 -6 812 837 -25 DL 461 N668DN LGA ATL 116 762 6 0
# ... with 1 more variable: time_hour <dttm>
python中的計(jì)數(shù)項(xiàng)是從0開(kāi)始,R中是從1開(kāi)始
與dplyr中slice_head()和slice_tail()對(duì)應(yīng)的函數(shù)是head()和tai()
select()
select()提供選取某列的功能,基本等同于sql中的select
# in R
> flights %>% slice(1:5) %>% select(origin, dest, distance)
# A tibble: 5 x 3
origin dest distance
<chr> <chr> <dbl>
1 EWR IAH 1400
2 LGA IAH 1416
3 JFK MIA 1089
4 JFK BQN 1576
5 LGA ATL 762
# in Python
flights >> slice_rows(0, 5) >> select('origin', 'dest', 'distance')
Out[509]:
origin dest distance
0 EWR IAH 1400
1 LGA IAH 1416
2 JFK MIA 1089
3 JFK BQN 1576
4 LGA ATL 762
添加end_with/endswith可以截取以特定字符結(jié)尾的列
# in R
> flights %>% slice(1:5) %>% select(ends_with('time'))
# A tibble: 5 x 5
dep_time sched_dep_time arr_time sched_arr_time air_time
<int> <int> <int> <int> <dbl>
1 517 515 830 819 227
2 533 529 850 830 227
3 542 540 923 850 160
4 544 545 1004 1022 183
5 554 600 812 837 116
# in Python
flights >> slice_rows(0, 5) >> select(endswith = 'time')
Out[511]:
dep_time sched_dep_time arr_time sched_arr_time air_time
0 517.0 515 830.0 819 227.0
1 533.0 529 850.0 830 227.0
2 542.0 540 923.0 850 160.0
3 544.0 545 1004.0 1022 183.0
4 554.0 600 812.0 837 116.0
mutate()
mutate()可以通過(guò)運(yùn)算增加新列
# in R
> flights %>% slice(1:5) %>% select(origin, dest, distance) %>% mutate(distance_in_km = distance/0.6213712 )
# A tibble: 5 x 4
origin dest distance distance_in_km
<chr> <chr> <dbl> <dbl>
1 EWR IAH 1400 2253.
2 LGA IAH 1416 2279.
3 JFK MIA 1089 1753.
4 JFK BQN 1576 2536.
5 LGA ATL 762 1226.
# in Python
flights >> slice_rows(0, 5) >> select('origin', 'dest', 'distance') >> mutate(distance_in_km = 'distance/0.6213712')
Out[]:
origin dest distance distance_in_km
0 EWR IAH 1400 2253.081572
1 LGA IAH 1416 2278.831076
2 JFK MIA 1089 1752.575594
3 JFK BQN 1576 2536.326112
4 LGA ATL 762 1226.320113
plydata包同樣支持使用自定義函數(shù)、以及Numpy、Pandas函數(shù)來(lái)創(chuàng)建新列
miles_to_km = lambda x: np.ceil(x/0.6213712)
flights >> slice_rows(0, 5) >> select('origin', 'dest', 'distance') >> mutate(distance_in_km = 'miles_to_km(distance)')
Out[519]:
origin dest distance distance_in_km
0 EWR IAH 1400 2254.0
1 LGA IAH 1416 2279.0
2 JFK MIA 1089 1753.0
3 JFK BQN 1576 2537.0
4 LGA ATL 762 1227.0
summarise() 和 group_by()
summarise()和group_by()可以實(shí)現(xiàn)分組聚合
- 選取1月1日的各機(jī)場(chǎng)起飛的航班數(shù)
# in R
> flights %>% filter(month == 1 & day == 1) %>% group_by(origin) %>% summarise(count = n())
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 3 x 2
origin count
<chr> <int>
1 EWR 305
2 JFK 297
3 LGA 240
# in Python
flights >> query("month == 1 & day == 1") >> group_by('origin') >> summarise(count = 'n()')
Out[]:
origin count
0 EWR 305
1 LGA 240
2 JFK 297
n() 返回值為該類(lèi)的長(zhǎng)度
總結(jié)
plydata的完成度目前來(lái)說(shuō)已經(jīng)很高了,大部分dplyr的功能都被完美移植了過(guò)來(lái),并且與另外一個(gè)R移植包plotnine(ggplot2移植)很高的聯(lián)動(dòng),應(yīng)該會(huì)是筆者日后Python常用的包之一。
在這里感謝作者Hassan Kibirige,下篇文章會(huì)講解他的另外一個(gè)包,也就是剛才提到大名鼎鼎的ggplot2在python中的移植版 - plotnine。
reference
Hassan Kibirige github主頁(yè) - https://github.com/has2k1
plydata github主頁(yè) - https://github.com/has2k1/plydata
plydata完整文檔 - https://plydata.readthedocs.io/en/latest/index.html
dplyr github主頁(yè) - https://github.com/tidyverse/dplyr