plydata包 - Python中的dplyr Pipe(管道)數(shù)據(jù)處理方式

Hassan Kibirige開(kāi)發(fā)的plydata包借鑒了R dplyr包的語(yǔ)法以及實(shí)現(xiàn)了管道函數(shù)%>%,并且與該作者另一個(gè)包plotnine(類(lèi)ggplot2語(yǔ)法繪圖工具包),本文在于對(duì)plydata包語(yǔ)法的入門(mén)以及與dplyr包的對(duì)比,讓對(duì)R熟悉的同學(xué)(比如我)更好的在python進(jìn)行數(shù)據(jù)處理工作。

首先導(dǎo)入包以及數(shù)據(jù)集,數(shù)據(jù)集使用經(jīng)典的nycflights13中的flights,這個(gè)數(shù)據(jù)集在python和R中均有包直接調(diào)用

# in R
library(dplyr)
library(nycflights13)

> head(flights)
# A tibble: 6 x 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest  air_time distance  hour minute
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>    <int> <chr>   <chr>  <chr>    <dbl>    <dbl> <dbl>  <dbl>
1  2013     1     1      517            515         2      830            819        11 UA        1545 N14228  EWR    IAH        227     1400     5     15
2  2013     1     1      533            529         4      850            830        20 UA        1714 N24211  LGA    IAH        227     1416     5     29
3  2013     1     1      542            540         2      923            850        33 AA        1141 N619AA  JFK    MIA        160     1089     5     40
4  2013     1     1      544            545        -1     1004           1022       -18 B6         725 N804JB  JFK    BQN        183     1576     5     45
5  2013     1     1      554            600        -6      812            837       -25 DL         461 N668DN  LGA    ATL        116      762     6      0
6  2013     1     1      554            558        -4      740            728        12 UA        1696 N39463  EWR    ORD        150      719     5     58
# ... with 1 more variable: time_hour <dttm>
# in Python
import pandas as pd
import numpy as np
from plydata import *
from nycflights13 import flights

flights.head(6)
Out[]: 
   year  month  day  dep_time  ...  distance  hour  minute             time_hour
0  2013      1    1     517.0  ...      1400     5      15  2013-01-01T10:00:00Z
1  2013      1    1     533.0  ...      1416     5      29  2013-01-01T10:00:00Z
2  2013      1    1     542.0  ...      1089     5      40  2013-01-01T10:00:00Z
3  2013      1    1     544.0  ...      1576     5      45  2013-01-01T10:00:00Z
4  2013      1    1     554.0  ...       762     6       0  2013-01-01T11:00:00Z
5  2013      1    1     554.0  ...       719     5      58  2013-01-01T10:00:00Z

[6 rows x 19 columns]

Single Table Verbs

dplyr與plydata的數(shù)據(jù)處理邏輯為提供一個(gè)函數(shù),僅僅執(zhí)行一種Verb數(shù)據(jù)處理(data manipulation)。所有的Verbs可以歸于以下三類(lèi):

    • filter() 根據(jù)條件篩選行 注:python中為query()
    • slice() 根據(jù)位置選擇行
    • arrange() 排列行
    • select() 選擇制定列
    • rename() 修改列名
    • mutate() 創(chuàng)建新列
    • relocate() 修改列順序
  • 行聚合
    • summarise() 根據(jù)運(yùn)算方式聚合為一行

Pipe - 管道操作

所有的dplyr與plydata函數(shù)均取data frame為第一個(gè)參數(shù),所以與之每個(gè)函數(shù)中放入data frame,管道操作提供了一個(gè)數(shù)據(jù)處理的流水線,通過(guò)%>%>>符號(hào)實(shí)現(xiàn)。

filter()與query()

  • 選取數(shù)據(jù)集中飛行距離小于50英里的航班
# in R
> flights %>% filter(distance < 50)
# A tibble: 1 x 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest  air_time distance  hour minute
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>    <int> <chr>   <chr>  <chr>    <dbl>    <dbl> <dbl>  <dbl>
1  2013     7    27       NA            106        NA       NA            245        NA US        1632 NA      EWR    LGA         NA       17     1      6
# ... with 1 more variable: time_hour <dttm>
# in Python
flights >> query('distance < 50')
Out[]: 
        year  month  day  ...  hour  minute             time_hour
275945  2013      7   27  ...     1       6  2013-07-27T05:00:00Z

[1 rows x 19 columns]

由結(jié)果可知,2013年唯一一個(gè)飛行距離低于50英里的航班是從紐約NWR紐瓦克機(jī)場(chǎng)飛到紐約LGA拉瓜地機(jī)場(chǎng)

  • 選取所有1月1日從LGA機(jī)場(chǎng)起飛至DTW機(jī)場(chǎng)的航班
# in R
> flights %>% filter(month == 1 & day == 1 & origin == 'LGA' & dest == 'DTW')
# A tibble: 12 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest  air_time distance  hour minute
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>    <int> <chr>   <chr>  <chr>    <dbl>    <dbl> <dbl>  <dbl>
 1  2013     1     1      602            605        -3      821            805        16 MQ        4401 N730MQ  LGA    DTW        105      502     6      5
 2  2013     1     1      659            705        -6      907            913        -6 DL         831 N998DL  LGA    DTW        105      502     7      5
 3  2013     1     1      857            905        -8     1107           1120       -13 DL         181 N321NB  LGA    DTW        110      502     9      5
 4  2013     1     1      858            900        -2     1102           1110        -8 MQ        4478 N737MQ  LGA    DTW        103      502     9      0
 5  2013     1     1     1048           1050        -2     1302           1250        12 MQ        4589 N537MQ  LGA    DTW        112      502    10     50
 6  2013     1     1     1113           1115        -2     1318           1315         3 DL        1031 N320NB  LGA    DTW        104      502    11     15
 7  2013     1     1     1231           1238        -7     1449           1446         3 DL        1131 N920DL  LGA    DTW        108      502    12     38
 8  2013     1     1     1305           1315       -10     1523           1520         3 MQ        4564 N725MQ  LGA    DTW        102      502    13     15
 9  2013     1     1     1440           1440         0     1658           1643        15 DL        1231 N926DL  LGA    DTW         94      502    14     40
10  2013     1     1     1640           1640         0     1859           1850         9 MQ        4540 N723MQ  LGA    DTW        106      502    16     40
11  2013     1     1     1739           1745        -6     1956           1953         3 DL        2331 N965DL  LGA    DTW         93      502    17     45
12  2013     1     1     2053           2055        -2     2254           2250         4 MQ        4573 N730MQ  LGA    DTW        102      502    20     55
# ... with 1 more variable: time_hour <dttm>
# in Python
flights >> query("month == 1 & day == 1 & origin == 'LGA' & dest == 'DTW'")
Out[490]: 
     year  month  day  dep_time  ...  distance  hour  minute             time_hour
21   2013      1    1     602.0  ...       502     6       5  2013-01-01T11:00:00Z
66   2013      1    1     659.0  ...       502     7       5  2013-01-01T12:00:00Z
163  2013      1    1     857.0  ...       502     9       5  2013-01-01T14:00:00Z
166  2013      1    1     858.0  ...       502     9       0  2013-01-01T14:00:00Z
251  2013      1    1    1048.0  ...       502    10      50  2013-01-01T15:00:00Z
267  2013      1    1    1113.0  ...       502    11      15  2013-01-01T16:00:00Z
324  2013      1    1    1231.0  ...       502    12      38  2013-01-01T17:00:00Z
355  2013      1    1    1305.0  ...       502    13      15  2013-01-01T18:00:00Z
419  2013      1    1    1440.0  ...       502    14      40  2013-01-01T19:00:00Z
558  2013      1    1    1640.0  ...       502    16      40  2013-01-01T21:00:00Z
612  2013      1    1    1739.0  ...       502    17      45  2013-01-01T22:00:00Z
782  2013      1    1    2053.0  ...       502    20      55  2013-01-02T01:00:00Z

[12 rows x 19 columns]
...
[44 rows x 19 columns]

python的plydata包目前無(wú)法使用,作為多個(gè)條件中的分割,但是可以使用& and代替

arrange()

arrange()可以通過(guò)一個(gè)或多個(gè)字段進(jìn)行排序

  • 對(duì)所有1月1日從LGA機(jī)場(chǎng)起飛至DTW機(jī)場(chǎng)的航班,按照延誤時(shí)間排序,延誤時(shí)間相同的話按照起飛時(shí)間逆序排序
# in R
> flights %>% filter(month == 1 & day == 1 & origin == 'LGA' & dest == 'DTW') %>% arrange(arr_delay, dep_time) 
# A tibble: 12 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest  air_time distance  hour minute
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>    <int> <chr>   <chr>  <chr>    <dbl>    <dbl> <dbl>  <dbl>
 1  2013     1     1      857            905        -8     1107           1120       -13 DL         181 N321NB  LGA    DTW        110      502     9      5
 2  2013     1     1      858            900        -2     1102           1110        -8 MQ        4478 N737MQ  LGA    DTW        103      502     9      0
 3  2013     1     1      659            705        -6      907            913        -6 DL         831 N998DL  LGA    DTW        105      502     7      5
 4  2013     1     1     1113           1115        -2     1318           1315         3 DL        1031 N320NB  LGA    DTW        104      502    11     15
 5  2013     1     1     1231           1238        -7     1449           1446         3 DL        1131 N920DL  LGA    DTW        108      502    12     38
 6  2013     1     1     1305           1315       -10     1523           1520         3 MQ        4564 N725MQ  LGA    DTW        102      502    13     15
 7  2013     1     1     1739           1745        -6     1956           1953         3 DL        2331 N965DL  LGA    DTW         93      502    17     45
 8  2013     1     1     2053           2055        -2     2254           2250         4 MQ        4573 N730MQ  LGA    DTW        102      502    20     55
 9  2013     1     1     1640           1640         0     1859           1850         9 MQ        4540 N723MQ  LGA    DTW        106      502    16     40
10  2013     1     1     1048           1050        -2     1302           1250        12 MQ        4589 N537MQ  LGA    DTW        112      502    10     50
11  2013     1     1     1440           1440         0     1658           1643        15 DL        1231 N926DL  LGA    DTW         94      502    14     40
12  2013     1     1      602            605        -3      821            805        16 MQ        4401 N730MQ  LGA    DTW        105      502     6      5
# ... with 1 more variable: time_hour <dttm>
# in Python
flights >> query("month == 1 & day == 1 & origin == 'LGA' & dest == 'DTW'") >> arrange("arr_delay", "-dep_time") 
Out[]: 
     year  month  day  dep_time  ...  distance  hour  minute             time_hour
163  2013      1    1     857.0  ...       502     9       5  2013-01-01T14:00:00Z
166  2013      1    1     858.0  ...       502     9       0  2013-01-01T14:00:00Z
66   2013      1    1     659.0  ...       502     7       5  2013-01-01T12:00:00Z
612  2013      1    1    1739.0  ...       502    17      45  2013-01-01T22:00:00Z
355  2013      1    1    1305.0  ...       502    13      15  2013-01-01T18:00:00Z
324  2013      1    1    1231.0  ...       502    12      38  2013-01-01T17:00:00Z
267  2013      1    1    1113.0  ...       502    11      15  2013-01-01T16:00:00Z
782  2013      1    1    2053.0  ...       502    20      55  2013-01-02T01:00:00Z
558  2013      1    1    1640.0  ...       502    16      40  2013-01-01T21:00:00Z
251  2013      1    1    1048.0  ...       502    10      50  2013-01-01T15:00:00Z
419  2013      1    1    1440.0  ...       502    14      40  2013-01-01T19:00:00Z
21   2013      1    1     602.0  ...       502     6       5  2013-01-01T11:00:00Z

[12 rows x 19 columns]

slice()與slice_rows()

slice()與slice_rows()可以通過(guò)行號(hào)來(lái)選取行

# in R
> flights %>% slice(1:5)
# A tibble: 5 x 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest  air_time distance  hour minute
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>    <int> <chr>   <chr>  <chr>    <dbl>    <dbl> <dbl>  <dbl>
1  2013     1     1      517            515         2      830            819        11 UA        1545 N14228  EWR    IAH        227     1400     5     15
2  2013     1     1      533            529         4      850            830        20 UA        1714 N24211  LGA    IAH        227     1416     5     29
3  2013     1     1      542            540         2      923            850        33 AA        1141 N619AA  JFK    MIA        160     1089     5     40
4  2013     1     1      544            545        -1     1004           1022       -18 B6         725 N804JB  JFK    BQN        183     1576     5     45
5  2013     1     1      554            600        -6      812            837       -25 DL         461 N668DN  LGA    ATL        116      762     6      0
# ... with 1 more variable: time_hour <dttm>
# in Python
> flights %>% slice(1:5)
# A tibble: 5 x 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest  air_time distance  hour minute
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>    <int> <chr>   <chr>  <chr>    <dbl>    <dbl> <dbl>  <dbl>
1  2013     1     1      517            515         2      830            819        11 UA        1545 N14228  EWR    IAH        227     1400     5     15
2  2013     1     1      533            529         4      850            830        20 UA        1714 N24211  LGA    IAH        227     1416     5     29
3  2013     1     1      542            540         2      923            850        33 AA        1141 N619AA  JFK    MIA        160     1089     5     40
4  2013     1     1      544            545        -1     1004           1022       -18 B6         725 N804JB  JFK    BQN        183     1576     5     45
5  2013     1     1      554            600        -6      812            837       -25 DL         461 N668DN  LGA    ATL        116      762     6      0
# ... with 1 more variable: time_hour <dttm>

python中的計(jì)數(shù)項(xiàng)是從0開(kāi)始,R中是從1開(kāi)始
與dplyr中slice_head()和slice_tail()對(duì)應(yīng)的函數(shù)是head()和tai()

select()

select()提供選取某列的功能,基本等同于sql中的select

# in R
> flights %>% slice(1:5) %>% select(origin, dest, distance)
# A tibble: 5 x 3
  origin dest  distance
  <chr>  <chr>    <dbl>
1 EWR    IAH       1400
2 LGA    IAH       1416
3 JFK    MIA       1089
4 JFK    BQN       1576
5 LGA    ATL        762
# in Python
flights >> slice_rows(0, 5) >> select('origin', 'dest', 'distance')
Out[509]: 
  origin dest  distance
0    EWR  IAH      1400
1    LGA  IAH      1416
2    JFK  MIA      1089
3    JFK  BQN      1576
4    LGA  ATL       762

添加end_with/endswith可以截取以特定字符結(jié)尾的列

# in R
> flights %>% slice(1:5) %>% select(ends_with('time'))
# A tibble: 5 x 5
  dep_time sched_dep_time arr_time sched_arr_time air_time
     <int>          <int>    <int>          <int>    <dbl>
1      517            515      830            819      227
2      533            529      850            830      227
3      542            540      923            850      160
4      544            545     1004           1022      183
5      554            600      812            837      116
# in Python
flights >> slice_rows(0, 5) >> select(endswith = 'time')
Out[511]: 
   dep_time  sched_dep_time  arr_time  sched_arr_time  air_time
0     517.0             515     830.0             819     227.0
1     533.0             529     850.0             830     227.0
2     542.0             540     923.0             850     160.0
3     544.0             545    1004.0            1022     183.0
4     554.0             600     812.0             837     116.0

mutate()

mutate()可以通過(guò)運(yùn)算增加新列

# in R
> flights %>% slice(1:5) %>% select(origin, dest, distance) %>% mutate(distance_in_km = distance/0.6213712 )
# A tibble: 5 x 4
  origin dest  distance distance_in_km
  <chr>  <chr>    <dbl>          <dbl>
1 EWR    IAH       1400          2253.
2 LGA    IAH       1416          2279.
3 JFK    MIA       1089          1753.
4 JFK    BQN       1576          2536.
5 LGA    ATL        762          1226.
# in Python
flights >> slice_rows(0, 5) >> select('origin', 'dest', 'distance') >> mutate(distance_in_km = 'distance/0.6213712')
Out[]: 
  origin dest  distance  distance_in_km
0    EWR  IAH      1400     2253.081572
1    LGA  IAH      1416     2278.831076
2    JFK  MIA      1089     1752.575594
3    JFK  BQN      1576     2536.326112
4    LGA  ATL       762     1226.320113

plydata包同樣支持使用自定義函數(shù)、以及Numpy、Pandas函數(shù)來(lái)創(chuàng)建新列

miles_to_km = lambda x: np.ceil(x/0.6213712)
flights >> slice_rows(0, 5) >> select('origin', 'dest', 'distance') >> mutate(distance_in_km = 'miles_to_km(distance)')
Out[519]: 
  origin dest  distance  distance_in_km
0    EWR  IAH      1400          2254.0
1    LGA  IAH      1416          2279.0
2    JFK  MIA      1089          1753.0
3    JFK  BQN      1576          2537.0
4    LGA  ATL       762          1227.0

summarise() 和 group_by()

summarise()和group_by()可以實(shí)現(xiàn)分組聚合

  • 選取1月1日的各機(jī)場(chǎng)起飛的航班數(shù)
# in R
> flights %>% filter(month == 1 & day == 1) %>% group_by(origin) %>% summarise(count = n())
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 3 x 2
  origin count
  <chr>  <int>
1 EWR      305
2 JFK      297
3 LGA      240
# in Python
flights >> query("month == 1 & day == 1") >> group_by('origin') >> summarise(count = 'n()')
Out[]: 
  origin  count
0    EWR    305
1    LGA    240
2    JFK    297

n() 返回值為該類(lèi)的長(zhǎng)度

總結(jié)

plydata的完成度目前來(lái)說(shuō)已經(jīng)很高了,大部分dplyr的功能都被完美移植了過(guò)來(lái),并且與另外一個(gè)R移植包plotnine(ggplot2移植)很高的聯(lián)動(dòng),應(yīng)該會(huì)是筆者日后Python常用的包之一。
在這里感謝作者Hassan Kibirige,下篇文章會(huì)講解他的另外一個(gè)包,也就是剛才提到大名鼎鼎的ggplot2在python中的移植版 - plotnine。

reference
Hassan Kibirige github主頁(yè) - https://github.com/has2k1
plydata github主頁(yè) - https://github.com/has2k1/plydata
plydata完整文檔 - https://plydata.readthedocs.io/en/latest/index.html
dplyr github主頁(yè) - https://github.com/tidyverse/dplyr

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書(shū)系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容