Pandas透視表和交叉表

透視表和交叉表


實(shí)現(xiàn)數(shù)據(jù)分析指標(biāo)計(jì)算的常用操作:

交叉表 -> 透視表 -> 分組聚合 -> 自定義函數(shù)

  • 交叉表就是聚合函數(shù)是len個數(shù)的透視表
  • 透視表是由聚合函數(shù)是mean的分組旋轉(zhuǎn)而成
  • 分組聚合就是自定義函數(shù)的一種特定操作

越往底層書寫越難,應(yīng)用范圍越廣。越往上層書寫越簡單,應(yīng)用范圍越窄


透視表(pivot table)是各種電子表格程序和數(shù)據(jù)分析軟件中一種高級數(shù)據(jù)匯總表格形式

它根據(jù)一個或多個鍵對數(shù)據(jù)進(jìn)行聚合,并根據(jù)行和列上的分組鍵將數(shù)據(jù)分配到各個矩形區(qū)域中


會不會操作透視表,是衡量一個人能否做數(shù)據(jù)分析項(xiàng)目的基準(zhǔn)

  • 入門:用Pandas原生的pivot_table方法生成透視表
  • 進(jìn)階:使用groupby和unstack配合手動構(gòu)造透視表

常用的crosstab交叉表函數(shù)結(jié)構(gòu)

# 常見參數(shù):行索引,列索引,分項(xiàng)小計(jì)
pd.crosstab(tips.time, [tips.smoker, tips.day], margins=True)

常用的Pivot_table 透視表函數(shù)結(jié)構(gòu)

# 常見參數(shù):需要計(jì)算的列,行索引,列索引,分項(xiàng)小計(jì)(默認(rèn)False),自定義計(jì)算函數(shù)(默認(rèn)是mean),缺失值填充
tips.pivot_table(['tip_pct', 'size'], index=['time', 'day'], columns='smoker', margins=True, aggfunc=len, fill_value=0)

# 交叉表只要把 aggfunc參數(shù)值改為 len 即可

底層:使用分組聚合和軸向旋轉(zhuǎn)實(shí)現(xiàn)透視表

# 分組:行索引,列索引;均值聚合;行索引轉(zhuǎn)列索引;填充缺失值為0
tips.groupby(['time', 'day', 'smoker'])['size', 'tip_pct'].mean().unstack().fillna(0)

# 交叉表只要把聚合函數(shù)由.mean()改為.size()即可

In [1]:

import numpy as np
import pandas as pd

文件tips.csv

total_bill,tip,smoker,day,time,size
16.99,1.01,No,Sun,Dinner,2
10.34,1.66,No,Sun,Dinner,3
21.01,3.5,No,Sun,Dinner,3
23.68,3.31,No,Sun,Dinner,2
24.59,3.61,No,Sun,Dinner,4
25.29,4.71,No,Sun,Dinner,4
8.77,2.0,No,Sun,Dinner,2
26.88,3.12,No,Sun,Dinner,4
15.04,1.96,No,Sun,Dinner,2
14.78,3.23,No,Sun,Dinner,2
10.27,1.71,No,Sun,Dinner,2
35.26,5.0,No,Sun,Dinner,4
15.42,1.57,No,Sun,Dinner,2
18.43,3.0,No,Sun,Dinner,4
14.83,3.02,No,Sun,Dinner,2
21.58,3.92,No,Sun,Dinner,2
10.33,1.67,No,Sun,Dinner,3
16.29,3.71,No,Sun,Dinner,3
16.97,3.5,No,Sun,Dinner,3
20.65,3.35,No,Sat,Dinner,3
17.92,4.08,No,Sat,Dinner,2
20.29,2.75,No,Sat,Dinner,2
15.77,2.23,No,Sat,Dinner,2
39.42,7.58,No,Sat,Dinner,4
19.82,3.18,No,Sat,Dinner,2
17.81,2.34,No,Sat,Dinner,4
13.37,2.0,No,Sat,Dinner,2
12.69,2.0,No,Sat,Dinner,2
21.7,4.3,No,Sat,Dinner,2
19.65,3.0,No,Sat,Dinner,2
9.55,1.45,No,Sat,Dinner,2
18.35,2.5,No,Sat,Dinner,4
15.06,3.0,No,Sat,Dinner,2
20.69,2.45,No,Sat,Dinner,4
17.78,3.27,No,Sat,Dinner,2
24.06,3.6,No,Sat,Dinner,3
16.31,2.0,No,Sat,Dinner,3
16.93,3.07,No,Sat,Dinner,3
18.69,2.31,No,Sat,Dinner,3
31.27,5.0,No,Sat,Dinner,3
16.04,2.24,No,Sat,Dinner,3
17.46,2.54,No,Sun,Dinner,2
13.94,3.06,No,Sun,Dinner,2
9.68,1.32,No,Sun,Dinner,2
30.4,5.6,No,Sun,Dinner,4
18.29,3.0,No,Sun,Dinner,2
22.23,5.0,No,Sun,Dinner,2
32.4,6.0,No,Sun,Dinner,4
28.55,2.05,No,Sun,Dinner,3
18.04,3.0,No,Sun,Dinner,2
12.54,2.5,No,Sun,Dinner,2
10.29,2.6,No,Sun,Dinner,2
34.81,5.2,No,Sun,Dinner,4
9.94,1.56,No,Sun,Dinner,2
25.56,4.34,No,Sun,Dinner,4
19.49,3.51,No,Sun,Dinner,2
38.01,3.0,Yes,Sat,Dinner,4
26.41,1.5,No,Sat,Dinner,2
11.24,1.76,Yes,Sat,Dinner,2
48.27,6.73,No,Sat,Dinner,4
20.29,3.21,Yes,Sat,Dinner,2
13.81,2.0,Yes,Sat,Dinner,2
11.02,1.98,Yes,Sat,Dinner,2
18.29,3.76,Yes,Sat,Dinner,4
17.59,2.64,No,Sat,Dinner,3
20.08,3.15,No,Sat,Dinner,3
16.45,2.47,No,Sat,Dinner,2
3.07,1.0,Yes,Sat,Dinner,1
20.23,2.01,No,Sat,Dinner,2
15.01,2.09,Yes,Sat,Dinner,2
12.02,1.97,No,Sat,Dinner,2
17.07,3.0,No,Sat,Dinner,3
26.86,3.14,Yes,Sat,Dinner,2
25.28,5.0,Yes,Sat,Dinner,2
14.73,2.2,No,Sat,Dinner,2
10.51,1.25,No,Sat,Dinner,2
17.92,3.08,Yes,Sat,Dinner,2
27.2,4.0,No,Thur,Lunch,4
22.76,3.0,No,Thur,Lunch,2
17.29,2.71,No,Thur,Lunch,2
19.44,3.0,Yes,Thur,Lunch,2
16.66,3.4,No,Thur,Lunch,2
10.07,1.83,No,Thur,Lunch,1
32.68,5.0,Yes,Thur,Lunch,2
15.98,2.03,No,Thur,Lunch,2
34.83,5.17,No,Thur,Lunch,4
13.03,2.0,No,Thur,Lunch,2
18.28,4.0,No,Thur,Lunch,2
24.71,5.85,No,Thur,Lunch,2
21.16,3.0,No,Thur,Lunch,2
28.97,3.0,Yes,Fri,Dinner,2
22.49,3.5,No,Fri,Dinner,2
5.75,1.0,Yes,Fri,Dinner,2
16.32,4.3,Yes,Fri,Dinner,2
22.75,3.25,No,Fri,Dinner,2
40.17,4.73,Yes,Fri,Dinner,4
27.28,4.0,Yes,Fri,Dinner,2
12.03,1.5,Yes,Fri,Dinner,2
21.01,3.0,Yes,Fri,Dinner,2
12.46,1.5,No,Fri,Dinner,2
11.35,2.5,Yes,Fri,Dinner,2
15.38,3.0,Yes,Fri,Dinner,2
44.3,2.5,Yes,Sat,Dinner,3
22.42,3.48,Yes,Sat,Dinner,2
20.92,4.08,No,Sat,Dinner,2
15.36,1.64,Yes,Sat,Dinner,2
20.49,4.06,Yes,Sat,Dinner,2
25.21,4.29,Yes,Sat,Dinner,2
18.24,3.76,No,Sat,Dinner,2
14.31,4.0,Yes,Sat,Dinner,2
14.0,3.0,No,Sat,Dinner,2
7.25,1.0,No,Sat,Dinner,1
38.07,4.0,No,Sun,Dinner,3
23.95,2.55,No,Sun,Dinner,2
25.71,4.0,No,Sun,Dinner,3
17.31,3.5,No,Sun,Dinner,2
29.93,5.07,No,Sun,Dinner,4
10.65,1.5,No,Thur,Lunch,2
12.43,1.8,No,Thur,Lunch,2
24.08,2.92,No,Thur,Lunch,4
11.69,2.31,No,Thur,Lunch,2
13.42,1.68,No,Thur,Lunch,2
14.26,2.5,No,Thur,Lunch,2
15.95,2.0,No,Thur,Lunch,2
12.48,2.52,No,Thur,Lunch,2
29.8,4.2,No,Thur,Lunch,6
8.52,1.48,No,Thur,Lunch,2
14.52,2.0,No,Thur,Lunch,2
11.38,2.0,No,Thur,Lunch,2
22.82,2.18,No,Thur,Lunch,3
19.08,1.5,No,Thur,Lunch,2
20.27,2.83,No,Thur,Lunch,2
11.17,1.5,No,Thur,Lunch,2
12.26,2.0,No,Thur,Lunch,2
18.26,3.25,No,Thur,Lunch,2
8.51,1.25,No,Thur,Lunch,2
10.33,2.0,No,Thur,Lunch,2
14.15,2.0,No,Thur,Lunch,2
16.0,2.0,Yes,Thur,Lunch,2
13.16,2.75,No,Thur,Lunch,2
17.47,3.5,No,Thur,Lunch,2
34.3,6.7,No,Thur,Lunch,6
41.19,5.0,No,Thur,Lunch,5
27.05,5.0,No,Thur,Lunch,6
16.43,2.3,No,Thur,Lunch,2
8.35,1.5,No,Thur,Lunch,2
18.64,1.36,No,Thur,Lunch,3
11.87,1.63,No,Thur,Lunch,2
9.78,1.73,No,Thur,Lunch,2
7.51,2.0,No,Thur,Lunch,2
14.07,2.5,No,Sun,Dinner,2
13.13,2.0,No,Sun,Dinner,2
17.26,2.74,No,Sun,Dinner,3
24.55,2.0,No,Sun,Dinner,4
19.77,2.0,No,Sun,Dinner,4
29.85,5.14,No,Sun,Dinner,5
48.17,5.0,No,Sun,Dinner,6
25.0,3.75,No,Sun,Dinner,4
13.39,2.61,No,Sun,Dinner,2
16.49,2.0,No,Sun,Dinner,4
21.5,3.5,No,Sun,Dinner,4
12.66,2.5,No,Sun,Dinner,2
16.21,2.0,No,Sun,Dinner,3
13.81,2.0,No,Sun,Dinner,2
17.51,3.0,Yes,Sun,Dinner,2
24.52,3.48,No,Sun,Dinner,3
20.76,2.24,No,Sun,Dinner,2
31.71,4.5,No,Sun,Dinner,4
10.59,1.61,Yes,Sat,Dinner,2
10.63,2.0,Yes,Sat,Dinner,2
50.81,10.0,Yes,Sat,Dinner,3
15.81,3.16,Yes,Sat,Dinner,2
7.25,5.15,Yes,Sun,Dinner,2
31.85,3.18,Yes,Sun,Dinner,2
16.82,4.0,Yes,Sun,Dinner,2
32.9,3.11,Yes,Sun,Dinner,2
17.89,2.0,Yes,Sun,Dinner,2
14.48,2.0,Yes,Sun,Dinner,2
9.6,4.0,Yes,Sun,Dinner,2
34.63,3.55,Yes,Sun,Dinner,2
34.65,3.68,Yes,Sun,Dinner,4
23.33,5.65,Yes,Sun,Dinner,2
45.35,3.5,Yes,Sun,Dinner,3
23.17,6.5,Yes,Sun,Dinner,4
40.55,3.0,Yes,Sun,Dinner,2
20.69,5.0,No,Sun,Dinner,5
20.9,3.5,Yes,Sun,Dinner,3
30.46,2.0,Yes,Sun,Dinner,5
18.15,3.5,Yes,Sun,Dinner,3
23.1,4.0,Yes,Sun,Dinner,3
15.69,1.5,Yes,Sun,Dinner,2
19.81,4.19,Yes,Thur,Lunch,2
28.44,2.56,Yes,Thur,Lunch,2
15.48,2.02,Yes,Thur,Lunch,2
16.58,4.0,Yes,Thur,Lunch,2
7.56,1.44,No,Thur,Lunch,2
10.34,2.0,Yes,Thur,Lunch,2
43.11,5.0,Yes,Thur,Lunch,4
13.0,2.0,Yes,Thur,Lunch,2
13.51,2.0,Yes,Thur,Lunch,2
18.71,4.0,Yes,Thur,Lunch,3
12.74,2.01,Yes,Thur,Lunch,2
13.0,2.0,Yes,Thur,Lunch,2
16.4,2.5,Yes,Thur,Lunch,2
20.53,4.0,Yes,Thur,Lunch,4
16.47,3.23,Yes,Thur,Lunch,3
26.59,3.41,Yes,Sat,Dinner,3
38.73,3.0,Yes,Sat,Dinner,4
24.27,2.03,Yes,Sat,Dinner,2
12.76,2.23,Yes,Sat,Dinner,2
30.06,2.0,Yes,Sat,Dinner,3
25.89,5.16,Yes,Sat,Dinner,4
48.33,9.0,No,Sat,Dinner,4
13.27,2.5,Yes,Sat,Dinner,2
28.17,6.5,Yes,Sat,Dinner,3
12.9,1.1,Yes,Sat,Dinner,2
28.15,3.0,Yes,Sat,Dinner,5
11.59,1.5,Yes,Sat,Dinner,2
7.74,1.44,Yes,Sat,Dinner,2
30.14,3.09,Yes,Sat,Dinner,4
12.16,2.2,Yes,Fri,Lunch,2
13.42,3.48,Yes,Fri,Lunch,2
8.58,1.92,Yes,Fri,Lunch,1
15.98,3.0,No,Fri,Lunch,3
13.42,1.58,Yes,Fri,Lunch,2
16.27,2.5,Yes,Fri,Lunch,2
10.09,2.0,Yes,Fri,Lunch,2
20.45,3.0,No,Sat,Dinner,4
13.28,2.72,No,Sat,Dinner,2
22.12,2.88,Yes,Sat,Dinner,2
24.01,2.0,Yes,Sat,Dinner,4
15.69,3.0,Yes,Sat,Dinner,3
11.61,3.39,No,Sat,Dinner,2
10.77,1.47,No,Sat,Dinner,2
15.53,3.0,Yes,Sat,Dinner,2
10.07,1.25,No,Sat,Dinner,2
12.6,1.0,Yes,Sat,Dinner,2
32.83,1.17,Yes,Sat,Dinner,2
35.83,4.67,No,Sat,Dinner,3
29.03,5.92,No,Sat,Dinner,3
27.18,2.0,Yes,Sat,Dinner,2
22.67,2.0,Yes,Sat,Dinner,2
17.82,1.75,No,Sat,Dinner,2
18.78,3.0,No,Thur,Dinner,2

In [3]:

# 小費(fèi)數(shù)據(jù)集
tips = pd.read_csv('examples/tips.csv')

# 生成一個新指標(biāo),小費(fèi)占消費(fèi)總額的百分比
tips['tip_pct'] = tips['tip'] / tips['total_bill']

tips.head()

Out[3]:

total_bill tip smoker day time size tip_pct
0 16.99 1.01 No Sun Dinner 2 0.059447
1 10.34 1.66 No Sun Dinner 3 0.160542
2 21.01 3.50 No Sun Dinner 3 0.166587
3 23.68 3.31 No Sun Dinner 2 0.139780
4 24.59 3.61 No Sun Dinner 4 0.146808

In [4]:

# 抽取一個小表計(jì)算:
t2 = tips[['day', 'time', 'tip']]
t2.head()

Out[4]:

day time tip
0 Sun Dinner 1.01
1 Sun Dinner 1.66
2 Sun Dinner 3.50
3 Sun Dinner 3.31
4 Sun Dinner 3.61

例子:每周各天(day)的午餐晚餐(time)小費(fèi)平均值(pivot_table的默認(rèn)聚合類型)

使用Pandas自帶透視表函數(shù) pivot_table 實(shí)現(xiàn)

In [10]:

# 透視表
t2.pivot_table('tip', index='day', columns='time')

# 聚合運(yùn)算列,行索引列,列索引列,缺失值填充
t2.pivot_table('tip', index='day', columns='time', fill_value=0)  # 填充缺失值為0

Out[10]:

time Dinner Lunch
day
Fri 2.940000 2.382857
Sat 2.993103 0.000000
Sun 3.255132 0.000000
Thur 3.000000 2.767705

使用原生使用分組聚合(groupby)和重塑(unstack)功能實(shí)現(xiàn)

In [15]:

t2.groupby(['day', 'time'])['tip'].mean().unstack().fillna(0)

Out[15]:

time Dinner Lunch
day
Fri 2.940000 2.382857
Sat 2.993103 0.000000
Sun 3.255132 0.000000
Thur 3.000000 2.767705

根據(jù)day和smoker計(jì)算分組平均數(shù),并將day和smoker放到行索引上

In [16]:

tips.head()

Out[16]:

total_bill tip smoker day time size tip_pct
0 16.99 1.01 No Sun Dinner 2 0.059447
1 10.34 1.66 No Sun Dinner 3 0.160542
2 21.01 3.50 No Sun Dinner 3 0.166587
3 23.68 3.31 No Sun Dinner 2 0.139780
4 24.59 3.61 No Sun Dinner 4 0.146808

In [17]:

# 透視表函數(shù)實(shí)現(xiàn)
tips.pivot_table(index=['day', 'smoker'])

Out[17]:

size tip tip_pct total_bill
day smoker
Fri No 2.250000 2.812500 0.151650 18.420000
Yes 2.066667 2.714000 0.174783 16.813333
Sat No 2.555556 3.102889 0.158048 19.661778
Yes 2.476190 2.875476 0.147906 21.276667
Sun No 2.929825 3.167895 0.160113 20.506667
Yes 2.578947 3.516842 0.187250 24.120000
Thur No 2.488889 2.673778 0.160298 17.113111
Yes 2.352941 3.030000 0.163863 19.190588

In [20]:

# 分組聚合實(shí)現(xiàn)
tips.groupby(['day', 'smoker']).mean()
tips.groupby(['day', 'smoker']).mean().sort_index(axis=1)  # 按列索引排序

Out[20]:

size tip tip_pct total_bill
day smoker
Fri No 2.250000 2.812500 0.151650 18.420000
Yes 2.066667 2.714000 0.174783 16.813333
Sat No 2.555556 3.102889 0.158048 19.661778
Yes 2.476190 2.875476 0.147906 21.276667
Sun No 2.929825 3.167895 0.160113 20.506667
Yes 2.578947 3.516842 0.187250 24.120000
Thur No 2.488889 2.673778 0.160298 17.113111
Yes 2.352941 3.030000 0.163863 19.190588

如果只想聚合tip_pct和size,而且想根據(jù)time進(jìn)行分組。再將smoker放到列索引上,把day放到行索引上

In [22]:

tips.head()

Out[22]:

total_bill tip smoker day time size tip_pct
0 16.99 1.01 No Sun Dinner 2 0.059447
1 10.34 1.66 No Sun Dinner 3 0.160542
2 21.01 3.50 No Sun Dinner 3 0.166587
3 23.68 3.31 No Sun Dinner 2 0.139780
4 24.59 3.61 No Sun Dinner 4 0.146808

In [24]:

# 透視表函數(shù)實(shí)現(xiàn)
tips.pivot_table(['tip_pct', 'size'], index=['time', 'day'], columns='smoker')

Out[24]:


image.png

In [27]:

# 分組聚合實(shí)現(xiàn)
# 使用groupby和unstack實(shí)現(xiàn)
# 更底層、更基礎(chǔ)的實(shí)現(xiàn)方式,透視表構(gòu)造的原理和過程一步一步的展現(xiàn)

# groupby參數(shù)是行索引,后面的參數(shù)是列索引
tips.groupby(['time', 'day', 'smoker'])['size', 'tip_pct'].mean().unstack()

Out[27]:


image.png

pivot_table 其他參數(shù)

傳入margins=True添加分項(xiàng)小計(jì)
這將會添加標(biāo)簽為All的行和列,其值對應(yīng)于單個等級中所有數(shù)據(jù)的分組統(tǒng)計(jì)
All值為平均數(shù):不單獨(dú)考慮煙民與非煙民(All列),不單獨(dú)考慮行分組兩個級別中的任何單項(xiàng)(All行)

In [29]:

t2.pivot_table('tip', index='day', columns='time')

Out[29]:

time Dinner Lunch
day
Fri 2.940000 2.382857
Sat 2.993103 NaN
Sun 3.255132 NaN
Thur 3.000000 2.767705

In [30]:

t2.pivot_table('tip', index='day', columns='time', margins=True)  # 分項(xiàng)小計(jì),平均值

Out[30]:

time Dinner Lunch All
day
Fri 2.940000 2.382857 2.734737
Sat 2.993103 NaN 2.993103
Sun 3.255132 NaN 3.255132
Thur 3.000000 2.767705 2.771452
All 3.102670 2.728088 2.998279

In [31]:

(2.940000 + 2.382857) / 2  # 對不上

Out[31]:

2.6614285

All統(tǒng)計(jì)求的是所有行或所有列的平均值,不是透視表那幾行幾列的平均值

In [32]:

(1+2+3+4+5)/5

Out[32]:

3.0

In [33]:

(1+2+3)/3

Out[33]:

2.0

In [34]:

(4+5)/2

Out[34]:

4.5

In [35]:

(2 + 4.5) / 2

Out[35]:

3.25

求所有dinner下tip的平均值

In [36]:

(2.940000 + 2.993103 + 3.255132 + 3.000000) / 4  # 對不上All

Out[36]:

3.04705875

In [37]:

t2.head()

Out[37]:

day time tip
0 Sun Dinner 1.01
1 Sun Dinner 1.66
2 Sun Dinner 3.50
3 Sun Dinner 3.31
4 Sun Dinner 3.61

In [42]:

# 求所有dinner下tip的平均值,和分項(xiàng)小計(jì)相等
t2[t2['time'] == 'Dinner']['tip'].mean()

Out[42]:

3.102670454545454

In [43]:

# 上例透視表,添加分項(xiàng)小計(jì)
tips.pivot_table(['tip_pct', 'size'], index=['time', 'day'], columns='smoker', margins=True)

Out[43]:

image.png

透視表默認(rèn)聚合函數(shù)為mean平均值

如想使用非默認(rèn)mean的其他的聚合函數(shù),傳給aggfunc即可(傳入函數(shù)名稱或函數(shù)字符串)

如使用count或len可得到有關(guān)分組大小的交叉表(計(jì)數(shù)或頻率)

傳入值類型,一般為 函數(shù)名字符串,函數(shù)名,numpy函數(shù)名:

len
'count'
np.max

In [48]:

tips.pivot_table(['tip_pct', 'size'], index=['time', 'day'], columns='smoker', margins=True, aggfunc=len)

Out[48]:

fill_value參數(shù)填充缺失值(NA)

In [50]:

f = tips.pivot_table(['tip_pct', 'size'], index=['time', 'day'], columns='smoker', margins=True, aggfunc=len, fill_value=0)
f

Out[50]:

image.png

In [52]:

# astype修改數(shù)據(jù)類型:np.int np.float
f.astype(np.int)  # 改為整型

Out[52]:

image.png

交叉表:crosstab

交叉表(cross-tabulation,簡稱crosstab)是一種用于計(jì)算分組頻率的特殊透視表

透視表由是分組后聚合函數(shù)為平均值的表旋轉(zhuǎn)而成
交叉表就是聚合函數(shù)為個數(shù)的透視表

In [53]:

data = pd.DataFrame(
    {
        'Sample': np.arange(1,11),
        'Nationality': ["USA", "Japan", "USA", "Japan", "Japan", "Japan", "USA", "USA", "Japan", "USA"],
        'Handedness': ["Right-handed", "Left-handed", "Right-handed", "Right-handed", "Left-handed", "Right-handed", "Right-handed", "Left-handed", "Right-handed", "Right-handed"],
    },
    columns=['Sample', 'Nationality', 'Handedness']
)
data

Out[53]:

Sample Nationality Handedness
0 1 USA Right-handed
1 2 Japan Left-handed
2 3 USA Right-handed
3 4 Japan Right-handed
4 5 Japan Left-handed
5 6 Japan Right-handed
6 7 USA Right-handed
7 8 USA Left-handed
8 9 Japan Right-handed
9 10 USA Right-handed

In [56]:

# 抽取單列
data['Nationality']
data.loc[:, 'Nationality']
data.Nationality

Out[56]:

0      USA
1    Japan
2      USA
3    Japan
4    Japan
5    Japan
6      USA
7      USA
8    Japan
9      USA
Name: Nationality, dtype: object

In [57]:

# 交叉表
pd.crosstab(data.Nationality, data.Handedness)  # 1列行索引,1列列索引

Out[57]:

Handedness Left-handed Right-handed
Nationality
Japan 2 3
USA 1 4

In [58]:

# 添加分項(xiàng)小計(jì)
pd.crosstab(data.Nationality, data.Handedness, margins=True)

Out[58]:

Handedness Left-handed Right-handed All
Nationality
Japan 2 3 5
USA 1 4 5
All 3 7 10

In [64]:

# 使用透視表實(shí)現(xiàn)交叉表效果
data.pivot_table('Sample', index='Nationality', columns='Handedness', aggfunc=len, margins=True)  # 寫法1
pd.pivot_table(data, index='Nationality', columns='Handedness', aggfunc=len, margins=True)['Sample']  # 寫法2

Out[64]:

Handedness Left-handed Right-handed All
Nationality
Japan 2 3 5
USA 1 4 5
All 3 7 10

In [67]:

# 底層實(shí)現(xiàn):分組聚合和軸旋轉(zhuǎn)實(shí)現(xiàn)交叉表效果
y = data.groupby(['Nationality', 'Handedness']).size().unstack()
y

Out[67]:

Handedness Left-handed Right-handed
Nationality
Japan 2 3
USA 1 4

In [69]:

# 增加分項(xiàng)小計(jì)行
y.sum()

y.loc['All'] = y.sum()
y

Out[69]:

Handedness Left-handed Right-handed
Nationality
Japan 2 3
USA 1 4
All 3 7

In [71]:

# 增加分項(xiàng)小計(jì)列
y.sum(axis=1)

y['All'] = y.sum(axis=1)
y

Out[71]:

Handedness Left-handed Right-handed All
Nationality
Japan 2 3 5
USA 1 4 5
All 3 7 10

例子:小費(fèi)數(shù)據(jù) 交叉表學(xué)習(xí)

In [72]:

tips.head()

Out[72]:

total_bill tip smoker day time size tip_pct
0 16.99 1.01 No Sun Dinner 2 0.059447
1 10.34 1.66 No Sun Dinner 3 0.160542
2 21.01 3.50 No Sun Dinner 3 0.166587
3 23.68 3.31 No Sun Dinner 2 0.139780
4 24.59 3.61 No Sun Dinner 4 0.146808

統(tǒng)計(jì)顧客在 每種用餐時間、每個星期下 的 吸煙數(shù)量情況

行索引:time,day

列索引:smoker

使用交叉表方法實(shí)現(xiàn)

In [75]:

pd.crosstab([tips.time, tips.day], tips.smoker, margins=True)

Out[75]:

image.png

In [77]:

pd.crosstab(tips.time, [tips.day, tips.smoker], margins=True)

Out[77]:

image.png

使用透視表方法實(shí)現(xiàn)

In [79]:

tips.head()

Out[79]:

total_bill tip smoker day time size tip_pct
0 16.99 1.01 No Sun Dinner 2 0.059447
1 10.34 1.66 No Sun Dinner 3 0.160542
2 21.01 3.50 No Sun Dinner 3 0.166587
3 23.68 3.31 No Sun Dinner 2 0.139780
4 24.59 3.61 No Sun Dinner 4 0.146808

In [80]:

pd.pivot_table(tips, index=['time', 'day'], columns='smoker', aggfunc=len, margins=True, fill_value=0)['size']

Out[80]:

image.png

使用分組聚合軸旋轉(zhuǎn)實(shí)現(xiàn)

In [84]:

tips.groupby(['time', 'day', 'smoker']).size().unstack().fillna(0).astype(np.int)

Out[84]:


image.png
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容