參考:https://datawhalechina.github.io/joyful-pandas/build/html/%E7%9B%AE%E5%BD%95/ch5.html#id2
一、長(zhǎng)寬表的變形(元素和列索引的轉(zhuǎn)換)
長(zhǎng)表:把性別Gender存在列名中(long狀態(tài))
pd.DataFrame({'Gender':['F','F','M','M'], 'Height':[163, 160, 175, 180]})
| Gender | Height | |
|---|---|---|
| 0 | F | 163 |
| 1 | F | 160 |
| 2 | M | 175 |
| 3 | M | 180 |
寬表:如果把性別作為列名,列中的元素是某一其他的相關(guān)特征數(shù)值,那么這個(gè)表是關(guān)于性別的寬表(wide狀態(tài))
pd.DataFrame({'Height: F':[163, 160], 'Height: M':[175, 180]})
| Height: F | Height: M | |
|---|---|---|
| 0 | 163 | 175 |
| 1 | 160 | 180 |
變形函數(shù):
| 分類(lèi) | 函數(shù) | 例子 | 備注 |
|---|---|---|---|
| 長(zhǎng)表轉(zhuǎn)寬表 | pivot | df.pivot(index='Name', columns='Subject', values='Grade') | 三個(gè)要素,分別是變形后的行索引、需要轉(zhuǎn)到列索引的列,以及這些列和行索引對(duì)應(yīng)的數(shù)值,它們分別對(duì)應(yīng)了pivot方法中的index, columns, values參數(shù)。<br />且行列組合需要唯一
|
| pivot_multi = df.pivot(index = ['Class', 'Name'],columns = ['Subject','Examination'],values = ['Grade','rank']) | 參數(shù)可以傳入列表,形成多級(jí)索引;仍然需要組合的唯一性 | ||
| pivot_table | df.pivot_table(index = 'Name', columns = 'Subject', values = 'Grade', aggfunc = 'mean') |
aggfunc 參數(shù)可以使用的聚合函數(shù),不需要組合的唯一性,因?yàn)槿绻嗤瑫?huì)被聚合起來(lái) |
|
| df.pivot_table(index = 'Name', columns = 'Subject', values = 'Grade', aggfunc='mean', margins=True) |
pivot_table具有邊際匯總的功能 | ||
| 寬表轉(zhuǎn)長(zhǎng)表 | melt | df_melted = df.melt(id_vars = ['Class', 'Name'], value_vars = ['Chinese', 'Math'], var_name = 'Subject', value_name = 'Grade') |
和pivot是互逆操作;只能合并一個(gè)元素 |
| wide_to_long | pd.wide_to_long(df, stubnames=['Chinese', 'Math'], i = ['Class', 'Name'], j='Examination', sep='_', suffix='.+') |
![]() ../_images/ch5_wtl.png
|
【練一練】
在上面的邊際匯總例子中,行或列的匯總為新表中行元素或者列元素的平均值,而總體的匯總為新表中四個(gè)元素的平均值。這種關(guān)系一定成立嗎?若不成立,請(qǐng)給出一個(gè)例子來(lái)說(shuō)明。
下面的代碼會(huì)報(bào)錯(cuò),size不能作為匯總函數(shù)
df.groupby(['Subject','Name']).size()
df.pivot_table(index = 'Name',
columns = 'Subject',
values = 'Grade',
aggfunc='size',
margins=True)
二、索引的變形(行列索引之間的交換)
| 函數(shù) | 例子 | 備注 |
|---|---|---|
| unstack | df.unstack() | 行索引轉(zhuǎn)為列索引 |
| df.unstack([0,2]) | 主要參數(shù)是移動(dòng)的層號(hào),默認(rèn)轉(zhuǎn)化最內(nèi)層,移動(dòng)到列索引的最內(nèi)層,同時(shí)支持同時(shí)轉(zhuǎn)化多個(gè)層;必須保證 被轉(zhuǎn)為列索引的行索引層 和 **被保留的行索引層 **構(gòu)成的組合是唯一的,否則會(huì)報(bào)錯(cuò) | |
| stack | df.stack([1, 2]) | 把列索引的層壓入行索引 |
三、其他變形函數(shù)
| 函數(shù) | 例子 | 備注 |
|---|---|---|
| crosstab | pd.crosstab(index = df.School, columns = df.Transfer) | 不是一個(gè)值得推薦使用的函數(shù);在默認(rèn)狀態(tài)下,可以統(tǒng)計(jì)元素組合出現(xiàn)的頻數(shù),即count操作 |
| pd.crosstab(index = df.School, columns = df.Transfer, values = [0]*df.shape[0], aggfunc = 'count') | 也可以做其他的聚合操作,values傳入要進(jìn)行agg計(jì)算的類(lèi)似序列的數(shù)據(jù) | |
| explode | df_ex = pd.DataFrame({'A': [[1, 2], 'my_str', {1, 2}, pd.Series([3, 4])], 'B': 1}) df_ex.explode('A') |
對(duì)某一列的元素進(jìn)行縱向的展開(kāi),被展開(kāi)的單元格必須存儲(chǔ)list, tuple, Series, np.ndarray中的一種類(lèi)型。 |
| get_dummies | pd.get_dummies(df.Grade).head() | 把類(lèi)別特征轉(zhuǎn)為指示變量;傳入series或dataframe |
練一練
前面提到了 crosstab 的性能劣于 pivot_table ,請(qǐng)選用多個(gè)聚合方法進(jìn)行驗(yàn)證。
1分別對(duì)learn_pandas表進(jìn)行pivot_table和crosstab的轉(zhuǎn)換,計(jì)算count和mean:
import datetime
beg_time = datetime.datetime.now()
df.pivot_table(index = 'School',
columns = 'Transfer',
values = 'Name',
aggfunc = 'count')
end_time = datetime.datetime.now()
print ("df.pivot_table:count", end_time - beg_time)
beg_time = datetime.datetime.now()
pd.crosstab(index = df.School, columns = df.Transfer, values = df.Height, aggfunc = 'count')
end_time = datetime.datetime.now()
print ("pd.crosstab:count", end_time - beg_time)
beg_time = datetime.datetime.now()
df.pivot_table(index = 'School',
columns = 'Transfer',
values = 'Name',
aggfunc = 'sum')
end_time = datetime.datetime.now()
print ("df.pivot_table:sum", end_time - beg_time)
beg_time = datetime.datetime.now()
pd.crosstab(index = df.School, columns = df.Transfer, values = df.Height, aggfunc = 'sum')
end_time = datetime.datetime.now()
print ("pd.crosstab:sum", end_time - beg_time)
結(jié)果:似乎沒(méi)有很慢?
'''
df.pivot_table:count 0:00:00.011967
pd.crosstab:count 0:00:00.009974
df.pivot_table:sum 0:00:00.013121
pd.crosstab:sum 0:00:00.010976
'''
2.是不是數(shù)據(jù)量不夠大,換一個(gè)大點(diǎn)的表試試:(31569條數(shù)據(jù)house_info.xls)
import datetime
print("index數(shù)量",df2.groupby(['city','rooms']).ngroups)
beg_time = datetime.datetime.now()
df2 = pd.read_excel("../data/house_info.xls")
df2.pivot_table(index = 'city',
columns = 'rooms',
values = 'area',
aggfunc = 'count')
end_time = datetime.datetime.now()
print ("df.pivot_table:count", end_time - beg_time)
beg_time = datetime.datetime.now()
pd.crosstab(index = df2.city, columns = df2.rooms, values = df2.area, aggfunc = 'count')
end_time = datetime.datetime.now()
print ("pd.crosstab:count", end_time - beg_time)
beg_time = datetime.datetime.now()
df2 = pd.read_excel("../data/house_info.xls")
df2['test'] = df2.index.values
df2.pivot_table(index = 'city',
columns = 'rooms',
values = 'test',
aggfunc = 'mean')
end_time = datetime.datetime.now()
print ("df.pivot_table:mean", end_time - beg_time)
beg_time = datetime.datetime.now()
pd.crosstab(index = df2.city, columns = df2.rooms, values = df2.test, aggfunc = 'mean')
end_time = datetime.datetime.now()
print ("pd.crosstab:mean", end_time - beg_time)
結(jié)果:crosstab還是更快一點(diǎn)
index數(shù)量 4336
df.pivot_table:count 0:00:00.885630
pd.crosstab:count 0:00:00.017953
df.pivot_table:mean 0:00:00.568480
pd.crosstab:mean 0:00:00.013963
3.換成3萬(wàn)條純數(shù)字的,index數(shù)量為669:
import datetime
from string import ascii_uppercase
data = np.random.randint(0, 26, size=(3000, 10))
cols = list(ascii_uppercase[:10])
np.random.seed(2)
df3 = pd.DataFrame(data, columns=cols)
print("index數(shù)量", df3.groupby(['A', 'B'])['C'].ngroups)
beg_time = datetime.datetime.now()
df3.groupby(['A', 'B'])['C'].count().unstack(fill_value=0)
end_time = datetime.datetime.now()
print ("groupby:count", end_time - beg_time)
beg_time = datetime.datetime.now()
df3.pivot_table(values='C', index='A', columns='B', aggfunc='count', fill_value=0)
end_time = datetime.datetime.now()
print ("pivot_table:count", end_time - beg_time)
beg_time = datetime.datetime.now()
pd.crosstab(index=df3.A, columns=df3.B)
end_time = datetime.datetime.now()
print ("crosstab:count", end_time - beg_time)
結(jié)果:crosstab確實(shí)是最慢的
index數(shù)量 669
groupby:count 0:00:00.002988
pivot_table:count 0:00:00.011968
crosstab:count 0:00:00.015994
4.換成3萬(wàn)條純字母的,index數(shù)量為676:
import datetime
from string import ascii_uppercase
data = np.random.choice(list(ascii_uppercase[:]), size=(30000, 10))
cols = list(ascii_uppercase[:10])
np.random.seed(2)
df3 = pd.DataFrame(data, columns=cols)
print("index數(shù)量",df3.groupby(['A', 'B'])['C'].ngroups)
beg_time = datetime.datetime.now()
df3.groupby(['A', 'B'])['C'].count().unstack(fill_value=0)
end_time = datetime.datetime.now()
print ("groupby:count", end_time - beg_time)
beg_time = datetime.datetime.now()
df3.pivot_table(values='C', index='A', columns='B', aggfunc='count', fill_value=0)
end_time = datetime.datetime.now()
print ("pivot_table:count", end_time - beg_time)
beg_time = datetime.datetime.now()
pd.crosstab(index=df3.A, columns=df3.B)
end_time = datetime.datetime.now()
print ("crosstab:count", end_time - beg_time)
結(jié)果:兩者差不多
index數(shù)量 676
#crosstab慢于pivot_table
groupby:count 0:00:00.006970
pivot_table:count 0:00:00.013941
crosstab:count 0:00:00.016956
#測(cè)試多次的其他輸出:crosstab快于pivot_table
index數(shù)量 676
groupby:count 0:00:00.016970
pivot_table:count 0:00:00.030903
crosstab:count 0:00:00.024946
結(jié)論:在index數(shù)量類(lèi)似的情況下,crosstab處理字符串的速度和pivot_table差不多(甚至有的時(shí)候要快),而crosstab處理數(shù)字的速度始終比pivot_table慢,隨著index數(shù)量的增長(zhǎng),差距越大
四、練習(xí)
Ex1:美國(guó)非法藥物數(shù)據(jù)集
現(xiàn)有一份關(guān)于美國(guó)非法藥物的數(shù)據(jù)集,其中 SubstanceName, DrugReports 分別指藥物名稱(chēng)和報(bào)告數(shù)量:
In [63]: df = pd.read_csv('data/drugs.csv').sort_values([
....: 'State','COUNTY','SubstanceName'],ignore_index=True)
....:
In [64]: df.head(3)
Out[64]:
YYYY State COUNTY SubstanceName DrugReports
0 2011 KY ADAIR Buprenorphine 3
1 2012 KY ADAIR Buprenorphine 5
2 2013 KY ADAIR Buprenorphine 4
- 將數(shù)據(jù)轉(zhuǎn)為如下的形式:

df2 = df.pivot_table(index=['State','COUNTY','SubstanceName'],columns=['YYYY'], values='DrugReports').reset_index().rename_axis(columns={'YYYY':''})
----------------------------
answer:
res = df.pivot(index=['State','COUNTY','SubstanceName'], columns='YYYY', values='DrugReports').reset_index().rename_axis(columns={'YYYY':''})
- 將第1問(wèn)中的結(jié)果恢復(fù)為原表。
注意最后要排序
df2.melt(id_vars=['State','COUNTY','SubstanceName'],
value_vars=[2010,2011,2012,2013,2014,2015,2016,2017],
var_name='YYYY',
value_name='DrugReports')
answer:
res_melted = res.melt(id_vars = ['State','COUNTY','SubstanceName'],
value_vars = res.columns[-8:],
var_name = 'YYYY',
value_name = 'DrugReports').dropna(
subset=['DrugReports'])
res_melted = res_melted[df.columns].sort_values(['State','COUNTY','SubstanceName'],ignore_index=True).astype({'YYYY':'int64', 'DrugReports':'int64'})
res_melted.equals(df)
- 按
State分別統(tǒng)計(jì)每年的報(bào)告數(shù)量總和,其中State, YYYY分別為列索引和行索引,要求分別使用pivot_table函數(shù)與groupby+unstack兩種不同的策略實(shí)現(xiàn),并體會(huì)它們之間的聯(lián)系。
df.pivot_table(values=['DrugReports'],index=['YYYY'],columns=['State'],aggfunc='sum')
df.groupby(['State','YYYY'])['DrugReports'].agg('sum').unstack([0])
answer:
res = df.groupby(['State', 'YYYY'])['DrugReports'].sum().to_frame().unstack(0).droplevel(0,axis=1)
Ex2:特殊的wide_to_long方法
從功能上看, melt 方法應(yīng)當(dāng)屬于 wide_to_long 的一種特殊情況,即 stubnames 只有一類(lèi)。請(qǐng)使用 wide_to_long 生成 melt 一節(jié)中的 df_melted 。(提示:對(duì)列名增加適當(dāng)?shù)那熬Y)
df = df.rename(columns={'Chinese':'Grade_Chinese','Math':'Grade_Math'})
pd.wide_to_long(df,
stubnames=['Grade'],
i=['Class','Name'],
j="Subject",
sep='_',
suffix='.+').reset_index().sort_values('Subject')
