數(shù)據(jù)分析和建模方面的大量編程工作都是用在數(shù)據(jù)準備上的:加載、清理、轉(zhuǎn)換、重塑。是因為,多數(shù)時候存放在文件或數(shù)據(jù)庫中的數(shù)據(jù)不能滿足你的數(shù)據(jù)處理應(yīng)用的要求。
7.1 合并數(shù)據(jù)集
pandas對象中的數(shù)據(jù)可以通過一些內(nèi)置的方式進行合并:
pandas.merge可根據(jù)一個或多個鍵將不同的DataFrame中的行連接起來。我們的SQL或其他的關(guān)系型數(shù)據(jù)庫就是實現(xiàn)數(shù)據(jù)庫的連接操作
pandas.concat可以沿著一條軸將多個對象堆疊在一起。
實例方法combine_first可以將重復(fù)數(shù)據(jù)編接在一起,用一個對象中的值填充另一個對象中的缺失值。
7.1.1 數(shù)據(jù)庫風(fēng)格的DataFrame合并
數(shù)據(jù)集的合并(merge)或連接(join)運算是通過一個或多個鍵將行連接起來。這些運算是關(guān)系型數(shù)據(jù)庫的核心。對數(shù)據(jù)應(yīng)用的這些算法的主要以pandas的merge函數(shù)為切入點。
In [38]: import pandas as pd
In [39]: from pandas import DataFrame
In [40]: df1=DataFrame({'key':['b','b','a','c','a','a','b'],'data1':range(7)})
In [41]: df2=DataFrame({'key':['b','b','d'],'data2':range(3)})
In [42]: df1
Out[42]:
data1 key
0? ? ? 0? b
1? ? ? 1? b
2? ? ? 2? a
3? ? ? 3? c
4? ? ? 4? a
5? ? ? 5? a
6? ? ? 6? b
In [43]: df2
Out[43]:
data2 key
0? ? ? 0? b
1? ? ? 1? b
2? ? ? 2? d
這是一種多對一的合并。對這些對象調(diào)用merge即可得到:
In [44]: pd.merge(df1,df2)
Out[44]:
data1 key? data2
0? ? ? 0? b? ? ? 0
1? ? ? 0? b? ? ? 1
2? ? ? 1? b? ? ? 0
3? ? ? 1? b? ? ? 1
4? ? ? 6? b? ? ? 0
5? ? ? 6? b? ? ? 1
上述的代碼并沒有指定需要哪個列進行連接。若未指定,merge就會將重疊列的列名當(dāng)做鍵。可以顯式的指定一下:
In [7]: pd.merge(df1,df2,on="key")
Out[7]:
data1 key? data2
0? ? ? 0? b? ? ? 0
1? ? ? 0? b? ? ? 1
2? ? ? 1? b? ? ? 0
3? ? ? 1? b? ? ? 1
4? ? ? 6? b? ? ? 0
5? ? ? 6? b? ? ? 1
如果兩個對象的列名不同,也可以分別進行指定:
In [8]: df3=DataFrame({'1key':['b','b','a','c','a','a','b'],'data1':range(7)})
In [9]: df4=DataFrame({'rkey':['a','b','d'],'data2':range(3)})
In [10]: pd.merge(df3,df4,left_on='1key',right_on='rkey')
Out[10]:
1key? data1? data2 rkey
0? ? b? ? ? 0? ? ? 1? ? b
1? ? b? ? ? 1? ? ? 1? ? b
2? ? b? ? ? 6? ? ? 1? ? b
3? ? a? ? ? 2? ? ? 0? ? a
4? ? a? ? ? 4? ? ? 0? ? a
5? ? a? ? ? 5? ? ? 0? ? a
在上述的結(jié)果中,C和d以及與之相關(guān)的數(shù)據(jù)不見了,默認下,merge做的是"inner"的連接;
結(jié)果中的鍵是交集,其他方式還有“l(fā)eft”,“right”以及“outer”。外連接求取的是鍵的并集,組合了左連接和右連接的效果:
多對多的合并非常簡單,甚至無需要額外的工作,如下所示:
In [17]: df1=DataFrame({'key':['b','b','a','c','a','b'],'data1':range(6)})
In [19]: df2=DataFrame({'key':['a','b','a','b','d'],'data2':range(5)})
In [20]: df1
Out[20]:
data1 key
0? ? ? 0? b
1? ? ? 1? b
2? ? ? 2? a
3? ? ? 3? c
4? ? ? 4? a
5? ? ? 5? b
In [21]: df2
Out[21]:
data2 key
0? ? ? 0? a
1? ? ? 1? b
2? ? ? 2? a
3? ? ? 3? b
4? ? ? 4? d
In [22]: pd.merge(df1,df2,on='key',how='left')
Out[22]:
data1 key? data2
0? ? ? 0? b? ? 1.0
1? ? ? 0? b? ? 3.0
2? ? ? 1? b? ? 1.0
3? ? ? 1? b? ? 3.0
4? ? ? 2? a? ? 0.0
5? ? ? 2? a? ? 2.0
6? ? ? 3? c? ? NaN
7? ? ? 4? a? ? 0.0
8? ? ? 4? a? ? 2.0
9? ? ? 5? b? ? 1.0
10? ? ? 5? b? ? 3.0
多對多連接產(chǎn)生的行就是笛卡體積,由于左邊的DataFrame有3個“b”行,右邊的有2個,最終結(jié)果就是6個“b”行。連接方式只影響出現(xiàn)結(jié)果中的鍵:
In [25]: pd.merge(df1,df2,how='inner')
Out[25]:
data1 key? data2
0? ? ? 0? b? ? ? 1
1? ? ? 0? b? ? ? 3
2? ? ? 1? b? ? ? 1
3? ? ? 1? b? ? ? 3
4? ? ? 5? b? ? ? 1
5? ? ? 5? b? ? ? 3
6? ? ? 2? a? ? ? 0
7? ? ? 2? a? ? ? 2
8? ? ? 4? a? ? ? 0
9? ? ? 4? a? ? ? 2
要根據(jù)多個鍵盤進行合并,傳入一個由列名組成的列表即可:
In [26]: left=DataFrame({'key1':['foo','foo','bar'],'key2':['one','two','one'],'lval':[1,2,3]})
In [27]: right=DataFrame({'key1':['foo','foo','bar','bar'],'key2':['one','one','one','two'],'rval':[4,5,5,7]})
In [28]: pd.merge(left,right,on=['key1','key2'],how='outer')
Out[28]:
key1 key2? lval? rval
0? foo? one? 1.0? 4.0
1? foo? one? 1.0? 5.0
2? foo? two? 2.0? NaN
3? bar? one? 3.0? 5.0
4? bar? two? NaN? 7.0
最后的結(jié)果會出現(xiàn)哪些鍵取決于所選的合并方式。
合并運算考慮的最后一個問題是對重復(fù)列名的處理。merge有一個實用的suffi選項,用于指定附加到左右兩個DataFrame 對象的重疊列名的字符串上:
In [9]: pd.merge(left,right,on="key1")
Out[9]:
key1 key2_x lval key2_y rval
0 foo one 1 one 4
1 foo one 1 one 5
2 foo two 2 one 4
3 foo two 2 one 5
4 bar one 3 one 5
5 bar one 3 two 7
In [10]: pd.merge(left,right,on="key1",suffixes=('_left','_right'))
Out[10]:
key1 key2_left lval key2_right rval
0 foo one 1 one 4
1 foo one 1 one 5
2 foo two 2 one 4
3 foo two 2 one 5
4 bar one 3 one 5
5 bar one 3 two 7

7.2索引上的合并
DataFrame中的連接鍵有時候位于其索引中。這種情況下,可以傳入left_index=True 或 right_index=True(或兩個都傳)以說明索引應(yīng)該被用作連接鍵:
In [11]: left1=DataFrame({'key':['a','b','a','a','b','c'],'value':range(6)})
In [12]: right1=DataFrame({'group_val':[3.5,7]},index=['a','b'])
In [13]: left1
Out[13]:
key? value
0? a? ? ? 0
1? b? ? ? 1
2? a? ? ? 2
3? a? ? ? 3
4? b? ? ? 4
5? c? ? ? 5
In [14]: right1
Out[14]:
group_val
a? ? ? ? 3.5
b? ? ? ? 7.0
In [15]: pd.merge(left1,right1,left_on='key',right_index=True)
Out[15]:
key? value? group_val
0? a? ? ? 0? ? ? ? 3.5
2? a? ? ? 2? ? ? ? 3.5
3? a? ? ? 3? ? ? ? 3.5
1? b? ? ? 1? ? ? ? 7.0
4? b? ? ? 4? ? ? ? 7.0
In [17]: pd.merge(left1,right1,left_on='key',right_index=True,how='outer')
Out[17]:
key? value? group_val
0? a? ? ? 0? ? ? ? 3.5
2? a? ? ? 2? ? ? ? 3.5
3? a? ? ? 3? ? ? ? 3.5
1? b? ? ? 1? ? ? ? 7.0
4? b? ? ? 4? ? ? ? 7.0
5? c? ? ? 5? ? ? ? NaN
對于層次化索引的數(shù)據(jù),事情稍微復(fù)雜:
In [33]: lefth=DataFrame({'key1':['Ohio','Ohio','Ohio','Nevada','Nevada'],'key2':[2000,2001,2002,2001,2002],'data':numpy.arange(5.)})
In [37]: righth=DataFrame(np.arange(12).reshape((6,2)),index=[['Nevada','Nevada','Ohio','Ohio','Ohio','Ohio'],[2001,2000,2000,2000,2001,2002]],columns=['event1','event2'])
In [38]: lefth
Out[38]:
data? ? key1? key2
0? 0.0? ? Ohio? 2000
1? 1.0? ? Ohio? 2001
2? 2.0? ? Ohio? 2002
3? 3.0? Nevada? 2001
4? 4.0? Nevada? 2002
In [39]: righth
Out[39]:
event1? event2
Nevada 2001? ? ? 0? ? ? 1
2000? ? ? 2? ? ? 3
Ohio? 2000? ? ? 4? ? ? 5
2000? ? ? 6? ? ? 7
2001? ? ? 8? ? ? 9
2002? ? ? 10? ? ? 11
上面這種情況,你必須以列表的形式指明用作合并鍵的多個列:
In [40]: pd.merge(lefth,righth,left_on=['key1','key2'],right_index=True)
Out[40]:
data? ? key1? key2? event1? event2
0? 0.0? ? Ohio? 2000? ? ? 4? ? ? 5
0? 0.0? ? Ohio? 2000? ? ? 6? ? ? 7
1? 1.0? ? Ohio? 2001? ? ? 8? ? ? 9
2? 2.0? ? Ohio? 2002? ? ? 10? ? ? 11
3? 3.0? Nevada? 2001? ? ? 0? ? ? 1
In [41]: pd.merge(lefth,righth,left_on=['key1','key2'],right_index=True,how='outer')
Out[41]:
data? ? key1? key2? event1? event2
0? 0.0? ? Ohio? 2000? ? 4.0? ? 5.0
0? 0.0? ? Ohio? 2000? ? 6.0? ? 7.0
1? 1.0? ? Ohio? 2001? ? 8.0? ? 9.0
2? 2.0? ? Ohio? 2002? ? 10.0? ? 11.0
3? 3.0? Nevada? 2001? ? 0.0? ? 1.0
4? 4.0? Nevada? 2002? ? NaN? ? NaN
4? NaN? Nevada? 2000? ? 2.0? ? 3.0
同時使用合并雙方的索引:
In [49]: left2=DataFrame([[1.,2.],[3.,4.],[5.,6.]],index=['a','b','e'],columns=['Ohio','Nevada'])
In [50]: right2=DataFrame([[7.,8.],[9.,10.],[11.,12.],[13.,14.]],index=['b','d','d','e'],columns=['Missouri','Alabama'])
In [51]: left2
Out[51]:
Ohio? Nevada
a? 1.0? ? 2.0
b? 3.0? ? 4.0
e? 5.0? ? 6.0
In [52]: right2
Out[52]:
Missouri? Alabama
b? ? ? 7.0? ? ? 8.0
d? ? ? 9.0? ? 10.0
d? ? ? 11.0? ? 12.0
e? ? ? 13.0? ? 14.0
In [53]: pd.merge(left2,right2,how='outer',left_index=True,right_index=True)
Out[53]:
Ohio? Nevada? Missouri? Alabama
a? 1.0? ? 2.0? ? ? NaN? ? ? NaN
b? 3.0? ? 4.0? ? ? 7.0? ? ? 8.0
d? NaN? ? NaN? ? ? 9.0? ? 10.0
d? NaN? ? NaN? ? ? 11.0? ? 12.0
e? 5.0? ? 6.0? ? ? 13.0? ? 14.0
DataFrame 還有一個join的方法,更為方便的實現(xiàn)按索引進行合并。還用于合并帶有多個相同或相似索引的DataFrame對象,而不管他們之間的有沒有重疊的列。
In [54]: left2.join(right2,how='outer')
Out[54]:
Ohio? Nevada? Missouri? Alabama
a? 1.0? ? 2.0? ? ? NaN? ? ? NaN
b? 3.0? ? 4.0? ? ? 7.0? ? ? 8.0
d? NaN? ? NaN? ? ? 9.0? ? 10.0
d? NaN? ? NaN? ? ? 11.0? ? 12.0
e? 5.0? ? 6.0? ? ? 13.0? ? 14.0
DataFrame的join方法是連接鍵上做的左連接,還支持參數(shù)DataFrame的索引跟調(diào)用者DataFrame的某個列之間的連接:
In [55]: left1.join(right1,on='key')
Out[55]:
key? value? group_val
0? a? ? ? 0? ? ? ? 3.5
1? b? ? ? 1? ? ? ? 7.0
2? a? ? ? 2? ? ? ? 3.5
3? a? ? ? 3? ? ? ? 3.5
4? b? ? ? 4? ? ? ? 7.0
5? c? ? ? 5? ? ? ? NaN
對于簡單的索引合并,還可以向join傳入一組DataFrame :
In [56]: another=DataFrame([[7.,8.],[9.,10.],[11.,12.],[16.,17.]],index=['a','c','e','f'],columns=['New York','Oregon'])
In [58]: left2.join([right2,another])
Out[58]:
Ohio? Nevada? Missouri? Alabama? New York? Oregon
a? 1.0? ? 2.0? ? ? NaN? ? ? NaN? ? ? 7.0? ? 8.0
b? 3.0? ? 4.0? ? ? 7.0? ? ? 8.0? ? ? NaN? ? NaN
c? NaN? ? NaN? ? ? NaN? ? ? NaN? ? ? 9.0? ? 10.0
d? NaN? ? NaN? ? ? 9.0? ? 10.0? ? ? NaN? ? NaN
d? NaN? ? NaN? ? ? 11.0? ? 12.0? ? ? NaN? ? NaN
e? 5.0? ? 6.0? ? ? 13.0? ? 14.0? ? ? 11.0? ? 12.0
f? NaN? ? NaN? ? ? NaN? ? ? NaN? ? ? 16.0? ? 17.0
In [59]: left2.join([right2,another],how='outer')
Out[59]:
Ohio? Nevada? Missouri? Alabama? New York? Oregon
a? 1.0? ? 2.0? ? ? NaN? ? ? NaN? ? ? 7.0? ? 8.0
b? 3.0? ? 4.0? ? ? 7.0? ? ? 8.0? ? ? NaN? ? NaN
c? NaN? ? NaN? ? ? NaN? ? ? NaN? ? ? 9.0? ? 10.0
d? NaN? ? NaN? ? ? 9.0? ? 10.0? ? ? NaN? ? NaN
d? NaN? ? NaN? ? ? 11.0? ? 12.0? ? ? NaN? ? NaN
e? 5.0? ? 6.0? ? ? 13.0? ? 14.0? ? ? 11.0? ? 12.0
f? NaN? ? NaN? ? ? NaN? ? ? NaN? ? ? 16.0? ? 17.0
7.1.3軸向連接
另一種數(shù)據(jù)合并運算也被稱作連接(connection)、綁定(binding)、或堆疊(stacking)
Numpy有一個用于合并原始Numpy數(shù)組的concatenation 函數(shù):
In [11]: arr=np.arange(12).reshape((3,4))
In [12]: arr
Out[12]:
array([[ 0,? 1,? 2,? 3],
[ 4,? 5,? 6,? 7],
[ 8,? 9, 10, 11]])
In [13]: np.concatenate([arr,arr],axis=1)
Out[13]:
array([[ 0,? 1,? 2, ...,? 1,? 2,? 3],
[ 4,? 5,? 6, ...,? 5,? 6,? 7],
[ 8,? 9, 10, ...,? 9, 10, 11]])
對于pandas對象(如Series和DataFrame),帶有標簽的軸能夠進推廣數(shù)組的連接
運算。
對pandas的concat提供的方式。
我們假設(shè)有三個沒有重疊索引的Series :
In [16]: from pandas import DataFrame,Series
In [17]: s1=Series([0,1],index=['a','b'])
In [18]: s2=Series([2,3,4],index=['c','d','e'])
In [19]: s3=Series([5,6],index=['f','g'])
In [21]: s1
Out[21]:
a? ? 0
b? ? 1
dtype: int64
In [22]: s2
Out[22]:
c? ? 2
d? ? 3
e? ? 4
dtype: int64
In [23]: s3
Out[23]:
f? ? 5
g? ? 6
dtype: int64
In [24]: pd.concat([s1,s2,s3])
Out[24]:
a? ? 0
b? ? 1
c? ? 2
d? ? 3
e? ? 4
f? ? 5
g? ? 6
dtype: int64
In [25]: pd.concat([s1,s2,s3],axis=0)
Out[25]:
a? ? 0
b? ? 1
c? ? 2
d? ? 3
e? ? 4
f? ? 5
g? ? 6
dtype: int64
In [26]: pd.concat([s1,s2,s3],axis=1)
Out[26]:
0? ? 1? ? 2
a? 0.0? NaN? NaN
b? 1.0? NaN? NaN
c? NaN? 2.0? NaN
d? NaN? 3.0? NaN
e? NaN? 4.0? NaN
f? NaN? NaN? 5.0
g? NaN? NaN? 6.0
默認狀態(tài),不特殊指定,concat的axis=0,會生成一個新的Series ;
如果axis=1,結(jié)果會變?yōu)镈ataFrame ,而且我們再傳入join='inner'就可以得到他們的交集
In [27]: [s1*5,s3]
Out[27]:
[a? ? 0
b? ? 5
dtype: int64, f? ? 5
g? ? 6
dtype: int64]
In [28]: s1*5
Out[28]:
a? ? 0
b? ? 5
dtype: int64
In [29]: s4=pd.concat([s1*5,s3])
In [30]: pd.concat([s1,s4],axis=1)
Out[30]:
0? 1
a? 0.0? 0
b? 1.0? 5
f? NaN? 5
g? NaN? 6
In [31]: pd.concat([s1,s4],axis=1,join='inner')
Out[31]:
0? 1
a? 0? 0
b? 1? 5
可以通過join_axes 指定要在其他軸上使用的索引:
In [32]: pd.concat([s1,s4],axis=1,join_axes=[['a','c','b','e']])
Out[32]:
0 1
a 0.0 0.0
c NaN NaN
b 1.0 5.0
e NaN NaN
上面的代碼有個問題,參與連接的片段在結(jié)果中無法區(qū)分。
使用keys參數(shù)可以在連接軸上創(chuàng)建一個層次化索引。
In [33]: pd.concat([s1,s2,s3],axis=1)
Out[33]:
0 1 2
a 0.0 NaN NaN
b 1.0 NaN NaN
c NaN 2.0 NaN
d NaN 3.0 NaN
e NaN 4.0 NaN
f NaN NaN 5.0
g NaN NaN 6.0
In [34]: pd.concat([s1,s2,s3],keys=['one','two'])
Out[34]:
one a 0
b 1
two c 2
d 3
e 4
dtype: int64
In [35]: pd.concat([s1,s2,s3],keys=['one','two','three'])
Out[35]:
one a 0
b 1
two c 2
d 3
e 4
three f 5
g 6
dtype: int64
In [36]: result=pd.concat([s1,s2,s3],keys=['one','two','three'])
In [37]: result.unstack()
Out[37]:
a b c d e f g
one 0.0 1.0 NaN NaN NaN NaN NaN
two NaN NaN 2.0 3.0 4.0 NaN NaN
three NaN NaN NaN NaN NaN 5.0 6.0
如果我們沿著axis=1對Series進行合并,那么keys就會成為DataFrame的例頭:
In [38]: pd.concat([s1,s2,s3],axis=1,keys=['one','two','three'])
Out[38]:
one two three
a 0.0 NaN NaN
b 1.0 NaN NaN
c NaN 2.0 NaN
d NaN 3.0 NaN
e NaN 4.0 NaN
f NaN NaN 5.0
g NaN NaN 6.0
In [39]: pd.concat([s1,s2,s3],axis=1)
Out[39]:
0 1 2
a 0.0 NaN NaN
b 1.0 NaN NaN
c NaN 2.0 NaN
d NaN 3.0 NaN
e NaN 4.0 NaN
f NaN NaN 5.0
g NaN NaN 6.0
這樣對于DataFrame也是相同的:
In [40]: df1=DataFrame(np.arange(6).reshape(3,2),index=['a','b','c'],columns=['one','two'])
In [41]: df2=DataFrame(5+np.arange(4).reshape(2,2),index=['a','c'],columns=['three','four'])
In [43]: df1
Out[43]:
one two
a 0 1
b 2 3
c 4 5
In [44]: df2
Out[44]:
three four
a 5 6
c 7 8
In [45]: pd.concat([df1,df2],axis=1,keys=['level1','level2'])
Out[45]:
level1 level2
one two three four
a 0 1 5.0 6.0
b 2 3 NaN NaN
c 4 5 7.0 8.0

7.1.4合并重疊數(shù)據(jù)
還有一種是數(shù)據(jù)組合是不能用簡單的合并(merge)或連接(concatenation)運算來處理。
比如,可能索引全部或部分重疊的兩個數(shù)據(jù)集。
我們使用numpy的where函數(shù),用于表達一種矢量化的if-else:
In [13]: a=Series([np.nan,2.5,np.nan,3.5,4.5,np.nan],index=['f','e','d','c','b','a'])
In [14]: b=Series(numpy.arange(len(a),dtype=np.float64),index=['f','e','d','c','b','a'])
In [15]: b[-1]=np.nan
In [16]: a
Out[16]:
f NaN
e 2.5
d NaN
c 3.5
b 4.5
a NaN
dtype: float64
In [17]: b
Out[17]:
f 0.0
e 1.0
d 2.0
c 3.0
b 4.0
a NaN
dtype: float64
In [31]: (pd.isnull(a),b,a)
Out[31]:
(f True
e False
d True
c False
b False
a True
dtype: bool, f 0.0
e 1.0
d 2.0
c 3.0
b 4.0
a NaN
dtype: float64, f NaN
e 2.5
d NaN
c 3.5
b 4.5
a NaN
dtype: float64)
In [33]: np.where(pd.isnull(a),b,a)
Out[33]: array([ 0. , 2.5, 2. , 3.5, 4.5, nan])
Series還有一個combine_first 方法,實現(xiàn)的也是這種功能:
In [34]: b[:-2].combine_first(a[2:])
Out[34]:
a NaN
b 4.5
c 3.0
d 2.0
e 1.0
f 0.0
對于DataFrame,combine_first自然也會在列上做同樣的事情,因此可以將其看做:
用參數(shù)對象中的數(shù)據(jù)為調(diào)用者對象的缺失數(shù)據(jù)“打補丁”:
In [36]: df1=DataFrame({'a':[1.,np.nan,5.,np.nan],'b':[np.nan,2.,np.nan,6.],'c':range(2,18,4)})
In [37]: df2=DataFrame({'a':[5.,4.,np.nan,3.,7.],'b':[np.nan,3.,4.,6.,8.]})
In [38]: df1
Out[38]:
a b c
0 1.0 NaN 2
1 NaN 2.0 6
2 5.0 NaN 10
3 NaN 6.0 14
In [39]: df2
Out[39]:
a b
0 5.0 NaN
1 4.0 3.0
2 NaN 4.0
3 3.0 6.0
4 7.0 8.0
In [40]: df1.combine_first(df2)
Out[40]:
a b c
0 1.0 NaN 2.0
1 4.0 2.0 6.0
2 5.0 4.0 10.0
3 3.0 6.0 14.0
4 7.0 8.0 NaN
7.2 重塑和軸向旋轉(zhuǎn)
許多用于重新排列表格型數(shù)據(jù)的基礎(chǔ)運算,這些函數(shù)也稱作重塑(reshape)或軸向旋轉(zhuǎn)(pivot)運算。
7.2.1 重塑層次化索引
層次化索引為DataFrame數(shù)據(jù)的重排任務(wù)提供了很好的一致性方式。
主要功能:
stack:將數(shù)據(jù)的列“旋轉(zhuǎn)”為行
unstack:將數(shù)據(jù)的行“旋轉(zhuǎn)”為列
下面給出一個例子,其中的行列索引均為字符串:
In [41]: data=DataFrame(np.arange(6).reshape((2,3)),index=pd.Index(['Ohio','Colorado'],name='state'),columns=pd.Index(['one','two','three'],name='number'))
In [42]: data
Out[42]:
number one two three
state
Ohio 0 1 2
Colorado 3 4 5
In [43]: result=data.stack()
In [44]: result
Out[44]:
state number
Ohio one 0
two 1
three 2
Colorado one 3
two 4
three 5
dtype: int32
對于層次化索引的Series,我們可以用unstack將其重排成一個DataFrame:
In [53]: result.unstack()
Out[53]:
number one two three
state
Ohio 0 1 2
Colorado 3 4 5
默認下,unstack操作的是內(nèi)層,傳入分級層級別的編號或名稱即可對其他級別進行unstack操作:
In [54]: result.unstack(0)
Out[54]:
state Ohio Colorado
number
one 0 3
two 1 4
three 2 5
In [55]: result.unstack('state')
Out[55]:
state Ohio Colorado
number
one 0 3
two 1 4
three 2 5
如果不是所有的級別值都能在各分組中找到的話,則unstack操作可能會引入缺失數(shù)據(jù):
In [56]: s1=Series([0,1,2,3],index=['a','b','c','d'])
In [57]: s2=Series([4,5,6],index=['c','d','e'])
In [58]: s1
Out[58]:
a 0
b 1
c 2
d 3
dtype: int64
In [59]: s2
Out[59]:
c? ? 4
d? ? 5
e? ? 6
dtype: int64
In [60]: data2=pd.concat([s1,s2],keys=['one','two'])
In [61]: data2
Out[61]:
one a 0
b 1
c 2
d 3
two c 4
d 5
e 6
dtype: int64
In [62]: data2.unstack()
Out[62]:
a b c d e
one 0.0 1.0 2.0 3.0 NaN
two NaN NaN 4.0 5.0 6.0
stack默認會濾除數(shù)據(jù),因此該運算是可逆的,
In [63]: data2.unstack().stack()
Out[63]:
one? a? ? 0.0
b? ? 1.0
c? ? 2.0
d? ? 3.0
two? c? ? 4.0
d? ? 5.0
e? ? 6.0
dtype: float64
In [64]: data2.unstack().stack(dropna=False)
Out[64]:
one? a? ? 0.0
b? ? 1.0
c? ? 2.0
d? ? 3.0
e? ? NaN
two? a? ? NaN
b? ? NaN
c? ? 4.0
d? ? 5.0
e? ? 6.0
dtype: float64
在對DataFrame進行unstack操作時,作為旋轉(zhuǎn)軸的級別將會成為結(jié)果中的最低級別:
In [65]: df=DataFrame({'left':result,'right':result+5},columns=pd.Index(['left','right'],name='side'))
In [66]: df
Out[66]:
side left right
state number
Ohio one 0 5
two 1 6
three 2 7
Colorado one 3 8
two 4 9
three 5 10
In [68]: df.unstack('state')
Out[68]:
side left right
state Ohio Colorado Ohio Colorado
number
one 0 3 5 8
two 1 4 6 9
three 2 5 7 10
In [69]: df.unstack('state').stack('side')
Out[69]:
state Ohio Colorado
number side
one left 0 3
right 5 8
two left 1 4
right 6 9
three left 2 5
right 7 10
7.2.2 將“長格式”旋轉(zhuǎn)為“寬格式”
時間序列數(shù)據(jù)通常時以所謂的“長格式”(long)或“堆疊格式”(stacked)存儲在數(shù)據(jù)庫和csv中:
In [6]: ldata = DataFrame({'date':['03-31','03-31','03-31','06-30','06-30','06-30'],
...: 'item':['real','infl','unemp','real','infl','unemp'],'value':['2710.','000.','5.8','2778.','2.34','5.1']})
...: print 'ldata is \n',ldata
...:
ldata is
date item value
0 03-31 real 2710.
1 03-31 infl 000.
2 03-31 unemp 5.8
3 06-30 real 2778.
4 06-30 infl 2.34
5 06-30 unemp 5.1
關(guān)系型數(shù)據(jù)庫(MYSQL)中的數(shù)據(jù)經(jīng)常都是這樣存儲的,因為固定的架構(gòu)(固定的架構(gòu)指的是列名和數(shù)據(jù)類型)的優(yōu)點在于: 隨著表中數(shù)據(jù)的添加或者刪除,item列中的值種類能夠增加減少。
DataFrame 中的pivot,容易操作長格式的數(shù)據(jù)。
In [8]: pivoted=ldata.pivot('date','item','value')
In [9]: pivoted.head()
Out[9]:
item? infl? real unemp
date
03-31? 000.? 2710.? 5.8
06-30? 2.34? 2778.? 5.1
我們給出6個value2的值,此時我們假設(shè)有兩個參數(shù)需要重塑,
In [10]: ldata['value2']=np.random.randn(len(ldata))
In [11]: len(ldata)
Out[11]: 6
In [12]: np.random.randn(len(ldata))
Out[12]:
array([-0.18423106,? 1.01403302,? 0.37678059,? 1.66064609,? 0.47334253,
-0.57611932])
In [13]: ldata['value2']
Out[13]:
0? ? 0.396190
1? -0.064478
2? ? 1.783752
3? ? 0.008722
4? -0.316903
5? -0.329522
Name: value2, dtype: float64
In [14]: ldata[:6]
Out[14]:
date? item? value? ? value2
0? 03-31? real? 2710.? 0.396190
1? 03-31? infl? 000. -0.064478
2? 03-31? unemp? ? 5.8? 1.783752
3? 06-30? real? 2778.? 0.008722
4? 06-30? infl? 2.34 -0.316903
5? 06-30? unemp? ? 5.1 -0.329522
忽略最后一個參數(shù),得到的DataFrame就會帶有層次化的列:
In [15]: pivoted=ldata.pivot('date','item')
In [16]: pivoted
Out[16]:
value? ? ? ? ? ? ? ? value2
item? infl? real unemp? ? ? infl? ? ? real? ? unemp
date
03-31? 000.? 2710.? 5.8 -0.064478? 0.396190? 1.783752
06-30? 2.34? 2778.? 5.1 -0.316903? 0.008722 -0.329522
In [17]: pivoted[:6]
Out[17]:
value? ? ? ? ? ? ? ? value2
item? infl? real unemp? ? ? infl? ? ? real? ? unemp
date
03-31? 000.? 2710.? 5.8 -0.064478? 0.396190? 1.783752
06-30? 2.34? 2778.? 5.1 -0.316903? 0.008722 -0.329522
In [18]: pivoted['value'][:5]
Out[18]:
item? infl? real unemp
date
03-31? 000.? 2710.? 5.8
06-30? 2.34? 2778.? 5.1
pivot是一個快捷方式,用set_index創(chuàng)建的層次化索引,再用unstack重塑。
In [19]: unstacked=ldata.set_index(['date','item']).unstack('item')
In [20]: unstacked[:6]
Out[20]:
value? ? ? ? ? ? ? ? value2
item? infl? real unemp? ? ? infl? ? ? real? ? unemp
date
03-31? 000.? 2710.? 5.8 -0.064478? 0.396190? 1.783752
06-30? 2.34? 2778.? 5.1 -0.316903? 0.008722 -0.329522