數(shù)據(jù)結構
在Pandas中有兩類非常重要的數(shù)據(jù)結構,即序列Series和數(shù)據(jù)框DataFrame。Series類似于NumPy中的一維ndarray數(shù)組,除了具備其一維數(shù)組可用的函數(shù)之外還具備索引的自動對齊功能。DataFrame類似于NumPy中的二維ndarray數(shù)組,同樣在這個基礎上新增了一些功能。
Series的創(chuàng)建
序列可以通過一維數(shù)組創(chuàng)建
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: arr = [n for n in range(10)]
In [4]: s1 = pd.Series(arr)
In [5]: print(s1)
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
dtype: int64
可以通過字典的方式創(chuàng)建
In [6]: dic = {'a':1, 'b':2, 'c':3, 'd':4, 'e':5}
In [7]: dic = pd.Series(dic)
In [8]: print(dic)
a 1
b 2
c 3
d 4
e 5
dtype: int64
還可以利用DataFrame中的元素來創(chuàng)建序列
DataFrame的創(chuàng)建
DataFrame的創(chuàng)建有以下幾種方式:
通過二維數(shù)組創(chuàng)建
In [9]: arr1 = np.arange(12).reshape(3,4)
In [10]: print(arr1)
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]
In [11]: df1 = pd.DataFrame(arr1)
In [12]: print(df1)
0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
In [13]: print(type(df1))
<class 'pandas.core.frame.DataFrame'>
通過字典的方式創(chuàng)建
In [20]: dic1 = {'a':[1,2,3,4], 'b':[5,6,7,8], 'c':[9,10,11,12], 'd':[13,14,15,16]}
In [21]: df1 = pd.DataFrame(dic1)
In [22]: print(type(df1))
<class 'pandas.core.frame.DataFrame'>
In [23]: print(df1)
a b c d
0 1 5 9 13
1 2 6 10 14
2 3 7 11 15
3 4 8 12 16
還可以通過重構DataFrame對象或者選取子集來創(chuàng)建新的DataFrame
In [25]: df2 = df1[['a', 'b']]
In [26]: print(type(df2))
<class 'pandas.core.frame.DataFrame'>
In [27]: print(df2)
a b
0 1 5
1 2 6
2 3 7
3 4 8
數(shù)據(jù)索引
不難發(fā)現(xiàn),Series和DataFrame對象的左邊(或上面)總是有一組元素以外的數(shù)據(jù)。其實這些就是索引,它的作用不光是通過標簽獲取對應位置的元素,還可以使Series或DataFrame的計算和操作實現(xiàn)自動對齊。
定義完Series或DataFrame后可以通過index屬性查看索引的起止和步長信息
In [36]: s2 = pd.Series(np.arange(100, 110))
In [37]: print(s2)
0 100
1 101
2 102
3 103
4 104
5 105
6 106
7 107
8 108
9 109
dtype: int32
In [38]: print(s2.index)
RangeIndex(start=0, stop=10, step=1)
序列默認使用從0開始以步長1自增的索引,我們還可以自己更改索引值
In [40]: s2.index = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
In [41]: print(s2)
a 100
b 101
c 102
d 103
e 104
f 105
g 106
h 107
i 108
j 109
dtype: int32
這樣以來Series就有了NumPy中一維數(shù)組的使用方式以及類似于Python中字典的取值方式。需要注意的是更改為非數(shù)值索引后用冒號切片時值的區(qū)間是前閉后閉的。
In [43]: s2[['a', 'b', 'c']]
Out[43]:
a 100
b 101
c 102
dtype: int32
In [44]: s2['b':'f']
Out[44]:
b 101
c 102
d 103
e 104
f 105
dtype: int32
自動對齊
如果我們想將兩個序列的值按照一定位置進行兩兩運算,那么索引這時候就派上大用場了。Series對象在進行算數(shù)運算時會將元素按照索引順序進行運算
In [81]: s3 = pd.Series(np.arange(10,15), index = ['a', 'b', 'c', 'd', 'e'])
In [82]: s4 = pd.Series(np.arange(0,500,step=100), index = ['d', 'c', 'a', 'e', 'f'])
In [83]: print(s3, s4)
a 10
b 11
c 12
d 13
e 14
dtype: int32
d 0
c 100
a 200
e 300
f 400
dtype: int32
In [84]: print(s3 + s4)
a 210
b NaN
c 112
d 13
e 314
f NaN
dtype: int32
由于部分索引只存在于一個序列中,所以在進行運算時會把該索引位置的值置為NaN。對于DataFrame,不僅僅是行索引的對齊,同時還會對列索引進行對齊,下面會進行說明。
pandas查詢數(shù)據(jù)
DataFrame對象有類似表格的輸出格式,我們可以通過行或列進行內(nèi)容選取。下面錄入一些學生的基本信息
In [88]: stu_dic = {'Age':[14,13,13,14,14,12,12,15,13,12,11,14,12,15,16,12,15,11,15],
...: 'Height':[69,56.5,65.3,62.8,63.5,57.3,59.8,62.5,62.5,59,51.3,64.3,56.3,66.5,72,64.
...: 8,67,57.5,66.5],
...: 'Name':['Alfred','Alice','Barbara','Carol','Henry','James','Jane','Janet','Jeffrey
...: ','John','Joyce','Judy','Louise','Marry','Philip','Robert','Ronald','Thomas','Will
...: am'],
...: 'Gender':['M','F','F','F','M','M','F','F','M','M','F','F','F','F','M','M','M','M',
...: 'M'],
...: 'Weight':[112.5,84,98,102.5,102.5,83,84.5,112.5,84,99.5,50.5,90,77,112,150,128,133
...: ,85,112]}
In [89]: student = pd.DataFrame(stu_dic)
In [90]: print(student)
Age Gender Height Name Weight
0 14 M 69.0 Alfred 112.5
1 13 F 56.5 Alice 84.0
2 13 F 65.3 Barbara 98.0
3 14 F 62.8 Carol 102.5
4 14 M 63.5 Henry 102.5
5 12 M 57.3 James 83.0
6 12 F 59.8 Jane 84.5
7 15 F 62.5 Janet 112.5
8 13 M 62.5 Jeffrey 84.0
9 12 M 59.0 John 99.5
10 11 F 51.3 Joyce 50.5
11 14 F 64.3 Judy 90.0
12 12 F 56.3 Louise 77.0
13 15 F 66.5 Marry 112.0
14 16 M 72.0 Philip 150.0
15 12 M 64.8 Robert 128.0
16 15 M 67.0 Ronald 133.0
17 11 M 57.5 Thomas 85.0
18 15 M 66.5 Willam 112.0
head和tail函數(shù)可以用來查詢前幾行或后幾行數(shù)據(jù)(默認查詢5行)
In [92]: student.head()
Out[92]:
Age Gender Height Name Weight
0 14 M 69.0 Alfred 112.5
1 13 F 56.5 Alice 84.0
2 13 F 65.3 Barbara 98.0
3 14 F 62.8 Carol 102.5
4 14 M 63.5 Henry 102.5
In [93]: student.tail(3)
Out[93]:
Age Gender Height Name Weight
16 15 M 67.0 Ronald 133.0
17 11 M 57.5 Thomas 85.0
18 15 M 66.5 Willam 112.0
查詢指定列直接用索引
In [97]: student[['Name', 'Age', 'Gender']].head()
Out[97]:
Name Age Gender
0 Alfred 14 M
1 Alice 13 F
2 Barbara 13 F
3 Carol 14 F
4 Henry 14 M
查詢指定行的數(shù)據(jù)用loc位置函數(shù),該函數(shù)可以按索引選取數(shù)據(jù)。類似的iloc函數(shù)可以根據(jù)位置選取數(shù)據(jù)
In [96]: student.loc[[1,3,5,7]]
Out[96]:
Age Gender Height Name Weight
1 13 F 56.5 Alice 84.0
3 14 F 62.8 Carol 102.5
5 12 M 57.3 James 83.0
7 15 F 62.5 Janet 112.5
查詢指定行的指定列信息
In [120]: student.loc[1:5, ['Name', 'Age', 'Gender']]
Out[120]:
Name Age Gender
1 Alice 13 F
2 Barbara 13 F
3 Carol 14 F
4 Henry 14 M
5 James 12 M
DataFrame用布爾列表作為索引可以進行條件搜索,同理可以在索引處使用邏輯運算。查詢所有12歲以上的女生信息
In [127]: student.head()[[True,False,True,False,True]]
Out[127]:
Age Gender Height Name Weight
0 14 M 69.0 Alfred 112.5
2 13 F 65.3 Barbara 98.0
4 14 M 63.5 Henry 102.5
In [128]: student[(student['Gender']=='F') & (student['Age']>12)]
Out[128]:
Age Gender Height Name Weight
1 13 F 56.5 Alice 84.0
2 13 F 65.3 Barbara 98.0
3 14 F 62.8 Carol 102.5
7 15 F 62.5 Janet 112.5
11 14 F 64.3 Judy 90.0
13 15 F 66.5 Marry 112.0
利用DataFrame進行統(tǒng)計分析
pandas模塊為我們提供了非常多用于統(tǒng)計分析的函數(shù),如求和、均值、最小值、最大值等
In [166]: data = pd.Series(np.random.randn(10))
In [167]: print(data)
0 0.139683
1 -0.223019
2 2.123692
3 0.122273
4 -1.409432
5 1.422986
6 -2.147855
7 -1.347533
8 0.363565
9 -0.014752
dtype: float64
In [168]: print(data.count()) #統(tǒng)計總個數(shù)
10
In [169]: print(data.max()) #最大值
2.12369188859
In [170]: print(data.min()) #最小值
-2.14785503764
In [171]: print(data.idxmax()) #最大值的索引
2
In [172]: print(data.idxmin()) #最小值的索引
6
In [173]: print(data.quantile(0.25)) #25%分位數(shù)
-1.0664041305639353
In [174]: print(data.sum()) #求和
-0.970391278249
In [175]: print(data.mean()) #平均值
-0.0970391278249
In [176]: print(data.median()) #中位數(shù)
0.05376066122711701
In [177]: print(data.mode()) #眾數(shù)
0 -2.147855
1 -1.409432
2 -1.347533
3 -0.223019
4 -0.014752
5 0.122273
6 0.139683
7 0.363565
8 1.422986
9 2.123692
dtype: float64
In [178]: print(data.var()) #方差
1.67477886567
In [179]: print(data.std()) #標準差
1.29413247609
In [180]: print(data.mad()) #平均絕對差
0.947936352321
In [181]: print(data.skew()) #偏度
0.111671093612
In [182]: print(data.kurt()) #峰度
-0.229936797723
In [183]: print(data.describe()) #一次性輸出多個描述性統(tǒng)計指標
count 10.000000
mean -0.097039
std 1.294132
min -2.147855
25% -1.066404
50% 0.053761
75% 0.307594
max 2.123692
dtype: float64
In [184]: print(data.value_counts()) # 查看Series對象的唯一zhi和計數(shù)
有時我們需要將一個函數(shù)應用到DataFrame中的每一列當中,這時可以借助apply函數(shù)
In [198]: print(d1)
[ 1.35779726 -0.80483372 -2.12362025 -0.33350244 -0.88671935 0.33419793
0.53678382 -0.74383037 -0.32020388 -0.91619886]
In [199]: d2 = np.random.f(2,4,size=10)
In [200]: print(d2)
[ 1.9737911 2.58048514 0.77602181 3.20355989 1.34335165 0.64734569
0.52191629 0.56962343 5.23736555 0.14573449]
In [201]: d3 = np.random.randint(50,100,size=10)
In [202]: print(d3)
[76 65 56 88 88 70 67 66 73 70]
In [203]: df = pd.DataFrame(np.array([d1,d2,d3]).T, columns=['x1', 'x2', 'x3'])
In [204]: print(df.head())
x1 x2 x3
0 1.357797 1.973791 76.0
1 -0.804834 2.580485 65.0
2 -2.123620 0.776022 56.0
3 -0.333502 3.203560 88.0
4 -0.886719 1.343352 88.0
In [209]: print(df.apply(lambda x: x.describe()))
x1 x2 x3
count 10.000000 10.000000 10.000000
mean -0.390013 1.699920 71.900000
std 0.958737 1.591444 10.016098
min -2.123620 0.145734 56.000000
25% -0.866248 0.589054 66.250000
50% -0.538666 1.059687 70.000000
75% 0.170597 2.428812 75.250000
max 1.357797 5.237366 88.000000
在統(tǒng)計離散數(shù)據(jù)時就不能使用這種統(tǒng)計方式了,我們需要統(tǒng)計離散變量的觀測數(shù)量、取值種類數(shù)、眾數(shù)及其個數(shù)。這時只需要使用describe方法就可以實現(xiàn)
In [211]: print(student['Gender'].describe())
count 19
unique 2
top M
freq 10
Name: Gender, dtype: object
除了以上簡單描述性統(tǒng)計之外,還提供了連續(xù)變量的相關系數(shù)(corr)和協(xié)方差矩陣(cov)的求解。關于相關系數(shù)的計算可以調用pearson方法或kendell方法或spearman方法,默認用pearson方法
In [214]: print(df.corr())
x1 x2 x3
x1 1.000000 0.115625 0.321113
x2 0.115625 1.000000 0.344972
x3 0.321113 0.344972 1.000000
如果只想管制某一個變量與其余變量的相關系數(shù)的話,可以使用corrwith,如下方只關心x1與其余變量的相關系數(shù)
In [216]: print(df.corrwith(df['x1']))
x1 1.000000
x2 0.115625
x3 0.321113
dtype: float64
利用Pandas實現(xiàn)SQL操作
在SQL中常見的操作主要是增、刪、改、查,在Pandas中同樣可以實現(xiàn)這幾項操作。新增數(shù)據(jù)操作對新數(shù)據(jù)的格式要求十分少,即使列值對不上也會自行新增列,列名順序不一致也會自動進行對齊
In [226]: dic = {'Name':['LiuShunxiang','Zhangshan'],'Gender':['M','F'],'Age':[27,23],'Heig
...: ht':[165.7,167.2],'Weight':[61,63]}
In [227]: student1 = pd.DataFrame(dic)
In [228]: student2 = pd.concat([student.head(), student1])
In [229]: print(student2)
Age Gender Height Name Weight
0 14 M 69.0 Alfred 112.5
1 13 F 56.5 Alice 84.0
2 13 F 65.3 Barbara 98.0
3 14 F 62.8 Carol 102.5
4 14 M 63.5 Henry 102.5
0 27 M 165.7 LiuShunxiang 61.0
1 23 F 167.2 Zhangshan 63.0
增
新增列后,默認將新列上的值都賦為NaN
In [230]: print(pd.DataFrame(student2, columns
=['Age','Height','Name','Gender','Weight','Score']))
Age Height Name Gender Weight Score
0 14 69.0 Alfred M 112.5 NaN
1 13 56.5 Alice F 84.0 NaN
2 13 65.3 Barbara F 98.0 NaN
3 14 62.8 Carol F 102.5 NaN
4 14 63.5 Henry M 102.5 NaN
0 27 165.7 LiuShunxiang M 61.0 NaN
1 23 167.2 Zhangshan F 63.0 NaN
刪
刪除DataFrame(可以看成是刪除表)使用Python的del關鍵字。刪除整列或整行都使用drop函數(shù),使用drop函數(shù)時需要指定軸axis,默認為0即代表著行(條件刪除可以使用索引條件過濾然后復制給當前DataFrame)
In [260]: del student1
In [261]: print(student2.drop([1,3,5]))
Age Gender Height Name Weight
0 14 M 69.0 Alfred 112.5
2 13 F 65.3 Barbara 98.0
4 14 M 63.5 Henry 102.5
6 23 F 167.2 Zhangshan 63.0
In [262]: print(student2.drop(['Height', 'Weight'], axis=1).head())
Age Gender Name
0 14 M Alfred
1 13 F Alice
2 13 F Barbara
3 14 F Carol
4 14 M Henry
改
可以使用布爾索引結合賦值的方法
In [274]: student2.loc[student2['Name']=='Alfred', 'Height'] = 173
In [275]: student2.loc[student2['Name']=='Alfred']
Out[275]:
Age Gender Height Name Weight
0 14 M 173.0 Alfred 112.5
查
上面已經(jīng)講過,不做贅述
聚合
Pandas中可以通過groupby函數(shù)實現(xiàn)數(shù)據(jù)的聚合操作,通過值將觀測行分組
In [277]: print(student.groupby('Gender').mean()) # 統(tǒng)計男女的各項平均值
Age Height Weight
Gender
F 13.222222 60.588889 90.111111
M 13.400000 63.910000 108.950000
如果不想將某個值做統(tǒng)計,可以先用drop刪除列然后做groupby操作。還可以用agg函數(shù)對分組計算多個統(tǒng)計量
In [279]: print(student.drop('Age',axis=1).groupby('Gender').agg([np.mean, np.median]))
Height Weight
mean median mean median
Gender
F 60.588889 62.50 90.111111 90.00
M 63.910000 64.15 108.950000 107.25
排序
排序在日常的統(tǒng)計分析中比較常見,我們可以使用sort_index和sort_values函數(shù)對Series或DataFrame對象實現(xiàn)按值排序和按索引排序,ascending=True/False可以指定使用升序還是降序
In [293]: print(student.head())
Age Gender Height Name Weight
0 14 M 69.0 Alfred 112.5
1 13 F 56.5 Alice 84.0
2 13 F 65.3 Barbara 98.0
3 14 F 62.8 Carol 102.5
4 14 M 63.5 Henry 102.5
In [294]: print(student.head().sort_values('Height'))
Age Gender Height Name Weight
1 13 F 56.5 Alice 84.0
3 14 F 62.8 Carol 102.5
4 14 M 63.5 Henry 102.5
2 13 F 65.3 Barbara 98.0
0 14 M 69.0 Alfred 112.5
In [295]: print(student.head().sort_values(by='Height').sort_index())
Age Gender Height Name Weight
0 14 M 69.0 Alfred 112.5
1 13 F 56.5 Alice 84.0
2 13 F 65.3 Barbara 98.0
3 14 F 62.8 Carol 102.5
4 14 M 63.5 Henry 102.5
多表連接
Pandas同樣于MySQL,分成左右內(nèi)外連接,用merge函數(shù)實現(xiàn)
In [296]: dic2 = {'Name':['Alfred','Alice','Barbara','Carol','Henry','Jeffrey','Judy','Phil
...: ip','Robert','Willam'], 'Score':[88,76,89,67,79,90,92,86,73,77]}
In [297]: score = pd.DataFrame(dic2)
In [298]: stu_score1 = pd.merge(student, score, on='Name') # on指定連接依據(jù)
In [299]: print(stu_score1)
Age Gender Height Name Weight Score
0 14 M 69.0 Alfred 112.5 88
1 13 F 56.5 Alice 84.0 76
2 13 F 65.3 Barbara 98.0 89
3 14 F 62.8 Carol 102.5 67
4 14 M 63.5 Henry 102.5 79
5 13 M 62.5 Jeffrey 84.0 90
6 14 F 64.3 Judy 90.0 92
7 16 M 72.0 Philip 150.0 86
8 12 M 64.8 Robert 128.0 73
9 15 M 66.5 Willam 112.0 77
merge函數(shù)默認使用的是內(nèi)連接,返回兩張表中共同存在的行。你還可以通過how參數(shù)設置left、right、outer連接方式,左連接保留前一張表的所有行,右鏈接保留后一張表的所有行,不存在的值自動設為NaN。
In [306]: score = pd.merge(student,score,how='left').head(10)
In [307]: score
Out[307]:
Age Gender Height Name Weight Score
0 14 M 69.0 Alfred 112.5 88.0
1 13 F 56.5 Alice 84.0 76.0
2 13 F 65.3 Barbara 98.0 89.0
3 14 F 62.8 Carol 102.5 67.0
4 14 M 63.5 Henry 102.5 79.0
5 12 M 57.3 James 83.0 NaN
6 12 F 59.8 Jane 84.5 NaN
7 15 F 62.5 Janet 112.5 NaN
8 13 M 62.5 Jeffrey 84.0 90.0
9 12 M 59.0 John 99.5 NaN
利用Pandas進行缺失值的處理
工作場景中我們在整理數(shù)據(jù)時需要面對缺失值處理的場景。常見的缺失值處理的方法分為三種:刪除法、填補法和插值法,這里就介紹一些簡單的刪除法和填補法。在刪除法中,dropna函數(shù)直接刪除含有NaN的行(可用axis=1刪除列,thresh=n刪除小于n個非空值的行,how=‘a(chǎn)ll’ 刪除所有值都為NaN的那些行)
In [325]: df = pd.DataFrame([[1,1,2],[3,5,np.nan],[13,21,34],[55,np.nan,10],[np.nan,np.nan,
...: np.nan],[np.nan,1,2]],columns=('x1','x2','x3'))
In [326]: print(df)
x1 x2 x3
0 1.0 1.0 2.0
1 3.0 5.0 NaN
2 13.0 21.0 34.0
3 55.0 NaN 10.0
4 NaN NaN NaN
5 NaN 1.0 2.0
In [327]: print(df.dropna())
x1 x2 x3
0 1.0 1.0 2.0
2 13.0 21.0 34.0
In [328]: print(df.dropna(how='all'))
x1 x2 x3
0 1.0 1.0 2.0
1 3.0 5.0 NaN
2 13.0 21.0 34.0
3 55.0 NaN 10.0
5 NaN 1.0 2.0
填補缺失值可以使用fillna函數(shù),它可以用值替換NaN,也可以用 method='ffill'或'bfill' 來指定用前向或后項填充。但是在大部分情況下我們會依據(jù)情況使用眾數(shù)、中位數(shù)、平均數(shù)等中心度量值來填充空白數(shù)據(jù)。
In [331]: print(df.fillna(0))
x1 x2 x3
0 1.0 1.0 2.0
1 3.0 5.0 0.0
2 13.0 21.0 34.0
3 55.0 0.0 10.0
4 0.0 0.0 0.0
5 0.0 1.0 2.0
In [332]: print(df.fillna(method='ffill'))
x1 x2 x3
0 1.0 1.0 2.0
1 3.0 5.0 2.0
2 13.0 21.0 34.0
3 55.0 21.0 10.0
4 55.0 21.0 10.0
5 55.0 1.0 2.0
In [333]: print(df.fillna(method='bfill'))
x1 x2 x3
0 1.0 1.0 2.0
1 3.0 5.0 34.0
2 13.0 21.0 34.0
3 55.0 1.0 10.0
4 NaN 1.0 2.0
5 NaN 1.0 2.0
In [334]: print(df.fillna({'x1':1, 'x2':2, 'x3':3}))
x1 x2 x3
0 1.0 1.0 2.0
1 3.0 5.0 3.0
2 13.0 21.0 34.0
3 55.0 2.0 10.0
4 1.0 2.0 3.0
5 1.0 1.0 2.0
isnull(和notnull)函數(shù)可以檢查DataFrame對象中的空(非空)值,返回一個布爾數(shù)組
In [358]: print(df.isnull())
x1 x2 x3
0 False False False
1 False False True
2 False False False
3 False True False
4 True True True
5 True False False
重命名DataFrame對象的列名用columns屬性,批量更改類名用rename
In [361]: df
Out[361]:
x1 x2 x3
0 1.0 1.0 2.0
1 3.0 5.0 NaN
2 13.0 21.0 34.0
3 55.0 NaN 10.0
4 NaN NaN NaN
5 NaN 1.0 2.0
In [362]: df.columns = ['a', 'b', 'c']
In [363]: df
Out[363]:
a b c
0 1.0 1.0 2.0
1 3.0 5.0 NaN
2 13.0 21.0 34.0
3 55.0 NaN 10.0
4 NaN NaN NaN
5 NaN 1.0 2.0
In [364]: df.rename(columns={'a':'aa'})
Out[364]:
aa b c
0 1.0 1.0 2.0
1 3.0 5.0 NaN
2 13.0 21.0 34.0
3 55.0 NaN 10.0
4 NaN NaN NaN
5 NaN 1.0 2.0
在Series對象中替換值使用replace函數(shù)
In [359]: s
Out[359]:
a 0
b 1
c 2
d 3
dtype: int32
In [360]: print(s.replace(1, 11))
a 0
b 11
c 2
d 3
dtype: int32
利用Pandas實現(xiàn)Excel的數(shù)據(jù)透視表
Pandas提供了實現(xiàn)數(shù)據(jù)透視表功能的pivot_table函數(shù),通過該函數(shù)可以直觀查看數(shù)據(jù)的聚合(計數(shù)、求和、均值、標準差)情況。該函數(shù)可傳遞的參數(shù)有以下這些pivot_table(data,values=None, # values指定需要聚合的字段
index=None, # 指定某些原始變量作為行索引
columns=None, # 指定哪些離散的分組變量
aggfunc='mean', # 指定相應的聚合函數(shù)
fill_value=None, # 使用一個常數(shù)替代缺失值,默認不替換
margins=False, # 是否進行性行或列的匯總,默認不匯總
dropna=True, # 默認所有觀測為缺失的值
margins_name='All') # 默認行匯總或咧匯總的名稱為‘All’
下面舉個例子,依據(jù)Gender、Age對Height、Weight進行展示
In [336]: table = pd.pivot_table(student, values=['Height', 'Weight'], columns=['Gender','A
...: ge'])
In [337]: print(table)
Gender Age
Height F 11 51.300000
12 58.050000
13 60.900000
14 63.550000
15 64.500000
M 11 57.500000
12 60.366667
13 62.500000
14 66.250000
15 66.750000
16 72.000000
Weight F 11 50.500000
12 80.750000
13 91.000000
14 96.250000
15 112.250000
M 11 85.000000
12 103.500000
13 84.000000
14 107.500000
15 122.500000
16 150.000000
dtype: float64
如果想把上述格式變換成更像Excel的堆疊格式可以使用unstack函數(shù)即可。
In [338]: table = pd.pivot_table(student, values=['Height', 'Weight'], columns=['Gender','A
...: ge']).unstack()
In [339]: print(table)
Age 11 12 13 14 15 16
Gender
Height F 51.3 58.050000 60.9 63.55 64.50 NaN
M 57.5 60.366667 62.5 66.25 66.75 72.0
Weight F 50.5 80.750000 91.0 96.25 112.25 NaN
M 85.0 103.500000 84.0 107.50 122.50 150.0
多層索引的使用
在Pandas中我們可以在一個軸上設置多個索引,類似于Excel的如下形式

設置的原理十分易懂
In [340]: s = pd.Series(np.arange(1,10),index=[["a","a","a","b","b","c","c","d","d"],[1,2,3
...: ,1,2,3,1,2,3]])
In [341]: print(s)
a 1 1
2 2
3 3
b 1 4
2 5
c 3 6
1 7
d 2 8
3 9
dtype: int32
選取元素時需要用兩個索引指定一個下標
In [342]: print(s['a'])
1 1
2 2
3 3
dtype: int32
In [343]: print(s['a', 2])
2
DataFrame的層次化索引也類似
In [344]: data = pd.DataFrame(np.random.randint(0,150,size=(8,12)),
...: columns = pd.MultiIndex.from_product([['模擬考','正式考'],
...: ['數(shù)學','語文','英語','物理','
...: 化學','生物']]),
...: index = pd.MultiIndex.from_product([['期中','期末'],
...: ['雷軍','李斌'],
...: ['測試一','測試二']]))
...:
In [345]: print(data)
模擬考 正式考
數(shù)學 語文 英語 物理 化學 生物 數(shù)學 語文 英語 物理 化學 生
物
期中 雷軍 測試一 42 26 120 94 25 141 45 50 94 44 88 74
測試二 76 105 138 87 125 55 60 15 37 54 79 56
李斌 測試一 6 53 97 74 46 65 5 130 67 129 3 109
測試二 108 22 89 13 53 47 12 81 19 53 93 104
期末 雷軍 測試一 41 136 14 4 33 53 142 43 28 68 88 7
測試二 47 84 127 28 17 135 116 31 118 49 27 97
李斌 測試一 30 43 48 24 15 3 26 71 82 104 125 79
測試二 50 114 43 123 111 106 5 91 34 76 52 82
數(shù)據(jù)IO
Pandas支持多種格式數(shù)據(jù)的讀寫操作,它們的讀函數(shù)分別為read_csv、read_table(從限定分隔符的文件中讀內(nèi)容)、read_excel、read_sql、read_json、read_html、read_clipboard(從粘貼板讀內(nèi)容),這些函數(shù)所用來操作的文件和數(shù)據(jù)格式一目了然。文件的寫操作有函數(shù)to_csv、to_excel、to_sql、to_json這幾個。
In [371]: df.to_csv('E:\\a.csv')
In [372]: file_str = pd.read_csv('E:\\a.csv')
In [373]: print(file_str)
Unnamed: 0 a b c
0 0 1.0 1.0 2.0
1 1 3.0 5.0 NaN
2 2 13.0 21.0 34.0
3 3 55.0 NaN 10.0
4 4 NaN NaN NaN
5 5 NaN 1.0 2.0