Pandas一篇入門

數(shù)據(jù)結構

在Pandas中有兩類非常重要的數(shù)據(jù)結構,即序列Series和數(shù)據(jù)框DataFrame。Series類似于NumPy中的一維ndarray數(shù)組,除了具備其一維數(shù)組可用的函數(shù)之外還具備索引的自動對齊功能。DataFrame類似于NumPy中的二維ndarray數(shù)組,同樣在這個基礎上新增了一些功能。

Series的創(chuàng)建

序列可以通過一維數(shù)組創(chuàng)建

In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: arr = [n for n in range(10)]
In [4]: s1 = pd.Series(arr)
In [5]: print(s1)
0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
dtype: int64

可以通過字典的方式創(chuàng)建

In [6]: dic = {'a':1, 'b':2, 'c':3, 'd':4, 'e':5}
In [7]: dic = pd.Series(dic)
In [8]: print(dic)
a    1
b    2
c    3
d    4
e    5
dtype: int64

還可以利用DataFrame中的元素來創(chuàng)建序列

DataFrame的創(chuàng)建

DataFrame的創(chuàng)建有以下幾種方式:
通過二維數(shù)組創(chuàng)建

In [9]: arr1 = np.arange(12).reshape(3,4)
In [10]: print(arr1)
[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]
In [11]: df1 = pd.DataFrame(arr1)
In [12]: print(df1)
   0  1   2   3
0  0  1   2   3
1  4  5   6   7
2  8  9  10  11
In [13]: print(type(df1))
<class 'pandas.core.frame.DataFrame'>

通過字典的方式創(chuàng)建

In [20]: dic1 = {'a':[1,2,3,4], 'b':[5,6,7,8], 'c':[9,10,11,12], 'd':[13,14,15,16]}
In [21]: df1 = pd.DataFrame(dic1)
In [22]: print(type(df1))
<class 'pandas.core.frame.DataFrame'>
In [23]: print(df1)
   a  b   c   d
0  1  5   9  13
1  2  6  10  14
2  3  7  11  15
3  4  8  12  16

還可以通過重構DataFrame對象或者選取子集來創(chuàng)建新的DataFrame

In [25]: df2 = df1[['a', 'b']]
In [26]: print(type(df2))
<class 'pandas.core.frame.DataFrame'>
In [27]: print(df2)
   a  b
0  1  5
1  2  6
2  3  7
3  4  8

數(shù)據(jù)索引

不難發(fā)現(xiàn),Series和DataFrame對象的左邊(或上面)總是有一組元素以外的數(shù)據(jù)。其實這些就是索引,它的作用不光是通過標簽獲取對應位置的元素,還可以使Series或DataFrame的計算和操作實現(xiàn)自動對齊。
定義完Series或DataFrame后可以通過index屬性查看索引的起止和步長信息

In [36]: s2 = pd.Series(np.arange(100, 110))
In [37]: print(s2)
0    100
1    101
2    102
3    103
4    104
5    105
6    106
7    107
8    108
9    109
dtype: int32
In [38]: print(s2.index)
RangeIndex(start=0, stop=10, step=1)

序列默認使用從0開始以步長1自增的索引,我們還可以自己更改索引值

In [40]: s2.index = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
In [41]: print(s2)
a    100
b    101
c    102
d    103
e    104
f    105
g    106
h    107
i    108
j    109
dtype: int32

這樣以來Series就有了NumPy中一維數(shù)組的使用方式以及類似于Python中字典的取值方式。需要注意的是更改為非數(shù)值索引后用冒號切片時值的區(qū)間是前閉后閉的。

In [43]: s2[['a', 'b', 'c']]
Out[43]:
a    100
b    101
c    102
dtype: int32
In [44]: s2['b':'f']
Out[44]:
b    101
c    102
d    103
e    104
f    105
dtype: int32
自動對齊

如果我們想將兩個序列的值按照一定位置進行兩兩運算,那么索引這時候就派上大用場了。Series對象在進行算數(shù)運算時會將元素按照索引順序進行運算

In [81]: s3 = pd.Series(np.arange(10,15), index = ['a', 'b', 'c', 'd', 'e'])
In [82]: s4 = pd.Series(np.arange(0,500,step=100), index = ['d', 'c', 'a', 'e', 'f'])
In [83]: print(s3, s4)
a    10
b    11
c    12
d    13
e    14
dtype: int32 
d      0
c    100
a    200
e    300
f    400
dtype: int32
In [84]: print(s3 + s4)
a    210
b    NaN
c    112
d     13
e    314
f     NaN
dtype: int32

由于部分索引只存在于一個序列中,所以在進行運算時會把該索引位置的值置為NaN。對于DataFrame,不僅僅是行索引的對齊,同時還會對列索引進行對齊,下面會進行說明。

pandas查詢數(shù)據(jù)

DataFrame對象有類似表格的輸出格式,我們可以通過行或列進行內(nèi)容選取。下面錄入一些學生的基本信息

In [88]: stu_dic = {'Age':[14,13,13,14,14,12,12,15,13,12,11,14,12,15,16,12,15,11,15],
    ...: 'Height':[69,56.5,65.3,62.8,63.5,57.3,59.8,62.5,62.5,59,51.3,64.3,56.3,66.5,72,64.
    ...: 8,67,57.5,66.5],
    ...: 'Name':['Alfred','Alice','Barbara','Carol','Henry','James','Jane','Janet','Jeffrey
    ...: ','John','Joyce','Judy','Louise','Marry','Philip','Robert','Ronald','Thomas','Will
    ...: am'],
    ...: 'Gender':['M','F','F','F','M','M','F','F','M','M','F','F','F','F','M','M','M','M',
    ...: 'M'],
    ...: 'Weight':[112.5,84,98,102.5,102.5,83,84.5,112.5,84,99.5,50.5,90,77,112,150,128,133
    ...: ,85,112]}
In [89]: student = pd.DataFrame(stu_dic)
In [90]: print(student)
    Age Gender  Height     Name  Weight
0    14      M    69.0   Alfred   112.5
1    13      F    56.5    Alice    84.0
2    13      F    65.3  Barbara    98.0
3    14      F    62.8    Carol   102.5
4    14      M    63.5    Henry   102.5
5    12      M    57.3    James    83.0
6    12      F    59.8     Jane    84.5
7    15      F    62.5    Janet   112.5
8    13      M    62.5  Jeffrey    84.0
9    12      M    59.0     John    99.5
10   11      F    51.3    Joyce    50.5
11   14      F    64.3     Judy    90.0
12   12      F    56.3   Louise    77.0
13   15      F    66.5    Marry   112.0
14   16      M    72.0   Philip   150.0
15   12      M    64.8   Robert   128.0
16   15      M    67.0   Ronald   133.0
17   11      M    57.5   Thomas    85.0
18   15      M    66.5   Willam   112.0

head和tail函數(shù)可以用來查詢前幾行或后幾行數(shù)據(jù)(默認查詢5行)

In [92]: student.head()
Out[92]:
   Age Gender  Height     Name  Weight
0   14      M    69.0   Alfred   112.5
1   13      F    56.5    Alice    84.0
2   13      F    65.3  Barbara    98.0
3   14      F    62.8    Carol   102.5
4   14      M    63.5    Henry   102.5
In [93]: student.tail(3)
Out[93]:
    Age Gender  Height    Name  Weight
16   15      M    67.0  Ronald   133.0
17   11      M    57.5  Thomas    85.0
18   15      M    66.5  Willam   112.0

查詢指定列直接用索引

In [97]: student[['Name', 'Age', 'Gender']].head()
Out[97]:
      Name  Age Gender
0   Alfred   14      M
1    Alice   13      F
2  Barbara   13      F
3    Carol   14      F
4    Henry   14      M

查詢指定行的數(shù)據(jù)用loc位置函數(shù),該函數(shù)可以按索引選取數(shù)據(jù)。類似的iloc函數(shù)可以根據(jù)位置選取數(shù)據(jù)

In [96]: student.loc[[1,3,5,7]]
Out[96]:
   Age Gender  Height   Name  Weight
1   13      F    56.5  Alice    84.0
3   14      F    62.8  Carol   102.5
5   12      M    57.3  James    83.0
7   15      F    62.5  Janet   112.5

查詢指定行的指定列信息

In [120]: student.loc[1:5, ['Name', 'Age', 'Gender']]
Out[120]:
      Name  Age Gender
1    Alice   13      F
2  Barbara   13      F
3    Carol   14      F
4    Henry   14      M
5    James   12      M

DataFrame用布爾列表作為索引可以進行條件搜索,同理可以在索引處使用邏輯運算。查詢所有12歲以上的女生信息

In [127]: student.head()[[True,False,True,False,True]]
Out[127]:
   Age Gender  Height     Name  Weight
0   14      M    69.0   Alfred   112.5
2   13      F    65.3  Barbara    98.0
4   14      M    63.5    Henry   102.5
In [128]: student[(student['Gender']=='F') & (student['Age']>12)]
Out[128]:
    Age Gender  Height     Name  Weight
1    13      F    56.5    Alice    84.0
2    13      F    65.3  Barbara    98.0
3    14      F    62.8    Carol   102.5
7    15      F    62.5    Janet   112.5
11   14      F    64.3     Judy    90.0
13   15      F    66.5    Marry   112.0

利用DataFrame進行統(tǒng)計分析

pandas模塊為我們提供了非常多用于統(tǒng)計分析的函數(shù),如求和、均值、最小值、最大值等

In [166]: data = pd.Series(np.random.randn(10))
In [167]: print(data)
0    0.139683
1   -0.223019
2    2.123692
3    0.122273
4   -1.409432
5    1.422986
6   -2.147855
7   -1.347533
8    0.363565
9   -0.014752
dtype: float64
In [168]: print(data.count()) #統(tǒng)計總個數(shù)
10
In [169]: print(data.max()) #最大值
2.12369188859
In [170]: print(data.min()) #最小值
-2.14785503764
In [171]: print(data.idxmax()) #最大值的索引
2
In [172]: print(data.idxmin()) #最小值的索引
6
In [173]: print(data.quantile(0.25)) #25%分位數(shù)
-1.0664041305639353
In [174]: print(data.sum()) #求和
-0.970391278249
In [175]: print(data.mean()) #平均值
-0.0970391278249
In [176]: print(data.median()) #中位數(shù)
0.05376066122711701
In [177]: print(data.mode()) #眾數(shù)
0   -2.147855
1   -1.409432
2   -1.347533
3   -0.223019
4   -0.014752
5    0.122273
6    0.139683
7    0.363565
8    1.422986
9    2.123692
dtype: float64
In [178]: print(data.var()) #方差
1.67477886567
In [179]: print(data.std()) #標準差
1.29413247609
In [180]: print(data.mad()) #平均絕對差
0.947936352321
In [181]: print(data.skew()) #偏度
0.111671093612
In [182]: print(data.kurt()) #峰度
-0.229936797723
In [183]: print(data.describe()) #一次性輸出多個描述性統(tǒng)計指標
count    10.000000
mean     -0.097039
std       1.294132
min      -2.147855
25%      -1.066404
50%       0.053761
75%       0.307594
max       2.123692
dtype: float64
In [184]: print(data.value_counts())  # 查看Series對象的唯一zhi和計數(shù)

有時我們需要將一個函數(shù)應用到DataFrame中的每一列當中,這時可以借助apply函數(shù)

In [198]: print(d1)
[ 1.35779726 -0.80483372 -2.12362025 -0.33350244 -0.88671935  0.33419793
  0.53678382 -0.74383037 -0.32020388 -0.91619886]
In [199]: d2 = np.random.f(2,4,size=10)
In [200]: print(d2)
[ 1.9737911   2.58048514  0.77602181  3.20355989  1.34335165  0.64734569
  0.52191629  0.56962343  5.23736555  0.14573449]
In [201]: d3 = np.random.randint(50,100,size=10)
In [202]: print(d3)
[76 65 56 88 88 70 67 66 73 70]
In [203]: df = pd.DataFrame(np.array([d1,d2,d3]).T, columns=['x1', 'x2', 'x3'])
In [204]: print(df.head())
         x1        x2    x3
0  1.357797  1.973791  76.0
1 -0.804834  2.580485  65.0
2 -2.123620  0.776022  56.0
3 -0.333502  3.203560  88.0
4 -0.886719  1.343352  88.0
In [209]: print(df.apply(lambda x: x.describe()))
              x1         x2         x3
count  10.000000  10.000000  10.000000
mean   -0.390013   1.699920  71.900000
std     0.958737   1.591444  10.016098
min    -2.123620   0.145734  56.000000
25%    -0.866248   0.589054  66.250000
50%    -0.538666   1.059687  70.000000
75%     0.170597   2.428812  75.250000
max     1.357797   5.237366  88.000000

在統(tǒng)計離散數(shù)據(jù)時就不能使用這種統(tǒng)計方式了,我們需要統(tǒng)計離散變量的觀測數(shù)量、取值種類數(shù)、眾數(shù)及其個數(shù)。這時只需要使用describe方法就可以實現(xiàn)

In [211]: print(student['Gender'].describe())
count     19
unique     2
top        M
freq      10
Name: Gender, dtype: object

除了以上簡單描述性統(tǒng)計之外,還提供了連續(xù)變量的相關系數(shù)(corr)和協(xié)方差矩陣(cov)的求解。關于相關系數(shù)的計算可以調用pearson方法或kendell方法或spearman方法,默認用pearson方法

In [214]: print(df.corr())
          x1        x2        x3
x1  1.000000  0.115625  0.321113
x2  0.115625  1.000000  0.344972
x3  0.321113  0.344972  1.000000

如果只想管制某一個變量與其余變量的相關系數(shù)的話,可以使用corrwith,如下方只關心x1與其余變量的相關系數(shù)

In [216]: print(df.corrwith(df['x1']))
x1    1.000000
x2    0.115625
x3    0.321113
dtype: float64

利用Pandas實現(xiàn)SQL操作

在SQL中常見的操作主要是增、刪、改、查,在Pandas中同樣可以實現(xiàn)這幾項操作。新增數(shù)據(jù)操作對新數(shù)據(jù)的格式要求十分少,即使列值對不上也會自行新增列,列名順序不一致也會自動進行對齊

In [226]: dic = {'Name':['LiuShunxiang','Zhangshan'],'Gender':['M','F'],'Age':[27,23],'Heig
     ...: ht':[165.7,167.2],'Weight':[61,63]}
In [227]: student1 = pd.DataFrame(dic)
In [228]: student2 = pd.concat([student.head(), student1])
In [229]: print(student2)
   Age Gender  Height          Name  Weight
0   14      M    69.0        Alfred   112.5
1   13      F    56.5         Alice    84.0
2   13      F    65.3       Barbara    98.0
3   14      F    62.8         Carol   102.5
4   14      M    63.5         Henry   102.5
0   27      M   165.7  LiuShunxiang    61.0
1   23      F   167.2     Zhangshan    63.0

新增列后,默認將新列上的值都賦為NaN

In [230]: print(pd.DataFrame(student2, columns
=['Age','Height','Name','Gender','Weight','Score']))
   Age  Height          Name Gender  Weight  Score
0   14    69.0        Alfred      M   112.5    NaN
1   13    56.5         Alice      F    84.0    NaN
2   13    65.3       Barbara      F    98.0    NaN
3   14    62.8         Carol      F   102.5    NaN
4   14    63.5         Henry      M   102.5    NaN
0   27   165.7  LiuShunxiang      M    61.0    NaN
1   23   167.2     Zhangshan      F    63.0    NaN

刪除DataFrame(可以看成是刪除表)使用Python的del關鍵字。刪除整列或整行都使用drop函數(shù),使用drop函數(shù)時需要指定軸axis,默認為0即代表著行(條件刪除可以使用索引條件過濾然后復制給當前DataFrame)

In [260]: del student1
In [261]: print(student2.drop([1,3,5]))
   Age Gender  Height       Name  Weight
0   14      M    69.0     Alfred   112.5
2   13      F    65.3    Barbara    98.0
4   14      M    63.5      Henry   102.5
6   23      F   167.2  Zhangshan    63.0
In [262]: print(student2.drop(['Height', 'Weight'], axis=1).head())
   Age Gender     Name
0   14      M   Alfred
1   13      F    Alice
2   13      F  Barbara
3   14      F    Carol
4   14      M    Henry

可以使用布爾索引結合賦值的方法

In [274]: student2.loc[student2['Name']=='Alfred', 'Height'] = 173
In [275]: student2.loc[student2['Name']=='Alfred']
Out[275]:
   Age Gender  Height    Name  Weight
0   14      M   173.0  Alfred   112.5

上面已經(jīng)講過,不做贅述

聚合

Pandas中可以通過groupby函數(shù)實現(xiàn)數(shù)據(jù)的聚合操作,通過值將觀測行分組

In [277]: print(student.groupby('Gender').mean())  # 統(tǒng)計男女的各項平均值
              Age     Height      Weight
Gender
F       13.222222  60.588889   90.111111
M       13.400000  63.910000  108.950000

如果不想將某個值做統(tǒng)計,可以先用drop刪除列然后做groupby操作。還可以用agg函數(shù)對分組計算多個統(tǒng)計量

In [279]: print(student.drop('Age',axis=1).groupby('Gender').agg([np.mean, np.median]))
           Height             Weight
             mean median        mean  median
Gender
F       60.588889  62.50   90.111111   90.00
M       63.910000  64.15  108.950000  107.25
排序

排序在日常的統(tǒng)計分析中比較常見,我們可以使用sort_index和sort_values函數(shù)對Series或DataFrame對象實現(xiàn)按值排序和按索引排序,ascending=True/False可以指定使用升序還是降序

In [293]: print(student.head())
   Age Gender  Height     Name  Weight
0   14      M    69.0   Alfred   112.5
1   13      F    56.5    Alice    84.0
2   13      F    65.3  Barbara    98.0
3   14      F    62.8    Carol   102.5
4   14      M    63.5    Henry   102.5
In [294]: print(student.head().sort_values('Height'))
   Age Gender  Height     Name  Weight
1   13      F    56.5    Alice    84.0
3   14      F    62.8    Carol   102.5
4   14      M    63.5    Henry   102.5
2   13      F    65.3  Barbara    98.0
0   14      M    69.0   Alfred   112.5
In [295]: print(student.head().sort_values(by='Height').sort_index())
   Age Gender  Height     Name  Weight
0   14      M    69.0   Alfred   112.5
1   13      F    56.5    Alice    84.0
2   13      F    65.3  Barbara    98.0
3   14      F    62.8    Carol   102.5
4   14      M    63.5    Henry   102.5
多表連接

Pandas同樣于MySQL,分成左右內(nèi)外連接,用merge函數(shù)實現(xiàn)

In [296]: dic2 = {'Name':['Alfred','Alice','Barbara','Carol','Henry','Jeffrey','Judy','Phil
     ...: ip','Robert','Willam'], 'Score':[88,76,89,67,79,90,92,86,73,77]}
In [297]: score = pd.DataFrame(dic2)
In [298]: stu_score1 = pd.merge(student, score, on='Name')  # on指定連接依據(jù)
In [299]: print(stu_score1)
   Age Gender  Height     Name  Weight  Score
0   14      M    69.0   Alfred   112.5     88
1   13      F    56.5    Alice    84.0     76
2   13      F    65.3  Barbara    98.0     89
3   14      F    62.8    Carol   102.5     67
4   14      M    63.5    Henry   102.5     79
5   13      M    62.5  Jeffrey    84.0     90
6   14      F    64.3     Judy    90.0     92
7   16      M    72.0   Philip   150.0     86
8   12      M    64.8   Robert   128.0     73
9   15      M    66.5   Willam   112.0     77

merge函數(shù)默認使用的是內(nèi)連接,返回兩張表中共同存在的行。你還可以通過how參數(shù)設置left、right、outer連接方式,左連接保留前一張表的所有行,右鏈接保留后一張表的所有行,不存在的值自動設為NaN。

In [306]: score = pd.merge(student,score,how='left').head(10)
In [307]: score
Out[307]:
    Age Gender  Height     Name  Weight  Score
0    14      M    69.0   Alfred   112.5   88.0
1    13      F    56.5    Alice    84.0   76.0
2    13      F    65.3  Barbara    98.0   89.0
3    14      F    62.8    Carol   102.5   67.0
4    14      M    63.5    Henry   102.5   79.0
5    12      M    57.3    James    83.0    NaN
6    12      F    59.8     Jane    84.5    NaN
7    15      F    62.5    Janet   112.5    NaN
8    13      M    62.5  Jeffrey    84.0   90.0
9    12      M    59.0     John    99.5    NaN

利用Pandas進行缺失值的處理

工作場景中我們在整理數(shù)據(jù)時需要面對缺失值處理的場景。常見的缺失值處理的方法分為三種:刪除法、填補法和插值法,這里就介紹一些簡單的刪除法和填補法。在刪除法中,dropna函數(shù)直接刪除含有NaN的行(可用axis=1刪除列,thresh=n刪除小于n個非空值的行,how=‘a(chǎn)ll’ 刪除所有值都為NaN的那些行)

In [325]: df = pd.DataFrame([[1,1,2],[3,5,np.nan],[13,21,34],[55,np.nan,10],[np.nan,np.nan,
     ...: np.nan],[np.nan,1,2]],columns=('x1','x2','x3'))
In [326]: print(df)
     x1    x2    x3
0   1.0   1.0   2.0
1   3.0   5.0   NaN
2  13.0  21.0  34.0
3  55.0   NaN  10.0
4   NaN   NaN   NaN
5   NaN   1.0   2.0
In [327]: print(df.dropna()) 
     x1    x2    x3
0   1.0   1.0   2.0
2  13.0  21.0  34.0
In [328]: print(df.dropna(how='all'))
     x1    x2    x3
0   1.0   1.0   2.0
1   3.0   5.0   NaN
2  13.0  21.0  34.0
3  55.0   NaN  10.0
5   NaN   1.0   2.0

填補缺失值可以使用fillna函數(shù),它可以用值替換NaN,也可以用 method='ffill'或'bfill' 來指定用前向或后項填充。但是在大部分情況下我們會依據(jù)情況使用眾數(shù)、中位數(shù)、平均數(shù)等中心度量值來填充空白數(shù)據(jù)。

In [331]: print(df.fillna(0))
     x1    x2    x3
0   1.0   1.0   2.0
1   3.0   5.0   0.0
2  13.0  21.0  34.0
3  55.0   0.0  10.0
4   0.0   0.0   0.0
5   0.0   1.0   2.0
In [332]: print(df.fillna(method='ffill'))
     x1    x2    x3
0   1.0   1.0   2.0
1   3.0   5.0   2.0
2  13.0  21.0  34.0
3  55.0  21.0  10.0
4  55.0  21.0  10.0
5  55.0   1.0   2.0
In [333]: print(df.fillna(method='bfill'))
     x1    x2    x3
0   1.0   1.0   2.0
1   3.0   5.0  34.0
2  13.0  21.0  34.0
3  55.0   1.0  10.0
4   NaN   1.0   2.0
5   NaN   1.0   2.0
In [334]: print(df.fillna({'x1':1, 'x2':2, 'x3':3}))
     x1    x2    x3
0   1.0   1.0   2.0
1   3.0   5.0   3.0
2  13.0  21.0  34.0
3  55.0   2.0  10.0
4   1.0   2.0   3.0
5   1.0   1.0   2.0

isnull(和notnull)函數(shù)可以檢查DataFrame對象中的空(非空)值,返回一個布爾數(shù)組

In [358]: print(df.isnull())
      x1     x2     x3
0  False  False  False
1  False  False   True
2  False  False  False
3  False   True  False
4   True   True   True
5   True  False  False

重命名DataFrame對象的列名用columns屬性,批量更改類名用rename

In [361]: df
Out[361]:
     x1    x2    x3
0   1.0   1.0   2.0
1   3.0   5.0   NaN
2  13.0  21.0  34.0
3  55.0   NaN  10.0
4   NaN   NaN   NaN
5   NaN   1.0   2.0
In [362]: df.columns = ['a', 'b', 'c']
In [363]: df
Out[363]:
      a     b     c
0   1.0   1.0   2.0
1   3.0   5.0   NaN
2  13.0  21.0  34.0
3  55.0   NaN  10.0
4   NaN   NaN   NaN
5   NaN   1.0   2.0
In [364]: df.rename(columns={'a':'aa'})
Out[364]:
     aa     b     c
0   1.0   1.0   2.0
1   3.0   5.0   NaN
2  13.0  21.0  34.0
3  55.0   NaN  10.0
4   NaN   NaN   NaN
5   NaN   1.0   2.0

在Series對象中替換值使用replace函數(shù)

In [359]: s
Out[359]:
a    0
b    1
c    2
d    3
dtype: int32
In [360]: print(s.replace(1, 11))
a     0
b    11
c     2
d     3
dtype: int32

利用Pandas實現(xiàn)Excel的數(shù)據(jù)透視表

Pandas提供了實現(xiàn)數(shù)據(jù)透視表功能的pivot_table函數(shù),通過該函數(shù)可以直觀查看數(shù)據(jù)的聚合(計數(shù)、求和、均值、標準差)情況。該函數(shù)可傳遞的參數(shù)有以下這些pivot_table(data,values=None, # values指定需要聚合的字段
index=None, # 指定某些原始變量作為行索引
columns=None, # 指定哪些離散的分組變量
aggfunc='mean', # 指定相應的聚合函數(shù)
fill_value=None, # 使用一個常數(shù)替代缺失值,默認不替換
margins=False, # 是否進行性行或列的匯總,默認不匯總
dropna=True, # 默認所有觀測為缺失的值
margins_name='All') # 默認行匯總或咧匯總的名稱為‘All’
下面舉個例子,依據(jù)Gender、Age對Height、Weight進行展示

In [336]: table = pd.pivot_table(student, values=['Height', 'Weight'], columns=['Gender','A
     ...: ge'])

In [337]: print(table)
        Gender  Age
Height  F       11      51.300000
                12      58.050000
                13      60.900000
                14      63.550000
                15      64.500000
        M       11      57.500000
                12      60.366667
                13      62.500000
                14      66.250000
                15      66.750000
                16      72.000000
Weight  F       11      50.500000
                12      80.750000
                13      91.000000
                14      96.250000
                15     112.250000
        M       11      85.000000
                12     103.500000
                13      84.000000
                14     107.500000
                15     122.500000
                16     150.000000
dtype: float64

如果想把上述格式變換成更像Excel的堆疊格式可以使用unstack函數(shù)即可。

In [338]: table = pd.pivot_table(student, values=['Height', 'Weight'], columns=['Gender','A
     ...: ge']).unstack()
In [339]: print(table)
Age              11          12    13      14      15     16
       Gender
Height F       51.3   58.050000  60.9   63.55   64.50    NaN
       M       57.5   60.366667  62.5   66.25   66.75   72.0
Weight F       50.5   80.750000  91.0   96.25  112.25    NaN
       M       85.0  103.500000  84.0  107.50  122.50  150.0

多層索引的使用

在Pandas中我們可以在一個軸上設置多個索引,類似于Excel的如下形式



設置的原理十分易懂

In [340]: s = pd.Series(np.arange(1,10),index=[["a","a","a","b","b","c","c","d","d"],[1,2,3
     ...: ,1,2,3,1,2,3]])
In [341]: print(s)
a  1    1
   2    2
   3    3
b  1    4
   2    5
c  3    6
   1    7
d  2    8
   3    9
dtype: int32

選取元素時需要用兩個索引指定一個下標

In [342]: print(s['a'])
1    1
2    2
3    3
dtype: int32
In [343]: print(s['a', 2])
2

DataFrame的層次化索引也類似

In [344]: data = pd.DataFrame(np.random.randint(0,150,size=(8,12)),
     ...:                columns = pd.MultiIndex.from_product([['模擬考','正式考'],
     ...:                                                    ['數(shù)學','語文','英語','物理','
     ...: 化學','生物']]),
     ...:                index = pd.MultiIndex.from_product([['期中','期末'],
     ...:                                                    ['雷軍','李斌'],
     ...:                                                   ['測試一','測試二']]))
     ...:
In [345]: print(data)
           模擬考                           正式考
            數(shù)學   語文   英語   物理   化學   生物   數(shù)學   語文   英語   物理   化學   生
物
期中 雷軍 測試一   42   26  120   94   25  141   45   50   94   44   88   74
      測試二   76  105  138   87  125   55   60   15   37   54   79   56
   李斌 測試一    6   53   97   74   46   65    5  130   67  129    3  109
      測試二  108   22   89   13   53   47   12   81   19   53   93  104
期末 雷軍 測試一   41  136   14    4   33   53  142   43   28   68   88    7
      測試二   47   84  127   28   17  135  116   31  118   49   27   97
   李斌 測試一   30   43   48   24   15    3   26   71   82  104  125   79
      測試二   50  114   43  123  111  106    5   91   34   76   52   82

數(shù)據(jù)IO

Pandas支持多種格式數(shù)據(jù)的讀寫操作,它們的讀函數(shù)分別為read_csv、read_table(從限定分隔符的文件中讀內(nèi)容)、read_excel、read_sql、read_json、read_html、read_clipboard(從粘貼板讀內(nèi)容),這些函數(shù)所用來操作的文件和數(shù)據(jù)格式一目了然。文件的寫操作有函數(shù)to_csv、to_excel、to_sql、to_json這幾個。

In [371]: df.to_csv('E:\\a.csv')
In [372]: file_str = pd.read_csv('E:\\a.csv')
In [373]: print(file_str)
   Unnamed: 0     a     b     c
0           0   1.0   1.0   2.0
1           1   3.0   5.0   NaN
2           2  13.0  21.0  34.0
3           3  55.0   NaN  10.0
4           4   NaN   NaN   NaN
5           5   NaN   1.0   2.0
?著作權歸作者所有,轉載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務。

相關閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容