6. 函數(shù)應(yīng)用
要將您自己定義的函數(shù)或其他第三方庫(kù)的函數(shù)應(yīng)用于 pandas 對(duì)象上,主要包含下面的方法。
具體使用哪種方法需要根據(jù)需要,是在整個(gè) DataFrame 還是 Series 上、行或列上,或者是元素上進(jìn)行操作
- 表應(yīng)用函數(shù):
pipe() - 行列應(yīng)用函數(shù):
apply() - 聚合函數(shù):
agg()與transform() - 元素級(jí)函數(shù):
applymap()
6.1 表應(yīng)用函數(shù)
雖然可以將 DataFrame 和 Series 傳遞到函數(shù)中,但是如果需要鏈?zhǔn)秸{(diào)用函數(shù),可以考慮使用管道函數(shù) pipe()
我們先進(jìn)行一些設(shè)置
In [142]: def extract_city_name(df):
.....: """
.....: Chicago, IL -> Chicago for city_name column
.....: """
.....: df["city_name"] = df["city_and_code"].str.split(",").str.get(0)
.....: return df
.....:
In [143]: def add_country_name(df, country_name=None):
.....: """
.....: Chicago -> Chicago-US for city_name column
.....: """
.....: col = "city_name"
.....: df["city_and_country"] = df[col] + country_name
.....: return df
.....:
In [144]: df_p = pd.DataFrame({"city_and_code": ["Chicago, IL"]})
extract_city_name 和 add_country_name 函數(shù)傳入和返回的都是 DataFrame
現(xiàn)在進(jìn)行下面的比較
In [145]: add_country_name(extract_city_name(df_p), country_name="US")
Out[145]:
city_and_code city_name city_and_country
0 Chicago, IL Chicago ChicagoUS
相當(dāng)于
In [146]: df_p.pipe(extract_city_name).pipe(add_country_name, country_name="US")
Out[146]:
city_and_code city_name city_and_country
0 Chicago, IL Chicago ChicagoUS
pandas 鼓勵(lì)使用第二種方式,即鏈?zhǔn)胶瘮?shù)調(diào)用。pipe 使您可以輕松地在函數(shù)鏈中使用您自己定義的或另一個(gè)庫(kù)的函數(shù),以及 pandas 中的函數(shù)。
在上面的示例中,函數(shù) extract_city_name 和 add_country_name 的第一個(gè)參數(shù)都是一個(gè) DataFrame,如果想要把 DataFrame 作為第二個(gè)或其他位置的參數(shù)怎么辦呢?
我們可以為 .pipe 函數(shù)傳遞 (callable, data_keyword) 元組,data_keyword 是字符串,其值為 callable 函數(shù)的某一參數(shù)名,pipe 會(huì)把 DataFrame 作為 data_keyword 指定參數(shù)的值傳遞到函數(shù)中
例如
>>> df = pd.DataFrame(np.random.rand(6, 4) * np.random.randint(1, 10), columns=list('ABCD'))
>>> df
A B C D
0 2.895628 1.021764 3.549697 3.946251
1 3.032729 5.527509 4.111962 4.246071
2 0.587101 4.009382 3.330098 0.671954
3 5.891730 2.829773 3.349024 5.687257
4 2.103148 2.658920 4.398308 2.653573
5 3.576252 2.512895 4.871405 1.283442
>>> def func(a, data, b, c):
...: return data[a] + b - c
>>> df.query('-2 < B < 3').pipe((func, 'data'), a='A', b=1, c=1)
0 2.895628
3 5.891730
4 2.103148
5 3.576252
Name: A, dtype: float64
我們將 DataFrame 傳遞給函數(shù) func 的第二個(gè)參數(shù) data,其他參數(shù)以命名參數(shù)的方式指定。
我們可以結(jié)合這一方式,使用 statmodels 來(lái)擬合回歸。它們的 API 的第一個(gè)參數(shù)是計(jì)算公式,第二個(gè)參數(shù)是 data,接受一個(gè) DataFrame 對(duì)象。
In [147]: import statsmodels.formula.api as sm
In [148]: bb = pd.read_csv("data/baseball.csv", index_col="id")
In [149]: (
.....: bb.query("h > 0")
.....: .assign(ln_h=lambda df: np.log(df.h))
.....: .pipe((sm.ols, "data"), "hr ~ ln_h + year + g + C(lg)")
.....: .fit()
.....: .summary()
.....: )
.....:
Out[149]:
<class 'statsmodels.iolib.summary.Summary'>
"""
OLS Regression Results
==============================================================================
Dep. Variable: hr R-squared: 0.685
Model: OLS Adj. R-squared: 0.665
Method: Least Squares F-statistic: 34.28
Date: Wed, 20 Jan 2021 Prob (F-statistic): 3.48e-15
Time: 11:49:07 Log-Likelihood: -205.92
No. Observations: 68 AIC: 421.8
Df Residuals: 63 BIC: 432.9
Df Model: 4
Covariance Type: nonrobust
===============================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------
Intercept -8484.7720 4664.146 -1.819 0.074 -1.78e+04 835.780
C(lg)[T.NL] -2.2736 1.325 -1.716 0.091 -4.922 0.375
ln_h -1.3542 0.875 -1.547 0.127 -3.103 0.395
year 4.2277 2.324 1.819 0.074 -0.417 8.872
g 0.1841 0.029 6.258 0.000 0.125 0.243
==============================================================================
Omnibus: 10.875 Durbin-Watson: 1.999
Prob(Omnibus): 0.004 Jarque-Bera (JB): 17.298
Skew: 0.537 Prob(JB): 0.000175
Kurtosis: 5.225 Cond. No. 1.49e+07
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.49e+07. This might indicate that there are
strong multicollinearity or other numerical problems.
"""
注意:當(dāng)只有兩個(gè)參數(shù)時(shí),可以省略元組后面的參數(shù)名稱,直接傳入變量,如果是超過(guò)兩個(gè)參數(shù),需要通過(guò)指定參數(shù)名賦值方式。
這個(gè) pipe() 函數(shù)是受 Unix 管道操作以及后面的 dplyr 和 magrittr 的啟發(fā)。
上面的代碼看起來(lái)是不是和我們之前介紹的 dplyr 操作很像,哈哈,可以好好比較一下。
6.2 行列行數(shù)應(yīng)用
可以使用 apply() 函數(shù)沿 DataFrame 對(duì)應(yīng)的軸應(yīng)用任意函數(shù)(默認(rèn)應(yīng)用在每列上),該方法與統(tǒng)計(jì)描述方法一樣,使用可選的 axis 參數(shù)指定應(yīng)用的軸
In [150]: df.apply(np.mean)
Out[150]:
one 0.811094
two 1.360588
three 0.187958
dtype: float64
In [151]: df.apply(np.mean, axis=1)
Out[151]:
a 1.583749
b 0.734929
c 1.133683
d -0.166914
dtype: float64
In [152]: df.apply(lambda x: x.max() - x.min())
Out[152]:
one 1.051928
two 1.632779
three 1.840607
dtype: float64
In [153]: df.apply(np.cumsum)
Out[153]:
one two three
a 1.394981 1.772517 NaN
b 1.738035 3.684640 -0.050390
c 2.433281 5.163008 1.177045
d NaN 5.442353 0.563873
In [154]: df.apply(np.exp)
Out[154]:
one two three
a 4.034899 5.885648 NaN
b 1.409244 6.767440 0.950858
c 2.004201 4.385785 3.412466
d NaN 1.322262 0.541630
apply() 函數(shù)也可以指派一個(gè)字符串方法名
In [155]: df.apply("mean")
Out[155]:
one 0.811094
two 1.360588
three 0.187958
dtype: float64
In [156]: df.apply("mean", axis=1)
Out[156]:
a 1.583749
b 0.734929
c 1.133683
d -0.166914
dtype: float64
默認(rèn)情況下,apply()里面調(diào)用的函數(shù)的返回類型會(huì)影響其輸出結(jié)果的類型。
如果調(diào)用的函數(shù)返回的是
Series,輸出結(jié)果的類型是DataFrame。而且輸出的列的索引與函數(shù)返回的Series索引相匹配。如果函數(shù)返回的是其它任意類型,輸出的結(jié)果將是 Series。
result_type 參數(shù)可以覆蓋默認(rèn)行為(只有當(dāng) axis=1,即應(yīng)用在列上才能發(fā)揮作用),該參數(shù)有三個(gè)可選的值:
-
expand: 類似列表的結(jié)果將被轉(zhuǎn)換成列 -
reduce: 如果可能,盡可能返回一個(gè)Series而不是擴(kuò)展成列表狀的結(jié)果。這與expand相反 -
broadcast: 結(jié)果將廣播到與原來(lái)DataFrame相同的形狀,原始的索引和列名將會(huì)保留
這些值決定了返回值是否擴(kuò)展為 DataFrame。
熟悉 apply() 的使用技巧,可以輕易的獲取數(shù)據(jù)中的信息,例如,假設(shè)我們要提取每列中最大值對(duì)應(yīng)的日期:
In [157]: tsdf = pd.DataFrame(
.....: np.random.randn(1000, 3),
.....: columns=["A", "B", "C"],
.....: index=pd.date_range("1/1/2000", periods=1000),
.....: )
.....:
In [158]: tsdf.apply(lambda x: x.idxmax())
Out[158]:
A 2000-08-06
B 2001-01-18
C 2001-07-18
dtype: datetime64[ns]
您還可以將其他參數(shù)和關(guān)鍵字參數(shù)傳遞給 apply() 方法,例如我們定義如下函數(shù)
def subtract_and_divide(x, sub, divide=1):
return (x - sub) / divide
你可以使用像這樣使用
df.apply(subtract_and_divide, args=(5,), divide=3)
還可以對(duì)每行或每列應(yīng)用 Series 函數(shù)
In [159]: tsdf
Out[159]:
A B C
2000-01-01 -0.158131 -0.232466 0.321604
2000-01-02 -1.810340 -3.105758 0.433834
2000-01-03 -1.209847 -1.156793 -0.136794
2000-01-04 NaN NaN NaN
2000-01-05 NaN NaN NaN
2000-01-06 NaN NaN NaN
2000-01-07 NaN NaN NaN
2000-01-08 -0.653602 0.178875 1.008298
2000-01-09 1.007996 0.462824 0.254472
2000-01-10 0.307473 0.600337 1.643950
In [160]: tsdf.apply(pd.Series.interpolate)
Out[160]:
A B C
2000-01-01 -0.158131 -0.232466 0.321604
2000-01-02 -1.810340 -3.105758 0.433834
2000-01-03 -1.209847 -1.156793 -0.136794
2000-01-04 -1.098598 -0.889659 0.092225
2000-01-05 -0.987349 -0.622526 0.321243
2000-01-06 -0.876100 -0.355392 0.550262
2000-01-07 -0.764851 -0.088259 0.779280
2000-01-08 -0.653602 0.178875 1.008298
2000-01-09 1.007996 0.462824 0.254472
2000-01-10 0.307473 0.600337 1.643950
apply() 函數(shù)有一個(gè) raw 參數(shù),其默認(rèn)值為 False,即在應(yīng)用函數(shù)前,會(huì)自動(dòng)將每行或每列轉(zhuǎn)換為 Series。
如果設(shè)置為 True,會(huì)將數(shù)據(jù)轉(zhuǎn)換為 ndarray 對(duì)象,在不需要使用索引的情況下,能顯著提高性能。
6.3 聚合函數(shù) API
聚合 API 能夠以一種簡(jiǎn)潔的方式來(lái)表達(dá)多個(gè)可能的聚合操作,使用 DataFrame.aggregate() 或者別名DataFrame.agg() 來(lái)進(jìn)行操作
我們使用與上面類似的示例數(shù)據(jù)
In [161]: tsdf = pd.DataFrame(
.....: np.random.randn(10, 3),
.....: columns=["A", "B", "C"],
.....: index=pd.date_range("1/1/2000", periods=10),
.....: )
.....:
In [162]: tsdf.iloc[3:7] = np.nan
In [163]: tsdf
Out[163]:
A B C
2000-01-01 1.257606 1.004194 0.167574
2000-01-02 -0.749892 0.288112 -0.757304
2000-01-03 -0.207550 -0.298599 0.116018
2000-01-04 NaN NaN NaN
2000-01-05 NaN NaN NaN
2000-01-06 NaN NaN NaN
2000-01-07 NaN NaN NaN
2000-01-08 0.814347 -0.257623 0.869226
2000-01-09 -0.250663 -1.206601 0.896839
2000-01-10 2.169758 -1.333363 0.283157
如果只使用單個(gè)函數(shù),其效果等同于 apply()。您也可以傳遞字符串變量名,將會(huì)返回一個(gè) Series 輸出結(jié)果
In [164]: tsdf.agg(np.sum)
Out[164]:
A 3.033606
B -1.803879
C 1.575510
dtype: float64
In [165]: tsdf.agg("sum")
Out[165]:
A 3.033606
B -1.803879
C 1.575510
dtype: float64
In [166]: tsdf.sum()
Out[166]:
A 3.033606
B -1.803879
C 1.575510
Series 上的單個(gè)聚合將返回標(biāo)量值
In [167]: tsdf["A"].agg("sum")
Out[167]: 3.033606102414146
6.3.1 多函數(shù)聚合
還可以將多個(gè)聚合函數(shù)以列表形式傳入,每個(gè)函數(shù)的計(jì)算結(jié)果是 DataFrame 的一行,對(duì)應(yīng)的索引是對(duì)應(yīng)函數(shù)的函數(shù)名。
In [168]: tsdf.agg(["sum"])
Out[168]:
A B C
sum 3.033606 -1.803879 1.57551
多個(gè)函數(shù)產(chǎn)生多個(gè)行
In [169]: tsdf.agg(["sum", "mean"])
Out[169]:
A B C
sum 3.033606 -1.803879 1.575510
mean 0.505601 -0.300647 0.262585
在一個(gè) Series 上,多個(gè)函數(shù)返回一個(gè) Series,并以函數(shù)名作為索引。
In [170]: tsdf["A"].agg(["sum", "mean"])
Out[170]:
sum 3.033606
mean 0.505601
Name: A, dtype: float64
傳遞 lambda 函數(shù)將產(chǎn)生一個(gè)名為 <lambda> 的行
In [171]: tsdf["A"].agg(["sum", lambda x: x.mean()])
Out[171]:
sum 3.033606
<lambda> 0.505601
Name: A, dtype: float64
傳遞一個(gè)命名的函數(shù)將以相應(yīng)的函數(shù)名稱作為索引
In [172]: def mymean(x):
.....: return x.mean()
.....:
In [173]: tsdf["A"].agg(["sum", mymean])
Out[173]:
sum 3.033606
mymean 0.505601
Name: A, dtype: float64
6.3.2 傳入字典聚合
可以用傳入字典的方式來(lái)指定每列應(yīng)用的函數(shù)
In [174]: tsdf.agg({"A": "mean", "B": "sum"})
Out[174]:
A 0.505601
B -1.803879
dtype: float64
注意:輸出結(jié)果的順序不是固定的,因?yàn)樽值涫菬o(wú)序的。如果想要讓輸出順序與輸入順序一致,可以使用 OrderedDict
from collections import OrderedDict
如果輸入的字典中包含列表時(shí),會(huì)返回 DataFrame 形式的輸出結(jié)果。輸出結(jié)果的索引是唯一的函數(shù)名,未應(yīng)用該函數(shù)的列對(duì)應(yīng)的值會(huì)被賦值為 NaN
In [175]: tsdf.agg({"A": ["mean", "min"], "B": "sum"})
Out[175]:
A B
mean 0.505601 NaN
min -0.749892 NaN
sum NaN -1.803879
6.3.3 混合類型
當(dāng) DataFrame 中存在無(wú)法聚合的混合類型時(shí),.agg() 只會(huì)對(duì)能夠聚合的列進(jìn)行的聚合。
In [176]: mdf = pd.DataFrame(
.....: {
.....: "A": [1, 2, 3],
.....: "B": [1.0, 2.0, 3.0],
.....: "C": ["foo", "bar", "baz"],
.....: "D": pd.date_range("20130101", periods=3),
.....: }
.....: )
.....:
In [177]: mdf.dtypes
Out[177]:
A int64
B float64
C object
D datetime64[ns]
dtype: object
In [178]: mdf.agg(["min", "sum"])
Out[178]:
A B C D
min 1 1.0 bar 2013-01-01
sum 6 6.0 foobarbaz NaT
你可能會(huì)對(duì) NaT 有疑問(wèn),難道是寫(xiě)錯(cuò)了?不是的,它類似于 NaN,表示:Not a Time。也是一種缺失值
6.3.4 自定義 describe
使用 .agg() 我們可以擴(kuò)展類似于內(nèi)建函數(shù) describe 的功能
In [179]: from functools import partial
In [180]: q_25 = partial(pd.Series.quantile, q=0.25)
In [181]: q_25.__name__ = "25%"
In [182]: q_75 = partial(pd.Series.quantile, q=0.75)
In [183]: q_75.__name__ = "75%"
In [184]: tsdf.agg(["count", "mean", "std", "min", q_25, "median", q_75, "max"])
Out[184]:
A B C
count 6.000000 6.000000 6.000000
mean 0.505601 -0.300647 0.262585
std 1.103362 0.887508 0.606860
min -0.749892 -1.333363 -0.757304
25% -0.239885 -0.979600 0.128907
median 0.303398 -0.278111 0.225365
75% 1.146791 0.151678 0.722709
max 2.169758 1.004194 0.896839
6.4 變換 API
transform()方法返回一個(gè)與原始對(duì)象索引相同(大小相同)的對(duì)象。
這個(gè) API 允許你同時(shí)提供多個(gè)操作,而不是一個(gè)一個(gè)的操作。它的 API 與 .agg API 十分相似
例如,對(duì)于下面的數(shù)據(jù)
In [185]: tsdf = pd.DataFrame(
.....: np.random.randn(10, 3),
.....: columns=["A", "B", "C"],
.....: index=pd.date_range("1/1/2000", periods=10),
.....: )
.....:
In [186]: tsdf.iloc[3:7] = np.nan
In [187]: tsdf
Out[187]:
A B C
2000-01-01 -0.428759 -0.864890 -0.675341
2000-01-02 -0.168731 1.338144 -1.279321
2000-01-03 -1.621034 0.438107 0.903794
2000-01-04 NaN NaN NaN
2000-01-05 NaN NaN NaN
2000-01-06 NaN NaN NaN
2000-01-07 NaN NaN NaN
2000-01-08 0.254374 -1.240447 -0.201052
2000-01-09 -0.157795 0.791197 -1.144209
.transform() 可以傳入 NumPy 函數(shù)、字符串函數(shù)名以及自定義函數(shù)實(shí)現(xiàn)對(duì)數(shù)據(jù)的變換
In [188]: tsdf.transform(np.abs)
Out[188]:
A B C
2000-01-01 0.428759 0.864890 0.675341
2000-01-02 0.168731 1.338144 1.279321
2000-01-03 1.621034 0.438107 0.903794
2000-01-04 NaN NaN NaN
2000-01-05 NaN NaN NaN
2000-01-06 NaN NaN NaN
2000-01-07 NaN NaN NaN
2000-01-08 0.254374 1.240447 0.201052
2000-01-09 0.157795 0.791197 1.144209
2000-01-10 0.030876 0.371900 0.061932
In [189]: tsdf.transform("abs")
Out[189]:
A B C
2000-01-01 0.428759 0.864890 0.675341
2000-01-02 0.168731 1.338144 1.279321
2000-01-03 1.621034 0.438107 0.903794
2000-01-04 NaN NaN NaN
2000-01-05 NaN NaN NaN
2000-01-06 NaN NaN NaN
2000-01-07 NaN NaN NaN
2000-01-08 0.254374 1.240447 0.201052
2000-01-09 0.157795 0.791197 1.144209
2000-01-10 0.030876 0.371900 0.061932
In [190]: tsdf.transform(lambda x: x.abs())
Out[190]:
A B C
2000-01-01 0.428759 0.864890 0.675341
2000-01-02 0.168731 1.338144 1.279321
2000-01-03 1.621034 0.438107 0.903794
2000-01-04 NaN NaN NaN
2000-01-05 NaN NaN NaN
2000-01-06 NaN NaN NaN
2000-01-07 NaN NaN NaN
2000-01-08 0.254374 1.240447 0.201052
2000-01-09 0.157795 0.791197 1.144209
2000-01-10 0.030876 0.371900 0.061932
在這里 transform() 只接收一個(gè)函數(shù),類似于
In [191]: np.abs(tsdf)
Out[191]:
A B C
2000-01-01 0.428759 0.864890 0.675341
2000-01-02 0.168731 1.338144 1.279321
2000-01-03 1.621034 0.438107 0.903794
2000-01-04 NaN NaN NaN
2000-01-05 NaN NaN NaN
2000-01-06 NaN NaN NaN
2000-01-07 NaN NaN NaN
2000-01-08 0.254374 1.240447 0.201052
2000-01-09 0.157795 0.791197 1.144209
2000-01-10 0.030876 0.371900 0.061932
如果 .transform() 應(yīng)用對(duì)象是 Series,返回的也是一個(gè) Series
In [192]: tsdf["A"].transform(np.abs)
Out[192]:
2000-01-01 0.428759
2000-01-02 0.168731
2000-01-03 1.621034
2000-01-04 NaN
2000-01-05 NaN
2000-01-06 NaN
2000-01-07 NaN
2000-01-08 0.254374
2000-01-09 0.157795
2000-01-10 0.030876
Freq: D, Name: A, dtype: float64
6.4.1 多函數(shù)變換
傳遞多個(gè)函數(shù)將產(chǎn)生一個(gè)包含多級(jí)列名的 DataFrame。第一級(jí)列名是原始列名,第二級(jí)是轉(zhuǎn)換函數(shù)的名稱。
In [193]: tsdf.transform([np.abs, lambda x: x + 1])
Out[193]:
A B C
absolute <lambda> absolute <lambda> absolute <lambda>
2000-01-01 0.428759 0.571241 0.864890 0.135110 0.675341 0.324659
2000-01-02 0.168731 0.831269 1.338144 2.338144 1.279321 -0.279321
2000-01-03 1.621034 -0.621034 0.438107 1.438107 0.903794 1.903794
2000-01-04 NaN NaN NaN NaN NaN NaN
2000-01-05 NaN NaN NaN NaN NaN NaN
2000-01-06 NaN NaN NaN NaN NaN NaN
2000-01-07 NaN NaN NaN NaN NaN NaN
2000-01-08 0.254374 1.254374 1.240447 -0.240447 0.201052 0.798948
2000-01-09 0.157795 0.842205 0.791197 1.791197 1.144209 -0.144209
2000-01-10 0.030876 0.969124 0.371900 1.371900 0.061932 1.061932
將多個(gè)函數(shù)傳遞給 Series 將產(chǎn)生一個(gè) DataFrame,列名為轉(zhuǎn)換函數(shù)的名稱
In [194]: tsdf["A"].transform([np.abs, lambda x: x + 1])
Out[194]:
absolute <lambda>
2000-01-01 0.428759 0.571241
2000-01-02 0.168731 0.831269
2000-01-03 1.621034 -0.621034
2000-01-04 NaN NaN
2000-01-05 NaN NaN
2000-01-06 NaN NaN
2000-01-07 NaN NaN
2000-01-08 0.254374 1.254374
2000-01-09 0.157795 0.842205
2000-01-10 0.030876 0.969124
6.4.2 使用字典變換
傳遞函數(shù)字典將允許對(duì)每個(gè)列進(jìn)行選擇性轉(zhuǎn)換
In [195]: tsdf.transform({"A": np.abs, "B": lambda x: x + 1})
Out[195]:
A B
2000-01-01 0.428759 0.135110
2000-01-02 0.168731 2.338144
2000-01-03 1.621034 1.438107
2000-01-04 NaN NaN
2000-01-05 NaN NaN
2000-01-06 NaN NaN
2000-01-07 NaN NaN
2000-01-08 0.254374 -0.240447
2000-01-09 0.157795 1.791197
2000-01-10 0.030876 1.371900
傳遞一個(gè)包含列表的字典將生成一個(gè)具有這些選擇性轉(zhuǎn)換的多級(jí)列名 DataFrame。
In [196]: tsdf.transform({"A": np.abs, "B": [lambda x: x + 1, "sqrt"]})
Out[196]:
A B
A <lambda> sqrt
2000-01-01 0.428759 0.135110 NaN
2000-01-02 0.168731 2.338144 1.156782
2000-01-03 1.621034 1.438107 0.661897
2000-01-04 NaN NaN NaN
2000-01-05 NaN NaN NaN
2000-01-06 NaN NaN NaN
2000-01-07 NaN NaN NaN
2000-01-08 0.254374 -0.240447 NaN
2000-01-09 0.157795 1.791197 0.889493
2000-01-10 0.030876 1.371900 0.609836
6.5 元素級(jí)函數(shù)應(yīng)用
由于并不是所有的函數(shù)都可以向量化,即能夠傳入 NumPy 數(shù)組并返回另一個(gè)數(shù)組或值。
因此 DataFrame 上的 applymap()方法和 Series 上的 map() 方法可以接受任何能夠傳入單個(gè)值并返回單個(gè)值的 Python 函數(shù)。例如
In [197]: df4
Out[197]:
one two three
a 1.394981 1.772517 NaN
b 0.343054 1.912123 -0.050390
c 0.695246 1.478369 1.227435
d NaN 0.279344 -0.613172
In [198]: def f(x):
.....: return len(str(x))
.....:
In [199]: df4["one"].map(f)
Out[199]:
a 18
b 19
c 18
d 3
Name: one, dtype: int64
In [200]: df4.applymap(f)
Out[200]:
one two three
a 18 17 3
b 19 18 20
c 18 18 16
d 3 19 19
Series.map() 還有另一個(gè)功能,能夠連接或映射到另一個(gè) Series,這與我們后面要講的連接操作關(guān)系密切
In [201]: s = pd.Series(
.....: ["six", "seven", "six", "seven", "six"], index=["a", "b", "c", "d", "e"]
.....: )
.....:
In [202]: t = pd.Series({"six": 6.0, "seven": 7.0})
In [203]: s
Out[203]:
a six
b seven
c six
d seven
e six
dtype: object
In [204]: s.map(t)
Out[204]:
a 6.0
b 7.0
c 6.0
d 7.0
e 6.0
dtype: float64
可以看到 s 中對(duì)應(yīng)位置的值會(huì)替換為 t 中定義的映射關(guān)系對(duì)應(yīng)的值。