Python 數(shù)據(jù)處理(八)—— 應(yīng)用函數(shù)

6. 函數(shù)應(yīng)用

要將您自己定義的函數(shù)或其他第三方庫(kù)的函數(shù)應(yīng)用于 pandas 對(duì)象上,主要包含下面的方法。

具體使用哪種方法需要根據(jù)需要,是在整個(gè) DataFrame 還是 Series 上、行或列上,或者是元素上進(jìn)行操作

  1. 表應(yīng)用函數(shù):pipe()
  2. 行列應(yīng)用函數(shù):apply()
  3. 聚合函數(shù):agg()transform()
  4. 元素級(jí)函數(shù):applymap()
6.1 表應(yīng)用函數(shù)

雖然可以將 DataFrameSeries 傳遞到函數(shù)中,但是如果需要鏈?zhǔn)秸{(diào)用函數(shù),可以考慮使用管道函數(shù) pipe()

我們先進(jìn)行一些設(shè)置

In [142]: def extract_city_name(df):
   .....:     """
   .....:     Chicago, IL -> Chicago for city_name column
   .....:     """
   .....:     df["city_name"] = df["city_and_code"].str.split(",").str.get(0)
   .....:     return df
   .....: 

In [143]: def add_country_name(df, country_name=None):
   .....:     """
   .....:     Chicago -> Chicago-US for city_name column
   .....:     """
   .....:     col = "city_name"
   .....:     df["city_and_country"] = df[col] + country_name
   .....:     return df
   .....: 

In [144]: df_p = pd.DataFrame({"city_and_code": ["Chicago, IL"]})

extract_city_nameadd_country_name 函數(shù)傳入和返回的都是 DataFrame

現(xiàn)在進(jìn)行下面的比較

In [145]: add_country_name(extract_city_name(df_p), country_name="US")
Out[145]: 
  city_and_code city_name city_and_country
0   Chicago, IL   Chicago        ChicagoUS

相當(dāng)于

In [146]: df_p.pipe(extract_city_name).pipe(add_country_name, country_name="US")
Out[146]: 
  city_and_code city_name city_and_country
0   Chicago, IL   Chicago        ChicagoUS

pandas 鼓勵(lì)使用第二種方式,即鏈?zhǔn)胶瘮?shù)調(diào)用。pipe 使您可以輕松地在函數(shù)鏈中使用您自己定義的或另一個(gè)庫(kù)的函數(shù),以及 pandas 中的函數(shù)。

在上面的示例中,函數(shù) extract_city_nameadd_country_name 的第一個(gè)參數(shù)都是一個(gè) DataFrame,如果想要把 DataFrame 作為第二個(gè)或其他位置的參數(shù)怎么辦呢?

我們可以為 .pipe 函數(shù)傳遞 (callable, data_keyword) 元組,data_keyword 是字符串,其值為 callable 函數(shù)的某一參數(shù)名,pipe 會(huì)把 DataFrame 作為 data_keyword 指定參數(shù)的值傳遞到函數(shù)中

例如

>>> df = pd.DataFrame(np.random.rand(6, 4) * np.random.randint(1, 10), columns=list('ABCD'))

>>> df
          A         B         C         D
0  2.895628  1.021764  3.549697  3.946251
1  3.032729  5.527509  4.111962  4.246071
2  0.587101  4.009382  3.330098  0.671954
3  5.891730  2.829773  3.349024  5.687257
4  2.103148  2.658920  4.398308  2.653573
5  3.576252  2.512895  4.871405  1.283442

>>> def func(a, data, b, c):
...:     return data[a] + b - c

>>> df.query('-2 < B < 3').pipe((func, 'data'), a='A', b=1, c=1)
0    2.895628
3    5.891730
4    2.103148
5    3.576252
Name: A, dtype: float64

我們將 DataFrame 傳遞給函數(shù) func 的第二個(gè)參數(shù) data,其他參數(shù)以命名參數(shù)的方式指定。

我們可以結(jié)合這一方式,使用 statmodels 來(lái)擬合回歸。它們的 API 的第一個(gè)參數(shù)是計(jì)算公式,第二個(gè)參數(shù)是 data,接受一個(gè) DataFrame 對(duì)象。

In [147]: import statsmodels.formula.api as sm

In [148]: bb = pd.read_csv("data/baseball.csv", index_col="id")

In [149]: (
   .....:     bb.query("h > 0")
   .....:     .assign(ln_h=lambda df: np.log(df.h))
   .....:     .pipe((sm.ols, "data"), "hr ~ ln_h + year + g + C(lg)")
   .....:     .fit()
   .....:     .summary()
   .....: )
   .....: 
Out[149]: 
<class 'statsmodels.iolib.summary.Summary'>
"""
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                     hr   R-squared:                       0.685
Model:                            OLS   Adj. R-squared:                  0.665
Method:                 Least Squares   F-statistic:                     34.28
Date:                Wed, 20 Jan 2021   Prob (F-statistic):           3.48e-15
Time:                        11:49:07   Log-Likelihood:                -205.92
No. Observations:                  68   AIC:                             421.8
Df Residuals:                      63   BIC:                             432.9
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
===============================================================================
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept   -8484.7720   4664.146     -1.819      0.074   -1.78e+04     835.780
C(lg)[T.NL]    -2.2736      1.325     -1.716      0.091      -4.922       0.375
ln_h           -1.3542      0.875     -1.547      0.127      -3.103       0.395
year            4.2277      2.324      1.819      0.074      -0.417       8.872
g               0.1841      0.029      6.258      0.000       0.125       0.243
==============================================================================
Omnibus:                       10.875   Durbin-Watson:                   1.999
Prob(Omnibus):                  0.004   Jarque-Bera (JB):               17.298
Skew:                           0.537   Prob(JB):                     0.000175
Kurtosis:                       5.225   Cond. No.                     1.49e+07
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.49e+07. This might indicate that there are
strong multicollinearity or other numerical problems.
"""

注意:當(dāng)只有兩個(gè)參數(shù)時(shí),可以省略元組后面的參數(shù)名稱,直接傳入變量,如果是超過(guò)兩個(gè)參數(shù),需要通過(guò)指定參數(shù)名賦值方式。

這個(gè) pipe() 函數(shù)是受 Unix 管道操作以及后面的 dplyrmagrittr 的啟發(fā)。

上面的代碼看起來(lái)是不是和我們之前介紹的 dplyr 操作很像,哈哈,可以好好比較一下。

6.2 行列行數(shù)應(yīng)用

可以使用 apply() 函數(shù)沿 DataFrame 對(duì)應(yīng)的軸應(yīng)用任意函數(shù)(默認(rèn)應(yīng)用在每列上),該方法與統(tǒng)計(jì)描述方法一樣,使用可選的 axis 參數(shù)指定應(yīng)用的軸

In [150]: df.apply(np.mean)
Out[150]: 
one      0.811094
two      1.360588
three    0.187958
dtype: float64

In [151]: df.apply(np.mean, axis=1)
Out[151]: 
a    1.583749
b    0.734929
c    1.133683
d   -0.166914
dtype: float64

In [152]: df.apply(lambda x: x.max() - x.min())
Out[152]: 
one      1.051928
two      1.632779
three    1.840607
dtype: float64

In [153]: df.apply(np.cumsum)
Out[153]: 
        one       two     three
a  1.394981  1.772517       NaN
b  1.738035  3.684640 -0.050390
c  2.433281  5.163008  1.177045
d       NaN  5.442353  0.563873

In [154]: df.apply(np.exp)
Out[154]: 
        one       two     three
a  4.034899  5.885648       NaN
b  1.409244  6.767440  0.950858
c  2.004201  4.385785  3.412466
d       NaN  1.322262  0.541630

apply() 函數(shù)也可以指派一個(gè)字符串方法名

In [155]: df.apply("mean")
Out[155]: 
one      0.811094
two      1.360588
three    0.187958
dtype: float64

In [156]: df.apply("mean", axis=1)
Out[156]: 
a    1.583749
b    0.734929
c    1.133683
d   -0.166914
dtype: float64

默認(rèn)情況下,apply()里面調(diào)用的函數(shù)的返回類型會(huì)影響其輸出結(jié)果的類型。

  • 如果調(diào)用的函數(shù)返回的是 Series,輸出結(jié)果的類型是 DataFrame。而且輸出的列的索引與函數(shù)返回的 Series 索引相匹配。

  • 如果函數(shù)返回的是其它任意類型,輸出的結(jié)果將是 Series。

result_type 參數(shù)可以覆蓋默認(rèn)行為(只有當(dāng) axis=1,即應(yīng)用在列上才能發(fā)揮作用),該參數(shù)有三個(gè)可選的值:

  • expand: 類似列表的結(jié)果將被轉(zhuǎn)換成列
  • reduce: 如果可能,盡可能返回一個(gè) Series 而不是擴(kuò)展成列表狀的結(jié)果。這與 expand 相反
  • broadcast: 結(jié)果將廣播到與原來(lái) DataFrame 相同的形狀,原始的索引和列名將會(huì)保留

這些值決定了返回值是否擴(kuò)展為 DataFrame。

熟悉 apply() 的使用技巧,可以輕易的獲取數(shù)據(jù)中的信息,例如,假設(shè)我們要提取每列中最大值對(duì)應(yīng)的日期:

In [157]: tsdf = pd.DataFrame(
   .....:     np.random.randn(1000, 3),
   .....:     columns=["A", "B", "C"],
   .....:     index=pd.date_range("1/1/2000", periods=1000),
   .....: )
   .....: 

In [158]: tsdf.apply(lambda x: x.idxmax())
Out[158]: 
A   2000-08-06
B   2001-01-18
C   2001-07-18
dtype: datetime64[ns]

您還可以將其他參數(shù)和關(guān)鍵字參數(shù)傳遞給 apply() 方法,例如我們定義如下函數(shù)

def subtract_and_divide(x, sub, divide=1):
    return (x - sub) / divide

你可以使用像這樣使用

df.apply(subtract_and_divide, args=(5,), divide=3)

還可以對(duì)每行或每列應(yīng)用 Series 函數(shù)

In [159]: tsdf
Out[159]: 
                   A         B         C
2000-01-01 -0.158131 -0.232466  0.321604
2000-01-02 -1.810340 -3.105758  0.433834
2000-01-03 -1.209847 -1.156793 -0.136794
2000-01-04       NaN       NaN       NaN
2000-01-05       NaN       NaN       NaN
2000-01-06       NaN       NaN       NaN
2000-01-07       NaN       NaN       NaN
2000-01-08 -0.653602  0.178875  1.008298
2000-01-09  1.007996  0.462824  0.254472
2000-01-10  0.307473  0.600337  1.643950

In [160]: tsdf.apply(pd.Series.interpolate)
Out[160]: 
                   A         B         C
2000-01-01 -0.158131 -0.232466  0.321604
2000-01-02 -1.810340 -3.105758  0.433834
2000-01-03 -1.209847 -1.156793 -0.136794
2000-01-04 -1.098598 -0.889659  0.092225
2000-01-05 -0.987349 -0.622526  0.321243
2000-01-06 -0.876100 -0.355392  0.550262
2000-01-07 -0.764851 -0.088259  0.779280
2000-01-08 -0.653602  0.178875  1.008298
2000-01-09  1.007996  0.462824  0.254472
2000-01-10  0.307473  0.600337  1.643950

apply() 函數(shù)有一個(gè) raw 參數(shù),其默認(rèn)值為 False,即在應(yīng)用函數(shù)前,會(huì)自動(dòng)將每行或每列轉(zhuǎn)換為 Series

如果設(shè)置為 True,會(huì)將數(shù)據(jù)轉(zhuǎn)換為 ndarray 對(duì)象,在不需要使用索引的情況下,能顯著提高性能。

6.3 聚合函數(shù) API

聚合 API 能夠以一種簡(jiǎn)潔的方式來(lái)表達(dá)多個(gè)可能的聚合操作,使用 DataFrame.aggregate() 或者別名DataFrame.agg() 來(lái)進(jìn)行操作

我們使用與上面類似的示例數(shù)據(jù)

In [161]: tsdf = pd.DataFrame(
   .....:     np.random.randn(10, 3),
   .....:     columns=["A", "B", "C"],
   .....:     index=pd.date_range("1/1/2000", periods=10),
   .....: )
   .....: 

In [162]: tsdf.iloc[3:7] = np.nan

In [163]: tsdf
Out[163]: 
                   A         B         C
2000-01-01  1.257606  1.004194  0.167574
2000-01-02 -0.749892  0.288112 -0.757304
2000-01-03 -0.207550 -0.298599  0.116018
2000-01-04       NaN       NaN       NaN
2000-01-05       NaN       NaN       NaN
2000-01-06       NaN       NaN       NaN
2000-01-07       NaN       NaN       NaN
2000-01-08  0.814347 -0.257623  0.869226
2000-01-09 -0.250663 -1.206601  0.896839
2000-01-10  2.169758 -1.333363  0.283157

如果只使用單個(gè)函數(shù),其效果等同于 apply()。您也可以傳遞字符串變量名,將會(huì)返回一個(gè) Series 輸出結(jié)果

In [164]: tsdf.agg(np.sum)
Out[164]: 
A    3.033606
B   -1.803879
C    1.575510
dtype: float64

In [165]: tsdf.agg("sum")
Out[165]: 
A    3.033606
B   -1.803879
C    1.575510
dtype: float64

In [166]: tsdf.sum()
Out[166]: 
A    3.033606
B   -1.803879
C    1.575510

Series 上的單個(gè)聚合將返回標(biāo)量值

In [167]: tsdf["A"].agg("sum")
Out[167]: 3.033606102414146
6.3.1 多函數(shù)聚合

還可以將多個(gè)聚合函數(shù)以列表形式傳入,每個(gè)函數(shù)的計(jì)算結(jié)果是 DataFrame 的一行,對(duì)應(yīng)的索引是對(duì)應(yīng)函數(shù)的函數(shù)名。

In [168]: tsdf.agg(["sum"])
Out[168]: 
            A         B        C
sum  3.033606 -1.803879  1.57551

多個(gè)函數(shù)產(chǎn)生多個(gè)行

In [169]: tsdf.agg(["sum", "mean"])
Out[169]: 
             A         B         C
sum   3.033606 -1.803879  1.575510
mean  0.505601 -0.300647  0.262585

在一個(gè) Series 上,多個(gè)函數(shù)返回一個(gè) Series,并以函數(shù)名作為索引。

In [170]: tsdf["A"].agg(["sum", "mean"])
Out[170]: 
sum     3.033606
mean    0.505601
Name: A, dtype: float64

傳遞 lambda 函數(shù)將產(chǎn)生一個(gè)名為 <lambda> 的行

In [171]: tsdf["A"].agg(["sum", lambda x: x.mean()])
Out[171]: 
sum         3.033606
<lambda>    0.505601
Name: A, dtype: float64

傳遞一個(gè)命名的函數(shù)將以相應(yīng)的函數(shù)名稱作為索引

In [172]: def mymean(x):
   .....:     return x.mean()
   .....: 

In [173]: tsdf["A"].agg(["sum", mymean])
Out[173]: 
sum       3.033606
mymean    0.505601
Name: A, dtype: float64
6.3.2 傳入字典聚合

可以用傳入字典的方式來(lái)指定每列應(yīng)用的函數(shù)

In [174]: tsdf.agg({"A": "mean", "B": "sum"})
Out[174]: 
A    0.505601
B   -1.803879
dtype: float64

注意:輸出結(jié)果的順序不是固定的,因?yàn)樽值涫菬o(wú)序的。如果想要讓輸出順序與輸入順序一致,可以使用 OrderedDict

from collections import OrderedDict

如果輸入的字典中包含列表時(shí),會(huì)返回 DataFrame 形式的輸出結(jié)果。輸出結(jié)果的索引是唯一的函數(shù)名,未應(yīng)用該函數(shù)的列對(duì)應(yīng)的值會(huì)被賦值為 NaN

In [175]: tsdf.agg({"A": ["mean", "min"], "B": "sum"})
Out[175]: 
             A         B
mean  0.505601       NaN
min  -0.749892       NaN
sum        NaN -1.803879
6.3.3 混合類型

當(dāng) DataFrame 中存在無(wú)法聚合的混合類型時(shí),.agg() 只會(huì)對(duì)能夠聚合的列進(jìn)行的聚合。

In [176]: mdf = pd.DataFrame(
   .....:     {
   .....:         "A": [1, 2, 3],
   .....:         "B": [1.0, 2.0, 3.0],
   .....:         "C": ["foo", "bar", "baz"],
   .....:         "D": pd.date_range("20130101", periods=3),
   .....:     }
   .....: )
   .....: 

In [177]: mdf.dtypes
Out[177]: 
A             int64
B           float64
C            object
D    datetime64[ns]
dtype: object
In [178]: mdf.agg(["min", "sum"])
Out[178]: 
     A    B          C          D
min  1  1.0        bar 2013-01-01
sum  6  6.0  foobarbaz        NaT

你可能會(huì)對(duì) NaT 有疑問(wèn),難道是寫(xiě)錯(cuò)了?不是的,它類似于 NaN,表示:Not a Time。也是一種缺失值

6.3.4 自定義 describe

使用 .agg() 我們可以擴(kuò)展類似于內(nèi)建函數(shù) describe 的功能

In [179]: from functools import partial

In [180]: q_25 = partial(pd.Series.quantile, q=0.25)

In [181]: q_25.__name__ = "25%"

In [182]: q_75 = partial(pd.Series.quantile, q=0.75)

In [183]: q_75.__name__ = "75%"

In [184]: tsdf.agg(["count", "mean", "std", "min", q_25, "median", q_75, "max"])
Out[184]: 
               A         B         C
count   6.000000  6.000000  6.000000
mean    0.505601 -0.300647  0.262585
std     1.103362  0.887508  0.606860
min    -0.749892 -1.333363 -0.757304
25%    -0.239885 -0.979600  0.128907
median  0.303398 -0.278111  0.225365
75%     1.146791  0.151678  0.722709
max     2.169758  1.004194  0.896839
6.4 變換 API

transform()方法返回一個(gè)與原始對(duì)象索引相同(大小相同)的對(duì)象。

這個(gè) API 允許你同時(shí)提供多個(gè)操作,而不是一個(gè)一個(gè)的操作。它的 API.agg API 十分相似

例如,對(duì)于下面的數(shù)據(jù)

In [185]: tsdf = pd.DataFrame(
   .....:     np.random.randn(10, 3),
   .....:     columns=["A", "B", "C"],
   .....:     index=pd.date_range("1/1/2000", periods=10),
   .....: )
   .....: 

In [186]: tsdf.iloc[3:7] = np.nan

In [187]: tsdf
Out[187]: 
                   A         B         C
2000-01-01 -0.428759 -0.864890 -0.675341
2000-01-02 -0.168731  1.338144 -1.279321
2000-01-03 -1.621034  0.438107  0.903794
2000-01-04       NaN       NaN       NaN
2000-01-05       NaN       NaN       NaN
2000-01-06       NaN       NaN       NaN
2000-01-07       NaN       NaN       NaN
2000-01-08  0.254374 -1.240447 -0.201052
2000-01-09 -0.157795  0.791197 -1.144209

.transform() 可以傳入 NumPy 函數(shù)、字符串函數(shù)名以及自定義函數(shù)實(shí)現(xiàn)對(duì)數(shù)據(jù)的變換

In [188]: tsdf.transform(np.abs)
Out[188]: 
                   A         B         C
2000-01-01  0.428759  0.864890  0.675341
2000-01-02  0.168731  1.338144  1.279321
2000-01-03  1.621034  0.438107  0.903794
2000-01-04       NaN       NaN       NaN
2000-01-05       NaN       NaN       NaN
2000-01-06       NaN       NaN       NaN
2000-01-07       NaN       NaN       NaN
2000-01-08  0.254374  1.240447  0.201052
2000-01-09  0.157795  0.791197  1.144209
2000-01-10  0.030876  0.371900  0.061932

In [189]: tsdf.transform("abs")
Out[189]: 
                   A         B         C
2000-01-01  0.428759  0.864890  0.675341
2000-01-02  0.168731  1.338144  1.279321
2000-01-03  1.621034  0.438107  0.903794
2000-01-04       NaN       NaN       NaN
2000-01-05       NaN       NaN       NaN
2000-01-06       NaN       NaN       NaN
2000-01-07       NaN       NaN       NaN
2000-01-08  0.254374  1.240447  0.201052
2000-01-09  0.157795  0.791197  1.144209
2000-01-10  0.030876  0.371900  0.061932

In [190]: tsdf.transform(lambda x: x.abs())
Out[190]: 
                   A         B         C
2000-01-01  0.428759  0.864890  0.675341
2000-01-02  0.168731  1.338144  1.279321
2000-01-03  1.621034  0.438107  0.903794
2000-01-04       NaN       NaN       NaN
2000-01-05       NaN       NaN       NaN
2000-01-06       NaN       NaN       NaN
2000-01-07       NaN       NaN       NaN
2000-01-08  0.254374  1.240447  0.201052
2000-01-09  0.157795  0.791197  1.144209
2000-01-10  0.030876  0.371900  0.061932

在這里 transform() 只接收一個(gè)函數(shù),類似于

In [191]: np.abs(tsdf)
Out[191]: 
                   A         B         C
2000-01-01  0.428759  0.864890  0.675341
2000-01-02  0.168731  1.338144  1.279321
2000-01-03  1.621034  0.438107  0.903794
2000-01-04       NaN       NaN       NaN
2000-01-05       NaN       NaN       NaN
2000-01-06       NaN       NaN       NaN
2000-01-07       NaN       NaN       NaN
2000-01-08  0.254374  1.240447  0.201052
2000-01-09  0.157795  0.791197  1.144209
2000-01-10  0.030876  0.371900  0.061932

如果 .transform() 應(yīng)用對(duì)象是 Series,返回的也是一個(gè) Series

In [192]: tsdf["A"].transform(np.abs)
Out[192]: 
2000-01-01    0.428759
2000-01-02    0.168731
2000-01-03    1.621034
2000-01-04         NaN
2000-01-05         NaN
2000-01-06         NaN
2000-01-07         NaN
2000-01-08    0.254374
2000-01-09    0.157795
2000-01-10    0.030876
Freq: D, Name: A, dtype: float64
6.4.1 多函數(shù)變換

傳遞多個(gè)函數(shù)將產(chǎn)生一個(gè)包含多級(jí)列名的 DataFrame。第一級(jí)列名是原始列名,第二級(jí)是轉(zhuǎn)換函數(shù)的名稱。

In [193]: tsdf.transform([np.abs, lambda x: x + 1])
Out[193]: 
                   A                   B                   C          
            absolute  <lambda>  absolute  <lambda>  absolute  <lambda>
2000-01-01  0.428759  0.571241  0.864890  0.135110  0.675341  0.324659
2000-01-02  0.168731  0.831269  1.338144  2.338144  1.279321 -0.279321
2000-01-03  1.621034 -0.621034  0.438107  1.438107  0.903794  1.903794
2000-01-04       NaN       NaN       NaN       NaN       NaN       NaN
2000-01-05       NaN       NaN       NaN       NaN       NaN       NaN
2000-01-06       NaN       NaN       NaN       NaN       NaN       NaN
2000-01-07       NaN       NaN       NaN       NaN       NaN       NaN
2000-01-08  0.254374  1.254374  1.240447 -0.240447  0.201052  0.798948
2000-01-09  0.157795  0.842205  0.791197  1.791197  1.144209 -0.144209
2000-01-10  0.030876  0.969124  0.371900  1.371900  0.061932  1.061932

將多個(gè)函數(shù)傳遞給 Series 將產(chǎn)生一個(gè) DataFrame,列名為轉(zhuǎn)換函數(shù)的名稱

In [194]: tsdf["A"].transform([np.abs, lambda x: x + 1])
Out[194]: 
            absolute  <lambda>
2000-01-01  0.428759  0.571241
2000-01-02  0.168731  0.831269
2000-01-03  1.621034 -0.621034
2000-01-04       NaN       NaN
2000-01-05       NaN       NaN
2000-01-06       NaN       NaN
2000-01-07       NaN       NaN
2000-01-08  0.254374  1.254374
2000-01-09  0.157795  0.842205
2000-01-10  0.030876  0.969124
6.4.2 使用字典變換

傳遞函數(shù)字典將允許對(duì)每個(gè)列進(jìn)行選擇性轉(zhuǎn)換

In [195]: tsdf.transform({"A": np.abs, "B": lambda x: x + 1})
Out[195]: 
                   A         B
2000-01-01  0.428759  0.135110
2000-01-02  0.168731  2.338144
2000-01-03  1.621034  1.438107
2000-01-04       NaN       NaN
2000-01-05       NaN       NaN
2000-01-06       NaN       NaN
2000-01-07       NaN       NaN
2000-01-08  0.254374 -0.240447
2000-01-09  0.157795  1.791197
2000-01-10  0.030876  1.371900

傳遞一個(gè)包含列表的字典將生成一個(gè)具有這些選擇性轉(zhuǎn)換的多級(jí)列名 DataFrame。

In [196]: tsdf.transform({"A": np.abs, "B": [lambda x: x + 1, "sqrt"]})
Out[196]: 
                   A         B          
                   A  <lambda>      sqrt
2000-01-01  0.428759  0.135110       NaN
2000-01-02  0.168731  2.338144  1.156782
2000-01-03  1.621034  1.438107  0.661897
2000-01-04       NaN       NaN       NaN
2000-01-05       NaN       NaN       NaN
2000-01-06       NaN       NaN       NaN
2000-01-07       NaN       NaN       NaN
2000-01-08  0.254374 -0.240447       NaN
2000-01-09  0.157795  1.791197  0.889493
2000-01-10  0.030876  1.371900  0.609836
6.5 元素級(jí)函數(shù)應(yīng)用

由于并不是所有的函數(shù)都可以向量化,即能夠傳入 NumPy 數(shù)組并返回另一個(gè)數(shù)組或值。

因此 DataFrame 上的 applymap()方法和 Series 上的 map() 方法可以接受任何能夠傳入單個(gè)值并返回單個(gè)值的 Python 函數(shù)。例如

In [197]: df4
Out[197]: 
        one       two     three
a  1.394981  1.772517       NaN
b  0.343054  1.912123 -0.050390
c  0.695246  1.478369  1.227435
d       NaN  0.279344 -0.613172

In [198]: def f(x):
   .....:     return len(str(x))
   .....: 

In [199]: df4["one"].map(f)
Out[199]: 
a    18
b    19
c    18
d     3
Name: one, dtype: int64

In [200]: df4.applymap(f)
Out[200]: 
   one  two  three
a   18   17      3
b   19   18     20
c   18   18     16
d    3   19     19

Series.map() 還有另一個(gè)功能,能夠連接或映射到另一個(gè) Series,這與我們后面要講的連接操作關(guān)系密切

In [201]: s = pd.Series(
   .....:     ["six", "seven", "six", "seven", "six"], index=["a", "b", "c", "d", "e"]
   .....: )
   .....: 

In [202]: t = pd.Series({"six": 6.0, "seven": 7.0})

In [203]: s
Out[203]: 
a      six
b    seven
c      six
d    seven
e      six
dtype: object

In [204]: s.map(t)
Out[204]: 
a    6.0
b    7.0
c    6.0
d    7.0
e    6.0
dtype: float64

可以看到 s 中對(duì)應(yīng)位置的值會(huì)替換為 t 中定義的映射關(guān)系對(duì)應(yīng)的值。

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書(shū)系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容