Pandas - 10.2 轉(zhuǎn)換與過濾

transform 轉(zhuǎn)換

轉(zhuǎn)換與聚合成單個值的計算不同,數(shù)據(jù)轉(zhuǎn)換后數(shù)量不會變,比如標(biāo)準(zhǔn)化,只是在不同的類中進行標(biāo)準(zhǔn)化。

import pandas as pd
df = pd.read_csv('data/gapminder.tsv', sep='\t')

def my_zscore(x):
    return ((x - x.mean())/x.std())

transform_z = df.groupby('year').lifeExp.transform(my_zscore)
print(transform_z.shape) # (1704,)
print(df.shape) # (1704, 6)

對比分組標(biāo)準(zhǔn)化和不分組標(biāo)準(zhǔn)化,兩個分組標(biāo)準(zhǔn)化結(jié)果類似,但不分組區(qū)別很大

from scipy.stats import zscore

sp_z_grouped = df.groupby('year').lifeExp.transform(zscore)
sp_z_nogroup = zscore(df.lifeExp)

print(transform_z.head())
'''
0   -1.656854
1   -1.731249
2   -1.786543
3   -1.848157
4   -1.894173
Name: lifeExp, dtype: float64
'''

print(sp_z_grouped.head())
'''
0   -1.662719
1   -1.737377
2   -1.792867
3   -1.854699
4   -1.900878
Name: lifeExp, dtype: float64
'''

print(sp_z_nogroup[:5])
# [-2.37533395 -2.25677417 -2.1278375  -1.97117751 -1.81103275]

以缺失值填充為例,用組內(nèi)平均值代替,而不是整個數(shù)據(jù)的平均值。比如男性和女性的消費能力不同,區(qū)分男女計算平均值代替缺失值更加合理。

import seaborn as sns
import numpy as np

np.random.seed(42)
# 取出10個樣本
tips_10 = sns.load_dataset('tips').sample(10)
# 隨機將四個樣本的'total_bill'值改成缺失值
tips_10.loc[np.random.permutation(tips_10.index)[:4], 'total_bill'] = np.NaN
print(tips_10)
'''
     total_bill   tip     sex smoker   day    time  size
24        19.82  3.18    Male     No   Sat  Dinner     2
6          8.77  2.00    Male     No   Sun  Dinner     2
153         NaN  2.00    Male     No   Sun  Dinner     4
211         NaN  5.16    Male    Yes   Sat  Dinner     4
198         NaN  2.00  Female    Yes  Thur   Lunch     2
176         NaN  2.00    Male    Yes   Sun  Dinner     2
192       28.44  2.56    Male    Yes  Thur   Lunch     2
124       12.48  2.52  Female     No  Thur   Lunch     2
9         14.78  3.23    Male     No   Sun  Dinner     2
101       15.38  3.00  Female    Yes   Fri  Dinner     2
'''
# 按sex統(tǒng)計缺失值的數(shù)量,Male3個,F(xiàn)emale1個
count_sex = tips_10.groupby('sex').count()
print(count_sex)
'''
        total_bill  tip  smoker  day  time  size
sex                                             
Male             4    7       7    7     7     7
Female           2    3       3    3     3     3
'''
# 返回給定向量的平均值
def fill_na_mean(x):
    avg = x.mean()
    return (x.fillna(avg))

total_bill_group_mean = tips_10.groupby('sex').total_bill.transform(fill_na_mean)
tips_10['fill_total_bill'] = total_bill_group_mean
print(tips_10)
'''
     total_bill   tip     sex smoker   day    time  size  fill_total_bill
24        19.82  3.18    Male     No   Sat  Dinner     2          19.8200
6          8.77  2.00    Male     No   Sun  Dinner     2           8.7700
153         NaN  2.00    Male     No   Sun  Dinner     4          17.9525
211         NaN  5.16    Male    Yes   Sat  Dinner     4          17.9525
198         NaN  2.00  Female    Yes  Thur   Lunch     2          13.9300
176         NaN  2.00    Male    Yes   Sun  Dinner     2          17.9525
192       28.44  2.56    Male    Yes  Thur   Lunch     2          28.4400
124       12.48  2.52  Female     No  Thur   Lunch     2          12.4800
9         14.78  3.23    Male     No   Sun  Dinner     2          14.7800
101       15.38  3.00  Female    Yes   Fri  Dinner     2          15.3800
'''

filter 過濾器

import pandas as pd
import seaborn as sns

tips = sns.load_dataset('tips')
print(tips.shape) # (244, 7)

print(tips['size'].value_counts())
'''
2    156
3     38
4     37
5      5
6      4
1      4
Name: size, dtype: int64
'''

輸出結(jié)果顯示,人數(shù)為1、5和6的情況不常見,需要過濾掉這些數(shù)據(jù),要求每組數(shù)量要超過30

tips_filtered = tips.groupby('size').filter(lambda x: x['size'].count() >= 30)
print(tips_filtered.shape) # (231, 7)
print(tips_filtered['size'].value_counts())
'''
(231, 7)
2    156
3     38
4     37
Name: size, dtype: int64
'''
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容