《Pandas 1.x Cookbook · 第二版》第05章 探索性數(shù)據(jù)分析

第01章 Pandas基礎(chǔ)
第02章 DataFrame基礎(chǔ)運(yùn)算
第03章 創(chuàng)建和持久化DataFrame
第04章 開始數(shù)據(jù)分析
第05章 探索性數(shù)據(jù)分析
第06章 選取數(shù)據(jù)子集
第07章 過濾行
第08章 索引對(duì)齊


5.1 概括性統(tǒng)計(jì)

概括性統(tǒng)計(jì)包括平均值、分位值、標(biāo)準(zhǔn)差。.describe方法能計(jì)算DataFrame中數(shù)值列的統(tǒng)計(jì)信息:

>>> import pandas as pd
>>> import numpy as np
>>> fueleco = pd.read_csv("data/vehicles.csv.zip")
>>> fueleco
       barrels08  barrelsA08  ...  phevHwy  phevComb
0      15.695714         0.0  ...        0         0
1      29.964545         0.0  ...        0         0
2      12.207778         0.0  ...        0         0
3      29.964545         0.0  ...        0         0
4      17.347895         0.0  ...        0         0
...          ...         ...  ...      ...       ...
39096  14.982273         0.0  ...        0         0
39097  14.330870         0.0  ...        0         0
39098  15.695714         0.0  ...        0         0
39099  15.695714         0.0  ...        0         0
39100  18.311667         0.0  ...        0         0

調(diào)用獨(dú)立的方法計(jì)算平均值、標(biāo)準(zhǔn)差、分位值:

>>> fueleco.mean()  
barrels08         17.442712
barrelsA08         0.219276
charge120          0.000000
charge240          0.029630
city08            18.077799
                   ...     
youSaveSpend   -3459.572645
charge240b         0.005869
phevCity           0.094703
phevHwy            0.094269
phevComb           0.094141
Length: 60, dtype: float64
>>> fueleco.std()  
barrels08          4.580230
barrelsA08         1.143837
charge120          0.000000
charge240          0.487408
city08             6.970672
                   ...     
youSaveSpend    3010.284617
charge240b         0.165399
phevCity           2.279478
phevHwy            2.191115
phevComb           2.226500
Length: 60, dtype: float64
>>> fueleco.quantile(
...     [0, 0.25, 0.5, 0.75, 1]
... )  
      barrels08  barrelsA08  ...  phevHwy  phevComb
0.00   0.060000    0.000000  ...      0.0       0.0
0.25  14.330870    0.000000  ...      0.0       0.0
0.50  17.347895    0.000000  ...      0.0       0.0
0.75  20.115000    0.000000  ...      0.0       0.0
1.00  47.087143   18.311667  ...     81.0      88.0

調(diào)用.describe方法:

>>> fueleco.describe()  
         barrels08   barrelsA08  ...      phevHwy     phevComb
count  39101.00...  39101.00...  ...  39101.00...  39101.00...
mean     17.442712     0.219276  ...     0.094269     0.094141
std       4.580230     1.143837  ...     2.191115     2.226500
min       0.060000     0.000000  ...     0.000000     0.000000
25%      14.330870     0.000000  ...     0.000000     0.000000
50%      17.347895     0.000000  ...     0.000000     0.000000
75%      20.115000     0.000000  ...     0.000000     0.000000
max      47.087143    18.311667  ...    81.000000    88.000000

查看object列的統(tǒng)計(jì)信息:

>>> fueleco.describe(include=object)  
              drive eng_dscr  ...   modifiedOn startStop
count         37912    23431  ...        39101      7405
unique            7      545  ...           68         2
top     Front-Wh...    (FFS)  ...  Tue Jan ...         N
freq          13653     8827  ...        29438      5176

更多

對(duì).describe的結(jié)果進(jìn)行轉(zhuǎn)置,可以顯示更多信息:

>>> fueleco.describe().T
                count         mean  ...       75%          max
barrels08     39101.0    17.442712  ...    20.115    47.087143
barrelsA08    39101.0     0.219276  ...     0.000    18.311667
charge120     39101.0     0.000000  ...     0.000     0.000000
charge240     39101.0     0.029630  ...     0.000    12.000000
city08        39101.0    18.077799  ...    20.000   150.000000
...               ...          ...  ...       ...          ...
youSaveSpend  39101.0 -3459.572645  ... -1500.000  5250.000000
charge240b    39101.0     0.005869  ...     0.000     7.000000
phevCity      39101.0     0.094703  ...     0.000    97.000000
phevHwy       39101.0     0.094269  ...     0.000    81.000000
phevComb      39101.0     0.094141  ...     0.000    88.000000

5.2 列的類型

查看.dtypes屬性:

>>> fueleco.dtypes
barrels08     float64
barrelsA08    float64
charge120     float64
charge240     float64
city08          int64
               ...    
modifiedOn     object
startStop      object
phevCity        int64
phevHwy         int64
phevComb        int64
Length: 83, dtype: object

每種數(shù)據(jù)類型的數(shù)量:

>>> fueleco.dtypes.value_counts()
float64    32
int64      27
object     23
bool        1
dtype: int64

更多

可以轉(zhuǎn)換列的數(shù)據(jù)類型以節(jié)省內(nèi)存:

>>> fueleco.select_dtypes("int64").describe().T
                count         mean  ...     75%     max
city08        39101.0    18.077799  ...    20.0   150.0
cityA08       39101.0     0.569883  ...     0.0   145.0
co2           39101.0    72.538989  ...    -1.0   847.0
co2A          39101.0     5.543950  ...    -1.0   713.0
comb08        39101.0    20.323828  ...    23.0   136.0
...               ...          ...  ...     ...     ...
year          39101.0  2000.635406  ...  2010.0  2018.0
youSaveSpend  39101.0 -3459.572645  ... -1500.0  5250.0
phevCity      39101.0     0.094703  ...     0.0    97.0
phevHwy       39101.0     0.094269  ...     0.0    81.0
phevComb      39101.0     0.094141  ...     0.0    88.0

city08comb08兩列的值都沒超過150。iinfo函數(shù)可以查看數(shù)據(jù)類型的范圍??梢詫㈩愋透臑?code>int16。內(nèi)存降為原來的25%:

>>> np.iinfo(np.int8)
iinfo(min=-128, max=127, dtype=int8)
>>> np.iinfo(np.int16)
iinfo(min=-32768, max=32767, dtype=int16)
>>> fueleco[["city08", "comb08"]].info(memory_usage="deep")
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39101 entries, 0 to 39100
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   city08  39101 non-null  int64
 1   comb08  39101 non-null  int64
dtypes: int64(2)
memory usage: 611.1 KB
>>> (
...     fueleco[["city08", "comb08"]]
...     .assign(
...         city08=fueleco.city08.astype(np.int16),
...         comb08=fueleco.comb08.astype(np.int16),
...     )
...     .info(memory_usage="deep")
... )
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39101 entries, 0 to 39100
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   city08  39101 non-null  int16
 1   comb08  39101 non-null  int16
dtypes: int16(2)
memory usage: 152.9 KB

finfo函數(shù)可以查看浮點(diǎn)數(shù)的范圍。

基數(shù)低的話,category類型更節(jié)省內(nèi)存。傳入memory_usage='deep',查看objectcategory兩種類型的內(nèi)存占用:

>>> fueleco.make.nunique()
134
>>> fueleco.model.nunique()
3816
>>> fueleco[["make"]].info(memory_usage="deep")
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39101 entries, 0 to 39100
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   make    39101 non-null  object
dtypes: object(1)
memory usage: 2.4 MB
>>> (
...     fueleco[["make"]]
...     .assign(make=fueleco.make.astype("category"))
...     .info(memory_usage="deep")
... )
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39101 entries, 0 to 39100
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   make    39101 non-null  category
dtypes: category(1)
memory usage: 90.4 KB
>>> fueleco[["model"]].info(memory_usage="deep")
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39101 entries, 0 to 39100
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   model   39101 non-null  object
dtypes: object(1)
memory usage: 2.5 MB
>>> (
...     fueleco[["model"]]
...     .assign(model=fueleco.model.astype("category"))
...     .info(memory_usage="deep")
... )
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39101 entries, 0 to 39100
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   model   39101 non-null  category
dtypes: category(1)
memory usage: 496.7 KB

5.3 類型數(shù)據(jù)

數(shù)據(jù)可以分為日期、連續(xù)型數(shù)據(jù)、類型數(shù)據(jù)。

選取數(shù)據(jù)類型為object的列:

>>> fueleco.select_dtypes(object).columns
Index(['drive', 'eng_dscr', 'fuelType', 'fuelType1', 'make', 'model',
       'mpgData', 'trany', 'VClass', 'guzzler', 'trans_dscr', 'tCharger',
       'sCharger', 'atvType', 'fuelType2', 'rangeA', 'evMotor', 'mfrCode',
       'c240Dscr', 'c240bDscr', 'createdOn', 'modifiedOn', 'startStop'],
      dtype='object')

使用.nunique方法確定基數(shù):

>>> fueleco.drive.nunique()
7

使用.sample方法查看一些數(shù)據(jù):

>>> fueleco.drive.sample(5, random_state=42)
4217     4-Wheel ...
1736     4-Wheel ...
36029    Rear-Whe...
37631    Front-Wh...
1668     Rear-Whe...
Name: drive, dtype: object

確認(rèn)缺失值的數(shù)量和百分比:

>>> fueleco.drive.isna().sum()
1189
>>> fueleco.drive.isna().mean() * 100
3.0408429451932175

使用.value_counts查看每種數(shù)據(jù)的個(gè)數(shù):

>>> fueleco.drive.value_counts()
Front-Wheel Drive             13653
Rear-Wheel Drive              13284
4-Wheel or All-Wheel Drive     6648
All-Wheel Drive                2401
4-Wheel Drive                  1221
2-Wheel Drive                   507
Part-time 4-Wheel Drive         198
Name: drive, dtype: int64

如果值太多,則查看排名前6的,折疊其余的:

>>> top_n = fueleco.make.value_counts().index[:6]
>>> (
...     fueleco.assign(
...         make=fueleco.make.where(
...             fueleco.make.isin(top_n), "Other"
...         )
...     ).make.value_counts()
... )
Other        23211
Chevrolet     3900
Ford          3208
Dodge         2557
GMC           2442
Toyota        1976
BMW           1807
Name: make, dtype: int64

使用Pandas對(duì)統(tǒng)計(jì)作圖:

>>> import matplotlib.pyplot as plt
>>> fig, ax = plt.subplots(figsize=(10, 8))
>>> top_n = fueleco.make.value_counts().index[:6]
>>> (
...     fueleco.assign(  
...         make=fueleco.make.where(
...             fueleco.make.isin(top_n), "Other"
...         )
...     )
...     .make.value_counts()
...     .plot.bar(ax=ax)
... )
>>> fig.savefig("c5-catpan.png", dpi=300)

使用seaborn對(duì)統(tǒng)計(jì)作圖:

>>> import seaborn as sns
>>> fig, ax = plt.subplots(figsize=(10, 8))
>>> top_n = fueleco.make.value_counts().index[:6]
>>> sns.countplot(
...     y="make",  
...     data=(
...         fueleco.assign(
...             make=fueleco.make.where(
...                 fueleco.make.isin(top_n), "Other"
...             )
...         )
...     ),
... )
>>> fig.savefig("c5-catsns.png", dpi=300) 

原理

查看drive列是缺失值的行:

>>> fueleco[fueleco.drive.isna()]
       barrels08  barrelsA08  ...  phevHwy  phevComb
7138    0.240000         0.0  ...        0         0
8144    0.312000         0.0  ...        0         0
8147    0.270000         0.0  ...        0         0
18215  15.695714         0.0  ...        0         0
18216  14.982273         0.0  ...        0         0
...          ...         ...  ...      ...       ...
23023   0.240000         0.0  ...        0         0
23024   0.546000         0.0  ...        0         0
23026   0.426000         0.0  ...        0         0
23031   0.426000         0.0  ...        0         0
23034   0.204000         0.0  ...        0         0

因?yàn)?code>value_counts不統(tǒng)計(jì)缺失值,設(shè)置dropna=False就可以統(tǒng)計(jì)缺失值:

>>> fueleco.drive.value_counts(dropna=False)
Front-Wheel Drive             13653
Rear-Wheel Drive              13284
4-Wheel or All-Wheel Drive     6648
All-Wheel Drive                2401
4-Wheel Drive                  1221
NaN                            1189
2-Wheel Drive                   507
Part-time 4-Wheel Drive         198
Name: drive, dtype: int64

更多

rangeA這列是object類型,但用.value_counts檢查時(shí),發(fā)現(xiàn)它其實(shí)是數(shù)值列。這是因?yàn)樵摿邪?code>/和-,Pandas將其解釋成了字符串列。

>>> fueleco.rangeA.value_counts()
290        74
270        56
280        53
310        41
277        38
           ..
328         1
250/370     1
362/537     1
310/370     1
340-350     1
Name: rangeA, Length: 216, dtype: int64

可以使用.str.extract方法和正則表達(dá)式提取沖突字符:

>>> (
...     fueleco.rangeA.str.extract(r"([^0-9.])")
...     .dropna()
...     .apply(lambda row: "".join(row), axis=1)
...     .value_counts()
... )
/    280
-     71
Name: rangeA, dtype: int64

缺失值的類型是字符串:

>>> set(fueleco.rangeA.apply(type))
{<class 'str'>, <class 'float'>}

統(tǒng)計(jì)缺失值的數(shù)量:

>>> fueleco.rangeA.isna().sum()
37616

將缺失值替換為0,-替換為/,根據(jù)/分割字符串,然后取平均值:

>>> (
...     fueleco.rangeA.fillna("0")
...     .str.replace("-", "/")
...     .str.split("/", expand=True)
...     .astype(float)
...     .mean(axis=1)
... )
0        0.0
1        0.0
2        0.0
3        0.0
4        0.0
        ... 
39096    0.0
39097    0.0
39098    0.0
39099    0.0
39100    0.0
Length: 39101, dtype: float64

另一種處理數(shù)值列的方法是用cutqcut方法分桶:

>>> (
...     fueleco.rangeA.fillna("0")
...     .str.replace("-", "/")
...     .str.split("/", expand=True)
...     .astype(float)
...     .mean(axis=1)
...     .pipe(lambda ser_: pd.cut(ser_, 10))
...     .value_counts()
... )
(-0.45, 44.95]     37688
(269.7, 314.65]      559
(314.65, 359.6]      352
(359.6, 404.55]      205
(224.75, 269.7]      181
(404.55, 449.5]       82
(89.9, 134.85]        12
(179.8, 224.75]        9
(44.95, 89.9]          8
(134.85, 179.8]        5
dtype: int64

qcut方法是按分位數(shù)平均分桶:

>>> (
...     fueleco.rangeA.fillna("0")
...     .str.replace("-", "/")
...     .str.split("/", expand=True)
...     .astype(float)
...     .mean(axis=1)
...     .pipe(lambda ser_: pd.qcut(ser_, 10))
...     .value_counts()
... )
Traceback (most recent call last):
  ...
ValueError: Bin edges must be unique: array([  0. ,   0. ,   0. ,   0. ,   0. ,   0. ,   0. ,   0. ,   0. ,
         0. , 449.5]).
>>> (
...     fueleco.city08.pipe(
...         lambda ser: pd.qcut(ser, q=10)
...     ).value_counts()
... )
(5.999, 13.0]    5939
(19.0, 21.0]     4477
(14.0, 15.0]     4381
(17.0, 18.0]     3912
(16.0, 17.0]     3881
(15.0, 16.0]     3855
(21.0, 24.0]     3676
(24.0, 150.0]    3235
(13.0, 14.0]     2898
(18.0, 19.0]     2847
Name: city08, dtype: int64

5.4 連續(xù)型數(shù)據(jù)

提取出數(shù)值列:

>>> fueleco.select_dtypes("number")
       barrels08  barrelsA08  ...  phevHwy  phevComb
0      15.695714         0.0  ...        0         0
1      29.964545         0.0  ...        0         0
2      12.207778         0.0  ...        0         0
3      29.964545         0.0  ...        0         0
4      17.347895         0.0  ...        0         0
...          ...         ...  ...      ...       ...
39096  14.982273         0.0  ...        0         0
39097  14.330870         0.0  ...        0         0
39098  15.695714         0.0  ...        0         0
39099  15.695714         0.0  ...        0         0
39100  18.311667         0.0  ...        0         0

使用.sample查看一些數(shù)據(jù):

>>> fueleco.city08.sample(5, random_state=42)
4217     11
1736     21
36029    16
37631    16
1668     17
Name: city08, dtype: int64

查看缺失值的數(shù)量和比例:

>>> fueleco.city08.isna().sum()
0
>>> fueleco.city08.isna().mean() * 100
0.0

獲取統(tǒng)計(jì)信息:

>>> fueleco.city08.describe()
count    39101.000000
mean        18.077799
std          6.970672
min          6.000000
25%         15.000000
50%         17.000000
75%         20.000000
max        150.000000
Name: city08, dtype: float64

使用Pandas畫柱狀圖:

>>> import matplotlib.pyplot as plt
>>> fig, ax = plt.subplots(figsize=(10, 8))
>>> fueleco.city08.hist(ax=ax)
>>> fig.savefig(
...     "c5-conthistpan.png", dpi=300
... )

發(fā)現(xiàn)這張圖中的數(shù)據(jù)很偏移,嘗試提高分桶的數(shù)目:

>>> import matplotlib.pyplot as plt
>>> fig, ax = plt.subplots(figsize=(10, 8))
>>> fueleco.city08.hist(ax=ax, bins=30)
>>> fig.savefig(
...     "c5-conthistpanbins.png", dpi=300
... )

使用seaborn創(chuàng)建分布圖,包括柱狀圖、核密度估計(jì)和地毯圖:

>>> fig, ax = plt.subplots(figsize=(10, 8))
>>> sns.distplot(fueleco.city08, rug=True, ax=ax)
>>> fig.savefig(
...     "c5-conthistsns.png", dpi=300
... )

更多

seaborn中還有其它用于表征數(shù)據(jù)分布的圖:

>>> fig, axs = plt.subplots(nrows=3, figsize=(10, 8))
>>> sns.boxplot(fueleco.city08, ax=axs[0])
>>> sns.violinplot(fueleco.city08, ax=axs[1])
>>> sns.boxenplot(fueleco.city08, ax=axs[2])
>>> fig.savefig("c5-contothersns.png", dpi=300)
boxplot,violin plot,和 boxen plot

如果想檢查數(shù)據(jù)是否是正態(tài)分布的,可以使用Kolmogorov-Smirnov測試,該測試提供了一個(gè)p值,如果p < 0.05,則不是正態(tài)分布的:

>>> from scipy import stats
>>> stats.kstest(fueleco.city08, cdf="norm")
KstestResult(statistic=0.9999999990134123, pvalue=0.0)

還可以用概率圖檢查數(shù)據(jù)是否是正態(tài)的,如果貼合紅線,則數(shù)據(jù)是正態(tài)的:

>>> from scipy import stats
>>> fig, ax = plt.subplots(figsize=(10, 8))
>>> stats.probplot(fueleco.city08, plot=ax)
>>> fig.savefig("c5-conprob.png", dpi=300)

5.5 在不同種數(shù)據(jù)間比較連續(xù)值

分析Ford、Honda、Tesla、BMW四個(gè)品牌的city08列的平均值和標(biāo)準(zhǔn)差:

>>> mask = fueleco.make.isin(
...     ["Ford", "Honda", "Tesla", "BMW"]
... )
>>> fueleco[mask].groupby("make").city08.agg(
...     ["mean", "std"]
... )
            mean       std
make
BMW    17.817377  7.372907
Ford   16.853803  6.701029
Honda  24.372973  9.154064
Tesla  92.826087  5.538970

使用seaborn進(jìn)行畫圖:

>>> g = sns.catplot(
...     x="make", y="city08", data=fueleco[mask], kind="box"
... )
>>> g.ax.figure.savefig("c5-catbox.png", dpi=300)

更多

boxplot不能體現(xiàn)出每個(gè)品牌中的數(shù)據(jù)量:

>>> mask = fueleco.make.isin(
...     ["Ford", "Honda", "Tesla", "BMW"]
... )
>>> (fueleco[mask].groupby("make").city08.count())
make
BMW      1807
Ford     3208
Honda     925
Tesla      46
Name: city08, dtype: int64

另一種方法是在boxplot的上方畫swarmplot:

>>> g = sns.catplot(
...     x="make", y="city08", data=fueleco[mask], kind="box"
... )
>>> sns.swarmplot(
...     x="make",
...     y="city08", 
...     data=fueleco[mask],
...     color="k",
...     size=1,
...     ax=g.ax,
... )
>>> g.ax.figure.savefig(
...     "c5-catbox2.png", dpi=300
... )

catplot可以補(bǔ)充更多的維度,比如年份:

>>> g = sns.catplot(
...     x="make",
...     y="city08",
...     data=fueleco[mask],
...     kind="box",
...     col="year",
...     col_order=[2012, 2014, 2016, 2018],
...     col_wrap=2,
... )
>>> g.axes[0].figure.savefig(
...     "c5-catboxcol.png", dpi=300
... )

或者,可以通過參數(shù)hue將四張圖放進(jìn)一張:

>>> g = sns.catplot(
...     x="make",
...     y="city08", 
...     data=fueleco[mask],
...     kind="box",
...     hue="year",
...     hue_order=[2012, 2014, 2016, 2018],
... )
>>> g.ax.figure.savefig(
...     "c5-catboxhue.png", dpi=300
... )

如果是在Jupyter中,可以對(duì)groupby結(jié)果使用格式:

>>> mask = fueleco.make.isin(
...     ["Ford", "Honda", "Tesla", "BMW"]
... )
>>> (
...     fueleco[mask]
...     .groupby("make")
...     .city08.agg(["mean", "std"])
...     .style.background_gradient(cmap="RdBu", axis=0)
... )

5.6 比較兩列連續(xù)型數(shù)據(jù)列

比較兩列的協(xié)方差:

>>> fueleco.city08.cov(fueleco.highway08)
46.33326023673625
>>> fueleco.city08.cov(fueleco.comb08)
47.41994667819079
>>> fueleco.city08.cov(fueleco.cylinders)
-5.931560263764761

比較兩列的皮爾森系數(shù):

>>> fueleco.city08.corr(fueleco.highway08)
0.932494506228495
>>> fueleco.city08.corr(fueleco.cylinders)
-0.701654842382788

用熱力圖顯示相關(guān)系數(shù):

>>> import seaborn as sns
>>> fig, ax = plt.subplots(figsize=(8, 8))
>>> corr = fueleco[
...     ["city08", "highway08", "cylinders"]
... ].corr()
>>> mask = np.zeros_like(corr, dtype=np.bool)
>>> mask[np.triu_indices_from(mask)] = True
>>> sns.heatmap(
...     corr,
...     mask=mask,
...     fmt=".2f",
...     annot=True,
...     ax=ax,
...     cmap="RdBu",
...     vmin=-1,
...     vmax=1,
...     square=True,
... )
>>> fig.savefig(
...     "c5-heatmap.png", dpi=300, bbox_inches="tight"
... )

用散點(diǎn)圖表示關(guān)系:

>>> fig, ax = plt.subplots(figsize=(8, 8))
>>> fueleco.plot.scatter(
...     x="city08", y="highway08", alpha=0.1, ax=ax
... )
>>> fig.savefig(
...     "c5-scatpan.png", dpi=300, bbox_inches="tight"
... )
>>> fig, ax = plt.subplots(figsize=(8, 8))
>>> fueleco.plot.scatter(
...     x="city08", y="cylinders", alpha=0.1, ax=ax
... )
>>> fig.savefig(
...     "c5-scatpan-cyl.png", dpi=300, bbox_inches="tight"
... )

因?yàn)橛械能囀请娷?,沒有氣缸,我們將缺失值填為0:

>>> fueleco.cylinders.isna().sum()
145
>>> fig, ax = plt.subplots(figsize=(8, 8))
>>> (
...     fueleco.assign(
...         cylinders=fueleco.cylinders.fillna(0)
...     ).plot.scatter(
...         x="city08", y="cylinders", alpha=0.1, ax=ax
...     )
... )
>>> fig.savefig(
...     "c5-scatpan-cyl0.png", dpi=300, bbox_inches="tight"
... )

使用seaborn添加回歸線:

>>> res = sns.lmplot(
...     x="city08", y="highway08", data=fueleco
... )
>>> res.fig.savefig(
...     "c5-lmplot.png", dpi=300, bbox_inches="tight"
... )

使用relplot,散點(diǎn)可以有不同的顏色和大?。?/p>

>>> res = sns.relplot(
...     x="city08",
...     y="highway08",
...     data=fueleco.assign(
...         cylinders=fueleco.cylinders.fillna(0)
...     ),
...     hue="year",
...     size="barrels08",
...     alpha=0.5,
...     height=8,
... )
>>> res.fig.savefig(
...     "c5-relplot2.png", dpi=300, bbox_inches="tight"
... )

還可以加入類別維度:

>>> res = sns.relplot(
...     x="city08",
...     y="highway08",
...     data=fueleco.assign(
...         cylinders=fueleco.cylinders.fillna(0)
...     ),
...     hue="year",
...     size="barrels08",
...     alpha=0.5,
...     height=8,
...     col="make",
...     col_order=["Ford", "Tesla"],
... )
>>> res.fig.savefig(
...     "c5-relplot3.png", dpi=300, bbox_inches="tight"
... )

如果兩列不是線性關(guān)系,還可以使用斯皮爾曼系數(shù):

>>> fueleco.city08.corr(
...     fueleco.barrels08, method="spearman"
... )
-0.9743658646193255

5.7 比較類型值

降低基數(shù),將VClass列變?yōu)?code>SClass,只用六個(gè)值:

>>> def generalize(ser, match_name, default):
...     seen = None
...     for match, name in match_name:
...         mask = ser.str.contains(match)
...         if seen is None:
...             seen = mask
...         else:
...             seen |= mask
...         ser = ser.where(~mask, name)
...     ser = ser.where(seen, default)
...     return ser
>>> makes = ["Ford", "Tesla", "BMW", "Toyota"]
>>> data = fueleco[fueleco.make.isin(makes)].assign(
...     SClass=lambda df_: generalize(
...         df_.VClass,
...         [
...             ("Seaters", "Car"),
...             ("Car", "Car"),
...             ("Utility", "SUV"),
...             ("Truck", "Truck"),
...             ("Van", "Van"),
...             ("van", "Van"),
...             ("Wagon", "Wagon"),
...         ],
...         "other",
...     )
... )

對(duì)每個(gè)品牌的車輛品類進(jìn)行計(jì)數(shù):

>>> data.groupby(["make", "SClass"]).size().unstack()
SClass     Car    SUV  ...  Wagon  other
make                   ...              
BMW     1557.0  158.0  ...   92.0    NaN
Ford    1075.0  372.0  ...  155.0  234.0
Tesla     36.0   10.0  ...    NaN    NaN
Toyota   773.0  376.0  ...  132.0  123.0

使用crosstab達(dá)到上一步同樣的目標(biāo):

>>> pd.crosstab(data.make, data.SClass)
SClass   Car  SUV  ...  Wagon  other
make               ...
BMW     1557  158  ...     92      0
Ford    1075  372  ...    155    234
Tesla     36   10  ...      0      0
Toyota   773  376  ...    132    123

加入更多維度:

>>> pd.crosstab(
...     [data.year, data.make], [data.SClass, data.VClass]
... )
SClass               Car             ...                       other
VClass      Compact Cars Large Cars  ... Special Purpose Vehicle 4WD
year make                            ...
1984 BMW               6          0  ...            0
     Ford             33          3  ...           21
     Toyota           13          0  ...            3
1985 BMW               7          0  ...            0
     Ford             31          2  ...            9
...                  ...        ...  ...          ...
2017 Tesla             0          8  ...            0
     Toyota            3          0  ...            0
2018 BMW              37         12  ...            0
     Ford              0          0  ...            0
     Toyota            4          0  ...            0

使用Cramér's V方法檢查品類的關(guān)系:

>>> import scipy.stats as ss
>>> import numpy as np
>>> def cramers_v(x, y):
...     confusion_matrix = pd.crosstab(x, y)
...     chi2 = ss.chi2_contingency(confusion_matrix)[0]
...     n = confusion_matrix.sum().sum()
...     phi2 = chi2 / n
...     r, k = confusion_matrix.shape
...     phi2corr = max(
...         0, phi2 - ((k - 1) * (r - 1)) / (n - 1)
...     )
...     rcorr = r - ((r - 1) ** 2) / (n - 1)
...     kcorr = k - ((k - 1) ** 2) / (n - 1)
...     return np.sqrt(
...         phi2corr / min((kcorr - 1), (rcorr - 1))
...     )
>>> cramers_v(data.make, data.SClass)
0.2859720982171866

.corr方法可以接收可調(diào)用變量,另一種方法如下:

>>> data.make.corr(data.SClass, cramers_v)
0.2859720982171866

使用barplot可視化:

>>> fig, ax = plt.subplots(figsize=(10, 8))
>>> (
...     data.pipe(
...         lambda df_: pd.crosstab(df_.make, df_.SClass)
...     ).plot.bar(ax=ax)
... )
>>> fig.savefig("c5-bar.png", dpi=300, bbox_inches="tight")

用seaborn實(shí)現(xiàn):

>>> res = sns.catplot(
...     kind="count", x="make", hue="SClass", data=data
... )
>>> res.fig.savefig(
...     "c5-barsns.png", dpi=300, bbox_inches="tight"
... )

使用堆積條形圖來表示:

>>> fig, ax = plt.subplots(figsize=(10, 8))
>>> (
...     data.pipe(
...         lambda df_: pd.crosstab(df_.make, df_.SClass)
...     )
...     .pipe(lambda df_: df_.div(df_.sum(axis=1), axis=0))
...     .plot.bar(stacked=True, ax=ax)
... )
>>> fig.savefig(
...     "c5-barstacked.png", dpi=300, bbox_inches="tight"
... )

5.8 使用Pandas的profiling庫

使用pip install pandas-profiling安裝profiling庫。使用ProfileReport創(chuàng)建一個(gè)HTML報(bào)告:

>>> import pandas_profiling as pp
>>> pp.ProfileReport(fueleco)

可以將其保存到文件:

>>> report = pp.ProfileReport(fueleco)
>>> report.to_file("fuel.html")

第01章 Pandas基礎(chǔ)
第02章 DataFrame基礎(chǔ)運(yùn)算
第03章 創(chuàng)建和持久化DataFrame
第04章 開始數(shù)據(jù)分析
第05章 探索性數(shù)據(jù)分析
第06章 選取數(shù)據(jù)子集
第07章 過濾行
第08章 索引對(duì)齊

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容