利用Python進行數(shù)據(jù)分析(11)-高階應(yīng)用category

本文中介紹的是pandas的高階應(yīng)用-分類數(shù)據(jù)category?

image

分裂數(shù)據(jù)Categorical

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

使用背景和目標

一個列中經(jīng)常會包含重復值,這些重復值是一個小型的不同值的集合。

unique()value_counts()能夠從數(shù)組中提取到不同的值并分別計算它們的頻率

values = pd.Series(["apple","orange","apple","apple"] * 2)
values
0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
dtype: object
pd.unique(values)   # 查看不同的取值情況
array(['apple', 'orange'], dtype=object)
pd.value_counts(values)  # 查看每個值的個數(shù)
apple     6
orange    2
dtype: int64

維度表

維度表包含了不同的值,將主要觀測值存儲為引用維度表的整數(shù)鍵

values = pd.Series([0,1,0,0] * 2)

dim = pd.Series(["apple","orange"])
values

0    0
1    1
2    0
3    0
4    0
5    1
6    0
7    0
dtype: int64
dim

0     apple
1    orange
dtype: object

take方法-分類(字典編碼展現(xiàn))

不同值的數(shù)組被稱之為數(shù)據(jù)的類別、字典或者層級

dim.take(values)

0     apple
1    orange
0     apple
0     apple
0     apple
1    orange
0     apple
0     apple
dtype: object

使用Categorical類型

fruits = ["apple","orange","apple","apple"] * 2
N = len(fruits)
df = pd.DataFrame({"fruit":fruits,  # 指定每列的取值內(nèi)容
                  "basket_id":np.arange(N),
                  "count":np.random.randint(3,15,size=N),
                  "weight":np.random.uniform(0,4,size=N)},
                 columns=["basket_id","fruit","count","weight"])  # 4個屬性值

df

image.png
df["fruit"]

0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
Name: fruit, dtype: object

如何生成Categorical實例

fruit_cat = df["fruit"].astype("category")  # 調(diào)用函數(shù)改變
fruit_cat   # 變成pd.Categorical的實例

0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
Name: fruit, dtype: category
Categories (2, object): [apple, orange]
c = fruit_cat.values
c
[apple, orange, apple, apple, apple, orange, apple, apple]
Categories (2, object): [apple, orange]

<span class="burk">兩個屬性:categories + codes</span>

print(c.categories)
print("-----")
print(c.codes)
Index(['apple', 'orange'], dtype='object')
-----
[0 1 0 0 0 1 0 0]
# 將DF的一列轉(zhuǎn)成Categorical對象
df["fruit"] = df["fruit"].astype("category")
df.fruit
0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
Name: fruit, dtype: category
Categories (2, object): [apple, orange]

從其他序列生成pd.Categorical對象

my_categories = pd.Categorical(['foo','bar','baz','foo','bar'])
my_categories
[foo, bar, baz, foo, bar]
Categories (3, object): [bar, baz, foo]

已知分類編碼數(shù)據(jù)的情況:from_codes

categories = ["foo","bar","baz"]
codes = [0,1,0,0,1,0,1,0]
my_code = pd.Categorical.from_codes(codes,categories)
my_code
[foo, bar, foo, foo, bar, foo, bar, foo]
Categories (3, object): [foo, bar, baz]

<span class="mark">顯式指定分類順序:ordered = True</span>

如果不指定順序,分類轉(zhuǎn)換是無序的。我們可以自己顯式地指定

ordered_cat = pd.Categorical.from_codes(codes,categories  # 指定分類用的數(shù)據(jù)
                                       ,ordered=True)
ordered_cat
[foo, bar, foo, foo, bar, foo, bar, foo]
Categories (3, object): [foo < bar < baz]

未排序的實例通過as_ordered排序

# 未排序的實例通過as_ordered來進行排序
my_categories.as_ordered()

[foo, bar, baz, foo, bar]
Categories (3, object): [bar < baz < foo]

Categorical對象來進行計算

np.random.seed(12345)  # 設(shè)置隨機種子
draws = np.random.randn(1000)
draws[:5]
array([-0.20470766,  0.47894334, -0.51943872, -0.5557303 ,  1.96578057])

qcut()函數(shù)-四分位數(shù)

# 計算四位分箱
bins = pd.qcut(draws,4)
bins
[(-0.684, -0.0101], (-0.0101, 0.63], (-0.684, -0.0101], (-0.684, -0.0101], (0.63, 3.928], ..., (-0.0101, 0.63], (-0.684, -0.0101], (-2.9499999999999997, -0.684], (-0.0101, 0.63], (0.63, 3.928]]
Length: 1000
Categories (4, interval[float64]): [(-2.9499999999999997, -0.684] < (-0.684, -0.0101] < (-0.0101, 0.63] < (0.63, 3.928]]

四分位數(shù)名稱 labels

bins = pd.qcut(draws,4,labels=["Q1","Q2","Q3","Q4"])
bins

[Q2, Q3, Q2, Q2, Q4, ..., Q3, Q2, Q1, Q3, Q4]
Length: 1000
Categories (4, object): [Q1 < Q2 < Q3 < Q4]
bins.codes[:10]

array([1, 2, 1, 1, 3, 3, 2, 2, 3, 3], dtype=int8)

結(jié)合groupby提取匯總信息

bins = pd.Series(bins, name="quartile")
results = (pd.Series(draws)
          .groupby(bins)
          .agg(["count","min","max"]).reset_index()
          )
results
image.png
results["quartile"]  # 保留原始中的分類信息
0    Q1
1    Q2
2    Q3
3    Q4
Name: quartile, dtype: category
Categories (4, object): [Q1 < Q2 < Q3 < Q4]

分類提高性能

如果在特定的數(shù)據(jù)集上做了大量的數(shù)據(jù)分析,將數(shù)據(jù)轉(zhuǎn)成分類數(shù)據(jù)有大大提高性能

N = 10000000
draws = pd.Series(np.random.randn(N))
labels = pd.Series(["foo","bar","baz","qux"] * (N // 4))
labels
0          foo
1          bar
2          baz
3          qux
4          foo
          ... 
9999995    qux
9999996    foo
9999997    bar
9999998    baz
9999999    qux
Length: 10000000, dtype: object

轉(zhuǎn)成分類數(shù)據(jù)

# 轉(zhuǎn)成分類數(shù)據(jù)
categories = labels.astype("category")
categories
0          foo
1          bar
2          baz
3          qux
4          foo
          ... 
9999995    qux
9999996    foo
9999997    bar
9999998    baz
9999999    qux
Length: 10000000, dtype: category
Categories (4, object): [bar, baz, foo, qux]

內(nèi)存比較

labels.memory_usage()
80000128
categories.memory_usage()

10000320

分類轉(zhuǎn)換的開銷

%time _ = labels.astype("category")

CPU times: user 374 ms, sys: 34.8 ms, total: 409 ms
Wall time: 434 ms

<span class="burk">分類方法</span>

s = pd.Series(["a","b","c","d"] * 2)
cat_s = s.astype("category")
cat_s

0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: category
Categories (4, object): [a, b, c, d]

cat屬性

特殊屬性cat提供了對分類方法的訪問

  • codes
  • categories
  • set_categories
cat_s.cat.codes

0    0
1    1
2    2
3    3
4    0
5    1
6    2
7    3
dtype: int8
cat_s.cat.categories

Index(['a', 'b', 'c', 'd'], dtype='object')

數(shù)據(jù)的實際類別超出給定的個數(shù)

actual_categories = ["a","b","c","d","e"]
cat_s2 = cat_s.cat.set_categories(actual_categories)
cat_s2

0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: category
Categories (5, object): [a, b, c, d, e]
cat_s2.value_counts()

d    2
c    2
b    2
a    2
e    0
dtype: int64

去除不在數(shù)據(jù)中的類別

cat_s3 = cat_s[cat_s.isin(["a","b"])]
cat_s3

0    a
1    b
4    a
5    b
dtype: category
Categories (4, object): [a, b, c, d]
# c、d沒有出現(xiàn),直接刪除
cat_s3.cat.remove_unused_categories()

0    a
1    b
4    a
5    b
dtype: category
Categories (2, object): [a, b]

如何創(chuàng)建虛擬變量:get_dummies()

在機器學習或統(tǒng)計數(shù)據(jù)中,通常需要將分類數(shù)據(jù)轉(zhuǎn)成虛擬變量,也稱之為one-hot編碼

cat_s = pd.Series(["a","b","c","d"] * 2, dtype="category")
cat_s

0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: category
Categories (4, object): [a, b, c, d]
pd.get_dummies(cat_s)

image.png
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

友情鏈接更多精彩內(nèi)容