Scikit-Learn中提供了幾個對分類變量進行獨熱編碼的轉(zhuǎn)換量(transformer):LabelEncoder、OneHotEncoder、LabelBinarizer??赡苁怯捎诎姹镜牟町悾趯嶋H使用過程中和《Scikit-Learn與TensorFlow機器學(xué)習(xí)實用指南》的運行結(jié)果略有不同。故在本文中對三者做個簡單梳理。
我的sklearn版本是0.20.0,Python是3.7.0 on Windows x64,本文使用房價數(shù)據(jù)進行測試。
>>> import sklearn
>>> sklearn.__versions__
0.20.0
首先導(dǎo)入數(shù)據(jù)源,提取需要處理的字段。
>>> import pandas as pd
>>> housing = pd.read_csv("housing/housing.csv")
>>> attrib_cat = housing["ocean_proximity"]
>>> attrib_cat.describe()
count 20640
unique 5
top <1H OCEAN
freq 9136
Name: ocean_proximity, dtype: object
要處理的字段存儲在attrib_cat變量中,包含5個類型,共計20640個樣本,類型是pands的Series。接下來開始依次使用以上三個轉(zhuǎn)換量來將其轉(zhuǎn)換為獨熱編碼:
LabelEncoder
>>> from sklearn.preprocessing import LabelEncoder, OneHotEncoder, LabelBinarizer
>>> LabelEncoder?
Encode labels with value between 0 and n_classes-1.
...
It can also be used to transform non-numerical labels (as long as they are
hashable and comparable) to numerical labels.
LabelEncoder可以將值轉(zhuǎn)換為0~n-1個類型,也可以用來將非數(shù)值的標簽轉(zhuǎn)換為數(shù)值標簽(需要確保非數(shù)值標簽是可比的和可哈希的)。
Sklearn的API設(shè)計的比較巧妙,所有轉(zhuǎn)換的過程都比較類似:
>>> le = LabeEncoder()
>>> letrans = le.fit_transform(attrib_cat)
>>> letrans
array([3, 3, 3, ..., 1, 1, 1])
分類變量被轉(zhuǎn)換成由數(shù)值做成的數(shù)組(array)。獲取獨熱編碼還需要使用OneHotDecoder(),書中介紹的第一種方法也是這個流程。
>>> oh = OneHotDecoder()
>>> oh.fit_transform(letrans)
ValueError: Expected 2D array, got 1D array instead:
array=[3 3 3 ... 1 1 1].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
OneHotDecoder接受的數(shù)據(jù)需要是2維數(shù)組,錯誤提示很明顯,書中也提到了這一點,這里僅僅作為提示。OneHotDecoder的介紹后面再整理。
>>> oh.fit_transform(letrans.reshape(-1,1))
<20640x5 sparse matrix of type '<class 'numpy.float64'>'
with 20640 stored elements in Compressed Sparse Row format>
輸出結(jié)果是SciPy的稀疏矩陣,為了減少內(nèi)存的占用。調(diào)用toarray()可以將其轉(zhuǎn)換為NumPy的數(shù)組。
OneHotDecoder
>>> OneHotDecoder?
Encode categorical integer features as a one-hot numeric array.
The input to this transformer should be an array-like of integers or
strings, denoting the values taken on by categorical (discrete) features.
The features are encoded using a one-hot (aka 'one-of-K' or 'dummy')
encoding scheme. This creates a binary column for each category and
returns a sparse matrix or dense array.
By default, the encoder derives the categories based on the unique values
in each feature. Alternatively, you can also specify the `categories`
manually.
The OneHotEncoder previously assumed that the input features take on
values in the range [0, max(values)). This behaviour is deprecated.
This encoding is needed for feeding categorical data to many scikit-learn
estimators, notably linear models and SVMs with the standard kernels.
Note: a one-hot encoding of y labels should use a LabelBinarizer
instead.
OneHotEncoder將數(shù)值型的特征轉(zhuǎn)換為獨熱編碼的數(shù)值型數(shù)組。接收的輸入是類數(shù)組的數(shù)值和字符串變量,依次來代表分類(離散)特征。這些特征會被按照熱點編碼的方式進行轉(zhuǎn)換。為每個類型創(chuàng)建一個二元的欄,返回一個稀疏矩陣或者密集數(shù)組。
該轉(zhuǎn)換量默認以各個屬性的唯一值作為分類依據(jù)。但也支持通過categories參數(shù)手動設(shè)置。
之前版本的OneHotEncoder假設(shè)輸入屬性的值在[0, max(values)]的范圍,我用的版本已經(jīng)被移除。
注意: 對預(yù)測屬性的獨熱編碼應(yīng)該使用LabelBinarizer來進行。
上例中已經(jīng)演示了OneHotEncoder對數(shù)值型數(shù)組的轉(zhuǎn)換,但文檔中可以看到它可以直接對字符類的數(shù)組進行獨熱編碼,這里來嘗試一下。
>>> oh.fit_transform(np.array(attrib_cat).reshape(-1,1))
<20640x5 sparse matrix of type '<class 'numpy.float64'>'
with 20640 stored elements in Compressed Sparse Row format>
確實可以完成上一個例子中兩步才能進行的操作。
LabelBinarizer
LabelBinarizer?
Binarize labels in a one-vs-all fashion
Several regression and binary classification algorithms are
available in scikit-learn. A simple way to extend these algorithms
to the multi-class classification case is to use the so-called
one-vs-all scheme.
At learning time, this simply consists in learning one regressor
or binary classifier per class. In doing so, one needs to convert
multi-class labels to binary labels (belong or does not belong
to the class). LabelBinarizer makes this process easy with the
transform method.
At prediction time, one assigns the class for which the corresponding
model gave the greatest confidence. LabelBinarizer makes this easy
with the inverse_transform method.
LabelBinarizer顧名思義將標簽轉(zhuǎn)換為一對多的形式。
為了將而分類或者回歸算法擴展到多分類,我們需要將標簽轉(zhuǎn)換為一對多的形式。
在訓(xùn)練過程中,通常包含了為每個類學(xué)習(xí)二分類模型或者回歸模型。為了實現(xiàn)這個目的,我們需要將多分類標簽轉(zhuǎn)換為而分類標簽(屬于某類或者不屬于某類)。LabelBinerizer是這個轉(zhuǎn)換的過程變得比較簡單。
在預(yù)測階段,LabelBinerizer也可以方便得將預(yù)測結(jié)果轉(zhuǎn)換成多分類標簽。
>>> lbe = LabelBinerizer()
>>> lbe.fit_transform(attrib_cat)
array([[0, 0, 0, 1, 0],
[0, 0, 0, 1, 0],
[0, 0, 0, 1, 0],
...,
[0, 1, 0, 0, 0],
[0, 1, 0, 0, 0],
[0, 1, 0, 0, 0]])
LabelBinarizer相當于集合了LabeEncoder和OneHotEncoder的過程,同時相比與OneHotEncoder,他的操作更簡單:直接接受pandas的Series格式數(shù)據(jù),默認輸出密集的NumPy數(shù)組,dtype是int32。
總結(jié)
梳理過后,三個轉(zhuǎn)換量的區(qū)別就比較明顯了:
- 描述
- LabelEncoder :將類型變量轉(zhuǎn)換為數(shù)值組成的數(shù)組。
- OneHotEncoder:將數(shù)值類型屬性轉(zhuǎn)換成獨熱編碼的數(shù)值型數(shù)組。
- LabelBinerizer: 將標簽二值化為一對多的形式。
- fit_transform的輸入
- LabelEncoder :pd.Series/np.array, 可哈希、可比的非數(shù)值或者數(shù)值, ndim=1。
- OneHotEncoder:np.array,可比的非數(shù)值或者數(shù)值,ndim=2。
- LabelBinerizer: pd.Series/np.array, 可哈希、可比的非數(shù)值或者數(shù)值,ndim=1。
- fit_transform的默認輸出
- LabelEncoder :np.array, dtype=int64。
- OneHotEncoder:SciPy.Sparse Matrix, dtype=float64。
- LabelBinerizer: np.array, dtype=int32。