前言

用Sklearn常用的Ensemble算法對當當熱銷書評論進行分類實踐。

關(guān)于集成算法概念可以看這篇文章總結(jié)Bootstraping、Bagging和Boosting

先看一下這篇文章樸素貝葉斯分類算法實踐，本文主要還是用當當評論數(shù)據(jù)做的分析。關(guān)于代碼部分一些細節(jié)在樸素貝葉斯分類算法實踐已經(jīng)詳細的解釋了。

完整代碼查看：https://github.com/xhades/rates_classify/tree/master/rates_classify
訓(xùn)練數(shù)據(jù)下載地址：https://pan.baidu.com/s/1kVOS39l

正文

RandomForest

sklearn RandomForestClassifier文檔地址

代碼

import numpy as np
from numpy import array, argmax, reshape
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import pickle
from sklearn.ensemble import RandomForestClassifier as RDF
np.set_printoptions(threshold=np.inf)


# 訓(xùn)練集測試集 3/7分割
def train(xFile, yFile):
    with open(xFile, "rb") as file_r:
        X = pickle.load(file_r)

    X = reshape(X, (212841, -1))  # reshape一下 （212841, 30*128）
    # 讀取label數(shù)據(jù)，并且encodig
    with open(yFile, "r") as yFile_r:
        labelLines = [_.strip("\n") for _ in yFile_r.readlines()]
    values = array(labelLines)
    labelEncoder = LabelEncoder()
    integerEncoded = labelEncoder.fit_transform(values)
    integerEncoded = integerEncoded.reshape(len(integerEncoded), 1)
    # print(integerEncoded)

    # 獲得label  編碼
    Y = integerEncoded.reshape(212841, )
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=42)

    # 隨機森林分類器
    clf = RDF(criterion="gini")
    # criterion 可以使用"gini"或者"entropy"，前者代表基尼系數(shù)，后者代表信息增益。一般說使用默認的基尼系數(shù)"gini"就可以了，即CART算法。除非你更喜歡類似ID3, C4.5的最優(yōu)特征選擇方法。

    clf.fit(X_train, Y_train)

    # 測試數(shù)據(jù)
    predict = clf.predict(X_test)
    count = 0
    for p, t in zip(predict, Y_test):
        if p == t:
            count += 1
    print("RandomForest Accuracy is:", count/len(Y_test))


if __name__ == "__main__":
    xFile = "Res/char_embedded.pkl"
    yFile = "data/label.txt"
    print("Start Training.....")
    train(xFile, yFile)
    print("End.....")

主要的參數(shù)說明

criterion 可以使用"gini"或者"entropy"，前者代表基尼系數(shù)，后者代表信息增益。一般說使用默認的基尼系數(shù)"gini"就可以了，即CART算法。除非你更喜歡類似ID3, C4.5的最優(yōu)特征選擇方法。
其他參數(shù)都用默認，以后再更新 =。=

結(jié)果

Start Training.....
RandomForest Accuracy is: 0.9258453009255634
End.....

最終結(jié)果大概92.6%左右的準確率

梯度提升算法GradientBoostingClassifier

sklearn GradientBoostingClassifier 文檔地址

Boosting不斷串行地迭代弱學(xué)習(xí)器最終形成一個強學(xué)習(xí)器，這點和Bagging并行的方式不同，所以在用梯度提升算法時耗時非常長

代碼


import numpy as np
from numpy import array, argmax, reshape
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import pickle
from sklearn.ensemble import GradientBoostingClassifier as GBC

np.set_printoptions(threshold=np.inf)


# 訓(xùn)練集測試集 3/7分割
def train(xFile, yFile):
    with open(xFile, "rb") as file_r:
        X = pickle.load(file_r)

    X = reshape(X, (212841, -1))  # reshape一下 （212841, 30*128）
    # 讀取label數(shù)據(jù)，并且Encoding
    with open(yFile, "r") as yFile_r:
        labelLines = [_.strip("\n") for _ in yFile_r.readlines()]
    values = array(labelLines)
    labelEncoder = LabelEncoder()
    integerEncoded = labelEncoder.fit_transform(values)
    integerEncoded = integerEncoded.reshape(len(integerEncoded), 1)
    # print(integerEncoded)

    # 獲得label 編碼
    Y = integerEncoded.reshape(212841, )
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=42)

    # 梯度提升分類器
    clf = GBC(loss="deviance", subsample=0.8, criterion="friedman_mse")

    clf.fit(X_train, Y_train)

    # 測試數(shù)據(jù)
    predict = clf.predict(X_test)
    count = 0
    for p, t in zip(predict, Y_test):
        if p == t:
            count += 1
    print("GradientBoosting  Accuracy is:", count/len(Y_test))


if __name__ == "__main__":
    xFile = "Res/char_embedded.pkl"
    yFile = "data/label.txt"
    print("Start Training.....")
    train(xFile, yFile)
    print("End.....")

主要的參數(shù)說明

subsample 數(shù)據(jù)隨機抽樣對決策樹進行訓(xùn)練，這個參數(shù)設(shè)置比1小即可，具體數(shù)值需要在“調(diào)參”過程中發(fā)現(xiàn)最優(yōu)
其他參數(shù)日后再(tai)整(lan)理(le)
源碼中的默認參數(shù)設(shè)置

    _SUPPORTED_LOSS = ('deviance', 'exponential')

    def __init__(self, loss='deviance', learning_rate=0.1, n_estimators=100,
                 subsample=1.0, criterion='friedman_mse', min_samples_split=2,
                 min_samples_leaf=1, min_weight_fraction_leaf=0.,
                 max_depth=3, min_impurity_split=1e-7, init=None,
                 random_state=None, max_features=None, verbose=0,
                 max_leaf_nodes=None, warm_start=False,
                 presort='auto'):

        super(GradientBoostingClassifier, self).__init__(
            loss=loss, learning_rate=learning_rate, n_estimators=n_estimators,
            criterion=criterion, min_samples_split=min_samples_split,
            min_samples_leaf=min_samples_leaf,
            min_weight_fraction_leaf=min_weight_fraction_leaf,
            max_depth=max_depth, init=init, subsample=subsample,
            max_features=max_features,
            random_state=random_state, verbose=verbose,
            max_leaf_nodes=max_leaf_nodes,
            min_impurity_split=min_impurity_split,
            warm_start=warm_start,
            presort=presort)

結(jié)果

Start Training.....
GradientBoosting  Accuracy is: 0.8833727467777551
End.....

最終準確率88.3%左右

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

Sklearn常用集成算法實踐

Sklearn常用集成算法實踐

前言

正文

RandomForest

代碼

主要的參數(shù)說明

結(jié)果

梯度提升算法GradientBoostingClassifier

代碼

主要的參數(shù)說明

結(jié)果

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

Sklearn常用集成算法實踐

前言

正文

RandomForest

代碼

主要的參數(shù)說明

結(jié)果

梯度提升算法GradientBoostingClassifier

代碼

主要的參數(shù)說明

結(jié)果

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av