統(tǒng)計(jì)學(xué)習(xí)方法第五章:決策樹(decision tree),CART算法,剪枝及python實(shí)現(xiàn)

統(tǒng)計(jì)學(xué)習(xí)方法第二章:感知機(jī)(perceptron)算法及python實(shí)現(xiàn)
統(tǒng)計(jì)學(xué)習(xí)方法第三章:k近鄰法(k-NN),kd樹及python實(shí)現(xiàn)
統(tǒng)計(jì)學(xué)習(xí)方法第四章:樸素貝葉斯法(naive Bayes),貝葉斯估計(jì)及python實(shí)現(xiàn)
統(tǒng)計(jì)學(xué)習(xí)方法第五章:決策樹(decision tree),CART算法,剪枝及python實(shí)現(xiàn)
統(tǒng)計(jì)學(xué)習(xí)方法第五章:決策樹(decision tree),ID3算法,C4.5算法及python實(shí)現(xiàn)

歡迎關(guān)注公眾號(hào):常失眠少年,大學(xué)生的修煉手冊(cè)!

完整代碼:
https://github.com/xjwhhh/LearningML/tree/master/StatisticalLearningMethod
歡迎follow和star

決策樹(decision tree)是一種基本的分類與回歸方法。決策樹模型呈樹狀結(jié)構(gòu),在分類問題中,表示基于特征對(duì)實(shí)例進(jìn)行分類的過程。

它可以認(rèn)為是if-then規(guī)則的集合,也可以認(rèn)為是定義在特征空間與類空間上的條件概率分布。

其主要優(yōu)點(diǎn)是模型具有可讀性,分類速度快。

學(xué)習(xí)時(shí),利用訓(xùn)練數(shù)據(jù),根據(jù)損失函數(shù)最小化的原則建立決策樹模型。預(yù)測(cè)時(shí),對(duì)新的數(shù)據(jù),利用決策樹模型進(jìn)行分類。

決策樹學(xué)習(xí)通常包括3個(gè)步驟:特征選擇,決策樹的生成和決策樹的修剪

主要的決策樹算法包括ID3,C4.5,CART。我主要實(shí)現(xiàn)的是CART算法,其余兩個(gè)算法也有所實(shí)現(xiàn),但正確率不高,先不予分享,以免誤人子弟。

以下是CART生成算法:

CART生成算法

以下是CART剪枝算法:

CART剪枝算法

每次剪枝剪的是某個(gè)內(nèi)部結(jié)點(diǎn)的子結(jié)點(diǎn),也就是將某個(gè)內(nèi)部結(jié)點(diǎn)的所有子結(jié)點(diǎn)回退到這個(gè)內(nèi)部結(jié)點(diǎn)里,并將這個(gè)內(nèi)部結(jié)點(diǎn)作為葉子結(jié)點(diǎn)。因此在計(jì)算損失函數(shù)時(shí),這個(gè)內(nèi)部結(jié)點(diǎn)之外的值都沒變,所以我們只需計(jì)算內(nèi)部結(jié)點(diǎn)剪枝前和剪枝后的損失函數(shù)

問題1:為什么剪去g(t)最小的T?
個(gè)人認(rèn)為是為了獲得更多的區(qū)間和子樹,便于比較

問題2:為什么不直接減去C(t)+\alpha最小的T?
個(gè)人認(rèn)為是整體損失函數(shù)=內(nèi)部結(jié)點(diǎn)損失函數(shù)+其他結(jié)點(diǎn)損失函數(shù)和。如果直接剪去這個(gè)T,使得內(nèi)部結(jié)點(diǎn)損失函數(shù)最小,但其他結(jié)點(diǎn)損失函數(shù)和不一定,所以整體不一定。換句話說就是局部最優(yōu)與整體最優(yōu)的關(guān)系。而采用損失函數(shù)減小率來進(jìn)行調(diào)整,剪枝幅度更小,能產(chǎn)生更多樹,可能其中某個(gè)樹就是整體損失函數(shù)最小

我實(shí)現(xiàn)的CART算法并沒有剪枝,剪枝的代碼會(huì)在之后做分享:

import cv2
import time
import logging
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

total_class = 10


# 這里選用了一個(gè)比較小的數(shù)據(jù)集,因?yàn)檫^大的數(shù)據(jù)集會(huì)導(dǎo)致棧溢出


def log(func):
    def wrapper(*args, **kwargs):
        start_time = time.time()
        logging.debug('start %s()' % func.__name__)
        ret = func(*args, **kwargs)

        end_time = time.time()
        logging.debug('end %s(), cost %s seconds' % (func.__name__, end_time - start_time))

        return ret

    return wrapper


# 二值化
def binaryzation(img):
    cv_img = img.astype(np.uint8)
    cv2.threshold(cv_img, 50, 1, cv2.THRESH_BINARY_INV, cv_img)
    return cv_img


@log
def binaryzation_features(trainset):
    features = []

    for img in trainset:
        img = np.reshape(img, (28, 28))
        cv_img = img.astype(np.uint8)

        img_b = binaryzation(cv_img)
        features.append(img_b)

    features = np.array(features)
    features = np.reshape(features, (-1, 784))

    return features


class TreeNode(object):
    """決策樹節(jié)點(diǎn)"""

    def __init__(self, **kwargs):
        '''
        attr_index: 屬性編號(hào)
        attr: 屬性值
        label: 類別(y)
        left_chuld: 左子結(jié)點(diǎn)
        right_child: 右子節(jié)點(diǎn)
        '''
        self.attr_index = kwargs.get('attr_index')
        self.attr = kwargs.get('attr')
        self.label = kwargs.get('label')
        self.left_child = kwargs.get('left_child')
        self.right_child = kwargs.get('right_child')


# 計(jì)算數(shù)據(jù)集的基尼指數(shù)
def gini_train_set(train_label):
    train_label_value = set(train_label)
    gini = 0.0
    for i in train_label_value:
        train_label_temp = train_label[train_label == i]
        pk = float(len(train_label_temp)) / len(train_label)
        gini += pk * (1 - pk)
    return gini


# 計(jì)算一個(gè)特征不同切分點(diǎn)的基尼指數(shù),并返回最小的
def gini_feature(train_feature, train_label):
    train_feature_value = set(train_feature)
    min_gini = float('inf')
    return_feature_value = 0
    for i in train_feature_value:
        train_feature_class1 = train_feature[train_feature == i]
        label_class1 = train_label[train_feature == i]
        # train_feature_class2 = train_feature[train_feature != i]
        label_class2 = train_label[train_feature != i]
        D1 = float(len(train_feature_class1)) / len(train_feature)
        D2 = 1 - D1
        if (len(label_class1) == 0):
            p1 = 0
        else:
            p1 = float(len(label_class1[label_class1 == label_class1[0]])) / len(label_class1)
        if (len(label_class2) == 0):
            p2 = 0
        else:
            p2 = float(len(label_class2[label_class2 == label_class2[0]])) / len(label_class2)
        gini = D1 * 2 * p1 * (1 - p1) + D2 * 2 * p2 * (1 - p2)
        if min_gini > gini:
            min_gini = gini
            return_feature_value = i
    return min_gini, return_feature_value


def get_best_index(train_set, train_label, feature_indexes):
    '''
    :param train_set: 給定數(shù)據(jù)集
    :param train_label: 數(shù)據(jù)集對(duì)應(yīng)的標(biāo)記
    :return: 最佳切分點(diǎn),最佳切分變量
    求給定切分點(diǎn)集合中的最佳切分點(diǎn)和其對(duì)應(yīng)的最佳切分變量
    '''
    min_gini = float('inf')
    feature_index = 0
    return_feature_value = 0
    for i in range(len(train_set[0])):
        if i in feature_indexes:
            train_feature = train_set[:, i]
            gini, feature_value = gini_feature(train_feature, train_label)
            if gini < min_gini:
                min_gini = gini
                feature_index = i
                return_feature_value = feature_value
    return feature_index, return_feature_value


# 根據(jù)最有特征和最優(yōu)切分點(diǎn)劃分?jǐn)?shù)據(jù)集
def divide_train_set(train_set, train_label, feature_index, feature_value):
    left = []
    right = []
    left_label = []
    right_label = []
    for i in range(len(train_set)):
        line = train_set[i]
        if line[feature_index] == feature_value:
            left.append(line)
            left_label.append(train_label[i])
        else:
            right.append(line)
            right_label.append(train_label[i])
    return np.array(left), np.array(right), np.array(left_label), np.array(right_label)


@log
def build_tree(train_set, train_label, feature_indexes):
    # 查看是否滿足停止條件
    train_label_value = set(train_label)
    if len(train_label_value) == 1:
        print("a")
        return TreeNode(label=train_label[0])

    if feature_indexes is None:
        print("b")
        return TreeNode(label=train_label[0])

    if len(feature_indexes) == 0:
        print("c")
        return TreeNode(label=train_label[0])

    feature_index, feature_value = get_best_index(train_set, train_label, feature_indexes)
    # print("feature_index",feature_index)

    left, right, left_label, right_label = divide_train_set(train_set, train_label, feature_index, feature_value)

    feature_indexes.remove(feature_index)
    # print("feature_indexes",feature_indexes)

    left_branch = build_tree(left, left_label, feature_indexes)
    right_branch = build_tree(right, right_label, feature_indexes)
    return TreeNode(left_child=left_branch,
                    right_child=right_branch,
                    attr_index=feature_index,
                    attr=feature_value)

# @log
# def prune(tree):


def predict_one(node, test):
    while node.label is None:
        if test[node.attr_index] == node.attr:
            node = node.left_child
        else:
            node = node.right_child
    return node.label


@log
def predict(tree, test_set):
    result = []
    for test in test_set:
        label = predict_one(tree, test)
        result.append(label)
    return result


if __name__ == '__main__':
    logger = logging.getLogger()
    logger.setLevel(logging.DEBUG)

    raw_data = pd.read_csv('../data/train_binary1.csv', header=0)
    data = raw_data.values

    imgs = data[0:, 1:]
    labels = data[:, 0]

    print(imgs.shape)

    # 圖片二值化
    # features = binaryzation_features(imgs)

    # 選取 2/3 數(shù)據(jù)作為訓(xùn)練集, 1/3 數(shù)據(jù)作為測(cè)試集
    train_features, test_features, train_labels, test_labels = train_test_split(imgs, labels, test_size=0.33,random_state=1)

    print(type(train_features))
    tree = build_tree(train_features, train_labels, [i for i in range(784)])
    test_predict = predict(tree, test_features)
    score = accuracy_score(test_labels, test_predict)

    print("The accuracy score is ", score)

運(yùn)行結(jié)果如下:

運(yùn)行結(jié)果

水平有限,如有錯(cuò)誤,希望指出

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

  • 1、模型原理 (一)原理 1、原理:引入信息熵(不確定程度)的概念,通過計(jì)算各屬性下的信息增益程度(信息增益越大,...
    Python_Franklin閱讀 12,610評(píng)論 0 17
  • 決策樹理論在決策樹理論中,有這樣一句話,“用較少的東西,照樣可以做很好的事情。越是小的決策樹,越優(yōu)于大的決策樹”。...
    制杖灶灶閱讀 6,056評(píng)論 0 25
  • 決策樹 決策樹模型與學(xué)習(xí) 特征選擇 決策樹的生成 決策樹的剪枝 CART 算法 決策樹模型實(shí)現(xiàn) 決策樹模型呈樹形結(jié)...
    千與千與閱讀 737評(píng)論 1 1
  • 生意人的賬簿記錄收入與支出,兩數(shù)相減,便是盈利,人生賬簿,記錄愛與被愛,兩數(shù)相加,就是成功。
    謝vV閱讀 236評(píng)論 0 1
  • 日有一新浪微博名為@潔潔良 的賬號(hào)公開發(fā)布辱華言論。 事件起源是在某活動(dòng)現(xiàn)場(chǎng),觀眾被批在離開后留下大量垃圾,@潔潔...
    汪不器閱讀 790評(píng)論 0 0

友情鏈接更多精彩內(nèi)容