數(shù)據(jù)集

數(shù)據(jù)總共有五個類別，分別是年齡,有工作,有自己的房子,信貸情況,類別（是否給貸款）。數(shù)據(jù)如下圖：

數(shù)據(jù)集

計算信息熵、條件熵和信息增益

獲取分類信息
feature是一個字符串，該函數(shù)獲取一個DataFrame中某一個feature下的所有類別、每個類別的數(shù)量、每個類別的索引。返回三個list對應(yīng)上述三個內(nèi)容。

def get_all_classes(dataFrame, feature):
    df_group = dataFrame.groupby(by=feature) # 按照feature分組
    classes = list(df_group.groups.keys())
    num_classes = []
    groups = []
    for i in classes:
        num_classes.append(len(df_group.get_group(name=i)))
        groups.append(df_group.get_group(name=i).index)
    return classes, num_classes, groups

計算信息熵
該函數(shù)計算某一個DataFrame中某個feature的信息熵。

def calculate_info_entropy(dataFrame, feature):
    total_num = len(dataFrame)
    classes, num_classes, groups = get_all_classes(dataFrame, feature)
    info_entropy = 0  # 信息熵初始化為0
    for i in range(len(classes)):
        p = num_classes[i] / total_num
        info_entropy += - p * math.log(p, 2)
    return info_entropy

計算條件熵
該函數(shù)計算feature2關(guān)于feature1的條件熵。

def calculate_conditional_entropy(dataFrame, feature1, feature2):
    total_num = len(dataFrame)
    classes, num_classes, groups = get_all_classes(dataFrame, feature1)
    conditional_entropy = 0
    for i in range(len(classes)): #遍歷關(guān)于feature1的每個分組
        p = num_classes[i] / total_num
        info_entropy = calculate_info_entropy(dataFrame.loc[groups[i], :], feature2) #計算每個分組的信息熵
        conditional_entropy += p * info_entropy # 將每個分組的信息熵乘以該分組的概率然后加到條件熵里
    return conditional_entropy

計算信息增益
該函數(shù)利用2和3兩個函數(shù)計算feature2關(guān)于feature1的信息增益。

def calculate_KL_divergence(dataFrame, feature1, feature2):
    info_entropy = calculate_info_entropy(dataFrame, feature2)
    conditional_entropy = calculate_conditional_entropy(dataFrame, feature1, feature2)
    return info_entropy - conditional_entropy

構(gòu)建決策樹

該函數(shù)用來返回一個字典，代表決策樹的結(jié)構(gòu)。

def create_Decision_tree(dataFrame):
    features = dataFrame.columns  # 獲取dataFrame的所有特征
    KL_divergences = []

    # 獲取dataFrame關(guān)于最后一列的信息熵，用于計算信息增益
    info_entropy = calculate_info_entropy(dataFrame, features[-1])
    for i in range(len(features) - 1):  # 計算各特征信息增益
        KL_divergence = calculate_KL_divergence(dataFrame, features[i], features[-1])

        if KL_divergence == info_entropy:  # 如果該特征的條件熵為0說明可以直接當(dāng)做葉子節(jié)點了
            classes, num_classes, groups = get_all_classes(dataFrame, features[i])
            decision_tree = {}
            for k in range(len(groups)):
                decision_tree[features[i] + ' ' + str(classes[k])] = dataFrame.loc[groups[k][0]][-1]
            return decision_tree
        else:
            KL_divergences.append(calculate_KL_divergence(dataFrame, features[i], features[-1]))
    most_gain = KL_divergences.index(max(KL_divergences))  # 獲取最大信息增益點的feature
    classes, num_classes, groups = get_all_classes(dataFrame, features[most_gain])  # 獲取分組信息
    decision_tree = {}
    for i in range(len(groups)):
        index = [j for j in range(len(features))]
        index.pop(most_gain)
        if calculate_info_entropy(dataFrame.iloc[groups[i], index], features[-1]) == 0:  # 該組條件熵為0就可以當(dāng)做葉子節(jié)點
            decision_tree[features[most_gain] + ' ' + str(classes[i])] = dataFrame.loc[groups[i][0]][-1]
        else:  # 條件熵不為0就遞歸調(diào)用該方法直到找到葉子節(jié)點
            decision_tree[features[most_gain] + ' ' + str(classes[i])] = create_Decision_tree(
                dataFrame.iloc[groups[i], index])
    return decision_tree

調(diào)用上述函數(shù)：

df = pd.read_csv('p3.csv', error_bad_lines=False)
print(create_Decision_tree(df))

輸出結(jié)果如下:

使用字典表示的決策樹

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

使用Python實現(xiàn)決策樹算法

使用Python實現(xiàn)決策樹算法

數(shù)據(jù)集

計算信息熵、條件熵和信息增益

構(gòu)建決策樹

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

使用Python實現(xiàn)決策樹算法

數(shù)據(jù)集

計算信息熵、條件熵和信息增益

構(gòu)建決策樹

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

計算信息熵、條件熵和信息增益