機(jī)器學(xué)習(xí)實(shí)戰(zhàn)讀書(shū)筆記-樸素貝葉斯

核心思想：要求分類(lèi)器給出一個(gè)最優(yōu)類(lèi)別的猜測(cè)結(jié)果，同時(shí)給出這個(gè)猜測(cè)概率的估計(jì)值

我們稱(chēng)之為樸素，是因?yàn)?strong>整個(gè)形式化過(guò)程只做最原始，最簡(jiǎn)單的假設(shè)。

概率基礎(chǔ)

$p(c_i|x,y) = \frac{p(x,y|c_i)p(c_i)}{p(x,y)}$

其中 $p(c_i|x,y)$ 的意義為：給定某個(gè)由x,y標(biāo)注的數(shù)據(jù)點(diǎn)，那么該數(shù)據(jù)點(diǎn)來(lái)自類(lèi)別 $c_i$ 的概率為多少

如果 $p(c_1|x,y) > p(c_2|x,y)$ ，那么屬于類(lèi)別1，反之亦然。

獨(dú)立：如果每個(gè)特征需要N個(gè)樣本，如果假設(shè)有10個(gè)特征，那么則需要 $N^{10}$ 個(gè)樣本，如果特征之間相互獨(dú)立，則需要的樣本數(shù)就可以從 $N^{10}$ 減少到 $10N$ 個(gè)，所謂獨(dú)立，指的是統(tǒng)計(jì)意義上的獨(dú)立，即一個(gè)特征或者單詞出現(xiàn)的可能性與它和其他單詞相鄰沒(méi)有關(guān)系。雖然我們知道這個(gè)假設(shè)并不正確，這也就是樸素的含義。

樸素貝葉斯假設(shè)：

特征之間相互獨(dú)立（單詞beacon出現(xiàn)在unhealty后面和出現(xiàn)在delicious后面的概率相同）
每個(gè)特征同等重要（判斷留言是否得等，需要看完所有的單詞）

雖然這兩個(gè)假設(shè)通常不成立，但是咱樸素貝葉斯就這么假設(shè)了。

使用python進(jìn)行文本分類(lèi)

訓(xùn)練算法

$p(c_i|\textbf{w})=\frac{p(\textbf{w}|c_i)p(c_i)}{p(\textbf{w})}$

將以上公式中的 $x, y$ 換為 $\textbf{w}$ ，粗體表示這是一個(gè)向量，此外由于樸素貝葉斯的獨(dú)立性假設(shè)，可以按如下公式計(jì)算 $p(\textbf{w}|c_i)$ ，以此來(lái)簡(jiǎn)化計(jì)算過(guò)程。
$p(\textbf{w}|c_i) = p(w_0,w_1,...,w_n|c_i) =\prod_{j}p(w_j|c_i)$

計(jì)算每個(gè)類(lèi)別中的文檔數(shù)目
對(duì)于每篇訓(xùn)練文檔：
    對(duì)于每個(gè)類(lèi)別：
        如果詞條出現(xiàn)在文檔中，增加該詞條的計(jì)數(shù)
        增加所有詞條的計(jì)數(shù)
    對(duì)每個(gè)類(lèi)別：
        對(duì)每個(gè)詞條：
            將該詞條的數(shù)目初一總詞條數(shù)得到條件概率
    返回每個(gè)類(lèi)別的條件概率

實(shí)戰(zhàn)

進(jìn)行文本分類(lèi)

分類(lèi)評(píng)論是否是惡意評(píng)論

import numpy as np


def load_data_set():
    """
    Generate train data set and associated classify result
    :return: (train_data_set, classify_result)
    """
    posting_list = [
        ['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],
        ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
        ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
        ['stop', 'posting', 'stupid', 'worthless', 'gar e'],
        ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
        ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
    class_vec = [0, 1, 0, 1, 0, 1]  # 1 is an abuse, 0 is not
    return posting_list, class_vec


def create_vocab_list(data_set):
    """
    Get a set of words which appear in the train data set
    :param data_set: train_data_set
    :return: set of words
    """
    vocab_set = set()
    for doc in data_set:
        # union of 2 sets
        vocab_set = vocab_set | set(doc)
    return list(vocab_set)


def word2vec_set(vocab_list, input_sentence):
    """
    Transfer a sentence to a vector based on the words appear in the sentence Using Set Model
    :param vocab_list: All word appeared in train set
    :param input_sentence: Input sentence
    :return: The vector representative of the input sentence
    """
    ret_vector = [0] * len(vocab_list)
    for word in input_sentence:
        if word in vocab_list:
            ret_vector[vocab_list.index(word)] = 1
        else:
            print("the word {} is not in the vocabulary".format(word))
    return ret_vector


def word2vec_bag(vocab_list, input_sentence):
    """
    Transfer a sentence to a vector using Word Bag Model, in case that one work might appears in on sentence more than once
    :param vocab_list: 
    :param input_sentence: 
    :return: 
    """
    ret_vector = [0] * len(vocab_list)
    for word in input_sentence:
        if word in input_sentence:
            ret_vector[vocab_list.index(word)] += 1
    return ret_vector


def train_naive(train_matrix, train_category):
    """
    Get probabilities to calculate bayes classify result
    :param train_matrix: All sentence vector of train set
    :param train_category: The classify result of train set
    :return: p(w|c_0) p(w|c_1) p(c_1)
    """
    # number of comment
    doc_num = len(train_matrix)
    # number of word in the vocabulary
    word_num = len(train_matrix[0])

    # probability of abusive p(c_1)
    # Seeing as is a 2 class problem, we could get the probability of non-abusive through 1-p_abuse
    p_abuse = sum(train_category) / float(doc_num)

    # p0_num = np.zeros(word_num)
    # p1_num = np.zeros(word_num)
    #
    # # p0_num/p0_denominator = p(w|c_0)
    # p0_denominator = 0.0
    # p1_denominator = 0.0
    p0_num = np.ones(word_num)
    p1_num = np.ones(word_num)

    p0_denominator = 2.0
    p1_denominator = 2.0

    for i in range(doc_num):
        # if this comment is abusive
        if train_category[i] == 1:
            p1_num += train_matrix[i]
            p1_denominator += sum(train_matrix[i])
        else:
            p0_num += train_matrix[i]
            p0_denominator += sum(train_matrix[i])
    # pi_condition is p(w|c_i)
    p1_condition = np.log(p1_num / p1_denominator)
    p0_condition = np.log(p0_num / p0_denominator)
    return p0_condition, p1_condition, p_abuse


def classify_naive(test_vector, p0_condition, p1_condition, p_1):
    # because we already process np.log
    # p(w|c_i) = p(w_0|c_i)p(w_1|c_i)p(w_2|c_i) ....
    # Asterisk means element-wise multiply in numpy
    p1 = sum(test_vector * p1_condition) + np.log(p_1)
    p0 = sum(test_vector * p0_condition) + np.log(1 - p_1)
    if p1 > p0:
        return 1
    else:
        return 0


def test_naive():
    post_list, class_list = load_data_set()
    vocab = create_vocab_list(post_list)
    train_matrix = []
    for post in post_list:
        train_matrix.append(word2vec_set(vocab, post))
    p0_condition, p1_conditon, p_aubsive = train_naive(train_matrix, class_list)
    test_entry = ["love", "my", "dalmation"]
    test_vector = word2vec_set(vocab, test_entry)
    print("The vector of input sentence is: ", test_vector)
    print("Classify result is: ", classify_naive(test_vector, p0_condition, p1_conditon, p_abusive))


post_list, classes = load_data_set()
print(post_list)
vocab = create_vocab_list(post_list)
print(word2vec_set(vocab, post_list[0]))
print(vocab)

train_matrix = []
for post in post_list:
    train_matrix.append(word2vec_set(vocab, post))
p_non_abusive_condition, p_abusive_condition, p_abusive = train_naive(train_matrix, classes)

print(p_abusive)
print(p_abusive_condition)

max_index = p_abusive_condition.argmax()
# argmax of p_abusive_condition is stupid, basically means the word 'stupid' contribute a lot to an abusive comment
print(vocab[max_index])

過(guò)濾垃圾郵件

import re
import random
import numpy as np


def create_vocab_list(data_set):
    """
    Get a set of words which appear in the train data set
    :param data_set: train_data_set
    :return: set of words
    """
    vocab_set = set()
    for doc in data_set:
        # union of 2 sets
        vocab_set = vocab_set | set(doc)
    return list(vocab_set)


def word2vec_bag(vocab_list, input_sentence):
    """
    Transfer a sentence to a vector using Word Bag Model, in case that one work might appears in on sentence more than once
    :param vocab_list:
    :param input_sentence:
    :return:
    """
    ret_vector = [0] * len(vocab_list)
    for word in input_sentence:
        if word in input_sentence:
            ret_vector[vocab_list.index(word)] += 1
    return ret_vector


def train_naive(train_matrix, train_category):
    """
    Get probabilities to calculate bayes classify result
    :param train_matrix: All sentence vector of train set
    :param train_category: The classify result of train set
    :return: p(w|c_0) p(w|c_1) p(c_1)
    """
    # number of comment
    doc_num = len(train_matrix)
    # number of word in the vocabulary
    word_num = len(train_matrix[0])

    # probability of abusive p(c_1)
    # Seeing as is a 2 class problem, we could get the probability of non-abusive through 1-p_abuse
    p_abuse = sum(train_category) / float(doc_num)

    # p0_num = np.zeros(word_num)
    # p1_num = np.zeros(word_num)
    #
    # # p0_num/p0_denominator = p(w|c_0)
    # p0_denominator = 0.0
    # p1_denominator = 0.0
    p0_num = np.ones(word_num)
    p1_num = np.ones(word_num)

    p0_denominator = 2.0
    p1_denominator = 2.0

    for i in range(doc_num):
        # if this comment is abusive
        if train_category[i] == 1:
            p1_num += train_matrix[i]
            p1_denominator += sum(train_matrix[i])
        else:
            p0_num += train_matrix[i]
            p0_denominator += sum(train_matrix[i])
    # pi_condition is p(w|c_i)
    p1_condition = np.log(p1_num / p1_denominator)
    p0_condition = np.log(p0_num / p0_denominator)
    return p0_condition, p1_condition, p_abuse


def classify_naive(test_vector, p0_condition, p1_condition, p_1):
    # because we already process np.log
    # p(w|c_i) = p(w_0|c_i)p(w_1|c_i)p(w_2|c_i) ....
    # Asterisk means element-wise multiply in numpy
    p1 = sum(test_vector * p1_condition) + np.log(p_1)
    p0 = sum(test_vector * p0_condition) + np.log(1 - p_1)
    if p1 > p0:
        return 1
    else:
        return 0


def parse_text(input_sentence):
    token_list = re.split(r'\W+', input_sentence)
    return [token.lower() for token in token_list if len(token) > 2]


def spam_test():
    # Import and parse files
    doc_list = []
    class_list = []
    for i in range(1, 26):
        try:
            words = parse_text(open("email/spam/{}.txt".format(i)).read())
        except:
            words = parse_text(open("email/spam/{}.txt".format(i), encoding='Windows 1252').read())
        doc_list.append(words)
        class_list.append(1)

        try:
            words = parse_text(open("email/ham/{}.txt".format(i)).read())
        except:
            words = parse_text(open("email/ham/{}.txt".format(i), encoding='Windows 1252').read())
        doc_list.append(words)
        class_list.append(0)
    vocab = create_vocab_list(doc_list)

    # Generate Training Set and Test Set
    test_set = [int(num) for num in random.sample(range(50), 10)]
    training_set = list(set(range(50)) - set(test_set))

    training_matrix = []
    training_class = []
    for doc_index in training_set:
        training_matrix.append(word2vec_bag(vocab, doc_list[doc_index]))
        training_class.append(class_list[doc_index])
    p0_conditon, p1_conditon, p_spam = train_naive(np.array(training_matrix), np.array(training_class))

    # Test the classify result
    err_count = 0
    for doc_index in test_set:
        test_vector = word2vec_bag(vocab, doc_list[doc_index])
        classify_result = classify_naive(test_vector, p0_conditon, p1_conditon, p_spam)
        if classify_result != class_list[doc_index]:
            err_count += 1
    print("The error rate is {}".format(err_count / len(test_set)))


spam_test()

總結(jié)

樸素貝葉斯以及貝葉斯準(zhǔn)則提供了一種使用已知的值估算未知值的方法；
通過(guò)特征間的條件獨(dú)立性假設(shè)，可以用于降低對(duì)數(shù)據(jù)量的需求，雖然這個(gè)假設(shè)過(guò)于簡(jiǎn)單，但是貝葉斯假設(shè)仍然是一種有效的分類(lèi)器
在編程實(shí)現(xiàn)樸素貝葉斯時(shí)需要考慮很多問(wèn)題，例如通過(guò)取自然對(duì)數(shù)來(lái)解決下溢出的問(wèn)題等

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

機(jī)器學(xué)習(xí)實(shí)戰(zhàn)讀書(shū)筆記-樸素貝葉斯