機(jī)器學(xué)習(xí)實(shí)戰(zhàn)讀書(shū)筆記-樸素貝葉斯

機(jī)器學(xué)習(xí)實(shí)戰(zhàn)讀書(shū)筆記-樸素貝葉斯

核心思想:要求分類(lèi)器給出一個(gè)最優(yōu)類(lèi)別的猜測(cè)結(jié)果,同時(shí)給出這個(gè)猜測(cè)概率的估計(jì)值

我們稱(chēng)之為樸素,是因?yàn)?strong>整個(gè)形式化過(guò)程只做最原始,最簡(jiǎn)單的假設(shè)。

概率基礎(chǔ)

p(c_i|x,y) = \frac{p(x,y|c_i)p(c_i)}{p(x,y)}

其中p(c_i|x,y)的意義為:給定某個(gè)由x,y標(biāo)注的數(shù)據(jù)點(diǎn),那么該數(shù)據(jù)點(diǎn)來(lái)自類(lèi)別c_i的概率為多少

如果p(c_1|x,y) > p(c_2|x,y),那么屬于類(lèi)別1,反之亦然。

獨(dú)立:如果每個(gè)特征需要N個(gè)樣本,如果假設(shè)有10個(gè)特征,那么則需要N^{10}個(gè)樣本,如果特征之間相互獨(dú)立,則需要的樣本數(shù)就可以從N^{10}減少到10N個(gè),所謂獨(dú)立,指的是統(tǒng)計(jì)意義上的獨(dú)立,即一個(gè)特征或者單詞出現(xiàn)的可能性與它和其他單詞相鄰沒(méi)有關(guān)系。雖然我們知道這個(gè)假設(shè)并不正確,這也就是樸素的含義。

樸素貝葉斯假設(shè)

  • 特征之間相互獨(dú)立(單詞beacon出現(xiàn)在unhealty后面和出現(xiàn)在delicious后面的概率相同)
  • 每個(gè)特征同等重要(判斷留言是否得等,需要看完所有的單詞)

雖然這兩個(gè)假設(shè)通常不成立,但是咱樸素貝葉斯就這么假設(shè)了。

使用python進(jìn)行文本分類(lèi)

訓(xùn)練算法

p(c_i|\textbf{w})=\frac{p(\textbf{w}|c_i)p(c_i)}{p(\textbf{w})}

將以上公式中的x, y換為\textbf{w},粗體表示這是一個(gè)向量,此外由于樸素貝葉斯的獨(dú)立性假設(shè),可以按如下公式計(jì)算p(\textbf{w}|c_i),以此來(lái)簡(jiǎn)化計(jì)算過(guò)程。
p(\textbf{w}|c_i) = p(w_0,w_1,...,w_n|c_i) =\prod_{j}p(w_j|c_i)

計(jì)算每個(gè)類(lèi)別中的文檔數(shù)目
對(duì)于每篇訓(xùn)練文檔:
    對(duì)于每個(gè)類(lèi)別:
        如果詞條出現(xiàn)在文檔中,增加該詞條的計(jì)數(shù)
        增加所有詞條的計(jì)數(shù)
    對(duì)每個(gè)類(lèi)別:
        對(duì)每個(gè)詞條:
            將該詞條的數(shù)目初一總詞條數(shù)得到條件概率
    返回每個(gè)類(lèi)別的條件概率

實(shí)戰(zhàn)

  • 進(jìn)行文本分類(lèi)

    分類(lèi)評(píng)論是否是惡意評(píng)論

    import numpy as np
    
    
    def load_data_set():
        """
        Generate train data set and associated classify result
        :return: (train_data_set, classify_result)
        """
        posting_list = [
            ['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],
            ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
            ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
            ['stop', 'posting', 'stupid', 'worthless', 'gar e'],
            ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
            ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
        class_vec = [0, 1, 0, 1, 0, 1]  # 1 is an abuse, 0 is not
        return posting_list, class_vec
    
    
    def create_vocab_list(data_set):
        """
        Get a set of words which appear in the train data set
        :param data_set: train_data_set
        :return: set of words
        """
        vocab_set = set()
        for doc in data_set:
            # union of 2 sets
            vocab_set = vocab_set | set(doc)
        return list(vocab_set)
    
    
    def word2vec_set(vocab_list, input_sentence):
        """
        Transfer a sentence to a vector based on the words appear in the sentence Using Set Model
        :param vocab_list: All word appeared in train set
        :param input_sentence: Input sentence
        :return: The vector representative of the input sentence
        """
        ret_vector = [0] * len(vocab_list)
        for word in input_sentence:
            if word in vocab_list:
                ret_vector[vocab_list.index(word)] = 1
            else:
                print("the word {} is not in the vocabulary".format(word))
        return ret_vector
    
    
    def word2vec_bag(vocab_list, input_sentence):
        """
        Transfer a sentence to a vector using Word Bag Model, in case that one work might appears in on sentence more than once
        :param vocab_list: 
        :param input_sentence: 
        :return: 
        """
        ret_vector = [0] * len(vocab_list)
        for word in input_sentence:
            if word in input_sentence:
                ret_vector[vocab_list.index(word)] += 1
        return ret_vector
    
    
    def train_naive(train_matrix, train_category):
        """
        Get probabilities to calculate bayes classify result
        :param train_matrix: All sentence vector of train set
        :param train_category: The classify result of train set
        :return: p(w|c_0) p(w|c_1) p(c_1)
        """
        # number of comment
        doc_num = len(train_matrix)
        # number of word in the vocabulary
        word_num = len(train_matrix[0])
    
        # probability of abusive p(c_1)
        # Seeing as is a 2 class problem, we could get the probability of non-abusive through 1-p_abuse
        p_abuse = sum(train_category) / float(doc_num)
    
        # p0_num = np.zeros(word_num)
        # p1_num = np.zeros(word_num)
        #
        # # p0_num/p0_denominator = p(w|c_0)
        # p0_denominator = 0.0
        # p1_denominator = 0.0
        p0_num = np.ones(word_num)
        p1_num = np.ones(word_num)
    
        p0_denominator = 2.0
        p1_denominator = 2.0
    
        for i in range(doc_num):
            # if this comment is abusive
            if train_category[i] == 1:
                p1_num += train_matrix[i]
                p1_denominator += sum(train_matrix[i])
            else:
                p0_num += train_matrix[i]
                p0_denominator += sum(train_matrix[i])
        # pi_condition is p(w|c_i)
        p1_condition = np.log(p1_num / p1_denominator)
        p0_condition = np.log(p0_num / p0_denominator)
        return p0_condition, p1_condition, p_abuse
    
    
    def classify_naive(test_vector, p0_condition, p1_condition, p_1):
        # because we already process np.log
        # p(w|c_i) = p(w_0|c_i)p(w_1|c_i)p(w_2|c_i) ....
        # Asterisk means element-wise multiply in numpy
        p1 = sum(test_vector * p1_condition) + np.log(p_1)
        p0 = sum(test_vector * p0_condition) + np.log(1 - p_1)
        if p1 > p0:
            return 1
        else:
            return 0
    
    
    def test_naive():
        post_list, class_list = load_data_set()
        vocab = create_vocab_list(post_list)
        train_matrix = []
        for post in post_list:
            train_matrix.append(word2vec_set(vocab, post))
        p0_condition, p1_conditon, p_aubsive = train_naive(train_matrix, class_list)
        test_entry = ["love", "my", "dalmation"]
        test_vector = word2vec_set(vocab, test_entry)
        print("The vector of input sentence is: ", test_vector)
        print("Classify result is: ", classify_naive(test_vector, p0_condition, p1_conditon, p_abusive))
    
    
    post_list, classes = load_data_set()
    print(post_list)
    vocab = create_vocab_list(post_list)
    print(word2vec_set(vocab, post_list[0]))
    print(vocab)
    
    train_matrix = []
    for post in post_list:
        train_matrix.append(word2vec_set(vocab, post))
    p_non_abusive_condition, p_abusive_condition, p_abusive = train_naive(train_matrix, classes)
    
    print(p_abusive)
    print(p_abusive_condition)
    
    max_index = p_abusive_condition.argmax()
    # argmax of p_abusive_condition is stupid, basically means the word 'stupid' contribute a lot to an abusive comment
    print(vocab[max_index])
    
    
  • 過(guò)濾垃圾郵件

    import re
    import random
    import numpy as np
    
    
    def create_vocab_list(data_set):
        """
        Get a set of words which appear in the train data set
        :param data_set: train_data_set
        :return: set of words
        """
        vocab_set = set()
        for doc in data_set:
            # union of 2 sets
            vocab_set = vocab_set | set(doc)
        return list(vocab_set)
    
    
    def word2vec_bag(vocab_list, input_sentence):
        """
        Transfer a sentence to a vector using Word Bag Model, in case that one work might appears in on sentence more than once
        :param vocab_list:
        :param input_sentence:
        :return:
        """
        ret_vector = [0] * len(vocab_list)
        for word in input_sentence:
            if word in input_sentence:
                ret_vector[vocab_list.index(word)] += 1
        return ret_vector
    
    
    def train_naive(train_matrix, train_category):
        """
        Get probabilities to calculate bayes classify result
        :param train_matrix: All sentence vector of train set
        :param train_category: The classify result of train set
        :return: p(w|c_0) p(w|c_1) p(c_1)
        """
        # number of comment
        doc_num = len(train_matrix)
        # number of word in the vocabulary
        word_num = len(train_matrix[0])
    
        # probability of abusive p(c_1)
        # Seeing as is a 2 class problem, we could get the probability of non-abusive through 1-p_abuse
        p_abuse = sum(train_category) / float(doc_num)
    
        # p0_num = np.zeros(word_num)
        # p1_num = np.zeros(word_num)
        #
        # # p0_num/p0_denominator = p(w|c_0)
        # p0_denominator = 0.0
        # p1_denominator = 0.0
        p0_num = np.ones(word_num)
        p1_num = np.ones(word_num)
    
        p0_denominator = 2.0
        p1_denominator = 2.0
    
        for i in range(doc_num):
            # if this comment is abusive
            if train_category[i] == 1:
                p1_num += train_matrix[i]
                p1_denominator += sum(train_matrix[i])
            else:
                p0_num += train_matrix[i]
                p0_denominator += sum(train_matrix[i])
        # pi_condition is p(w|c_i)
        p1_condition = np.log(p1_num / p1_denominator)
        p0_condition = np.log(p0_num / p0_denominator)
        return p0_condition, p1_condition, p_abuse
    
    
    def classify_naive(test_vector, p0_condition, p1_condition, p_1):
        # because we already process np.log
        # p(w|c_i) = p(w_0|c_i)p(w_1|c_i)p(w_2|c_i) ....
        # Asterisk means element-wise multiply in numpy
        p1 = sum(test_vector * p1_condition) + np.log(p_1)
        p0 = sum(test_vector * p0_condition) + np.log(1 - p_1)
        if p1 > p0:
            return 1
        else:
            return 0
    
    
    def parse_text(input_sentence):
        token_list = re.split(r'\W+', input_sentence)
        return [token.lower() for token in token_list if len(token) > 2]
    
    
    def spam_test():
        # Import and parse files
        doc_list = []
        class_list = []
        for i in range(1, 26):
            try:
                words = parse_text(open("email/spam/{}.txt".format(i)).read())
            except:
                words = parse_text(open("email/spam/{}.txt".format(i), encoding='Windows 1252').read())
            doc_list.append(words)
            class_list.append(1)
    
            try:
                words = parse_text(open("email/ham/{}.txt".format(i)).read())
            except:
                words = parse_text(open("email/ham/{}.txt".format(i), encoding='Windows 1252').read())
            doc_list.append(words)
            class_list.append(0)
        vocab = create_vocab_list(doc_list)
    
        # Generate Training Set and Test Set
        test_set = [int(num) for num in random.sample(range(50), 10)]
        training_set = list(set(range(50)) - set(test_set))
    
        training_matrix = []
        training_class = []
        for doc_index in training_set:
            training_matrix.append(word2vec_bag(vocab, doc_list[doc_index]))
            training_class.append(class_list[doc_index])
        p0_conditon, p1_conditon, p_spam = train_naive(np.array(training_matrix), np.array(training_class))
    
        # Test the classify result
        err_count = 0
        for doc_index in test_set:
            test_vector = word2vec_bag(vocab, doc_list[doc_index])
            classify_result = classify_naive(test_vector, p0_conditon, p1_conditon, p_spam)
            if classify_result != class_list[doc_index]:
                err_count += 1
        print("The error rate is {}".format(err_count / len(test_set)))
    
    
    spam_test()
    

總結(jié)

  • 樸素貝葉斯以及貝葉斯準(zhǔn)則提供了一種使用已知的值估算未知值的方法;
  • 通過(guò)特征間的條件獨(dú)立性假設(shè),可以用于降低對(duì)數(shù)據(jù)量的需求,雖然這個(gè)假設(shè)過(guò)于簡(jiǎn)單,但是貝葉斯假設(shè)仍然是一種有效的分類(lèi)器
  • 在編程實(shí)現(xiàn)樸素貝葉斯時(shí)需要考慮很多問(wèn)題,例如通過(guò)取自然對(duì)數(shù)來(lái)解決下溢出的問(wèn)題等
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書(shū)系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容