機(jī)器學(xué)習(xí)實(shí)戰(zhàn)讀書(shū)筆記-樸素貝葉斯
核心思想:要求分類(lèi)器給出一個(gè)最優(yōu)類(lèi)別的猜測(cè)結(jié)果,同時(shí)給出這個(gè)猜測(cè)概率的估計(jì)值
我們稱(chēng)之為樸素,是因?yàn)?strong>整個(gè)形式化過(guò)程只做最原始,最簡(jiǎn)單的假設(shè)。
概率基礎(chǔ)
其中的意義為:給定某個(gè)由x,y標(biāo)注的數(shù)據(jù)點(diǎn),那么該數(shù)據(jù)點(diǎn)來(lái)自類(lèi)別
的概率為多少
如果,那么屬于類(lèi)別1,反之亦然。
獨(dú)立:如果每個(gè)特征需要N個(gè)樣本,如果假設(shè)有10個(gè)特征,那么則需要個(gè)樣本,如果特征之間相互獨(dú)立,則需要的樣本數(shù)就可以從
減少到
個(gè),所謂獨(dú)立,指的是統(tǒng)計(jì)意義上的獨(dú)立,即一個(gè)特征或者單詞出現(xiàn)的可能性與它和其他單詞相鄰沒(méi)有關(guān)系。雖然我們知道這個(gè)假設(shè)并不正確,這也就是樸素的含義。
樸素貝葉斯假設(shè):
- 特征之間相互獨(dú)立(單詞beacon出現(xiàn)在unhealty后面和出現(xiàn)在delicious后面的概率相同)
- 每個(gè)特征同等重要(判斷留言是否得等,需要看完所有的單詞)
雖然這兩個(gè)假設(shè)通常不成立,但是咱樸素貝葉斯就這么假設(shè)了。
使用python進(jìn)行文本分類(lèi)
訓(xùn)練算法
將以上公式中的換為
,粗體表示這是一個(gè)向量,此外由于樸素貝葉斯的獨(dú)立性假設(shè),可以按如下公式計(jì)算
,以此來(lái)簡(jiǎn)化計(jì)算過(guò)程。
計(jì)算每個(gè)類(lèi)別中的文檔數(shù)目
對(duì)于每篇訓(xùn)練文檔:
對(duì)于每個(gè)類(lèi)別:
如果詞條出現(xiàn)在文檔中,增加該詞條的計(jì)數(shù)
增加所有詞條的計(jì)數(shù)
對(duì)每個(gè)類(lèi)別:
對(duì)每個(gè)詞條:
將該詞條的數(shù)目初一總詞條數(shù)得到條件概率
返回每個(gè)類(lèi)別的條件概率
實(shí)戰(zhàn)
-
進(jìn)行文本分類(lèi)
分類(lèi)評(píng)論是否是惡意評(píng)論
import numpy as np def load_data_set(): """ Generate train data set and associated classify result :return: (train_data_set, classify_result) """ posting_list = [ ['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'], ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'], ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'], ['stop', 'posting', 'stupid', 'worthless', 'gar e'], ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'], ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']] class_vec = [0, 1, 0, 1, 0, 1] # 1 is an abuse, 0 is not return posting_list, class_vec def create_vocab_list(data_set): """ Get a set of words which appear in the train data set :param data_set: train_data_set :return: set of words """ vocab_set = set() for doc in data_set: # union of 2 sets vocab_set = vocab_set | set(doc) return list(vocab_set) def word2vec_set(vocab_list, input_sentence): """ Transfer a sentence to a vector based on the words appear in the sentence Using Set Model :param vocab_list: All word appeared in train set :param input_sentence: Input sentence :return: The vector representative of the input sentence """ ret_vector = [0] * len(vocab_list) for word in input_sentence: if word in vocab_list: ret_vector[vocab_list.index(word)] = 1 else: print("the word {} is not in the vocabulary".format(word)) return ret_vector def word2vec_bag(vocab_list, input_sentence): """ Transfer a sentence to a vector using Word Bag Model, in case that one work might appears in on sentence more than once :param vocab_list: :param input_sentence: :return: """ ret_vector = [0] * len(vocab_list) for word in input_sentence: if word in input_sentence: ret_vector[vocab_list.index(word)] += 1 return ret_vector def train_naive(train_matrix, train_category): """ Get probabilities to calculate bayes classify result :param train_matrix: All sentence vector of train set :param train_category: The classify result of train set :return: p(w|c_0) p(w|c_1) p(c_1) """ # number of comment doc_num = len(train_matrix) # number of word in the vocabulary word_num = len(train_matrix[0]) # probability of abusive p(c_1) # Seeing as is a 2 class problem, we could get the probability of non-abusive through 1-p_abuse p_abuse = sum(train_category) / float(doc_num) # p0_num = np.zeros(word_num) # p1_num = np.zeros(word_num) # # # p0_num/p0_denominator = p(w|c_0) # p0_denominator = 0.0 # p1_denominator = 0.0 p0_num = np.ones(word_num) p1_num = np.ones(word_num) p0_denominator = 2.0 p1_denominator = 2.0 for i in range(doc_num): # if this comment is abusive if train_category[i] == 1: p1_num += train_matrix[i] p1_denominator += sum(train_matrix[i]) else: p0_num += train_matrix[i] p0_denominator += sum(train_matrix[i]) # pi_condition is p(w|c_i) p1_condition = np.log(p1_num / p1_denominator) p0_condition = np.log(p0_num / p0_denominator) return p0_condition, p1_condition, p_abuse def classify_naive(test_vector, p0_condition, p1_condition, p_1): # because we already process np.log # p(w|c_i) = p(w_0|c_i)p(w_1|c_i)p(w_2|c_i) .... # Asterisk means element-wise multiply in numpy p1 = sum(test_vector * p1_condition) + np.log(p_1) p0 = sum(test_vector * p0_condition) + np.log(1 - p_1) if p1 > p0: return 1 else: return 0 def test_naive(): post_list, class_list = load_data_set() vocab = create_vocab_list(post_list) train_matrix = [] for post in post_list: train_matrix.append(word2vec_set(vocab, post)) p0_condition, p1_conditon, p_aubsive = train_naive(train_matrix, class_list) test_entry = ["love", "my", "dalmation"] test_vector = word2vec_set(vocab, test_entry) print("The vector of input sentence is: ", test_vector) print("Classify result is: ", classify_naive(test_vector, p0_condition, p1_conditon, p_abusive)) post_list, classes = load_data_set() print(post_list) vocab = create_vocab_list(post_list) print(word2vec_set(vocab, post_list[0])) print(vocab) train_matrix = [] for post in post_list: train_matrix.append(word2vec_set(vocab, post)) p_non_abusive_condition, p_abusive_condition, p_abusive = train_naive(train_matrix, classes) print(p_abusive) print(p_abusive_condition) max_index = p_abusive_condition.argmax() # argmax of p_abusive_condition is stupid, basically means the word 'stupid' contribute a lot to an abusive comment print(vocab[max_index])
-
過(guò)濾垃圾郵件
import re import random import numpy as np def create_vocab_list(data_set): """ Get a set of words which appear in the train data set :param data_set: train_data_set :return: set of words """ vocab_set = set() for doc in data_set: # union of 2 sets vocab_set = vocab_set | set(doc) return list(vocab_set) def word2vec_bag(vocab_list, input_sentence): """ Transfer a sentence to a vector using Word Bag Model, in case that one work might appears in on sentence more than once :param vocab_list: :param input_sentence: :return: """ ret_vector = [0] * len(vocab_list) for word in input_sentence: if word in input_sentence: ret_vector[vocab_list.index(word)] += 1 return ret_vector def train_naive(train_matrix, train_category): """ Get probabilities to calculate bayes classify result :param train_matrix: All sentence vector of train set :param train_category: The classify result of train set :return: p(w|c_0) p(w|c_1) p(c_1) """ # number of comment doc_num = len(train_matrix) # number of word in the vocabulary word_num = len(train_matrix[0]) # probability of abusive p(c_1) # Seeing as is a 2 class problem, we could get the probability of non-abusive through 1-p_abuse p_abuse = sum(train_category) / float(doc_num) # p0_num = np.zeros(word_num) # p1_num = np.zeros(word_num) # # # p0_num/p0_denominator = p(w|c_0) # p0_denominator = 0.0 # p1_denominator = 0.0 p0_num = np.ones(word_num) p1_num = np.ones(word_num) p0_denominator = 2.0 p1_denominator = 2.0 for i in range(doc_num): # if this comment is abusive if train_category[i] == 1: p1_num += train_matrix[i] p1_denominator += sum(train_matrix[i]) else: p0_num += train_matrix[i] p0_denominator += sum(train_matrix[i]) # pi_condition is p(w|c_i) p1_condition = np.log(p1_num / p1_denominator) p0_condition = np.log(p0_num / p0_denominator) return p0_condition, p1_condition, p_abuse def classify_naive(test_vector, p0_condition, p1_condition, p_1): # because we already process np.log # p(w|c_i) = p(w_0|c_i)p(w_1|c_i)p(w_2|c_i) .... # Asterisk means element-wise multiply in numpy p1 = sum(test_vector * p1_condition) + np.log(p_1) p0 = sum(test_vector * p0_condition) + np.log(1 - p_1) if p1 > p0: return 1 else: return 0 def parse_text(input_sentence): token_list = re.split(r'\W+', input_sentence) return [token.lower() for token in token_list if len(token) > 2] def spam_test(): # Import and parse files doc_list = [] class_list = [] for i in range(1, 26): try: words = parse_text(open("email/spam/{}.txt".format(i)).read()) except: words = parse_text(open("email/spam/{}.txt".format(i), encoding='Windows 1252').read()) doc_list.append(words) class_list.append(1) try: words = parse_text(open("email/ham/{}.txt".format(i)).read()) except: words = parse_text(open("email/ham/{}.txt".format(i), encoding='Windows 1252').read()) doc_list.append(words) class_list.append(0) vocab = create_vocab_list(doc_list) # Generate Training Set and Test Set test_set = [int(num) for num in random.sample(range(50), 10)] training_set = list(set(range(50)) - set(test_set)) training_matrix = [] training_class = [] for doc_index in training_set: training_matrix.append(word2vec_bag(vocab, doc_list[doc_index])) training_class.append(class_list[doc_index]) p0_conditon, p1_conditon, p_spam = train_naive(np.array(training_matrix), np.array(training_class)) # Test the classify result err_count = 0 for doc_index in test_set: test_vector = word2vec_bag(vocab, doc_list[doc_index]) classify_result = classify_naive(test_vector, p0_conditon, p1_conditon, p_spam) if classify_result != class_list[doc_index]: err_count += 1 print("The error rate is {}".format(err_count / len(test_set))) spam_test()
總結(jié)
- 樸素貝葉斯以及貝葉斯準(zhǔn)則提供了一種使用已知的值估算未知值的方法;
- 通過(guò)特征間的條件獨(dú)立性假設(shè),可以用于降低對(duì)數(shù)據(jù)量的需求,雖然這個(gè)假設(shè)過(guò)于簡(jiǎn)單,但是貝葉斯假設(shè)仍然是一種有效的分類(lèi)器
- 在編程實(shí)現(xiàn)樸素貝葉斯時(shí)需要考慮很多問(wèn)題,例如通過(guò)取自然對(duì)數(shù)來(lái)解決下溢出的問(wèn)題等