第四章內(nèi)容《基于概率論的分類方法——樸素貝葉斯》循序漸進(jìn)地講解了三個例子:
- 簡單的句子分類(abusive or not)
- 垃圾郵件過濾
- 從個人廣告中獲取區(qū)域傾向
簡單的句子分類
- 獲取訓(xùn)練集和測試集
- 分析訓(xùn)練集中的所有樣本,得到測試集的詞匯列表vocabList和每個樣本的描述向量,即vocabList中的詞是否在該樣本中出現(xiàn),若出現(xiàn),對應(yīng)的值為1,否則為0。
- 根據(jù)樸素貝葉斯公式計(jì)算測試集中每個樣本的后驗(yàn)概率,得到每個樣本的預(yù)測分類
需要注意并改進(jìn)的兩個地方
- 由于在計(jì)算后驗(yàn)概率的時候,是需要很多個概率值相乘,所有一旦其中有一個概率值為0,結(jié)果就會為0(無意義)。為了避免這種情況,將分子numerator初始化成1,將分母denominator初始化成2。
- 一般情況下,大多數(shù)概率值都偏小,當(dāng)很多個很小的數(shù)相乘的時候,會出現(xiàn)underflow的現(xiàn)象,即在python中的結(jié)果為0。因此,可以采用取對數(shù)的方式將多個小概率值的乘法轉(zhuǎn)變成多個小概率值取對數(shù)后的結(jié)果的加法。
import numpy as np
import math
def loadDataSet():
postingList = [['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],
['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
['my', 'dalmatian', 'is', 'so', 'cute', 'I', 'love', 'him'],
['stop', 'posting', 'stupid', 'worthless', 'garbage'],
['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
classVec = [0, 1, 0, 1, 0, 1] # 1 is abusive, 0 is not
return postingList, classVec
# create a list of all the unique words in all of our documents
def createVocabList(dataSet):
vocabSet = set([])
for document in dataSet:
vocabSet = vocabSet | set(document)
# create the union of two sets
return list(vocabSet)
def setOfWords2Vec(vocabList, inputSet):
returnVec = [0] * len(vocabList)
for word in inputSet:
if word in vocabList:
returnVec[vocabList.index(word)] = 1
else:
print('the word:', word, 'is not in my Vocabulary!')
return returnVec
'''
def bagOfWords2Vec(vocabList, inputSet):
returnVec = [0] * len(vocabList)
for word in inputSet:
if word in vocabList:
returnVec[vocabList.index(word)] += 1
return returnVec
'''
def trainNB0(trainMatrix, trainCategory):
numTrainDocs = len(trainMatrix)
numWords = len(trainMatrix[0])
pAbusive = sum(trainCategory) / float(numTrainDocs)
# numerator / denominator
p0Num = np.ones(numWords)
p1Num = np.ones(numWords)
p0Denom = 2.0
p1Denom = 2.0
for i in range(numTrainDocs):
# vector addition
if trainCategory[i] == 1:
p1Num += trainMatrix[i]
p1Denom += sum(trainMatrix[i])
else:
p0Num += trainMatrix[i]
p0Denom += sum(trainMatrix[i])
# to avoid underflow based on the equation log(a * b) = log(a) + log(b)
p0Vect = np.log(p0Num / p0Denom)
p1Vect = np.log(p1Num / p1Denom)
return p0Vect, p1Vect, pAbusive
def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
# vec2Classify is a vector to classify
p1 = sum(vec2Classify * p1Vec) + math.log(pClass1)
p0 = sum(vec2Classify * p0Vec) + math.log(1.0 - pClass1)
if p1 > p0:
return 1
else:
return 0
def testingNB():
listOPosts, listClasses = loadDataSet()
myVocabList = createVocabList(listOPosts)
trainMat = []
for postInDoc in listOPosts:
trainMat.append(setOfWords2Vec(myVocabList, postInDoc))
p0V, p1V, pAb = trainNB0(trainMat, listClasses)
testEntry = ['love', 'my', 'dalmatian']
thisDoc = np.array(setOfWords2Vec(myVocabList, testEntry))
print(testEntry, 'classified as:', classifyNB(thisDoc, p0V, p1V, pAb))
testEntry = ['stupid', 'garbage']
thisDoc = np.array(setOfWords2Vec(myVocabList, testEntry))
print(testEntry, 'classified as:', classifyNB(thisDoc, p0V, p1V, pAb))
從個人廣告中獲取區(qū)域傾向
待更新