《機(jī)器學(xué)習(xí)實(shí)戰(zhàn)》學(xué)習(xí)筆記(三)——樸素貝葉斯

第四章內(nèi)容《基于概率論的分類方法——樸素貝葉斯》循序漸進(jìn)地講解了三個例子:

  • 簡單的句子分類(abusive or not)
  • 垃圾郵件過濾
  • 從個人廣告中獲取區(qū)域傾向

簡單的句子分類

  1. 獲取訓(xùn)練集和測試集
  2. 分析訓(xùn)練集中的所有樣本,得到測試集的詞匯列表vocabList和每個樣本的描述向量,即vocabList中的詞是否在該樣本中出現(xiàn),若出現(xiàn),對應(yīng)的值為1,否則為0。
  3. 根據(jù)樸素貝葉斯公式計(jì)算測試集中每個樣本的后驗(yàn)概率,得到每個樣本的預(yù)測分類

需要注意并改進(jìn)的兩個地方

  1. 由于在計(jì)算后驗(yàn)概率的時候,是需要很多個概率值相乘,所有一旦其中有一個概率值為0,結(jié)果就會為0(無意義)。為了避免這種情況,將分子numerator初始化成1,將分母denominator初始化成2。
  2. 一般情況下,大多數(shù)概率值都偏小,當(dāng)很多個很小的數(shù)相乘的時候,會出現(xiàn)underflow的現(xiàn)象,即在python中的結(jié)果為0。因此,可以采用取對數(shù)的方式將多個小概率值的乘法轉(zhuǎn)變成多個小概率值取對數(shù)后的結(jié)果的加法。
import numpy as np
import math
def loadDataSet():
    postingList = [['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],
                   ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
                   ['my', 'dalmatian', 'is', 'so', 'cute', 'I', 'love', 'him'],
                   ['stop', 'posting', 'stupid', 'worthless', 'garbage'],
                   ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
                   ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
    classVec = [0, 1, 0, 1, 0, 1]   # 1 is abusive, 0 is not
    return postingList, classVec
# create a list of all the unique words in all of our documents
def createVocabList(dataSet):
    vocabSet = set([])
    for document in dataSet:
        vocabSet = vocabSet | set(document)
 # create the union of two sets
    return list(vocabSet)
def setOfWords2Vec(vocabList, inputSet):
    returnVec = [0] * len(vocabList)
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)] = 1
        else:
            print('the word:', word, 'is not in my Vocabulary!')
    return returnVec
'''
def bagOfWords2Vec(vocabList, inputSet):
    returnVec = [0] * len(vocabList)
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)] += 1
    return returnVec
'''
def trainNB0(trainMatrix, trainCategory):
    numTrainDocs = len(trainMatrix)
    numWords = len(trainMatrix[0])
    pAbusive = sum(trainCategory) / float(numTrainDocs)
    # numerator / denominator
    p0Num = np.ones(numWords)
    p1Num = np.ones(numWords)
    p0Denom = 2.0
    p1Denom = 2.0
    for i in range(numTrainDocs):
        # vector addition
        if trainCategory[i] == 1:
            p1Num += trainMatrix[i]
            p1Denom += sum(trainMatrix[i])
        else:
            p0Num += trainMatrix[i]
            p0Denom += sum(trainMatrix[i])
    # to avoid underflow based on the equation log(a * b) = log(a) + log(b)
    p0Vect = np.log(p0Num / p0Denom)
    p1Vect = np.log(p1Num / p1Denom)
    return p0Vect, p1Vect, pAbusive
def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
    # vec2Classify is a vector to classify
    p1 = sum(vec2Classify * p1Vec) + math.log(pClass1)
    p0 = sum(vec2Classify * p0Vec) + math.log(1.0 - pClass1)
    if p1 > p0:
        return 1
    else:
        return 0
def testingNB():
    listOPosts, listClasses = loadDataSet()
    myVocabList = createVocabList(listOPosts)
    trainMat = []
    for postInDoc in listOPosts:
        trainMat.append(setOfWords2Vec(myVocabList, postInDoc))
    p0V, p1V, pAb = trainNB0(trainMat, listClasses)
    testEntry = ['love', 'my', 'dalmatian']
    thisDoc = np.array(setOfWords2Vec(myVocabList, testEntry))
    print(testEntry, 'classified as:', classifyNB(thisDoc, p0V, p1V, pAb))
    testEntry = ['stupid', 'garbage']
    thisDoc = np.array(setOfWords2Vec(myVocabList, testEntry))
    print(testEntry, 'classified as:', classifyNB(thisDoc, p0V, p1V, pAb))

從個人廣告中獲取區(qū)域傾向

待更新

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容