非結(jié)構(gòu)化文本的分類(lèi)算法
如身高、體重、對(duì)法案的投票等。具有
能用表格來(lái)展現(xiàn) 的共性的數(shù)據(jù)我們稱為 “結(jié)構(gòu)化數(shù)據(jù)” 。
數(shù)據(jù)集中的每條數(shù)據(jù)(下表中的一行)由多個(gè)特征進(jìn)行描述(下表中的列)。

非結(jié)構(gòu)化的數(shù)據(jù) 指的是諸如電子郵件文本、推特信息、博客、新聞等。這些數(shù)據(jù)第一眼看起來(lái)是無(wú)法用一張表格來(lái)展現(xiàn)的。
非結(jié)構(gòu)化文本我們可以使用 樸素貝葉斯算法 來(lái)進(jìn)行分類(lèi)

h ∈ H 表示計(jì)算每個(gè)事件的概率;
P( D | h ) 表示在給定 h 的條件下,D 發(fā)生的概率(如給定某類(lèi)文章,這類(lèi)文章中特定單詞出現(xiàn)的概率);
P( h ) 則指事件 h 發(fā)生的概率。
訓(xùn)練階段
首先,我們統(tǒng)計(jì)所有文本中一共出現(xiàn)了多少個(gè)不同的單詞,記作“| Vocabulary |”(總詞匯表)。
對(duì)于每個(gè)單詞 wk,我們將計(jì)算 P( wk | hi),每個(gè) hi (喜歡和討厭兩種)的計(jì)算步驟如下:
1、將該分類(lèi)下的所有文章合并到一起;
2、統(tǒng)計(jì)每個(gè)單詞出現(xiàn)的次數(shù),記為 n
3、對(duì)于總詞匯表中的單詞 wk,統(tǒng)計(jì)他們?cè)诒绢?lèi)文章中出現(xiàn)的次數(shù) nk;
-
4、最后應(yīng)用下方的公式:
使用樸素貝葉斯進(jìn)行分類(lèi)

對(duì)下面這句話判斷它是正面還是負(fù)面:
I am stunned by the hype over gravity.
我們需要計(jì)算的是下面兩個(gè)概率,并選取較高的結(jié)果:
P( like ) × P( I | like ) × P( am | like ) × P( stunned | like ) × ...
P( dislike ) × P( I | dislike ) × P( am | dislike ) × P( stunned | dislike ) × ...

結(jié)果中的 6.22E - 22 是科學(xué)計(jì)數(shù)法,等價(jià)于 6.22 * 10-22。
由于 Python 不能處理處理這么小的小樹(shù),所以我們要用對(duì)數(shù)來(lái)計(jì)算——將每個(gè)概率的對(duì)數(shù)相加。
算法實(shí)現(xiàn)一個(gè)包含 100 字的文本中,每個(gè)單詞的概率是 0.0001:
import math
p = 0
for i in range(100):
  p += math.log(0.0001)
提示:
- bn = x 可以轉(zhuǎn)換為 logbx = n
- log10(ab) = log10(a) + log10(b)
常用詞和停詞
“這些組成語(yǔ)法結(jié)構(gòu)的單詞是沒(méi)有意思的,反而會(huì)產(chǎn)生很多噪音”—— H.P.Luhn
也就是說(shuō),將這些“噪音”單詞去除后是會(huì)提升分類(lèi)正確率。我們將這些單詞稱為“停詞”,有專門(mén)的停詞表可供使用。去除這些詞的理由是:
? 1. 能夠減少需要處理的數(shù)據(jù)量;
? 2. 這些詞的存在會(huì)對(duì)分類(lèi)效果產(chǎn)生負(fù)面影響。
雖然像 the、a 這種單詞的確沒(méi)有意義,但有常用詞如 work、write、school 等在某些場(chǎng)合下還是有作用的,如果將他們也列入停詞表里可能會(huì)有問(wèn)題。所以定制停詞表還需要做些額外的考慮。
分類(lèi)器的初始化代碼要完成以下工作:
1、 讀取停詞列表;
2、 獲取訓(xùn)練集中各目錄(分類(lèi))的名稱;
3、對(duì)于各個(gè)分類(lèi),調(diào)用 train 方法,統(tǒng)計(jì)單詞出現(xiàn)的次數(shù);
-
4、 計(jì)算下面的公式
初始化分類(lèi)器
from __future__ import print_function
import os, codecs, math
class BayesText:
def __init__(self, trainingdir, stopwordlist):
"""樸素貝葉斯分類(lèi)器
trainingdir 訓(xùn)練集目錄,子目錄是分> 類(lèi),子目錄中包含若干文本
stopwordlist 停詞列表(一行一個(gè))
"""
self.vocabulary = {}
self.prob = {}
self.totals = {}
self.stopwords = {}
f = open(stopwordlist)
for line in f:
self.stopwords[line.strip()] = 1
f.close()
categories = os.listdir(trainingdir)
# 將不是目錄的元素過(guò)濾掉
self.categories = [filename for filename in categories if os.path.isdir(trainingdir + filename)]
print("Counting ...")
for category in self.categories:
print(' ' + category)
(self.prob[category], self.totals[category]) = self.train(trainingdir, category)
# 刪除出現(xiàn)次數(shù)小于3次的單詞
toDelete = []
for word in self.vocabulary:
if self.vocabulary[word] < 3:
# 遍歷列表時(shí)不能刪除元素,因此做一個(gè)標(biāo)記
toDelete.append(word)
# 刪除
for word in toDelete:
del self.vocabulary[word]
# 計(jì)算概率
vocabLength = len(self.vocabulary)
print("Computing probabilities:")
for category in self.categories:
print(' ' + category)
denominator = self.totals[category] + vocabLength
for word in self.vocabulary:
if word in self.prob[category]:
count = self.prob[category][word]
else:
count = 1
self.prob[category][word] = (float(count + 1) / denominator)
print ("DONE TRAINING\n\n")
def train(self, trainingdir, category):
"""計(jì)算分類(lèi)下各單詞出現(xiàn)的次數(shù)"""
currentdir = trainingdir + category
files = os.listdir(currentdir)
counts = {}
total = 0
for file in files:
#print(currentdir + '/' + file)
f = codecs.open(currentdir + '/' + file, 'r', 'iso8859-1')
for line in f:
tokens = line.split()
for token in tokens:
# 刪除標(biāo)點(diǎn)符號(hào),并將單詞轉(zhuǎn)換為小寫(xiě)
token = token.strip('\'".,?:-')
token = token.lower()
if token != '' and not token in self.stopwords:
self.vocabulary.setdefault(token, 0)
self.vocabulary[token] += 1
counts.setdefault(token, 0)
counts[token] += 1
total += 1
f.close()
return(counts, total)
分類(lèi)器
def classify(self, filename):
results = {}
for category in self.categories:
results[category] = 0
f = codecs.open(filename, 'r', 'iso8859-1')
for line in f:
tokens = line.split()
for token in tokens:
#print(token)
token = token.strip('\'".,?:-').lower()
if token in self.vocabulary:
for category in self.categories:
if self.prob[category][token] == 0:
print("%s %s" % (category, token))
results[category] += math.log(self.prob[category][token])
f.close()
results = list(results.items())
results.sort(key=lambda tuple: tuple[1], reverse = True)
# 如果要調(diào)試,可以打印出整個(gè)列表。
return results[0][0]
分類(lèi)所有文檔,并計(jì)算準(zhǔn)確率
def testCategory(self, directory, category):
files = os.listdir(directory)
total = 0
correct = 0
for file in files:
total += 1
result = self.classify(directory + file)
if result == category:
correct += 1
return (correct, total)
def test(self, testdir):
"""測(cè)試集的目錄結(jié)構(gòu)和訓(xùn)練集相同"""
categories = os.listdir(testdir)
# 過(guò)濾掉不是目錄的元素
categories = [filename for filename in categories if os.path.isdir(testdir + filename)]
correct = 0
total = 0
for category in categories:
print(".", end="")
(catCorrect, catTotal) = self.testCategory(testdir + category + '/', category)
correct += catCorrect
total += catTotal
print("\n\nAccuracy is %f%% (%i test instances)" % ((float(correct) / total) * 100, total))
# -*- coding:utf-8 -*-
'''
Created on 2018年11月28日
@author: KingSley
'''
import os, codecs, math
class BayesText:
def __init__(self, trainingdir, stopwordlist):
""" 樸素貝葉斯分類(lèi)器
trainingdir 訓(xùn)練集目錄,子目錄是分類(lèi),子目錄中包含若干文本
stopwordlist 停詞列表(一行一個(gè))
"""
self.vocabulary = {}
self.prob = {}
self.totals = {}
self.stopwords = {}
f = open(stopwordlist)
for line in f:
self.stopwords[line.strip()] = 1
f.close()
categories = os.listdir(trainingdir)
# 將不是目錄的元素過(guò)濾掉
self.categories = [filename for filename in categories
if os.path.isdir(trainingdir + filename)]
print("Counting ...")
for category in self.categories:
print(' ' + category)
(self.prob[category],
self.totals[category]) = self.train(trainingdir, category)
# 刪除出現(xiàn)次數(shù)小于 3 次的單詞
toDelete = []
for word in self.vocabulary:
if self.vocabulary[word] < 3:
# 遍歷列表時(shí)不能刪除元素,因此做一個(gè)標(biāo)記
toDelete.append(word)
# 刪除
for word in toDelete:
del self.vocabulary[word]
# 計(jì)算概率
vocabLength = len(self.vocabulary)
print("Computing probabilities:")
for category in self.categories:
print(' ' + category)
denominator = self.totals[category] + vocabLength
for word in self.vocabulary:
if word in self.prob[category]:
count = self.prob[category][word]
else:
count = 1
self.prob[category][word] = (float(count + 1) / denominator)
print("DONE TRAINING\n\n")
def train(self, trainingdir, category):
"""計(jì)算分類(lèi)下個(gè)單詞出現(xiàn)的次數(shù)"""
currentdir = trainingdir + category
files = os.listdir(currentdir)
counts = {}
total = 0
for file in files:
#print(currentdir + '/' + file)
f = codecs.open(currentdir + '/' + file, 'r', 'iso8859-1')
for line in f:
tokens = line.split()
for token in tokens:
# 刪除標(biāo)點(diǎn)符號(hào),并將單詞轉(zhuǎn)換為小寫(xiě)
token = token.strip('\'".,?:-')
token = token.lower()
if token != '' and not token in self.stopwords:
self.vocabulary.setdefault(token, 0)
self.vocabulary[token] += 1
counts.setdefault(token, 0)
counts[token] += 1
total += 1
f.close()
return(counts, total)
def classify(self, filename):
results = {}
for category in self.categories:
results[category] = 0
f = codecs.open(filename, 'r', 'iso8859-1')
for line in f:
tokens = line.split()
for token in tokens:
#print(token)
token = token.strip('\'".,?:-').lower()
if token in self.vocabulary:
for category in self.categories:
if self.prob[category][token] == 0:
print("%s %s" % (category, token))
results[category] += math.log(
self.prob[category][token])
f.close()
results = list(results.items())
results.sort(key=lambda tuple: tuple[1], reverse = True)
# 如果要調(diào)試,可以打印出整個(gè)列表
return results[0][0]
def testCategory(self, directory, category):
files = os.listdir(directory)
total = 0
correct = 0
for file in files:
total += 1
result = self.classify(directory + file)
if result == category:
correct += 1
return (correct, total)
def test(self, testdir):
"""測(cè)試集的目錄結(jié)構(gòu)和訓(xùn)練集相同"""
categories = os.listdir(testdir)
# 過(guò)濾掉不是目錄的元素
categories = [filename for filename in categories if
os.path.isdir(testdir + filename)]
correct = 0
total = 0
for category in categories:
print(".", end="")
(catCorrect, catTotal) = self.testCategory(
testdir + category + '/', category)
correct += catCorrect
total += catTotal
print("\n\nAccuracy is %f%% (%i test instances)" %
((float(correct) / total) * 100, total))
baseDirectory = '/20news-bydate/'
trainingDir = baseDirectory + '20news-bydate-train/'
testDir = baseDirectory + '20news-bydate-test/'
stoplistfile = "20news-bydate\stopwords0.txt"
print("Reg stoplist 0 ")
bT = BayesText(trainingDir, baseDirectory + "stopwords0.txt")
print("Running Test ...")
bT.test(testDir)
print("\n\nReg stoplist 25 ")
bT = BayesText(trainingDir, baseDirectory + "stopwords25.txt")
print("Running Test ...")
bT.test(testDir)
print("\n\nReg stoplist 174 ")
bT = BayesText(trainingDir, baseDirectory + "stopwords174.txt")
print("Running Test ...")
bT.test(testDir)
參考原文作者:Ron Zacharski CC BY-NC 3.0] https://github.com/egrcc/guidetodatamining
參考原文原文 http://guidetodatamining.com/
參考譯文來(lái)自 @egrcc 的 https://github.com/egrcc/guidetodatamining

