首先通俗的解釋下Topic模型LDA:假如我們需要寫一篇關(guān)于新上市汽車的文章,首先需要確定文章大概的主題,比如要寫這輛汽車的動(dòng)力、外觀、內(nèi)飾。確定完主題之后,就要下筆了,下筆的過程其實(shí)是在確定的主題中選擇合適的詞。
動(dòng)力詞:發(fā)動(dòng)機(jī)、渦輪增壓、功率、油耗、扭矩等;
外觀詞:氙氣、天窗、后視鏡、前臉、格柵燈等;
內(nèi)飾詞:儀表臺(tái)、中控臺(tái)、方向盤、座椅、靠背等。
最后加上合適的語(yǔ)法,文章就完成了。文章確定主題、主題確定詞的過程,就是LDA過程。百度百科中的解釋為:
LDA(Latent Dirichlet Allocation)是一種文檔主題生成模型,也稱為一個(gè)三層貝葉斯概率模型,包含詞、主題和文檔三層結(jié)構(gòu)。所謂生成模型,即認(rèn)為一篇文章的每個(gè)詞都是通過“以一定概率選擇了某個(gè)主題,并從這個(gè)主題中以一定概率選擇某個(gè)詞”這樣一個(gè)過程得到。文檔到主題服從多項(xiàng)式分布,主題到詞服從多項(xiàng)式分布。
這篇文章中不會(huì)出現(xiàn)很復(fù)雜的數(shù)學(xué)公式,需要深入學(xué)習(xí)的建議百度搜索“LDA數(shù)學(xué)八卦”,我就不班門弄斧了。也不會(huì)出現(xiàn)各種專業(yè)的術(shù)語(yǔ),比如Gibbs抽樣、Dirichlet分布、共軛分布、馬爾科夫鏈等,這些都在《LDA數(shù)學(xué)八卦》中有詳細(xì)解釋。同時(shí)建議有能力的讀者,直接去閱讀LDA作者主頁(yè):David M. Blei。
這篇文章也不會(huì)引用scikit-learn、gensim等機(jī)器學(xué)習(xí)包中的LDA代碼,我們堅(jiān)持“擼代碼,學(xué)知識(shí)”,自己去實(shí)現(xiàn)這個(gè)模型。代碼地址:python_lda。代碼中除了利用Gibbs抽樣方法(也可以改為EM算法解LDA,讀者可以思考下兩者的區(qū)別)實(shí)現(xiàn)LDA過程之外,同時(shí)對(duì)模型進(jìn)行了改進(jìn),這里主要解釋下改進(jìn)思路:
LDA本身作為一種非監(jiān)督的算法模型,同時(shí)也可能由于訓(xùn)練集本身存在有大量的噪聲數(shù)據(jù),可能導(dǎo)致模型在效果上并不能滿足工業(yè)上的需求。比如我們經(jīng)過一次LDA過程之后,得到的每個(gè)Topic的詞列表(xxx.twords)中,多多少少的混雜有其他Topic的詞語(yǔ)或噪聲詞語(yǔ)等,這就導(dǎo)致后邊的inference的正確率不理想。
在LDA過程完成,得到xxx.twords文件之后,我們可以嘗試根據(jù)“專家經(jīng)驗(yàn)”,手動(dòng)去除每個(gè)Topic中不應(yīng)該屬于該主題的詞。處理完之后,相當(dāng)于我們得到一個(gè)比較理想、比較干凈的“先驗(yàn)知識(shí)”。
得到這樣的“先驗(yàn)知識(shí)”之后,我們就可以將它當(dāng)做變量傳入下一次的LDA過程,并在模型初始化時(shí),將“先驗(yàn)知識(shí)”中的詞以較大概率落到相應(yīng)的Topic中。同樣的訓(xùn)練集、同樣的參數(shù)再次迭代LDA過程。兩三次這樣的迭代之后,效果應(yīng)該就有一定改進(jìn)。
雖然能在一定程度上改進(jìn)模型效果,但是這樣做也有一定的弊端:大大增大了人工成本,同時(shí)如果Topic個(gè)數(shù)過多(幾千上萬(wàn)個(gè)),也很難一個(gè)個(gè)去篩選“先驗(yàn)知識(shí)”。
以上是一個(gè)簡(jiǎn)單的改進(jìn)思路,就在平時(shí)工作中的應(yīng)用效果來看,能提高10%左右的正確率。具體實(shí)現(xiàn)見代碼:
# _*_ coding: utf-8 _*_
"""
python_lda.py by xianhu
"""
importos
importnumpy
importlogging
fromcollectionsimportdefaultdict
# 全局變量
MAX_ITER_NUM =10000# 最大迭代次數(shù)
VAR_NUM =20# 自動(dòng)計(jì)算迭代次數(shù)時(shí),計(jì)算方差的區(qū)間大小
classBiDictionary(object):
"""
定義雙向字典,通過key可以得到value,通過value也可以得到key
"""
def__init__(self):
"""
:key: 雙向字典初始化
"""
self.dict = {}# 正向的數(shù)據(jù)字典,其key為self的key
self.dict_reversed = {}# 反向的數(shù)據(jù)字典,其key為self的value
return
def__len__(self):
"""
:key: 獲取雙向字典的長(zhǎng)度
"""
returnlen(self.dict)
def__str__(self):
"""
:key: 將雙向字典轉(zhuǎn)化為字符串對(duì)象
"""
str_list = ["%s\t%s"% (key,self.dict[key])forkeyinself.dict]
return"\n".join(str_list)
defclear(self):
"""
:key: 清空雙向字典對(duì)象
"""
self.dict.clear()
self.dict_reversed.clear()
return
defadd_key_value(self,key,value):
"""
:key: 更新雙向字典,增加一項(xiàng)
"""
self.dict[key] = value
self.dict_reversed[value] = key
return
defremove_key_value(self,key,value):
"""
:key: 更新雙向字典,刪除一項(xiàng)
"""
ifkeyinself.dict:
delself.dict[key]
delself.dict_reversed[value]
return
defget_value(self,key,default=None):
"""
:key: 通過key獲取value,不存在返回default
"""
returnself.dict.get(key,default)
defget_key(self,value,default=None):
"""
:key: 通過value獲取key,不存在返回default
"""
returnself.dict_reversed.get(value,default)
defcontains_key(self,key):
"""
:key: 判斷是否存在key值
"""
returnkeyinself.dict
defcontains_value(self,value):
"""
:key: 判斷是否存在value值
"""
returnvalueinself.dict_reversed
defkeys(self):
"""
:key: 得到雙向字典全部的keys
"""
returnself.dict.keys()
defvalues(self):
"""
:key: 得到雙向字典全部的values
"""
returnself.dict_reversed.keys()
defitems(self):
"""
:key: 得到雙向字典全部的items
"""
returnself.dict.items()
classCorpusSet(object):
"""
定義語(yǔ)料集類,作為L(zhǎng)daBase的基類
"""
def__init__(self):
"""
:key: 初始化函數(shù)
"""
# 定義關(guān)于word的變量
self.local_bi = BiDictionary()# id和word之間的本地雙向字典,key為id,value為word
self.words_count =0# 數(shù)據(jù)集中word的數(shù)量(排重之前的)
self.V =0# 數(shù)據(jù)集中word的數(shù)量(排重之后的)
# 定義關(guān)于article的變量
self.artids_list = []# 全部article的id的列表,按照數(shù)據(jù)讀取的順序存儲(chǔ)
self.arts_Z = []# 全部article中所有詞的id信息,維數(shù)為 M * art.length()
self.M =0# 數(shù)據(jù)集中article的數(shù)量
# 定義推斷中用到的變量(可能為空)
self.global_bi =None# id和word之間的全局雙向字典,key為id,value為word
self.local_2_global = {}# 一個(gè)字典,local字典和global字典之間的對(duì)應(yīng)關(guān)系
return
definit_corpus_with_file(self,file_name):
"""
:key: 利用數(shù)據(jù)文件初始化語(yǔ)料集數(shù)據(jù)。文件每一行的數(shù)據(jù)格式: id[tab]word1 word2 word3......
"""
withopen(file_name,"r",encoding="utf-8")asfile_iter:
self.init_corpus_with_articles(file_iter)
return
definit_corpus_with_articles(self,article_list):
"""
:key: 利用article的列表初始化語(yǔ)料集。每一篇article的格式為: id[tab]word1 word2 word3......
"""
# 清理數(shù)據(jù)--word數(shù)據(jù)
self.local_bi.clear()
self.words_count =0
self.V =0
# 清理數(shù)據(jù)--article數(shù)據(jù)
self.artids_list.clear()
self.arts_Z.clear()
self.M =0
# 清理數(shù)據(jù)--清理local到global的映射關(guān)系
self.local_2_global.clear()
# 讀取article數(shù)據(jù)
forlineinarticle_list:
frags = line.strip().split()
iflen(frags) <2:
continue
# 獲取article的id
art_id = frags[0].strip()
# 獲取word的id
art_wordid_list = []
forwordin[w.strip()forwinfrags[1:]ifw.strip()]:
local_id =self.local_bi.get_key(word)ifself.local_bi.contains_value(word)elselen(self.local_bi)
# 這里的self.global_bi為None和為空是有區(qū)別的
ifself.global_biis None:
# 更新id信息
self.local_bi.add_key_value(local_id,word)
art_wordid_list.append(local_id)
else:
ifself.global_bi.contains_value(word):
# 更新id信息
self.local_bi.add_key_value(local_id,word)
art_wordid_list.append(local_id)
# 更新local_2_global
self.local_2_global[local_id] =self.global_bi.get_key(word)
# 更新類變量: 必須article中word的數(shù)量大于0
iflen(art_wordid_list) >0:
self.words_count +=len(art_wordid_list)
self.artids_list.append(art_id)
self.arts_Z.append(art_wordid_list)
# 做相關(guān)初始計(jì)算--word相關(guān)
self.V =len(self.local_bi)
logging.debug("words number: "+str(self.V) +", "+str(self.words_count))
# 做相關(guān)初始計(jì)算--article相關(guān)
self.M =len(self.artids_list)
logging.debug("articles number: "+str(self.M))
return
defsave_wordmap(self,file_name):
"""
:key: 保存word字典,即self.local_bi的數(shù)據(jù)
"""
withopen(file_name,"w",encoding="utf-8")asf_save:
f_save.write(str(self.local_bi))
return
defload_wordmap(self,file_name):
"""
:key: 加載word字典,即加載self.local_bi的數(shù)據(jù)
"""
self.local_bi.clear()
withopen(file_name,"r",encoding="utf-8")asf_load:
for_id,_wordin[line.strip().split()forlineinf_loadifline.strip()]:
self.local_bi.add_key_value(int(_id),_word.strip())
self.V =len(self.local_bi)
return
classLdaBase(CorpusSet):
"""
LDA模型的基類,相關(guān)說明:
》article的下標(biāo)范圍為[0, self.M), 下標(biāo)為 m
》wordid的下標(biāo)范圍為[0, self.V), 下標(biāo)為 w
》topic的下標(biāo)范圍為[0, self.K), 下標(biāo)為 k 或 topic
》article中word的下標(biāo)范圍為[0, article.size()), 下標(biāo)為 n
"""
def__init__(self):
"""
:key: 初始化函數(shù)
"""
CorpusSet.__init__(self)
# 基礎(chǔ)變量--1
self.dir_path =""# 文件夾路徑,用于存放LDA運(yùn)行的數(shù)據(jù)、中間結(jié)果等
self.model_name =""# LDA訓(xùn)練或推斷的模型名稱,也用于讀取訓(xùn)練的結(jié)果
self.current_iter =0# LDA訓(xùn)練或推斷的模型已經(jīng)迭代的次數(shù),用于繼續(xù)模型訓(xùn)練過程
self.iters_num =0# LDA訓(xùn)練或推斷過程中Gibbs抽樣迭代的總次數(shù),整數(shù)值或者"auto"
self.topics_num =0# LDA訓(xùn)練或推斷過程中的topic的數(shù)量,即self.K值
self.K =0# LDA訓(xùn)練或推斷過程中的topic的數(shù)量,即self.topics_num值
self.twords_num =0# LDA訓(xùn)練或推斷結(jié)束后輸出與每個(gè)topic相關(guān)的word的個(gè)數(shù)
# 基礎(chǔ)變量--2
self.alpha = numpy.zeros(self.K)# 超參數(shù)alpha,K維的float值,默認(rèn)為50/K
self.beta = numpy.zeros(self.V)# 超參數(shù)beta,V維的float值,默認(rèn)為0.01
# 基礎(chǔ)變量--3
self.Z = []# 所有word的topic信息,即Z(m, n),維數(shù)為 M * article.size()
# 統(tǒng)計(jì)計(jì)數(shù)(可由self.Z計(jì)算得到)
self.nd = numpy.zeros((self.M,self.K))# nd[m, k]用于保存第m篇article中第k個(gè)topic產(chǎn)生的詞的個(gè)數(shù),其維數(shù)為 M * K
self.ndsum = numpy.zeros((self.M,1))# ndsum[m, 0]用于保存第m篇article的總詞數(shù),維數(shù)為 M * 1
self.nw = numpy.zeros((self.K,self.V))# nw[k, w]用于保存第k個(gè)topic產(chǎn)生的詞中第w個(gè)詞的數(shù)量,其維數(shù)為 K * V
self.nwsum = numpy.zeros((self.K,1))# nwsum[k, 0]用于保存第k個(gè)topic產(chǎn)生的詞的總數(shù),維數(shù)為 K * 1
# 多項(xiàng)式分布參數(shù)變量
self.theta = numpy.zeros((self.M,self.K))# Doc-Topic多項(xiàng)式分布的參數(shù),維數(shù)為 M * K,由alpha值影響
self.phi = numpy.zeros((self.K,self.V))# Topic-Word多項(xiàng)式分布的參數(shù),維數(shù)為 K * V,由beta值影響
# 輔助變量,目的是提高算法執(zhí)行效率
self.sum_alpha =0.0# 超參數(shù)alpha的和
self.sum_beta =0.0# 超參數(shù)beta的和
# 先驗(yàn)知識(shí),格式為{word_id: [k1, k2, ...], ...}
self.prior_word = defaultdict(list)
# 推斷時(shí)需要的訓(xùn)練模型
self.train_model =None
return
# --------------------------------------------------輔助函數(shù)---------------------------------------------------------
definit_statistics_document(self):
"""
:key: 初始化關(guān)于article的統(tǒng)計(jì)計(jì)數(shù)。先決條件: self.M, self.K, self.Z
"""
assertself.M >0andself.K >0andself.Z
# 統(tǒng)計(jì)計(jì)數(shù)初始化
self.nd = numpy.zeros((self.M,self.K),dtype=numpy.int)
self.ndsum = numpy.zeros((self.M,1),dtype=numpy.int)
# 根據(jù)self.Z進(jìn)行更新,更新self.nd[m, k]和self.ndsum[m, 0]
forminrange(self.M):
forkinself.Z[m]:
self.nd[m,k] +=1
self.ndsum[m,0] =len(self.Z[m])
return
definit_statistics_word(self):
"""
:key: 初始化關(guān)于word的統(tǒng)計(jì)計(jì)數(shù)。先決條件: self.V, self.K, self.Z, self.arts_Z
"""
assertself.V >0andself.K >0andself.Zandself.arts_Z
# 統(tǒng)計(jì)計(jì)數(shù)初始化
self.nw = numpy.zeros((self.K,self.V),dtype=numpy.int)
self.nwsum = numpy.zeros((self.K,1),dtype=numpy.int)
# 根據(jù)self.Z進(jìn)行更新,更新self.nw[k, w]和self.nwsum[k, 0]
forminrange(self.M):
fork,winzip(self.Z[m],self.arts_Z[m]):
self.nw[k,w] +=1
self.nwsum[k,0] +=1
return
definit_statistics(self):
"""
:key: 初始化全部的統(tǒng)計(jì)計(jì)數(shù)。上兩個(gè)函數(shù)的綜合函數(shù)。
"""
self.init_statistics_document()
self.init_statistics_word()
return
defsum_alpha_beta(self):
"""
:key: 計(jì)算alpha、beta的和
"""
self.sum_alpha =self.alpha.sum()
self.sum_beta =self.beta.sum()
return
defcalculate_theta(self):
"""
:key: 初始化并計(jì)算模型的theta值(M*K),用到alpha值
"""
assertself.sum_alpha >0
self.theta = (self.nd +self.alpha) / (self.ndsum +self.sum_alpha)
return
defcalculate_phi(self):
"""
:key: 初始化并計(jì)算模型的phi值(K*V),用到beta值
"""
assertself.sum_beta >0
self.phi = (self.nw +self.beta) / (self.nwsum +self.sum_beta)
return
# ---------------------------------------------計(jì)算Perplexity值------------------------------------------------------
defcalculate_perplexity(self):
"""
:key: 計(jì)算Perplexity值,并返回
"""
# 計(jì)算theta和phi值
self.calculate_theta()
self.calculate_phi()
# 開始計(jì)算
preplexity =0.0
forminrange(self.M):
forwinself.arts_Z[m]:
preplexity += numpy.log(numpy.sum(self.theta[m] *self.phi[:,w]))
returnnumpy.exp(-(preplexity /self.words_count))
# --------------------------------------------------靜態(tài)函數(shù)---------------------------------------------------------
@staticmethod
defmultinomial_sample(pro_list):
"""
:key: 靜態(tài)函數(shù),多項(xiàng)式分布抽樣,此時(shí)會(huì)改變pro_list的值
:parampro_list: [0.2, 0.7, 0.4, 0.1],此時(shí)說明返回下標(biāo)1的可能性大,但也不絕對(duì)
"""
# 將pro_list進(jìn)行累加
forkinrange(1,len(pro_list)):
pro_list[k] += pro_list[k-1]
# 確定隨機(jī)數(shù) u 落在哪個(gè)下標(biāo)值,此時(shí)的下標(biāo)值即為抽取的類別(random.rand()返回: [0, 1.0))
u = numpy.random.rand() * pro_list[-1]
return_index =len(pro_list) -1
fortinrange(len(pro_list)):
ifpro_list[t] > u:
return_index = t
break
returnreturn_index
# ----------------------------------------------Gibbs抽樣算法--------------------------------------------------------
defgibbs_sampling(self,is_calculate_preplexity):
"""
:key: LDA模型中的Gibbs抽樣過程
:paramis_calculate_preplexity: 是否計(jì)算preplexity值
"""
# 計(jì)算preplexity值用到的變量
pp_list = []
pp_var = numpy.inf
# 開始迭代
last_iter =self.current_iter +1
iters_num =self.iters_numifself.iters_num !="auto"elseMAX_ITER_NUM
forself.current_iterinrange(last_iter,last_iter+iters_num):
info ="......"
# 是否計(jì)算preplexity值
ifis_calculate_preplexity:
pp =self.calculate_perplexity()
pp_list.append(pp)
# 計(jì)算列表最新VAR_NUM項(xiàng)的方差
pp_var = numpy.var(pp_list[-VAR_NUM:])iflen(pp_list) >= VAR_NUMelsenumpy.inf
info = (", preplexity: "+str(pp)) + ((", var: "+str(pp_var))iflen(pp_list) >= VAR_NUMelse"")
# 輸出Debug信息
logging.debug("\titeration "+str(self.current_iter) + info)
# 判斷是否跳出循環(huán)
ifself.iters_num =="auto"andpp_var < (VAR_NUM /2):
break
# 對(duì)每篇article的每個(gè)word進(jìn)行一次抽樣,抽取合適的k值
forminrange(self.M):
forninrange(len(self.Z[m])):
w =self.arts_Z[m][n]
k =self.Z[m][n]
# 統(tǒng)計(jì)計(jì)數(shù)減一
self.nd[m,k] -=1
self.ndsum[m,0] -=1
self.nw[k,w] -=1
self.nwsum[k,0] -=1
ifself.prior_wordand(winself.prior_word):
# 帶有先驗(yàn)知識(shí),否則進(jìn)行正常抽樣
k = numpy.random.choice(self.prior_word[w])
else:
# 計(jì)算theta值--下邊的過程為抽取第m篇article的第n個(gè)詞w的topic,即新的k
theta_p = (self.nd[m] +self.alpha) / (self.ndsum[m,0] +self.sum_alpha)
# 計(jì)算phi值--判斷是訓(xùn)練模型,還是推斷模型(注意self.beta[w_g])
ifself.local_2_globalandself.train_model:
w_g =self.local_2_global[w]
phi_p = (self.train_model.nw[:,w_g] +self.nw[:,w] +self.beta[w_g]) / \
(self.train_model.nwsum[:,0] +self.nwsum[:,0] +self.sum_beta)
else:
phi_p = (self.nw[:,w] +self.beta[w]) / (self.nwsum[:,0] +self.sum_beta)
# multi_p為多項(xiàng)式分布的參數(shù),此時(shí)沒有進(jìn)行標(biāo)準(zhǔn)化
multi_p = theta_p * phi_p
# 此時(shí)的topic即為Gibbs抽樣得到的topic,它有較大的概率命中多項(xiàng)式概率大的topic
k = LdaBase.multinomial_sample(multi_p)
# 統(tǒng)計(jì)計(jì)數(shù)加一
self.nd[m,k] +=1
self.ndsum[m,0] +=1
self.nw[k,w] +=1
self.nwsum[k,0] +=1
# 更新Z值
self.Z[m][n] = k
# 抽樣完畢
return
# -----------------------------------------Model數(shù)據(jù)存儲(chǔ)、讀取相關(guān)函數(shù)-------------------------------------------------
defsave_parameter(self,file_name):
"""
:key: 保存模型相關(guān)參數(shù)數(shù)據(jù),包括: topics_num, M, V, K, words_count, alpha, beta
"""
withopen(file_name,"w",encoding="utf-8")asf_param:
foritemin["topics_num","M","V","K","words_count"]:
f_param.write("%s\t%s\n"% (item,str(self.__dict__[item])))
f_param.write("alpha\t%s\n"%",".join([str(item)foriteminself.alpha]))
f_param.write("beta\t%s\n"%",".join([str(item)foriteminself.beta]))
return
defload_parameter(self,file_name):
"""
:key: 加載模型相關(guān)參數(shù)數(shù)據(jù),和上一個(gè)函數(shù)相對(duì)應(yīng)
"""
withopen(file_name,"r",encoding="utf-8")asf_param:
forlineinf_param:
key,value = line.strip().split()
ifkeyin["topics_num","M","V","K","words_count"]:
self.__dict__[key] =int(value)
elifkeyin["alpha","beta"]:
self.__dict__[key] = numpy.array([float(item)foriteminvalue.split(",")])
return
defsave_zvalue(self,file_name):
"""
:key: 保存模型關(guān)于article的變量,包括: arts_Z, Z, artids_list等
"""
withopen(file_name,"w",encoding="utf-8")asf_zvalue:
forminrange(self.M):
out_line = [str(w) +":"+str(k)forw,kinzip(self.arts_Z[m],self.Z[m])]
f_zvalue.write(self.artids_list[m] +"\t"+" ".join(out_line) +"\n")
return
defload_zvalue(self,file_name):
"""
:key: 讀取模型的Z變量。和上一個(gè)函數(shù)相對(duì)應(yīng)
"""
self.arts_Z = []
self.artids_list = []
self.Z = []
withopen(file_name,"r",encoding="utf-8")asf_zvalue:
forlineinf_zvalue:
frags = line.strip().split()
art_id = frags[0].strip()
w_k_list = [value.split(":")forvalueinfrags[1:]]
# 添加到類中
self.artids_list.append(art_id)
self.arts_Z.append([int(item[0])foriteminw_k_list])
self.Z.append([int(item[1])foriteminw_k_list])
return
defsave_twords(self,file_name):
"""
:key: 保存模型的twords數(shù)據(jù),要用到phi的數(shù)據(jù)
"""
self.calculate_phi()
out_num =self.Vifself.twords_num >self.Velseself.twords_num
withopen(file_name,"w",encoding="utf-8")asf_twords:
forkinrange(self.K):
words_list =sorted([(w,self.phi[k,w])forwinrange(self.V)],key=lambdax: x[1],reverse=True)
f_twords.write("Topic %dth:\n"% k)
f_twords.writelines(["\t%s %f\n"% (self.local_bi.get_value(w),p)forw,pinwords_list[:out_num]])
return
defload_twords(self,file_name):
"""
:key: 加載模型的twords數(shù)據(jù),即先驗(yàn)數(shù)據(jù)
"""
self.prior_word.clear()
topic = -1
withopen(file_name,"r",encoding="utf-8")asf_twords:
forlineinf_twords:
ifline.startswith("Topic"):
topic =int(line.strip()[6:-3])
else:
word_id =self.local_bi.get_key(line.strip().split()[0].strip())
self.prior_word[word_id].append(topic)
return
defsave_tag(self,file_name):
"""
:key: 輸出模型最終給數(shù)據(jù)打標(biāo)簽的結(jié)果,用到theta值
"""
self.calculate_theta()
withopen(file_name,"w",encoding="utf-8")asf_tag:
forminrange(self.M):
f_tag.write("%s\t%s\n"% (self.artids_list[m]," ".join([str(item)foriteminself.theta[m]])))
return
defsave_model(self):
"""
:key: 保存模型數(shù)據(jù)
"""
name_predix ="%s-%05d"% (self.model_name,self.current_iter)
# 保存訓(xùn)練結(jié)果
self.save_parameter(os.path.join(self.dir_path,"%s.%s"% (name_predix,"param")))
self.save_wordmap(os.path.join(self.dir_path,"%s.%s"% (name_predix,"wordmap")))
self.save_zvalue(os.path.join(self.dir_path,"%s.%s"% (name_predix,"zvalue")))
#保存額外數(shù)據(jù)
self.save_twords(os.path.join(self.dir_path,"%s.%s"% (name_predix,"twords")))
self.save_tag(os.path.join(self.dir_path,"%s.%s"% (name_predix,"tag")))
return
defload_model(self):
"""
:key: 加載模型數(shù)據(jù)
"""
name_predix ="%s-%05d"% (self.model_name,self.current_iter)
# 加載訓(xùn)練結(jié)果
self.load_parameter(os.path.join(self.dir_path,"%s.%s"% (name_predix,"param")))
self.load_wordmap(os.path.join(self.dir_path,"%s.%s"% (name_predix,"wordmap")))
self.load_zvalue(os.path.join(self.dir_path,"%s.%s"% (name_predix,"zvalue")))
return
classLdaModel(LdaBase):
"""
LDA模型定義,主要實(shí)現(xiàn)訓(xùn)練、繼續(xù)訓(xùn)練、推斷的過程
"""
definit_train_model(self,dir_path,model_name,current_iter,iters_num=None,topics_num=10,twords_num=200,
alpha=-1.0,beta=0.01,data_file="",prior_file=""):
"""
:key: 初始化訓(xùn)練模型,根據(jù)參數(shù)current_iter(是否等于0)決定是初始化新模型,還是加載已有模型
:key: 當(dāng)初始化新模型時(shí),除了prior_file先驗(yàn)文件外,其余所有的參數(shù)都需要,且current_iter等于0
:key: 當(dāng)加載已有模型時(shí),只需要dir_path, model_name, current_iter(不等于0), iters_num, twords_num即可
:paramiters_num: 可以為整數(shù)值或者“auto”
"""
ifcurrent_iter ==0:
logging.debug("init a new train model")
# 初始化語(yǔ)料集
self.init_corpus_with_file(data_file)
# 初始化部分變量
self.dir_path = dir_path
self.model_name = model_name
self.current_iter = current_iter
self.iters_num = iters_num
self.topics_num = topics_num
self.K = topics_num
self.twords_num = twords_num
# 初始化alpha和beta
self.alpha = numpy.array([alphaifalpha >0else(50.0/self.K)forkinrange(self.K)])
self.beta = numpy.array([betaifbeta >0else0.01forwinrange(self.V)])
# 初始化Z值,以便統(tǒng)計(jì)計(jì)數(shù)
self.Z = [[numpy.random.randint(self.K)forninrange(len(self.arts_Z[m]))]forminrange(self.M)]
else:
logging.debug("init an existed model")
# 初始化部分變量
self.dir_path = dir_path
self.model_name = model_name
self.current_iter = current_iter
self.iters_num = iters_num
self.twords_num = twords_num
# 加載已有模型
self.load_model()
# 初始化統(tǒng)計(jì)計(jì)數(shù)
self.init_statistics()
# 計(jì)算alpha和beta的和值
self.sum_alpha_beta()
# 初始化先驗(yàn)知識(shí)
ifprior_file:
self.load_twords(prior_file)
# 返回該模型
returnself
defbegin_gibbs_sampling_train(self,is_calculate_preplexity=True):
"""
:key: 訓(xùn)練模型,對(duì)語(yǔ)料集中的所有數(shù)據(jù)進(jìn)行Gibbs抽樣,并保存最后的抽樣結(jié)果
"""
# Gibbs抽樣
logging.debug("sample iteration start, iters_num: "+str(self.iters_num))
self.gibbs_sampling(is_calculate_preplexity)
logging.debug("sample iteration finish")
# 保存模型
logging.debug("save model")
self.save_model()
return
definit_inference_model(self,train_model):
"""
:key: 初始化推斷模型
"""
self.train_model = train_model
# 初始化變量: 主要用到self.topics_num, self.K
self.topics_num = train_model.topics_num
self.K = train_model.K
# 初始化變量self.alpha, self.beta,直接沿用train_model的值
self.alpha = train_model.alpha# K維的float值,訓(xùn)練和推斷模型中的K相同,故可以沿用
self.beta = train_model.beta# V維的float值,推斷模型中用于計(jì)算phi的V值應(yīng)該是全局的word的數(shù)量,故可以沿用
self.sum_alpha_beta()# 計(jì)算alpha和beta的和
# 初始化數(shù)據(jù)集的self.global_bi
self.global_bi = train_model.local_bi
return
definference_data(self,article_list,iters_num=100,repeat_num=3):
"""
:key: 利用現(xiàn)有模型推斷數(shù)據(jù)
:paramarticle_list: 每一行的數(shù)據(jù)格式為: id[tab]word1 word2 word3......
:paramiters_num: 每一次迭代的次數(shù)
:paramrepeat_num: 重復(fù)迭代的次數(shù)
"""
# 初始化語(yǔ)料集
self.init_corpus_with_articles(article_list)
# 初始化返回變量
return_theta = numpy.zeros((self.M,self.K))
# 重復(fù)抽樣
foriinrange(repeat_num):
logging.debug("inference repeat_num: "+str(i+1))
# 初始化變量
self.current_iter =0
self.iters_num = iters_num
# 初始化Z值,以便統(tǒng)計(jì)計(jì)數(shù)
self.Z = [[numpy.random.randint(self.K)forninrange(len(self.arts_Z[m]))]forminrange(self.M)]
# 初始化統(tǒng)計(jì)計(jì)數(shù)
self.init_statistics()
# 開始推斷
self.gibbs_sampling(is_calculate_preplexity=False)
# 計(jì)算theta
self.calculate_theta()
return_theta +=self.theta
# 計(jì)算結(jié)果,并返回
returnreturn_theta / repeat_num
if__name__ =="__main__":
"""
測(cè)試代碼
"""
logging.basicConfig(level=logging.DEBUG,format="%(asctime)s\t%(levelname)s\t%(message)s")
# train或者inference
test_type ="train"
# test_type = "inference"
# 測(cè)試新模型
iftest_type =="train":
model = LdaModel()
# 由prior_file決定是否帶有先驗(yàn)知識(shí)
model.init_train_model("/root/py_dir/lda_python_native/","model",current_iter=0,iters_num="auto",topics_num=10,data_file="/root/py_dir/lda_python_native/sba_bu.txt")
# model.init_train_model("data/", "model", current_iter=0, iters_num="auto", topics_num=10, data_file="corpus.txt", prior_file="prior.twords")
model.begin_gibbs_sampling_train()
eliftest_type =="inference":
model = LdaModel()
model.init_inference_model(LdaModel().init_train_model("data/","model",current_iter=134))
data = [
"cn? ? 咪咕 漫畫 咪咕 漫畫 漫畫 更名 咪咕 漫畫 資源 偷星 國(guó)漫 全彩 日漫 實(shí)時(shí) 在線看 隨心所欲 登陸 漫畫 資源 黑白 全彩 航海王",
"co? ? aircloud aircloud 硬件 設(shè)備 wifi 智能 手要 平板電腦 電腦 存儲(chǔ) aircloud 文件 遠(yuǎn)程 型號(hào) aircloud 硬件 設(shè)備 wifi"
]
result = model.inference_data(data)
# 退出程序
exit()