模型創(chuàng)建
Gensim中 Word2Vec 模型的期望輸入是進(jìn)過(guò)分詞的句子列表,即是某個(gè)二維數(shù)組。這里我們暫時(shí)使用 Python 內(nèi)置的數(shù)組,不過(guò)其在輸入數(shù)據(jù)集較大的情況下會(huì)占用大量的 RAM。Gensim 本身只是要求能夠迭代的有序句子列表,因此在工程實(shí)踐中我們可以使用自定義的生成器,只在內(nèi)存中保存單條語(yǔ)句。
# 引入 word2vecfrom gensim.models import word2vec# 引入日志配置import logginglogging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)# 引入數(shù)據(jù)集raw_sentences = ["the quick brown fox jumps over the lazy dogs","yoyoyo you go home now to sleep"]# 切分詞匯sentences= [s.encode('utf-8').split() for s in sentences]# 構(gòu)建模型model = word2vec.Word2Vec(sentences, min_count=1)# 進(jìn)行相關(guān)性比較model.similarity('dogs','you')
這里我們調(diào)用Word2Vec創(chuàng)建模型實(shí)際上會(huì)對(duì)數(shù)據(jù)執(zhí)行兩次迭代操作,第一輪操作會(huì)統(tǒng)計(jì)詞頻來(lái)構(gòu)建內(nèi)部的詞典數(shù)結(jié)構(gòu),第二輪操作會(huì)進(jìn)行神經(jīng)網(wǎng)絡(luò)訓(xùn)練,而這兩個(gè)步驟是可以分步進(jìn)行的,這樣對(duì)于某些不可重復(fù)的流(譬如 Kafka 等流式數(shù)據(jù)中)可以手動(dòng)控制:
model = gensim.models.Word2Vec(iter=1)? # an empty model, no training yetmodel.build_vocab(some_sentences)? # can be a non-repeatable, 1-pass generatormodel.train(other_sentences)? # can be a non-repeatable, 1-pass generator
Word2Vec 參數(shù)
min_count
model = Word2Vec(sentences, min_count=10) # default value is 5
在不同大小的語(yǔ)料集中,我們對(duì)于基準(zhǔn)詞頻的需求也是不一樣的。譬如在較大的語(yǔ)料集中,我們希望忽略那些只出現(xiàn)過(guò)一兩次的單詞,這里我們就可以通過(guò)設(shè)置min_count參數(shù)進(jìn)行控制。一般而言,合理的參數(shù)值會(huì)設(shè)置在0~100之間。
size
size參數(shù)主要是用來(lái)設(shè)置神經(jīng)網(wǎng)絡(luò)的層數(shù),Word2Vec 中的默認(rèn)值是設(shè)置為100層。更大的層次設(shè)置意味著更多的輸入數(shù)據(jù),不過(guò)也能提升整體的準(zhǔn)確度,合理的設(shè)置范圍為 10~數(shù)百。
model = Word2Vec(sentences, size=200) # default value is 100
workers
workers參數(shù)用于設(shè)置并發(fā)訓(xùn)練時(shí)候的線程數(shù),不過(guò)僅當(dāng)Cython安裝的情況下才會(huì)起作用:
model = Word2Vec(sentences, workers=4) # default = 1 worker = no parallelization
外部語(yǔ)料集
在真實(shí)的訓(xùn)練場(chǎng)景中我們往往會(huì)使用較大的語(yǔ)料集進(jìn)行訓(xùn)練,譬如這里以 Word2Vec 官方的text8為例,只要改變模型中的語(yǔ)料集開(kāi)源即可:
sentences = word2vec.Text8Corpus('text8')model = word2vec.Word2Vec(sentences, size=200)
這里語(yǔ)料集中的語(yǔ)句是經(jīng)過(guò)分詞的,因此可以直接使用。筆者在第一次使用該類時(shí)報(bào)錯(cuò)了,因此把 Gensim 中的源代碼貼一下,也方便以后自定義處理其他語(yǔ)料集:
class Text8Corpus(object):? ? """Iterate over sentences from the "text8" corpus, unzipped from http://mattmahoney.net/dc/text8.zip ."""? ? def __init__(self, fname, max_sentence_length=MAX_WORDS_IN_BATCH):? ? ? ? self.fname = fname? ? ? ? self.max_sentence_length = max_sentence_length? ? def __iter__(self):? ? ? ? # the entire corpus is one gigantic line -- there are no sentence marks at all? ? ? ? # so just split the sequence of tokens arbitrarily: 1 sentence = 1000 tokens? ? ? ? sentence, rest = [], b''? ? ? ? with utils.smart_open(self.fname) as fin:? ? ? ? ? ? while True:? ? ? ? ? ? ? ? text = rest + fin.read(8192)? # avoid loading the entire file (=1 line) into RAM? ? ? ? ? ? ? ? if text == rest:? # EOF? ? ? ? ? ? ? ? ? ? words = utils.to_unicode(text).split()? ? ? ? ? ? ? ? ? ? sentence.extend(words)? # return the last chunk of words, too (may be shorter/longer)? ? ? ? ? ? ? ? ? ? if sentence:? ? ? ? ? ? ? ? ? ? ? ? yield sentence? ? ? ? ? ? ? ? ? ? break? ? ? ? ? ? ? ? last_token = text.rfind(b' ')? # last token may have been split in two... keep for next iteration? ? ? ? ? ? ? ? words, rest = (utils.to_unicode(text[:last_token]).split(),? ? ? ? ? ? ? ? ? ? ? ? ? ? ? text[last_token:].strip()) if last_token >= 0 else ([], text)? ? ? ? ? ? ? ? sentence.extend(words)? ? ? ? ? ? ? ? while len(sentence) >= self.max_sentence_length:? ? ? ? ? ? ? ? ? ? yield sentence[:self.max_sentence_length]? ? ? ? ? ? ? ? ? ? sentence = sentence[self.max_sentence_length:]
我們?cè)谏衔闹幸蔡峒?,如果是?duì)于大量的輸入語(yǔ)料集或者需要整合磁盤上多個(gè)文件夾下的數(shù)據(jù),我們可以以迭代器的方式而不是一次性將全部?jī)?nèi)容讀取到內(nèi)存中來(lái)節(jié)省 RAM 空間:
class MySentences(object):? ? def __init__(self, dirname):? ? ? ? self.dirname = dirname? ? def __iter__(self):? ? ? ? for fname in os.listdir(self.dirname):? ? ? ? ? ? for line in open(os.path.join(self.dirname, fname)):? ? ? ? ? ? ? ? yield line.split()sentences = MySentences('/some/directory') # a memory-friendly iteratormodel = gensim.models.Word2Vec(sentences)
模型保存與讀取
model.save('text8.model')2015-02-24 11:19:26,059 : INFO : saving Word2Vec object under text8.model, separately None2015-02-24 11:19:26,060 : INFO : not storing attribute syn0norm2015-02-24 11:19:26,060 : INFO : storing numpy array 'syn0' to text8.model.syn0.npy2015-02-24 11:19:26,742 : INFO : storing numpy array 'syn1' to text8.model.syn1.npymodel1 = Word2Vec.load('text8.model') model.save_word2vec_format('text.model.bin', binary=True)2015-02-24 11:19:52,341 : INFO : storing 71290x200 projection weights into text.model.bin model1 = word2vec.Word2Vec.load_word2vec_format('text.model.bin', binary=True)2015-02-24 11:22:08,185 : INFO : loading projection weights from text.model.bin2015-02-24 11:22:10,322 : INFO : loaded (71290, 200) matrix from text.model.bin2015-02-24 11:22:10,322 : INFO : precomputing L2-norms of word weight vectors
模型預(yù)測(cè)
Word2Vec 最著名的效果即是以語(yǔ)義化的方式推斷出相似詞匯:
model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)[('queen', 0.50882536)]model.doesnt_match("breakfast cereal dinner lunch";.split())'cereal'model.similarity('woman', 'man')0.73723527model.most_similar(['man'])[(u'woman', 0.5686948895454407), (u'girl', 0.4957364797592163), (u'young', 0.4457539916038513), (u'luckiest', 0.4420626759529114), (u'serpent', 0.42716869711875916), (u'girls', 0.42680859565734863), (u'smokes', 0.4265017509460449), (u'creature', 0.4227582812309265), (u'robot', 0.417464017868042), (u'mortal', 0.41728296875953674)]
如果我們希望直接獲取某個(gè)單詞的向量表示,直接以下標(biāo)方式訪問(wèn)即可:
model['computer']? # raw NumPy vector of a wordarray([-0.00449447, -0.00310097,? 0.02421786, ...], dtype=float32)
模型評(píng)估
Word2Vec 的訓(xùn)練屬于無(wú)監(jiān)督模型,并沒(méi)有太多的類似于監(jiān)督學(xué)習(xí)里面的客觀評(píng)判方式,更多的依賴于端應(yīng)用。Google 之前公開(kāi)了20000條左右的語(yǔ)法與語(yǔ)義化訓(xùn)練樣本,每一條遵循A is to B as C is to D這個(gè)格式,地址在這里:
model.accuracy('/tmp/questions-words.txt')2014-02-01 22:14:28,387 : INFO : family: 88.9% (304/342)2014-02-01 22:29:24,006 : INFO : gram1-adjective-to-adverb: 32.4% (263/812)2014-02-01 22:36:26,528 : INFO : gram2-opposite: 50.3% (191/380)2014-02-01 23:00:52,406 : INFO : gram3-comparative: 91.7% (1222/1332)2014-02-01 23:13:48,243 : INFO : gram4-superlative: 87.9% (617/702)2014-02-01 23:29:52,268 : INFO : gram5-present-participle: 79.4% (691/870)2014-02-01 23:57:04,965 : INFO : gram7-past-tense: 67.1% (995/1482)2014-02-02 00:15:18,525 : INFO : gram8-plural: 89.6% (889/992)2014-02-02 00:28:18,140 : INFO : gram9-plural-verbs: 68.7% (482/702)2014-02-02 00:28:18,140 : INFO : total: 74.3% (5654/7614)
還是需要強(qiáng)調(diào)下,訓(xùn)練集上表現(xiàn)的好也不意味著 Word2Vec 在真實(shí)應(yīng)用中就會(huì)表現(xiàn)的很好,還是需要因地制宜。)