FastText模型
下面介紹了Gensim中的fastText模型并展示如何在Lee Corpus上使用它。
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
在這里,我們將學習使用fastText庫來訓練詞嵌入模型、保存和加載它們,并執(zhí)行類似于Word2Vec的相似性操作和向量查找。
什么時候使用fastText?
fastText背后的主要原理是單詞的形態(tài)結構帶有單詞意義的重要信息。這樣的結構是不會被傳統(tǒng)的單詞嵌入模型(如Word2Vec)所考慮的。傳統(tǒng)的單詞嵌入模型只為每個獨立的單詞訓練一個唯一的單詞嵌入。形態(tài)結構對于形態(tài)豐富的語言(德語,土耳其語)尤其重要。在這些語言中,單個單詞可以具有大量形態(tài)形式,且每種形態(tài)形式可能很少出現(xiàn)。因此,傳統(tǒng)的單詞嵌入模型很難訓練好的詞嵌入。
fastText試圖通過將每個單詞作為其子單詞的聚集來解決以上問題。為了簡單和語言獨立,子詞被視為該詞的字符ngram。一個單詞的向量簡單地視為其組件char-ngram(字詞)的所有向量的總和。
根據(jù)Word2Vec和fastText的詳細對比,與原始Word2Vec相比,fastText在語法任務上的表現(xiàn)明顯更好,特別是當訓練語料庫的規(guī)模很小。不過,Word2Vec在語義任務上的表現(xiàn)略勝fastText。隨著訓練語料庫規(guī)模的增加,兩者間的差異逐漸變小。
假設詞匯外單詞(out-of-vocabulary,OOV)至少有一個char-ngram(子單詞)存在訓練數(shù)據(jù)中,fastText甚至可以通過對其組件char-ngram(子單詞)的向量進行求和來獲得該詞匯外單詞的向量。
訓練模型
在以下的例子中,我們將使用Lee Corpus來訓練我們的模型。
from pprint import pprint as print
from gensim.models.fasttext import FastText
from gensim.test.utils import datapath
# Set file names for train and test data
corpus_file = datapath('lee_background.cor')
model = FastText(vector_size=100)
# build the vocabulary
model.build_vocab(corpus_file=corpus_file)
# train the model
model.train(corpus_file=corpus_file, epochs=model.epochs, total_examples=model.corpus_count, total_words=model.corpus_total_words
)
print(model)
結果為:
<gensim.models.fasttext.FastText object at 0x7f9733391be0>
訓練超參數(shù)
用于訓練模型的超參數(shù)遵循與 Word2Vec 相同的模式。FastText支持原始word2vec中的以下參數(shù):
| 參數(shù) | 說明 |
|---|---|
| model | 訓練結構。允許值:cbow(默認),skipgram |
| vector_size | 將要學習的向量嵌入維度(默認值:100) |
| alpha | 初始學習率(默認值:0.025) |
| window | 上下文窗口大小(默認值:5) |
| min_count | 忽略出現(xiàn)次數(shù)小于此值的單詞(默認值:5) |
| loss | 訓練對象。允許值:ns(負采樣,默認值),hs(Hierarchical Softmax),softmax |
| sample | 對高頻詞進行下采樣的閾值(默認值:0.001) |
| negative | 采樣大的負單詞數(shù)量,僅在loss設置為ns時使用(默認值:5) |
| epochs | 輪次(默認值:5) |
| sorted_vocab | 按降序頻率對詞匯進行排序(默認值:1) |
| threads | 使用的線程數(shù)量(默認值:12) |
| min_n | char ngram的最小長度(默認值:3) |
| max_n | char ngram的最大長度(默認值:6) |
| bucket | number of buckets used for hashing ngrams(默認值:2000000) |
參數(shù)min_n和max_n控制每個單詞在訓練和嵌入時被劃分成字符gram的長度。如果max_n被設置為0或者小于min_n,則沒有字符gram被使用。該模型被簡化為了Word2Vec。
為了綁定正在訓練的模型的內存要求,使用了一個哈希函數(shù),該函數(shù)將gram映射到1到K中的整數(shù)。為了哈希這些字符序列,我們將使用Fowler-Noll-Vo(FNV-1a變體)函數(shù)。
注意:你可以在使用Gensim的fastText原生實現(xiàn)時繼續(xù)訓練模型。
保存/加載模型
模型能夠通過load和save方法來保存和加載,就像Gensim中的其他模型一樣。
# Save a model trained via Gensim's fastText implementation to temp.
import tempfile
import os
with tempfile.NamedTemporaryFile(prefix='saved_model_gensim-', delete=False) as tmp:
model.save(tmp.name, separately=[])
# Load back the same model.
loaded_model = FastText.load(tmp.name)
print(loaded_model)
os.unlink(tmp.name) # demonstration complete, don't need the temp file anymore
結果為:
<gensim.models.fasttext.FastText object at 0x7f972fe265b0>
save_word2vec_format同樣也適用于fastText模型,但是會導致所有gram的向量丟失。因此,模型以此方式加載會表現(xiàn)的像常規(guī)的word2vec模型一樣。
詞向量查找
所有查找fastText單詞(包括OOV單詞)所需的信息都包含在其model.wv屬性中。
如果你不需要繼續(xù)訓練你的模型,為了節(jié)約空間和RAM,你可以導出并保存此.wv屬性并丟棄模型。
wv = model.wv
print(wv)
#
# FastText models support vector lookups for out-of-vocabulary words by summing up character ngrams belonging to the word.
#
print('night' in wv.key_to_index)
結果為
<gensim.models.fasttext.FastTextKeyedVectors object at 0x7f9733391280>
True
print('nights' in wv.key_to_index)
結果為
False
print(wv['night'])
結果為
array([ 0.12453239, -0.26018462, -0.04087191, 0.2563215 , 0.31401935,
0.16155584, 0.39527607, 0.27404118, -0.45236284, 0.06942682,
0.36584955, 0.51162827, -0.51161295, -0.192019 , -0.5068029 ,
-0.07426998, -0.6276584 , 0.22271585, 0.19990133, 0.2582401 ,
0.14329399, -0.01959469, -0.45576197, -0.06447829, 0.1493489 ,
0.17261286, -0.13472046, 0.26546794, -0.34596932, 0.5626187 ,
-0.7038802 , 0.15603925, -0.03104019, -0.06228801, -0.13480644,
-0.0684596 , 0.24728075, 0.55081636, 0.07330963, 0.32814154,
0.1574982 , 0.56742406, -0.31233737, 0.14195296, 0.0540203 ,
0.01718009, 0.05519052, -0.04002226, 0.16157456, -0.5134223 ,
-0.01033936, 0.05745083, -0.39208183, 0.52553374, -1.0542839 ,
0.2145304 , -0.15234643, -0.35197273, -0.6215585 , 0.01796502,
0.21242104, 0.30762967, 0.2787644 , -0.19908747, 0.7144409 ,
0.45586124, -0.21344525, 0.26920903, -0.651759 , -0.37096855,
-0.16243419, -0.3085725 , -0.70485127, -0.04926324, -0.80278563,
-0.24352737, 0.6427129 , -0.3530421 , -0.29960123, 0.01466726,
-0.18253349, -0.2489397 , 0.00648343, 0.18057272, -0.11812428,
-0.49044088, 0.1847386 , -0.27946883, 0.3941279 , -0.39211616,
0.26847798, 0.41468227, -0.3953728 , -0.25371104, 0.3390468 ,
-0.16447693, -0.18722224, 0.2782088 , -0.0696249 , 0.4313547 ],
dtype=float32)
print(wv['nights'])
結果為
array([ 0.10586783, -0.22489995, -0.03636307, 0.22263278, 0.27037606,
0.1394871 , 0.3411114 , 0.2369042 , -0.38989475, 0.05935 ,
0.31713557, 0.44301754, -0.44249156, -0.16652377, -0.4388366 ,
-0.06266895, -0.5436303 , 0.19294666, 0.17363031, 0.22459263,
0.12532061, -0.01866964, -0.3936521 , -0.05507145, 0.12905194,
0.14942174, -0.11657442, 0.22935589, -0.29934618, 0.4859668 ,
-0.6073519 , 0.13433163, -0.02491274, -0.05468523, -0.11884545,
-0.06117092, 0.21444008, 0.4775469 , 0.06227469, 0.28350767,
0.13580805, 0.48993143, -0.27067345, 0.1252003 , 0.04606731,
0.01598426, 0.04640368, -0.03456376, 0.14138013, -0.44429192,
-0.00865329, 0.05027836, -0.341311 , 0.45402458, -0.91097856,
0.1868968 , -0.13116683, -0.30361563, -0.5364188 , 0.01603454,
0.18146741, 0.26708448, 0.24074472, -0.17163375, 0.61906886,
0.39530373, -0.18259627, 0.23319626, -0.5634787 , -0.31959867,
-0.13945322, -0.269441 , -0.60941464, -0.0403638 , -0.69563633,
-0.2098089 , 0.5569868 , -0.30320194, -0.25840232, 0.01436759,
-0.15632603, -0.21624804, 0.00434287, 0.15566474, -0.10228094,
-0.4249678 , 0.16197811, -0.24147548, 0.34205705, -0.3391568 ,
0.23235887, 0.35860622, -0.34247142, -0.21777524, 0.29318404,
-0.1407287 , -0.16115218, 0.24247572, -0.06217333, 0.37221798],
dtype=float32)
相似性操作
相似性操作運行和word2vec一樣。假設詞匯表外的單詞至少有一個字符gram出現(xiàn)在訓練數(shù)據(jù)中,該單詞也可以被使用。
print("nights" in wv.key_to_index)
結果為
False
print("night" in wv.key_to_index)
結果為
True
print(wv.similarity("night", "nights"))
結果為
0.999992
在fastText模型中,句法上相似的單詞通常擁有較高的相似性,因為因為大量的字符gram將是相同的。因此,fastText通常在句法任務上比Word2Vec表現(xiàn)得好。詳細的比較參見這里。
其他相似性操作
示例訓練語料庫是一個較小的語料庫,結果預計不會很好。這里僅用于概念驗證
print(wv.most_similar("nights"))
結果為
[('night', 0.9999929070472717),
('night.', 0.9999895095825195),
('flights', 0.999988853931427),
('rights', 0.9999886751174927),
('residents', 0.9999884366989136),
('overnight', 0.9999883770942688),
('commanders', 0.999988317489624),
('reached', 0.9999881386756897),
('commander', 0.9999880790710449),
('leading', 0.999987781047821)]
print(wv.n_similarity(['sushi', 'shop'], ['japanese', 'restaurant']))
結果為
0.9999402
print(wv.doesnt_match("breakfast cereal dinner lunch".split()))
結果為
'lunch'
print(wv.most_similar(positive=['baghdad', 'england'], negative=['london']))
結果為
[('attempt', 0.999660074710846),
('biggest', 0.9996545314788818),
('again', 0.9996527433395386),
('against', 0.9996523857116699),
('doubles', 0.9996522068977356),
('Royal', 0.9996512532234192),
('Airlines', 0.9996494054794312),
('forced', 0.9996494054794312),
('arrest', 0.9996492266654968),
('follows', 0.999649167060852)]
print(wv.evaluate_word_analogies(datapath('questions-words.txt')))
結果為
(0.24489795918367346,
[{'correct': [], 'incorrect': [], 'section': 'capital-common-countries'},
{'correct': [], 'incorrect': [], 'section': 'capital-world'},
{'correct': [], 'incorrect': [], 'section': 'currency'},
{'correct': [], 'incorrect': [], 'section': 'city-in-state'},
{'correct': [],
'incorrect': [('HE', 'SHE', 'HIS', 'HER'), ('HIS', 'HER', 'HE', 'SHE')],
'section': 'family'},
{'correct': [], 'incorrect': [], 'section': 'gram1-adjective-to-adverb'},
{'correct': [], 'incorrect': [], 'section': 'gram2-opposite'},
{'correct': [('GOOD', 'BETTER', 'LOW', 'LOWER'),
('GREAT', 'GREATER', 'LOW', 'LOWER'),
('LONG', 'LONGER', 'LOW', 'LOWER')],
'incorrect': [('GOOD', 'BETTER', 'GREAT', 'GREATER'),
('GOOD', 'BETTER', 'LONG', 'LONGER'),
('GREAT', 'GREATER', 'LONG', 'LONGER'),
('GREAT', 'GREATER', 'GOOD', 'BETTER'),
('LONG', 'LONGER', 'GOOD', 'BETTER'),
('LONG', 'LONGER', 'GREAT', 'GREATER'),
('LOW', 'LOWER', 'GOOD', 'BETTER'),
('LOW', 'LOWER', 'GREAT', 'GREATER'),
('LOW', 'LOWER', 'LONG', 'LONGER')],
'section': 'gram3-comparative'},
{'correct': [('BIG', 'BIGGEST', 'LARGE', 'LARGEST'),
('GOOD', 'BEST', 'LARGE', 'LARGEST'),
('GREAT', 'GREATEST', 'LARGE', 'LARGEST')],
'incorrect': [('BIG', 'BIGGEST', 'GOOD', 'BEST'),
('BIG', 'BIGGEST', 'GREAT', 'GREATEST'),
('GOOD', 'BEST', 'GREAT', 'GREATEST'),
('GOOD', 'BEST', 'BIG', 'BIGGEST'),
('GREAT', 'GREATEST', 'BIG', 'BIGGEST'),
('GREAT', 'GREATEST', 'GOOD', 'BEST'),
('LARGE', 'LARGEST', 'BIG', 'BIGGEST'),
('LARGE', 'LARGEST', 'GOOD', 'BEST'),
('LARGE', 'LARGEST', 'GREAT', 'GREATEST')],
'section': 'gram4-superlative'},
{'correct': [('GO', 'GOING', 'SAY', 'SAYING'),
('LOOK', 'LOOKING', 'PLAY', 'PLAYING'),
('LOOK', 'LOOKING', 'SAY', 'SAYING'),
('LOOK', 'LOOKING', 'GO', 'GOING'),
('PLAY', 'PLAYING', 'SAY', 'SAYING'),
('PLAY', 'PLAYING', 'GO', 'GOING'),
('SAY', 'SAYING', 'GO', 'GOING')],
'incorrect': [('GO', 'GOING', 'LOOK', 'LOOKING'),
('GO', 'GOING', 'PLAY', 'PLAYING'),
('GO', 'GOING', 'RUN', 'RUNNING'),
('LOOK', 'LOOKING', 'RUN', 'RUNNING'),
('PLAY', 'PLAYING', 'RUN', 'RUNNING'),
('PLAY', 'PLAYING', 'LOOK', 'LOOKING'),
('RUN', 'RUNNING', 'SAY', 'SAYING'),
('RUN', 'RUNNING', 'GO', 'GOING'),
('RUN', 'RUNNING', 'LOOK', 'LOOKING'),
('RUN', 'RUNNING', 'PLAY', 'PLAYING'),
('SAY', 'SAYING', 'LOOK', 'LOOKING'),
('SAY', 'SAYING', 'PLAY', 'PLAYING'),
('SAY', 'SAYING', 'RUN', 'RUNNING')],
'section': 'gram5-present-participle'},
{'correct': [('AUSTRALIA', 'AUSTRALIAN', 'INDIA', 'INDIAN'),
('AUSTRALIA', 'AUSTRALIAN', 'ISRAEL', 'ISRAELI'),
('FRANCE', 'FRENCH', 'INDIA', 'INDIAN'),
('INDIA', 'INDIAN', 'ISRAEL', 'ISRAELI'),
('ISRAEL', 'ISRAELI', 'INDIA', 'INDIAN'),
('SWITZERLAND', 'SWISS', 'INDIA', 'INDIAN')],
'incorrect': [('AUSTRALIA', 'AUSTRALIAN', 'FRANCE', 'FRENCH'),
('AUSTRALIA', 'AUSTRALIAN', 'SWITZERLAND', 'SWISS'),
('FRANCE', 'FRENCH', 'ISRAEL', 'ISRAELI'),
('FRANCE', 'FRENCH', 'SWITZERLAND', 'SWISS'),
('FRANCE', 'FRENCH', 'AUSTRALIA', 'AUSTRALIAN'),
('INDIA', 'INDIAN', 'SWITZERLAND', 'SWISS'),
('INDIA', 'INDIAN', 'AUSTRALIA', 'AUSTRALIAN'),
('INDIA', 'INDIAN', 'FRANCE', 'FRENCH'),
('ISRAEL', 'ISRAELI', 'SWITZERLAND', 'SWISS'),
('ISRAEL', 'ISRAELI', 'AUSTRALIA', 'AUSTRALIAN'),
('ISRAEL', 'ISRAELI', 'FRANCE', 'FRENCH'),
('SWITZERLAND', 'SWISS', 'AUSTRALIA', 'AUSTRALIAN'),
('SWITZERLAND', 'SWISS', 'FRANCE', 'FRENCH'),
('SWITZERLAND', 'SWISS', 'ISRAEL', 'ISRAELI')],
'section': 'gram6-nationality-adjective'},
{'correct': [],
'incorrect': [('GOING', 'WENT', 'PAYING', 'PAID'),
('GOING', 'WENT', 'PLAYING', 'PLAYED'),
('GOING', 'WENT', 'SAYING', 'SAID'),
('GOING', 'WENT', 'TAKING', 'TOOK'),
('PAYING', 'PAID', 'PLAYING', 'PLAYED'),
('PAYING', 'PAID', 'SAYING', 'SAID'),
('PAYING', 'PAID', 'TAKING', 'TOOK'),
('PAYING', 'PAID', 'GOING', 'WENT'),
('PLAYING', 'PLAYED', 'SAYING', 'SAID'),
('PLAYING', 'PLAYED', 'TAKING', 'TOOK'),
('PLAYING', 'PLAYED', 'GOING', 'WENT'),
('PLAYING', 'PLAYED', 'PAYING', 'PAID'),
('SAYING', 'SAID', 'TAKING', 'TOOK'),
('SAYING', 'SAID', 'GOING', 'WENT'),
('SAYING', 'SAID', 'PAYING', 'PAID'),
('SAYING', 'SAID', 'PLAYING', 'PLAYED'),
('TAKING', 'TOOK', 'GOING', 'WENT'),
('TAKING', 'TOOK', 'PAYING', 'PAID'),
('TAKING', 'TOOK', 'PLAYING', 'PLAYED'),
('TAKING', 'TOOK', 'SAYING', 'SAID')],
'section': 'gram7-past-tense'},
{'correct': [('BUILDING', 'BUILDINGS', 'CAR', 'CARS'),
('BUILDING', 'BUILDINGS', 'CHILD', 'CHILDREN'),
('CAR', 'CARS', 'BUILDING', 'BUILDINGS'),
('CHILD', 'CHILDREN', 'CAR', 'CARS'),
('MAN', 'MEN', 'CAR', 'CARS')],
'incorrect': [('BUILDING', 'BUILDINGS', 'MAN', 'MEN'),
('CAR', 'CARS', 'CHILD', 'CHILDREN'),
('CAR', 'CARS', 'MAN', 'MEN'),
('CHILD', 'CHILDREN', 'MAN', 'MEN'),
('CHILD', 'CHILDREN', 'BUILDING', 'BUILDINGS'),
('MAN', 'MEN', 'BUILDING', 'BUILDINGS'),
('MAN', 'MEN', 'CHILD', 'CHILDREN')],
'section': 'gram8-plural'},
{'correct': [], 'incorrect': [], 'section': 'gram9-plural-verbs'},
{'correct': [('GOOD', 'BETTER', 'LOW', 'LOWER'),
('GREAT', 'GREATER', 'LOW', 'LOWER'),
('LONG', 'LONGER', 'LOW', 'LOWER'),
('BIG', 'BIGGEST', 'LARGE', 'LARGEST'),
('GOOD', 'BEST', 'LARGE', 'LARGEST'),
('GREAT', 'GREATEST', 'LARGE', 'LARGEST'),
('GO', 'GOING', 'SAY', 'SAYING'),
('LOOK', 'LOOKING', 'PLAY', 'PLAYING'),
('LOOK', 'LOOKING', 'SAY', 'SAYING'),
('LOOK', 'LOOKING', 'GO', 'GOING'),
('PLAY', 'PLAYING', 'SAY', 'SAYING'),
('PLAY', 'PLAYING', 'GO', 'GOING'),
('SAY', 'SAYING', 'GO', 'GOING'),
('AUSTRALIA', 'AUSTRALIAN', 'INDIA', 'INDIAN'),
('AUSTRALIA', 'AUSTRALIAN', 'ISRAEL', 'ISRAELI'),
('FRANCE', 'FRENCH', 'INDIA', 'INDIAN'),
('INDIA', 'INDIAN', 'ISRAEL', 'ISRAELI'),
('ISRAEL', 'ISRAELI', 'INDIA', 'INDIAN'),
('SWITZERLAND', 'SWISS', 'INDIA', 'INDIAN'),
('BUILDING', 'BUILDINGS', 'CAR', 'CARS'),
('BUILDING', 'BUILDINGS', 'CHILD', 'CHILDREN'),
('CAR', 'CARS', 'BUILDING', 'BUILDINGS'),
('CHILD', 'CHILDREN', 'CAR', 'CARS'),
('MAN', 'MEN', 'CAR', 'CARS')],
'incorrect': [('HE', 'SHE', 'HIS', 'HER'),
('HIS', 'HER', 'HE', 'SHE'),
('GOOD', 'BETTER', 'GREAT', 'GREATER'),
('GOOD', 'BETTER', 'LONG', 'LONGER'),
('GREAT', 'GREATER', 'LONG', 'LONGER'),
('GREAT', 'GREATER', 'GOOD', 'BETTER'),
('LONG', 'LONGER', 'GOOD', 'BETTER'),
('LONG', 'LONGER', 'GREAT', 'GREATER'),
('LOW', 'LOWER', 'GOOD', 'BETTER'),
('LOW', 'LOWER', 'GREAT', 'GREATER'),
('LOW', 'LOWER', 'LONG', 'LONGER'),
('BIG', 'BIGGEST', 'GOOD', 'BEST'),
('BIG', 'BIGGEST', 'GREAT', 'GREATEST'),
('GOOD', 'BEST', 'GREAT', 'GREATEST'),
('GOOD', 'BEST', 'BIG', 'BIGGEST'),
('GREAT', 'GREATEST', 'BIG', 'BIGGEST'),
('GREAT', 'GREATEST', 'GOOD', 'BEST'),
('LARGE', 'LARGEST', 'BIG', 'BIGGEST'),
('LARGE', 'LARGEST', 'GOOD', 'BEST'),
('LARGE', 'LARGEST', 'GREAT', 'GREATEST'),
('GO', 'GOING', 'LOOK', 'LOOKING'),
('GO', 'GOING', 'PLAY', 'PLAYING'),
('GO', 'GOING', 'RUN', 'RUNNING'),
('LOOK', 'LOOKING', 'RUN', 'RUNNING'),
('PLAY', 'PLAYING', 'RUN', 'RUNNING'),
('PLAY', 'PLAYING', 'LOOK', 'LOOKING'),
('RUN', 'RUNNING', 'SAY', 'SAYING'),
('RUN', 'RUNNING', 'GO', 'GOING'),
('RUN', 'RUNNING', 'LOOK', 'LOOKING'),
('RUN', 'RUNNING', 'PLAY', 'PLAYING'),
('SAY', 'SAYING', 'LOOK', 'LOOKING'),
('SAY', 'SAYING', 'PLAY', 'PLAYING'),
('SAY', 'SAYING', 'RUN', 'RUNNING'),
('AUSTRALIA', 'AUSTRALIAN', 'FRANCE', 'FRENCH'),
('AUSTRALIA', 'AUSTRALIAN', 'SWITZERLAND', 'SWISS'),
('FRANCE', 'FRENCH', 'ISRAEL', 'ISRAELI'),
('FRANCE', 'FRENCH', 'SWITZERLAND', 'SWISS'),
('FRANCE', 'FRENCH', 'AUSTRALIA', 'AUSTRALIAN'),
('INDIA', 'INDIAN', 'SWITZERLAND', 'SWISS'),
('INDIA', 'INDIAN', 'AUSTRALIA', 'AUSTRALIAN'),
('INDIA', 'INDIAN', 'FRANCE', 'FRENCH'),
('ISRAEL', 'ISRAELI', 'SWITZERLAND', 'SWISS'),
('ISRAEL', 'ISRAELI', 'AUSTRALIA', 'AUSTRALIAN'),
('ISRAEL', 'ISRAELI', 'FRANCE', 'FRENCH'),
('SWITZERLAND', 'SWISS', 'AUSTRALIA', 'AUSTRALIAN'),
('SWITZERLAND', 'SWISS', 'FRANCE', 'FRENCH'),
('SWITZERLAND', 'SWISS', 'ISRAEL', 'ISRAELI'),
('GOING', 'WENT', 'PAYING', 'PAID'),
('GOING', 'WENT', 'PLAYING', 'PLAYED'),
('GOING', 'WENT', 'SAYING', 'SAID'),
('GOING', 'WENT', 'TAKING', 'TOOK'),
('PAYING', 'PAID', 'PLAYING', 'PLAYED'),
('PAYING', 'PAID', 'SAYING', 'SAID'),
('PAYING', 'PAID', 'TAKING', 'TOOK'),
('PAYING', 'PAID', 'GOING', 'WENT'),
('PLAYING', 'PLAYED', 'SAYING', 'SAID'),
('PLAYING', 'PLAYED', 'TAKING', 'TOOK'),
('PLAYING', 'PLAYED', 'GOING', 'WENT'),
('PLAYING', 'PLAYED', 'PAYING', 'PAID'),
('SAYING', 'SAID', 'TAKING', 'TOOK'),
('SAYING', 'SAID', 'GOING', 'WENT'),
('SAYING', 'SAID', 'PAYING', 'PAID'),
('SAYING', 'SAID', 'PLAYING', 'PLAYED'),
('TAKING', 'TOOK', 'GOING', 'WENT'),
('TAKING', 'TOOK', 'PAYING', 'PAID'),
('TAKING', 'TOOK', 'PLAYING', 'PLAYED'),
('TAKING', 'TOOK', 'SAYING', 'SAID'),
('BUILDING', 'BUILDINGS', 'MAN', 'MEN'),
('CAR', 'CARS', 'CHILD', 'CHILDREN'),
('CAR', 'CARS', 'MAN', 'MEN'),
('CHILD', 'CHILDREN', 'MAN', 'MEN'),
('CHILD', 'CHILDREN', 'BUILDING', 'BUILDINGS'),
('MAN', 'MEN', 'BUILDING', 'BUILDINGS'),
('MAN', 'MEN', 'CHILD', 'CHILDREN')],
'section': 'Total accuracy'}])
詞移動器距離
這部分需要安裝pyemd庫:pip install pyemd。
從兩個句子開始:
sentence_obama = 'Obama speaks to the media in Illinois'.lower().split()
sentence_president = 'The president greets the press in Chicago'.lower().split()
移除停止詞:
from gensim.parsing.preprocessing import STOPWORDS
sentence_obama = [w for w in sentence_obama if w not in STOPWORDS]
sentence_president = [w for w in sentence_president if w not in STOPWORDS]
計算兩個句子的詞移動距離:
distance = wv.wmdistance(sentence_obama, sentence_president)
print(f"Word Movers Distance is {distance} (lower means closer)")
結果為
'Word Movers Distance is 0.015923231075180694 (lower means closer)'