UniformML-paper 4 Glove 《GloVe Global Vectors for Word Representation》

今天我們來重溫下強(qiáng)烈推薦的一篇經(jīng)典的詞向量訓(xùn)練模型——Glove。(大家可能比較熟悉的是word2vec,這篇后續(xù)我們也會(huì)來重溫下,在大量語料的時(shí)候,glove的表現(xiàn)會(huì)優(yōu)于word2vec)

一 Glove 模型介紹

glove 是基于詞共現(xiàn)矩陣來做的,由于共現(xiàn)矩陣非常稀疏,這篇也主要是通過非零元素來進(jìn)行訓(xùn)練。

Our model efficiently leverages statistical information by training onlyon the nonzero elements in a word-word co- occurrence matrix,rather than on the entire sparse matrix or on individual context windows in a large corpus.

本文的訓(xùn)練方法主要的兩大思想:

1、利用全局的統(tǒng)計(jì)矩陣信息

2、利用局部的窗口特征信息

The result is a new global log- bilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods.

上述這兩種方法其實(shí)分別是LSA和skip-gram的思想。LSA(潛在語義模型)是基于詞-文檔共現(xiàn)矩陣?yán)肧VD等矩陣分解的方法將詞語用更稠密的向量來表示。而skip-gram模型則是對(duì)每個(gè)句子的固定窗口進(jìn)行語言模型(詞語預(yù)測(cè))學(xué)習(xí)。

While methods like LSA efficiently leverage statistical information, they do relatively poorly on the word analogy task, indicating a sub-optimal vector space structure. Methods like skip-gram may do better on the analogy task, but they poorly utilize the statistics of the corpus since they train on separate local context windows instead of on global co-occurrence counts.

1) 矩陣分解方法(Matrix Factorization methods)

LSA模型就是應(yīng)用了矩陣分解的方法,他們構(gòu)建了term-document共現(xiàn)矩陣

In LSA, the ma- trices are of “term-document” type, i.e., the rows correspond to words or terms, and the columns correspond to different documents in the corpus.

2)基于窗口的模型(shallow window-based methods)

In the skip-gram and ivLBL models, the objec- tive is to predict a word’s context given the word itself, whereas the objective in the CBOW and vLBL models is to predict a word given its con- text.

glove既然是利用了上面的兩個(gè)特性,那又是怎么做的呢?

首先根據(jù)文檔集合構(gòu)建word-wrod共現(xiàn)矩陣,假設(shè)Xij表示詞語Xi和Xj的共現(xiàn)次數(shù)。XX_i=\sum_j X_{ij}

P_{ij}=\frac{X_{ij}}{X_i}??

作者發(fā)現(xiàn)了一個(gè)很有趣的現(xiàn)象Pik/Pjk的比值和詞語(Xi, Xj, Xk)之間的相關(guān)性有關(guān),如表1所示,

1)k=solid和ice相關(guān),steam無關(guān),此時(shí)比值Pik/Pjk偏大,

2)相反如果k=gas和steam相關(guān),ice無關(guān),則此時(shí)比值Pik/Pjk較小。

3)而當(dāng)k=water/fashion 和ice/steam同時(shí)相關(guān)/無關(guān) 時(shí)候,比值接近1.

idea of Glove

The above argument suggests that the appropriate starting point for word vector learning should be with ratios of co-occurrence probabilities rather than the probabilities themselves.

用公式化表示上面這個(gè)關(guān)系

F(w_i, w_j, w_k)=\frac{P_{ik}}{P_{jk}}? ? ? ? (1)

那么該如何構(gòu)建這個(gè)F呢?

我們知道詞向量空間有個(gè)很經(jīng)典的案例:

king-queen=man-woman

這個(gè)側(cè)面反應(yīng)出我們的詞向量空間應(yīng)該是一個(gè)線性結(jié)構(gòu)的,因此我們可以上述的F進(jìn)行改造:

Since vector spaces are inherently linear structures, the most natural way to do this is withvector differences. With this aim, we can restrict our consideration to those functionsF that depend only on the difference of the two target words,

F(w_i-w_j,\widetilde w_k)=\frac{P_{ik}}{P_{jk}}? ? ? (2)

F((w_i-w_j)^T\widetilde w_k)=\frac{P_{ik}}{P_{jk}}? ?? (3)

我們對(duì)Pik也進(jìn)行變換

F((w_i-w_j)^T\widetilde w_k)=\frac{F(w_i^T\widetilde w_k)}{F(w_j^T\widetilde w_k)}? ?? (4)

其中我們有如下關(guān)系

P_{ik}=\frac{X_{ik}}{X_i}=F(w_i^T\widetilde w_k)? ? ??? (5)

對(duì)兩邊分別取對(duì)數(shù)log

w_i^T\widetilde w_k=log(P_{ik})=logX_{ik}-logX_i? ? ? (6)

這里log(Xi)可以說和k是無關(guān)的,這里用bias來替換,如下

w_i^T \widetilde w_k + b_i+\widetilde b_k=log(X_{ik})? ? ? (7)

然后又對(duì)上述模型進(jìn)行了改造,采用weighted least squares regression model 來構(gòu)造損失誤差

J=\sum_{i,j=1}^{V} f(x_{ij})(w_i^T\widetilde w_j+b_i+\widetilde b_j-logX_{ij})^2? ? ? ?(8)

其中f(Xij)是加權(quán)系數(shù)

f(x)=(x/x_{max})^\alpha  \ \ \ if x<x_{max}  \ \ ; 1\ \ otherwise? ? (9)

其中 alpha=3/4, Xmax=100

二 Glove模型和別的模型關(guān)系

我們知道skip-gram模型的目標(biāo)是最大化下面的Qij(這里Qij就是一個(gè)softmax,目標(biāo)就是希望給定單詞i,然后找到最大的context Wj)

Q_{ij}=\frac{exp(w_i^T\widetilde w_j)}{\sum_k exp(w_i^T\widetilde w_k)}? ? ? (10)

從最大似然法我們有如下公式:

J=\max [\prod Q_{ij}] \propto \min [-\sum_{i\in corpus; j\in context(i)} log(Q_{ij})]? ? ? ?(11)

J=-\sum_{i=1}^{V} \sum_{j=1}^{V} X_{ij}logQ_{ij}? ? ? (12)

之前我們定義過Xi=\sum_j (Xij)

J=-\sum_{i=1}^V X_i\sum_{j=1}^{V} P_{ij}log(Q_{ij})=\sum_{i=1}^{V}X_iH(P_i, Q_i)? ? ?(13)

這里H(P, Q) 是對(duì)于分布P和Q的交叉熵(cross entropy)

As a weighted sum of cross-entropy error, this objective bears some formal resemblance to the weighted least squares objective of Eqn. (8)

到此為止,我們看到通過利用softmax和最大似然法得到的損失函數(shù)和我們公式(8)形式上很像了,但是有個(gè)問題,就是這里的概率分布Q是個(gè)計(jì)算量特別大的,尤其在Vocabulary很大的時(shí)候。因此,本文作者提出我們可以考慮另一個(gè)誤差衡量的方法來避免大量的計(jì)算。一個(gè)比較簡(jiǎn)單的誤差衡量就是平方誤差啦。

To begin, cross entropy error is just one among many possible distance measures between probability distributions, and it has the unfortunate property that distributions with long tails are often modeled poorly with too much weight given to the unlikely events. Furthermore, for the measure to be bounded it requires that the model distributionQ be properly normalized. This presents a computational bottleneck owing to the sum over the whole vocabulary in Eqn. (10), and it would be desirable to consider a different distance measure that did not require this property ofQ.A natural choice would be a least squares objective in which normalization factors inQ andP are discarded,

J=\sum_{i,j}X_i(\hat P_{ij}-\hat Q_{ij})^2=\sum_{i,j}X_i(X_{ij}-exp(w_i^T\widetilde w_j))^2? ? ?(14)

這里不再是概率分布形式,這樣引出另一個(gè)問題,就是Xij可能會(huì)很大,導(dǎo)致優(yōu)化比較困難,作者對(duì)此做了個(gè)log變換,公式變成如下:

J=\sum_{i,j}X_i(logX_{ij}-w_i^T\widetilde w_j)^2? ? ? (15)

最后我們?cè)趯?duì)權(quán)重函數(shù)做個(gè)修改,

In fact, Mikolov et al. (2013a) observe that performance can be increased by filtering the data so as to re- duce the effective value of the weighting factor for frequent words. With this in mind, we introduce a more general weighting function, which we are free to take to depend on the context word as well

J=\sum_{i,j} f(X_{ij})(logX_{ij}-w_i^T\widetilde w_j)^2? ? (16)

三、模型評(píng)測(cè)指標(biāo)

本文驗(yàn)證詞語向量的評(píng)測(cè)用了如下幾種方法

1)word analogy

這類任務(wù)的目標(biāo)是希望能夠回答諸如“a is to be as c is to ___”

The word analogy task con- sists of questions like, “a is tob asc is to ?”

這個(gè)任務(wù)能夠檢測(cè)在向量空間的某些結(jié)構(gòu)是否滿足

2)Word similarity

3)Named Entity Recognition

另外作者還對(duì)模型做了更進(jìn)一步的分析

1) 詞語向量維度和窗口大小的影響

作者也對(duì)比了在選擇不同的vector長度和window長度對(duì)結(jié)果的影響。這里window size用了兩種方式,一個(gè)是對(duì)稱的,即左右都有窗口大小。一個(gè)是非對(duì)稱,即只有左邊擴(kuò)大

A context window that extends to the left and right of a target word will be called symmetric,

and one which extends only to the left will be called asymmetric.


vector dimension ?and window size

2)glove vs word2vec


glove vs. word2vec

這張圖看起來有點(diǎn)耐人尋味,首先說明下坐標(biāo)軸,橫坐標(biāo)其實(shí)是兩個(gè)表示,對(duì)于Glove來說是訓(xùn)練的迭代次數(shù),對(duì)于word2vec來說是負(fù)采樣的樣本數(shù)(negative samples),可以看到負(fù)采樣的樣本數(shù)不能過多,10個(gè)左右即可,過多了會(huì)影響模型的效果。不過很有意思的是,glove在analogy任務(wù)上是優(yōu)于word2vec的。

原文提供了代碼路徑

We provide the source code for the model as well as trained word vectors at http://nlp. stanford.edu/projects/glove/.

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

友情鏈接更多精彩內(nèi)容