句子相似度計算之sentence embedding_1

SIF

Smooth Inverse Frequency

A SIMPLE BUT TOUGH-TO-BEAT BASELINE FOR SENTENCE EMBEDDINGS

  1. 用weighted average of the word vectors表示句子;
    對詞w,詞頻表示為p(w),權(quán)重為a,則:
    a/(a + p(w) )
  2. 在用PCA或者SVD進行降維:去掉 first principal component (common component removal)。
SIF

FastSent

Learning to understand phrases by embedding the dictionary

We propose using the definitions found in everyday dictionaries as a means of bridging this gap between lexical and phrasal semantics. Neural language embedding models can be effectively trained to map dictionary definitions (phrases) to (lexical) representations of the words defined by those definitions. We present two applications of these architectures: reverse dictionaries that return the name of a concept given a definition or description and general-knowledge crossword question answerers. On both tasks, neural language embedding models trained on definitions from a handful of freely-available lexical resources perform as well or better than existing commercial systems that rely on significant task-specific engineering.

Skip-Thought

Skip-Thought Vectors
Code: https://github.com/ryankiros/skip-thoughts

無監(jiān)督學習句子向量。
數(shù)據(jù)集:continuity of text from books
Hypothesize:Sentences that share semantic and syntactic properties are thus mapped to similar vector representations。
OOV的處理:vocabulary expansion,使得詞表可以覆蓋million級別的詞量。

Model

Skip-Thought

模型結(jié)構(gòu):encoder-decoder
skip-gram: use a word to predict its surrounding context;
skip-thought: encode a sentence to predict the sentences around it.
Training corpus: BookCorpus dataset, a collection of novels, with 16 different genres,.
輸入:a sentence tuple (s_{i-1},s_{i+1})
Encoder: feature extractor --> skip-thought vector
Decoders: one for s_{i-1}, one for s_{i+1}
Objective function: \sum_t logP(w_{i+1}^t | w_{i+1}^{<t},h_i) + \sum_t logP(w_{i-1}^t | w_{i-1}^{<t},h_i)

下游任務:8種, semantic relatedness, paraphrase detection, image-sentence ranking, question-type classification and 4 benchmark sentiment and subjectivity datasets。

InferSent

Supervised Learning of Universal Sentence Representations from Natural Language Inference Data
Code: https://github.com/facebookresearch/InferSent

監(jiān)督式學習句子向量。
數(shù)據(jù)集:Standford Natural Language Inference datasets(SNLI),包含570k個人工生成的英語句子對,3種標簽判讀句子對的關系,entailment, contradiction, neutral。
向量表示:300d GloVe vectors
Hypothesize:使用SNLI數(shù)據(jù)集,足夠?qū)W習到句子表達。

InferSent

Model

3種向量拼接方式:

  1. concatenation of u,v;
  2. element-wise product u\cdot v;
  3. absolute element-wise difference |u-v|

7種模型結(jié)構(gòu):

  1. LSTM
  2. GRU
  3. GRU_last: concatenation of last hidden states of forward and backward
  4. BiLSTM with mean pooling
  5. BiLSTM with max pooling
  6. Self-attentive network
  7. Hierarchical convolutional networks.

Universal Sentence Encoder

Universal Sentence Encoder
code: https://tfhub.dev/google/universal-sentence-encoder/2

通過學習多個NLP任務來encode句子,從而得到句子表達。
Model
訓練集:SNLI
遷移任務:MR, CR, SUBJ, MPQA, TREC, SST, STS Benchmark, WEAT
遷移的輸入:concatenation (sentence, word)

模型結(jié)構(gòu):2種encoders:

  1. Transformer
    準確率高,但模型復雜度高,計算開銷大。
    步驟:
    a. Word representation:element-wise sum (word, word_position)
    c. PTB tokenized string得到512d的句子表示

  2. DAN(deep averaging network)
    損失一點準度,但效率高。
    a. Words + bi_grams
    b. averaged embeddings
    c. DNN得到sentence embeddings

SentenceBERT

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

SBERT

Model
訓練集:SNLI, Multi-Genre NLI,前者為3分類數(shù)據(jù),后者為sentence-pair形式。
使用的BERT向量:

  1. [CLS]:BERT output [CLS]token
  2. MEAN:BERT output 向量平均
  3. MAX:BERT output 向量取max

目標函數(shù):

  1. 分類:o = softmax(W_t(u,v,|u-v|)
  2. 回歸
  3. Triplet:max(||s_a - s_p|| - ||s_a - s_n|| + \e , 0)
    其中:
    s_a:anchor sentence
    s_p:positive sentence
    s_n:negative sentence
    模型需要讓anchor和positive的距離小于anchor和negative的距離。
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務。

相關閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容