RETRO

@(NLP)[IR]

姚偉峰(Matrix Yao)

Info Card

Basic Idea

RETRO is a neural language model.
Comparing w/ existing language models like GPT, it separates memorization and generalization, so memorize the world knowledge w/ Retrieval, while learn the language structure w/ Model.

General auto-regressive language model
L(X|\theta) \triangleq \sum_{i=1}^{n}l_{\theta}(x_i|(x_j)_{j<i})
RETRO's chunked retrieval enhanced model
L(X|\theta, \mathcal D) \triangleq \sum_{u=1}^{l} \sum_{i=1}^{m}l_{\theta}(x_{(u-1)m+i}|(x_j)_{j<(u-1)m+i}, (RET_{\mathcal D}(C_{u'}))_{u'<u})

LM Before and After
  • Any benefits?
    Democratization \to fast/cheap and good
    • Fewer parameters: 25x fewer parameters lead to much lower requirement computation requirement for training and serving;
    • SOTA accuracy: show better perplexity on LM and SOTA accuracy on downstream tasks e.g., question answering;

Below diagram from [1] is not the whole picture of RETRO. It's just the retrieval part.


How Does it Work

Step-1: Retrieve Nearest Neighbors and Encode them

  • Points
    • retrieve the top-k nearest neighbors in chunk granularity, neither passage granularity as sentence BERT nor token granularity like ColBERT
    • each of top-k token sequence = concat(neighbor chunk, continuation chunk)
    • each token sequence is encoded w/ bi-directional transformer encoder, optionally w/ self-attended query as k/v

Step-2: Decode Causally

  • CCA(Chunked Cross Attention)


  • Points
    • both attention and CCA are causal, to make it auto-regressive

Results

Language Model

Pretty good bits-per-byte even 23+x smaller size.


Downstream Task: QA

Not really so good, considering the 7.5B model size. And inferior accuracy than FiD, they blame the encoder weight not enough in current model.


Application on ODQA domain

Pipeline Comparison

  • dense retriever + neural ranker
    E.g.,

    • Single Retrieval Encoder: SentenceEmbedding Retriever + ColBERT Ranker

    • Dual Retrieval Encoder: DPR Retriever + ColBERT Ranker

  • RETRO

We can see that RETRO can easily fit as a dense retriever + neural ranker ODQA pipeline. It can be viewed as single-encoder dense retriever + neural ranker , and the ranker is compute-heavier than ColBERT, both because of model size and the ranker doc encoder cannot be pre-computed.

To put RETRO into the map of ODQA paradigms

References

  1. RETRO Is Blazingly Fast
  2. The Illustrated Retrieval Transformer
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容