NEZHA

Abstract

主要核心創(chuàng)新點：1.函數(shù)式相關性位置編碼 2.全詞mask策略 3.混合預測訓練 4.LAMB優(yōu)化器

1 Introduction

google‘s bert，ernie,bert-wwm的區(qū)別：google mask chinese character or wordpiece token。ernie mask實體或短語，同時增加預訓練任務 Token-Document Relation Prediction and Sentence Reordering，這個需要看了ernie論文才行。bert-wwm mask全詞。

補充：wordpiece和BPE(byte-pair encoding)。英語love,loving,loves都是一個意思，但如果以詞為粒度，就是不一樣的詞，詞表變得非常大，訓練速度慢，訓練效果也不好。wordpiece是BPE的變種，BPE每次選擇最高頻的subword加入詞表，而wordpiece則是基于概率生成subword。BPE例子如下。

編碼例子：

# 給定單詞序列
[“the</w>”, “highest</w>”, “mountain</w>”]

# 假設已有排好序的subword詞表
[“errrr</w>”, “tain</w>”, “moun”, “est</w>”, “high”, “the</w>”, “a</w>”]

# 迭代結(jié)果
"the</w>" -> ["the</w>"]
"highest</w>" -> ["high", "est</w>"]
"mountain</w>" -> ["moun", "tain</w>"]

解碼例子：

# 編碼序列
[“the</w>”, “high”, “est</w>”, “moun”, “tain</w>”]

# 解碼序列
“the</w> highest</w> mountain</w>”

wordpiece算法：

準備足夠大的訓練語料
確定期望的subword詞表大小
將單詞拆分成字符序列
基于第3步數(shù)據(jù)訓練語言模型
從所有可能的subword單元中選擇加入語言模型后能最大程度地增加訓練數(shù)據(jù)概率的單元作為新的單元
重復第5步直到達到第2步設定的subword詞表大小或概率增量低于某一閾值

關于position embedding,主要有三種。transformer使用sinusoidal function，bert使用parametric positional encodings。transformer-XL和XLNet使用折中方案，使用sinusoidal function+training bias組成。而nezha使用的是，使用事先設定好的函數(shù)，在self-attention模塊，沒有額外的需要訓練的參數(shù)。

2 pre-training nezha models

2.1 Preliminaries: BERT Model & Positional Encoding

參數(shù)式postion encoder:bert最大長度處理512的句子，所以bert會在position embedding的時候，形成一個512*768的lookup table，然后在模型訓練中更行。

函數(shù)式position encoder:transformer通過一個函數(shù)，固定住position ebedding的值，公式如下：

image.png

這兩個公式可以簡單的學習到相對位置信息。

image.png

如圖所示，對于每個維度，正弦波的頻率和偏移有所不同，也就是說不同位置的單詞，擁有不同的波，所以可以認為擁有相對位置信息。

Self-attention with relative position representations：

image.png

這篇里在self-attention里加需要訓練的參數(shù)，用距離表示位置信息。

2.2 Functional Relative Positional Encoding

NEZHA用的是函數(shù)式相對位置編碼，在上文Self-attention with relative position representations的基礎上，基于函數(shù)生成a。該生成方法思想來源于transformer，按我的理解是將transformer的編碼模式改成相對編碼，同時將它加入到self-attention計算過程中作為bias，而不是在ebedding的時候計算。

image.png

2.3 WholeWord Masking

跟bert-wwm一樣，全詞mask。分詞的時候使用jieba。12%的中文字被mask掉。同時1.5%的字隨機替換。

2.4 Mixed Precision Training

一種加速方式,mixed precision training，有空看相關論文。

2.5 LAMB Optimizer

一種可以在大批量數(shù)據(jù)上做優(yōu)化的優(yōu)化器，該方法使用超過30k的batch-size，作為結(jié)果，可以把bert訓練時間從3天下降到76分鐘。這里分析一下優(yōu)化器，以及優(yōu)化器之間的差別。

sgd,bgd,mini-batch sgd是以前常用的優(yōu)化器。盡管sgd在模型及數(shù)據(jù)集上獲得了很好的效果，但是速度慢，那么一個自然而然的想法就是利用動量(momentum)，利用當前位置的梯度結(jié)合過去累積的梯度來優(yōu)化。將其結(jié)合進SGD就誕生了最基本的結(jié)合動量的優(yōu)化方法。需要注意的是momentum在這里扮演了阻力的角色，也就是如果梯度的方向來回波動，那么momentum可以減少波動對于整體收斂方向的影響，以此來加快收斂速率。

image.png

但是這個算法沒有解決兩個問題：

1）模型參數(shù)的初始設定對收斂的影響

2）由于引入了更多的超參數(shù)，那么超參數(shù)的設定對收斂的有很大影響

為了解決第二點，提出了Adagrad：如果梯度一直保持同一個方向，那么可以適當增大學習速率，相反則減少學習速率。Adagrad采用了類似的思想，同說提出可以用累積的動量來rescale梯度從而達到控制學習速率的目的。但這個就有一個很明顯的問題，如果前幾次梯度方向相同，那么 [圖片上傳失敗...(image-aa4178-1645178652782)] 的值就會很快變得很大，結(jié)果模型居然就這么收斂了？！實驗也證明了這一點。因此這個方法要求初始學習速率一定要很小，而且最好是凸優(yōu)化問題，畢竟只有一個最優(yōu)解，RMSProp則優(yōu)化了這個問題。

RMSProp 采用了exponential moving average的辦法，可以理解為一種加權的平均數(shù)，在迭代的過程中，越早的梯度對于當前動量的影響就越小,因此整體動量可以維持在一個較為穩(wěn)定的范圍內(nèi)而不像adamGrad一樣有迅速收斂的危險。大量實驗也證明在優(yōu)化非凸函數(shù)的任務上，這個方法基本都是最優(yōu)解。

接下來推出了adam算法。

3 Experiments

實驗結(jié)果是對google bert,bert-wwm,ernie的對比。中文的效果測試數(shù)據(jù)集為：

? CMRC (Chinese Machine Reading Comprehension 2018) [16]: A machine reading comprehension task that returns an answer span in a given passage for a given question.
? XNLI (Cross-lingual Natural Language Inference) [17]: The Chinese portion of XNLI, which is a version of MultiNLI where the dev and test sets have been translated (by humans) into 15 languages. XNLI is a natural language inference task. The goal of this task is to predict if the second sentence is a contradiction, entailment or neutral to the first sentence.
? LCQMC (Large-scale Chinese Question Matching Corpus) [18]: A sentence pair matching task. Given a pair of sentences, the task is to determine if the two sentences are semantically equivalent or not.
? PD-NER (People’s Daily Named Entity Recognition) 9: A sequence labeling task that identifies the named entities from text. The corpus is from People’s Daily, a Chinese News Media.
? ChnSenti (Chinese Sentiment Classification) 10: A binary classification task which predicts if the sentiment of a given sentence is positive or negative.

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

nezha論文解析

nezha論文解析

NEZHA

Abstract

主要核心創(chuàng)新點：1.函數(shù)式相關性位置編碼 2.全詞mask策略 3.混合預測訓練 4.LAMB優(yōu)化器

1 Introduction

google‘s bert，ernie,bert-wwm的區(qū)別：google mask chinese character or wordpiece token。ernie mask實體或短語，同時增加預訓練任務 Token-Document Relation Prediction and Sentence Reordering，這個需要看了ernie論文才行。bert-wwm mask全詞。

2 pre-training nezha models

2.1 Preliminaries: BERT Model & Positional Encoding

2.2 Functional Relative Positional Encoding

2.3 WholeWord Masking

2.4 Mixed Precision Training

2.5 LAMB Optimizer

3 Experiments

Conclusion

跟abstract一樣。

相關閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

nezha論文解析

NEZHA

Abstract

主要核心創(chuàng)新點：1.函數(shù)式相關性位置編碼 2.全詞mask策略 3.混合預測訓練 4.LAMB優(yōu)化器

1 Introduction

google‘s bert，ernie,bert-wwm的區(qū)別：google mask chinese character or wordpiece token。ernie mask實體或短語，同時增加預訓練任務 Token-Document Relation Prediction and Sentence Reordering，這個需要看了ernie論文才行。bert-wwm mask全詞。

2 pre-training nezha models

2.1 Preliminaries: BERT Model & Positional Encoding

2.2 Functional Relative Positional Encoding

2.3 WholeWord Masking

2.4 Mixed Precision Training

2.5 LAMB Optimizer

3 Experiments

Conclusion

跟abstract一樣。

相關閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

google‘s bert，ernie,bert-wwm的區(qū)別：google mask chinese character or wordpiece token。ernie mask實體或短語，同時增加預訓練任務 Token-Document Relation Prediction and Sentence Reordering，這個需要看了ernie論文才行。bert-wwm mask全詞。

跟abstract一樣。