這兩天能一本正經(jīng)胡說的語言模型GPT-2.0大火,官方OPENAI以模型太強(qiáng)大擔(dān)心被壞人使用為由,也只公布了117M的模型,是不到號稱的15億參數(shù)的十分之一,同時牽起了OPENAI?VS?CLOSEAI的口水戰(zhàn),看熱鬧的總是不會閑事大,2019春節(jié)剛過,AI領(lǐng)域就一片喧嘩,注定今年AI將繼續(xù)高歌猛進(jìn),希望自己能在其中,跟上大部隊。
好了,廢話不多說,看完熱鬧,就迫不及待下載了GPT-2.0公布的117M模型跑跑。同時把官方發(fā)布的paper?Language Models are Unsupervised Multitask Learners拜讀了一下,但論文中對如何訓(xùn)練模型介紹得相對較少,重點是在炫各種實驗數(shù)據(jù)。以下是論文摘要:
1.?訓(xùn)練數(shù)據(jù):為了獲取多樣、體量龐大且又有質(zhì)量的數(shù)據(jù)作為訓(xùn)練樣本,作者 only scraped web pages which have been curated/filtered by humans,但人工篩選是非常expensive,所以作者scraped all outbound links from Reddit, a social media platform, which received at least 3 karma.最終得到over 8 million documents for a total of 40 GB of text的數(shù)據(jù)作為訓(xùn)練樣本
2.?輸入模型表示方法:作者沒有采用word-level or character-level,而是采用了Byte Pair Encoding (BPE),作者 prevent BPE from merging across character categories for any byte sequence. And add an exception for spaces which significantly improves the compression efficiency while adding only minimal fragmentation of words across multiple vocab tokens.?因為這種方法可以對任何一個Unicode string計算概率,所以該語言模型對任何數(shù)據(jù)集都不用做預(yù)處理。
3.模型:論文寫得比較簡單,首先指出采用了Transformer,然后是基于OPENAI? GPT model稍做修改,下圖是OPENAI? GPT model模型:
在這基礎(chǔ)上做的少量修改包括:
(1)?將layer normalization移到每個sub-block入口 ;
(2)在每個self-attention block后加normaliztion;?
(3)修改residual layers的weights(initialization by a factor of 1/√N where N is the number of residual layers);?
(4)詞匯量增加到50257;
(5)上下文大小從512增加到1024tokens;?
(6)batchsize增加到512
4.實驗:這是這篇論文重點展示的部分,分別在以下實驗中展示了GPT 2.0?模型的強(qiáng)大
(1)?zero-shot domain transfer

(2)? Children’s Book Test :? examine the performance of LMs on different categories of words: named entities, nouns, verbs, and prepositions.
(3)LAMBADA: tests the ability of systems to model long-range dependencies in text.
(4)Winograd Schema Challenge: measure the capability of a system to perform commonsense reasoning by measuring its ability to resolve ambiguities in text.
(5) Reading Comprehension:CoQA tests reading comprehension capabilities and also the ability of models to answer questions that depend on conversation history.
(6)Summarization:? test GPT-2’s ability to perform summarization on the CNN and Daily Mail dataset.
(7)Translation:?完成english-french任務(wù)
(8)?Question Answering:evaluate how often it generates the correct answer to factoid-style questions
最后推薦一篇張俊林博士新鮮出爐的剖析文章:效果驚人的GPT 2.0模型:它告訴了我們什么
