關(guān)鍵詞:Bert,預(yù)訓(xùn)練模型,微調(diào)
內(nèi)容摘要
- Bert源碼工程介紹
- MRPC任務(wù)介紹
- 輸入層,數(shù)據(jù)格式要求
- Bert模型層,transformer結(jié)構(gòu)
- 下游任務(wù)微調(diào)網(wǎng)絡(luò)
- 預(yù)訓(xùn)練模型參數(shù)遷移
Bert源碼工程介紹
工程地址在github倉(cāng)庫(kù)位置google-research/bert,該工程包含多個(gè)Python腳本,包括
- modeling.py:定義Bert的網(wǎng)絡(luò)結(jié)構(gòu),主要transformer,embedding,pool等網(wǎng)絡(luò)模塊
- run_classifier.py:基于Bert網(wǎng)絡(luò)開啟一個(gè)文本分類任務(wù),如果指定了預(yù)訓(xùn)練模型,基于預(yù)訓(xùn)練模型的參數(shù)再訓(xùn)練做微調(diào)
- run_pretraining.py:Bert的預(yù)訓(xùn)練部分,包括NSP任務(wù)和MLM任務(wù)
- create_pretraining_data.py:制作預(yù)訓(xùn)練數(shù)據(jù)
- tokenization.py:一些句子處理工具模塊,包括分詞,標(biāo)點(diǎn)處理,格式統(tǒng)一等
- run_squad.py:配置和啟動(dòng)基于bert在squad數(shù)據(jù)集上的問答任務(wù)
- extract_features.py:通過Bert計(jì)算句子向量
- optimization.py:定義了模型訓(xùn)練的優(yōu)化器模塊
在上一篇文章介紹完Bert,Transformer,預(yù)訓(xùn)練模型,微調(diào)的基本概念和關(guān)系之后,本篇從Bert的官方源碼入手進(jìn)行源碼跟讀學(xué)習(xí),先從最容易地直接應(yīng)用Bert預(yù)訓(xùn)練模型進(jìn)行MRPC任務(wù)微調(diào)入手,以run_classifier.py腳本為入口。
MRPC任務(wù)介紹
MPRC的學(xué)習(xí)目標(biāo)是給定兩個(gè)句子,判斷這兩個(gè)句子是否說的是一個(gè)意思,相當(dāng)于輸入一對(duì)句子做二分類。樣例數(shù)據(jù)如下
Quality #1 ID #2 ID #1 String #2 String
1 702876 702977 Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence . Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .
第一列代表y值,1意思相同,2意思不同,后面分別是句子1的id,句子2的id,句子1的內(nèi)容,句子2的內(nèi)容。相當(dāng)于輸入一對(duì)句子給模型,而Bert的預(yù)訓(xùn)練部分也是輸入也是一對(duì)句子,兩者輸入基本相同,預(yù)訓(xùn)練基于輸入無(wú)監(jiān)督學(xué)習(xí)語(yǔ)義知識(shí),微調(diào)部分基于輸入遷移預(yù)訓(xùn)練的模型參數(shù)去做分類。
從網(wǎng)絡(luò)上下載預(yù)訓(xùn)練模型uncased_L-2_H-128_A-2.zip(2層transformer,128維embedding,BERT-Tiny)和對(duì)應(yīng)的MRPC數(shù)據(jù),使用如下腳本即可允許訓(xùn)練和驗(yàn)證過程跑通模型
python run_classifier.py --task_name=MRPC --do_train=true --do_eval=true --data_dir=./bert-master/GLUE_MRPC --vocab_file=./bert-master/bert_base_model/vocab.txt --bert_config_file=./bert-master/bert_base_model/bert_config.json --max_seq_length=128 --train_batch_size=32 --init_checkpoint=./bert-master/bert_base_model/bert_model.ckpt --learning_rate=2e-5 --num_train_epochs=3 --output_dir=/tmp/mrpc_output
運(yùn)行結(jié)束驗(yàn)證集模型效果如下,驗(yàn)證集準(zhǔn)確率達(dá)到0.710。
I0814 21:25:02.392750 140513357039424 run_classifier.py:993] ***** Eval results *****
INFO:tensorflow: eval_accuracy = 0.7107843
INFO:tensorflow: eval_loss = 0.56687844
INFO:tensorflow: global_step = 343
INFO:tensorflow: loss = 0.56687844
輸入層,數(shù)據(jù)格式要求
進(jìn)入run_classifier.py源碼,從main入口開始看,第一步是構(gòu)造數(shù)據(jù),實(shí)例化一個(gè)MRPC的數(shù)據(jù)處理類MrpcProcessor,他的目的是讀取訓(xùn)練,驗(yàn)證,測(cè)試數(shù)據(jù),以及將y值,句子1,句子2全部寫入Python內(nèi)存集合中。
# 實(shí)例化MrpcProcessor
processor = processors[task_name]()
然后實(shí)例化一個(gè)分詞工具tokenizer,它的目的是在下面構(gòu)造樣本階段提供空格分詞,wordpiece,以及token轉(zhuǎn)id,id轉(zhuǎn)token的功能。
# 實(shí)例化一個(gè)分詞信息類
tokenizer = tokenization.FullTokenizer(
# True
vocab_file=FLAGS.vocab_file, do_lower_case=FLAGS.do_lower_case)
然后開始讀取樣本數(shù)據(jù)到Python集合,返回一個(gè)list of InputExample對(duì)象。
train_examples = processor.get_train_examples(FLAGS.data_dir)
每個(gè)InputExample包含全局樣本id,句子1,句子2,y值。
class InputExample(object):
def __init__(self, guid, text_a, text_b=None, label=None):
self.guid = guid
self.text_a = text_a
self.text_b = text_b
self.label = label
下一步將Python內(nèi)存樣本數(shù)據(jù)轉(zhuǎn)化為tfrecord格式的磁盤數(shù)據(jù),當(dāng)數(shù)據(jù)量較大時(shí)不能一把將所有數(shù)據(jù)load到內(nèi)存,此時(shí)訓(xùn)練過程中的數(shù)據(jù)IO會(huì)影響訓(xùn)練效率,因此將數(shù)據(jù)先轉(zhuǎn)化為tfrecord提高效率。
train_file = os.path.join(FLAGS.output_dir, "train.tf_record")
# 將list InputExample轉(zhuǎn)化為tfrecord,寫入/tmp/mrpc_output/train.tf_record
file_based_convert_examples_to_features(
train_examples, label_list, FLAGS.max_seq_length, tokenizer, train_file)
對(duì)Python集合樣本數(shù)據(jù)逐行遍歷,構(gòu)造成key:value格式
feature = convert_single_example(ex_index, example, label_list,
max_seq_length, tokenizer)
def create_int_feature(values):
f = tf.train.Feature(int64_list=tf.train.Int64List(value=list(values)))
return f
# 在5個(gè)特征5列
features = collections.OrderedDict()
features["input_ids"] = create_int_feature(feature.input_ids)
features["input_mask"] = create_int_feature(feature.input_mask)
features["segment_ids"] = create_int_feature(feature.segment_ids)
# 改為list
features["label_ids"] = create_int_feature([feature.label_id])
features["is_real_example"] = create_int_feature(
[int(feature.is_real_example)])
tf_example = tf.train.Example(features=tf.train.Features(feature=features))
# 一行一行寫入
writer.write(tf_example.SerializeToString())
將單個(gè)InputExample對(duì)象轉(zhuǎn)化為tfrecord格式的邏輯如下,首先對(duì)句子長(zhǎng)度進(jìn)行截取最大長(zhǎng)度128
if tokens_b:
# 如果tokens_a,tokens_b太長(zhǎng)超過128,進(jìn)行均勻截取,最大長(zhǎng)度max_seq_length - 3,要留三個(gè)給[CLS], [SEP], [SEP]
_truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)
else:
# Account for [CLS] and [SEP] with "- 2"
if len(tokens_a) > max_seq_length - 2:
tokens_a = tokens_a[0:(max_seq_length - 2)]
當(dāng)Bert的輸入是一個(gè)句子時(shí)有兩個(gè)特殊符[CLS],[SEP],當(dāng)輸入是一對(duì)句子時(shí)有三個(gè)特殊符[CLS],[SEP],[SEP],因此最大字符串長(zhǎng)度要對(duì)應(yīng)減去2或者3。剩下的超長(zhǎng)部分字符串從右側(cè)截取,如果是兩個(gè)句子誰(shuí)長(zhǎng)截取誰(shuí)。
下面開始構(gòu)造符合Bert輸入的token_id和type_ids,作者的代碼備注如下
# The convention in BERT is:
# (a) For sequence pairs:
# tokens: [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
# type_ids: 0 0 0 0 0 0 0 0 1 1 1 1 1 1
# (b) For single sequences:
# tokens: [CLS] the dog is hairy . [SEP]
# type_ids: 0 0 0 0 0 0 0
單句子和一對(duì)句子已經(jīng)在上文有介紹,新出現(xiàn)的type_ids代表句子編號(hào),第一句都是0,第二句都是1。
作者將[SEP],[CLS]拼接原始的分詞數(shù)據(jù)里面,然后對(duì)所有分詞做了數(shù)字id轉(zhuǎn)換
input_ids = tokenizer.convert_tokens_to_ids(tokens)
再此基礎(chǔ)上構(gòu)造mask,所有mask都是以0填充
input_mask = [1] * len(input_ids)
# Zero-pad up to the sequence length.
# 如果超過128,已經(jīng)在前面處理成最大128了 while不進(jìn)入
while len(input_ids) < max_seq_length:
# 所有padding全是0 [PAD]
input_ids.append(0)
input_mask.append(0)
segment_ids.append(0)
數(shù)據(jù)構(gòu)造完成,在日志里面作者打印出了5條樣本,比如其中一條,特征包含tokens,input_ids,input_mask,segment_ids。
INFO:tensorflow:guid: dev-1
INFO:tensorflow:tokens: [CLS] he said the foods ##er ##vic ##e pie business doesn ' t fit the company ' s long - term growth strategy . [SEP] " the foods ##er ##vic ##e pie business does not fit our long - term growth strategy . [SEP]
INFO:tensorflow:input_ids: 101 2002 2056 1996 9440 2121 7903 2063 11345 2449 2987 1005 1056 4906 1996 2194 1005 1055 2146 1011 2744 3930 5656 1012 102 1000 1996 9440 2121 7903 2063 11345 2449 2515 2025 4906 2256 2146 1011 2744 3930 5656 1012 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:label: 1 (id = 1)
Bert模型層,transformer結(jié)構(gòu)
首先將預(yù)訓(xùn)練的Bert模型的可設(shè)置的參數(shù)讀進(jìn)來(lái)是一個(gè)字典的形式。
bert_config = modeling.BertConfig.from_json_file(FLAGS.bert_config_file)
{"hidden_size": 128, "hidden_act": "gelu", "initializer_range": 0.02, "vocab_size": 30522, "hidden_dropout_prob": 0.1, "num_attention_heads": 2, "type_vocab_size": 2, "max_position_embeddings": 512, "num_hidden_layers": 2, "intermediate_size": 512, "attention_probs_dropout_prob": 0.1}
主要包括隱藏層大小(詞表embedding維度,以及多頭self attention之后每個(gè)詞的embedding維度),隱藏層激活函數(shù),詞表大小,dropout比例等。
下一步計(jì)算總訓(xùn)練step數(shù),以及需要分配多少step用于warm up學(xué)習(xí)率。
# 計(jì)算總共多少step=343
num_train_steps = int(
# train_batch_size = 32
# num_train_epochs = 3
len(train_examples) / FLAGS.train_batch_size * FLAGS.num_train_epochs)
# warmup_proportion=0.1,預(yù)熱學(xué)習(xí)率,先以一個(gè)較小的學(xué)習(xí)率進(jìn)行學(xué)習(xí),然后再恢復(fù)為指定學(xué)習(xí)率,34
num_warmup_steps = int(num_train_steps * FLAGS.warmup_proportion)
warmup_proportion=0.1代表前10%的step的學(xué)習(xí)率先從一個(gè)很小的值慢慢變大到設(shè)置的真實(shí)學(xué)習(xí)率,目的是為了模型在訓(xùn)練初期小步前進(jìn),防止模型的初始化參數(shù)和新任務(wù)不匹配,學(xué)習(xí)率太大導(dǎo)致無(wú)法收斂。
下一步構(gòu)建模型輸入函數(shù)
model_fn = model_fn_builder(
bert_config=bert_config,
num_labels=len(label_list),
init_checkpoint=FLAGS.init_checkpoint,
learning_rate=FLAGS.learning_rate,
num_train_steps=num_train_steps,
num_warmup_steps=num_warmup_steps,
use_tpu=FLAGS.use_tpu,
use_one_hot_embeddings=FLAGS.use_tpu)
跟進(jìn)model_fn,作者在其中創(chuàng)建了Bert模型,并且拿到之前tfrecord制作的特征輸入模型,基于一個(gè)二分類任務(wù)的y值計(jì)算了一次正向傳播拿到了loss
# TODO 正向傳播
(total_loss, per_example_loss, logits, probabilities) = create_model(
bert_config, is_training, input_ids, input_mask, segment_ids, label_ids,
num_labels, use_one_hot_embeddings)
跟進(jìn)create_model,這個(gè)方法定了Bert模型
model = modeling.BertModel(
config=bert_config,
is_training=is_training,
input_ids=input_ids,
input_mask=input_mask,
token_type_ids=segment_ids,
use_one_hot_embeddings=use_one_hot_embeddings)
跟進(jìn)BertModel,這里有Bert的模型結(jié)構(gòu),首先作者構(gòu)建了一個(gè)全部詞表的embedding映射table,維度128,將input_ids輸入進(jìn)去拿到映射后的稠密向量。
with tf.variable_scope("embeddings"):
# Perform embedding lookup on the word ids.
# 給出輸入的embedding映射,以及l(fā)ookup表
(self.embedding_output, self.embedding_table) = embedding_lookup(
input_ids=input_ids,
vocab_size=config.vocab_size,
embedding_size=config.hidden_size,
initializer_range=config.initializer_range,
word_embedding_name="word_embeddings",
use_one_hot_embeddings=use_one_hot_embeddings)
然后對(duì)輸出的embedding進(jìn)行后處理,包括加入segment_id和pos_id的映射相加的結(jié)果,在做layer norm和dropout。
self.embedding_output = embedding_postprocessor(
input_tensor=self.embedding_output,
use_token_type=True,
token_type_ids=token_type_ids,
token_type_vocab_size=config.type_vocab_size,
token_type_embedding_name="token_type_embeddings",
use_position_embeddings=True,
position_embedding_name="position_embeddings",
initializer_range=config.initializer_range,
max_position_embeddings=config.max_position_embeddings,
dropout_prob=config.hidden_dropout_prob)
在構(gòu)造位置編碼的時(shí)候作者直接向前劃去前128個(gè)位置的隨機(jī)初始化向量作為位置編碼,因?yàn)楸纠邢拗屏溯斎胱畲箝L(zhǎng)度128
position_embeddings = tf.slice(full_position_embeddings, [0, 0],
[seq_length, -1])
在將三個(gè)embedding相加之后所有輸入準(zhǔn)備完畢,進(jìn)入encoder階段,encoder和Transformer模型的內(nèi)容一致,這里不做展開
self.all_encoder_layers = transformer_model(
input_tensor=self.embedding_output,
attention_mask=attention_mask,
hidden_size=config.hidden_size, # 128
num_hidden_layers=config.num_hidden_layers, # 2
num_attention_heads=config.num_attention_heads, # 2
intermediate_size=config.intermediate_size, # 512
intermediate_act_fn=get_activation(config.hidden_act), # gelu激活函數(shù)
hidden_dropout_prob=config.hidden_dropout_prob, # 0.1
attention_probs_dropout_prob=config.attention_probs_dropout_prob, # 0.1
initializer_range=config.initializer_range, # 0.02
do_return_all_layers=True)
輸出self.all_encoder_layers是一個(gè)列表,記錄了每個(gè)transformer block的結(jié)果,作者拿到最后一個(gè)transformer block的結(jié)果作為整個(gè)encoder層的輸出
self.sequence_output = self.all_encoder_layers[-1]
最后作者將第一個(gè)位置[CLS]這個(gè)詞的embedding輸出拿到,并且對(duì)他額外做了一層全連接
with tf.variable_scope("pooler"):
# 0:1第一個(gè)詞 => [None, 1, 128] => [None, 128]
first_token_tensor = tf.squeeze(self.sequence_output[:, 0:1, :], axis=1)
# [None, 128]
# 取第一個(gè)詞做全連接
self.pooled_output = tf.layers.dense(
first_token_tensor,
config.hidden_size, # 128
activation=tf.tanh,
kernel_initializer=create_initializer(config.initializer_range))
最終這個(gè)self.pooled_output是一個(gè)[None, 128]維度的張量。
下游任務(wù)微調(diào)網(wǎng)絡(luò)
下游任務(wù)是MRPC的分類任務(wù),判斷兩個(gè)句子是不是一個(gè)意思,作者直接拿到Transformer層的輸出self.pooled_output,在加一層全連接映射到y(tǒng)值上計(jì)算拿到loss
output_layer = model.get_pooled_output()
hidden_size = output_layer.shape[-1].value
# 這個(gè)output_weights和output_bias是唯一不是ckpt復(fù)現(xiàn)出來(lái)的
output_weights = tf.get_variable(
"output_weights", [num_labels, hidden_size],
initializer=tf.truncated_normal_initializer(stddev=0.02))
output_bias = tf.get_variable(
"output_bias", [num_labels], initializer=tf.zeros_initializer())
with tf.variable_scope("loss"):
if is_training:
# I.e., 0.1 dropout
output_layer = tf.nn.dropout(output_layer, keep_prob=0.9)
# 手動(dòng)實(shí)現(xiàn)的對(duì)數(shù)似然損失
# 直接輸出到目標(biāo)
logits = tf.matmul(output_layer, output_weights, transpose_b=True)
logits = tf.nn.bias_add(logits, output_bias)
probabilities = tf.nn.softmax(logits, axis=-1)
log_probs = tf.nn.log_softmax(logits, axis=-1)
one_hot_labels = tf.one_hot(labels, depth=num_labels, dtype=tf.float32)
per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1)
loss = tf.reduce_mean(per_example_loss)
當(dāng)mode為訓(xùn)練時(shí),作者又手寫了訓(xùn)練操作
if mode == tf.estimator.ModeKeys.TRAIN:
# 梯度修剪一次
# TODO 反向傳播
train_op = optimization.create_optimizer(
total_loss, learning_rate, num_train_steps, num_warmup_steps, use_tpu)
output_spec = tf.contrib.tpu.TPUEstimatorSpec(
mode=mode,
loss=total_loss,
train_op=train_op,
scaffold_fn=scaffold_fn)
這個(gè)train_op包含一個(gè)warm up學(xué)習(xí)率的邏輯
if num_warmup_steps:
global_steps_int = tf.cast(global_step, tf.int32)
warmup_steps_int = tf.constant(num_warmup_steps, dtype=tf.int32)
global_steps_float = tf.cast(global_steps_int, tf.float32)
warmup_steps_float = tf.cast(warmup_steps_int, tf.float32)
# global_steps_float增大,warmup_learning_rate逐漸增大
warmup_percent_done = global_steps_float / warmup_steps_float
warmup_learning_rate = init_lr * warmup_percent_done
is_warmup = tf.cast(global_steps_int < warmup_steps_int, tf.float32)
# 要么是learning_rate要么是warmup_learning_rate
learning_rate = (
(1.0 - is_warmup) * learning_rate + is_warmup * warmup_learning_rate)
同時(shí)自己寫了個(gè)優(yōu)化器,包括學(xué)習(xí)率衰減,參數(shù)權(quán)重的迭代更新邏輯。
optimizer = AdamWeightDecayOptimizer(
learning_rate=learning_rate,
weight_decay_rate=0.01,
beta_1=0.9,
beta_2=0.999,
epsilon=1e-6,
exclude_from_weight_decay=["LayerNorm", "layer_norm", "bias"])
下面構(gòu)造estimator,注意如果run_config里面有model_dir則mode為EVAL,直接讀取訓(xùn)練好的模型進(jìn)行評(píng)估
estimator = tf.contrib.tpu.TPUEstimator(
# False
use_tpu=FLAGS.use_tpu,
model_fn=model_fn,
config=run_config,
train_batch_size=FLAGS.train_batch_size, # 32
eval_batch_size=FLAGS.eval_batch_size, # 8
predict_batch_size=FLAGS.predict_batch_size) # 8
最后讀取tfrecord磁盤地址文件,構(gòu)造輸入函數(shù),給道estimator進(jìn)行訓(xùn)練
train_input_fn = file_based_input_fn_builder(
input_file=train_file,
seq_length=FLAGS.max_seq_length,
is_training=True,
drop_remainder=True)
estimator.train(input_fn=train_input_fn, max_steps=num_train_steps)
預(yù)訓(xùn)練模型參數(shù)遷移
作者在構(gòu)建模型之后,通過tf.train.init_from_checkpoint將預(yù)訓(xùn)練模型的參數(shù)從ckpt檢查點(diǎn)原封不動(dòng)地遷移到新構(gòu)建的Bert網(wǎng)絡(luò)中同名的參數(shù)上
if init_checkpoint:
(assignment_map, initialized_variable_names
) = modeling.get_assignment_map_from_checkpoint(tvars, init_checkpoint)
if use_tpu:
def tpu_scaffold():
tf.train.init_from_checkpoint(init_checkpoint, assignment_map)
return tf.train.Scaffold()
scaffold_fn = tpu_scaffold
else:
# 從ckpt中把同名參數(shù)的變量的值恢復(fù)到新網(wǎng)絡(luò)中
tf.train.init_from_checkpoint(init_checkpoint, assignment_map)
遷移的參數(shù)包括
| 變量名 | 維度 | 是否從ckpt遷移 |
|---|---|---|
| bert/embeddings/word_embeddings:0, | (30522,128), | INIT_FROM_CKPT |
| bert/embeddings/token_type_embeddings:0, | (2,128), | INIT_FROM_CKPT |
| bert/embeddings/position_embeddings:0, | (512,128), | INIT_FROM_CKPT |
| bert/embeddings/LayerNorm/beta:0, | (128), | INIT_FROM_CKPT |
| bert/embeddings/LayerNorm/gamma:0, | (128), | INIT_FROM_CKPT |
| bert/encoder/layer_0/attention/self/query/kernel:0, | (128,128), | INIT_FROM_CKPT |
| bert/encoder/layer_0/attention/self/query/bias:0, | (128), | INIT_FROM_CKPT |
| bert/encoder/layer_0/attention/self/key/kernel:0, | (128,128), | INIT_FROM_CKPT |
| bert/encoder/layer_0/attention/self/key/bias:0, | (128), | INIT_FROM_CKPT |
| bert/encoder/layer_0/attention/self/value/kernel:0, | (128,128), | INIT_FROM_CKPT |
| bert/encoder/layer_0/attention/self/value/bias:0, | (128), | INIT_FROM_CKPT |
| bert/encoder/layer_0/attention/output/dense/kernel:0, | (128,128), | INIT_FROM_CKPT |
| bert/encoder/layer_0/attention/output/dense/bias:0, | (128), | INIT_FROM_CKPT |
| bert/encoder/layer_0/attention/output/LayerNorm/beta:0, | (128), | INIT_FROM_CKPT |
| bert/encoder/layer_0/attention/output/LayerNorm/gamma:0, | (128), | INIT_FROM_CKPT |
| bert/encoder/layer_0/intermediate/dense/kernel:0, | (128,512), | INIT_FROM_CKPT |
| bert/encoder/layer_0/intermediate/dense/bias:0, | (512), | INIT_FROM_CKPT |
| bert/encoder/layer_0/output/dense/kernel:0, | (512,128), | INIT_FROM_CKPT |
| bert/encoder/layer_0/output/dense/bias:0, | (128), | INIT_FROM_CKPT |
| bert/encoder/layer_0/output/LayerNorm/beta:0, | (128), | INIT_FROM_CKPT |
| bert/encoder/layer_0/output/LayerNorm/gamma:0, | (128), | INIT_FROM_CKPT |
| bert/encoder/layer_1/attention/self/query/kernel:0, | (128,128), | INIT_FROM_CKPT |
| bert/encoder/layer_1/attention/self/query/bias:0, | (128), | INIT_FROM_CKPT |
| bert/encoder/layer_1/attention/self/key/kernel:0, | (128,128), | INIT_FROM_CKPT |
| bert/encoder/layer_1/attention/self/key/bias:0, | (128), | INIT_FROM_CKPT |
| bert/encoder/layer_1/attention/self/value/kernel:0, | (128,128), | INIT_FROM_CKPT |
| bert/encoder/layer_1/attention/self/value/bias:0, | (128), | INIT_FROM_CKPT |
| bert/encoder/layer_1/attention/output/dense/kernel:0, | (128,128), | INIT_FROM_CKPT |
| bert/encoder/layer_1/attention/output/dense/bias:0, | (128), | INIT_FROM_CKPT |
| bert/encoder/layer_1/attention/output/LayerNorm/beta:0, | (128), | INIT_FROM_CKPT |
| bert/encoder/layer_1/attention/output/LayerNorm/gamma:0, | (128), | INIT_FROM_CKPT |
| bert/encoder/layer_1/intermediate/dense/kernel:0, | (128,512), | INIT_FROM_CKPT |
| bert/encoder/layer_1/intermediate/dense/bias:0, | (512), | INIT_FROM_CKPT |
| bert/encoder/layer_1/output/dense/kernel:0, | (512,128), | INIT_FROM_CKPT |
| bert/encoder/layer_1/output/dense/bias:0, | (128), | INIT_FROM_CKPT |
| bert/encoder/layer_1/output/LayerNorm/beta:0, | (128), | INIT_FROM_CKPT |
| bert/encoder/layer_1/output/LayerNorm/gamma:0, | (128), | INIT_FROM_CKPT |
| bert/pooler/dense/kernel:0, | (128,128), | INIT_FROM_CKPT |
| bert/pooler/dense/bias:0, | (128), | INIT_FROM_CKPT |
| output_weights:0, | (2, 128) | |
| output_bias:0, | (2,) |
其中只有微調(diào)層的全連接和偏置是新模型自己初始化的,其他參數(shù)都是從預(yù)訓(xùn)練模型進(jìn)行遷移,最終將所有參數(shù)融合在一起進(jìn)行訓(xùn)練。預(yù)訓(xùn)練模型的參數(shù)包括詞向量,位置編碼,segment編碼,每層Transformer的QKV參數(shù),以及全連接層參數(shù)。全部參數(shù)共計(jì)4386178個(gè),440萬(wàn),其中詞向量獨(dú)占390萬(wàn),占比89%,由此可見如果預(yù)訓(xùn)練部分能訓(xùn)練出很好的詞向量,則微調(diào)部分就越容易,因?yàn)樵~向量的調(diào)整占微調(diào)的第一大頭。
如果此處init_checkpoint為None,則不使用預(yù)訓(xùn)練模型,直接使用新網(wǎng)絡(luò)隨機(jī)初始化的embedding進(jìn)行訓(xùn)練,運(yùn)行如下
python run_classifier.py --task_name=MRPC --do_train=true --do_eval=true --data_dir=./bert-master/GLUE_MRPC --vocab_file=./bert-master/bert_base_model/vocab.txt --bert_config_file=./bert-master/bert_base_model/bert_config.json --max_seq_length=128 --train_batch_size=32 --learning_rate=2e-5 --num_train_epochs=3 --output_dir=/tmp/mrpc_output
驗(yàn)證集模型準(zhǔn)確率如下,準(zhǔn)確率只有0.683
I0814 21:21:21.268342 140291776911168 run_classifier.py:993] ***** Eval results *****
INFO:tensorflow: eval_accuracy = 0.6838235
INFO:tensorflow: eval_loss = 0.6237707
INFO:tensorflow: global_step = 343
INFO:tensorflow: loss = 0.6237707
相比使用預(yù)訓(xùn)練模型的效果0.71下降3個(gè)點(diǎn),由此也體現(xiàn)了在大數(shù)據(jù)上做預(yù)訓(xùn)練+微調(diào),相比于對(duì)某個(gè)樣本直接建模模型效果更好的優(yōu)勢(shì)。