如何構(gòu)建一個(gè)字典是基本所有自然語言任務(wù)都需要考慮的,當(dāng)然也是根據(jù)具體任務(wù)因地制宜的。
- 基于單個(gè)字符的字典應(yīng)用于一些中文任務(wù)中,如NER,BERT,但是在英文任務(wù)中可能使學(xué)習(xí)任務(wù)難度劇增。
- 基于單詞的字典可能會(huì)使字典維度過大,單個(gè)單詞的出現(xiàn)次數(shù)過少,得不到充分的訓(xùn)練。
Neural Machine Translation of Rare Words with Subword Units中提出了Subword的概念,并提供了一個(gè)從訓(xùn)練數(shù)據(jù)中構(gòu)建字典的算法(BPE)。當(dāng)然這里不會(huì)炒冷飯,接下來介紹一種tensor2tensor框架中構(gòu)建Subword字典的方法。大致的項(xiàng)目法
構(gòu)建
tensor2tensor中text_encoder.SubwordTextEncoder類提供了構(gòu)建字典和根據(jù)已知字典編碼、解碼功能,其中構(gòu)建字典的方法使這里重點(diǎn)關(guān)注的。類方法build_from_generator從一個(gè)文本生成器構(gòu)建指定字典大小的subword字典,大致流程如下:
- 對(duì)所有的token計(jì)數(shù),token可以是英文單詞或者中文分詞。
-
build_to_target_size方法利用二分查找法在初設(shè)的min_val和max_val中找到合適的min_token_count使得構(gòu)建的字典維度近似目標(biāo)target_size,其中build_from_token方法負(fù)責(zé)核心的構(gòu)建字典操作。
build_from_token方法的主要想法是從所有character ngram尋找字符長(zhǎng)度長(zhǎng)、出現(xiàn)頻率高的subtoken集合。
-
使用全部單個(gè)字符和預(yù)留token初始化字典,其中預(yù)留token一般包括填充符<pad>、句子結(jié)束符<EOS>和_ESCAPE_CHARS=set(u"\_u;0123456789")
alphabet_tokens = chain(six.iterkeys(token_counts), [native_to_unicode(t) for t in reserved_tokens]) self._init_alphabet_from_tokens(alphabet_tokens) # Bootstrap the initial list of subtokens with the characters from the # alphabet plus the escaping characters. self._init_subtokens_from_list(list(self._alphabet), reserved_tokens=reserved_tokens) -
For each iteration:
a. 首先利用現(xiàn)有的subtoken字典對(duì)所有原始token編碼,并計(jì)數(shù)。其中
_escape_token在token末尾添加標(biāo)志符號(hào)'_',同時(shí)使用Unicode值替換字符表alphabet不存在的字符(OOV)。subtoken_counts = collections.defaultdict(int) for token, count in six.iteritems(token_counts): iter_start_time = time.time() escaped_token = _escape_token(token, self._alphabet) subtokens = self._escaped_token_to_subtoken_strings(escaped_token) start = 0 for subtoken in subtokens: last_position = len(escaped_token) + 1 if max_subtoken_length is not None: last_position = min(last_position, start + max_subtoken_length) for end in range(start + 1, last_position): new_subtoken = escaped_token[start:end] subtoken_counts[new_subtoken] += count start += len(subtoken)其中針對(duì)subtokens的for循環(huán)可能比較晦澀,舉個(gè)例子
# Example 1 subtokens=['l', 'o', 'w', 'e', 'r', '_'] new_subtokens=['l', 'lo', 'low', 'lowe', 'lower', 'lower_', 'o', 'ow', 'owe', 'ower', 'ower_', 'w', 'we', 'wer', 'wer_', 'e', 'er', 'er_', 'r', 'r_', '_'] # Example 2 subtokens=['low', 'er_'] new_subtokens=['l', 'lo', 'low', 'e', 'er', 'er_']b. 將subtoken_counts按照subtoken的長(zhǎng)度聚類
len_to_subtoken_strings = [] for subtoken_string, count in six.iteritems(subtoken_counts): lsub = len(subtoken_string) if count >= min_count: while len(len_to_subtoken_strings) <= lsub: len_to_subtoken_strings.append(set()) len_to_subtoken_strings[lsub].add(subtoken_string)c. 按照subtoken長(zhǎng)度由大到小遍歷所有subtokens,保留頻數(shù)大于等于預(yù)設(shè)min_count的subtoken,同時(shí)減去當(dāng)前subtoken子串的相應(yīng)頻數(shù)。
new_subtoken_strings = [] for lsub in range(len(len_to_subtoken_strings) - 1, 0, -1): subtoken_strings = len_to_subtoken_strings[lsub] for subtoken_string in subtoken_strings: count = subtoken_counts[subtoken_string] if count >= min_count: # Exclude alphabet tokens here, as they must be included later, # explicitly, regardless of count. if subtoken_string not in self._alphabet: new_subtoken_strings.append((count, subtoken_string)) for l in range(1, lsub): subtoken_counts[subtoken_string[:l]] -= countd. 利用new_subtoken_strings和全部單字符(alphabet)的并集更新字典
編碼
給定一個(gè)SubwordTextEncoder和一個(gè)token字符串,首先通過_escape_token規(guī)范原始token,再利用_escaped_token_to_subtoken_strings把token分解為subtoken_strings,最后根據(jù)字典編碼映射strings到ids。其中_escaped_token_to_subtoken_strings是利用前向最大匹配算法,由于subtoken字典中包含了所有單個(gè)字符,所以必然存在一個(gè)分解。
def _escaped_token_to_subtoken_strings(self, escaped_token):
"""Converts an escaped token string to a list of subtoken strings.
Args:
escaped_token: An escaped token as a unicode string.
Returns:
A list of subtokens as unicode strings.
"""
# NOTE: This algorithm is greedy; it won't necessarily produce the "best"
# list of subtokens.
ret = []
start = 0
token_len = len(escaped_token)
while start < token_len:
for end in range(
min(token_len, start + self._max_subtoken_len), start, -1):
subtoken = escaped_token[start:end]
if subtoken in self._subtoken_string_to_id:
ret.append(subtoken)
start = end
break
else: # Did not break
# If there is no possible encoding of the escaped token then one of the
# characters in the token is not in the alphabet. This should be
# impossible and would be indicative of a bug.
assert False, "Token substring not found in subtoken vocabulary."
return ret
解碼
相對(duì)于編碼,解碼是比較簡(jiǎn)單的。由于每個(gè)token的結(jié)尾都會(huì)被標(biāo)記'',解碼過程就是拼接所有輸出subtokens,再按照''分割,就可以得到tokens。在編碼過程中,_escape_token針對(duì)OOV做的特殊處理需要通過_unescape_token逆向轉(zhuǎn)化回來。
def _subtoken_ids_to_tokens(self, subtokens):
"""Converts a list of subtoken ids to a list of tokens.
Args:
subtokens: a list of integers in the range [0, vocab_size)
Returns:
a list of strings.
"""
concatenated = "".join(
[self._subtoken_id_to_subtoken_string(s) for s in subtokens])
split = concatenated.split("_")
ret = []
for t in split:
if t:
unescaped = _unescape_token(t + "_")
if unescaped:
ret.append(unescaped)
return ret