02-seq2seq原理與實踐

目錄

原理部分

  • 機器翻譯發(fā)展歷史
  • Seq2Seq網(wǎng)絡(luò)基本架構(gòu)
  • Seq2Seq網(wǎng)絡(luò)應(yīng)用
  • Seq2Seq存在的問題
  • Attention機制

實踐部分

  • 任務(wù)1:
    • 數(shù)據(jù)預(yù)處理
    • 編碼層與詞向量
    • 完成解碼模塊
    • 模型迭代
  • 任務(wù)2:
    • 數(shù)據(jù)預(yù)處理
    • 使用構(gòu)建好的詞向量
    • 完成解碼操作
    • 任務(wù)總結(jié)


在進行學(xué)習(xí)Seq2Seq之前,先來回顧一下RNN(圖1)和LSTM(圖2)的網(wǎng)絡(luò)架構(gòu)。

圖1 RNN網(wǎng)絡(luò)架構(gòu)
圖2 LSTM網(wǎng)絡(luò)架構(gòu)

原理部分

機器翻譯的歷史

圖3 最早期的逐字翻譯

逐字翻譯出來的結(jié)果明顯不符合人類日常語言交流的常態(tài),語言生硬或者不符合語義,于是就發(fā)展到了基于統(tǒng)計學(xué)的機器翻譯,但是它也明顯的缺點就是不包含上下文的信息。


圖4 基于統(tǒng)計學(xué)的機器翻譯

以及現(xiàn)在的基于循環(huán)網(wǎng)絡(luò)(RNN)和編碼(word embedding)的機器翻譯。如圖5


圖5 基于深度學(xué)習(xí)的機器翻譯

有了輸入的內(nèi)容,并對其進行編碼,有利用計算機進行計算和處理,處理完成后我們還需要對其進行解碼操作。如圖6


i圖6 基于深度學(xué)習(xí)的機器翻譯

現(xiàn)在有用戶輸入一段英文文本序列想要得到對應(yīng)的西班牙語文本翻譯。

  • 首先進行Input,接收到用戶輸入的文本序列。
  • 其次,進入編碼器Encoder(如RNN),將文本序列進行編碼,得到比如為維度是3維的數(shù)據(jù)形式向量
  • 然后,將3維的向量輸入到解碼器Decoder中
  • 最后,得到解碼后的文本

其實概覽全局,整個流程就是從用戶那里得到一段文本序列(Sequence)經(jīng)過計算機的處理(To),即輸入和編碼;最終得到了對應(yīng)的文本序列(Sequence),即輸出和解碼,其實這也就是seq2seq的流程。

Seq2Seq的網(wǎng)絡(luò)架構(gòu)

整個網(wǎng)絡(luò)模型分為Encoder和Decoder,兩個部分接連著一個中間向量。

  • Encoder是一個RNN網(wǎng)絡(luò),其隱藏層包含有若干個單元。每個單元都是一個LSTM單元。Encoder輸出的結(jié)果是經(jīng)過處理的向量,并作為Decoder的輸入。
  • 同理,Decoder結(jié)構(gòu)與Encoder結(jié)構(gòu)類似,每一個單元的輸入是前一個單元的輸出,即每步得出一個結(jié)果。
  • 該模型訓(xùn)練有一個缺點,就是語料數(shù)據(jù)很難獲取。

以下圖7為例,現(xiàn)在收到了一封郵件,內(nèi)容為Are you free tomorrow。最終想要得到Y(jié)es, What's up?的回復(fù)。Tips:STATRT為開始符(有的論文用GO表示);
END為終止符,作為解碼器解碼終止的標志,有的論文稱為EOS(End of sentences.),這就需要在數(shù)據(jù)預(yù)處理的過程中在訓(xùn)練數(shù)據(jù)中加入。

圖7 Seq2Seq網(wǎng)絡(luò)架構(gòu)實例

Seq2Seq的應(yīng)用

  • 機器翻譯


    圖8 Seq2Seq網(wǎng)絡(luò)應(yīng)用-機器翻譯
  • 文本摘要


    圖9 Seq2Seq網(wǎng)絡(luò)應(yīng)用-文本摘要
  • 情感對話生成


    圖10 Seq2Seq網(wǎng)絡(luò)應(yīng)用-情感對話生成
  • 代碼補全


    圖11 Seq2Seq網(wǎng)絡(luò)應(yīng)用-代碼補全

Seq2Seq存在的問題

  • 壓縮損失了信息
    如圖12,在進行模型訓(xùn)練前,對文本需要進行embedding,即將文本映射為向量,然后通過LSTM單元,但是即使LSTM控制保留的信息再好,壓縮到最后一個節(jié)點那里也總是會丟失信息,那么就會對對最后的預(yù)測結(jié)果會產(chǎn)生影響。


    圖12 LSTM中的信息丟失問題
  • 長度限制
    如果輸入的序列過長,訓(xùn)練出來的模型表達效果也不會太出色,一般理想長度為10-20.如圖13。


    圖13 Seq2Seq受到文本長度的影響

Attention機制

基于以上的問題,在模型中加入Attention注意力機制,具體原理可以看02-注意力機制-attention機制(基于循環(huán)神經(jīng)網(wǎng)絡(luò)RNN)這篇文章。

Attention機制在計算機視覺領(lǐng)域中的解釋是這樣的,“高分辨率”聚焦在圖片的某個特定區(qū)域并以“低分辨率”感知圖像周邊區(qū)域的模式。通過大量的實驗證明,將attention機制應(yīng)用在機器翻譯、摘要生成、閱讀理解等問題上,取得的效果顯著。

另外還有一種Bucket機制,比如現(xiàn)在有很多組對話,有些對話長度為0-100字符,那么相應(yīng)的進行模型訓(xùn)練后,輸出的區(qū)間也會是這樣0-100字符。正常情況下,應(yīng)該對所有的的句子進行補全,但是的工作量會增加。
Bucket機制則是對所有的句子先進行分組,將句子根據(jù)不同的區(qū)間分為若干個組,如bucket1[10,10],bucket2[10-30,20-30],bucket3[30-100,30-100]等,然后再進行計算。即,如果我們要進行模型訓(xùn)練,發(fā)現(xiàn)語料數(shù)據(jù)的長度變化幅度有點大,那么就可以考慮加入Bucket機制。(在TensorFlow深度學(xué)習(xí)框架中進行seq2seq網(wǎng)絡(luò)訓(xùn)練時,默認進行Bucket)。

實踐部分

任務(wù)1:

任務(wù)1將實現(xiàn)一個基礎(chǔ)版的Seq2Seq輸入一個單詞(字母序列),模型將返回一個對字母排序后的“單詞”。

基礎(chǔ)Seq2Seq主要包含三部分:

如:將文本按照字典順序排序:hello --> ehllo

查看TensorFlow版本

from distutils.version import LooseVersion
import tensorflow as tf
from tensorflow.python.layers.core import Dense


# Check TensorFlow Version
assert LooseVersion(tf.__version__) >= LooseVersion('1.1'), 'Please use TensorFlow version 1.1 or newer'
print('TensorFlow Version: {}'.format(tf.__version__))

如果缺少某些包,到該網(wǎng)站下載即可,不過可能網(wǎng)速可能過慢。http://www.lfd.uci.edu/~gohlke/pythonlibs/#tensorflow

1.數(shù)據(jù)集加載

import numpy as np
import time
import tensorflow as tf

with open('data/letters_source.txt', 'r', encoding='utf-8') as f:  # 
    source_data = f.read()

with open('data/letters_target.txt', 'r', encoding='utf-8') as f:
    target_data = f.read()
1.1數(shù)據(jù)預(yù)覽
print(source_data.split('\n')[:10])
print(target_data.split('\n')[:10])

source輸出結(jié)果為:
['bsaqq',
'npy',
'lbwuj',
'bqv',
'kial',
'tddam',
'edxpjpg',
'nspv',
'huloz',
'kmclq']

target輸出結(jié)果為:
['abqqs',
'npy',
'bjluw',
'bqv',
'aikl',
'addmt',
'degjppx',
'npsv',
'hlouz',
'cklmq']

source為準備數(shù)據(jù),即準備輸入的數(shù)據(jù),作為訓(xùn)練集。
target為目標數(shù)據(jù),即預(yù)測實現(xiàn)的數(shù)據(jù),作為測試集。

2.數(shù)據(jù)預(yù)處理

這里的數(shù)據(jù)預(yù)處理,是將待輸入的文本映射為連續(xù)低維稠密向量,便于模型進行訓(xùn)練。

def extract_character_vocab(data):
    '''
    構(gòu)造映射表
    '''
    # 這里構(gòu)造特殊詞表,便于執(zhí)行特殊操作如開始GO、停止EOS、未知向量UNK(多出現(xiàn)在不規(guī)范的數(shù)據(jù)集中,無法對其進行映射時)和PAD(對文本進行填充保證每次大小都是一樣的,如RNN中的零填充)。
    special_words = ['<PAD>', '<UNK>', '<GO>',  '<EOS>']  

    set_words = list(set([character for line in data.split('\n') for character in line]))  # 統(tǒng)計不重復(fù)的字符,轉(zhuǎn)換為列表,便于之后進行embedding
    # 這里要把四個特殊字符添加進詞典
    int_to_vocab = {idx: word for idx, word in enumerate(special_words + set_words)}  # 利用枚舉方法做映射,完成數(shù)據(jù)預(yù)處理
    vocab_to_int = {word: idx for idx, word in int_to_vocab.items()}

    return int_to_vocab, vocab_to_int
2.1調(diào)用構(gòu)造好的函數(shù)進行數(shù)據(jù)預(yù)處理
# 構(gòu)造映射表
source_int_to_letter, source_letter_to_int = extract_character_vocab(source_data)
target_int_to_letter, target_letter_to_int = extract_character_vocab(target_data)

# 對字母進行轉(zhuǎn)換
source_int = [[source_letter_to_int.get(letter, source_letter_to_int['<UNK>']) 
               for letter in line] for line in source_data.split('\n')]
target_int = [[target_letter_to_int.get(letter, target_letter_to_int['<UNK>']) 
               for letter in line] + [target_letter_to_int['<EOS>']] for line in target_data.split('\n')] 
2.2查看映射結(jié)果
# 查看一下轉(zhuǎn)換結(jié)果
print(source_int[:10])
print(target_int[:10])

結(jié)果1:
[[17, 9, 12, 11, 11], # bsaqq
[16, 29, 26],
[13, 17, 15, 25, 8],
[17, 11, 4],
[18, 10, 12, 13],
[23, 7, 7, 12, 24],
[27, 7, 6, 29, 8, 29, 5],
[16, 9, 29, 4],
[28, 25, 13, 21, 20],
[18, 24, 22, 13, 11]]
結(jié)果2:
[[12, 17, 11, 11, 9, 3], # abqqs,可以看到這里的3代表加入的特殊符號EOS
[16, 29, 26, 3],
[17, 8, 13, 25, 15, 3],
[17, 11, 4, 3],
[12, 10, 18, 13, 3],
[12, 7, 7, 24, 23, 3],
[7, 27, 5, 8, 29, 29, 6, 3],
[16, 29, 9, 4, 3],
[28, 13, 21, 25, 20, 3],
[22, 18, 13, 24, 11, 3]]

3.構(gòu)建模型

3.1輸入層
def get_inputs():
    '''
    模型輸入tensor
    '''
    inputs = tf.placeholder(tf.int32, [None, None], name='inputs')  # 用placeholder進行占位,形狀不指定根據(jù)訓(xùn)練數(shù)據(jù)變化
    targets = tf.placeholder(tf.int32, [None, None], name='targets')
    learning_rate = tf.placeholder(tf.float32, name='learning_rate')  # 同理,這里替學(xué)習(xí)率進行占位
    
    # 定義target序列最大長度(之后target_sequence_length和source_sequence_length會作為feed_dict的參數(shù))
    target_sequence_length = tf.placeholder(tf.int32, (None,), name='target_sequence_length')
    max_target_sequence_length = tf.reduce_max(target_sequence_length, name='max_target_len')  # 這里計算序列最大長度項,便于之后根據(jù)此進行填充 
    source_sequence_length = tf.placeholder(tf.int32, (None,), name='source_sequence_length')
    
    return inputs, targets, learning_rate, target_sequence_length, max_target_sequence_length, source_sequence_length
3.2Encoder端

在Encoder端,我們需要進行兩步:

  • 第一步要對我們的輸入進行Embedding;
  • 再把Embedding好的向量傳給RNN進行處理。
將要使用到的API介紹:

在Embedding中,我們使用tf.contrib.layers.embed_sequence,它會對每個batch執(zhí)行embedding操作。

  • tf.contrib.layers.embed_sequence:

對序列數(shù)據(jù)執(zhí)行embedding操作,輸入[batch_size, sequence_length]的tensor,返回[batch_size, sequence_length, embed_dim]的tensor。

features = [[1,2,3],[4,5,6]]

outputs = tf.contrib.layers.embed_sequence(features, vocab_size, embed_dim)

如果embed_dim=4,輸出結(jié)果為

[
[[0.1,0.2,0.3,0.1],[0.2,0.5,0.7,0.2],[0.1,0.6,0.1,0.2]],
[[0.6,0.2,0.8,0.2],[0.5,0.6,0.9,0.2],[0.3,0.9,0.2,0.2]]
]

  • tf.contrib.rnn.MultiRNNCell:

對RNN單元按序列堆疊。接受參數(shù)為一個由RNN cell組成的list。

rnn_size代表一個rnn單元中隱層節(jié)點數(shù)量,layer_nums代表堆疊的rnn cell個數(shù)

  • tf.nn.dynamic_rnn:

構(gòu)建RNN,接受動態(tài)輸入序列。返回RNN的輸出以及最終狀態(tài)的tensor。

dynamic_rnn與rnn的區(qū)別在于,dynamic_rnn對于不同的batch,可以接收不同的sequence_length。

例如,第一個batch是[batch_size,10],第二個batch是[batch_size,20]。而rnn只能接收定長的sequence_length。

def get_encoder_layer(input_data, rnn_size, num_layers,
                   source_sequence_length, source_vocab_size, 
                   encoding_embedding_size):

    '''
    構(gòu)造Encoder層,其實也就是一個簡單的RNN模型
    
    參數(shù)說明:
    - input_data: 輸入tensor,輸入數(shù)據(jù)
    - rnn_size: rnn隱層結(jié)點數(shù)量
    - num_layers: 堆疊的rnn cell數(shù)量
    - source_sequence_length: 源數(shù)據(jù)的序列長度
    - source_vocab_size: 源數(shù)據(jù)的詞典大小,詞庫大小(不重復(fù)的詞)
    - encoding_embedding_size: embedding的大小,映射成向量后的維度
    '''
    # Encoder embedding
    encoder_embed_input = tf.contrib.layers.embed_sequence(input_data, source_vocab_size, encoding_embedding_size)

    # RNN cell,以隨機初始化的方式構(gòu)造基本的LSTM單元
    def get_lstm_cell(rnn_size):
        lstm_cell = tf.contrib.rnn.LSTMCell(rnn_size, initializer=tf.random_uniform_initializer(-0.1, 0.1, seed=2))  
        return lstm_cell

    # 根據(jù)基本的LSTM單元,構(gòu)造多隱層的RNN網(wǎng)絡(luò),有幾層隱層,就把幾層的LSTM單元組合在一起
    cell = tf.contrib.rnn.MultiRNNCell([get_lstm_cell(rnn_size) for _ in range(num_layers)])  
    
    # 構(gòu)建RNN,接受動態(tài)輸入序列。返回RNN的輸出以及最終狀態(tài)的tensor
    encoder_output, encoder_state = tf.nn.dynamic_rnn(cell, encoder_embed_input, 
                                                      sequence_length=source_sequence_length, dtype=tf.float32)  # cell是構(gòu)造好的網(wǎng)絡(luò),映射向量,序列長度
    
    return encoder_output, encoder_state
3.3Decoder端

對target數(shù)據(jù)進行預(yù)處理:
預(yù)處理包括加入停止詞,保證數(shù)據(jù)的維度一致等。


圖14 數(shù)據(jù)預(yù)處理后的示意圖
def process_decoder_input(data, vocab_to_int, batch_size):
    '''
    補充<GO>,并移除最后一個字符 
    '''
    # cut掉最后一個字符
    ending = tf.strided_slice(data, [0, 0], [batch_size, -1], [1, 1])
    decoder_input = tf.concat([tf.fill([batch_size, 1], vocab_to_int['<GO>']), ending], 1)

    return decoder_input
3.4對target數(shù)據(jù)進行embedding

同樣地,我們還需要對target數(shù)據(jù)進行embedding,使得它們能夠傳入Decoder中的RNN。

將要使用到的API介紹:
  • tf.contrib.seq2seq.TrainingHelper:

Decoder端用來訓(xùn)練的函數(shù)。

這個函數(shù)不會把t-1階段的輸出作為t階段的輸入,而是把target中的真實值直接輸入給RNN。

主要參數(shù)是inputs和sequence_length。返回helper對象,可以作為BasicDecoder函數(shù)的參數(shù)。

  • tf.contrib.seq2seq.GreedyEmbeddingHelper:

它和TrainingHelper的區(qū)別在于它會把t-1下的輸出進行embedding后再輸入給RNN。

下面的圖15中代表的是training過程:

在training過程中,我們并不會把每個階段的預(yù)測輸出作為下一階段的輸入,下一階段的輸入我們會直接使用target data真實值,這樣能夠保證模型更加準確。

圖15 Decoder端訓(xùn)練過程.png

def decoding_layer(target_letter_to_int, decoding_embedding_size, num_layers, rnn_size,
                   target_sequence_length, max_target_sequence_length, encoder_state, decoder_input):
    '''
    構(gòu)造Decoder層
    
    參數(shù):
    - target_letter_to_int: target數(shù)據(jù)的映射表
    - decoding_embedding_size: embed向量大小
    - num_layers: 堆疊的RNN單元數(shù)量
    - rnn_size: RNN單元的隱層結(jié)點數(shù)量
    - target_sequence_length: target數(shù)據(jù)序列長度
    - max_target_sequence_length: target數(shù)據(jù)序列最大長度
    - encoder_state: encoder端編碼的狀態(tài)向量
    - decoder_input: decoder端輸入
    '''
    # 1. Embedding
    target_vocab_size = len(target_letter_to_int)  # 計算最終詞庫的大小
    decoder_embeddings = tf.Variable(tf.random_uniform([target_vocab_size, decoding_embedding_size])) # 定義映射矩陣
    decoder_embed_input = tf.nn.embedding_lookup(decoder_embeddings, decoder_input)  # 查看當前的映射結(jié)果

    # 2. 構(gòu)造Decoder中的RNN單元
    def get_decoder_cell(rnn_size):  
      """構(gòu)造基本的LSTM單元"""
        decoder_cell = tf.contrib.rnn.LSTMCell(rnn_size,
                                           initializer=tf.random_uniform_initializer(-0.1, 0.1, seed=2))
        return decoder_cell
    cell = tf.contrib.rnn.MultiRNNCell([get_decoder_cell(rnn_size) for _ in range(num_layers)])  # 構(gòu)造RNN網(wǎng)絡(luò)
     
    # 3. Output全連接層,相當于是加上Softmax,對得出的結(jié)果進行分類
    output_layer = Dense(target_vocab_size,
                         kernel_initializer = tf.truncated_normal_initializer(mean = 0.0, stddev=0.1))


    # 4. Training decoder,訓(xùn)練decoder,LSTM單元直接用label去做輸入
    with tf.variable_scope("decode"):
        # 得到help對象
        training_helper = tf.contrib.seq2seq.TrainingHelper(inputs=decoder_embed_input,
                                                            sequence_length=target_sequence_length,
                                                            time_major=False)
        # 構(gòu)造基本的decoder
        training_decoder = tf.contrib.seq2seq.BasicDecoder(cell,
                                                           training_helper,
                                                           encoder_state,
                                                           output_layer) 
        # 得到decoder訓(xùn)練后的輸出值
        training_decoder_output, _ = tf.contrib.seq2seq.dynamic_decode(training_decoder,
                                                                       impute_finished=True,
                                                                       maximum_iterations=max_target_sequence_length)

    # 5. Predicting decoder,預(yù)測decoder,LSTM單元用前一階段的輸出去做輸入
    # 與training共享參數(shù)
    with tf.variable_scope("decode", reuse=True):  # 作用域與4相同,reuse=Ture,說明與上一階段的參數(shù)是共享的
        # 創(chuàng)建一個常量tensor并復(fù)制為batch_size的大小
        start_tokens = tf.tile(tf.constant([target_letter_to_int['<GO>']], dtype=tf.int32), [batch_size], 
                               name='start_tokens')
        predicting_helper = tf.contrib.seq2seq.GreedyEmbeddingHelper(decoder_embeddings,
                                                                start_tokens,
                                                                target_letter_to_int['<EOS>'])
        predicting_decoder = tf.contrib.seq2seq.BasicDecoder(cell,
                                                        predicting_helper,
                                                        encoder_state,
                                                        output_layer)
        predicting_decoder_output, _ = tf.contrib.seq2seq.dynamic_decode(predicting_decoder,
                                                            impute_finished=True,
                                                            maximum_iterations=max_target_sequence_length)
    
    return training_decoder_output, predicting_decoder_output
3.5構(gòu)建seq2seq模型

上面已經(jīng)構(gòu)建完成Encoder和Decoder,下面將這兩部分連接起來,構(gòu)建seq2seq模型

def seq2seq_model(input_data, targets, lr, target_sequence_length, 
                  max_target_sequence_length, source_sequence_length,
                  source_vocab_size, target_vocab_size,
                  encoder_embedding_size, decoder_embedding_size, 
                  rnn_size, num_layers):
    
    # 獲取encoder的狀態(tài)輸出
    _, encoder_state = get_encoder_layer(input_data, 
                                  rnn_size, 
                                  num_layers, 
                                  source_sequence_length,
                                  source_vocab_size, 
                                  encoding_embedding_size)
    
    
    # 預(yù)處理后的decoder輸入
    decoder_input = process_decoder_input(targets, target_letter_to_int, batch_size)
    
    # 將狀態(tài)向量與輸入傳遞給decoder
    training_decoder_output, predicting_decoder_output = decoding_layer(target_letter_to_int, 
                                                                       decoding_embedding_size, 
                                                                       num_layers, 
                                                                       rnn_size,
                                                                       target_sequence_length,
                                                                       max_target_sequence_length,
                                                                       encoder_state, 
                                                                       decoder_input) 
    
    return training_decoder_output, predicting_decoder_output
    

超參數(shù)設(shè)置
# 超參數(shù)
# Number of Epochs
epochs = 60
# Batch Size
batch_size = 128
# RNN Size
rnn_size = 50
# Number of Layers
num_layers = 2
# Embedding Size
encoding_embedding_size = 15
decoding_embedding_size = 15
# Learning Rate
learning_rate = 0.001
構(gòu)造graph
# 構(gòu)造graph
train_graph = tf.Graph()

with train_graph.as_default():
    
    # 獲得模型輸入    
    input_data, targets, lr, target_sequence_length, max_target_sequence_length, source_sequence_length = get_inputs()
    
    training_decoder_output, predicting_decoder_output = seq2seq_model(input_data, 
                                                                      targets, 
                                                                      lr, 
                                                                      target_sequence_length, 
                                                                      max_target_sequence_length, 
                                                                      source_sequence_length,
                                                                      len(source_letter_to_int),
                                                                      len(target_letter_to_int),
                                                                      encoding_embedding_size, 
                                                                      decoding_embedding_size, 
                                                                      rnn_size, 
                                                                      num_layers)    
    
    training_logits = tf.identity(training_decoder_output.rnn_output, 'logits')
    predicting_logits = tf.identity(predicting_decoder_output.sample_id, name='predictions')
    
    masks = tf.sequence_mask(target_sequence_length, max_target_sequence_length, dtype=tf.float32, name='masks')  # 不將EOS等特殊符號參與運算

    with tf.name_scope("optimization"):
        
        # Loss function
        cost = tf.contrib.seq2seq.sequence_loss(
            training_logits,
            targets,
            masks)

        # Optimizer
        optimizer = tf.train.AdamOptimizer(lr)  # 優(yōu)化器

        # Gradient Clipping 基于定義的min與max對tesor數(shù)據(jù)進行截斷操作,目的是為了應(yīng)對梯度爆發(fā)或者梯度消失的情況
        gradients = optimizer.compute_gradients(cost)  # 梯度求解
        capped_gradients = [(tf.clip_by_value(grad, -5., 5.), var) for grad, var in gradients if grad is not None]  # 為梯度求解指定范圍
        train_op = optimizer.apply_gradients(capped_gradients)

4.batch批處理

def pad_sentence_batch(sentence_batch, pad_int):
    '''
    對batch中的序列進行補全,保證batch中的每行都有相同的sequence_length
    
    參數(shù):
    - sentence batch
    - pad_int: <PAD>對應(yīng)索引號
    '''
    max_sentence = max([len(sentence) for sentence in sentence_batch])
    return [sentence + [pad_int] * (max_sentence - len(sentence)) for sentence in sentence_batch]
def get_batches(targets, sources, batch_size, source_pad_int, target_pad_int):
    '''
    定義生成器,用來獲取batch
    '''
    for batch_i in range(0, len(sources)//batch_size):
        start_i = batch_i * batch_size
        sources_batch = sources[start_i:start_i + batch_size]  # 指定索引符,將數(shù)據(jù)取出
        targets_batch = targets[start_i:start_i + batch_size]
        # 補全序列
        pad_sources_batch = np.array(pad_sentence_batch(sources_batch, source_pad_int))
        pad_targets_batch = np.array(pad_sentence_batch(targets_batch, target_pad_int))
        
        # 記錄每條記錄的長度
        pad_targets_lengths = []
        for target in pad_targets_batch:
            pad_targets_lengths.append(len(target))
        
        pad_source_lengths = []
        for source in pad_sources_batch:
            pad_source_lengths.append(len(source))
        
        yield pad_targets_batch, pad_sources_batch, pad_targets_lengths, pad_source_lengths

5.Training訓(xùn)練

# 將數(shù)據(jù)集分割為train和validation
train_source = source_int[batch_size:]
train_target = target_int[batch_size:]
# 留出一個batch進行驗證
valid_source = source_int[:batch_size]
valid_target = target_int[:batch_size]
(valid_targets_batch, valid_sources_batch, valid_targets_lengths, valid_sources_lengths) = next(get_batches(valid_target, valid_source, batch_size,
                           source_letter_to_int['<PAD>'],
                           target_letter_to_int['<PAD>']))

display_step = 50 # 每隔50輪輸出loss

checkpoint = "trained_model.ckpt" 
with tf.Session(graph=train_graph) as sess:
    sess.run(tf.global_variables_initializer())
        
    for epoch_i in range(1, epochs+1):
        for batch_i, (targets_batch, sources_batch, targets_lengths, sources_lengths) in enumerate(
                get_batches(train_target, train_source, batch_size,
                           source_letter_to_int['<PAD>'],
                           target_letter_to_int['<PAD>'])):
            
            _, loss = sess.run(
                [train_op, cost],
                {input_data: sources_batch,
                 targets: targets_batch,
                 lr: learning_rate,
                 target_sequence_length: targets_lengths,
                 source_sequence_length: sources_lengths})

            if batch_i % display_step == 0:
                
                # 計算validation loss
                validation_loss = sess.run(
                [cost],
                {input_data: valid_sources_batch,
                 targets: valid_targets_batch,
                 lr: learning_rate,
                 target_sequence_length: valid_targets_lengths,
                 source_sequence_length: valid_sources_lengths})
                
                print('Epoch {:>3}/{} Batch {:>4}/{} - Training Loss: {:>6.3f}  - Validation loss: {:>6.3f}'
                      .format(epoch_i,
                              epochs, 
                              batch_i, 
                              len(train_source) // batch_size, 
                              loss, 
                              validation_loss[0]))

    
    
    # 保存模型
    saver = tf.train.Saver()
    saver.save(sess, checkpoint)
    print('Model Trained and Saved')
結(jié)果:

Epoch 1/60 Batch 50/77 - Training Loss: 2.332 - Validation loss: 2.091
Epoch 2/60 Batch 50/77 - Training Loss: 1.803 - Validation loss: 1.593
Epoch 3/60 Batch 50/77 - Training Loss: 1.550 - Validation loss: 1.379
Epoch 4/60 Batch 50/77 - Training Loss: 1.343 - Validation loss: 1.184
Epoch 5/60 Batch 50/77 - Training Loss: 1.230 - Validation loss: 1.077
Epoch 6/60 Batch 50/77 - Training Loss: 1.096 - Validation loss: 0.956
Epoch 7/60 Batch 50/77 - Training Loss: 0.993 - Validation loss: 0.849
Epoch 8/60 Batch 50/77 - Training Loss: 0.893 - Validation loss: 0.763
Epoch 9/60 Batch 50/77 - Training Loss: 0.808 - Validation loss: 0.673
Epoch 10/60 Batch 50/77 - Training Loss: 0.728 - Validation loss: 0.600
Epoch 11/60 Batch 50/77 - Training Loss: 0.650 - Validation loss: 0.539
Epoch 12/60 Batch 50/77 - Training Loss: 0.594 - Validation loss: 0.494
Epoch 13/60 Batch 50/77 - Training Loss: 0.560 - Validation loss: 0.455
Epoch 14/60 Batch 50/77 - Training Loss: 0.502 - Validation loss: 0.411
Epoch 15/60 Batch 50/77 - Training Loss: 0.464 - Validation loss: 0.380
Epoch 16/60 Batch 50/77 - Training Loss: 0.428 - Validation loss: 0.352
Epoch 17/60 Batch 50/77 - Training Loss: 0.394 - Validation loss: 0.323
Epoch 18/60 Batch 50/77 - Training Loss: 0.364 - Validation loss: 0.297
Epoch 19/60 Batch 50/77 - Training Loss: 0.335 - Validation loss: 0.270
Epoch 20/60 Batch 50/77 - Training Loss: 0.305 - Validation loss: 0.243
Epoch 21/60 Batch 50/77 - Training Loss: 0.311 - Validation loss: 0.248
Epoch 22/60 Batch 50/77 - Training Loss: 0.253 - Validation loss: 0.203
Epoch 23/60 Batch 50/77 - Training Loss: 0.227 - Validation loss: 0.182
Epoch 24/60 Batch 50/77 - Training Loss: 0.204 - Validation loss: 0.165
Epoch 25/60 Batch 50/77 - Training Loss: 0.184 - Validation loss: 0.150
Epoch 26/60 Batch 50/77 - Training Loss: 0.166 - Validation loss: 0.136
Epoch 27/60 Batch 50/77 - Training Loss: 0.150 - Validation loss: 0.124
Epoch 28/60 Batch 50/77 - Training Loss: 0.135 - Validation loss: 0.113
Epoch 29/60 Batch 50/77 - Training Loss: 0.121 - Validation loss: 0.103
Epoch 30/60 Batch 50/77 - Training Loss: 0.109 - Validation loss: 0.094
Epoch 31/60 Batch 50/77 - Training Loss: 0.098 - Validation loss: 0.086
Epoch 32/60 Batch 50/77 - Training Loss: 0.088 - Validation loss: 0.079
Epoch 33/60 Batch 50/77 - Training Loss: 0.079 - Validation loss: 0.073
Epoch 34/60 Batch 50/77 - Training Loss: 0.071 - Validation loss: 0.067
Epoch 35/60 Batch 50/77 - Training Loss: 0.063 - Validation loss: 0.062
Epoch 36/60 Batch 50/77 - Training Loss: 0.057 - Validation loss: 0.057
Epoch 37/60 Batch 50/77 - Training Loss: 0.052 - Validation loss: 0.053
Epoch 38/60 Batch 50/77 - Training Loss: 0.047 - Validation loss: 0.049
Epoch 39/60 Batch 50/77 - Training Loss: 0.043 - Validation loss: 0.045
Epoch 40/60 Batch 50/77 - Training Loss: 0.039 - Validation loss: 0.042
Epoch 41/60 Batch 50/77 - Training Loss: 0.036 - Validation loss: 0.039
Epoch 42/60 Batch 50/77 - Training Loss: 0.033 - Validation loss: 0.037
Epoch 43/60 Batch 50/77 - Training Loss: 0.030 - Validation loss: 0.034
Epoch 44/60 Batch 50/77 - Training Loss: 0.028 - Validation loss: 0.032
Epoch 45/60 Batch 50/77 - Training Loss: 0.026 - Validation loss: 0.029
Epoch 46/60 Batch 50/77 - Training Loss: 0.024 - Validation loss: 0.028
Epoch 47/60 Batch 50/77 - Training Loss: 0.027 - Validation loss: 0.029
Epoch 48/60 Batch 50/77 - Training Loss: 0.030 - Validation loss: 0.030
Epoch 49/60 Batch 50/77 - Training Loss: 0.023 - Validation loss: 0.026
Epoch 50/60 Batch 50/77 - Training Loss: 0.021 - Validation loss: 0.024
Epoch 51/60 Batch 50/77 - Training Loss: 0.019 - Validation loss: 0.022
Epoch 52/60 Batch 50/77 - Training Loss: 0.017 - Validation loss: 0.021
Epoch 53/60 Batch 50/77 - Training Loss: 0.016 - Validation loss: 0.020
Epoch 54/60 Batch 50/77 - Training Loss: 0.015 - Validation loss: 0.019
Epoch 55/60 Batch 50/77 - Training Loss: 0.014 - Validation loss: 0.018
Epoch 56/60 Batch 50/77 - Training Loss: 0.013 - Validation loss: 0.018
Epoch 57/60 Batch 50/77 - Training Loss: 0.012 - Validation loss: 0.017
Epoch 58/60 Batch 50/77 - Training Loss: 0.011 - Validation loss: 0.016
Epoch 59/60 Batch 50/77 - Training Loss: 0.011 - Validation loss: 0.016
Epoch 60/60 Batch 50/77 - Training Loss: 0.010 - Validation loss: 0.015
Model Trained and Saved

6.Predicate預(yù)測

def source_to_seq(text):
    '''
    對源數(shù)據(jù)進行轉(zhuǎn)換
    '''
    sequence_length = 7
    return [source_letter_to_int.get(word, source_letter_to_int['<UNK>']) for word in text] + [source_letter_to_int['<PAD>']]*(sequence_length-len(text))
# 輸入一個單詞
input_word = 'common'
text = source_to_seq(input_word)

checkpoint = "./trained_model.ckpt"

loaded_graph = tf.Graph()
with tf.Session(graph=loaded_graph) as sess:
    # 加載模型
    loader = tf.train.import_meta_graph(checkpoint + '.meta')
    loader.restore(sess, checkpoint)

    input_data = loaded_graph.get_tensor_by_name('inputs:0')
    logits = loaded_graph.get_tensor_by_name('predictions:0')
    source_sequence_length = loaded_graph.get_tensor_by_name('source_sequence_length:0')
    target_sequence_length = loaded_graph.get_tensor_by_name('target_sequence_length:0')
    
    answer_logits = sess.run(logits, {input_data: [text]*batch_size, 
                                      target_sequence_length: [len(text)]*batch_size, 
                                      source_sequence_length: [len(text)]*batch_size})[0] 


pad = source_letter_to_int["<PAD>"] 

print('原始輸入:', input_word)

print('\nSource')
print('  Word 編號:    {}'.format([i for i in text]))
print('  Input Words: {}'.format(" ".join([source_int_to_letter[i] for i in text])))

print('\nTarget')
print('  Word 編號:       {}'.format([i for i in answer_logits if i != pad]))
print('  Response Words: {}'.format(" ".join([target_int_to_letter[i] for i in answer_logits if i != pad])))
結(jié)果展示:

INFO:tensorflow:Restoring parameters from ./trained_model.ckpt
原始輸入: common

Source
Word 編號: [20, 28, 6, 6, 28, 5, 0]
Input Words: c o m m o n <PAD>

Target
Word 編號: [20, 6, 6, 5, 28, 28, 3]
Response Words: c m m n o o <EOS>

任務(wù)2:文本摘要練習(xí)

數(shù)據(jù)集:Amazon 500000評論
分為以下步驟進行:

  • 數(shù)據(jù)預(yù)處理
  • 構(gòu)建Seq2Seq模型
  • 訓(xùn)練網(wǎng)絡(luò)
  • 測試效果

seq2seq教程: https://github.com/j-min/tf_tutorial_plus/tree/master/RNN_seq2seq/contrib_seq2seq 國外大神寫的Seq2Seq的tutorial

1.導(dǎo)入需要的外部庫

import pandas as pd
import numpy as np
import tensorflow as tf
import re
from nltk.corpus import stopwords
import time
from tensorflow.python.layers.core import Dense
from tensorflow.python.ops.rnn_cell_impl import _zero_state_tensors
print('TensorFlow Version: {}'.format(tf.__version__))

2.導(dǎo)入數(shù)據(jù)

reviews = pd.read_csv("Reviews.csv")
print(reviews.shape)
print(reviews.head())

結(jié)果為:
(568454, 10)

Id  ProductId   UserId  ProfileName HelpfulnessNumerator    HelpfulnessDenominator  Score   Time    Summary Text

0 1 B001E4KFG0 A3SGXH7AUHU8GW delmartian 1 1 5 1303862400 Good Quality Dog Food I have bought several of the Vitality canned d...
1 2 B00813GRG4 A1D87F6ZCVE5NK dll pa 0 0 1 1346976000 Not as Advertised Product arrived labeled as Jumbo Salted Peanut...
2 3 B000LQOCH0 ABXLMWJIXXAIN Natalia Corres "Natalia Corres" 1 1 4 1219017600 "Delight" says it all This is a confection that has been around a fe...
3 4 B000UA0QIQ A395BORC6FGVXV Karl 3 3 2 1307923200 Cough Medicine If you are looking for the secret ingredient i...
4 5 B006K2ZZ7K A1UQRSCLF8GW1T Michael D. Bigham "M. Wassir" 0 0 5 1350777600 Great taffy Great taffy at a great price. There was a wid...

2.1檢查空數(shù)據(jù)
# Check for any nulls values
reviews.isnull().sum()
2.2刪除空值和不需要的特征
# Remove null values and unneeded features
reviews = reviews.dropna()
reviews = reviews.drop(['Id','ProductId','UserId','ProfileName','HelpfulnessNumerator','HelpfulnessDenominator',
                        'Score','Time'], 1)
reviews = reviews.reset_index(drop=True)

reviews.head()
2.3查看部分數(shù)據(jù)
# Inspecting some of the reviews
for i in range(5):
    print("Review #",i+1)
    print(reviews.Summary[i])
    print(reviews.Text[i])
    print()

3.數(shù)據(jù)預(yù)處理

主要處理任務(wù):

  • 全部轉(zhuǎn)換成小寫
  • 連詞轉(zhuǎn)換
  • 去停用詞(只在描述中去掉)
3.1設(shè)置縮寫詞列表

contractions = { 
"ain't": "am not",
"aren't": "are not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he would",
"he'd've": "he would have",
"he'll": "he will",
"he's": "he is",
"how'd": "how did",
"how'll": "how will",
"how's": "how is",
"i'd": "i would",
"i'll": "i will",
"i'm": "i am",
"i've": "i have",
"isn't": "is not",
"it'd": "it would",
"it'll": "it will",
"it's": "it is",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"must've": "must have",
"mustn't": "must not",
"needn't": "need not",
"oughtn't": "ought not",
"shan't": "shall not",
"sha'n't": "shall not",
"she'd": "she would",
"she'll": "she will",
"she's": "she is",
"should've": "should have",
"shouldn't": "should not",
"that'd": "that would",
"that's": "that is",
"there'd": "there had",
"there's": "there is",
"they'd": "they would",
"they'll": "they will",
"they're": "they are",
"they've": "they have",
"wasn't": "was not",
"we'd": "we would",
"we'll": "we will",
"we're": "we are",
"we've": "we have",
"weren't": "were not",
"what'll": "what will",
"what're": "what are",
"what's": "what is",
"what've": "what have",
"where'd": "where did",
"where's": "where is",
"who'll": "who will",
"who's": "who is",
"won't": "will not",
"wouldn't": "would not",
"you'd": "you would",
"you'll": "you will",
"you're": "you are"
}
3.2數(shù)據(jù)清洗
def clean_text(text, remove_stopwords = True):
    '''Remove unwanted characters, stopwords, and format the text to create fewer nulls word embeddings'''
    
    # Convert words to lower case
    text = text.lower()
    
    # Replace contractions with their longer forms 
    if True:
        text = text.split()
        new_text = []
        for word in text:
            if word in contractions:
                new_text.append(contractions[word])
            else:
                new_text.append(word)
        text = " ".join(new_text)
    
    # Format words and remove unwanted characters
    text = re.sub(r'https?:\/\/.*[\r\n]*', '', text, flags=re.MULTILINE)
    text = re.sub(r'\<a href', ' ', text)
    text = re.sub(r'&amp;', '', text) 
    text = re.sub(r'[_"\-;%()|+&=*%.,!?:#$@\[\]/]', ' ', text)
    text = re.sub(r'<br />', ' ', text)
    text = re.sub(r'\'', ' ', text)
    
    # Optionally, remove stop words
    if remove_stopwords:
        text = text.split()
        stops = set(stopwords.words("english"))
        text = [w for w in text if not w in stops]
        text = " ".join(text)

    return text

↑我們將刪除文本中的停用詞,因為它們不能用于訓(xùn)練我們的模型。 但是,我們會將它們保留為摘要,以便它們聽起來更像自然短語。

# Clean the summaries and texts
clean_summaries = []
for summary in reviews.Summary:
    clean_summaries.append(clean_text(summary, remove_stopwords=False))
print("Summaries are complete.")

clean_texts = []
for text in reviews.Text:
    clean_texts.append(clean_text(text))
print("Texts are complete.")

檢查已清理的摘要和文本,確保它們已被清理干凈

for i in range(5):
    print("Clean Review #",i+1)
    print(clean_summaries[i])
    print(clean_texts[i])
    print()

計算一組文本中每個單詞的出現(xiàn)次數(shù)

def count_words(count_dict, text):
    '''Count the number of occurrences of each word in a set of text'''
    for sentence in text:
        for word in sentence.split():
            if word not in count_dict:
                count_dict[word] = 1
            else:
                count_dict[word] += 1

查找每個單詞的使用次數(shù)和詞匯量的大小

word_counts = {}

count_words(word_counts, clean_summaries)
count_words(word_counts, clean_texts)
            
print("Size of Vocabulary:", len(word_counts))

結(jié)果:
Size of Vocabulary: 132884

4.使用構(gòu)建好的詞向量

這里使用目前效果較好,別人已構(gòu)建好的詞向量

# 加載Conceptnet Numberbatch(CN)嵌入,類似于GloVe,但可能更好
# (https://github.com/commonsense/conceptnet-numberbatch)  這里使用別人已經(jīng)訓(xùn)練好的詞向量ConceptNet
embeddings_index = {}
with open('numberbatch-en-17.04b.txt', encoding='utf-8') as f:
    for line in f:
        values = line.split(' ')
        word = values[0]
        embedding = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = embedding

print('Word embeddings:', len(embeddings_index))

詞庫總詞向量為:
484557

4.1但是有些詞在我們當前使用的語料庫中是不存在的,那么這時候就需要我們自己去做word embedding.
# Find the number of words that are missing from CN, and are used more than our threshold.embedding.
missing_words = 0
threshold = 20

for word, count in word_counts.items():
    if count > threshold:
        if word not in embeddings_index:
            missing_words += 1
            
missing_ratio = round(missing_words/len(word_counts),4)*100
            
print("Number of words missing from CN:", missing_words)
print("Percent of words that are missing from vocabulary: {}%".format(missing_ratio))

結(jié)果為:
Number of words missing from CN: 3044
Percent of words that are missing from vocabulary: 2.29%

閾值設(shè)置為20,不在詞向量中的且出現(xiàn)超過20次,那咱們就得自己做它的映射向量了

4.2將單詞轉(zhuǎn)換為整數(shù)的字典
# Limit the vocab that we will use to words that appear ≥ threshold or are in GloVe

#dictionary to convert words to integers 這里做了將詞到int類型的映射,方便在訓(xùn)練和測試的時候,詞的轉(zhuǎn)換的操作
vocab_to_int = {} 

value = 0
for word, count in word_counts.items():
    if count >= threshold or word in embeddings_index:
        vocab_to_int[word] = value
        value += 1

# Special tokens that will be added to our vocab
codes = ["<UNK>","<PAD>","<EOS>","<GO>"]   

# Add codes to vocab
for code in codes:
    vocab_to_int[code] = len(vocab_to_int)

# Dictionary to convert integers to words
int_to_vocab = {}
for word, value in vocab_to_int.items():
    int_to_vocab[value] = word

usage_ratio = round(len(vocab_to_int) / len(word_counts),4)*100

print("Total number of unique words:", len(word_counts))
print("Number of words we will use:", len(vocab_to_int))
print("Percent of words we will use: {}%".format(usage_ratio))

結(jié)果為:
Total number of unique words: 132884
Number of words we will use: 65469
Percent of words we will use: 49.27%

4.3設(shè)置詞向量維度
# Need to use 300 for embedding dimensions to match CN's vectors.
embedding_dim = 300  # 因為使用的是別人已經(jīng)訓(xùn)練好的詞向量,且他們設(shè)置的詞向量的維度是300維,這里指定自己的維度也是300維,確保保持一致
nb_words = len(vocab_to_int)

# Create matrix with default values of zero
word_embedding_matrix = np.zeros((nb_words, embedding_dim), dtype=np.float32)
for word, i in vocab_to_int.items():
    if word in embeddings_index:
        word_embedding_matrix[i] = embeddings_index[word]
    else:
        # If word not in CN, create a random embedding for it
        new_embedding = np.array(np.random.uniform(-1.0, 1.0, embedding_dim))
        embeddings_index[word] = new_embedding
        word_embedding_matrix[i] = new_embedding

# Check if value matches len(vocab_to_int)
print(len(word_embedding_matrix))  # 65469
4.4將文本中的單詞轉(zhuǎn)換為整數(shù)。
def convert_to_ints(text, word_count, unk_count, eos=False):
    '''Convert words in text to an integer.。
       If word is not in vocab_to_int, use UNK's integer.如果word不在vocab_to_int中,請使用UNK的整數(shù)
       Total the number of words and UNKs.單詞和UNK的總數(shù)。
       Add EOS token to the end of texts 將EOS token添加到文本末尾'''
    ints = []
    for sentence in text:
        sentence_ints = []
        for word in sentence.split():
            word_count += 1
            if word in vocab_to_int:
                sentence_ints.append(vocab_to_int[word])
            else:
                sentence_ints.append(vocab_to_int["<UNK>"])
                unk_count += 1
        if eos:
            sentence_ints.append(vocab_to_int["<EOS>"])
        ints.append(sentence_ints)
    return ints, word_count, unk_count
4.5將convert_to_ints應(yīng)用于clean_summaries和clean_texts
# Apply convert_to_ints to clean_summaries and clean_texts
word_count = 0
unk_count = 0

int_summaries, word_count, unk_count = convert_to_ints(clean_summaries, word_count, unk_count)
int_texts, word_count, unk_count = convert_to_ints(clean_texts, word_count, unk_count, eos=True)

unk_percent = round(unk_count/word_count,4)*100

print("Total number of words in headlines:", word_count)
print("Total number of UNKs in headlines:", unk_count)
print("Percent of words that are UNK: {}%".format(unk_percent))

結(jié)果為:
Total number of words in headlines: 25679946
Total number of UNKs in headlines: 170450
Percent of words that are UNK: 0.66%

4.6從文本中創(chuàng)建句子長度的DataFrame
def create_lengths(text):  # 因為語料庫中詞的長度不一致,要做padding,所以這里先統(tǒng)計每個sentence長度
    '''Create a data frame of the sentence lengths from a text'''
    lengths = []
    for sentence in text:
        lengths.append(len(sentence))
    return pd.DataFrame(lengths, columns=['counts'])
lengths_summaries = create_lengths(int_summaries)
lengths_texts = create_lengths(int_texts)

print("Summaries:")
print(lengths_summaries.describe())
print()
print("Texts:")
print(lengths_texts.describe())

結(jié)果為:
Summaries:
counts
count 568412.000000
mean 4.181620
std 2.657872
min 0.000000
25% 2.000000
50% 4.000000
75% 5.000000
max 48.000000

Texts:
counts
count 568412.000000
mean 41.996782
std 42.520854
min 1.000000
25% 18.000000
50% 29.000000
75% 50.000000
max 2085.000000

# Inspect the length of texts 統(tǒng)計百分比
print(np.percentile(lengths_texts.counts, 90))
print(np.percentile(lengths_texts.counts, 95))
print(np.percentile(lengths_texts.counts, 99))

84.0
115.0
207.0

# Inspect the length of summaries  檢查摘要的長度
print(np.percentile(lengths_summaries.counts, 90))
print(np.percentile(lengths_summaries.counts, 95))
print(np.percentile(lengths_summaries.counts, 99))

8.0
9.0
13.0

4.7計算UNK出現(xiàn)在句子中的次數(shù)
def unk_counter(sentence):
    '''Counts the number of times UNK appears in a sentence.'''
    unk_count = 0
    for word in sentence:
        if word == vocab_to_int["<UNK>"]:
            unk_count += 1
    return unk_count
4.8文本排序,設(shè)置范圍
# Sort the summaries and texts by the length of the texts, shortest to longest  按文本長度對摘要和文本進行排序,最短到最長
# Limit the length of summaries and texts based on the min and max ranges.根據(jù)最小和最大范圍限制摘要和文本的長度
# Remove reviews that include too many UNKs刪除包含太多UNK的評論

sorted_summaries = []
sorted_texts = []
max_text_length = 84
max_summary_length = 13
min_length = 2
unk_text_limit = 1
unk_summary_limit = 0

for length in range(min(lengths_texts.counts), max_text_length): 
    for count, words in enumerate(int_summaries):
        if (len(int_summaries[count]) >= min_length and
            len(int_summaries[count]) <= max_summary_length and
            len(int_texts[count]) >= min_length and
            unk_counter(int_summaries[count]) <= unk_summary_limit and
            unk_counter(int_texts[count]) <= unk_text_limit and
            length == len(int_texts[count])
           ):
            sorted_summaries.append(int_summaries[count])
            sorted_texts.append(int_texts[count])
        
# Compare lengths to ensure they match
print(len(sorted_summaries))
print(len(sorted_texts))

5.構(gòu)建Seq2Seq模型

這里使用的是RNN的變種-Bidirectional RNNs,Bidirectional RNNs(雙向網(wǎng)絡(luò))的改進之處便是,假設(shè)當前的輸出(第t步的輸出)不僅僅與前面的序列有關(guān),并且還與后面的序列有關(guān)。

例如:預(yù)測一個語句中缺失的詞語那么就需要根據(jù)上下文來進行預(yù)測。Bidirectional RNNs是一個相對較簡單的RNNs,是由兩個RNNs上下疊加在一起組成的。輸出由這兩個RNNs的隱藏層的狀態(tài)決定的


Bidirectional RNNs
5.1輸入層
5.1.1設(shè)置模型輸入,為模型的輸入創(chuàng)建占位符
def model_inputs():
    '''Create palceholders for inputs to the model'''
    
    input_data = tf.placeholder(tf.int32, [None, None], name='input')
    targets = tf.placeholder(tf.int32, [None, None], name='targets')
    lr = tf.placeholder(tf.float32, name='learning_rate')
    keep_prob = tf.placeholder(tf.float32, name='keep_prob')
    summary_length = tf.placeholder(tf.int32, (None,), name='summary_length')
    max_summary_length = tf.reduce_max(summary_length, name='max_dec_len')
    text_length = tf.placeholder(tf.int32, (None,), name='text_length')

    return input_data, targets, lr, keep_prob, summary_length, max_summary_length, text_length
5.2將<GO>插入,便于批處理和訓(xùn)練
def process_encoding_input(target_data, vocab_to_int, batch_size):
    '''Remove the last word id from each batch and concat the <GO> to the begining of each batch
      從每個批次中刪除最后一個單詞id,并將<GO>連接到每個批次的開頭'''
    
    ending = tf.strided_slice(target_data, [0, 0], [batch_size, -1], [1, 1])
    dec_input = tf.concat([tf.fill([batch_size, 1], vocab_to_int['<GO>']), ending], 1)

    return dec_input
5.2編碼層
5.2.1創(chuàng)建編碼層
def encoding_layer(rnn_size, sequence_length, num_layers, rnn_inputs, keep_prob):
    '''Create the encoding layer雙向RNN,就是由兩個RNN網(wǎng)絡(luò)組織成的'''
    
    for layer in range(num_layers):
        with tf.variable_scope('encoder_{}'.format(layer)):
            cell_fw = tf.contrib.rnn.LSTMCell(rnn_size,
                                              initializer=tf.random_uniform_initializer(-0.1, 0.1, seed=2))
            cell_fw = tf.contrib.rnn.DropoutWrapper(cell_fw, 
                                                    input_keep_prob = keep_prob)

            cell_bw = tf.contrib.rnn.LSTMCell(rnn_size,
                                              initializer=tf.random_uniform_initializer(-0.1, 0.1, seed=2))
            cell_bw = tf.contrib.rnn.DropoutWrapper(cell_bw, 
                                                    input_keep_prob = keep_prob)

            enc_output, enc_state = tf.nn.bidirectional_dynamic_rnn(cell_fw, 
                                                                    cell_bw, 
                                                                    rnn_inputs,
                                                                    sequence_length,
                                                                    dtype=tf.float32)
    # Join outputs since we are using a bidirectional RNN
    enc_output = tf.concat(enc_output,2)
    
    return enc_output, enc_state
5.2.2訓(xùn)練解碼層
def training_decoding_layer(dec_embed_input, summary_length, dec_cell, initial_state, output_layer, 
                            vocab_size, max_summary_length):
    '''Create the training logits
      logits: 未歸一化的概率, 一般也就是 softmax層的輸入。所以logits和lables的shape一樣'''
    
    training_helper = tf.contrib.seq2seq.TrainingHelper(inputs=dec_embed_input,
                                                        sequence_length=summary_length,
                                                        time_major=False)

    training_decoder = tf.contrib.seq2seq.BasicDecoder(dec_cell,
                                                       training_helper,
                                                       initial_state,
                                                       output_layer) 

    training_logits, _ = tf.contrib.seq2seq.dynamic_decode(training_decoder,
                                                           output_time_major=False,
                                                           impute_finished=True,
                                                           maximum_iterations=max_summary_length)
    return training_logits
5.2.3預(yù)測解碼層
def inference_decoding_layer(embeddings, start_token, end_token, dec_cell, initial_state, output_layer,
                             max_summary_length, batch_size):
    '''Create the inference logits'''
    
    start_tokens = tf.tile(tf.constant([start_token], dtype=tf.int32), [batch_size], name='start_tokens')
    
    inference_helper = tf.contrib.seq2seq.GreedyEmbeddingHelper(embeddings,
                                                                start_tokens,
                                                                end_token)
                
    inference_decoder = tf.contrib.seq2seq.BasicDecoder(dec_cell,
                                                        inference_helper,
                                                        initial_state,
                                                        output_layer)
                
    inference_logits, _ = tf.contrib.seq2seq.dynamic_decode(inference_decoder,
                                                            output_time_major=False,
                                                            impute_finished=True,
                                                            maximum_iterations=max_summary_length)
    
    return inference_logits
5.3解碼層
def decoding_layer(dec_embed_input, embeddings, enc_output, enc_state, vocab_size, text_length, summary_length, 
                   max_summary_length, rnn_size, vocab_to_int, keep_prob, batch_size, num_layers):
    '''Create the decoding cell and attention for the training and inference decoding layers
      為訓(xùn)練和預(yù)測解碼層創(chuàng)建解碼單元和注意力機制'''
    
    for layer in range(num_layers):
        with tf.variable_scope('decoder_{}'.format(layer)):
            lstm = tf.contrib.rnn.LSTMCell(rnn_size,
                                           initializer=tf.random_uniform_initializer(-0.1, 0.1, seed=2))
            dec_cell = tf.contrib.rnn.DropoutWrapper(lstm, 
                                                     input_keep_prob = keep_prob)
    
    output_layer = Dense(vocab_size,
                         kernel_initializer = tf.truncated_normal_initializer(mean = 0.0, stddev=0.1))
    
    attn_mech = tf.contrib.seq2seq.BahdanauAttention(rnn_size,
                                                  enc_output,
                                                  text_length,
                                                  normalize=False,
                                                  name='BahdanauAttention')

    dec_cell = tf.contrib.seq2seq.DynamicAttentionWrapper(dec_cell,
                                                          attn_mech,
                                                          rnn_size)
            
    initial_state = tf.contrib.seq2seq.DynamicAttentionWrapperState(enc_state[0],
                                                                    _zero_state_tensors(rnn_size, 
                                                                                        batch_size, 
                                                                                        tf.float32)) 
    with tf.variable_scope("decode"):
        training_logits = training_decoding_layer(dec_embed_input, 
                                                  summary_length, 
                                                  dec_cell, 
                                                  initial_state,
                                                  output_layer,
                                                  vocab_size, 
                                                  max_summary_length)
    with tf.variable_scope("decode", reuse=True):
        inference_logits = inference_decoding_layer(embeddings,  
                                                    vocab_to_int['<GO>'], 
                                                    vocab_to_int['<EOS>'],
                                                    dec_cell, 
                                                    initial_state, 
                                                    output_layer,
                                                    max_summary_length,
                                                    batch_size)

    return training_logits, inference_logits
5.4組合Seq2Seq模型
def seq2seq_model(input_data, target_data, keep_prob, text_length, summary_length, max_summary_length, 
                  vocab_size, rnn_size, num_layers, vocab_to_int, batch_size):
    '''Use the previous functions to create the training and inference logits
      使用之前的函數(shù)創(chuàng)建訓(xùn)練和預(yù)測logits'''
    
    # Use Numberbatch's embeddings and the newly created ones as our embeddings
    embeddings = word_embedding_matrix  # 矩陣
    
    enc_embed_input = tf.nn.embedding_lookup(embeddings, input_data)
    enc_output, enc_state = encoding_layer(rnn_size, text_length, num_layers, enc_embed_input, keep_prob)
    
    dec_input = process_encoding_input(target_data, vocab_to_int, batch_size)
    dec_embed_input = tf.nn.embedding_lookup(embeddings, dec_input)
    
    training_logits, inference_logits  = decoding_layer(dec_embed_input, 
                                                        embeddings,
                                                        enc_output,
                                                        enc_state, 
                                                        vocab_size, 
                                                        text_length, 
                                                        summary_length, 
                                                        max_summary_length,
                                                        rnn_size, 
                                                        vocab_to_int, 
                                                        keep_prob, 
                                                        batch_size,
                                                        num_layers)
    
    return training_logits, inference_logits
5.5批處理文本句子
5.5.1填充句子,讓句子的長度達到一致
def pad_sentence_batch(sentence_batch):
    """Pad sentences with <PAD> so that each sentence of a batch has the same length
      使用<PAD>填充句子,以便批處理中的每個句子具有相同的長度"""
    max_sentence = max([len(sentence) for sentence in sentence_batch])
    return [sentence + [vocab_to_int['<PAD>']] * (max_sentence - len(sentence)) for sentence in sentence_batch]
5.5.2批量處理摘要,文本和句子的長度
def get_batches(summaries, texts, batch_size):
    """Batch summaries, texts, and the lengths of their sentences together"""
    for batch_i in range(0, len(texts)//batch_size):
        start_i = batch_i * batch_size
        summaries_batch = summaries[start_i:start_i + batch_size]
        texts_batch = texts[start_i:start_i + batch_size]
        pad_summaries_batch = np.array(pad_sentence_batch(summaries_batch))
        pad_texts_batch = np.array(pad_sentence_batch(texts_batch))
        
        # Need the lengths for the _lengths parameters
        pad_summaries_lengths = []
        for summary in pad_summaries_batch:
            pad_summaries_lengths.append(len(summary))
        
        pad_texts_lengths = []
        for text in pad_texts_batch:
            pad_texts_lengths.append(len(text))
        
        yield pad_summaries_batch, pad_texts_batch, pad_summaries_lengths, pad_texts_lengths
5.5設(shè)置超參數(shù)
# Set the Hyperparameters
epochs = 100
batch_size = 64
rnn_size = 256
num_layers = 2
learning_rate = 0.005
keep_probability = 0.75
5.6在TensorFlow構(gòu)建模型需要的圖來進行計算
# Build the graph
train_graph = tf.Graph()
# Set the graph to default to ensure that it is ready for training將圖表設(shè)置為默認,以確保它已準備好進行訓(xùn)練
with train_graph.as_default():
    
    # Load the model inputs    
    input_data, targets, lr, keep_prob, summary_length, max_summary_length, text_length = model_inputs()

    # Create the training and inference logits
    training_logits, inference_logits = seq2seq_model(tf.reverse(input_data, [-1]),
                                                      targets, 
                                                      keep_prob,   
                                                      text_length,
                                                      summary_length,
                                                      max_summary_length,
                                                      len(vocab_to_int)+1,
                                                      rnn_size, 
                                                      num_layers, 
                                                      vocab_to_int,
                                                      batch_size)
    
    # Create tensors for the training logits and inference logits
    training_logits = tf.identity(training_logits.rnn_output, 'logits')
    inference_logits = tf.identity(inference_logits.sample_id, name='predictions')
    
    # Create the weights for sequence_loss
    masks = tf.sequence_mask(summary_length, max_summary_length, dtype=tf.float32, name='masks')

    with tf.name_scope("optimization"):
        # Loss function
        cost = tf.contrib.seq2seq.sequence_loss(
            training_logits,
            targets,
            masks)

        # Optimizer
        optimizer = tf.train.AdamOptimizer(learning_rate)

        # Gradient Clipping
        gradients = optimizer.compute_gradients(cost)
        capped_gradients = [(tf.clip_by_value(grad, -5., 5.), var) for grad, var in gradients if grad is not None]
        train_op = optimizer.apply_gradients(capped_gradients)
print("Graph is built.")

6.訓(xùn)練網(wǎng)絡(luò)

6.1訓(xùn)練數(shù)據(jù)子集
# Subset the data for training
start = 200000
end = start + 50000
sorted_summaries_short = sorted_summaries[start:end]
sorted_texts_short = sorted_texts[start:end]
print("The shortest text length:", len(sorted_texts_short[0]))  # The shortest text length: 25
print("The longest text length:",len(sorted_texts_short[-1]))  # The longest text length: 31
6.2訓(xùn)練模型
# Train the Model
learning_rate_decay = 0.95
min_learning_rate = 0.0005
display_step = 20 # Check training loss after every 20 batches
stop_early = 0 
stop = 3 # If the update loss does not decrease in 3 consecutive update checks, stop training
per_epoch = 3 # Make 3 update checks per epoch
update_check = (len(sorted_texts_short)//batch_size//per_epoch)-1

update_loss = 0 
batch_loss = 0
summary_update_loss = [] # Record the update losses for saving improvements in the model

checkpoint = "best_model.ckpt" 
with tf.Session(graph=train_graph) as sess:
    sess.run(tf.global_variables_initializer())
    
    # If we want to continue training a previous session
    #loader = tf.train.import_meta_graph("./" + checkpoint + '.meta')
    #loader.restore(sess, checkpoint)
    
    for epoch_i in range(1, epochs+1):
        update_loss = 0
        batch_loss = 0
        for batch_i, (summaries_batch, texts_batch, summaries_lengths, texts_lengths) in enumerate(
                get_batches(sorted_summaries_short, sorted_texts_short, batch_size)):
            start_time = time.time()
            _, loss = sess.run(
                [train_op, cost],
                {input_data: texts_batch,
                 targets: summaries_batch,
                 lr: learning_rate,
                 summary_length: summaries_lengths,
                 text_length: texts_lengths,
                 keep_prob: keep_probability})

            batch_loss += loss
            update_loss += loss
            end_time = time.time()
            batch_time = end_time - start_time

            if batch_i % display_step == 0 and batch_i > 0:
                print('Epoch {:>3}/{} Batch {:>4}/{} - Loss: {:>6.3f}, Seconds: {:>4.2f}'
                      .format(epoch_i,
                              epochs, 
                              batch_i, 
                              len(sorted_texts_short) // batch_size, 
                              batch_loss / display_step, 
                              batch_time*display_step))
                batch_loss = 0

            if batch_i % update_check == 0 and batch_i > 0:
                print("Average loss for this update:", round(update_loss/update_check,3))
                summary_update_loss.append(update_loss)
                
                # If the update loss is at a new minimum, save the model
                if update_loss <= min(summary_update_loss):
                    print('New Record!') 
                    stop_early = 0
                    saver = tf.train.Saver() 
                    saver.save(sess, checkpoint)

                else:
                    print("No Improvement.")
                    stop_early += 1
                    if stop_early == stop:
                        break
                update_loss = 0
            
                    
        # Reduce learning rate, but not below its minimum value
        learning_rate *= learning_rate_decay
        if learning_rate < min_learning_rate:
            learning_rate = min_learning_rate
        
        if stop_early == stop:
            print("Stopping Training.")
            break

7.測試模型

7.1為模型準備文本語料
def text_to_seq(text):
    '''Prepare the text for the model'''
    text = clean_text(text)
    return [vocab_to_int.get(word, vocab_to_int['<UNK>']) for word in text.split()]
7.2輸入語料,進行測試
# Create your own review or use one from the dataset,創(chuàng)建自己的評論或使用數(shù)據(jù)集中的評論
#input_sentence = "I have never eaten an apple before, but this red one was nice. \
                  #I think that I will try a green apple next time."
#text = text_to_seq(input_sentence)
random = np.random.randint(0,len(clean_texts))
input_sentence = clean_texts[random]
text = text_to_seq(clean_texts[random])

checkpoint = "./best_model.ckpt"

loaded_graph = tf.Graph()
with tf.Session(graph=loaded_graph) as sess:
    # Load saved model
    loader = tf.train.import_meta_graph(checkpoint + '.meta')
    loader.restore(sess, checkpoint)

    input_data = loaded_graph.get_tensor_by_name('input:0')
    logits = loaded_graph.get_tensor_by_name('predictions:0')
    text_length = loaded_graph.get_tensor_by_name('text_length:0')
    summary_length = loaded_graph.get_tensor_by_name('summary_length:0')
    keep_prob = loaded_graph.get_tensor_by_name('keep_prob:0')
    
    #Multiply by batch_size to match the model's input parameters
    answer_logits = sess.run(logits, {input_data: [text]*batch_size, 
                                      summary_length: [np.random.randint(5,8)], 
                                      text_length: [len(text)]*batch_size,
                                      keep_prob: 1.0})[0] 

# Remove the padding from the tweet
pad = vocab_to_int["<PAD>"] 

print('Original Text:', input_sentence)

print('\nText')
print('  Word Ids:    {}'.format([i for i in text]))
print('  Input Words: {}'.format(" ".join([int_to_vocab[i] for i in text])))

print('\nSummary')
print('  Word Ids:       {}'.format([i for i in answer_logits if i != pad]))
print('  Response Words: {}'.format(" ".join([int_to_vocab[i] for i in answer_logits if i != pad])))

結(jié)果為:
INFO:tensorflow:Restoring parameters from ./best_model.ckpt
Original Text: love individual oatmeal cups found years ago sam quit selling sound big lots quit selling found target expensive buy individually trilled get entire case time go anywhere need water microwave spoon know quaker flavor packets

Text
Word Ids: [70595, 18808, 668, 45565, 51927, 51759, 32488, 13510, 32036, 59599, 11693, 444, 23335, 32036, 59599, 51927, 67316, 726, 24842, 50494, 48492, 1062, 44749, 38443, 42344, 67973, 14168, 7759, 5347, 29528, 58763, 18927, 17701, 20232, 47328]
Input Words: love individual oatmeal cups found years ago sam quit selling sound big lots quit selling found target expensive buy individually trilled get entire case time go anywhere need water microwave spoon know quaker flavor packets

Summary
Word Ids: [70595, 28738]
Response Words: love it

Examples of reviews and summaries:
  • Review(1): The coffee tasted great and was at such a good price! I highly recommend this to everyone!
  • Summary(1): great coffee
  • Review(2): This is the worst cheese that I have ever bought! I will never buy it again and I hope you won't either!
  • Summary(2): omg gross gross
  • Review(3): love individual oatmeal cups found years ago sam quit selling sound big lots quit selling found target expensive buy individually trilled get entire case time go anywhere need water microwave spoon know quaker flavor packets
  • Summary(3): love it
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容