基于RNN實(shí)現(xiàn)古詩詞生成模型

我們知道,RNN(循環(huán)神經(jīng)網(wǎng)絡(luò))模型是基于當(dāng)前的狀態(tài)和當(dāng)前的輸入來對下一時刻做出預(yù)判。而LSTM(長短時記憶網(wǎng)絡(luò))模型則可以記憶距離當(dāng)前位置較遠(yuǎn)的上下文信息。
在此,我們根據(jù)上述預(yù)判模型來進(jìn)行 古詩詞的生成模型訓(xùn)練。
首先,我們需要準(zhǔn)備好古詩詞的數(shù)據(jù)集:全唐詩共34646首,我把數(shù)據(jù)文件上傳到了我的csdn中,又需要的可以下載
http://download.csdn.net/download/qq_34470213/10150761

訓(xùn)練模型

1、獲取字典

  • 我們首先需要讀取詩集,把詩集的每首詩都分離出來存入列表,根據(jù)列表的長度 就可以得出共有多少首古詩。

首先需要把每首詩讀出來,故可以使用open函數(shù)。

由于在數(shù)據(jù)文件中每首詩的格式都是( 題目:內(nèi)容 ),所以可以先使用strip函數(shù)去掉空格,再使用split(“:”)來分割題目和內(nèi)容,由于我們在這里只需要使用詩的內(nèi)容,所以只保存內(nèi)容即可。

得到了詩點(diǎn)的內(nèi)容,需要注意的是有些詩句的題目中也會含有“:”符號,我們需要把這樣的句子省略掉,因?yàn)樗皇窃娫~內(nèi)容。
得到了所有的詩詞內(nèi)容。

為了標(biāo)記詩詞的開始和結(jié)尾,我們在開頭加上字符“[”,末尾加上字符“]”,在訓(xùn)練的時候程序也會根據(jù)該符號來作為訓(xùn)練的始末狀態(tài)。
把所有的唐詩內(nèi)容都加入到列表中,列表長度即為唐詩的總數(shù)。

代碼實(shí)現(xiàn):

poetrys = []
with open(poetry_file, "r", encoding='utf-8', ) as f:
    for line in f:
        try:
            title, content = line.strip().split(':')
            content = content.replace(' ', '')
            if '_' in content or '(' in content or '(' in content or '《' in content or '[' in content:
                continue
            if len(content) < 5 or len(content) > 79:
                continue
            content = '[' + content + ']'
            poetrys.append(content)
        except Exception as e:
            pass

poetrys = sorted(poetrys, key=lambda line: len(line))
print('唐詩總數(shù): ', len(poetrys))
  • 得到所有唐詩內(nèi)容以后,就可以對每個字進(jìn)行編碼了,由此得到所有詩的編碼形式,把編碼放入神經(jīng)網(wǎng)絡(luò)進(jìn)行訓(xùn)練。

則需要把所有的詩詞中所有出現(xiàn)過的字都進(jìn)行統(tǒng)計(jì),統(tǒng)計(jì)其出現(xiàn)過的次數(shù),使用collection.Counter對一個列表中的每個元素都進(jìn)行遍歷統(tǒng)計(jì),返回值為一個元素和出現(xiàn)次數(shù)相對應(yīng)的字典。

我們?nèi)∮杏?xùn)練必要的數(shù)據(jù)進(jìn)行編碼,首先根據(jù)字典中的出現(xiàn)次數(shù)以由高到低的順序進(jìn)行排序,可以使用sorted函數(shù),key表示排序方法,k=lambda x:x[1],表示根據(jù) 第二個參數(shù)(即出現(xiàn)次數(shù))的大小從大到小排序,設(shè)置為-x[1]排序后則是從大到小。

取出需要編碼的字,按照從0開始的編碼格式,對每個字進(jìn)行編碼,排序后我們得到了具有每個字和其出現(xiàn)次數(shù)的元組,我們只需要拿到每個字即可。
zip([1,2],[3,4],[5,6])
-- 》 [1,3,5],[2,4,6]
zip(*[(1,2),(3,4),(5,6)])
--》[1,3,5], [2,4,6]

選擇出現(xiàn)次數(shù)多的字進(jìn)行編碼,作為編碼字典。把每個字與從0到len的數(shù)字編碼字典
dict(d):創(chuàng)建一個字典。d 必須是一個序列 (key,value)元組
最后得到每個字與從0開始的字符組成的字典

把每首詩的每個字都進(jìn)行編碼處理,即從字典中找到每個字對應(yīng)的號碼
dict.get(key, default=None)
key -- 字典中要查找的鍵。
default -- 如果指定鍵的值不存在時,返回該默認(rèn)值。

代碼實(shí)現(xiàn)

all_words = []
for poetry in poetrys:
    all_words += [word for word in poetry]
counter = collections.Counter(all_words)

count_pairs = sorted(counter.items(), key=lambda x: -x[1])
words, _ = zip(*count_pairs)
leng = int(len(words)*0.9)
words = words[:leng]+(' ',)

word_num_map = dict(zip(words, range(len(words))))
to_num = lambda word: word_num_map.get(word, len(words))

poetrys_vector = [list(map(to_num, poetry)) for poetry in poetrys]

  • 訓(xùn)練數(shù)據(jù)

訓(xùn)練時每次取64首詩進(jìn)行訓(xùn)練,即每次在列表內(nèi)取64個數(shù)據(jù),然后對其進(jìn)行輸出數(shù)據(jù)x,輸出數(shù)據(jù)y進(jìn)行賦值,y為正確的結(jié)果,用于訓(xùn)練。(需注意的是,由于模型的作用是對下一個字進(jìn)行預(yù)測,所以y只是x的數(shù)據(jù)向前移動一個字)
定義一個RNN模型,然后把數(shù)據(jù)代入進(jìn)行訓(xùn)練,使用RNN進(jìn)行訓(xùn)練的過程大約分為:
1、定義模型和結(jié)構(gòu)。
2、0初始化當(dāng)前狀態(tài)。
3、輸入數(shù)據(jù)進(jìn)行ID到單詞向量的轉(zhuǎn)化。
4、輸入數(shù)據(jù)和初始化狀態(tài)代入模型進(jìn)行訓(xùn)練,得到訓(xùn)練結(jié)果。
5、對訓(xùn)練結(jié)果加入一個全連接層得到最終輸出。
多次訓(xùn)練,得到最終的狀態(tài)和最終的損失。在本例中,共規(guī)定了50次訓(xùn)練,每次訓(xùn)練都對每個batche數(shù)據(jù)進(jìn)行訓(xùn)練,由于共有34646首詩,每個batche的大小為64,所以共有541個batche

 for epoch in range(50):
            for batche in range(541):
                    train(epoch, batche)

由于最后的輸出數(shù)據(jù)是下一個字,所以輸出格式的大小為該字可能對應(yīng)的編碼,輸出大小為len。

為了防止中斷,及時保存。

生成古詩:
使用以上訓(xùn)練好的網(wǎng)絡(luò)模型來生成新的古詩,生成古詩的主要方法有:
讀取模板文件,對每個字的出現(xiàn)個數(shù)都進(jìn)行統(tǒng)計(jì),根據(jù)統(tǒng)計(jì)結(jié)果取出數(shù)據(jù)來進(jìn)行編碼,得到每個字和相應(yīng)的編碼字典。用于字和編碼之間的轉(zhuǎn)化。
生成RNN模型網(wǎng)絡(luò),應(yīng)用于根據(jù)輸入信息得到相應(yīng)的輸出信息。與訓(xùn)練模型的編寫方法相同。
讀取已保存的網(wǎng)絡(luò)模型,根據(jù)已經(jīng)訓(xùn)練好的模型來進(jìn)行新的數(shù)據(jù)預(yù)測。
使用循環(huán)語句進(jìn)行編碼和字之間的轉(zhuǎn)化,直到一首詩做完后退出。

訓(xùn)練數(shù)據(jù)的總代碼:

import collections
import numpy as np
from tensorflow.contrib.legacy_seq2seq.python.ops.seq2seq import sequence_loss_by_example
import tensorflow as tf
import os

MODEL_SAVE_PATH = "./save/"
MODEL_NAME = "poetry.module"

# -------------------------------數(shù)據(jù)預(yù)處理---------------------------#

poetry_file = 'poetry.txt'

# 詩集
poetrys = []
with open(poetry_file, "r", encoding='utf-8', ) as f:
    for line in f:
        try:
            title, content = line.strip().split(':')
            content = content.replace(' ', '')
            if '_' in content or '(' in content or '(' in content or '《' in content or '[' in content:
                continue
            if len(content) < 5 or len(content) > 79:
                continue
            content = '[' + content + ']'
            poetrys.append(content)
        except Exception as e:
            pass

poetrys = sorted(poetrys, key=lambda line: len(line))
print('唐詩總數(shù): ', len(poetrys))

all_words = []
for poetry in poetrys:
    all_words += [word for word in poetry]
counter = collections.Counter(all_words)
print(counter)
count_pairs = sorted(counter.items(), key=lambda x: -x[1])
print(count_pairs)
words, _ = zip(*count_pairs)
print(words)
print(len(words))
leng = int(len(words)*0.9)

words = words[:leng]+(' ',)
print(words)

word_num_map = dict(zip(words, range(len(words))))

to_num = lambda word: word_num_map.get(word, len(words))
poetrys_vector = [list(map(to_num, poetry)) for poetry in poetrys]
# [[314, 3199, 367, 1556, 26, 179, 680, 0, 3199, 41, 506, 40, 151, 4, 98, 1],
# [339, 3, 133, 31, 302, 653, 512, 0, 37, 148, 294, 25, 54, 833, 3, 1, 965, 1315, 377, 1700, 562, 21, 37, 0, 2, 1253, 21, 36, 264, 877, 809, 1]
# ....]

# 每次取64首詩進(jìn)行訓(xùn)練
batch_size = 64
n_chunk = len(poetrys_vector) // batch_size
x_batches = []
y_batches = []

for i in range(n_chunk):
    start_index = i * batch_size
    end_index = start_index + batch_size

    batches = poetrys_vector[start_index:end_index]
    length = max(map(len, batches))
    xdata = np.full((batch_size, length), word_num_map[' '], np.int32)
    for row in range(batch_size):
        xdata[row, :len(batches[row])] = batches[row]
    ydata = np.copy(xdata)
    ydata[:, :-1] = xdata[:, 1:]
    """
    xdata             ydata
    [6,2,4,6,9]       [2,4,6,9,9]
    [1,4,2,8,5]       [4,2,8,5,5]
    """
    x_batches.append(xdata)
    y_batches.append(ydata)

# ---------------------------------------RNN--------------------------------------#

input_data = tf.placeholder(tf.int32, [batch_size, None])
output_targets = tf.placeholder(tf.int32, [batch_size, None])


# 定義RNN
def neural_network(model='lstm', rnn_size=128, num_layers=2):
    if model == 'rnn':
        cell_fun = tf.nn.rnn_cell.BasicRNNCell
    elif model == 'gru':
        cell_fun = tf.nn.rnn_cell.GRUCell
    elif model == 'lstm':
        cell_fun = tf.nn.rnn_cell.BasicLSTMCell

    cell = cell_fun(rnn_size, state_is_tuple=True)
    cell = tf.nn.rnn_cell.MultiRNNCell([cell] * num_layers, state_is_tuple=True)

    initial_state = cell.zero_state(batch_size, tf.float32)

    with tf.variable_scope('rnnlm'):
        softmax_w = tf.get_variable("softmax_w", [rnn_size, len(words) + 1])
        softmax_b = tf.get_variable("softmax_b", [len(words) + 1])
        with tf.device("/cpu:0"):
            embedding = tf.get_variable("embedding", [len(words) + 1, rnn_size])
            inputs = tf.nn.embedding_lookup(embedding, input_data)

    outputs, last_state = tf.nn.dynamic_rnn(cell, inputs, initial_state=initial_state, scope='rnnlm')
    output = tf.reshape(outputs, [-1, rnn_size])

    logits = tf.matmul(output, softmax_w) + softmax_b
    probs = tf.nn.softmax(logits)
    return logits, last_state, probs, cell, initial_state


# 訓(xùn)練
def train_neural_network():
    logits, last_state, _, _, _ = neural_network()
    targets = tf.reshape(output_targets, [-1])
    loss = sequence_loss_by_example([logits], [targets], [tf.ones_like(targets, dtype=tf.float32)], len(words))
    cost = tf.reduce_mean(loss)
    learning_rate = tf.Variable(0.0, trainable=False)
    tvars = tf.trainable_variables()
    grads, _ = tf.clip_by_global_norm(tf.gradients(cost, tvars), 5)
    optimizer = tf.train.AdamOptimizer(learning_rate)
    train_op = optimizer.apply_gradients(zip(grads, tvars))

    with tf.Session() as sess:
        sess.run(tf.initialize_all_variables())
        # saver = tf.train.Saver()
        for epoch in range(50):
            sess.run(tf.assign(learning_rate, 0.002 * (0.97 ** epoch)))
            n = 0
            for batche in range(n_chunk):
                train_loss, _, _ = sess.run([cost, last_state, train_op],
                                            feed_dict={input_data: x_batches[n], output_targets: y_batches[n]})
                n += 1
                print(epoch, batche, train_loss)
                if epoch % 7 == 0:
                     saver.save(sess, os.path.join(MODEL_SAVE_PATH, MODEL_NAME), global_step=epoch)

train_neural_network()

訓(xùn)練結(jié)束后得到儲存神經(jīng)網(wǎng)絡(luò)模型的文件:


我的筆記本上訓(xùn)練了十個多小時,如果不想訓(xùn)練,可以直接下載我訓(xùn)練好的文件來使用,可以得到同樣的效果。
我把訓(xùn)練的最后結(jié)果放到了這里,鏈接:https://pan.baidu.com/s/1bIibbo 密碼:ojs3

使用模型生成詩句

使用模型時首先應(yīng)該加載出該模型使我們方便使用。
已知一首詩的開始標(biāo)志字為"[",設(shè)其初始狀態(tài)為0,由此開始載入模型,迭代可以求得整首古詩,古詩的結(jié)束標(biāo)志為"]",出現(xiàn)了此輸出結(jié)果表示古詩生成完畢,退出循環(huán),打印結(jié)果。

import collections
import numpy as np
import tensorflow as tf

#-------------------------------數(shù)據(jù)預(yù)處理---------------------------#

poetry_file ='poetry.txt'

# 詩集
poetrys = []
with open(poetry_file, "r", encoding='utf-8',) as f:
    for line in f:
        try:
            title, content = line.strip().split(':')
            content = content.replace(' ','')
            if '_' in content or '(' in content or '(' in content or '《' in content or '[' in content:
                continue
            if len(content) < 5 or len(content) > 79:
                continue
            content = '[' + content + ']'
            poetrys.append(content)
        except Exception as e:
            pass

poetrys = sorted(poetrys,key=lambda line: len(line))
print('唐詩總數(shù): ', len(poetrys))

all_words = []
for poetry in poetrys:
    all_words += [word for word in poetry]
counter = collections.Counter(all_words)
count_pairs = sorted(counter.items(), key=lambda x: -x[1])
words, _ = zip(*count_pairs)

words = words[:len(words)] + (' ',)
word_num_map = dict(zip(words, range(len(words))))
to_num = lambda word: word_num_map.get(word, len(words))
poetrys_vector = [ list(map(to_num, poetry)) for poetry in poetrys]
#[[314, 3199, 367, 1556, 26, 179, 680, 0, 3199, 41, 506, 40, 151, 4, 98, 1],
#[339, 3, 133, 31, 302, 653, 512, 0, 37, 148, 294, 25, 54, 833, 3, 1, 965, 1315, 377, 1700, 562, 21, 37, 0, 2, 1253, 21, 36, 264, 877, 809, 1]
#....]

batch_size = 1
n_chunk = len(poetrys_vector) // batch_size
x_batches = []
y_batches = []
for i in range(n_chunk):
    start_index = i * batch_size
    end_index = start_index + batch_size

    batches = poetrys_vector[start_index:end_index]
    length = max(map(len,batches))
    xdata = np.full((batch_size,length), word_num_map[' '], np.int32)
    for row in range(batch_size):
        xdata[row,:len(batches[row])] = batches[row]
    ydata = np.copy(xdata)
    ydata[:,:-1] = xdata[:,1:]
    """
    xdata             ydata
    [6,2,4,6,9]       [2,4,6,9,9]
    [1,4,2,8,5]       [4,2,8,5,5]
    """
    x_batches.append(xdata)
    y_batches.append(ydata)


#---------------------------------------RNN--------------------------------------#

input_data = tf.placeholder(tf.int32, [batch_size, None])
output_targets = tf.placeholder(tf.int32, [batch_size, None])
# 定義RNN
def neural_network(model='lstm', rnn_size=128, num_layers=2):
    if model == 'rnn':
        cell_fun = tf.nn.rnn_cell.BasicRNNCell
    elif model == 'gru':
        cell_fun = tf.nn.rnn_cell.GRUCell
    elif model == 'lstm':
        cell_fun = tf.nn.rnn_cell.BasicLSTMCell

    cell = cell_fun(rnn_size, state_is_tuple=True)
    cell = tf.nn.rnn_cell.MultiRNNCell([cell] * num_layers, state_is_tuple=True)

    initial_state = cell.zero_state(batch_size, tf.float32)

    with tf.variable_scope('rnnlm'):
        softmax_w = tf.get_variable("softmax_w", [rnn_size, len(words)+1])
        softmax_b = tf.get_variable("softmax_b", [len(words)+1])
        with tf.device("/cpu:0"):
            embedding = tf.get_variable("embedding", [len(words)+1, rnn_size])
            inputs = tf.nn.embedding_lookup(embedding, input_data)

    outputs, last_state = tf.nn.dynamic_rnn(cell, inputs, initial_state=initial_state, scope='rnnlm')
    output = tf.reshape(outputs,[-1, rnn_size])

    logits = tf.matmul(output, softmax_w) + softmax_b
    probs = tf.nn.softmax(logits)
    return logits, last_state, probs, cell, initial_state

#-------------------------------生成古詩---------------------------------#
# 使用訓(xùn)練完成的模型

def gen_poetry():
    def to_word(weights):
        t = np.cumsum(weights)
        s = np.sum(weights)
        sample = int(np.searchsorted(t, np.random.rand(1)*s))
        return words[sample]

    _, last_state, probs, cell, initial_state = neural_network()

    with tf.Session() as sess:
        sess.run(tf.initialize_all_variables())

        saver = tf.train.Saver(tf.all_variables())
        saver.restore(sess, './save/poetry.module-49')

        state_ = sess.run(cell.zero_state(1, tf.float32))

        x = np.array([list(map(word_num_map.get, '['))])
        [probs_, state_] = sess.run([probs, last_state], feed_dict={input_data: x, initial_state: state_})
        word = to_word(probs_)
        
        poem = ''
        word_biao = word
        while word != ']':
            poem += word_biao
            x = np.zeros((1,1))
            x[0,0] = word_num_map[word]
            [probs_, state_] = sess.run([probs, last_state], feed_dict={input_data: x, initial_state: state_})
            word = to_word(probs_)
            word_biao =word
            if word_biao == '。':
                word_biao = '。\n'
            print(word_biao)
        
      return poem

print(gen_poetry())

輸出結(jié)果:


藏頭詩的寫作

藏頭詩與自由作詩的區(qū)別在于,需要指定每句話的頭一個字,所以初始狀態(tài)便需要重新設(shè)定為給定的字,我們設(shè)置一個for循環(huán)來取出藏頭句子的每
一個單字,對該單字進(jìn)行訓(xùn)練。
我們把第一個字設(shè)置為"[",求出狀態(tài)state_,然后將該狀態(tài)代入該單字中求下一個字的解。即,已知當(dāng)前輸入為"word",當(dāng)前狀態(tài)是“[”的狀態(tài)state_,求輸出和下一步狀態(tài)。
輸出作為當(dāng)前輸入,下一步狀態(tài)作為當(dāng)前狀態(tài),再求下一個字。
直到詩句滿足字?jǐn)?shù)狀態(tài)或結(jié)束,則退出循環(huán),處理下一個單字。

def gen_poetry_with_head_and_type(head, type):
    if type != 5 and type != 7:
        print('The second para has to be 5 or 7!')
        return

    def to_word(weights):
        t = np.cumsum(weights)
        s = np.sum(weights)
        sample = int(np.searchsorted(t, np.random.rand(1)*s))
        return words[sample]

    _, last_state, probs, cell, initial_state = neural_network()

    with tf.Session() as sess:
        sess.run(tf.initialize_all_variables())
        saver = tf.train.Saver()
        saver.restore(sess, './save/poetry.module-35')
        poem = ''
        i = 0

        for the_word in head:
                flag = True
                while flag:
                    state_ = sess.run(cell.zero_state(1, tf.float32))
                    x = np.array([list(map(word_num_map.get, '['))])
                    [probs_, state_] = sess.run([probs, last_state], feed_dict={input_data: x, initial_state: state_})

                    sentence = the_word
                    x = np.zeros((1, 1))
                    x[0, 0] = word_num_map[sentence]
                    [probs_, state_] = sess.run([probs, last_state], feed_dict={input_data: x, initial_state: state_})

                    word = to_word(probs_)
                    sentence += word

                    while word!='。':
                        x = np.zeros((1, 1))
                        x[0, 0] = word_num_map[word]
                        [probs_, state_] = sess.run([probs, last_state], feed_dict={input_data: x, initial_state: state_})
                        word = to_word(probs_)

                        sentence += word

                        if len(sentence) == 2 + 2 * type:
                            sentence += '\n'
                            poem += sentence
                            flag = False

        return poem

print(gen_poetry_with_head_and_type("碧影江白", 7))

經(jīng)過處理后輸出詩句:


最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容