Tensorflow快餐教程(12) - 用機(jī)器寫(xiě)莎士比亞的戲劇

高層框架:TFLearn和Keras

上一節(jié)我們學(xué)習(xí)了Tensorflow的高層API封裝,可以通過(guò)簡(jiǎn)單的幾步就生成一個(gè)DNN分類(lèi)器來(lái)解決MNIST手寫(xiě)識(shí)別問(wèn)題。

盡管Tensorflow也在不斷推進(jìn)Estimator API。但是,這并不是工具的全部。在Tensorflow官方的API方外,我們還有強(qiáng)大的工具,比如TFLearn和Keras。

這節(jié)我們就做一個(gè)武器庫(kù)的展示,看看專(zhuān)門(mén)為T(mén)ensorflow做的高層框架TFLearn和跨Tensorflow和CNTK幾種后端的Keras為我們做了哪些強(qiáng)大的功能封裝。

機(jī)器來(lái)寫(xiě)莎士比亞的戲劇

之前我們簡(jiǎn)單介紹了強(qiáng)大的用于處理序列數(shù)據(jù)的RNN。RNN比起其它網(wǎng)絡(luò)的重要優(yōu)點(diǎn)是可以學(xué)習(xí)了序列數(shù)據(jù)之后進(jìn)行自生成。
比如,學(xué)習(xí)《唐詩(shī)三百首》可以寫(xiě)詩(shī),學(xué)習(xí)了Linux Kernel源代碼就能寫(xiě)C代碼(雖然基本上編譯不過(guò))。

我們首先來(lái)一個(gè)自動(dòng)寫(xiě)莎士比亞戲劇的例子吧。
在看代碼之前我先嘮叨幾句。深度學(xué)習(xí)對(duì)于數(shù)據(jù)量的要求還是比較高的,像訓(xùn)練自動(dòng)生成的這種,一般得幾百萬(wàn)到幾千萬(wàn)量級(jí)的訓(xùn)練數(shù)據(jù)下才能有好的效果。比如只用幾篇小說(shuō)來(lái)訓(xùn)練肯定生成不知所云的小說(shuō)。就算是人類(lèi)也做不到只學(xué)幾首詩(shī)就會(huì)寫(xiě)詩(shī)么。
另外一點(diǎn)就是,訓(xùn)練數(shù)據(jù)量上來(lái)了,對(duì)于時(shí)間和算力的要求也是指數(shù)級(jí)提高的。
比如我們用莎翁的戲劇來(lái)訓(xùn)練,雖然數(shù)據(jù)量也不是特別的大,也就16萬(wàn)多行,但是在CPU上訓(xùn)練的話也不是一兩個(gè)小時(shí)能搞定的,大約是天為單位。
后面我們舉圖像或視頻的例子,在CPU上訓(xùn),論月也是并不意外的。

那么,這個(gè)需要訓(xùn)一天左右的例子,代碼會(huì)多復(fù)雜呢?答案是核心代碼不過(guò)10幾行,總共加上數(shù)據(jù)處理和測(cè)試代碼也不過(guò)50行左右。

from __future__ import absolute_import, division, print_function

import os
import pickle
from six.moves import urllib

import tflearn
from tflearn.data_utils import *

path = "shakespeare_input.txt"
char_idx_file = 'char_idx.pickle'

if not os.path.isfile(path):
    urllib.request.urlretrieve("https://raw.githubusercontent.com/tflearn/tflearn.github.io/master/resources/shakespeare_input.txt", path)

maxlen = 25

char_idx = None
if os.path.isfile(char_idx_file):
  print('Loading previous char_idx')
  char_idx = pickle.load(open(char_idx_file, 'rb'))

X, Y, char_idx = \
    textfile_to_semi_redundant_sequences(path, seq_maxlen=maxlen, redun_step=3,
                                         pre_defined_char_idx=char_idx)

pickle.dump(char_idx, open(char_idx_file,'wb'))

g = tflearn.input_data([None, maxlen, len(char_idx)])
g = tflearn.lstm(g, 512, return_seq=True)
g = tflearn.dropout(g, 0.5)
g = tflearn.lstm(g, 512, return_seq=True)
g = tflearn.dropout(g, 0.5)
g = tflearn.lstm(g, 512)
g = tflearn.dropout(g, 0.5)
g = tflearn.fully_connected(g, len(char_idx), activation='softmax')
g = tflearn.regression(g, optimizer='adam', loss='categorical_crossentropy',
                       learning_rate=0.001)

m = tflearn.SequenceGenerator(g, dictionary=char_idx,
                              seq_maxlen=maxlen,
                              clip_gradients=5.0,
                              checkpoint_path='model_shakespeare')

for i in range(50):
    seed = random_sequence_from_textfile(path, maxlen)
    m.fit(X, Y, validation_set=0.1, batch_size=128,
          n_epoch=1, run_id='shakespeare')
    print("-- TESTING...")
    print("-- Test with temperature of 1.0 --")
    print(m.generate(600, temperature=1.0, seq_seed=seed))
    print("-- Test with temperature of 0.5 --")
    print(m.generate(600, temperature=0.5, seq_seed=seed))

上面的例子需要使用TFLearn框架,可以通過(guò)

pip install tflearn

來(lái)安裝。

TFLearn是專(zhuān)門(mén)為T(mén)ensorflow開(kāi)發(fā)的高層次API框架。
用TFLearn API的主要好處是可讀性更好,比如剛才的核心代碼:

g = tflearn.input_data([None, maxlen, len(char_idx)])
g = tflearn.lstm(g, 512, return_seq=True)
g = tflearn.dropout(g, 0.5)
g = tflearn.lstm(g, 512, return_seq=True)
g = tflearn.dropout(g, 0.5)
g = tflearn.lstm(g, 512)
g = tflearn.dropout(g, 0.5)
g = tflearn.fully_connected(g, len(char_idx), activation='softmax')
g = tflearn.regression(g, optimizer='adam', loss='categorical_crossentropy',
                       learning_rate=0.001)

m = tflearn.SequenceGenerator(g, dictionary=char_idx,
                              seq_maxlen=maxlen,
                              clip_gradients=5.0,
                              checkpoint_path='model_shakespeare')

從輸入數(shù)據(jù),三層LSTM,三層Dropout,最后是一個(gè)softmax的全連接層。

我們?cè)賮?lái)看一個(gè)預(yù)測(cè)泰坦尼克號(hào)幸存概率的網(wǎng)絡(luò)的結(jié)構(gòu):

# Build neural network
net = tflearn.input_data(shape=[None, 6])
net = tflearn.fully_connected(net, 32)
net = tflearn.fully_connected(net, 32)
net = tflearn.fully_connected(net, 2, activation='softmax')
net = tflearn.regression(net)

# Define model
model = tflearn.DNN(net)
# Start training (apply gradient descent algorithm)
model.fit(data, labels, n_epoch=10, batch_size=16, show_metric=True)

我們來(lái)看看訓(xùn)練第一輪之后生成的戲劇,這個(gè)階段肯定還是有點(diǎn)慘,不過(guò)基本的意思有了:

THAISA:
Why, sir, say if becel; sunthy alot but of
coos rytermelt, buy -
bived with wond I saTt fas,'? You and grigper.

FIENDANS:
By my wordhand!

KING RECENTEN:
Wish sterest expeun The siops so his fuurs,
And emour so, ane stamn.
she wealiwe muke britgie; I dafs tpichicon, bist,
Turch ose be fast wirpest neerenler.

NONTo:
So befac, sels at, Blove and rackity;
The senent stran spard: and, this not you so the wount
hor hould batil's toor wate
What if a poostit's of bust contot;
Whit twetemes, Game ifon I am
Ures the fast to been'd matter:
To and lause. Tiess her jittarss,
Let concertaet ar: and not!
Not fearle her g

我們?cè)倏纯从?xùn)練10輪之后的結(jié)果:

PEMBROKE:
There tell the elder pieres,
Would our pestilent shapeing sebaricity. So have partned in me, Project of Yorle
again, and then when you set man
make plash'd of her too sparent
upon this father be dangerous puny or house;
Born is now been left of himself,
This true compary nor no stretches, back that
Horses had hand or question!

POLIXENES:
I have unproach the strangest
padely carry neerful young Yir,
Or hope not fall-a a cause of banque.

JESSICA:
He that comes to find the just,
And eyes gold, substrovious;
Yea pity a god on a foul rioness, these tebles and purish new head meet again?

20輪之后的結(jié)果:

y prison,
Fatal and ominous children and the foot, it will
hear with you: it is my pace comprite
To come my soldiers, if I were dread,
Of breath as what I charge with I well;
Her palace and every tailor, the house of wondrous sweet mark!

STANLEY:
Take that spirit, thou hast
'no whore he did eyes, and what men damned, and
I had evils; by lap, or so,
But wholow'st thy report subject,
Had my rabble against thee;
And no rassians which he secure
of genslications; when I have move undertake-inward, into Bertounce;
Upon a shift, meet as we are. He beggars thing
Have for it will, but joy with the minute cannot whom we prarem
-- Test with temperature of 0.5 --
y prison,
Fatal and ominous rein here,
The princess have all to prince and the marsh of his company
To prove brother in the world,
And we the forest prove more than the heavens on the false report of the fools,
Depose the body of my wits.

DUKE SENIOR:
The night better appelled my part.

ANGELO:
Care you that may you understand your grace
I may speak of a point, and seems as in the heart
Who be deeds to show and sale for so
unhouses me of her soul, and the heart of them from the corder black to stand about up.

CLAUDIO:
The place of the world shall be married his love.

從生成城市名字說(shuō)起

大家的莎士比亞模型應(yīng)該正在訓(xùn)練過(guò)程中吧,咱們閑著也是閑著,不如從一個(gè)更簡(jiǎn)單的例子來(lái)看看這個(gè)生成過(guò)程。
我們還是取TFLearn的官方例子,通過(guò)讀取美國(guó)主要城市名字列表來(lái)生成一些新的城市名字。

我們以Z開(kāi)頭的城市為例:

Zachary
Zafra
Zag
Zahl
Zaleski
Zalma
Zama
Zanesfield
Zanesville
Zap
Zapata
Zarah
Zavalla
Zearing
Zebina
Zebulon
Zeeland
Zeigler
Zela
Zelienople
Zell
Zellwood
Zemple
Zena
Zenda
Zenith
Zephyr
Zephyr Cove
Zephyrhills
Zia Pueblo
Zillah
Zilwaukee
Zim
Zimmerman
Zinc
Zion
Zionsville
Zita
Zoar
Zolfo Springs
Zona
Zumbro Falls
Zumbrota
Zuni
Zurich
Zwingle
Zwolle

一共20580個(gè)城市。這個(gè)訓(xùn)練就快多了,在純CPU上訓(xùn)練,大約5到6分鐘可以訓(xùn)練一輪。

代碼如下,跟上面寫(xiě)莎翁的戲劇的如出一轍:

from __future__ import absolute_import, division, print_function

import os
from six import moves
import ssl

import tflearn
from tflearn.data_utils import *

path = "US_Cities.txt"
if not os.path.isfile(path):
    context = ssl._create_unverified_context()
    moves.urllib.request.urlretrieve("https://raw.githubusercontent.com/tflearn/tflearn.github.io/master/resources/US_Cities.txt", path, context=context)

maxlen = 20

string_utf8 = open(path, "r").read().decode('utf-8')
X, Y, char_idx = \
    string_to_semi_redundant_sequences(string_utf8, seq_maxlen=maxlen, redun_step=3)

g = tflearn.input_data(shape=[None, maxlen, len(char_idx)])
g = tflearn.lstm(g, 512, return_seq=True)
g = tflearn.dropout(g, 0.5)
g = tflearn.lstm(g, 512)
g = tflearn.dropout(g, 0.5)
g = tflearn.fully_connected(g, len(char_idx), activation='softmax')
g = tflearn.regression(g, optimizer='adam', loss='categorical_crossentropy',
                       learning_rate=0.001)

m = tflearn.SequenceGenerator(g, dictionary=char_idx,
                              seq_maxlen=maxlen,
                              clip_gradients=5.0,
                              checkpoint_path='model_us_cities')

for i in range(40):
    seed = random_sequence_from_string(string_utf8, maxlen)
    m.fit(X, Y, validation_set=0.1, batch_size=128,
          n_epoch=1, run_id='us_cities')
    print("-- TESTING...")
    print("-- Test with temperature of 1.2 --")
    print(m.generate(30, temperature=1.2, seq_seed=seed).encode('utf-8'))
    print("-- Test with temperature of 1.0 --")
    print(m.generate(30, temperature=1.0, seq_seed=seed).encode('utf-8'))
    print("-- Test with temperature of 0.5 --")
    print(m.generate(30, temperature=0.5, seq_seed=seed).encode('utf-8'))

我們看下第一輪訓(xùn)練完生成的城市名:

t and Shoot
Cuthbertd
Lettfrecv
El
Ceoneel Sutd
Sa

第二輪:

stle
Finchford
Finch Dasthond
madloogd
Wlaycoyarfw

第三輪:

averal
Cape Carteret
Acbiropa Heowar Sor Dittoy
Do

第十輪:

hoenchen
Schofield
Stcojos
Schabell
StcaKnerum Cri

第二十輪,好像開(kāi)始有點(diǎn)意思了:

Hill
Cherry Hills Village
Hillfood Pork
Hillbrook

第三十輪,又有點(diǎn)退化:

ckitat
Kline
Klondike
Klonsder
Klansburg
Dlandon
D

第四十輪:

Branch
Villages of Ocite
Sidaydaton
Sidway
Siddade

第100輪:

Atlasburg
Atmautluak
Attion
Attul
Atta
Aque Creek

tflearn.SequenceGenerator的好處和壞處都是將細(xì)節(jié)都封裝起來(lái)了,我們難以看到它的背后發(fā)生了什么。

溫度參數(shù)

其實(shí)在前面的結(jié)果中,我們只是節(jié)選了一種溫度下的結(jié)果。輸出的結(jié)果一般都輸出幾種溫度的值。那么這個(gè)溫度是什么意思呢?

溫度是表征概念變化的量。如果溫度高,比如大于1,就代表希望輸出結(jié)果更穩(wěn)定。穩(wěn)定的結(jié)果就是可能每次生成的句子都是一樣的。如果等于1,那么對(duì)結(jié)果沒(méi)有影響。如果小于1,那么就會(huì)讓每次生成的結(jié)果變化比較大。
像做詩(shī)這樣比較浪漫的事情,我們一般希望溫度值在0.1以下,有變化才好玩,不是嗎?

下面是城市生成值,在三種不同溫度下的生成結(jié)果:

-- Test with temperature of 1.2 --

Atlasburg
Atmautluak
Attion
Attul
Atta
Aque Creek
-- Test with temperature of 1.0 --

Atlasburg
Atmautluak
Attila
Attaville
Atteville
-- Test with temperature of 0.5 --

Atlasburg
Atmautluak
Attigua
Attinword
Attrove

跨后端高層API - Keras,生成尼采的文章

Keras是可以跨Tensorflow,微軟的CNTK等多種后端的API。
可以通過(guò)

pip install keras

來(lái)安裝keras。我們安裝了Tensorflow之后,Keras會(huì)選用Tensorflow來(lái)做它的后端。

我們也看下Keras上文本生成的例子。官方例子是用來(lái)生成尼采的句子。

核心語(yǔ)句就6句話:

model = Sequential()
model.add(LSTM(128, input_shape=(maxlen, len(chars))))
model.add(Dense(len(chars)))
model.add(Activation('softmax'))

optimizer = RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)

下面是完整的代碼,大家跑來(lái)玩玩吧。如果對(duì)尼采不感興趣,也可以換成別的文章。不過(guò)請(qǐng)注意,正如注釋中所說(shuō)的,文本隨便換,但是要保持在10萬(wàn)字符以上。最好是100萬(wàn)字符以上。

'''Example script to generate text from Nietzsche's writings.
At least 20 epochs are required before the generated text
starts sounding coherent.
It is recommended to run this script on GPU, as recurrent
networks are quite computationally intensive.
If you try this script on new data, make sure your corpus
has at least ~100k characters. ~1M is better.
'''

from __future__ import print_function
from keras.callbacks import LambdaCallback
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.layers import LSTM
from keras.optimizers import RMSprop
from keras.utils.data_utils import get_file
import numpy as np
import random
import sys
import io

path = get_file('nietzsche.txt', origin='https://s3.amazonaws.com/text-datasets/nietzsche.txt')
with io.open(path, encoding='utf-8') as f:
    text = f.read().lower()
print('corpus length:', len(text))

chars = sorted(list(set(text)))
print('total chars:', len(chars))
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

# cut the text in semi-redundant sequences of maxlen characters
maxlen = 40
step = 3
sentences = []
next_chars = []
for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])
print('nb sequences:', len(sentences))

print('Vectorization...')
x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        x[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1


# build the model: a single LSTM
print('Build model...')
model = Sequential()
model.add(LSTM(128, input_shape=(maxlen, len(chars))))
model.add(Dense(len(chars)))
model.add(Activation('softmax'))

optimizer = RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)


def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)


def on_epoch_end(epoch, logs):
    # Function invoked at end of each epoch. Prints generated text.
    print()
    print('----- Generating text after Epoch: %d' % epoch)

    start_index = random.randint(0, len(text) - maxlen - 1)
    for diversity in [0.2, 0.5, 1.0, 1.2]:
        print('----- diversity:', diversity)

        generated = ''
        sentence = text[start_index: start_index + maxlen]
        generated += sentence
        print('----- Generating with seed: "' + sentence + '"')
        sys.stdout.write(generated)

        for i in range(400):
            x_pred = np.zeros((1, maxlen, len(chars)))
            for t, char in enumerate(sentence):
                x_pred[0, t, char_indices[char]] = 1.

            preds = model.predict(x_pred, verbose=0)[0]
            next_index = sample(preds, diversity)
            next_char = indices_char[next_index]

            generated += next_char
            sentence = sentence[1:] + next_char

            sys.stdout.write(next_char)
            sys.stdout.flush()
        print()

print_callback = LambdaCallback(on_epoch_end=on_epoch_end)

model.fit(x, y,
          batch_size=128,
          epochs=60,
          callbacks=[print_callback])

文本生成背后的原理 - 只不過(guò)是概率的預(yù)測(cè)而己

TFLearn的封裝做得太好,我們看不到細(xì)節(jié)。所以我們參看一下Keras的同功能實(shí)現(xiàn)的代碼:

        for i in range(400):
            x_pred = np.zeros((1, maxlen, len(chars)))
            for t, char in enumerate(sentence):
                x_pred[0, t, char_indices[char]] = 1.

            preds = model.predict(x_pred, verbose=0)[0]
            next_index = sample(preds, diversity)
            next_char = indices_char[next_index]

            generated += next_char
            sentence = sentence[1:] + next_char

            sys.stdout.write(next_char)
            sys.stdout.flush()

我們可以看到,本質(zhì)上是調(diào)用model.predict來(lái)預(yù)測(cè)在當(dāng)前序列下最可能出現(xiàn)的字符是什么。

preds = model.predict(x_pred, verbose=0)[0]

從sample的代碼中,我們可以看到溫度值背后的原理:

def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

了解了這些原理之后,使用不同的文本,加上不同的溫度值,享受您的機(jī)器創(chuàng)作之旅吧!

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書(shū)系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容