本文Github
1. RNN文本分類網(wǎng)絡(luò)結(jié)構(gòu)

2. tensorflow中的RNN
RNN在tensorflow中有靜態(tài)RNN,動(dòng)態(tài)RNN之分。兩者差異挺大,我們在使用tensorflow進(jìn)行RNN實(shí)踐時(shí),主要注意以下幾點(diǎn):
- 靜態(tài)RNN一般需要將所有句子padding成等長處理,這點(diǎn)與TextCNN一樣的,但動(dòng)態(tài)rnn稍顯靈活一點(diǎn),動(dòng)態(tài)RNN中,只要一個(gè)batch中的所有句子等長就可以;
- 靜態(tài)RNN的輸入與輸出是list或二維張量;動(dòng)態(tài)RNN中輸入輸出的是三維張量,相對與TextCNN,少了一維;
- 靜態(tài)RNN生成過程所需的時(shí)間更長,網(wǎng)絡(luò)所占內(nèi)存會(huì)更大,但模型中會(huì)帶有每個(gè)序列的中間信息,利于調(diào)試;動(dòng)態(tài)RNN生成過程所需時(shí)間相對少,所占內(nèi)存相對更小,但模型中只有最后的狀態(tài)。
本文介紹使用動(dòng)態(tài)RNN進(jìn)行文本分類。
2.1 數(shù)據(jù)預(yù)處理
首先去除文本中的標(biāo)點(diǎn)符號,對文本分詞,最后將每句的分詞結(jié)果依次存入contents列表,標(biāo)簽也依次存入labels列表。
def read_file(filename):
re_han = re.compile(u"([\u4E00-\u9FD5a-zA-Z0-9+#&\._%]+)")
contents, labels = [], []
with codecs.open(filename, 'r', encoding='utf-8') as f:
for line in f:
try:
line = line.rstrip()
assert len(line.split('\t')) == 2
label, content = line.split('\t')
labels.append(label)
blocks = re_han.split(content)
word = []
for blk in blocks:
if re_han.match(blk):
word.extend(jieba.lcut(blk))
contents.append(word)
except:
pass
return labels, contents
接下來,建立詞典,將詞典中詞語的詞向量單獨(dú)存入文件。這些詞應(yīng)該具有一定的重要性,我們通過詞頻排序,選擇前N個(gè)詞。但在這之前,應(yīng)該去停用詞!去了停用詞之后,取文本(這個(gè)文本指的是所有文本,包括訓(xùn)練、測試、驗(yàn)證集)中前N個(gè)詞,表示這N個(gè)詞是比較重要的。我提取了文本的前9999個(gè)比較重要的詞,并按順序保存了下來。embeddings= np.zeros([10000, 100]) 表示我建立了一個(gè)10000個(gè)詞,維度是100的詞向量集合。然后將9999個(gè)詞在大詞向量中的數(shù)值,按1-9999的順序,放入了新建的詞向量中。第0項(xiàng),讓它保持是100個(gè)0的狀態(tài)。
def built_vocab_vector(filenames,voc_size = 10000):
'''
去停用詞,得到前9999個(gè)詞,獲取對應(yīng)的詞 以及 詞向量
:param filenames:
:param voc_size:
:return:
'''
stopword = open('./data/stopwords.txt', 'r', encoding='utf-8')
stop = [key.strip(' \n') for key in stopword]
all_data = []
j = 1
embeddings = np.zeros([10000, 100])
for filename in filenames:
labels, content = read_file(filename)
for eachline in content:
line =[]
for i in range(len(eachline)):
if str(eachline[i]) not in stop:#去停用詞
line.append(eachline[i])
all_data.extend(line)
counter = Counter(all_data)
count_paris = counter.most_common(voc_size-1)
word, _ = list(zip(*count_paris))
f = codecs.open('./data/vector_word.txt', 'r', encoding='utf-8')
vocab_word = open('./data/vocab_word.txt', 'w', encoding='utf-8')
for ealine in f:
item = ealine.split(' ')
key = item[0]
vec = np.array(item[1:], dtype='float32')
if key in word:
embeddings[j] = np.array(vec)
vocab_word.write(key.strip('\r') + '\n')
j += 1
np.savez_compressed('./data/vector_word.npz', embeddings=embeddings)
然后建立詞典,目的是為了讓中文單詞能夠轉(zhuǎn)換成數(shù)字序列。
def get_wordid(filename):
key = open(filename, 'r', encoding='utf-8')
wordid = {}
wordid['<PAD>'] = 0
j = 1
for w in key:
w = w.strip('\n')
w = w.strip('\r')
wordid[w] = j
j += 1
return wordid
下面,開始將句子中的詞,以及標(biāo)簽中的詞,都變成數(shù)字的序列。其中將標(biāo)簽中的值,變成one-hot形式。read_category()是建立標(biāo)簽的詞典,作用與上面建立的詞典作用一致。
def read_category():
categories = ['體育', '財(cái)經(jīng)', '房產(chǎn)', '家居', '教育', '科技', '時(shí)尚', '時(shí)政', '游戲', '娛樂']
cat_to_id = dict(zip(categories, range(len(categories))))
return categories, cat_to_id
接下來,需要進(jìn)行padding處理,區(qū)別與CNN中的處理,這里是統(tǒng)計(jì)一個(gè)batch中最長句子,然后按batch進(jìn)行padding,這是比較標(biāo)注的做法。但由于單個(gè)子句非常長,按原長處理電腦運(yùn)行非常吃力,故指定了最大長度為250(吐槽下文本)。因此這一步實(shí)際上是對所有句子進(jìn)行padding。并將中文詞按照詞典轉(zhuǎn)換為數(shù)字,y_pad = kr.utils.to_categorical(label_id)是將標(biāo)簽轉(zhuǎn)換為one-hot形式。
def process(filename, word_to_id, cat_to_id, max_length=250):
labels, contents = read_file(filename)
data_id, label_id = [], []
for i in range(len(contents)):
data_id.append([word_to_id[x] for x in contents[i] if x in word_to_id])
label_id.append(cat_to_id[labels[i]])
x_pad = kr.preprocessing.sequence.pad_sequences(data_id, max_length, padding='post', truncating='post')
y_pad = kr.utils.to_categorical(label_id)
return x_pad, y_pad
然后,是生成每一次輸入RNN模型的batch了。這里用了np.random.permutation函數(shù)將indices打亂。
def batch_iter(x, y, batch_size = 64):
data_len = len(x)
x = np.array(x)
num_batch = int((data_len - 1)/batch_size) + 1
indices = np.random.permutation(np.arange(data_len))
'''
np.arange(4) = [0,1,2,3]
np.random.permutation([1, 4, 9, 12, 15]) = [15, 1, 9, 4, 12]
'''
x_shuff = x[indices]
y_shuff = y[indices]
for i in range(num_batch):
start_id = i * batch_size
end_id = min((i+1) * batch_size, data_len)
yield x_shuff[start_id:end_id], y_shuff[start_id:end_id]
最后,根據(jù)動(dòng)態(tài)RNN模型的特點(diǎn),需要計(jì)算各句子的真實(shí)長度,存入列表。為啥要計(jì)算真實(shí)長度?因?yàn)橛杏冒。。?!因?yàn)榻o動(dòng)態(tài)RNN輸入真實(shí)的句子長度,它就知道超過句子真實(shí)長度的部分是無用信息了,超過真實(shí)長度部分的值為0。
def sequence(x_batch):
seq_len = []
for line in x_batch:
length = np.sum(np.sign(line))
seq_len.append(length)
return seq_len
2.2 RNN網(wǎng)絡(luò)
數(shù)據(jù)預(yù)處理好了,接下里就可以用tensorflow寫RNN網(wǎng)絡(luò)結(jié)構(gòu)了。RNN網(wǎng)絡(luò)首先要定義Cell,有三種,分別是:RNNCell,LSTMCell,GRUCell。
接下來,考慮使用單層,多層,是單向還是雙向;最后是使用動(dòng)態(tài)還是靜態(tài)。本文使用的是動(dòng)態(tài)雙層LSTM網(wǎng)絡(luò),因此,輸入的是三維張量。RNN的返回值有兩個(gè),一個(gè)是結(jié)果,一個(gè)是Cell狀態(tài),結(jié)果也是三維張量。在使用多層RNN需要注意的地方:在使用單層RNN時(shí),embedding_dim和hidden_dim在數(shù)值上可以不一致,但涉及到多層的時(shí)候,需要將兩者的數(shù)值相等,否則會(huì)報(bào)錯(cuò)。具體可以看。
class RnnModel(object):
def __init__(self):
self.input_x = tf.placeholder(tf.int32, shape=[None, pm.seq_length], name='input_x')
self.input_y = tf.placeholder(tf.float32, shape=[None, pm.num_classes], name='input_y')
self.seq_length = tf.placeholder(tf.int32, shape=[None], name='sequen_length')
self.keep_prob = tf.placeholder(tf.float32, name='keep_prob')
self.global_step = tf.Variable(0, trainable=False, name='global_step')
self.rnn()
def rnn(self):
with tf.device('/cpu:0'), tf.name_scope('embedding'):
embedding = tf.get_variable('embedding', shape=[pm.vocab_size, pm.embedding_dim],
initializer=tf.constant_initializer(pm.pre_trianing))
self.embedding_input = tf.nn.embedding_lookup(embedding, self.input_x)
with tf.name_scope('cell'):
cell = tf.nn.rnn_cell.LSTMCell(pm.hidden_dim)
cell = tf.nn.rnn_cell.DropoutWrapper(cell, output_keep_prob=self.keep_prob)
cells = [cell for _ in range(pm.num_layers)]
Cell = tf.nn.rnn_cell.MultiRNNCell(cells, state_is_tuple=True)
with tf.name_scope('rnn'):
#hidden一層 輸入是[batch_size, seq_length, embendding_dim]
#hidden二層 輸入是[batch_size, seq_length, 2*hidden_dim]
#2*hidden_dim = embendding_dim + hidden_dim
output, _ = tf.nn.dynamic_rnn(cell=Cell, inputs=self.embedding_input, sequence_length=self.seq_length, dtype=tf.float32)
output = tf.reduce_sum(output, axis=1)
#output:[batch_size, seq_length, hidden_dim]
with tf.name_scope('dropout'):
self.out_drop = tf.nn.dropout(output, keep_prob=self.keep_prob)
with tf.name_scope('output'):
w = tf.Variable(tf.truncated_normal([pm.hidden_dim, pm.num_classes], stddev=0.1), name='w')
b = tf.Variable(tf.constant(0.1, shape=[pm.num_classes]), name='b')
self.logits = tf.matmul(self.out_drop, w) + b
self.predict = tf.argmax(tf.nn.softmax(self.logits), 1, name='predict')
with tf.name_scope('loss'):
losses = tf.nn.softmax_cross_entropy_with_logits_v2(logits=self.logits, labels=self.input_y)
self.loss = tf.reduce_mean(losses)
with tf.name_scope('optimizer'):
optimizer = tf.train.AdamOptimizer(pm.learning_rate)
gradients, variables = zip(*optimizer.compute_gradients(self.loss))#計(jì)算變量梯度,得到梯度值,變量
gradients, _ = tf.clip_by_global_norm(gradients, pm.clip)
#對g進(jìn)行l(wèi)2正則化計(jì)算,比較其與clip的值,如果l2后的值更大,讓梯度*(clip/l2_g),得到新梯度
self.optimizer = optimizer.apply_gradients(zip(gradients, variables), global_step=self.global_step)
#global_step 自動(dòng)+1
with tf.name_scope('accuracy'):
correct_prediction = tf.equal(self.predict, tf.argmax(self.input_y, 1))
self.accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32), name='accuracy')
2.3 訓(xùn)練模型
模型構(gòu)建好了,可以開始訓(xùn)練了。當(dāng)global_step為100的倍數(shù)時(shí),輸出當(dāng)前訓(xùn)練結(jié)果,本次訓(xùn)練迭代三次,每迭代完一次,保存模型。
def train():
tensorboard_dir = './tensorboard/Text_Rnn'
save_dir = './checkpoints/Text_Rnn'
if not os.path.exists(tensorboard_dir):
os.makedirs(tensorboard_dir)
if not os.path.exists(save_dir):
os.makedirs(save_dir)
save_path = os.path.join(save_dir, 'best_validation')
tf.summary.scalar('loss', model.loss)
tf.summary.scalar('accuracy', model.accuracy)
merged_summary = tf.summary.merge_all()
writer = tf.summary.FileWriter(tensorboard_dir)
saver = tf.train.Saver()
session = tf.Session()
session.run(tf.global_variables_initializer())
writer.add_graph(session.graph)
x_train, y_train = process(pm.train_filename, wordid, cat_to_id, max_length=250)
x_test, y_test = process(pm.test_filename, wordid, cat_to_id, max_length=250)
for epoch in range(pm.num_epochs):
print('Epoch:', epoch+1)
num_batchs = int((len(x_train) - 1) / pm.batch_size) + 1
batch_train = batch_iter(x_train, y_train, batch_size=pm.batch_size)
for x_batch, y_batch in batch_train:
seq_len = sequence(x_batch)
feed_dict = model.feed_data(x_batch, y_batch, seq_len, pm.keep_prob)
_, global_step, _summary, train_loss, train_accuracy = session.run([model.optimizer, model.global_step, merged_summary,
model.loss, model.accuracy],feed_dict=feed_dict)
if global_step % 100 == 0:
test_loss, test_accuracy = model.evaluate(session, x_test, y_test)
print('global_step:', global_step, 'train_loss:', train_loss, 'train_accuracy:', train_accuracy,
'test_loss:', test_loss, 'test_accuracy:', test_accuracy)
if global_step % num_batchs == 0:
print('Saving Model...')
saver.save(session, save_path, global_step=global_step)
pm.learning_rate *= pm.lr_decay
訓(xùn)練結(jié)果如下:
從每次運(yùn)行的結(jié)果上看,成績較為理想。運(yùn)用最后保存的模型對驗(yàn)證集進(jìn)行預(yù)測,并計(jì)算準(zhǔn)確率,以及輸出前10條結(jié)果,進(jìn)行查看。
def val():
pre_label = []
label = []
session = tf.Session()
session.run(tf.global_variables_initializer())
save_path = tf.train.latest_checkpoint('./checkpoints/Text_Rnn')
saver = tf.train.Saver()
saver.restore(sess=session, save_path=save_path)
val_x, val_y = process(pm.val_filename, wordid, cat_to_id, max_length=250)
batch_val = batch_iter(val_x, val_y, batch_size=64)
for x_batch, y_batch in batch_val:
seq_len = sequence(x_batch)
pre_lab = session.run(model.predict, feed_dict={model.input_x: x_batch,
model.seq_length: seq_len,
model.keep_prob: 1.0})
pre_label.extend(pre_lab)
label.extend(y_batch)
return pre_label, label

在5000條驗(yàn)證集上預(yù)測準(zhǔn)確率達(dá)到了96.7%,從前10條結(jié)果上也可以看出,結(jié)果相當(dāng)理想。
3 總結(jié)
本文使用的數(shù)據(jù)來自https://github.com/cjymz886/text-cnn。文本分為10類,數(shù)據(jù)來自新聞文本,故文本比較長。在做本次實(shí)驗(yàn)之前,由于比較懶,直接用的上一次TextCnn文本預(yù)處理的程序,也就是指定一個(gè)max_length=n,然后將所有句子padding成max_length。收斂速度被TextCnn甩老遠(yuǎn)。后來進(jìn)行了部分改進(jìn),將長度變短。收斂速度依舊不如TextCnn??磥?,在做長文本的文本分類時(shí),還是用CNN網(wǎng)絡(luò)吧!