用詞向量加深度學習的方法做情感分析的基本思路是:
1.訓練詞向量 2.句子預處理、分詞,句子變成一個個詞的序列,指定序列最大長度,多砍少補,詞分配索引、對應上詞向量。3. 定義網(wǎng)絡結構,比如可以使用一層LSTM+全連接層,使用dropout增加泛化性,然后開始訓練。4.調(diào)整參數(shù),看訓練集和驗證集的loss, accuracy,當驗證集的accuracy非偶然地不增反降(一般也對應著loss開始上升)時,說明開始過擬合,停止訓練,用這個epoch/iteration和參數(shù)重新對所有數(shù)據(jù)訓練出模型。
1.詞向量
訓練詞向量的語料和步驟在前面文章已有。可以把情感分析語料加上來一起訓練詞向量,方法和代碼略過。值得一提的是,詞向量語料最好是用待分析情感領域的語料,越多越好;另外,分詞的好壞會很大程度影響詞向量的準確性,可以做一些額外預處理比如去停用詞、加入行業(yè)詞條作為自定義詞典等。
2.文本變數(shù)字
下圖很好地解釋了文本變數(shù)字的過程。通過詞向量我們可以得到一個詞表,詞表里每個詞有個index(比如詞的index為詞在詞表中位置+1),且這個數(shù)字對應了該詞的詞向量,得到類似如圖中embedding matrix。注意要留一個特殊數(shù)字如0代表非詞典詞。一句話分詞后得到一系列詞,如“I thought the movie was incredible and inspiring”分詞后是“I”,“thought”,“the”,“movie”,“was”,“incredible”,“and”,“inspiring”,每個詞對應一個索引數(shù)字,得到向量[41 804 201534 1005 15 7446 5 13767]。輸入需轉化為統(tǒng)一長度(max_len),如10,這句話只有8個詞,那么需要補齊剩下的2個空位為0。那么[41 804 201534 1005 15 7446 5 13767 0 0]通過查詢embedding matrix就可以得到[batch_size = 1, max_len = 10, word2vec_dimension = 50]的向量。即為輸入。

3.網(wǎng)絡結構
步驟2所說的是輸入,輸出就是one-hot向量。如3分類(正面、負面、中性),對應輸出為[1 0 0]和[0 1 0]和[0 0 1],softmax得到的輸出就可以代表各分類的概率。對于二分類,也可以用0,1來代表輸出,這樣用sigmoid使輸出映射到0到1之間,也可以作為概率。那么有了輸入和輸出,就要定義模型/網(wǎng)絡結構,然后讓模型自己去學習參數(shù)。這里語料不多,模型盡可能簡單??梢杂靡粚覥NN(這里當然是包括pooling層的),RNN(LSTM, GRU, Bidirectional lstm)等,最后是一層全連接層。實驗發(fā)現(xiàn)Bidirectional lstm效果最好,測試集上能達到95%以上的正確率。這也與一般認知相符,因為CNN只提取了一段段的詞,沒考慮上下文信息;而lstm將句子由左向右計算,不能結合右邊的信息,所以bi-lstm加一遍反向計算的信息。
4.訓練
劃分訓練集和驗證集(0.2比例),用訓練集做訓練,同時對驗證集也要算loss和accuracy。正常情況,訓練集loss越來越低,accuracy越來越高至收斂;驗證集開始也如此,到某個時刻開始loss升高,accuracy降低,說明過擬合,在這一刻early-stopping。用當前參數(shù)重新訓練整個數(shù)據(jù),得到模型。
5.python代碼
keras訓練
# -*- coding: utf-8 -*-
import time
import yaml
import sys
from sklearn.model_selection import train_test_split
import multiprocessing
import numpy as np
from gensim.models import Word2Vec
from gensim.corpora.dictionary import Dictionary
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers.embeddings import Embedding
from keras.layers import Bidirectional
from keras.layers.recurrent import LSTM
from keras.layers.core import Dense, Dropout,Activation
from keras.models import model_from_yaml
np.random.seed(35) # For Reproducibility
import jieba
import pandas as pd
import sys
sys.setrecursionlimit(1000000)
# set parameters:
vocab_dim = 256
maxlen = 150
batch_size = 32
n_epoch = 5
input_length = 150
validation_rate = 0.0
cpu_count = multiprocessing.cpu_count()
def read_txt(filename):
f = open(filename)
res = []
for i in f:
res.append(i.replace("\n",""))
del(res[0])
return res
#加載訓練文件
def loadfile():
neg = read_txt("./bida_neg.txt")
pos = read_txt('./bida_pos.txt')
combined=np.concatenate((pos, neg))
y = np.concatenate((np.ones(len(pos),dtype=int), np.zeros(len(neg),dtype=int)))
return combined,y
#對句子經(jīng)行分詞,并去掉換行符
def tokenizer(text):
''' Simple Parser converting each document to lower-case, then
removing the breaks for new lines and finally splitting on the
whitespace
'''
text = [jieba.lcut(document.replace('\n', '')) for document in text]
return text
def create_dictionaries(model=None,
combined=None):
''' Function does are number of Jobs:
1- Creates a word to index mapping
2- Creates a word to vector mapping
3- Transforms the Training and Testing Dictionaries
'''
if (combined is not None) and (model is not None):
gensim_dict = Dictionary()
gensim_dict.doc2bow(model.wv.vocab.keys(),
allow_update=True)
w2indx = {v: k+1 for k, v in gensim_dict.items()}#所有頻數(shù)超過10的詞語的
索引
w2vec = {word: model[word] for word in w2indx.keys()}#所有頻數(shù)超過10的詞
語的詞向量
def parse_dataset(combined):
''' Words become integers
'''
data=[]
for sentence in combined:
new_txt = []
for word in sentence:
try:
new_txt.append(w2indx[word])
except:
new_txt.append(0)
data.append(new_txt)
return data
combined=parse_dataset(combined)
combined= sequence.pad_sequences(combined, maxlen=maxlen)#每個句子所含詞
語對應的索引
return w2indx, w2vec,combined
else:
print('No data provided...')
def get_data(index_dict,word_vectors,combined,y):
n_symbols = len(index_dict) + 1 # 所有單詞的索引數(shù),頻數(shù)小于10的詞語索引為0,所以加1
embedding_weights = np.zeros((n_symbols, vocab_dim))#索引為0的詞語,詞向量全
為0
for word, index in index_dict.items():#從索引為1的詞語開始,對每個詞語對應其
詞向量
embedding_weights[index, :] = word_vectors[word]
x_train, x_test, y_train, y_test = train_test_split(combined, y, test_size=validation_rate)
return n_symbols,embedding_weights,x_train,y_train,x_test,y_test
def word2vec_train(model, combined):
index_dict, word_vectors,combined = create_dictionaries(model=model,combined=combined)
return index_dict, word_vectors, combined
##定義網(wǎng)絡結構
def train_lstm(n_symbols,embedding_weights,x_train,y_train,x_test,y_test):
model = Sequential()
model.add(Embedding(output_dim=vocab_dim,
input_dim=n_symbols,
mask_zero=True,
weights=[embedding_weights],
input_length=input_length)) # Adding Input Length
model.add(Bidirectional(LSTM(32, activation='sigmoid',inner_activation='sigmoid')))
model.add(Dropout(0.4))
model.add(Dense(1))
model.add(Activation('sigmoid'))
print('Compiling the Model...')
model.compile(loss='binary_crossentropy',
optimizer='adam',metrics=['accuracy'])
print("Train...")
model.fit(x_train, y_train, batch_size=batch_size, nb_epoch=n_epoch,verbose=1, validation_data=(x_test, y_test))
print("Evaluate...")
score = model.evaluate(x_test, y_test,
batch_size=batch_size)
yaml_string = model.to_yaml()
with open('lstm_data/lstm.yml', 'w') as outfile:
outfile.write( yaml.dump(yaml_string, default_flow_style=True) )
model.save_weights('lstm_data/lstm.h5')
print('Test score:', score)
#訓練模型,并保存
def train():
combined,y=loadfile()
combined = tokenizer(combined)
model = Word2Vec.load("../models/word2vec.model")
index_dict, word_vectors,combined=create_dictionaries(model, combined)
n_symbols,embedding_weights,x_train,y_train,x_test,y_test=get_data(index_dict, word_vectors,combined,y)
train_lstm(n_symbols,embedding_weights,x_train,y_train,x_test,y_test)
if __name__=='__main__':
train()
以上是二分類,輸出映射到0~1之間的代碼,如果多分類,激活函數(shù)用softmax代替sigmoid,loss='binary_crossentropy'改為loss='categorical_crossentropy',另外y = to_categorical(y, num_classes=classes)
預測
# -*- coding: utf-8 -*-
import time
import yaml
import sys
from sklearn.model_selection import train_test_split
import multiprocessing
import numpy as np
from gensim.models import Word2Vec
from gensim.corpora.dictionary import Dictionary
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import LSTM
from keras.layers.core import Dense, Dropout,Activation
from keras.models import model_from_yaml
import jieba
import pandas as pd
# set parameters:
vocab_dim = 256
maxlen = 150
batch_size = 32
n_epoch = 5
input_length = 150
cpu_count = multiprocessing.cpu_count()
def init_dictionaries(w2v_model):
gensim_dict = Dictionary()
gensim_dict.doc2bow(w2v_model.wv.vocab.keys(),
allow_update=True)
w2indx = {v: k+1 for k, v in gensim_dict.items()}
w2vec = {word: w2v_model[word] for word in w2indx.keys()}
return w2indx, w2vec
def process_words(w2indx, words):
temp = []
for word in words:
try:
temp.append(w2indx[word])
except:
temp.append(0)
res = sequence.pad_sequences([temp], maxlen = maxlen)
return res
def input_transform(string, w2index):
words=jieba.lcut(string)
return process_words(w2index, words)
def load_model():
print('loading model......')
with open('lstm_data/lstm.yml', 'r') as f:
yaml_string = yaml.load(f)
model = model_from_yaml(yaml_string)
model.load_weights('lstm_data/lstm.h5')
model.compile(loss='binary_crossentropy',
optimizer='adam',metrics=['accuracy'])
w2v_model=Word2Vec.load('../models/word2vec.model')
return model,w2v_model
def lstm_predict(string, model, w2index):
data=input_transform(string, w2index)
data.reshape(1,-1)
result=model.predict_classes(data)
prob = model.predict_proba(data)
print(string)
print("prob:" + str(prob))
if result[0][0]==1:
#print(string,' positive')
return 1
else:
#print(string,' negative')
return -1
if __name__=='__main__':
model,w2v_model = load_model()
w2index, _ = init_dictionaries(w2v_model)
lstm_predict("平安大跌", model, w2index)
tensorflow訓練
#coding = utf-8
from gensim.corpora import Dictionary
from gensim.models import Word2Vec
import numpy as np
from random import randint
from sklearn.model_selection import train_test_split
import tensorflow as tf
import jieba
def read_txt(filename):
f = open(filename)
res = []
for i in f:
res.append(i.replace("\n",""))
del(res[0])
return res
def loadfile():
neg = read_txt("../data/bida_neg.txt")
pos = read_txt('../data/bida_pos.txt')
combined=np.concatenate((pos, neg))
y = np.concatenate((np.ones(len(pos),dtype=int),np.zeros(len(neg),dtype=int)))
return combined,y
def create_dictionaries(model=None):
if (combined is not None) and (model is not None):
gensim_dict = Dictionary()
gensim_dict.doc2bow(model.wv.vocab.keys(),
allow_update=True)
w2index = {v: k+1 for k, v in gensim_dict.items()}
vectors = np.zeros((len(w2index) + 1, num_dimensions), dtype='float32')
for k, v in gensim_dict.items():
vectors[k+1] = model[v]
return w2index, vectors
def get_train_batch(batch_size):
labels = []
arr = np.zeros([batch_size, max_seq_length])
for i in range(batch_size):
num = randint(0,len(X_train) - 1)
labels.append(y_train[num])
arr[i] = X_train[num]
return arr, labels
def get_test_batch(batch_size):
labels = []
arr = np.zeros([batch_size, max_seq_length])
for i in range(batch_size):
num = randint(0,len(X_test) - 1)
labels.append(y_test[num])
arr[i] = X_test[num]
return arr, labels
def get_all_batches(batch_size = 32, mode = "train"):
X, y = None, None
if mode == "train":
X = X_train
y = y_train
elif mode == "test":
X = X_test
y = y_test
batches = int(len(y)/batch_size)
arrs = [X[i*batch_size:i*batch_size + batch_size] for i in range(batches)]
arrs.append(X[batches*batch_size:len(y)])
labels = [y[i*batch_size:i*batch_size + batch_size] for i in range(batches)]
labels.append(y[batches*batch_size:len(y)])
return arrs, labels
def parse_dataset(sentences, w2index, max_len):
data=[]
for sentence in sentences:
words = jieba.lcut(sentence.replace('\n', ''))
new_txt = np.zeros((max_len), dtype='int32')
index = 0
for word in words:
try:
new_txt[index] = w2index[word]
except:
new_txt[index] = 0
index += 1
if index >= max_len:
break
data.append(new_txt)
return data
batch_size = 32
lstm_units = 64
num_classes = 2
iterations = 50000
num_dimensions = 256
max_seq_len = 150
max_seq_length = 150
validation_rate = 0.2
random_state = 9876
output_keep_prob = 0.5
learning_rate = 0.001
combined, y = loadfile()
model = Word2Vec.load("../models/word2vec.model")
w2index, vectors = create_dictionaries(model)
X = parse_dataset(combined, w2index, max_seq_len)
y = [[1,0] if yi == 1 else [0,1] for yi in y]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=validation_rate, random_state=random_state)
tf.reset_default_graph()
labels = tf.placeholder(tf.float32, [None, num_classes])
input_data = tf.placeholder(tf.int32, [None, max_seq_length])
data = tf.placeholder(tf.float32, [None, max_seq_length, num_dimensions])
data = tf.nn.embedding_lookup(vectors, input_data)
#bidirectional lstm
lstm_fw = tf.contrib.rnn.BasicLSTMCell(lstm_units)
lstm_fw = tf.contrib.rnn.DropoutWrapper(cell=lstm_fw, output_keep_prob=output_keep_prob)
lstm_bw = tf.contrib.rnn.BasicLSTMCell(lstm_units)
lstm_bw = tf.contrib.rnn.DropoutWrapper(cell=lstm_bw, output_keep_prob=output_keep_prob)
(output_fw, output_bw),_ = tf.nn.bidirectional_dynamic_rnn(cell_fw=lstm_fw, cell_bw=lstm_bw,inputs = data, dtype=tf.float32)
outputs = tf.concat([output_fw, output_bw], axis=2)
# Fully connected layer.
weight = tf.get_variable(name="W", shape=[2 * lstm_units, num_classes],
dtype=tf.float32)
bias = tf.get_variable(name="b", shape=[num_classes], dtype=tf.float32,
initializer=tf.zeros_initializer())
last = tf.transpose(outputs, [1,0,2])
last = tf.gather(last, int(last.get_shape()[0]) - 1)
logits = (tf.matmul(last, weight) + bias)
prediction = tf.nn.softmax(logits)
correctPred = tf.equal(tf.argmax(prediction,1), tf.argmax(labels,1))
accuracy = tf.reduce_mean(tf.cast(correctPred, tf.float32))
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=labels))
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(loss)
sess = tf.InteractiveSession()
saver = tf.train.Saver()
sess.run(tf.global_variables_initializer())
cal_iter = 500
loss_train, loss_test = 0.0, 0.0
acc_train, acc_test = 0.0, 0.0
print("start training...")
for i in range(iterations):
#Next Batch of reviews
next_batch, next_batch_labels = get_train_batch(batch_size);
sess.run(optimizer, {input_data: next_batch, labels: next_batch_labels})
#Save the network every 10,000 training iterations
if (i % cal_iter == 0):
save_path = saver.save(sess, "models/pretrained_lstm.ckpt")
print("iteration: " + str(i))
train_acc, train_loss = 0.0, 0.0
test_acc, test_loss = 0.0, 0.0
train_arrs, train_labels = get_all_batches(300)
test_arrs, test_labels = get_all_batches(300, "test")
for k in range(len(train_labels)):
temp1, temp2 = sess.run([accuracy, loss], {input_data: train_arrs[k], labels : train_labels[k]})
train_acc += temp1
train_loss += temp2
train_acc /= len(train_labels)
train_loss /= len(train_labels)
for k in range(len(test_labels)):
temp1, temp2 = sess.run([accuracy, loss], {input_data: test_arrs[k], labels : test_labels[k]})
test_acc += temp1
test_loss += temp2
test_acc /= len(test_labels)
test_loss /= len(test_labels)
print("train accuracy: " + str(train_acc) + ", train loss: " + str(train_loss))
print("test accucary: " + str(test_acc) + ", test loss: " + str(test_loss))
預測
import tensorflow as tf
from gensim.models import Word2Vec
from gensim.corpora.dictionary import Dictionary
import numpy as np
import jieba
def create_dictionaries(model=None):
if model is not None:
gensim_dict = Dictionary()
gensim_dict.doc2bow(model.wv.vocab.keys(),
allow_update=True)
w2index = {v: k+1 for k, v in gensim_dict.items()}
vectors = np.zeros((len(w2index) + 1, num_dimensions), dtype='float32')
for k, v in gensim_dict.items():
vectors[k+1] = model[v]
return w2index, vectors
def parse_dataset(sentence, w2index, max_len):
words = jieba.lcut(sentence.replace('\n', ''))
new_txt = np.zeros((max_len), dtype='int32')
index = 0
for word in words:
try:
new_txt[index] = w2index[word]
except:
new_txt[index] = 0
index += 1
if index >= max_len:
break
return [new_txt]
batch_size = 32
lstm_units = 64
num_classes = 2
iterations = 100000
num_dimensions = 256
max_seq_len = 150
max_seq_length = 150
validation_rate = 0.2
random_state = 333
output_keep_prob = 0.5
model = Word2Vec.load("../models/word2vec.model")
w2index, vectors = create_dictionaries(model)
tf.reset_default_graph()
labels = tf.placeholder(tf.float32, [None, num_classes])
input_data = tf.placeholder(tf.int32, [None, max_seq_length])
data = tf.placeholder(tf.float32, [None, max_seq_length, num_dimensions])
data = tf.nn.embedding_lookup(vectors,input_data)
"""
bi-lstm
"""
#bidirectional lstm
lstm_fw = tf.contrib.rnn.BasicLSTMCell(lstm_units)
lstm_fw = tf.contrib.rnn.DropoutWrapper(cell=lstm_fw, output_keep_prob=output_keep_prob)
lstm_bw = tf.contrib.rnn.BasicLSTMCell(lstm_units)
lstm_bw = tf.contrib.rnn.DropoutWrapper(cell=lstm_bw, output_keep_prob=output_keep_prob)
(output_fw, output_bw),_ = tf.nn.bidirectional_dynamic_rnn(cell_fw=lstm_fw, cell_bw=lstm_bw,inputs = data, dtype=tf.float32)
outputs = tf.concat([output_fw, output_bw], axis=2)
# Fully connected layer.
weight = tf.get_variable(name="W", shape=[2 * lstm_units, num_classes],
dtype=tf.float32)
bias = tf.get_variable(name="b", shape=[num_classes], dtype=tf.float32,
initializer=tf.zeros_initializer())
#last = tf.reshape(outputs, [-1, 2 * lstm_units])
last = tf.transpose(outputs, [1,0,2])
last = tf.gather(last, int(last.get_shape()[0]) - 1)
logits = (tf.matmul(last, weight) + bias)
prediction = tf.nn.softmax(logits)
correctPred = tf.equal(tf.argmax(prediction,1), tf.argmax(labels,1))
accuracy = tf.reduce_mean(tf.cast(correctPred, tf.float32))
sess = tf.InteractiveSession()
saver = tf.train.Saver()
#saver.restore(sess, 'models/pretrained_lstm.ckpt-27000.data-00000-of-00001')
saver.restore(sess, tf.train.latest_checkpoint('models'))
l = ["平安銀行大跌", "平安銀行暴跌", "平安銀行扭虧為盈","小米將加深與TCL合作",
"蘋果手機現(xiàn)在賣的不如以前了","蘋果和三星的糟糕業(yè)績預示著全球商業(yè)領域將經(jīng)歷更加嚴
峻的考驗。"
,"這道菜不好吃"]
for s in l:
print(s)
X = parse_dataset(s, w2index, max_seq_len)
predictedSentiment = sess.run(prediction, {input_data: X})[0]
print(predictedSentiment[0], predictedSentiment[1])
參考資料:
https://github.com/adeshpande3/LSTM-Sentiment-Analysis/blob/master/Oriole%20LSTM.ipynb
https://buptldy.github.io/2016/07/20/2016-07-20-sentiment%20analysis/
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
http://www.wildml.com/2015/10/recurrent-neural-network-tutorial-part-4-implementing-a-grulstm-rnn-with-python-and-theano/
https://arxiv.org/abs/1408.5882