學(xué)習(xí)筆記TF019:序列分類、IMDB影評(píng)分類

序列分類,預(yù)測整個(gè)輸入序列的類別標(biāo)簽。情緒分析,預(yù)測用戶撰寫文字話題態(tài)度。預(yù)測選舉結(jié)果或產(chǎn)品、電影評(píng)分。

國際電影數(shù)據(jù)庫(International Movie Database)影評(píng)數(shù)據(jù)集。目標(biāo)值二元,正面或負(fù)面。語言大量否定、反語、模糊,不能只看單詞是否出現(xiàn)。構(gòu)建詞向量循環(huán)網(wǎng)絡(luò),逐個(gè)單詞查看每條評(píng)論,最后單詞話性值訓(xùn)練預(yù)測整條評(píng)論情緒分類器。

斯擔(dān)福大學(xué)人工智能實(shí)驗(yàn)室的IMDB影評(píng)數(shù)據(jù)集: http://ai.stanford.edu/~amaas/data/sentiment/ 。壓縮tar文檔,正面負(fù)面評(píng)論從兩個(gè)文件夾文本文件獲取。利用正則表達(dá)式提取純文本,字母全部轉(zhuǎn)小寫。

詞向量嵌入表示,比獨(dú)熱編碼詞語語義更豐富。詞匯表確定單詞索引,找到正確詞向量。序列填充相同長度,多個(gè)影評(píng)數(shù)據(jù)批量送入網(wǎng)絡(luò)。

序列標(biāo)注模型,傳入兩個(gè)占位符,一輸入數(shù)據(jù)data或序列,二目標(biāo)值target或情緒。傳入配置參數(shù)params對(duì)象,優(yōu)化器。

動(dòng)態(tài)計(jì)算當(dāng)前批數(shù)據(jù)序列長度。數(shù)據(jù)單個(gè)張量形式,各序列以最長影評(píng)長度補(bǔ)0。絕對(duì)值最大值縮減詞向量。零向量,標(biāo)量0。實(shí)型詞向量,標(biāo)量大于0實(shí)數(shù)。tf.sign()離散為0或1。結(jié)果沿時(shí)間步相加,得到序列長度。張量長度與批數(shù)據(jù)容量相同,標(biāo)量表示序列長度。

使用params對(duì)象定義單元類型和單元數(shù)量。length屬性指定向RNN提供批數(shù)據(jù)最多行數(shù)。獲取每個(gè)序列最后活性值,送入softmax層。因每條影評(píng)長度不同,批數(shù)據(jù)每個(gè)序列RNN最后相關(guān)輸出活性值有不同索引。在時(shí)間步維度(批數(shù)據(jù)形狀sequencestime_stepsword_vectors)建立索引。tf.gather()沿第1維建立索引。輸出活性值形狀sequencestime_stepsword_vectors前兩維扁平化(flatten),添加序列長度。添加length-1,選擇最后有效時(shí)間步。

梯度裁剪,梯度值限制在合理范圍內(nèi)。可用任何中分類有意義代價(jià)函數(shù),模型輸出可用所有類別概率分布。增加梯度裁剪(gradient clipping)改善學(xué)習(xí)結(jié)果,限制最大權(quán)值更新。RNN訓(xùn)練難度大,不同超參數(shù)搭配不當(dāng),權(quán)值極易發(fā)散。

TensorFlow支持優(yōu)化器實(shí)例compute_gradients函數(shù)推演,修改梯度,apply_gradients函數(shù)應(yīng)用權(quán)值變化。梯度分量小于-limit,設(shè)置-limit;梯度分量在于limit,設(shè)置limit。TensorFlow導(dǎo)數(shù)可取None,表示某個(gè)變量與代價(jià)函數(shù)沒有關(guān)系,數(shù)學(xué)上應(yīng)為零向量但None利于內(nèi)部性能優(yōu)化,只需傳回None值。

影評(píng)逐個(gè)單詞送入循環(huán)神經(jīng)網(wǎng)絡(luò),每個(gè)時(shí)間步由詞向量構(gòu)成批數(shù)據(jù)。batched函數(shù)查找詞向量,所有序列長度補(bǔ)齊。訓(xùn)練模型,定義超參數(shù)、加載數(shù)據(jù)集和詞向量、經(jīng)過預(yù)處理訓(xùn)練批數(shù)據(jù)運(yùn)行模型。模型成功訓(xùn)練,取決網(wǎng)絡(luò)結(jié)構(gòu)、超參數(shù)、詞向量質(zhì)量??蓮膕kip-gram模型word2vec項(xiàng)目(https://code.google.com/archive/p/word2vec/ )、斯坦福NLP研究組Glove模型(https://nlp.stanford.edu/projects/glove ),加載預(yù)訓(xùn)練詞向量。

Kaggle 開放學(xué)習(xí)競賽(https://kaggle.com/c/word2vec-nlp-tutorial ),IMDB影評(píng)數(shù)據(jù),與他人比較預(yù)測結(jié)果。

import tarfile
import re

from helpers import download


class ImdbMovieReviews:

    DEFAULT_URL = \
    'http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz'
    TOKEN_REGEX = re.compile(r'[A-Za-z]+|[!?.:,()]')

def __init__(self, cache_dir, url=None):
    self._cache_dir = cache_dir
    self._url = url or type(self).DEFAULT_URL

    def __iter__(self):
        filepath = download(self._url, self._cache_dir)
        with tarfile.open(filepath) as archive:
            for filename in archive.getnames():
                if filename.startswith('aclImdb/train/pos/'):
                    yield self._read(archive, filename), True
                elif filename.startswith('aclImdb/train/neg/'):
                    yield self._read(archive, filename), False

    def _read(self, archive, filename):
        with archive.extractfile(filename) as file_:
            data = file_.read().decode('utf-8')
            data = type(self).TOKEN_REGEX.findall(data)
            data = [x.lower() for x in data]
            return data

import bz2
import numpy as np


class Embedding:

    def __init__(self, vocabulary_path, embedding_path, length):
        self._embedding = np.load(embedding_path)
        with bz2.open(vocabulary_path, 'rt') as file_:
            self._vocabulary = {k.strip(): i for i, k in enumerate(file_)}
        self._length = length

    def __call__(self, sequence):
        data = np.zeros((self._length, self._embedding.shape[1]))
        indices = [self._vocabulary.get(x, 0) for x in sequence]
        embedded = self._embedding[indices]
        data[:len(sequence)] = embedded
        return data

    @property
    def dimensions(self):
        return self._embedding.shape[1]

import tensorflow as tf

from helpers import lazy_property


class SequenceClassificationModel:

    def __init__(self, data, target, params):
        self.data = data
        self.target = target
        self.params = params
        self.prediction
        self.cost
        self.error
        self.optimize

    @lazy_property
    def length(self):
        used = tf.sign(tf.reduce_max(tf.abs(self.data), reduction_indices=2))
        length = tf.reduce_sum(used, reduction_indices=1)
        length = tf.cast(length, tf.int32)
        return length

    @lazy_property
    def prediction(self):
        # Recurrent network.
        output, _ = tf.nn.dynamic_rnn(
            self.params.rnn_cell(self.params.rnn_hidden),
            self.data,
            dtype=tf.float32,
            sequence_length=self.length,
        )
        last = self._last_relevant(output, self.length)
        # Softmax layer.
        num_classes = int(self.target.get_shape()[1])
        weight = tf.Variable(tf.truncated_normal(
            [self.params.rnn_hidden, num_classes], stddev=0.01))
        bias = tf.Variable(tf.constant(0.1, shape=[num_classes]))
        prediction = tf.nn.softmax(tf.matmul(last, weight) + bias)
        return prediction

    @lazy_property
    def cost(self):
        cross_entropy = -tf.reduce_sum(self.target * tf.log(self.prediction))
        return cross_entropy

    @lazy_property
    def error(self):
        mistakes = tf.not_equal(
            tf.argmax(self.target, 1), tf.argmax(self.prediction, 1))
        return tf.reduce_mean(tf.cast(mistakes, tf.float32))

    @lazy_property
    def optimize(self):
        gradient = self.params.optimizer.compute_gradients(self.cost)
        try:
            limit = self.params.gradient_clipping
            gradient = [
                (tf.clip_by_value(g, -limit, limit), v)
                if g is not None else (None, v)
                for g, v in gradient]
        except AttributeError:
            print('No gradient clipping parameter specified.')
        optimize = self.params.optimizer.apply_gradients(gradient)
        return optimize

    @staticmethod
    def _last_relevant(output, length):
        batch_size = tf.shape(output)[0]
        max_length = int(output.get_shape()[1])
        output_size = int(output.get_shape()[2])
        index = tf.range(0, batch_size) * max_length + (length - 1)
        flat = tf.reshape(output, [-1, output_size])
        relevant = tf.gather(flat, index)
        return relevant

import tensorflow as tf

from helpers import AttrDict

from Embedding import Embedding
from ImdbMovieReviews import ImdbMovieReviews
from preprocess_batched import preprocess_batched
from SequenceClassificationModel import SequenceClassificationModel

IMDB_DOWNLOAD_DIR = './imdb'
WIKI_VOCAB_DIR = '../01_wikipedia/wikipedia'
WIKI_EMBED_DIR = '../01_wikipedia/wikipedia'


params = AttrDict(
    rnn_cell=tf.contrib.rnn.GRUCell,
    rnn_hidden=300,
    optimizer=tf.train.RMSPropOptimizer(0.002),
    batch_size=20,
)

reviews = ImdbMovieReviews(IMDB_DOWNLOAD_DIR)
length = max(len(x[0]) for x in reviews)

embedding = Embedding(
    WIKI_VOCAB_DIR + '/vocabulary.bz2',
    WIKI_EMBED_DIR + '/embeddings.npy', length)
batches = preprocess_batched(reviews, length, embedding, params.batch_size)

data = tf.placeholder(tf.float32, [None, length, embedding.dimensions])
target = tf.placeholder(tf.float32, [None, 2])
model = SequenceClassificationModel(data, target, params)

sess = tf.Session()
sess.run(tf.initialize_all_variables())
for index, batch in enumerate(batches):
    feed = {data: batch[0], target: batch[1]}
    error, _ = sess.run([model.error, model.optimize], feed)
    print('{}: {:3.1f}%'.format(index + 1, 100 * error))

參考資料:
《面向機(jī)器智能的TensorFlow實(shí)踐》

歡迎加我微信交流:qingxingfengzi
我的微信公眾號(hào):qingxingfengzigz
我老婆張幸清的微信公眾號(hào):qingqingfeifangz

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容