基于fastNLP的DPCNN網(wǎng)絡(luò)實現(xiàn)

本文的目的是學(xué)習(xí)論文:ACL2017- Deep Pyramid Convolutional Neural Networks for Text Categorization

期間也第一次使用fastNLP這個框架,也將使用fastNLP這個框架遇到的問題一同寫出來。
之后可能會再寫一篇單獨對模型構(gòu)建的解析,涉及DPCNN中的shortcut機制,和Pyramid(金字塔)原理。

主要參考了與陽光共進早餐-的博客



數(shù)據(jù)預(yù)處理

首先數(shù)據(jù)集已經(jīng)為我們分好label和text,并且該數(shù)據(jù)集只包含了4種類型的新聞分別是:'Business', 'Sci/Tech','World', 'Sports‘,為了之后的方便我要將label和text合并,并且將label轉(zhuǎn)化為數(shù)字類型,值得注意的是fastNLP的label必須從0開始。

import pandas as pd
data_label=pd.read_table("train_labels.txt",sep="\t")
type_mapping = {    
    'Business':0,
    'Sci/Tech':1,
    'World':2,
    'Sports':3
}
data_label.columns=['label']
data_label['label']=data_label['label'].map(type_mapping)
    label
0   0
1   0
2   0
3   0
4   0
5   0
6   0
7   0
8   0
9   0
10  0
11  0
12  0
13  0
14  0
15  0
16  0

將二者合并

        text                                           label
0   Wall St. Bears Claw Back Into the Black (Reute...   0
1   Carlyle Looks Toward Commercial Aerospace (Reu...   0
2   Oil and Economy Cloud Stocks' Outlook (Reuters...   0
3   Iraq Halts Oil Exports from Main Southern Pipe...   0
4   Oil prices soar to all-time record posing new...    0
5   Stocks End Up But Near Year Lows (Reuters) Re...    0
6   Money Funds Fell in Latest Week (AP) AP - Asse...   0
7   Fed minutes show dissent over inflation (USATO...   0
8   Safety Net (Forbes.com) Forbes.com - After ear...   0
9   Wall St. Bears Claw Back Into the Black NEW Y...    0
10  Oil and Economy Cloud Stocks' Outlook NEW YOR...    0
11  No Need for OPEC to Pump More-Iran Gov TEHRAN...    0
12  Non-OPEC Nations Should Up Output-Purnomo JAK...    0
13  Google IPO Auction Off to Rocky Start WASHING...    0

測試集也同樣處理,然后保存到本地

data_train.to_csv("data_train.txt")

以上都是基于pandas的數(shù)據(jù)處理,接下來使用fastNLP的接口處理數(shù)據(jù),以方便后面的網(wǎng)絡(luò)搭建。
接下來的就是傳統(tǒng)的詞嵌入過程,文章中使用的是unsupervised embedding的嵌入方法,但是我沒看懂(逃,所以就用傳統(tǒng)的詞嵌入了,就是構(gòu)建詞典后,將詞索引化,然后使所有句子統(tǒng)一長度即添加‘0’,文章中似乎把這個過程叫做PAD。(這里似乎可以把詞的索引用TF-IDF處理或者直接用GloVe?)


讀取數(shù)據(jù)

from fastNLP import DataSet
from fastNLP import Instance
from fastNLP import Vocabulary
ata_train=DataSet.read_csv("data_train.txt",headers=('a','text','label'))//這里的‘a(chǎn)’是上面合并的時候的最左列的索引數(shù)字,DataSet似乎是得將所有列都讀入才可以
data_test=DataSet.read_csv("data_test.txt",headers=('a','text','label'))

將文本小寫,將label轉(zhuǎn)化為INT格式

data_train.apply(lambda x: int(x['label']),new_field_name='label')
data_train.apply(lambda x: x['text'].lower(), new_field_name='text')
data_test.apply(lambda x: int(x['label']),new_field_name='label')
data_test.apply(lambda x: x['text'].lower(), new_field_name='text')

分詞

def split_sent(instance):
    return instance['text'].split()
data_train.apply(split_sent,new_field_name='description_words')
data_test.apply(split_sent,new_field_name='description_words')

統(tǒng)計分詞后的長度,得到最大長度,以此來添加‘0’,并且作為后面網(wǎng)絡(luò)maxfeature的參數(shù)

data_train.apply(lambda x: len(x['description_words']),new_field_name='description_seq_len')
data_test.apply(lambda x: len(x['description_words']),new_field_name='description_seq_len')

max_seq_len_train=0
max_seq_len_test=0
for i in range (len(data_train)):
    if(data_train[i]['description_seq_len'] > max_seq_len_train):
        max_seq_len_train = data_train[i]['description_seq_len']
    else:
        pass
for i in range (len(data_test)):
    if(data_test[i]['description_seq_len'] > max_seq_len_test):
        max_seq_len_test = data_test[i]['description_seq_len']
    else:
        pass
max_sentence_length = max_seq_len_train
if (max_seq_len_test > max_sentence_length):
    max_sentence_length = max_seq_len_test
print ('max_sentence_length:',max_sentence_length)

根據(jù)訓(xùn)練集來建立詞典

vocab = Vocabulary(min_freq=2)
data_train.apply(lambda x:[vocab.add(word) for word in x['description_words']])
vocab.build_vocab()
data_train.apply(lambda x: [vocab.to_index(word) for word in x['description_words']],new_field_name='description_words')
data_test.apply(lambda x: [vocab.to_index(word) for word in x['description_words']],new_field_name='description_words')

PADDING

def padding_words(data):
    for i in range(len(data)):
        if data[i]['description_seq_len'] <= max_sentence_length:
            padding = [0] * (max_sentence_length - data[i]['description_seq_len'])
            data[i]['description_words'] += padding
        else:
            pass
    return data
data_train= padding_words(data_train)
data_test = padding_words(data_test)
data_train.apply(lambda x: len(x['description_words']), new_field_name='description_seq_len')
data_test.apply(lambda x: len(x['description_words']), new_field_name='description_seq_len')

fastNLP中要標(biāo)注出數(shù)據(jù)集中的輸入和輸出,確定input和target。rename_filed的目的在后面會說明。

data_train.rename_field("description_words","description_word_seq")
data_train.rename_field("label","label_seq")
data_test.rename_field("description_words","description_word_seq")
data_test.rename_field("label","label_seq")

data_train.set_input("description_word_seq")
data_test.set_input("description_word_seq")
data_train.set_target("label_seq")
data_test.set_target("label_seq")
print("dataset processed successfully!")

網(wǎng)絡(luò)搭建

DPCNN中包含了ResNet網(wǎng)絡(luò)中的shortcut機制。
為什么在前面需要rename_filed的,是因為在fastNLP中構(gòu)建神經(jīng)網(wǎng)絡(luò)必須要有forward函數(shù)并且返回一個字典類型的數(shù)據(jù)。
具體內(nèi)容可以參考fastNLP的Tutorials

import torch
import torch.nn as nn

class ResnetBlock(nn.Module):
    def __init__(self, channel_size):
        super(ResnetBlock, self).__init__()

        self.channel_size = channel_size
        self.maxpool = nn.Sequential(
            nn.ConstantPad1d(padding=(0, 1), value=0),
            nn.MaxPool1d(kernel_size=3, stride=2)
        )
        self.conv = nn.Sequential(
            nn.BatchNorm1d(num_features=self.channel_size),
            nn.ReLU(),
            nn.Conv1d(self.channel_size, self.channel_size, kernel_size=3, padding=1),

            nn.BatchNorm1d(num_features=self.channel_size),
            nn.ReLU(),
            nn.Conv1d(self.channel_size, self.channel_size, kernel_size=3, padding=1),
        )

    def forward(self, x):
        x_shortcut = self.maxpool(x)
        x = self.conv(x_shortcut)
        x = x + x_shortcut
        return x


class DPCNN(nn.Module):
    def __init__(self,max_features,word_embedding_dimension,max_sentence_length,num_classes):
        super(DPCNN, self).__init__()
        self.max_features = max_features
        self.embed_size = word_embedding_dimension
        self.maxlen = max_sentence_length
        self.num_classes=num_classes
        self.channel_size = 250

        self.embedding = nn.Embedding(self.max_features, self.embed_size)
        torch.nn.init.normal_(self.embedding.weight.data,mean=0,std=0.01)
        self.embedding.weight.requires_grad = True

        # region embedding
        self.region_embedding = nn.Sequential(
            nn.Conv1d(self.embed_size, self.channel_size, kernel_size=3, padding=1),
            nn.BatchNorm1d(num_features=self.channel_size),
            nn.ReLU(),
            nn.Dropout(0.2)
        )

        self.conv_block = nn.Sequential(
            nn.BatchNorm1d(num_features=self.channel_size),
            nn.ReLU(),
            nn.Conv1d(self.channel_size, self.channel_size, kernel_size=3, padding=1),
            nn.BatchNorm1d(num_features=self.channel_size),
            nn.ReLU(),
            nn.Conv1d(self.channel_size, self.channel_size, kernel_size=3, padding=1),
        )
        
        self.seq_len = self.maxlen
        resnet_block_list = []
        while (self.seq_len > 2):
            resnet_block_list.append(ResnetBlock(self.channel_size))
            self.seq_len = self.seq_len // 2
        #         print('seqlen{}'.format(self.seq_len))
        self.resnet_layer = nn.Sequential(*resnet_block_list)
        
        self.fc = nn.Sequential(
            nn.Linear(self.channel_size*self.seq_len, self.num_classes),
            nn.BatchNorm1d(self.num_classes),
            nn.ReLU(inplace=True),
            nn.Dropout(0.5),
            nn.Linear(self.num_classes, self.num_classes)
         )
    def forward(self, description_word_seq):
        x = self.embedding(description_word_seq)
        x = x.permute(0, 2, 1)
        x = self.region_embedding(x)
        x = self.conv_block(x)
        x = self.resnet_layer(x)
        x = x.permute(0, 2, 1)
        x = x.contiguous().view(x.size(0), -1)
        output = self.fc(x)
        return {'output': output}
    def predict(self, description_word_seq):
        """
        :param word_seq: torch.LongTensor, [batch_size, seq_len]
        :return predict: dict of torch.LongTensor, [batch_size, seq_len]
        """
        output = self(description_word_seq)
        _, predict = output['output'].max(dim=1)
        return {'predict': predict}

使用的一些超參

word_embedding_dimension = 300
num_classes = 4
pickle_path = 'result/'

訓(xùn)練網(wǎng)絡(luò)和測試網(wǎng)絡(luò)可以參考與陽光共進早餐-的博客
這里要注意的是,fastNLP的版本似乎是pytorch 0.4,而我是用的是1.0.1,在原來博主的參數(shù)中save_path=None,則fastNLP會調(diào)用0.4版本的dict,即模型參數(shù),和版本不匹配

trainer=Trainer(model=model,train_data=data_train,dev_data=data_test,loss=loss,metrics=metric,save_path=None,batch_size=64,n_epochs=5,optimizer=Adam(lr=0.001, weight_decay=0.0001))
trainer.train()

所以我在調(diào)用這個接口時候,為save_path設(shè)置了參數(shù),這樣接口調(diào)用的是save之后的模型的dict

trainer=Trainer(model=model,train_data=data_train,dev_data=data_test,loss=loss,metrics=metric,save_path='CD',batch_size=64,n_epochs=5,optimizer=Adam(lr=0.001, weight_decay=0.0001))
trainer.train()

最后的效果

Evaluation at Epoch 1/5. Step:1875/9375. AccuracyMetric: acc=0.881053

Evaluation at Epoch 2/5. Step:3750/9375. AccuracyMetric: acc=0.884474

Evaluation at Epoch 3/5. Step:5625/9375. AccuracyMetric: acc=0.889342

Evaluation at Epoch 4/5. Step:7500/9375. AccuracyMetric: acc=0.889737

Evaluation at Epoch 5/5. Step:9375/9375. AccuracyMetric: acc=0.895789


In Epoch:5/Step:9375, got best dev performance:AccuracyMetric: acc=0.895789
Reloaded the best model.
new_model.pkl saved in result/
[tester] 
AccuracyMetric: acc=0.895789
{'AccuracyMetric': {'acc': 0.895789}}
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容