Pytorch學(xué)習(xí)記錄-更深的TorchText學(xué)習(xí)01

Pytorch學(xué)習(xí)記錄-更深的TorchText學(xué)習(xí)01
簡(jiǎn)單實(shí)現(xiàn)torchtext之后,我希望能夠進(jìn)一步學(xué)習(xí)torchtext。找到兩個(gè)教程

1. practical-torchtext簡(jiǎn)介

有效使用torchtext的教程,包括兩個(gè)部分

  • 文本分類(lèi)
  • 詞級(jí)別的語(yǔ)言模型

1.1 目標(biāo)

torchtext的文檔仍然相對(duì)不完整,目前有效地使用torchtext需要閱讀相當(dāng)多的代碼。這組教程旨在提供使用torchtext的工作示例,以使更多用戶(hù)能夠充分利用這個(gè)風(fēng)扇庫(kù)。

1.2 使用

torchtext的當(dāng)前pip版本存在一些錯(cuò)誤,這些錯(cuò)誤會(huì)導(dǎo)致某些代碼運(yùn)行不正確。這些錯(cuò)誤目前只在torchtext的github存儲(chǔ)庫(kù)的主分支上修復(fù)。因此,教程建議使用以下命令從github存儲(chǔ)庫(kù)安裝torchtext:

pip install --upgrade git+https://github.com/pytorch/text

2. 基于torchtext處理的文本分析

我看了一下,第一課是基于這兩天反復(fù)操作的那個(gè)教程,但是作者進(jìn)行了豐富和解釋?zhuān)头旁谶@里再跑一次了

2.0 簡(jiǎn)介

前面都是一致的,加載數(shù)據(jù),預(yù)處理之后生成dataset,輸入模型。
使用的數(shù)據(jù)集還是之前的Kaggle垃圾信息數(shù)據(jù)。

import pandas as pd
import numpy as np
import torch
from torch.nn import init
from torchtext.data import Field

2.1 聲明Fields

Field類(lèi)用于確定數(shù)據(jù)是如何預(yù)處理并轉(zhuǎn)化為數(shù)字格式。
很簡(jiǎn)單。標(biāo)簽的預(yù)處理更加容易,因?yàn)樗鼈円呀?jīng)轉(zhuǎn)換為二進(jìn)制編碼。我們需要做的就是告訴Field類(lèi)標(biāo)簽已經(jīng)處理完畢。我們通過(guò)將use_vocab = False關(guān)鍵字傳遞給構(gòu)造函數(shù)來(lái)完成此操作

tokenize=lambda x: x.split()
TEXT=Field(sequential=True, tokenize=tokenize, lower=True)
LABEL=Field(sequential=False, use_vocab=False)

2.2 創(chuàng)建Dataset

我們將使用TabularDataset類(lèi)來(lái)讀取我們的數(shù)據(jù),因?yàn)樗莄sv格式(截至目前,TabularDataset處理csv,tsv和json文件)

對(duì)于列車(chē)和驗(yàn)證數(shù)據(jù),我們需要處理標(biāo)簽。我們傳入的字段必須與列的順序相同。對(duì)于我們不使用的字段,我們傳入一個(gè)元組,其中第二個(gè)元素是None

%%time
from torchtext.data import TabularDataset

tv_datafields=[
    ('id',None),
    ('comment_text',TEXT),
    ("toxic", LABEL),
    ("severe_toxic", LABEL),
    ("threat", LABEL),
    ("obscene", LABEL), 
    ("insult", LABEL),
    ("identity_hate", LABEL)
]
trn,vld=TabularDataset.splits(
    path=r'C:\Users\jwc19\Desktop\2001_2018jszyfz\code\data\torchtextdata',
    train='train.csv',
    validation='valid.csv',
    format='csv',
    skip_header=True,
    fields=tv_datafields
)
Wall time: 4.99 ms
%%time
tst_datafields=[
    ('id',None),
    ('comment_text',TEXT)
]
tst=TabularDataset(
    path=r'C:\Users\jwc19\Desktop\2001_2018jszyfz\code\data\torchtextdata\test.csv',
    format='csv',
    skip_header=True,
    fields=tst_datafields
)
Wall time: 3.01 ms

2.3 構(gòu)建字典

對(duì)于TEXT字段將單詞轉(zhuǎn)換為整數(shù),需要告訴整個(gè)詞匯是什么。為此,我們運(yùn)行TEXT.build_vocab,傳入數(shù)據(jù)集以構(gòu)建詞匯表。

%%time
TEXT.build_vocab(trn)
TEXT.vocab.freqs.most_common(10)
print(TEXT.vocab.freqs.most_common(10))
[('the', 78), ('to', 41), ('you', 33), ('of', 30), ('and', 26), ('a', 26), ('is', 24), ('that', 22), ('i', 20), ('if', 19)]
Wall time: 3.99 ms

在這里,dataset中每一個(gè)元素都是一個(gè)Example對(duì)象,包含有若干獨(dú)立數(shù)據(jù)

# 查看trn這個(gè)field的標(biāo)簽
print(trn[0].__dict__.keys())
# 查看某一行中的文本,在結(jié)果中可以看到,已有的文本是已經(jīng)被分好詞的
print(trn[10].comment_text)
dict_keys(['comment_text', 'toxic', 'severe_toxic', 'threat', 'obscene', 'insult', 'identity_hate'])
['"', 'fair', 'use', 'rationale', 'for', 'image:wonju.jpg', 'thanks', 'for', 'uploading', 'image:wonju.jpg.', 'i', 'notice', 'the', 'image', 'page', 'specifies', 'that', 'the', 'image', 'is', 'being', 'used', 'under', 'fair', 'use', 'but', 'there', 'is', 'no', 'explanation', 'or', 'rationale', 'as', 'to', 'why', 'its', 'use', 'in', 'wikipedia', 'articles', 'constitutes', 'fair', 'use.', 'in', 'addition', 'to', 'the', 'boilerplate', 'fair', 'use', 'template,', 'you', 'must', 'also', 'write', 'out', 'on', 'the', 'image', 'description', 'page', 'a', 'specific', 'explanation', 'or', 'rationale', 'for', 'why', 'using', 'this', 'image', 'in', 'each', 'article', 'is', 'consistent', 'with', 'fair', 'use.', 'please', 'go', 'to', 'the', 'image', 'description', 'page', 'and', 'edit', 'it', 'to', 'include', 'a', 'fair', 'use', 'rationale.', 'if', 'you', 'have', 'uploaded', 'other', 'fair', 'use', 'media,', 'consider', 'checking', 'that', 'you', 'have', 'specified', 'the', 'fair', 'use', 'rationale', 'on', 'those', 'pages', 'too.', 'you', 'can', 'find', 'a', 'list', 'of', "'image'", 'pages', 'you', 'have', 'edited', 'by', 'clicking', 'on', 'the', '""my', 'contributions""', 'link', '(it', 'is', 'located', 'at', 'the', 'very', 'top', 'of', 'any', 'wikipedia', 'page', 'when', 'you', 'are', 'logged', 'in),', 'and', 'then', 'selecting', '""image""', 'from', 'the', 'dropdown', 'box.', 'note', 'that', 'any', 'fair', 'use', 'images', 'uploaded', 'after', '4', 'may,', '2006,', 'and', 'lacking', 'such', 'an', 'explanation', 'will', 'be', 'deleted', 'one', 'week', 'after', 'they', 'have', 'been', 'uploaded,', 'as', 'described', 'on', 'criteria', 'for', 'speedy', 'deletion.', 'if', 'you', 'have', 'any', 'questions', 'please', 'ask', 'them', 'at', 'the', 'media', 'copyright', 'questions', 'page.', 'thank', 'you.', '(talk', '?', 'contribs', '?', ')', 'unspecified', 'source', 'for', 'image:wonju.jpg', 'thanks', 'for', 'uploading', 'image:wonju.jpg.', 'i', 'noticed', 'that', 'the', "file's", 'description', 'page', 'currently', "doesn't", 'specify', 'who', 'created', 'the', 'content,', 'so', 'the', 'copyright', 'status', 'is', 'unclear.', 'if', 'you', 'did', 'not', 'create', 'this', 'file', 'yourself,', 'then', 'you', 'will', 'need', 'to', 'specify', 'the', 'owner', 'of', 'the', 'copyright.', 'if', 'you', 'obtained', 'it', 'from', 'a', 'website,', 'then', 'a', 'link', 'to', 'the', 'website', 'from', 'which', 'it', 'was', 'taken,', 'together', 'with', 'a', 'restatement', 'of', 'that', "website's", 'terms', 'of', 'use', 'of', 'its', 'content,', 'is', 'usually', 'sufficient', 'information.', 'however,', 'if', 'the', 'copyright', 'holder', 'is', 'different', 'from', 'the', "website's", 'publisher,', 'then', 'their', 'copyright', 'should', 'also', 'be', 'acknowledged.', 'as', 'well', 'as', 'adding', 'the', 'source,', 'please', 'add', 'a', 'proper', 'copyright', 'licensing', 'tag', 'if', 'the', 'file', "doesn't", 'have', 'one', 'already.', 'if', 'you', 'created/took', 'the', 'picture,', 'audio,', 'or', 'video', 'then', 'the', 'tag', 'can', 'be', 'used', 'to', 'release', 'it', 'under', 'the', 'gfdl.', 'if', 'you', 'believe', 'the', 'media', 'meets', 'the', 'criteria', 'at', 'wikipedia:fair', 'use,', 'use', 'a', 'tag', 'such', 'as', 'or', 'one', 'of', 'the', 'other', 'tags', 'listed', 'at', 'wikipedia:image', 'copyright', 'tags#fair', 'use.', 'see', 'wikipedia:image', 'copyright', 'tags', 'for', 'the', 'full', 'list', 'of', 'copyright', 'tags', 'that', 'you', 'can', 'use.', 'if', 'you', 'have', 'uploaded', 'other', 'files,', 'consider', 'checking', 'that', 'you', 'have', 'specified', 'their', 'source', 'and', 'tagged', 'them,', 'too.', 'you', 'can', 'find', 'a', 'list', 'of', 'files', 'you', 'have', 'uploaded', 'by', 'following', '[', 'this', 'link].', 'unsourced', 'and', 'untagged', 'images', 'may', 'be', 'deleted', 'one', 'week', 'after', 'they', 'have', 'been', 'tagged,', 'as', 'described', 'on', 'criteria', 'for', 'speedy', 'deletion.', 'if', 'the', 'image', 'is', 'copyrighted', 'under', 'a', 'non-free', 'license', '(per', 'wikipedia:fair', 'use)', 'then', 'the', 'image', 'will', 'be', 'deleted', '48', 'hours', 'after', '.', 'if', 'you', 'have', 'any', 'questions', 'please', 'ask', 'them', 'at', 'the', 'media', 'copyright', 'questions', 'page.', 'thank', 'you.', '(talk', '?', 'contribs', '?', ')', '"']

2.4 構(gòu)建迭代器

在訓(xùn)練期間,將使用一種稱(chēng)為BucketIterator的特殊迭代器。當(dāng)數(shù)據(jù)傳遞到神經(jīng)網(wǎng)絡(luò)時(shí),我們希望將數(shù)據(jù)填充為相同的長(zhǎng)度,以便我們可以批量處理它們。
如果序列的長(zhǎng)度差異很大,則填充將消耗大量浪費(fèi)的內(nèi)存和時(shí)間。BucketIterator將每個(gè)批次的相似長(zhǎng)度的序列組合在一起,以最小化填充。

from torchtext.data import Iterator, BucketIterator
# sort_key就是告訴BucketIterator使用哪個(gè)key值去進(jìn)行組合,很明顯,在這里是comment_text
# repeat設(shè)定為False是因?yàn)橹笠虬@個(gè)迭代層
train_iter, val_iter=BucketIterator.splits(
    (trn,vld),
    batch_sizes=(64,64),
    device=-1,
    sort_key=lambda x:len(x.comment_text),
    sort_within_batch=False,
    repeat=False
)

# 現(xiàn)在就可以看一下輸出的BucketIterator是怎樣的。
batch=next(train_iter.__iter__());batch
The `device` argument should be set by using `torch.device` or passing a string as an argument. This behavior will be deprecated soon and currently defaults to cpu.
The `device` argument should be set by using `torch.device` or passing a string as an argument. This behavior will be deprecated soon and currently defaults to cpu.






[torchtext.data.batch.Batch of size 25]
    [.comment_text]:[torch.LongTensor of size 494x25]
    [.toxic]:[torch.LongTensor of size 25]
    [.severe_toxic]:[torch.LongTensor of size 25]
    [.threat]:[torch.LongTensor of size 25]
    [.obscene]:[torch.LongTensor of size 25]
    [.insult]:[torch.LongTensor of size 25]
    [.identity_hate]:[torch.LongTensor of size 25]
batch.__dict__.keys()
dict_keys(['batch_size', 'dataset', 'fields', 'input_fields', 'target_fields', 'comment_text', 'toxic', 'severe_toxic', 'threat', 'obscene', 'insult', 'identity_hate'])
test_iter = Iterator(tst, batch_size=64, device=-1, sort=False, sort_within_batch=False, repeat=False)
The `device` argument should be set by using `torch.device` or passing a string as an argument. This behavior will be deprecated soon and currently defaults to cpu.

2.5 打包迭代器

目前,迭代器返回一個(gè)名為torchtext.data.Batch的自定義數(shù)據(jù)類(lèi)型。這使得代碼重用變得困難(因?yàn)槊看瘟忻臅r(shí),我們都需要修改代碼),并且使得torchtext很難與其他庫(kù)一起用于某些用例(例如torchsample和fastai)。
這里教程將寫(xiě)了一個(gè)簡(jiǎn)單的包裝器,使批量易于使用。具體地說(shuō),我們將批處理轉(zhuǎn)換為元素形式(x,y),其中x是自變量(模型的輸入),y是因變量(監(jiān)督數(shù)據(jù))。

class BatchWrapper:
    def __init__(self, dl, x_var, y_vars):
        self.dl, self.x_var, self.y_vars = dl, x_var, y_vars # we pass in the list of attributes for x and y
    
    def __iter__(self):
        for batch in self.dl:
            x = getattr(batch, self.x_var) # we assume only one input in this wrapper
            
            if self.y_vars is not None: # we will concatenate y into a single tensor
                y = torch.cat([getattr(batch, feat).unsqueeze(1) for feat in self.y_vars], dim=1).float()
            else:
                y = torch.zeros((1))

            yield (x, y)
    
    def __len__(self):
        return len(self.dl)
train_dl = BatchWrapper(train_iter, "comment_text", ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"])
valid_dl = BatchWrapper(val_iter, "comment_text", ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"])
test_dl = BatchWrapper(test_iter, "comment_text", None)

驗(yàn)證一下,這里有一個(gè)理解,iter方法是用來(lái)迭代出tensor的?似乎這樣是可以的。

next(train_dl.__iter__())
(tensor([[ 63,  66, 354,  ..., 334, 453, 778],
         [  4,  82,  63,  ...,  55, 523, 650],
         [664,   2,   4,  ..., 520,  30,  22],
         ...,
         [  1,   1,   1,  ...,   1,   1,   1],
         [  1,   1,   1,  ...,   1,   1,   1],
         [  1,   1,   1,  ...,   1,   1,   1]]),
 tensor([[0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0.],
         [1., 1., 0., 1., 1., 0.],
         [0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0.],
         [1., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0.],
         [1., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0.]]))

2.6 訓(xùn)練一個(gè)文本分類(lèi)器

依舊是LSTM

import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.autograd import Variable

class LSTM(nn.Module):
    def __init__(self, hidden_dim, emb_dim=300, num_linear=1):
        super().__init__()
        self.embedding = nn.Embedding(len(TEXT.vocab), emb_dim)
        self.encoder = nn.LSTM(emb_dim, hidden_dim, num_layers=1)
        self.linear_layers = []

        for _ in range(num_linear - 1):
            self.linear_layers.append(nn.Linear(hidden_dim, hidden_dim))
            self.linear_layer = nn.ModuleList(self.linear_layers)

        self.predictor = nn.Linear(hidden_dim, 6)

    def forward(self, seq):
        hdn, _ = self.encoder(self.embedding(seq))
        feature = hdn[-1, :, :]
        for layer in self.linear_layers:
            feature = layer(feature)
        preds = self.predictor(feature)

        return preds
em_sz = 100
nh = 500
nl = 3
model = LSTM(nh, emb_dim=em_sz)
%%time

import tqdm
opt=optim.Adam(model.parameters(),lr=1e-2)
loss_func=nn.BCEWithLogitsLoss()
epochs=2
Wall time: 0 ns
for epoch in range(1, epochs + 1):
    running_loss = 0.0
    running_corrects = 0
    model.train()
    for x, y in tqdm.tqdm(train_dl):
        opt.zero_grad()
        preds = model(x)
        loss = loss_func(y, preds)
        loss.backward()
        opt.step()

        running_loss += loss.item()* x.size(0)
    epoch_loss = running_loss / len(trn)

    val_loss = 0.0
    model.eval()  # 評(píng)估模式
    for x, y in valid_dl:
        preds = model(x)
        loss = loss_func(y, preds)
        val_loss += loss.item()* x.size(0)

    val_loss /= len(vld)
    print('Epoch: {}, Training Loss: {:.4f}, Validation Loss: {:.4f}'.format(epoch, epoch_loss, val_loss))
test_preds = []
for x, y in tqdm.tqdm(test_dl):
    preds = model(x)
    preds = preds.data.numpy()
    # 模型的實(shí)際輸出是logit,所以再經(jīng)過(guò)一個(gè)sigmoid函數(shù)
    preds = 1 / (1 + np.exp(-preds))
    test_preds.append(preds)
    test_preds = np.hstack(test_preds)

print(test_preds)
100%|██████████| 1/1 [00:06<00:00,  6.28s/it]


Epoch: 1, Training Loss: 14.2130, Validation Loss: 4.4170


100%|██████████| 1/1 [00:04<00:00,  4.20s/it]


Epoch: 2, Training Loss: 10.5315, Validation Loss: 3.3947


100%|██████████| 1/1 [00:00<00:00,  2.87it/s]


[[0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
 [0.99978834 0.99761593 0.5279695  0.9961003  0.9957486  0.3662841 ]
 [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
 [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
 [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
 [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
 [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
 [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
 [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
 [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
 [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
 [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
 [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
 [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
 [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
 [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
 [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
 [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
 [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
 [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
 [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
 [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
 [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
 [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
 [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
 [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
 [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
 [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
 [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
 [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
 [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
 [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
 [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]]

2.7 測(cè)試數(shù)據(jù)并查看

test_preds = []
for x, y in tqdm.tqdm(test_dl):
    preds = model(x)
    # if you're data is on the GPU, you need to move the data back to the cpu
    # preds = preds.data.cpu().numpy()
    preds = preds.data.numpy()
    # the actual outputs of the model are logits, so we need to pass these values to the sigmoid function
    preds = 1 / (1 + np.exp(-preds))
    test_preds.append(preds)
test_preds = np.hstack(test_preds)
100%|██████████| 1/1 [00:00<00:00,  2.77it/s]
df = pd.read_csv("./data/torchtextdata/test.csv")
for i, col in enumerate(["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]):
    df[col] = test_preds[:, i]
df.head(3)
image.png
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書(shū)系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

友情鏈接更多精彩內(nèi)容