引言

從deeplearning.ai的課程開始，嘗試撿回荒廢了3年的NLP。
Coursera課程鏈接

搭建jupyter + vscode學(xué)習(xí)環(huán)境

Start with Why

為什么要用vscode？

我很想用諸如：“誰用誰知道，不用就吃虧”這樣的話來偷懶，但為了能讓心存疑慮的小伙伴放心去用，好歹要用自己的話說一下這工具為什么好用。

vscode是寫程序必備的“萬用軍刀”，如果硬要說有什么它辦不到，那可能只是還沒找到合適的插件罷了。

因此，我不打算寫vscode功能的詳細(xì)清單。用過的同學(xué)們都知道，一旦用上，不光敲代碼，可能日常碼字你都離不開它。

現(xiàn)在我在vscode上面完成的工作有：前后端開發(fā)，代碼調(diào)試，記筆記，遠(yuǎn)程登錄服務(wù)器操作，命令行，上線代碼/博客，等等。相信未來它能承載更多工作入口。就正如我現(xiàn)在想要把Jupyter整合進(jìn)去一樣。

Jupyter Notebook幾年前剛開始學(xué)機(jī)器學(xué)習(xí)就用過，但漸漸少用（少用的原因是Mac算力限制，跑機(jī)器學(xué)習(xí)太費(fèi)勁）之后連怎么搭環(huán)境都給忘了。

這段時(shí)間想把自己的NLP技能撿回來，于是就有了這想法。簡單一搜，果然有方案，馬上動(dòng)手不啰嗦。

步驟一：搭建本地Jupyter服務(wù)器

jupyter notebook 是非常好的用于學(xué)習(xí)人工智能編程的工具。

首先Jupyter配合anaconda讓你可以在不同的packages環(huán)境下進(jìn)行相對應(yīng)的開發(fā)，特別是在跑機(jī)器學(xué)習(xí)時(shí)往往需要加載大量的庫來配合工作，省去折騰各種不同版本的包和運(yùn)行環(huán)境的麻煩。

其次使用Jupyter還可以一邊做筆記，一邊看程序運(yùn)行結(jié)果，免去界面切換的繁瑣。當(dāng)你做完一次學(xué)習(xí)之后，筆記可以立馬拿去發(fā)布分享，強(qiáng)化自身學(xué)習(xí)動(dòng)力。

因此，強(qiáng)烈建議大家都在自家電腦上搭建一個(gè)Jupyter Notebook的運(yùn)行環(huán)境，網(wǎng)上教程很多這里就不再累贅。

首先你要自行安裝好Python3環(huán)境

下載并安裝Anaconda（這又是個(gè)什么玩意兒？）

image.png

我記得以前安裝anaconda都是跑命令行搞出來的，現(xiàn)在居然下載完了直接就可以裝了。好吧，那就隨帶說一下為什么要用Anaconda，然后使用過程又要注意什么。

Anaconda解決了維護(hù)運(yùn)行環(huán)境不一致的問題，你可以為每一個(gè)應(yīng)用配置單獨(dú)的，隔離的環(huán)境。（一句話說完）

如果這句話你還是理解不了的話，建議隨便在github上找?guī)讉€(gè)Python項(xiàng)目拉下來玩，不用Anaconda，然后就知道為啥要用這東西了。

安裝完了之后啟動(dòng)Jupyter notebook，運(yùn)行

jupyter notebook

命令運(yùn)行成功后系統(tǒng)會(huì)為你自動(dòng)打開 localhost:8888，因?yàn)槲覀兪且趘scode里面去用的，所以直接關(guān)掉就行。

步驟二：配置vscode

在插件市場安裝Jupyter插件，成功后啟動(dòng)命令窗口（Shift+Command+P）

執(zhí)行 Jupyter:Create New Blank Jupyter Notebook

image.png

然后就可以開始使用了，在新建的文檔中能看到的信息和你在網(wǎng)頁上使用無異，可以看到已連接的local，Python3是否正在執(zhí)行等等。

image.png

參考資料

Working with Jupyter Notebooks in Visual Studio Code
Install and Use — Jupyter Documentation 4.1.1 alpha documentation

NLP基礎(chǔ)01 - 數(shù)據(jù)預(yù)處理

對數(shù)據(jù)進(jìn)行預(yù)處理
使用NLTK處理數(shù)據(jù)集

引入包

NLTK(http://www.nltk.org/)是一個(gè)自然語言工具箱，提供超過50種語料庫和詞法資源(如WordNet)提供了易于使用的接口，還提供了一套用于分類、標(biāo)記、詞干提取、標(biāo)記、解析和語義推理的文本處理庫、工業(yè)強(qiáng)度NLP庫的包裝器。

適合于語言學(xué)家、工程師、學(xué)生、教育工作者、研究人員和行業(yè)用戶。NLTK可用于Windows、Mac OS X和Linux。最重要的是，NLTK是一個(gè)免費(fèi)的、開源的、社區(qū)驅(qū)動(dòng)的項(xiàng)目。

Python的自然語言處理為語言處理編程提供了一個(gè)實(shí)用的入門。它由NLTK的創(chuàng)建者編寫，指導(dǎo)讀者了解編寫Python程序的基礎(chǔ)知識(shí)、使用語料庫、對文本進(jìn)行分類、分析語言結(jié)構(gòu)等等。該書的在線版本已經(jīng)針對Python 3和NLTK 3進(jìn)行了更新。(Python 2的原始版本仍然可以在http://nltk.org/book_1ed上找到。)

對tweets數(shù)據(jù)進(jìn)行情感性分析，即判斷每一條tweet是正向，負(fù)向，還是中性描述。
在NLTK包中有預(yù)加載的一個(gè)Twitter實(shí)驗(yàn)數(shù)據(jù)集，可直接使用。

import nltk
from nltk.corpus import twitter_samples
import matplotlib.pyplot as plt
import random

關(guān)于Twitter數(shù)據(jù)集

NLTK這個(gè)數(shù)據(jù)集已經(jīng)把tweets劃分成了正向或負(fù)向，各5000條。雖然數(shù)據(jù)集來源于真實(shí)數(shù)據(jù)，但這樣的劃分是人為的。

由于本地使用 nltk.download('twitter_samples') 語句會(huì)報(bào)錯(cuò)：Errno 61 Connection refused

因此需要在命令行中進(jìn)行如下操作（同一窗口操作命令行是vscode優(yōu)勢之一）

在nltk/nltk_data: NLTK Data 下載zip
解壓后把文件夾改名為nltk_data
若運(yùn)行下一步時(shí)報(bào)錯(cuò)，可查看提示搬運(yùn)文件夾到程序會(huì)檢索的目錄下

from nltk.book import *

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908

用 strings() 函數(shù)加載數(shù)據(jù)

all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')

現(xiàn)在，我們可以先來看看數(shù)據(jù)長什么樣子。這在正式跑數(shù)之前是非常重要的操作

print('Number of postive tweets: ', len(all_positive_tweets))
print('Number of negative tweets: ', len(all_negative_tweets))

print('\nThe type of all_positive_tweets is: ', type(all_positive_tweets))
print('\nThe type of all_negative_tweets is: ', type(all_negative_tweets))
print('\nThe type of a tweet entry is: ', type(all_negative_tweets[0]))

Number of postive tweets:  5000
Number of negative tweets:  5000

The type of all_positive_tweets is:  <class 'list'>

The type of all_negative_tweets is:  <class 'list'>

The type of a tweet entry is:  <class 'str'>

從上面結(jié)果可以看出來，兩個(gè)json文件已被轉(zhuǎn)換成了列表，而一條tweet則是一個(gè)字符串。

你還可以使用 pyplot 庫去畫一個(gè)餅圖，用來描述上述的數(shù)據(jù)（增加一點(diǎn)數(shù)據(jù)可視化總是有好處滴）

pyplot庫使用可參考 Basic pie chart — Matplotlib 3.3.3 documentation

# 自定義圖形大小
fig = plt.figure(figsize=(5, 5))

# 定義標(biāo)簽
labels = 'Positives', 'Negative'

# 每頁大小
sizes = [len(all_positive_tweets), len(all_negative_tweets)]

# 聲明餅圖，頁大小，保留小數(shù)位，陰影，角度-90為垂直切分
plt.pie(sizes, labels=labels, autopct='%1.1f%%', shadow=True, startangle=90)

plt.axis('equal')

plt.show()

查看原始文本數(shù)據(jù)

查看真實(shí)的數(shù)據(jù)情況，下面的代碼會(huì)print出正向，負(fù)向的評論，以不同顏色為區(qū)分

# 正向評論 綠色
print('\033[92m' + all_positive_tweets[random.randint(0,5000)])

# 負(fù)向評論 紅色
print('\033[91m' + all_negative_tweets[random.randint(0,5000)])

?[92m@JayHorwell Hi Jay, if you haven't received it yet please email our events team at events@breastcancernow.org and they'll sort it :)
?[91m@pickledog47 @FoxyLustyGrover Its Kate, tho!!  :(  #sniff

由此發(fā)現(xiàn)數(shù)據(jù)中含有不少表情符號及url信息，在后續(xù)的處理中需要考慮在內(nèi)

對原始文本進(jìn)行預(yù)處理

數(shù)據(jù)預(yù)處理是所有機(jī)器學(xué)習(xí)的關(guān)鍵步驟。包括數(shù)據(jù)清洗和格式化。對NLP而言，主要有以下任務(wù):

分詞
處理大小寫
刪除停止詞（Stop Words）和標(biāo)點(diǎn)符號
提取詞根(處理英語時(shí)特有的Stemming)

# 選擇一條較為復(fù)雜的數(shù)據(jù)
tweet = all_positive_tweets[2277]

print(tweet)

My beautiful sunflowers on a sunny Friday morning off :) #sunflowers #favourites #happy #Friday off… https://t.co/3tfYom0N1i

import re                                     # 正則表達(dá)式庫
import string                                 # 字符串操作庫

from nltk.corpus import stopwords             # NLTK的stopwords庫，貌似不支持中文
from nltk.stem import PorterStemmer           # stemming 庫
from nltk.tokenize import TweetTokenizer      # 推特分詞器

去除超鏈接，推特標(biāo)簽和格式

刪除推特平臺(tái)常用字符串，就像微博一樣，有許多'@' '#' 和url
使用re庫執(zhí)行正則表達(dá)式操作。使用sub()替換成空串

關(guān)于python正則表達(dá)式出來參考：Python 正則表達(dá)式 | 菜鳥教程
可以直接使用vscode的查找工具進(jìn)行正則表達(dá)式的調(diào)試

print('\033[92m' + tweet)
print('\033[94m')

tweet2 = re.sub(r'^RT[\s]+', '', tweet) #處理 RT【空格】打頭的數(shù)據(jù)，即“轉(zhuǎn)發(fā)”類的tweet

tweet2 = re.sub(r'https?:\/\/.*[\r\n]*', '', tweet2) #去除超鏈接

tweet2 = re.sub(r'#', '', tweet2)

print(tweet2)

?[92mMy beautiful sunflowers on a sunny Friday morning off :) #sunflowers #favourites #happy #Friday off… https://t.co/3tfYom0N1i
?[94m
My beautiful sunflowers on a sunny Friday morning off :) sunflowers favourites happy Friday off…

先試試直接分詞，看看結(jié)果如何

print()
print('\033[92m' + tweet2)
print('\033[94m')

tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True, reduce_len=True)

tweet_tokens = tokenizer.tokenize(tweet2)

print()
print('Tokenized string:')
print(tweet_tokens)

?[92mMy beautiful sunflowers on a sunny Friday morning off :) sunflowers favourites happy Friday off… 
?[94m

Tokenized string:
['my', 'beautiful', 'sunflowers', 'on', 'a', 'sunny', 'friday', 'morning', 'off', ':)', 'sunflowers', 'favourites', 'happy', 'friday', 'off', '…']

去掉stop words和標(biāo)點(diǎn)符號

stop words是常用的沒有實(shí)際意義的那些詞語，之前試過生成詞云都會(huì)發(fā)現(xiàn)諸如“的”，“那么”這些詞會(huì)很多，所以在處理前最好先去掉。
在英文情況下會(huì)有所不同，具體看下一步執(zhí)行結(jié)果。

stopwords_english = stopwords.words('english')

print('Stop words\n')
print(stopwords_english)

print('\nPunctuation\n')
print(string.punctuation)

Stop words

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

Punctuation

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~

我們可以看到上面的停止詞包含了一些可能很重要的詞。例如“I”，"not", "between", "won", "against" 。

不同分析目的，可能要對停止詞表進(jìn)一步加工，在我們前面下載nltk_data里面有一個(gè)stopwords的文件夾，對應(yīng)的English那個(gè)文件就是停止詞的詞表。
在這個(gè)練習(xí)里，則用整個(gè)列表。

下面開始進(jìn)行分詞操作

print()
print('033[92m')
print(tweet_tokens)
print('033[94m')

tweets_clean = []

for word in tweet_tokens:
    if (word not in stopwords_english and
        word not in string.punctuation):
        tweets_clean.append(word)

print('removed stop words and punctuation:')
print(tweets_clean)

033[92m
['my', 'beautiful', 'sunflowers', 'on', 'a', 'sunny', 'friday', 'morning', 'off', ':)', 'sunflowers', 'favourites', 'happy', 'friday', 'off', '…']
033[94m
removed stop words and punctuation:
['beautiful', 'sunflowers', 'sunny', 'friday', 'morning', ':)', 'sunflowers', 'favourites', 'happy', 'friday', '…']

詞干提取(Stemming)

這是處理英語時(shí)需要特別考慮的一個(gè)因素，比如

learn
learning
learned
learnt

這些詞的詞根都是learn，但處理時(shí)提取出來的可能不是learn。例如，happy

happy
happiness
happier

我們需要提取出happi，而不是happ，因?yàn)樗莌appen的詞根。

NLTK有不同的模塊用于詞干提取，我們將使用使用PorterStemmer完成此操作

print()
print('\033[92m')
print(tweets_clean)
print('\033[94m')

stemmer = PorterStemmer()

tweets_stem = []

for word in tweets_clean:
    stem_word = stemmer.stem(word)
    tweets_stem.append(stem_word)

print('stemmed words:')
print(tweets_stem)

?[92m
['beautiful', 'sunflowers', 'sunny', 'friday', 'morning', ':)', 'sunflowers', 'favourites', 'happy', 'friday', '…']
?[94m
stemmed words:
['beauti', 'sunflow', 'sunni', 'friday', 'morn', ':)', 'sunflow', 'favourit', 'happi', 'friday', '…']

process_tweet()

可使用諸如utils.py這樣的文件，對上述過程進(jìn)行封裝，例如process_tweet函數(shù)的以下應(yīng)用
utils.py的代碼放在最后

from utils import process_tweet

tweet = all_positive_tweets[2277]

print()
print('\033[92m')
print(tweet)
print('\033[94m')

tweets_stem = process_tweet(tweet);

print('preprocessed tweet:')
print(tweets_stem)

?[92m
My beautiful sunflowers on a sunny Friday morning off :) #sunflowers #favourites #happy #Friday off… https://t.co/3tfYom0N1i
?[94m
preprocessed tweet:
['beauti', 'sunflow', 'sunni', 'friday', 'morn', ':)', 'sunflow', 'favourit', 'happi', 'friday', '…']

總結(jié)

通過這個(gè)練習(xí)，我們知道了一般NLP的預(yù)處理過程，當(dāng)然實(shí)際過程（涉及中文時(shí)）會(huì)更復(fù)雜，要結(jié)合數(shù)據(jù)具體情況不斷調(diào)整。

把以下內(nèi)容保存為文件utils.py，放在ipynb文件同一個(gè)目錄下，最后一個(gè)步驟才能運(yùn)行成功

import re
import string
import numpy as np


from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import TweetTokenizer


def process_tweet(tweet):
    """Process tweet function.
    Input:
        tweet: a string containing a tweet
    Output:
        tweets_clean: a list of words containing the processed tweet
?
    """
    stemmer = PorterStemmer()
    stopwords_english = stopwords.words('english')
    # remove stock market tickers like $GE
    tweet = re.sub(r'\$\w*', '', tweet)
    # remove old style retweet text "RT"
    tweet = re.sub(r'^RT[\s]+', '', tweet)
    # remove hyperlinks
    tweet = re.sub(r'https?:\/\/.*[\r\n]*', '', tweet)
    # remove hashtags
    # only removing the hash # sign from the word
    tweet = re.sub(r'#', '', tweet)
    # tokenize tweets
    tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True,
                               reduce_len=True)
    tweet_tokens = tokenizer.tokenize(tweet)

    tweets_clean = []
    for word in tweet_tokens:
        if (word not in stopwords_english and  # remove stopwords
                word not in string.punctuation):  # remove punctuation
            # tweets_clean.append(word)
            stem_word = stemmer.stem(word)  # stemming word
            tweets_clean.append(stem_word)

    return tweets_clean


def build_freqs(tweets, ys):
    """Build frequencies.
    Input:
        tweets: a list of tweets
        ys: an m x 1 array with the sentiment label of each tweet
            (either 0 or 1)
    Output:
        freqs: a dictionary mapping each (word, sentiment) pair to its
        frequency
    """
    # Convert np array to list since zip needs an iterable.
    # The squeeze is necessary or the list ends up with one element.
    # Also note that this is just a NOP if ys is already a list.
    yslist = np.squeeze(ys).tolist()

    # Start with an empty dictionary and populate it by looping over all tweets
    # and over all processed words in each tweet.
    freqs = {}
    for y, tweet in zip(yslist, tweets):
        for word in process_tweet(tweet):
            pair = (word, y)
            if pair in freqs:
                freqs[pair] += 1
            else:
                freqs[pair] = 1
    
    return freqs

ChangeLog

2021/1/28 17:12:10 折騰了兩小時(shí)，先到這里。其實(shí)你會(huì)發(fā)現(xiàn)搞程序遇到的麻煩，跟你在玩一個(gè)游戲（比較虐的那種）時(shí)被卡住的感覺很像，這時(shí)應(yīng)該先設(shè)法讓自己先停下來去搞點(diǎn)別的……
2021/2/1 16:28:38 花了兩小時(shí)把后面內(nèi)容完成

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

NLP筆記Day1：環(huán)境搭建及數(shù)據(jù)預(yù)處理

NLP筆記Day1：環(huán)境搭建及數(shù)據(jù)預(yù)處理

引言

搭建jupyter + vscode學(xué)習(xí)環(huán)境

Start with Why

步驟一：搭建本地Jupyter服務(wù)器

步驟二：配置vscode

參考資料

NLP基礎(chǔ)01 - 數(shù)據(jù)預(yù)處理

引入包

關(guān)于Twitter數(shù)據(jù)集

查看原始文本數(shù)據(jù)

對原始文本進(jìn)行預(yù)處理

去除超鏈接，推特標(biāo)簽和格式

去掉stop words和標(biāo)點(diǎn)符號

詞干提取(Stemming)

process_tweet()

總結(jié)

ChangeLog

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

NLP筆記Day1：環(huán)境搭建及數(shù)據(jù)預(yù)處理

引言

搭建jupyter + vscode學(xué)習(xí)環(huán)境

Start with Why

步驟一：搭建本地Jupyter服務(wù)器

步驟二：配置vscode

參考資料

NLP基礎(chǔ)01 - 數(shù)據(jù)預(yù)處理

引入包

關(guān)于Twitter數(shù)據(jù)集

查看原始文本數(shù)據(jù)

對原始文本進(jìn)行預(yù)處理

去除超鏈接，推特標(biāo)簽和格式

去掉stop words和標(biāo)點(diǎn)符號

詞干提取(Stemming)

process_tweet()

總結(jié)

ChangeLog

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av