本文目錄
- 第一章：文本預處理（Preprocess）
  - 1.1NLTK自然語言處理庫
    - 1.1.1 NLTK自帶語料庫
- 第二章：分詞處理（Tokenize）
  - 2.1 NLTK分詞
  - 2.2 結(jié)巴分詞
  - 2.3 正則表達式分詞
  - 2.4 詞形處理
    - 2.4.1 Inflection變化——Stemming
    - 2.4.2 derivation引申——Lemmatization
  - 2.5 處理StopWords
- 第三章：自然語言處理（Make Feature）
  - 3.1 情感分析
    - 3.1.1 Sentiment Dictionary（關鍵詞打分）
    - 3.1.2 Machine Learning（樸素貝葉斯）
  - 3.2 文本相似度
    - 3.2.1 文本特征頻率
  - 3.3 文本分類
    - 3.3.1 Term Frequency
    - 3.3.2 Inverse Document Frequency

文本處理的基本流程

第一步：文本預處理（Preprocess）
第二步：分詞處理（Tokenize）
第三步：生成對應特征向量（Make Feature）
第四步：放入學習器學習（Machine Learning）

文本處理流程.png

第一章：文本預處理（Preprocess）

1.1 NLTK自然語言處理庫

1.1.1 NLTK自帶語料庫

以下代碼使用布朗大學語料庫（nltk.corpus）：

from nltk.corpus import brown
brown.categories()

輸出：
['adventure', 'belles_lettres', 'editorial',
'fiction','government', 'hobbies','humor',
'learned', 'lore', 'mystery',
'news', 'religion', 'reviews',
'romance', 'science_fiction']

展示該語料庫句子數(shù)

len(brown.sents())

輸出：57340

展示該語料庫單詞數(shù)

len(brown.words())

輸出：1161192

第二章：分詞處理（Tokenize）

將一段完整段落按詞拆分，分詞形式可分為以下兩種：

啟發(fā)式Heuristic（字典）
機器學習/統(tǒng)計方法：HMM、CRF

2.1 采用NLTK進行拆分

import nltk
sentence = "hello world"
tokens = nltk.wordpunct_tokenize(sentence)
tokens

輸出： ['hello', 'world']

2.2 采用結(jié)巴分詞進行拆分

import jieba
seg_list = jieba.cut("我來到北京清華大學", cut_all = True)#（包含所有分詞，全模式）
print("Full Mode:", "/".join(seg_list))
seg_list = jieba.cut("我來到北京清華大學", cut_all=False)#（精確模式）
print("Default Mode:", "/".join(seg_list))
seg_list = jieba.cut("他來到網(wǎng)易杭研大廈") # 默認方式（包含新詞，默認精確模式）
print(",".join(seg_list))
seg_list = jieba.cut_for_search("小明碩士畢業(yè)于中國科學院計算所，后在日本京都大學深造")#搜索引擎模式
print(",".join(seg_list))

輸出：
Full Mode: 我/來到/北京/清華/清華大學/華大/大學
Default Mode: 我/來到/北京/清華大學
他,來到,網(wǎng)易,杭研,大廈
小明,碩士,畢業(yè),于,中國,科學,學院,科學院,中國科學院,計算,計算所,，,后,在,日本,京都,大學,日本京都大

3.3 正則表達式

在博客和社交網(wǎng)絡上，亂七八糟不合語法不和正常邏輯的語言有很多，比如表情，@某人，郵箱以及URL等，識別的方式是采用正則表達式。

不采用正則表達式分詞

不采用正則表達式會將表情符號、話題以及URL等分開

from nltk.tokenize import word_tokenize
tweet = "RT @angelababy: love you baby! :D http://ah.love #168cm"
print(word_tokenize(tweet))

輸出：['RT', '@', 'angelababy', ':', 'love', 'you', 'baby', '!', ':', 'D', 'http', ':', '//ah.love', '#', '168cm']

采用正則表達式分詞

規(guī)定正則表達式：

emoticons_str：規(guī)定表情
regex_str：規(guī)定特殊字符串

import re
emoticons_str = r"""
    (?:
        [:=;] # 眼睛
        [oO\-]? # ?鼻?子
        [D\)\]\(\]/\\OpP] # 嘴
    )"""
regex_str = [
    emoticons_str,
    r'<[^>]+>', # HTML tags
    r'(?:@[\w_]+)', # @某?人
    r"(?:\#+[\w_]+[\w\'_\-]*[\w_]+)", # 話題標簽
    r'http[s]?://(?:[a-z]|[0-9]|[$-_@.&amp;+]|[!*\(\),]|(?:%[0-9a-f][0-9a-f]))+', # URLs
    r'(?:(?:\d+,?)+(?:\.?\d+)?)', # 數(shù)字
    r"(?:[a-z][a-z'\-_]+[a-z])", # 含有 - 和 ‘ 的單詞
    r'(?:[\w_]+)', # 其他
    r'(?:\S)' # 其他
]

匹配并分詞

# re.VERBOSE代表允許你將注釋寫入re，這些注釋會被引擎忽略
# re.IGNORECASE代表忽略大小寫
tokens_re = re.compile(r'(' + '|'.join(regex_str) + ')', re.VERBOSE | re.IGNORECASE)
emoticons_re = re.compile(r'^' + emoticons_str + '$', re.VERBOSE | re.IGNORECASE)
# 找到符合正則表達式的所有字符串
def tokenize(s):
    return tokens_re.findall(s)
# 是否將找到的字符串進行大小寫統(tǒng)一（表情除外）
def preprocess(s, lowercase=False):
    tokens = tokenize(s)
    if lowercase:
        tokens = [token if emoticons_re.search(token) else token.lower() for token in tokens]
    return tokens
tweet = "RT @angelababy: love you baby! :D http://ah.love #168cm"
print(preprocess(tweet))

輸出：['RT', '@angelababy', ':', 'love', 'you', 'baby', '!', ':D', 'http://ah.love', '#168cm']

2.4 詞形處理

Inflection變化：walk=>walking=>walked（都為動詞，不影響詞性）
- Stemming詞干提?。簩⒉挥绊懺~性的詞根去除
  - walking 除去ing=>walk
  - walked 除去ed =>walk
derivation引申：nation（n.）=>national（adj.）=>nationalize（v.）
- Lemmatization詞形歸一：把各種類型的詞的變形，都統(tǒng)一為一個形式
  - went歸一 => go
  - are歸一 => be

2.4.1 Stemming

使用Stemming可以進行Inflection

from nltk.stem.porter import PorterStemmer
porter_stemmer = PorterStemmer()
porter_stemmer.stem('multiply')
porter_stemmer.stem('provision')

輸出：maximum、provis

from nltk.stem import SnowballStemmer
snowball_stemmer = SnowballStemmer("english")
snowball_stemmer.stem("maximum")
snowball_stemmer.stem("presumably")

輸出：maximum、presum

2.4.2 Lemmatization

使用Lemmatization可以進行derivation

from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
wordnet_lemmatizer.lemmatize('knives')
# wordnet_lemmatizer.lemmatize('abaci')

輸出：'knife'

需要注意以下情況

# 默認作為名詞，所以沒有這個單詞，返回自己本身
wordnet_lemmatizer.lemmatize('are')
wordnet_lemmatizer.lemmatize('is')

輸出：'is'

wordnet_lemmatizer.lemmatize('is', pos='v') # 指定為動詞

輸出：'be'
通過nltk可以對詞性進行標注

import nltk
text = nltk.wordpunct_tokenize('what does the fox say')
nltk.pos_tag(text)

輸出：
[('what', 'WDT'),
('does', 'VBZ'),
('the', 'DT'),
('fox', 'NNS'),
('say', 'VBP')]

2.5 處理StopWords

StopWords是指像中文當中“的、地、得、它、她和他”以及英文中"the"等停止詞語，對文章意義沒有用處的詞語

from nltk.corpus import stopwords
# 先進行token分詞，得到?個word_list
# ...
# 然后進行過濾filter
filtered_words = [word for word in word_list if word not in stopwords.words('english')]

文本預處理以及分詞過程總結(jié)

文本預處理流水線.png

第三章：生成對應特征向量（自然語言處理）

將人能理解的語言轉(zhuǎn)換為計算機語言即將詞語轉(zhuǎn)換為特征向量

自然語言處理.png

自然語言處理一般包括以下三方面應用：

情感分析
文本相似度
文本分類

3.1 情感分析

3.1.1 Sentiment Dictionary（關鍵詞打分）

like 1分
good 2分
bad -2分
terrible -3分
打分機制：AFINN-111

sentiment_dictionary = {}
for line in open('data/AFINN-111.txt')
      word, score = line.split('\t')     sentiment_dictionary[word] = int(score)
# 把這個打分表記錄在?一個Dict上以后
# 跑?一遍整個句句?子，把對應的值相加
total_score = sum(sentiment_dictionary.get(word, 0) for word in words) # 有值就是Dict中的值，沒有就是0
# 于是你就得到了了?一個 sentiment score

缺點：太簡單，不能增加新詞，特殊詞匯無法避免

3.1.2 配合Machine Learning

from nltk.classify import NaiveBayesClassifier
# 隨?手造點訓練集
s1 = 'this is a good book'
s2 = 'this is a awesome book'
s3 = 'this is a bad book'
s4 = 'this is a terrible book'
def preprocess(s):
    return {word: True for word in s.lower().split()}
# 把訓練集給做成標準形式
training_data = [[preprocess(s1), 'pos'],
[preprocess(s2), 'pos'],
[preprocess(s3), 'neg'],
[preprocess(s4), 'neg']]
# 使用樸素貝葉斯算法
model = NaiveBayesClassifier.train(training_data)
# 打出結(jié)果
print(model.classify(preprocess('this is a good book')))

3.2 文本相似度

3.2.1 文本特征頻率

第一步：頻率統(tǒng)計

we	you	he	work	happy	are
1	0	3	0	1	1
1	0	2	0	1	1
0	1	0	1	0	0

第二步：計算向量相似度（余弦定理）

$\cos (\theta)=\frac{A \cdot B}{\|A\|\|B\|}$

# 頻率統(tǒng)計
import nltk
from nltk import FreqDist
corpus = 'this is my sentence ''this is my life ''this is the day'
tokens = nltk.word_tokenize(corpus)
print(tokens)

輸出：['this', 'is', 'my', 'sentence', 'this', 'is', 'my', 'life', 'this', 'is', 'the', 'day']

fdist = FreqDist(tokens)
print(fdist['is'])

輸出：3

#可以把最常?用的50個單詞拿出來
standard_freq_vector = fdist.most_common(50)
size = len(standard_freq_vector)
print(standard_freq_vector)

輸出：[('this', 3), ('is', 3), ('my', 2), ('sentence', 1), ('life', 1), ('the', 1), ('day', 1)]

將從大到小的順序記錄下來

def position_lookup(v):
    res={}
    counter = 0
    for word in v:
        res[word[0]] = counter
        counter += 1
    return res
standard_freq_dict = position_lookup(standard_freq_vector)
print(standard_freq_dict)

輸出：{'this': 0, 'is': 1, 'my': 2, 'sentence': 3, 'life': 4, 'the': 5, 'day': 6}

sentence = "this is cool"
freq_vector = [0]*size
tokens = nltk.word_tokenize(sentence)
for word in tokens:
    try:
        freq_vector[standard_freq_dict[word]] += 1
    except KeyError:
        continue
print(freq_vector)

輸出：[1, 1, 0, 0, 0, 0, 0]

3.3 文本分類

3.3.1 TF：Term Frequency,衡量一個term在文檔中出現(xiàn)頻率

$TF\left( t \right) =\frac{t\text{在文檔中的次數(shù)}}{\text{文檔中的}term\text{總數(shù)}}$

3.3.2 IDF：Inverse Document Frequency,衡量一個term有多重要

$IDF\left( t \right) =\log \left( \text{文檔總數(shù)/含}t\text{的文檔總數(shù)} \right)$

from nltk.text import TextCollection
corpus = TextCollection(['this is sentence one',
                        'this is sentence two',
                        'this is sentence three'])
print(corpus.tf_idf('this', 'this is sentence four'))

機器學習

可以使用各類機器學習或深度學習算法進行學習

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

NLP基本步驟及原理

NLP基本步驟及原理

文本處理的基本流程

第一章：文本預處理（Preprocess）

1.1 NLTK自然語言處理庫

1.1.1 NLTK自帶語料庫

第二章：分詞處理（Tokenize）

2.1 采用NLTK進行拆分

2.2 采用結(jié)巴分詞進行拆分

3.3 正則表達式

不采用正則表達式分詞

采用正則表達式分詞

2.4 詞形處理

2.4.1 Stemming

2.4.2 Lemmatization

2.5 處理StopWords

文本預處理以及分詞過程總結(jié)

第三章：生成對應特征向量（自然語言處理）

3.1 情感分析

3.1.1 Sentiment Dictionary（關鍵詞打分）

3.1.2 配合Machine Learning

3.2 文本相似度

3.2.1 文本特征頻率

第一步：頻率統(tǒng)計

第二步：計算向量相似度（余弦定理）

3.3 文本分類

3.3.1 TF：Term Frequency,衡量一個term在文檔中出現(xiàn)頻率

3.3.2 IDF：Inverse Document Frequency,衡量一個term有多重要

機器學習

相關閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

NLP基本步驟及原理

文本處理的基本流程

第一章：文本預處理（Preprocess）

1.1 NLTK自然語言處理庫

1.1.1 NLTK自帶語料庫

第二章：分詞處理（Tokenize）

2.1 采用NLTK進行拆分

2.2 采用結(jié)巴分詞進行拆分

3.3 正則表達式

不采用正則表達式分詞

采用正則表達式分詞

2.4 詞形處理

2.4.1 Stemming

2.4.2 Lemmatization

2.5 處理StopWords

文本預處理以及分詞過程總結(jié)

第三章：生成對應特征向量（自然語言處理）

3.1 情感分析

3.1.1 Sentiment Dictionary（關鍵詞打分）

3.1.2 配合Machine Learning

3.2 文本相似度

3.2.1 文本特征頻率

第一步：頻率統(tǒng)計

第二步：計算向量相似度（余弦定理）

3.3 文本分類

3.3.1 TF：Term Frequency,衡量一個term在文檔中出現(xiàn)頻率

3.3.2 IDF：Inverse Document Frequency,衡量一個term有多重要

機器學習

相關閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av