97写真在线,999国产视频,97资源色呦呦

在處理 NLP 相關(guān)任務(wù)的時候（文本分類、聚類，智能客服等），首要任務(wù)是對文本數(shù)據(jù)進行預(yù)處理。結(jié)合自己的實踐經(jīng)驗，總結(jié)了 N 條預(yù)處理的方法。

去掉一些無用的符號

文本中可能會出現(xiàn)連續(xù)的符號（比如感嘆號?。?！或一些奇怪的單詞等。）我們將文本按照符號進行分割然后再組裝。

def tokenizer(ori_list):
   SYMBOLS = re.compile('[\s;\"\",.!?\\/\[\]]+')
   new_list = []
   for q in ori_list:
       words=SYMBOLS.split(q.lower().strip())
       new_list.append(' '.join(words))
   return new_list

停用詞過濾

網(wǎng)上有很多開源的停用詞集合，也可以根據(jù)自己業(yè)務(wù)建立領(lǐng)域停用詞表。（或者直接使用NLTK自帶的）

def removeStopWord(ori_list):
   new_list = []
   #nltk中stopwords包含what等，但是在QA問題中，這算關(guān)鍵詞，所以不看作stop words
   restored = ['what','when','which','how','who','where']
   english_stop_words = list(set(stopwords.words('english')))
   for w in restored:
       english_stop_words.remove(w)
   for q in ori_list:
       sentence = ' '.join([w for w in q.strip().split(' ') if w not in english_stop_words])
       new_list.append(sentence)
   return new_list

去掉出現(xiàn)頻率很低的詞

我們?nèi)コ皖l詞，可以基于詞典設(shè)置一個閾值，比如出現(xiàn)次數(shù)少于10,20....

def removeLowFrequence(ori_list,vocabulary,thres = 10):
    #根據(jù)thres篩選詞表，小于thres的詞去掉
    new_list = []
    for q in ori_list:
        sentence = ' '.join([w for w in q.strip().split(' ') if w in vocabulary and vocabulary[w] >= thres])
        new_list.append(sentence)
    return new_list

對于數(shù)字的處理

分詞完只有有些單詞可能就是數(shù)字比如44，415，把所有這些數(shù)字都看成是一個單詞，這個新的單詞我們可以定義為 "#number"

def replaceDigits(ori_list,replace = '#number'):
    #將數(shù)字統(tǒng)一替換replace,默認#number
    DIGITS = re.compile('\d+')
    new_list = []
    for q in ori_list:
        q = DIGITS.sub(replace,q)
        new_list.append(q)
    return new_list

關(guān)于我

dreampai（公眾號，簡書，知乎同名），專注于 NLP和金融。

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

NLP 預(yù)處理總結(jié)

NLP 預(yù)處理總結(jié)

去掉一些無用的符號

停用詞過濾

去掉出現(xiàn)頻率很低的詞

對于數(shù)字的處理

關(guān)于我

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

NLP 預(yù)處理總結(jié)

去掉一些無用的符號

停用詞過濾

去掉出現(xiàn)頻率很低的詞

對于數(shù)字的處理

關(guān)于我

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av