文本清洗+python+正則表達(dá)式+詞頻統(tǒng)計(jì)

文本清洗,導(dǎo)出到文件

import re

# make English text clean 
def clean_en_text(text):
    # keep English, digital and space
    comp = re.compile('[^A-Z^a-z^0-9^ ]')
    return comp.sub(' ',text)

# make Chinese text clean
def clean_zh_text(text):
    # keep English, digital and Chinese
    comp = re.compile('[^A-Z^a-z^0-9^\u4e00-\u9fa5]')
    return comp.sub(' ',text)

def file_en_clean(r_file_ad, w_file_ad):
    f = open(r_file_ad,'rt')
    print('讀取文件的名字:',f.name)
    lines = f.readlines()
    output = []
    for line in lines:
        line = clean_en_text(line)
        output.append(line)
    f.close()
    f = open(w_file_ad,'w')
    print('寫入文件的名字:',f.name)
    for o in output:
        f.write(o)
        f.write('\n')
    f.close()
if __name__ == '__main__':
# 本代碼所在文件和兩個(gè).txt文件在同一目錄下
    file_en_clean('./e2.txt','./new_test.txt')

加入詞頻統(tǒng)計(jì),導(dǎo)出到文件

import re

# make English text clean 
def clean_en_text(text):
    # keep English, digital and space
    comp = re.compile('[^A-Z^a-z^0-9^ ]')
    return comp.sub(' ',text)

# make Chinese text clean
def clean_zh_text(text):
    # keep English, digital and Chinese
    comp = re.compile('[^A-Z^a-z^0-9^\u4e00-\u9fa5]')
    return comp.sub(' ',text)

def dealed_list(filename):
    f = open(filename,'rt')
    print('讀取文件的名字:',f.name)
    lines = f.readlines()
    output = []
    for line in lines:
        line = clean_en_text(line)
        output.append(line)
    f.close()
    return output

def readlist(dealed_list):
    fr = dealed_list
    wordsL = []#use this list to save the words
    for word in fr:
        word = word.lower()
        word = word.strip()
        word = word.split()
        wordsL = wordsL + word
    return wordsL

#count the frequency of every word and store in a dictionary
#And sort dictionaries by value from large to small
def count(wordsL):
    wordsD = {}
    for x in wordsL:
        #move these words that we don't need
        if Judge(x):
            continue
        #count
        if not x in wordsD:
            wordsD[x] = 1
        wordsD[x] += 1
    #Sort dictionaries by value from large to small
    wordsInorder = sorted(wordsD.items(), key=lambda x:x[1], reverse = True)
    return wordsInorder

#juege whether the word is that we want to move such as punctuation or letter
#You can modify this function to move more words such as number
def Judge(word):
    punctList = [' ','\t','\n',',','.',':','?']#juege whether the word is punctuation
    letterList = ['a','b','c','d','m','n','x','p','t']#juege whether the word is letter
    if word in punctList:
        return True
    elif word in letterList:
        return True
    else:
        return False

if __name__ == '__main__':
    for x in range(1,6):
        filename = 'e' + str(x) + '.txt'
        # 去掉不需要的字符
        L = dealed_list(filename)
        wordsL = readlist(L)
        words = count(wordsL)
        fw = open('./results/words_e' + str(x) + '.txt','w')
        for item in words:
            fw.write(item[0] + '\t' + str(item[1]) + '\n')
        fw.close()

參考博文:
用Python實(shí)現(xiàn)針對(duì)英文論文的詞頻分析
Python正則表達(dá)式做文本預(yù)處理,去掉特殊符號(hào)

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

  • 夜鶯2517閱讀 128,186評(píng)論 1 9
  • 版本:ios 1.2.1 亮點(diǎn): 1.app角標(biāo)可以實(shí)時(shí)更新天氣溫度或選擇空氣質(zhì)量,建議處女座就不要選了,不然老想...
    我就是沉沉閱讀 7,477評(píng)論 1 6
  • 我是一名過去式的高三狗,很可悲,在這三年里我沒有戀愛,看著同齡的小伙伴們一對(duì)兒一對(duì)兒的,我的心不好受。怎么說呢,高...
    小娘紙閱讀 3,845評(píng)論 4 7
  • 這些日子就像是一天一天在倒計(jì)時(shí) 一想到他走了 心里就是說不出的滋味 從幾個(gè)月前認(rèn)識(shí)他開始 就意識(shí)到終究會(huì)發(fā)生的 只...
    栗子a閱讀 1,721評(píng)論 1 3

友情鏈接更多精彩內(nèi)容