2022-04-04 統(tǒng)計《三國演義》漢字數(shù)量

代碼來源

北京理工大學(xué)慕課-嵩天老師課程,統(tǒng)計三國演義人物出現(xiàn)最多的前15位。

#CalThreeKingdomsV2.py
import jieba
excludes = {"將軍","卻說","荊州","二人","不可","不能","如此","商議","如何","軍士"}
txt = open("threekingdoms.txt", "r", encoding='utf-8').read()
words  = jieba.lcut(txt)
counts = {}
for word in words:
    if len(word) == 1:
        continue
    elif word == "諸葛亮" or word == "孔明曰":
        rword = "孔明"
    elif word == "關(guān)公" or word == "云長":
        rword = "關(guān)羽"
    elif word == "玄德" or word == "玄德曰":
        rword = "劉備"
    elif word == "孟德" or word == "丞相":
        rword = "曹操"
    else:
        rword = word
    counts[rword] = counts.get(rword,0) + 1
for word in excludes:
    del counts[word]
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=True) 
for i in range(15):
    word, count = items[i]
   print ("{0:<10}{1:>5}".format(word, count))

思路

用jieda庫切片,所以根本不需要考慮標點符號、空格的去除。
把把一個人的不同稱謂統(tǒng)一成一類
把每個詞出現(xiàn)次數(shù)寫進字典{詞語:出現(xiàn)次數(shù)}
把一些顯然易見不是人名的詞從counts字典中刪掉(這依賴于多運行幾次這段代碼,然后設(shè)置一個excludes詞庫,再運行,再擴充詞庫。
把counts字典,用.itmes弄成鍵值對信息,再用list轉(zhuǎn)成列表。見下:

>>> a={"2d":"哈",23:"s"}
>>> print(a)
{'2d': '哈', 23: 's'}
>>> c=a.items()
>>> print(c)
dict_items([('2d', '哈'), (23, 's')])
>>> list(c)
[('2d', '哈'), (23, 's')]
>>> print(c)
dict_items([('2d', '哈'), (23, 's')])
>>> print(list(c))
[('2d', '哈'), (23, 's')]
>>> 

用.sort函數(shù)排序,其中排序依據(jù)key用一個匿名函數(shù)lambda表達,這里搞不太清,反正是用列表的二維x[1]作為排序依據(jù),reverse=True即從大到小輸出為新的items列表。
for循環(huán)跑15次,把前15輸出。

另外

老師還講了莎士比亞《哈姆雷特》單詞出現(xiàn)的排序。

#CalHamletV1.py
def getText():
    txt = open("hamlet.txt", "r").read()
    txt = txt.lower()
    for ch in '!"#$%&()*+,-./:;<=>?@[\\]^_‘{|}~':
        txt = txt.replace(ch, " ")   #將文本中特殊字符替換為空格
    return txt

hamletTxt = getText()
words  = hamletTxt.split()
counts = {}
for word in words:          
    counts[word] = counts.get(word,0) + 1
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=True) 
for i in range(10):
    word, count = items[i]
    print ("{0:<10}{1:>5}".format(word, count))

于是

是否可以統(tǒng)計《哈姆雷特》和《三國演義》除符號、空格外的數(shù)量。

def getText():
    txt = open("hamlet.txt", "r").read()
    for ch in '!"#$%&()*+,-./:;<=>?@[\\]^_‘{|}~':
        txt = txt.strip(ch)   #將文本中特殊字符替換為空格
    return txt
l=getText()
count=len(l)
print("全文總字母數(shù):{}".format(count))

對于哈姆雷特,老師也是這么處理的,我先不管了。
但......

#CalThreeKingdomsV2.py
txt = open("threekingdoms.txt", "r", encoding='utf-8').read()
for ch in '!"#$%&()*+,-。/:;<=>?@[\\]^_‘{|}~':
        txt = txt.strip(ch)   #將文本中特殊字符替換為空格
count=len(txt)
print(count)

輸出結(jié)果:602415
我在那一長串字符中,加了個空格,本來預(yù)期,字數(shù)會減少,結(jié)果紋絲不動,我又刪了幾個字符,以為會增多,結(jié)果也紋絲不動,好吧,有問題。
再寫。

#CalThreeKingdomsV2.py
import re
txt = open("threekingdoms.txt", "r", encoding='utf-8').read()
clear='[!"#$%&()*+,-。/:;<=>?@[\\] ^_‘{|}~]'
str=re.sub(clear,"",txt)   
count=len(str)
print(count)

輸出結(jié)果:555212
反正,貌似是靠譜的,這里面用了re庫(我目前完全不懂),還發(fā)現(xiàn),要注意它的使用

>>> str=re.sub('[a]',"d",txt)
>>> print(str)
dddd

若不然,就錯了

>>> str=re.sub([a],"d",txt)
Traceback (most recent call last):
  File "<pyshell#14>", line 1, in <module>
    str=re.sub([a],"d",txt)
  File "D:\download\python\lib\re.py", line 210, in sub
    return _compile(pattern, flags).sub(repl, string, count)
  File "D:\download\python\lib\re.py", line 294, in _compile
    return _cache[type(pattern), pattern, flags]
TypeError: unhashable type: 'list'

雙引號也行哈

txt="adda"
>>> str=re.sub("[a]","d",txt)
>>> print(str)
dddd
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容