代碼來源
北京理工大學(xué)慕課-嵩天老師課程,統(tǒng)計三國演義人物出現(xiàn)最多的前15位。
#CalThreeKingdomsV2.py
import jieba
excludes = {"將軍","卻說","荊州","二人","不可","不能","如此","商議","如何","軍士"}
txt = open("threekingdoms.txt", "r", encoding='utf-8').read()
words = jieba.lcut(txt)
counts = {}
for word in words:
if len(word) == 1:
continue
elif word == "諸葛亮" or word == "孔明曰":
rword = "孔明"
elif word == "關(guān)公" or word == "云長":
rword = "關(guān)羽"
elif word == "玄德" or word == "玄德曰":
rword = "劉備"
elif word == "孟德" or word == "丞相":
rword = "曹操"
else:
rword = word
counts[rword] = counts.get(rword,0) + 1
for word in excludes:
del counts[word]
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=True)
for i in range(15):
word, count = items[i]
print ("{0:<10}{1:>5}".format(word, count))
思路
用jieda庫切片,所以根本不需要考慮標點符號、空格的去除。
把把一個人的不同稱謂統(tǒng)一成一類
把每個詞出現(xiàn)次數(shù)寫進字典{詞語:出現(xiàn)次數(shù)}
把一些顯然易見不是人名的詞從counts字典中刪掉(這依賴于多運行幾次這段代碼,然后設(shè)置一個excludes詞庫,再運行,再擴充詞庫。
把counts字典,用.itmes弄成鍵值對信息,再用list轉(zhuǎn)成列表。見下:
>>> a={"2d":"哈",23:"s"}
>>> print(a)
{'2d': '哈', 23: 's'}
>>> c=a.items()
>>> print(c)
dict_items([('2d', '哈'), (23, 's')])
>>> list(c)
[('2d', '哈'), (23, 's')]
>>> print(c)
dict_items([('2d', '哈'), (23, 's')])
>>> print(list(c))
[('2d', '哈'), (23, 's')]
>>>
用.sort函數(shù)排序,其中排序依據(jù)key用一個匿名函數(shù)lambda表達,這里搞不太清,反正是用列表的二維x[1]作為排序依據(jù),reverse=True即從大到小輸出為新的items列表。
for循環(huán)跑15次,把前15輸出。
另外
老師還講了莎士比亞《哈姆雷特》單詞出現(xiàn)的排序。
#CalHamletV1.py
def getText():
txt = open("hamlet.txt", "r").read()
txt = txt.lower()
for ch in '!"#$%&()*+,-./:;<=>?@[\\]^_‘{|}~':
txt = txt.replace(ch, " ") #將文本中特殊字符替換為空格
return txt
hamletTxt = getText()
words = hamletTxt.split()
counts = {}
for word in words:
counts[word] = counts.get(word,0) + 1
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=True)
for i in range(10):
word, count = items[i]
print ("{0:<10}{1:>5}".format(word, count))
于是
是否可以統(tǒng)計《哈姆雷特》和《三國演義》除符號、空格外的數(shù)量。
def getText():
txt = open("hamlet.txt", "r").read()
for ch in '!"#$%&()*+,-./:;<=>?@[\\]^_‘{|}~':
txt = txt.strip(ch) #將文本中特殊字符替換為空格
return txt
l=getText()
count=len(l)
print("全文總字母數(shù):{}".format(count))
對于哈姆雷特,老師也是這么處理的,我先不管了。
但......
#CalThreeKingdomsV2.py
txt = open("threekingdoms.txt", "r", encoding='utf-8').read()
for ch in '!"#$%&()*+,-。/:;<=>?@[\\]^_‘{|}~':
txt = txt.strip(ch) #將文本中特殊字符替換為空格
count=len(txt)
print(count)
輸出結(jié)果:602415
我在那一長串字符中,加了個空格,本來預(yù)期,字數(shù)會減少,結(jié)果紋絲不動,我又刪了幾個字符,以為會增多,結(jié)果也紋絲不動,好吧,有問題。
再寫。
#CalThreeKingdomsV2.py
import re
txt = open("threekingdoms.txt", "r", encoding='utf-8').read()
clear='[!"#$%&()*+,-。/:;<=>?@[\\] ^_‘{|}~]'
str=re.sub(clear,"",txt)
count=len(str)
print(count)
輸出結(jié)果:555212
反正,貌似是靠譜的,這里面用了re庫(我目前完全不懂),還發(fā)現(xiàn),要注意它的使用
>>> str=re.sub('[a]',"d",txt)
>>> print(str)
dddd
若不然,就錯了
>>> str=re.sub([a],"d",txt)
Traceback (most recent call last):
File "<pyshell#14>", line 1, in <module>
str=re.sub([a],"d",txt)
File "D:\download\python\lib\re.py", line 210, in sub
return _compile(pattern, flags).sub(repl, string, count)
File "D:\download\python\lib\re.py", line 294, in _compile
return _cache[type(pattern), pattern, flags]
TypeError: unhashable type: 'list'
雙引號也行哈
txt="adda"
>>> str=re.sub("[a]","d",txt)
>>> print(str)
dddd