Linux下詞頻的計算

參考文章:
https://blog.csdn.net/herecles/article/details/8152054
https://www.cnblogs.com/standby/p/8309994.html

示例的文本如下:

cat words.txt
The Zen of Python, by Tim Peters
 
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!

1.利用AWK來統(tǒng)計詞頻

 cat words.txt | awk '{for(i=1;i<=NF;i++){if($i ~ /\w/) valid++;\
count[$i]++}}END{print "valid words:"valid"\n";for(j in count)\
print j,count[j]}'
# 加了if去篩選“單詞”字符,但是結(jié)果不理想
#在END中,利用for將hash count中的數(shù)據(jù)輸出。

valid words:143#利用perl語言進行分析是,顯然不是這樣的,不知道哪里出了問題

-- 1
 19
 hard 1
 1unts.
one 2
only 1
is 10
it 1
If 2
 1nse.
special 1
aren't 1
are 1
ambiguity, 1
honking 1
Readability 1
way 2
of 3
In 1
 1w.
easy 1
one-- 1
than 8
Special 1
*right* 1
refuse 1
preferably 1
that 1
be 3
Errors 1
Sparse 1
Complex 1
explain, 2
 1ver.
 1tch.
 1rity.
bad 1
you're 1
Beautiful 1
There 1
 1sted.
do 2
Unless 1
by 1
cases 1
better 8
Now 1
Explicit 1
face 1
often 1
unless 1
not 1
more 1
a 2
 1ters
implementation 2
Tim 1
obvious 1
Although 3
let's 1
 1.
 1lently.
practicality 1
Namespaces 1
should 2
 1mplex.
those! 1
great 1
 2ea.
it's 1
Simple 1
 1les.
enough 1
idea 1
explicitly 1
 1lenced.
pass 1
Zen 1

2.利用perl來統(tǒng)計詞頻

perl語言此次處理起來似乎更勝一籌,但是這里有個點我琢磨很久,因為使用了2個perl語句,但是2個perl語句的作用不太一樣,不能放在一個loop下執(zhí)行,其中第一個語句是利用-alne(相當(dāng)于while<>)將words中的單詞進行遍歷,完了之后需要結(jié)束循環(huán);第二個perl語句不需要-alne,只是通過foreach語句進行hash count的打印,故而需加上END語句進行操作

 cat words.txt|perl -alne '{foreach(split){$total++;next if /\W/;\
$valid++;$count{$_}++;}}' -e  'END{print"total:$total words,\
valid:$valid words\n";foreach $word (sort keys %count)\
{print " $word ==> $count{$word}\n"}}'

total:144 words,valid:113 words

 Although ==> 3

 Beautiful ==> 1

 Complex ==> 1

 Errors ==> 1

 Explicit ==> 1

 Flat ==> 1

 If ==> 2

  There ==> 1

 Tim ==> 1

 Unless ==> 1

 Zen ==> 1

 a ==> 2

 and ==> 1

 are ==> 1

 at ==> 1

 bad ==> 1

 enough ==> 1

 explicitly ==> 1

 face ==> 1

 first ==> 1

 good ==> 1

 great ==> 1

 hard ==> 1

 honking ==> 1

 idea ==> 1

 implementation ==> 2

 is ==> 10

 it ==> 1

 may ==> 2

 more ==> 1

 never ==> 2

 not ==> 1

 obvious ==> 1

 of ==> 3

 often ==> 1

 one ==> 2

 only ==> 1

 pass ==> 1

 practicality ==> 1

 preferably ==> 1

 refuse ==> 1
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

  • 第1章 小試牛刀 $ 是普通用戶,# 表示管理員用戶 root。 shebang:#!。sharp / hash ...
    巴喬書摘閱讀 6,658評論 1 4
  • awk介紹awk變量printf命令:實現(xiàn)格式化輸出操作符awk patternawk actionawk數(shù)組aw...
    哈嘍別樣閱讀 1,737評論 0 4
  • 嗨,各位盆友,2019已經(jīng)開始兩個月了,大家過得開心嗎? 果果我這些天可開心可滋潤了!面色紅艷有關(guān)澤,就像草莓一樣...
    訂好果閱讀 310評論 0 0
  • 今天晚八點,星造音開始了每周日的固定培訓(xùn)時間。不太一樣的是今天我們看了阿里巴巴的紀錄片——《造夢者》。 ...
    球球很瘋狂愛挑戰(zhàn)閱讀 775評論 0 1
  • 當(dāng)復(fù)雜龐大的數(shù)據(jù)分析與充滿趣味性的故事結(jié)合,會產(chǎn)生怎樣的可視化效果?本篇由國外某個大佬使用數(shù)據(jù)可視化工具制作的可視...
    Acleus閱讀 472評論 0 0

友情鏈接更多精彩內(nèi)容