分詞器

日語

  • 單個句子 分詞
% echo "MeCabで形態(tài)素解析を行うとこうなる." | /Users/admin/Documents/mecab/bin/mecab -Owakati
  • 整個文件 分詞
% /Users/admin/Documents/mecab/bin/mecab INPUT -o OUTPUT -O wakati

mecab參數(shù)配置
mecab安裝
很棒的總結(jié)(日文)
mecab配置文件

中文

Execute Tokenization.py to perform segmentation by using Jieba.

Common Methods of segmentation:

Methods of Chinese Segmentation Algorithm Related Link
Jieba Based on a prefix dictionary structure to achieve efficient word graph scanning. Build a directed acyclic graph (DAG) for all possible word combinations.Use dynamic programming to find the most probable combination based on the word frequency.For unknown words, a HMM-based model is used with the Viterbi algorithm. Github Sun, J. "‘Jieba’Chinese word segmentation tool." (2012).
THULAC(THU Lexical Analyzer for Chinese) Based on Structured Perceptron Github paper(2009) Maosong Sun, Xinxiong Chen, Kaixu Zhang, Zhipeng Guo, Zhiyuan Liu. THULAC: An Efficient Lexical Analyzer for Chinese. 2016.
StanfordSegmenter Based on CRF Github Tutorials paper(2005) paper(2008)

get the code from here.

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容