日語
- 單個句子 分詞
% echo "MeCabで形態(tài)素解析を行うとこうなる." | /Users/admin/Documents/mecab/bin/mecab -Owakati
- 整個文件 分詞
% /Users/admin/Documents/mecab/bin/mecab INPUT -o OUTPUT -O wakati
mecab參數(shù)配置
mecab安裝
很棒的總結(jié)(日文)
mecab配置文件
中文
Execute Tokenization.py to perform segmentation by using Jieba.
Common Methods of segmentation:
| Methods of Chinese Segmentation | Algorithm | Related Link | |
|---|---|---|---|
| Jieba | Based on a prefix dictionary structure to achieve efficient word graph scanning. Build a directed acyclic graph (DAG) for all possible word combinations.Use dynamic programming to find the most probable combination based on the word frequency.For unknown words, a HMM-based model is used with the Viterbi algorithm. | Github | Sun, J. "‘Jieba’Chinese word segmentation tool." (2012). |
| THULAC(THU Lexical Analyzer for Chinese) | Based on Structured Perceptron | Github paper(2009) | Maosong Sun, Xinxiong Chen, Kaixu Zhang, Zhipeng Guo, Zhiyuan Liu. THULAC: An Efficient Lexical Analyzer for Chinese. 2016. |
| StanfordSegmenter | Based on CRF | Github Tutorials paper(2005) paper(2008) |
get the code from here.