自然語言處理中文分詞
利用傳統(tǒng)方法(N-gram,HMM等)、神經(jīng)網(wǎng)絡(luò)方法(CNN,LSTM等)和預(yù)訓(xùn)練方法(Bert等)的中文分詞任務(wù)實(shí)現(xiàn)【The word segmentation task is realized by using traditional methods (n-gram, HMM, etc.), neural network methods (CNN, LSTM, etc.) and pre training methods (Bert, etc.)】
項(xiàng)目地址:https://github.com/JackHCC/Chinese-Tokenization
方法概述
- 傳統(tǒng)算法:使用N-gram,HMM,最大熵,CRF等實(shí)現(xiàn)中文分詞
- 神經(jīng)?絡(luò)?法:CNN、Bi-LSTM、Transformer等
- 預(yù)訓(xùn)練語?模型?法:Bert等
數(shù)據(jù)集概述
- PKU 與 MSR 是 SIGHAN 于 2005 年組織的中?分詞?賽 所?的數(shù)據(jù)集,也是學(xué)術(shù)界測(cè)試分詞?具的標(biāo)準(zhǔn)數(shù)據(jù)集。
實(shí)驗(yàn)過程
實(shí)驗(yàn)結(jié)果
PKU數(shù)據(jù)集
| 模型 | 準(zhǔn)確率 | 召回率 | F1分?jǐn)?shù) |
|---|---|---|---|
| Uni-Gram | 0.8550 | 0.9342 | 0.8928 |
| Uni-Gram+規(guī)則 | 0.9111 | 0.9496 | 0.9300 |
| HMM | 0.7936 | 0.8090 | 0.8012 |
| CRF | 0.9409 | 0.9396 | 0.9400 |
| Bi-LSTM | 0.9248 | 0.9236 | 0.9240 |
| Bi-LSTM+CRF | 0.9366 | 0.9354 | 0.9358 |
| BERT | 0.9712 | 0.9635 | 0.9673 |
| BERT-CRF | 0.9705 | 0.9619 | 0.9662 |
| jieba | 0.8559 | 0.7896 | 0.8214 |
| pkuseg | 0.9512 | 0.9224 | 0.9366 |
| THULAC | 0.9287 | 0.9295 | 0.9291 |
MSR數(shù)據(jù)集
| 模型 | 準(zhǔn)確率 | 召回率 | F1分?jǐn)?shù) |
|---|---|---|---|
| Uni-Gram | 0.9119 | 0.9633 | 0.9369 |
| Uni-Gram+規(guī)則 | 0.9129 | 0.9634 | 0.9375 |
| HMM | 0.7786 | 0.8189 | 0.7983 |
| CRF | 0.9675 | 0.9676 | 0.9675 |
| Bi-LSTM | 0.9624 | 0.9625 | 0.9624 |
| Bi-LSTM+CRF | 0.9631 | 0.9632 | 0.9632 |
| BERT | 0.9841 | 0.9817 | 0.9829 |
| BERT-CRF | 0.9805 | 0.9787 | 0.9796 |
| jieba | 0.8204 | 0.8145 | 0.8174 |
| pkuseg | 0.8701 | 0.8894 | 0.8796 |
| THULAC | 0.8428 | 0.8880 | 0.8648 |