簡介:
在使用Elasticsearch 進行搜索中文時,Elasticsearch 內(nèi)置的分詞器會將所有的漢字切分為單個字,對用國內(nèi)習(xí)慣的一些形容詞、常見名字等則無法優(yōu)雅的處理,此時就需要用到一些開源的分詞器,常見的分詞器如下:
- Standard默認(rèn)分詞器
- IK 中文分詞器
- Pinyin 分詞器
- Smart Chinese 分詞器
- hanlp 中文分詞器
- 達摩院中文分詞AliNLP
分詞器比較

- standard 默認(rèn)分詞器,對單個字符進行切分,查全率高,準(zhǔn)確度較低
- IK 分詞器 ik_max_word:查全率與準(zhǔn)確度較高,性能也高,是業(yè)務(wù)中普遍采用的中文分詞器
- IK 分詞器 ik_smart:切分力度較大,準(zhǔn)確度與查全率不高,但是查詢性能較高
- Smart Chinese 分詞器:查全率與準(zhǔn)確率性能較高
- hanlp 中文分詞器:切分力度較大,準(zhǔn)確度與查全率不高,但是查詢性能較高
- Pinyin 分詞器:針對漢字拼音進行的分詞器,與上面介紹的分詞器稍有不同,在用拼音進行查詢時查全率準(zhǔn)確度較高
下面詳細(xì)介紹下各種分詞器,對同一組漢語進行分詞的結(jié)果對比,方便大家在實際使用中參考。
standard 默認(rèn)分詞器
GET _analyze
{
"text": "南京市長江大橋",
"tokenizer": "standard"
}
#返回
{
"tokens" : [
{
"token" : "南",
"start_offset" : 0,
"end_offset" : 1,
"type" : "<IDEOGRAPHIC>",
"position" : 0
},
{
"token" : "京",
"start_offset" : 1,
"end_offset" : 2,
"type" : "<IDEOGRAPHIC>",
"position" : 1
},
{
"token" : "市",
"start_offset" : 2,
"end_offset" : 3,
"type" : "<IDEOGRAPHIC>",
"position" : 2
},
{
"token" : "長",
"start_offset" : 3,
"end_offset" : 4,
"type" : "<IDEOGRAPHIC>",
"position" : 3
},
{
"token" : "江",
"start_offset" : 4,
"end_offset" : 5,
"type" : "<IDEOGRAPHIC>",
"position" : 4
},
{
"token" : "大",
"start_offset" : 5,
"end_offset" : 6,
"type" : "<IDEOGRAPHIC>",
"position" : 5
},
{
"token" : "橋",
"start_offset" : 6,
"end_offset" : 7,
"type" : "<IDEOGRAPHIC>",
"position" : 6
}
]
}
默認(rèn)分詞器處理中文是按照單個漢字進行切割,不能很好的理解中文詞語的含義,在實際項目使用中很少會使用默認(rèn)分詞器來處理中文。
IK 中文分詞器:
插件下載地址:https://github.com/medcl/elasticsearch-analysis-ik/releases/tag/v7.10.0
(注意要下載和使用的Elasticsearch 匹配的版本)
- 在 Elasticsearch 的安裝目錄的 Plugins 目錄下新建 IK 文件夾,然后將下載的 IK 安裝包解壓到此目錄下
- 重啟 ES 即生效
IK 分詞器包含:ik_smart 以及 ik_max_word 2種分詞器,都可以使用在
索引和查詢階段。創(chuàng)建一個索引,里面包含2個字段:
- max_word_content 使用 ik_max_word 分詞器處理;
- smart_content 采用 ik_smart 分詞器處理;
分別對比下執(zhí)行結(jié)果:
#創(chuàng)建索引
PUT /analyze_chinese
{
"mappings": {
"properties": {
"max_word_content": {
"type": "text",
"analyzer": "ik_max_word",
"search_analyzer": "ik_max_word"
},
"smart_content": {
"type": "text",
"analyzer": "ik_smart",
"search_analyzer": "ik_smart"
}
}
}
}
#添加測試數(shù)據(jù)
POST analyze_chinese/_bulk
{"index":{"_id":1}}
{"max_word_content":"南京市長江大橋","smart_content":"我是南京市民"}
# ik_max_word 查詢分析器解析結(jié)果
POST _analyze
{
"text": "南京市長江大橋",
"analyzer": "ik_max_word"
}
#結(jié)果:
{
"tokens" : [
{
"token" : "南京市",
"start_offset" : 0,
"end_offset" : 3,
"type" : "CN_WORD",
"position" : 0
},
{
"token" : "南京",
"start_offset" : 0,
"end_offset" : 2,
"type" : "CN_WORD",
"position" : 1
},
{
"token" : "市長",
"start_offset" : 2,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 2
},
{
"token" : "長江大橋",
"start_offset" : 3,
"end_offset" : 7,
"type" : "CN_WORD",
"position" : 3
},
{
"token" : "長江",
"start_offset" : 3,
"end_offset" : 5,
"type" : "CN_WORD",
"position" : 4
},
{
"token" : "大橋",
"start_offset" : 5,
"end_offset" : 7,
"type" : "CN_WORD",
"position" : 5
}
]
}
#ik_smart
POST _analyze
{
"text": "南京市長江大橋",
"analyzer": "ik_smart"
}
#結(jié)果:
{
"tokens" : [
{
"token" : "南京市",
"start_offset" : 0,
"end_offset" : 3,
"type" : "CN_WORD",
"position" : 0
},
{
"token" : "長江大橋",
"start_offset" : 3,
"end_offset" : 7,
"type" : "CN_WORD",
"position" : 1
}
]
}
通過以上分析,ik_smart 顯然分詞的顆粒度較粗,而 ik_max_word 顆粒度較細(xì)
通過DSL來驗證查詢
POST analyze_chinese/_search
{
"query": {
"match": {
"smart_content": "南京市"
}
}
}
#結(jié)果
"hits" : {
"total" : {
"value" : 0,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
}
未匹配到記錄,因為“我是南京市民” 經(jīng)過分詞處理后并不包含“南京市” 的 token,
那通過“南京” 搜索呢?
POST analyze_chinese/_search
{
"query": {
"match": {
"smart_content": "南京"
}
}
}
#返回
"hits" : [
{
"_index" : "analyze_chinese",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.2876821,
"_source" : {
"max_word_content" : "南京市長江大橋",
"smart_content" : "我是南京市民"
}
}
]
經(jīng)過 ik_max_word 分詞處理器處理之后的 max_word_content 字段效果呢?
POST analyze_chinese/_search
{
"query": {
"match": {
"max_word_content": "南京"
}
}
}
#結(jié)果
"hits" : [
{
"_index" : "analyze_chinese",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.2876821,
"_source" : {
"max_word_content" : "南京市長江大橋",
"smart_content" : "我是南京市民"
}
}
]
#使用 南京市 查詢
POST analyze_chinese/_search
{
"query": {
"match": {
"max_word_content": "南京市"
}
}
}
#結(jié)果
"hits" : [
{
"_index" : "analyze_chinese",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.5753642,
"_source" : {
"max_word_content" : "南京市長江大橋",
"smart_content" : "我是南京市民"
}
}
]
可以看到,由于 “南京市長江大橋” 經(jīng)過 ik_max_word 分詞器處理后,包含 “南京市” token,所以都可以查詢到。
IK 分詞器總結(jié):
- ik_max_word 分詞顆粒度小,滿足業(yè)務(wù)場景更豐富
- ik_smart 分詞器顆粒度較粗,滿足分詞場景要求不高的業(yè)務(wù)
pinyin 分詞器
首先,下載 pinyin 分詞器插件:
https://github.com/medcl/elasticsearch-analysis-pinyin
本地編譯并打包后,上傳到ES安裝目錄下的 plugins 下并解壓,然后重啟ES,重啟后查看是否安裝成功:
[elasticsearch@stage-node1 elasticsearch-7.10.0]$ ./bin/elasticsearch-plugin list
ik
pinyin
可以看到 pinyin 插件已經(jīng)安裝成功
PUT /analyze_chinese_pinyin/
{
"settings" : {
"analysis" : {
"analyzer" : {
"pinyin_analyzer" : {
"tokenizer" : "my_pinyin"
}
},
"tokenizer" : {
"my_pinyin" : {
"type" : "pinyin",
"keep_separate_first_letter" : false,
"keep_full_pinyin" : true,
"keep_original" : true,
"limit_first_letter_length" : 16,
"lowercase" : true,
"remove_duplicated_term" : true
}
}
}
}
}
#
GET /analyze_chinese_pinyin/_analyze
{
"text": ["南京市長江大橋"],
"analyzer": "pinyin_analyzer"
}
#返回:
{
"tokens" : [
{
"token" : "nan",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 0
},
{
"token" : "南京市長江大橋",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 0
},
{
"token" : "njscjdq",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 0
},
{
"token" : "jing",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 1
},
{
"token" : "shi",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 2
},
{
"token" : "chang",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 3
},
{
"token" : "jiang",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 4
},
{
"token" : "da",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 5
},
{
"token" : "qiao",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 6
}
]
}
#設(shè)置測試數(shù)據(jù)
POST analyze_chinese_pinyin/_bulk
{"index":{"_id":1}}
{"name":"南京市長江大橋"}
#根據(jù)拼音查詢 njscjdq
POST analyze_chinese_pinyin/_search
{
"query": {
"match": {
"name.pinyin": "njscjdq"
}
}
}
#返回
"hits" : [
{
"_index" : "analyze_chinese_pinyin",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.6931471,
"_source" : {
"name" : "南京市長江大橋"
}
}
]
#通過 nan 查詢
POST analyze_chinese_pinyin/_search
{
"query": {
"match": {
"name.pinyin": "nan"
}
}
}
# 返回
"hits" : [
{
"_index" : "analyze_chinese_pinyin",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.6931471,
"_source" : {
"name" : "南京市長江大橋"
}
}
]
因為經(jīng)過 南京長江大橋 經(jīng)過 pinyin_analyzer 分詞器分詞后,包含 nan 和 njscjdq 所以都能匹配查詢到記錄
Smart Chinese Analysis
參考:https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-smartcn.html
Smart Chinese Analysis 插件將Lucene的智能中文分析模塊集成到elasticsearch中,
提供了中文或中英文混合文本的分析器。該分析器使用概率知識來找到簡體中文文本的最佳分詞。文本首先被分解成句子,然后每個句子被分割成單詞。
此插件必須在每個節(jié)點上安裝且需要重啟才生效,此插件提供了smartcn 分析器、smartcn_tokenizer tokenizer、
./bin/elasticsearch-plugin install analysis-smartcn
-> Installing analysis-smartcn
-> Downloading analysis-smartcn from elastic
[=================================================] 100%
-> Installed analysis-smartcn
同樣執(zhí)行查看已安裝插件的列表
[elasticsearch@stage-node1 elasticsearch-7.10.0]$ ./bin/elasticsearch-plugin list
analysis-smartcn
ik
pinyin
安裝成功后,需要重啟 ES 以便插件生效
POST _analyze
{
"analyzer": "smartcn",
"text":"南京市長江大橋"
}
#返回
{
"tokens" : [
{
"token" : "南京市",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 0
},
{
"token" : "長江",
"start_offset" : 3,
"end_offset" : 5,
"type" : "word",
"position" : 1
},
{
"token" : "大橋",
"start_offset" : 5,
"end_offset" : 7,
"type" : "word",
"position" : 2
}
]
}
hanlp 中文分詞器
安裝插件:
./bin/elasticsearch-plugin install https://github.com/KennFalcon/elasticsearch-analysis-hanlp/releases/download/v7.10.0/elasticsearch-analysis-hanlp-7.10.0.zip
安裝后查看插件安裝情況,安裝成功后也同樣需要重啟ES
[elasticsearch@stage-node1 elasticsearch-7.10.0]$ ./bin/elasticsearch-plugin list
analysis-hanlp
analysis-smartcn
ik
pinyin
GET _analyze
{
"text": "南京市長江大橋",
"tokenizer": "hanlp"
}
#返回
{
"tokens" : [
{
"token" : "南京市",
"start_offset" : 0,
"end_offset" : 3,
"type" : "ns",
"position" : 0
},
{
"token" : "長江大橋",
"start_offset" : 3,
"end_offset" : 7,
"type" : "nz",
"position" : 1
}
]
}