簡介：

在使用Elasticsearch 進行搜索中文時，Elasticsearch 內(nèi)置的分詞器會將所有的漢字切分為單個字，對用國內(nèi)習(xí)慣的一些形容詞、常見名字等則無法優(yōu)雅的處理，此時就需要用到一些開源的分詞器，常見的分詞器如下：

Standard默認(rèn)分詞器
IK 中文分詞器
Pinyin 分詞器
Smart Chinese 分詞器
hanlp 中文分詞器
達摩院中文分詞AliNLP

分詞器比較

image.png

standard 默認(rèn)分詞器，對單個字符進行切分，查全率高，準(zhǔn)確度較低
IK 分詞器 ik_max_word：查全率與準(zhǔn)確度較高，性能也高，是業(yè)務(wù)中普遍采用的中文分詞器
IK 分詞器 ik_smart：切分力度較大，準(zhǔn)確度與查全率不高，但是查詢性能較高
Smart Chinese 分詞器：查全率與準(zhǔn)確率性能較高
hanlp 中文分詞器：切分力度較大，準(zhǔn)確度與查全率不高，但是查詢性能較高
Pinyin 分詞器：針對漢字拼音進行的分詞器，與上面介紹的分詞器稍有不同，在用拼音進行查詢時查全率準(zhǔn)確度較高

下面詳細(xì)介紹下各種分詞器，對同一組漢語進行分詞的結(jié)果對比，方便大家在實際使用中參考。

standard 默認(rèn)分詞器

GET _analyze
{
  "text": "南京市長江大橋",
  "tokenizer": "standard"
}

#返回
{
  "tokens" : [
    {
      "token" : "南",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<IDEOGRAPHIC>",
      "position" : 0
    },
    {
      "token" : "京",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "<IDEOGRAPHIC>",
      "position" : 1
    },
    {
      "token" : "市",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "<IDEOGRAPHIC>",
      "position" : 2
    },
    {
      "token" : "長",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "<IDEOGRAPHIC>",
      "position" : 3
    },
    {
      "token" : "江",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "<IDEOGRAPHIC>",
      "position" : 4
    },
    {
      "token" : "大",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "<IDEOGRAPHIC>",
      "position" : 5
    },
    {
      "token" : "橋",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "<IDEOGRAPHIC>",
      "position" : 6
    }
  ]
}

默認(rèn)分詞器處理中文是按照單個漢字進行切割，不能很好的理解中文詞語的含義，在實際項目使用中很少會使用默認(rèn)分詞器來處理中文。

IK 中文分詞器：

插件下載地址：https://github.com/medcl/elasticsearch-analysis-ik/releases/tag/v7.10.0
（注意要下載和使用的Elasticsearch 匹配的版本）

在 Elasticsearch 的安裝目錄的 Plugins 目錄下新建 IK 文件夾，然后將下載的 IK 安裝包解壓到此目錄下
重啟 ES 即生效
IK 分詞器包含：ik_smart 以及 ik_max_word 2種分詞器，都可以使用在
索引和查詢階段。創(chuàng)建一個索引，里面包含2個字段:

max_word_content 使用 ik_max_word 分詞器處理;
smart_content 采用 ik_smart 分詞器處理;
分別對比下執(zhí)行結(jié)果:

#創(chuàng)建索引
PUT /analyze_chinese
{
  "mappings": {
    "properties": {
      "max_word_content": {
        "type": "text",
        "analyzer": "ik_max_word",
        "search_analyzer": "ik_max_word"
      },
      "smart_content": {
        "type": "text",
        "analyzer": "ik_smart",
        "search_analyzer": "ik_smart"
      }
    }
  }
}

#添加測試數(shù)據(jù)
POST analyze_chinese/_bulk
{"index":{"_id":1}}
{"max_word_content":"南京市長江大橋","smart_content":"我是南京市民"}

# ik_max_word 查詢分析器解析結(jié)果
POST _analyze
{
  "text": "南京市長江大橋",
  "analyzer": "ik_max_word"
}
#結(jié)果：
{
  "tokens" : [
    {
      "token" : "南京市",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "南京",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "市長",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "長江大橋",
      "start_offset" : 3,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "長江",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "CN_WORD",
      "position" : 4
    },
    {
      "token" : "大橋",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 5
    }
  ]
}

#ik_smart
POST _analyze
{
  "text": "南京市長江大橋",
  "analyzer": "ik_smart"
}

#結(jié)果：
{
  "tokens" : [
    {
      "token" : "南京市",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "長江大橋",
      "start_offset" : 3,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 1
    }
  ]
}

通過以上分析，ik_smart 顯然分詞的顆粒度較粗，而 ik_max_word 顆粒度較細(xì)
通過DSL來驗證查詢

POST analyze_chinese/_search
{
  "query": {
    "match": {
      "smart_content": "南京市"
    }
  }
}

#結(jié)果
"hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  }

未匹配到記錄，因為“我是南京市民” 經(jīng)過分詞處理后并不包含“南京市” 的 token,
那通過“南京” 搜索呢？

POST analyze_chinese/_search
{
  "query": {
    "match": {
      "smart_content": "南京"
    }
  }
}

#返回
"hits" : [
      {
        "_index" : "analyze_chinese",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.2876821,
        "_source" : {
          "max_word_content" : "南京市長江大橋",
          "smart_content" : "我是南京市民"
        }
      }
    ]

經(jīng)過 ik_max_word 分詞處理器處理之后的 max_word_content 字段效果呢？

POST analyze_chinese/_search
{
  "query": {
    "match": {
      "max_word_content": "南京"
    }
  }
}

#結(jié)果
"hits" : [
      {
        "_index" : "analyze_chinese",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.2876821,
        "_source" : {
          "max_word_content" : "南京市長江大橋",
          "smart_content" : "我是南京市民"
        }
      }
    ]

#使用 南京市 查詢
POST analyze_chinese/_search
{
  "query": {
    "match": {
      "max_word_content": "南京市"
    }
  }
}
#結(jié)果
"hits" : [
      {
        "_index" : "analyze_chinese",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.5753642,
        "_source" : {
          "max_word_content" : "南京市長江大橋",
          "smart_content" : "我是南京市民"
        }
      }
    ]

可以看到，由于 “南京市長江大橋” 經(jīng)過 ik_max_word 分詞器處理后，包含 “南京市” token,所以都可以查詢到。

IK 分詞器總結(jié)：

ik_max_word 分詞顆粒度小，滿足業(yè)務(wù)場景更豐富
ik_smart 分詞器顆粒度較粗，滿足分詞場景要求不高的業(yè)務(wù)

pinyin 分詞器

首先，下載 pinyin 分詞器插件：
https://github.com/medcl/elasticsearch-analysis-pinyin

本地編譯并打包后，上傳到ES安裝目錄下的 plugins 下并解壓，然后重啟ES，重啟后查看是否安裝成功：

[elasticsearch@stage-node1 elasticsearch-7.10.0]$ ./bin/elasticsearch-plugin list
ik
pinyin

可以看到 pinyin 插件已經(jīng)安裝成功

PUT /analyze_chinese_pinyin/ 
{
    "settings" : {
        "analysis" : {
            "analyzer" : {
                "pinyin_analyzer" : {
                    "tokenizer" : "my_pinyin"
                    }
            },
            "tokenizer" : {
                "my_pinyin" : {
                    "type" : "pinyin",
                    "keep_separate_first_letter" : false,
                    "keep_full_pinyin" : true,
                    "keep_original" : true,
                    "limit_first_letter_length" : 16,
                    "lowercase" : true,
                    "remove_duplicated_term" : true
                }
            }
        }
    }
}

#
GET /analyze_chinese_pinyin/_analyze
{
  "text": ["南京市長江大橋"],
  "analyzer": "pinyin_analyzer"
}

#返回：
{
  "tokens" : [
    {
      "token" : "nan",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "南京市長江大橋",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "njscjdq",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "jing",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "shi",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "chang",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "jiang",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "da",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "qiao",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 6
    }
  ]
}

#設(shè)置測試數(shù)據(jù)
POST analyze_chinese_pinyin/_bulk
{"index":{"_id":1}}
{"name":"南京市長江大橋"}

#根據(jù)拼音查詢 njscjdq
POST analyze_chinese_pinyin/_search
{
  "query": {
    "match": {
      "name.pinyin": "njscjdq"
    }
  }
}

#返回
"hits" : [
      {
        "_index" : "analyze_chinese_pinyin",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.6931471,
        "_source" : {
          "name" : "南京市長江大橋"
        }
      }
    ]

#通過 nan 查詢

POST analyze_chinese_pinyin/_search
{
  "query": {
    "match": {
      "name.pinyin": "nan"
    }
  }
}

# 返回
"hits" : [
      {
        "_index" : "analyze_chinese_pinyin",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.6931471,
        "_source" : {
          "name" : "南京市長江大橋"
        }
      }
    ]

因為經(jīng)過南京長江大橋經(jīng)過 pinyin_analyzer 分詞器分詞后，包含 nan 和 njscjdq 所以都能匹配查詢到記錄

Smart Chinese Analysis

參考：https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-smartcn.html

Smart Chinese Analysis 插件將Lucene的智能中文分析模塊集成到elasticsearch中，
提供了中文或中英文混合文本的分析器。該分析器使用概率知識來找到簡體中文文本的最佳分詞。文本首先被分解成句子，然后每個句子被分割成單詞。
此插件必須在每個節(jié)點上安裝且需要重啟才生效，此插件提供了smartcn 分析器、smartcn_tokenizer tokenizer、

./bin/elasticsearch-plugin install analysis-smartcn
-> Installing analysis-smartcn
-> Downloading analysis-smartcn from elastic
[=================================================] 100%   
-> Installed analysis-smartcn

同樣執(zhí)行查看已安裝插件的列表

[elasticsearch@stage-node1 elasticsearch-7.10.0]$ ./bin/elasticsearch-plugin list
analysis-smartcn
ik
pinyin

安裝成功后，需要重啟 ES 以便插件生效

POST _analyze
{
  "analyzer": "smartcn",
  "text":"南京市長江大橋"
}

#返回
{
  "tokens" : [
    {
      "token" : "南京市",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "長江",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "大橋",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "word",
      "position" : 2
    }
  ]
}

hanlp 中文分詞器

安裝插件：

./bin/elasticsearch-plugin install https://github.com/KennFalcon/elasticsearch-analysis-hanlp/releases/download/v7.10.0/elasticsearch-analysis-hanlp-7.10.0.zip

安裝后查看插件安裝情況，安裝成功后也同樣需要重啟ES

[elasticsearch@stage-node1 elasticsearch-7.10.0]$ ./bin/elasticsearch-plugin list
analysis-hanlp
analysis-smartcn
ik
pinyin

GET _analyze
{
  "text": "南京市長江大橋",
  "tokenizer": "hanlp"
}

#返回
{
  "tokens" : [
    {
      "token" : "南京市",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "ns",
      "position" : 0
    },
    {
      "token" : "長江大橋",
      "start_offset" : 3,
      "end_offset" : 7,
      "type" : "nz",
      "position" : 1
    }
  ]
}

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

Elasticsearch 中文分詞器

Elasticsearch 中文分詞器

簡介：

分詞器比較

standard 默認(rèn)分詞器

IK 中文分詞器：

IK 分詞器總結(jié)：

pinyin 分詞器

Smart Chinese Analysis

hanlp 中文分詞器

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

Elasticsearch 中文分詞器

簡介：

分詞器比較

standard 默認(rèn)分詞器

IK 中文分詞器：

IK 分詞器總結(jié)：

pinyin 分詞器

Smart Chinese Analysis

hanlp 中文分詞器

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av