05-ElasticSearch分詞

分詞

一個tokenizer(分詞器)接收一個字符流,將之分割為獨立的tokens(詞元,通常是獨立的單詞),然后輸出tokens流。
例如:whitespace tokenizer遇到空白字符時分割文本。它會將文本“Quick brown fox!”分割為[Quick,brown,fox!]。

該tokenizer(分詞器)還負(fù)責(zé)記錄各個terms(詞條)的順序或position位置(用于phrase短語和word proximity詞近鄰查詢),以及term(詞條)所代表的原始word(單詞)的start(起始)和end(結(jié)束)的character offsets(字符串偏移量)(用于高亮顯示搜索的內(nèi)容)。

elasticsearch提供了很多內(nèi)置的分詞器,可以用來構(gòu)建custom analyzers(自定義分詞器)。
關(guān)于分詞器: https://www.elastic.co/guide/en/elasticsearch/reference/7.6/analysis.html

POST _analyze
{
  "analyzer": "standard",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

執(zhí)行結(jié)果:

{
  "tokens" : [
    {
      "token" : "the",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "2",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "<NUM>",
      "position" : 1
    },
    {
      "token" : "quick",
      "start_offset" : 6,
      "end_offset" : 11,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "brown",
      "start_offset" : 12,
      "end_offset" : 17,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "foxes",
      "start_offset" : 18,
      "end_offset" : 23,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "jumped",
      "start_offset" : 24,
      "end_offset" : 30,
      "type" : "<ALPHANUM>",
      "position" : 5
    },
    {
      "token" : "over",
      "start_offset" : 31,
      "end_offset" : 35,
      "type" : "<ALPHANUM>",
      "position" : 6
    },
    {
      "token" : "the",
      "start_offset" : 36,
      "end_offset" : 39,
      "type" : "<ALPHANUM>",
      "position" : 7
    },
    {
      "token" : "lazy",
      "start_offset" : 40,
      "end_offset" : 44,
      "type" : "<ALPHANUM>",
      "position" : 8
    },
    {
      "token" : "dog's",
      "start_offset" : 45,
      "end_offset" : 50,
      "type" : "<ALPHANUM>",
      "position" : 9
    },
    {
      "token" : "bone",
      "start_offset" : 51,
      "end_offset" : 55,
      "type" : "<ALPHANUM>",
      "position" : 10
    }
  ]
}
安裝ik分詞器

所有的語言分詞,默認(rèn)使用的都是“Standard Analyzer”,但是這些分詞器針對于中文的分詞,并不友好。為此需要安裝中文的分詞器。

Mac下因為文件夾下有.DS_Store文件導(dǎo)致安裝分詞器有點問題,可以先啟動容器后進(jìn)入容器內(nèi)部進(jìn)行安裝

docker exec -it elasticsearch /bin/bash #進(jìn)入容器
/usr/share/elasticsearch/bin
./elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.4.2/elasticsearch-analysis-ik-7.4.2.zip
elasticsearch-plugin list  # 列出我們所有安裝的插件,看有沒有ik

CentOS環(huán)境按下面安裝
https://github.com/medcl/elasticsearch-analysis-ik/releases 找對應(yīng)es版本下載
在前面安裝的elasticsearch時,我們已經(jīng)將elasticsearch容器的“/usr/share/elasticsearch/plugins”目錄,映射到宿主機的 /mydata/elasticsearch/plugins 目錄下,所以比較方便的做法就是下載“/elasticsearch-analysis-ik-7.4.2.zip”文件,然后解壓到該文件夾下即可。安裝完畢后,需要重啟elasticsearch容器。

cd /mydata/elasticsearch/plugins
mkdir ik
cd /mydata/elasticsearch/plugins/ik
# 如果沒有wget 命令先安裝wget:yum -y install wget
wget https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.4.2/elasticsearch-analysis-ik-7.4.2.zip
unzip elasticsearch-analysis-ik-7.4.2.zip
# 如果報 unzip: command not found的錯誤就執(zhí)行下:yum install -y unzip zip 
chmod -R 777 ik
docker restart elasticsearch #重啟elasticsearch
docker exec -it elasticsearch /bin/bash #進(jìn)入容器
cd /usr/share/elasticsearch/plugins  #看有沒有ik目錄
cd /usr/share/elasticsearch/bin
elasticsearch-plugin -h
elasticsearch-plugin list  # 列出我們所有安裝的插件,看有沒有ik

還可以采用如下的方式。
查看elasticsearch版本號:

[root@hadoop-104 ~]# curl http://localhost:9200
{
  "name" : "0adeb7852e00",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "9gglpP0HTfyOTRAaSe2rIg",
  "version" : {
    "number" : "7.6.2",      #版本號為7.6.2
    "build_flavor" : "default",
    "build_type" : "docker",
    "build_hash" : "ef48eb35cf30adf4db14086e8aabd07ef6fb113f",
    "build_date" : "2020-03-26T06:34:37.794943Z",
    "build_snapshot" : false,
    "lucene_version" : "8.4.0",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}
[root@hadoop-104 ~]# 

進(jìn)入es容器內(nèi)部plugin目錄:docker exec -it 容器id /bin/bash

[root@hadoop-104 ~]# docker exec -it elasticsearch /bin/bash
[root@0adeb7852e00 elasticsearch]# 
[root@0adeb7852e00 elasticsearch]# pwd
/usr/share/elasticsearch
#下載ik7.4.2
[root@0adeb7852e00 elasticsearch]# wget https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.4.2/elasticsearch-analysis-ik-7.4.2.zip
[root@0adeb7852e00 elasticsearch]# unzip elasticsearch-analysis-ik-7.4.2.zip -d ink
Archive:  elasticsearch-analysis-ik-7.4.2.zip
   creating: ik/config/
  inflating: ik/config/main.dic      
  inflating: ik/config/quantifier.dic  
  inflating: ik/config/extra_single_word_full.dic  
  inflating: ik/config/IKAnalyzer.cfg.xml  
  inflating: ik/config/surname.dic   
  inflating: ik/config/suffix.dic    
  inflating: ik/config/stopword.dic  
  inflating: ik/config/extra_main.dic  
  inflating: ik/config/extra_stopword.dic  
  inflating: ik/config/preposition.dic  
  inflating: ik/config/extra_single_word_low_freq.dic  
  inflating: ik/config/extra_single_word.dic  
  inflating: ik/elasticsearch-analysis-ik-7.6.2.jar  
  inflating: ik/httpclient-4.5.2.jar  
  inflating: ik/httpcore-4.4.4.jar   
  inflating: ik/commons-logging-1.2.jar  
  inflating: ik/commons-codec-1.9.jar  
  inflating: ik/plugin-descriptor.properties  
  inflating: ik/plugin-security.policy  
[root@0adeb7852e00 elasticsearch]#
#移動到plugins目錄下
[root@0adeb7852e00 elasticsearch]# mv ik plugins/
[root@0adeb7852e00 elasticsearch]# rm -rf elasticsearch-analysis-ik-7.4.2.zip 
測試分詞器

使用默認(rèn)分詞器

GET my_index/_analyze
{
   "text":"我是中國人"
}

執(zhí)行結(jié)果:

{
  "tokens" : [
    {
      "token" : "我",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<IDEOGRAPHIC>",
      "position" : 0
    },
    {
      "token" : "是",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "<IDEOGRAPHIC>",
      "position" : 1
    },
    {
      "token" : "中",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "<IDEOGRAPHIC>",
      "position" : 2
    },
    {
      "token" : "國",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "<IDEOGRAPHIC>",
      "position" : 3
    },
    {
      "token" : "人",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "<IDEOGRAPHIC>",
      "position" : 4
    }
  ]
}

使用ik分詞器

GET my_index/_analyze
{
   "analyzer": "ik_smart", 
   "text":"我是中國人"
}

或者

GET my_index/_analyze
{
   "analyzer": "ik_max_word", 
   "text":"我是中國人"
}

輸出結(jié)果:

{
  "tokens" : [
    {
      "token" : "我",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "CN_CHAR",
      "position" : 0
    },
    {
      "token" : "是",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "中國人",
      "start_offset" : 2,
      "end_offset" : 5,
      "type" : "CN_WORD",
      "position" : 2
    }
  ]
}
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

友情鏈接更多精彩內(nèi)容