分詞
一個tokenizer(分詞器)接收一個字符流,將之分割為獨立的tokens(詞元,通常是獨立的單詞),然后輸出tokens流。
例如:whitespace tokenizer遇到空白字符時分割文本。它會將文本“Quick brown fox!”分割為[Quick,brown,fox!]。
該tokenizer(分詞器)還負(fù)責(zé)記錄各個terms(詞條)的順序或position位置(用于phrase短語和word proximity詞近鄰查詢),以及term(詞條)所代表的原始word(單詞)的start(起始)和end(結(jié)束)的character offsets(字符串偏移量)(用于高亮顯示搜索的內(nèi)容)。
elasticsearch提供了很多內(nèi)置的分詞器,可以用來構(gòu)建custom analyzers(自定義分詞器)。
關(guān)于分詞器: https://www.elastic.co/guide/en/elasticsearch/reference/7.6/analysis.html
POST _analyze
{
"analyzer": "standard",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
執(zhí)行結(jié)果:
{
"tokens" : [
{
"token" : "the",
"start_offset" : 0,
"end_offset" : 3,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "2",
"start_offset" : 4,
"end_offset" : 5,
"type" : "<NUM>",
"position" : 1
},
{
"token" : "quick",
"start_offset" : 6,
"end_offset" : 11,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "brown",
"start_offset" : 12,
"end_offset" : 17,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "foxes",
"start_offset" : 18,
"end_offset" : 23,
"type" : "<ALPHANUM>",
"position" : 4
},
{
"token" : "jumped",
"start_offset" : 24,
"end_offset" : 30,
"type" : "<ALPHANUM>",
"position" : 5
},
{
"token" : "over",
"start_offset" : 31,
"end_offset" : 35,
"type" : "<ALPHANUM>",
"position" : 6
},
{
"token" : "the",
"start_offset" : 36,
"end_offset" : 39,
"type" : "<ALPHANUM>",
"position" : 7
},
{
"token" : "lazy",
"start_offset" : 40,
"end_offset" : 44,
"type" : "<ALPHANUM>",
"position" : 8
},
{
"token" : "dog's",
"start_offset" : 45,
"end_offset" : 50,
"type" : "<ALPHANUM>",
"position" : 9
},
{
"token" : "bone",
"start_offset" : 51,
"end_offset" : 55,
"type" : "<ALPHANUM>",
"position" : 10
}
]
}
安裝ik分詞器
所有的語言分詞,默認(rèn)使用的都是“Standard Analyzer”,但是這些分詞器針對于中文的分詞,并不友好。為此需要安裝中文的分詞器。
Mac下因為文件夾下有.DS_Store文件導(dǎo)致安裝分詞器有點問題,可以先啟動容器后進(jìn)入容器內(nèi)部進(jìn)行安裝
docker exec -it elasticsearch /bin/bash #進(jìn)入容器
/usr/share/elasticsearch/bin
./elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.4.2/elasticsearch-analysis-ik-7.4.2.zip
elasticsearch-plugin list # 列出我們所有安裝的插件,看有沒有ik
CentOS環(huán)境按下面安裝
在 https://github.com/medcl/elasticsearch-analysis-ik/releases 找對應(yīng)es版本下載
在前面安裝的elasticsearch時,我們已經(jīng)將elasticsearch容器的“/usr/share/elasticsearch/plugins”目錄,映射到宿主機的 /mydata/elasticsearch/plugins 目錄下,所以比較方便的做法就是下載“/elasticsearch-analysis-ik-7.4.2.zip”文件,然后解壓到該文件夾下即可。安裝完畢后,需要重啟elasticsearch容器。
cd /mydata/elasticsearch/plugins
mkdir ik
cd /mydata/elasticsearch/plugins/ik
# 如果沒有wget 命令先安裝wget:yum -y install wget
wget https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.4.2/elasticsearch-analysis-ik-7.4.2.zip
unzip elasticsearch-analysis-ik-7.4.2.zip
# 如果報 unzip: command not found的錯誤就執(zhí)行下:yum install -y unzip zip
chmod -R 777 ik
docker restart elasticsearch #重啟elasticsearch
docker exec -it elasticsearch /bin/bash #進(jìn)入容器
cd /usr/share/elasticsearch/plugins #看有沒有ik目錄
cd /usr/share/elasticsearch/bin
elasticsearch-plugin -h
elasticsearch-plugin list # 列出我們所有安裝的插件,看有沒有ik
還可以采用如下的方式。
查看elasticsearch版本號:
[root@hadoop-104 ~]# curl http://localhost:9200
{
"name" : "0adeb7852e00",
"cluster_name" : "elasticsearch",
"cluster_uuid" : "9gglpP0HTfyOTRAaSe2rIg",
"version" : {
"number" : "7.6.2", #版本號為7.6.2
"build_flavor" : "default",
"build_type" : "docker",
"build_hash" : "ef48eb35cf30adf4db14086e8aabd07ef6fb113f",
"build_date" : "2020-03-26T06:34:37.794943Z",
"build_snapshot" : false,
"lucene_version" : "8.4.0",
"minimum_wire_compatibility_version" : "6.8.0",
"minimum_index_compatibility_version" : "6.0.0-beta1"
},
"tagline" : "You Know, for Search"
}
[root@hadoop-104 ~]#
進(jìn)入es容器內(nèi)部plugin目錄:docker exec -it 容器id /bin/bash
[root@hadoop-104 ~]# docker exec -it elasticsearch /bin/bash
[root@0adeb7852e00 elasticsearch]#
[root@0adeb7852e00 elasticsearch]# pwd
/usr/share/elasticsearch
#下載ik7.4.2
[root@0adeb7852e00 elasticsearch]# wget https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.4.2/elasticsearch-analysis-ik-7.4.2.zip
[root@0adeb7852e00 elasticsearch]# unzip elasticsearch-analysis-ik-7.4.2.zip -d ink
Archive: elasticsearch-analysis-ik-7.4.2.zip
creating: ik/config/
inflating: ik/config/main.dic
inflating: ik/config/quantifier.dic
inflating: ik/config/extra_single_word_full.dic
inflating: ik/config/IKAnalyzer.cfg.xml
inflating: ik/config/surname.dic
inflating: ik/config/suffix.dic
inflating: ik/config/stopword.dic
inflating: ik/config/extra_main.dic
inflating: ik/config/extra_stopword.dic
inflating: ik/config/preposition.dic
inflating: ik/config/extra_single_word_low_freq.dic
inflating: ik/config/extra_single_word.dic
inflating: ik/elasticsearch-analysis-ik-7.6.2.jar
inflating: ik/httpclient-4.5.2.jar
inflating: ik/httpcore-4.4.4.jar
inflating: ik/commons-logging-1.2.jar
inflating: ik/commons-codec-1.9.jar
inflating: ik/plugin-descriptor.properties
inflating: ik/plugin-security.policy
[root@0adeb7852e00 elasticsearch]#
#移動到plugins目錄下
[root@0adeb7852e00 elasticsearch]# mv ik plugins/
[root@0adeb7852e00 elasticsearch]# rm -rf elasticsearch-analysis-ik-7.4.2.zip
測試分詞器
使用默認(rèn)分詞器
GET my_index/_analyze
{
"text":"我是中國人"
}
執(zhí)行結(jié)果:
{
"tokens" : [
{
"token" : "我",
"start_offset" : 0,
"end_offset" : 1,
"type" : "<IDEOGRAPHIC>",
"position" : 0
},
{
"token" : "是",
"start_offset" : 1,
"end_offset" : 2,
"type" : "<IDEOGRAPHIC>",
"position" : 1
},
{
"token" : "中",
"start_offset" : 2,
"end_offset" : 3,
"type" : "<IDEOGRAPHIC>",
"position" : 2
},
{
"token" : "國",
"start_offset" : 3,
"end_offset" : 4,
"type" : "<IDEOGRAPHIC>",
"position" : 3
},
{
"token" : "人",
"start_offset" : 4,
"end_offset" : 5,
"type" : "<IDEOGRAPHIC>",
"position" : 4
}
]
}
使用ik分詞器
GET my_index/_analyze
{
"analyzer": "ik_smart",
"text":"我是中國人"
}
或者
GET my_index/_analyze
{
"analyzer": "ik_max_word",
"text":"我是中國人"
}
輸出結(jié)果:
{
"tokens" : [
{
"token" : "我",
"start_offset" : 0,
"end_offset" : 1,
"type" : "CN_CHAR",
"position" : 0
},
{
"token" : "是",
"start_offset" : 1,
"end_offset" : 2,
"type" : "CN_CHAR",
"position" : 1
},
{
"token" : "中國人",
"start_offset" : 2,
"end_offset" : 5,
"type" : "CN_WORD",
"position" : 2
}
]
}