前言
es能夠?qū)崿F(xiàn)快速的全文搜索,除了依賴其本身倒排索引的思想,還依賴其分詞器
分析器
- es本身內(nèi)置了一些常用的分析器(analyzer),分析器由三種構(gòu)建組成:
- character filter: 字符過濾器(在一段文本進行分詞之前,先進行預(yù)處理,比如過濾html標簽等)
- tokenizer: 分詞器(對字段進行切分)
- token filter: token過濾器(對切分的單詞進行加工,如大小寫轉(zhuǎn)換等)
- 三者順序: character filter -> tokenizer -> token filter
- 三者個數(shù): character filter(0個或多個)+tokenizer(恰好一個)+token filter(0個或多個)
es內(nèi)置的分析器
- es內(nèi)置了一些常用的分析器,如下:
Standard Analyzer - 默認分詞器,按詞切分,小寫處理
Simple Analyzer - 按照非字母切分(符號被過濾), 小寫處理
Stop Analyzer - 小寫處理,停用詞過濾(the,a,is)
Whitespace Analyzer - 按照空格切分,不轉(zhuǎn)小寫
Keyword Analyzer - 不分詞,直接將輸入當作輸出
Patter Analyzer - 正則表達式,默認\W+(非字符分割)
Language - 提供了30多種常見語言的分詞器
Customer Analyzer 自定義分詞器
- 根據(jù)這些分詞器我們可以進行自定義一些簡單的分詞器,如 以逗號分隔的分詞器
{
"settings":{
"analysis":{
"analyzer":{
"comma":{
"type":"pattern",
"pattern":","
}
}
}
}
}
- 或者自定義選擇分詞器及過濾器,組裝一個新的分析器
{
"settings": {
"analysis": {
"analyzer": {
"std_folded": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
}
}
自定義分析器
- 并不是所有的需求都可以以內(nèi)置的組件進行組裝得到,當有一些特殊的需求時,內(nèi)置的分詞器可能很難實現(xiàn),這時我們可以嘗試自定義分析器。 以下以連續(xù)字符串分詞為例: 給定一個字符串,要求分詞出來的結(jié)果涵蓋: 所有的連續(xù)3個字母、4個字母、5個字母...
嗯... 其實elasticsearch內(nèi)置的分詞器,也可以實現(xiàn),如下:
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 4,
"max_gram": 10,
"token_chars": [
"letter",
"digit"
]
}
}
}
}}
自定義插件實現(xiàn)
這里我們以一個空格分詞器為例
pom文件
<properties>
<elasticsearch.version>6.5.4</elasticsearch.version>
<lucene.version>7.5.0</lucene.version>
<maven.compiler.target>1.8</maven.compiler.target>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
<dependencies>
<dependency>
<groupId>org.elasticsearch</groupId>
<artifactId>elasticsearch</artifactId>
<version>${elasticsearch.version}</version>
<scope>provided</scope>
</dependency>
</dependencies>
<build>
<resources>
<resource>
<directory>src/main/resources</directory>
<filtering>false</filtering>
<excludes>
<exclude>*.properties</exclude>
</excludes>
</resource>
</resources>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<version>2.6</version>
<configuration>
<appendAssemblyId>false</appendAssemblyId>
<outputDirectory>${project.build.directory}/releases/</outputDirectory>
<descriptors>
<descriptor>${basedir}/src/main/assemblies/plugin.xml</descriptor>
</descriptors>
</configuration>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.5.1</version>
<configuration>
<source>${maven.compiler.target}</source>
<target>${maven.compiler.target}</target>
</configuration>
</plugin>
</plugins>
</build>
- 注意這里指定了 plugin.xml并設(shè)置了靜態(tài)資源文件
plugin.xml 注意文件位置
<?xml version="1.0"?>
<assembly>
<id>my-analysis</id>
<formats>
<format>zip</format>
</formats>
<includeBaseDirectory>false</includeBaseDirectory>
<files>
<file>
<source>${project.basedir}/src/main/resources/my.properties</source>
<outputDirectory/>
<filtered>true</filtered>
</file>
</files>
<dependencySets>
<dependencySet>
<outputDirectory/>
<useProjectArtifact>true</useProjectArtifact>
<useTransitiveFiltering>true</useTransitiveFiltering>
<excludes>
<exclude>org.elasticsearch:elasticsearch</exclude>
</excludes>
</dependencySet>
</dependencySets>
</assembly>
- 這里指定了my.properties
my.properties
description=${project.description}
version=${project.version}
name=${project.name}
classname=com.test.plugin.MyPlugin
java.version=${maven.compiler.target}
elasticsearch.version=${elasticsearch.version}
- 這里指定了classname就是我們的插件類
代碼
- 分析器
package com.test.index.analysis;
import org.apache.lucene.analysis.Analyzer;
/**
* @author phil.zhang
* @date 2021/2/21
*/
public class MyAnalyzer extends Analyzer {
@Override
protected TokenStreamComponents createComponents(String fieldName) {
MyTokenizer myTokenizer = new MyTokenizer();
return new TokenStreamComponents(myTokenizer);
}
}
- 分析器provider
package com.test.index.analysis;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.env.Environment;
import org.elasticsearch.index.IndexSettings;
import org.elasticsearch.index.analysis.AbstractIndexAnalyzerProvider;
/**
* @author phil.zhang
* @date 2021/2/21
*/
public class MyAnalyzerProvider extends AbstractIndexAnalyzerProvider<MyAnalyzer> {
private MyAnalyzer myAnalyzer;
public MyAnalyzerProvider(IndexSettings indexSettings,Environment environment, String name, Settings settings) {
super(indexSettings,name,settings);
myAnalyzer = new MyAnalyzer();
}
@Override
public MyAnalyzer get() {
return myAnalyzer;
}
}
- 分詞器--核心邏輯
package com.test.index.analysis;
import java.io.IOException;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
/**
* @author phil.zhang
* @date 2021/2/21
*/
public class MyTokenizer extends Tokenizer {
private final StringBuilder buffer = new StringBuilder();
private int suffixOffset;
/** 分詞開始的位置 **/
private int tokenStart = 0;
/** 分詞結(jié)束的位置 **/
private int tokenEnd = 0;
/** 將attribute加入map, 這里分出來的詞語 需要包含字符串 和 offset兩種屬性 **/
private final CharTermAttribute termAttribute = addAttribute(CharTermAttribute.class);
private final OffsetAttribute offsetAttribute = addAttribute(OffsetAttribute.class);
@Override
public boolean incrementToken() throws IOException {
clearAttributes();
buffer.setLength(0); // 清空數(shù)據(jù)
int ci;
char ch;
tokenStart = tokenEnd;
// 讀取一個字符
ci = input.read();
ch = (char)ci;
while (true) {
if (ci == -1) {
// 沒有數(shù)據(jù)了
if (buffer.length() == 0) {
// 分詞結(jié)束
return false;
}else {
// 返回一個分詞結(jié)果
termAttribute.setEmpty().append(buffer);
offsetAttribute.setOffset(correctOffset(tokenStart),correctOffset(tokenEnd));
return true;
}
}else if (ch == ' ') {
// 遇到空格
tokenEnd ++;
if (buffer.length()>0) {
termAttribute.setEmpty().append(buffer);
offsetAttribute.setOffset(correctOffset(tokenStart),correctOffset(tokenEnd));
return true;
}else {
ci = input.read();
ch = (char) ci;
}
}else { // 沒有遇到空格,繼續(xù)追加
buffer.append(ch);
tokenEnd++;
ci = input.read();
ch = (char) ci;
}
}
}
@Override
public void end() throws IOException {
int finalOffset = correctOffset(suffixOffset);
offsetAttribute.setOffset(finalOffset,finalOffset);
}
@Override
public void reset() throws IOException {
super.reset();
tokenStart = tokenEnd = 0;
}
}
- 分詞器工廠
package com.test.index.analysis;
import org.apache.lucene.analysis.Tokenizer;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.env.Environment;
import org.elasticsearch.index.IndexSettings;
import org.elasticsearch.index.analysis.AbstractTokenizerFactory;
/**
* @author phil.zhang
* @date 2021/2/21
*/
public class MyTokenizerFactory extends AbstractTokenizerFactory {
public MyTokenizerFactory(IndexSettings indexSettings,Environment environment,String ignored, Settings settings) {
super(indexSettings,ignored,settings);
}
@Override
public Tokenizer create() {
return new MyTokenizer();
}
}
- 插件類
package com.test.plugin;
import com.test.index.analysis.MyAnalyzerProvider;
import com.test.index.analysis.MyTokenizerFactory;
import java.util.HashMap;
import java.util.Map;
import org.apache.lucene.analysis.Analyzer;
import org.elasticsearch.index.analysis.AnalyzerProvider;
import org.elasticsearch.index.analysis.TokenizerFactory;
import org.elasticsearch.indices.analysis.AnalysisModule;
import org.elasticsearch.plugins.AnalysisPlugin;
import org.elasticsearch.plugins.Plugin;
/**
* @author phil.zhang
* @date 2021/2/21
*/
public class MyPlugin extends Plugin implements AnalysisPlugin {
@Override
public Map<String, AnalysisModule.AnalysisProvider<TokenizerFactory>> getTokenizers() {
Map<String, AnalysisModule.AnalysisProvider<TokenizerFactory>> extra = new HashMap<>();
extra.put("my-word", MyTokenizerFactory::new);
return extra;
}
@Override
public Map<String, AnalysisModule.AnalysisProvider<AnalyzerProvider<? extends Analyzer>>> getAnalyzers() {
Map<String, AnalysisModule.AnalysisProvider<AnalyzerProvider<? extends Analyzer>>> extra = new HashMap<>();
extra.put("my-word", MyAnalyzerProvider::new);
return extra;
}
}
后續(xù)
到這里代碼就開發(fā)完成了,可以進行簡單的自測看下效果,然后就可以使用maven命令進行打包,之后就是分詞器插件的安裝流程,這里不再進一步說明