自定義分詞器

前言

es能夠?qū)崿F(xiàn)快速的全文搜索,除了依賴其本身倒排索引的思想,還依賴其分詞器

分析器

  • es本身內(nèi)置了一些常用的分析器(analyzer),分析器由三種構(gòu)建組成:
    • character filter: 字符過濾器(在一段文本進行分詞之前,先進行預(yù)處理,比如過濾html標簽等)
    • tokenizer: 分詞器(對字段進行切分)
    • token filter: token過濾器(對切分的單詞進行加工,如大小寫轉(zhuǎn)換等)
  • 三者順序: character filter -> tokenizer -> token filter
  • 三者個數(shù): character filter(0個或多個)+tokenizer(恰好一個)+token filter(0個或多個)

es內(nèi)置的分析器

  • es內(nèi)置了一些常用的分析器,如下:
Standard Analyzer - 默認分詞器,按詞切分,小寫處理
Simple Analyzer - 按照非字母切分(符號被過濾), 小寫處理
Stop Analyzer - 小寫處理,停用詞過濾(the,a,is)
Whitespace Analyzer - 按照空格切分,不轉(zhuǎn)小寫
Keyword Analyzer - 不分詞,直接將輸入當作輸出
Patter Analyzer - 正則表達式,默認\W+(非字符分割)
Language - 提供了30多種常見語言的分詞器
Customer Analyzer 自定義分詞器
  • 根據(jù)這些分詞器我們可以進行自定義一些簡單的分詞器,如 以逗號分隔的分詞器
{
 "settings":{
  "analysis":{
    "analyzer":{
      "comma":{
        "type":"pattern",
        "pattern":","
      }
    }
  }
 }
}
  • 或者自定義選擇分詞器及過濾器,組裝一個新的分析器
{
    "settings": {
        "analysis": {
            "analyzer": {
                "std_folded": {
                    "type": "custom",
                    "tokenizer": "standard",
                    "filter": [
                        "lowercase",
                        "asciifolding"
                    ]
                }
            }
        }
    }
}

自定義分析器

  • 并不是所有的需求都可以以內(nèi)置的組件進行組裝得到,當有一些特殊的需求時,內(nèi)置的分詞器可能很難實現(xiàn),這時我們可以嘗試自定義分析器。 以下以連續(xù)字符串分詞為例: 給定一個字符串,要求分詞出來的結(jié)果涵蓋: 所有的連續(xù)3個字母、4個字母、5個字母...
    嗯... 其實elasticsearch內(nèi)置的分詞器,也可以實現(xiàn),如下:
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "ngram",
          "min_gram": 4,
          "max_gram": 10,
          "token_chars": [
            "letter",
            "digit"
          ]
        }
      }
    }
  }}

自定義插件實現(xiàn)

這里我們以一個空格分詞器為例

pom文件
  <properties>
    <elasticsearch.version>6.5.4</elasticsearch.version>
    <lucene.version>7.5.0</lucene.version>
    <maven.compiler.target>1.8</maven.compiler.target>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
  </properties>

  <dependencies>
    <dependency>
      <groupId>org.elasticsearch</groupId>
      <artifactId>elasticsearch</artifactId>
      <version>${elasticsearch.version}</version>
      <scope>provided</scope>
    </dependency>
  </dependencies>

  <build>
    <resources>
      <resource>
        <directory>src/main/resources</directory>
        <filtering>false</filtering>
        <excludes>
          <exclude>*.properties</exclude>
        </excludes>
      </resource>
    </resources>
    <plugins>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-assembly-plugin</artifactId>
        <version>2.6</version>
        <configuration>
          <appendAssemblyId>false</appendAssemblyId>
          <outputDirectory>${project.build.directory}/releases/</outputDirectory>
          <descriptors>
            <descriptor>${basedir}/src/main/assemblies/plugin.xml</descriptor>
          </descriptors>
        </configuration>
        <executions>
          <execution>
            <phase>package</phase>
            <goals>
              <goal>single</goal>
            </goals>
          </execution>
        </executions>
      </plugin>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-compiler-plugin</artifactId>
        <version>3.5.1</version>
        <configuration>
          <source>${maven.compiler.target}</source>
          <target>${maven.compiler.target}</target>
        </configuration>
      </plugin>
    </plugins>
  </build>
  • 注意這里指定了 plugin.xml并設(shè)置了靜態(tài)資源文件
plugin.xml 注意文件位置
<?xml version="1.0"?>
<assembly>
  <id>my-analysis</id>
  <formats>
    <format>zip</format>
  </formats>
  <includeBaseDirectory>false</includeBaseDirectory>
  <files>
    <file>
      <source>${project.basedir}/src/main/resources/my.properties</source>
      <outputDirectory/>
      <filtered>true</filtered>
    </file>
  </files>
  <dependencySets>
    <dependencySet>
      <outputDirectory/>
      <useProjectArtifact>true</useProjectArtifact>
      <useTransitiveFiltering>true</useTransitiveFiltering>
      <excludes>
        <exclude>org.elasticsearch:elasticsearch</exclude>
      </excludes>
    </dependencySet>
  </dependencySets>
</assembly>
  • 這里指定了my.properties
my.properties
description=${project.description}
version=${project.version}
name=${project.name}
classname=com.test.plugin.MyPlugin
java.version=${maven.compiler.target}
elasticsearch.version=${elasticsearch.version}
  • 這里指定了classname就是我們的插件類
代碼
  • 分析器
package com.test.index.analysis;

import org.apache.lucene.analysis.Analyzer;

/**
 * @author phil.zhang
 * @date 2021/2/21
 */
public class MyAnalyzer extends Analyzer {
  @Override
  protected TokenStreamComponents createComponents(String fieldName) {
    MyTokenizer myTokenizer = new MyTokenizer();
    return new TokenStreamComponents(myTokenizer);
  }
}
  • 分析器provider
package com.test.index.analysis;

import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.env.Environment;
import org.elasticsearch.index.IndexSettings;
import org.elasticsearch.index.analysis.AbstractIndexAnalyzerProvider;

/**
 * @author phil.zhang
 * @date 2021/2/21
 */
public class MyAnalyzerProvider extends AbstractIndexAnalyzerProvider<MyAnalyzer> {
  private MyAnalyzer myAnalyzer;
  public MyAnalyzerProvider(IndexSettings indexSettings,Environment environment, String name, Settings settings) {
    super(indexSettings,name,settings);
    myAnalyzer = new MyAnalyzer();
  }
  @Override
  public MyAnalyzer get() {
    return myAnalyzer;
  }
}
  • 分詞器--核心邏輯
package com.test.index.analysis;

import java.io.IOException;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;

/**
 * @author phil.zhang
 * @date 2021/2/21
 */
public class MyTokenizer extends Tokenizer {
  private final StringBuilder buffer = new StringBuilder();
  private int suffixOffset;
  /** 分詞開始的位置 **/
  private int tokenStart = 0;
  /** 分詞結(jié)束的位置 **/
  private int tokenEnd = 0;
  /** 將attribute加入map, 這里分出來的詞語 需要包含字符串 和 offset兩種屬性 **/
  private final CharTermAttribute termAttribute = addAttribute(CharTermAttribute.class);
  private final OffsetAttribute offsetAttribute = addAttribute(OffsetAttribute.class);

  @Override
  public boolean incrementToken() throws IOException {
    clearAttributes();
    buffer.setLength(0); // 清空數(shù)據(jù)
    int ci;
    char ch;
    tokenStart = tokenEnd;
    // 讀取一個字符
    ci = input.read();
    ch = (char)ci;
    while (true) {
      if (ci == -1) {
        // 沒有數(shù)據(jù)了
        if (buffer.length() == 0) {
          // 分詞結(jié)束
          return false;
        }else {
          // 返回一個分詞結(jié)果
          termAttribute.setEmpty().append(buffer);
          offsetAttribute.setOffset(correctOffset(tokenStart),correctOffset(tokenEnd));
          return true;
        }
      }else if (ch == ' ') {
        // 遇到空格
        tokenEnd ++;
        if (buffer.length()>0) {
          termAttribute.setEmpty().append(buffer);
          offsetAttribute.setOffset(correctOffset(tokenStart),correctOffset(tokenEnd));
          return true;
        }else {
          ci = input.read();
          ch = (char) ci;
        }
      }else { // 沒有遇到空格,繼續(xù)追加
        buffer.append(ch);
        tokenEnd++;
        ci = input.read();
        ch = (char) ci;

      }
    }
  }

  @Override
  public void end() throws IOException {
    int finalOffset = correctOffset(suffixOffset);
    offsetAttribute.setOffset(finalOffset,finalOffset);
  }

  @Override
  public void reset() throws IOException {
    super.reset();
    tokenStart = tokenEnd = 0;
  }
}
  • 分詞器工廠
package com.test.index.analysis;

import org.apache.lucene.analysis.Tokenizer;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.env.Environment;
import org.elasticsearch.index.IndexSettings;
import org.elasticsearch.index.analysis.AbstractTokenizerFactory;

/**
 * @author phil.zhang
 * @date 2021/2/21
 */
public class MyTokenizerFactory extends AbstractTokenizerFactory {

  public MyTokenizerFactory(IndexSettings indexSettings,Environment environment,String ignored, Settings settings) {
    super(indexSettings,ignored,settings);
  }

  @Override
  public Tokenizer create() {
    return new MyTokenizer();
  }
}
  • 插件類
package com.test.plugin;

import com.test.index.analysis.MyAnalyzerProvider;
import com.test.index.analysis.MyTokenizerFactory;
import java.util.HashMap;
import java.util.Map;
import org.apache.lucene.analysis.Analyzer;
import org.elasticsearch.index.analysis.AnalyzerProvider;
import org.elasticsearch.index.analysis.TokenizerFactory;
import org.elasticsearch.indices.analysis.AnalysisModule;
import org.elasticsearch.plugins.AnalysisPlugin;
import org.elasticsearch.plugins.Plugin;

/**
 * @author phil.zhang
 * @date 2021/2/21
 */
public class MyPlugin extends Plugin implements AnalysisPlugin {

  @Override
  public Map<String, AnalysisModule.AnalysisProvider<TokenizerFactory>> getTokenizers() {
    Map<String, AnalysisModule.AnalysisProvider<TokenizerFactory>> extra = new HashMap<>();
    extra.put("my-word", MyTokenizerFactory::new);
    return extra;
  }
  @Override
  public Map<String, AnalysisModule.AnalysisProvider<AnalyzerProvider<? extends Analyzer>>> getAnalyzers() {

    Map<String, AnalysisModule.AnalysisProvider<AnalyzerProvider<? extends Analyzer>>> extra = new HashMap<>();
    extra.put("my-word", MyAnalyzerProvider::new);
    return extra;
  }
}
后續(xù)

到這里代碼就開發(fā)完成了,可以進行簡單的自測看下效果,然后就可以使用maven命令進行打包,之后就是分詞器插件的安裝流程,這里不再進一步說明

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容