1.注冊(cè)分析器
nalyzer、 tokenizer、 filter可以在elasticsearch.yml 配置
index :
analysis :
analyzer :
standard :
type: standard
stopwords : [stop1, stop2]
myAnalyzer1 :
type: standard
stopwords : [stop1, stop2, stop3]
max_token_length : 500
myAnalyzer2 :
tokenizer : standard
filter : [standard, lowercase, stop]
tokenizer :
myTokenizer1 :
type: standard
max_token_length : 900
myTokenizer2 :
type: keyword
buffer_size : 512
filter :
myTokenFilter1 :
type: stop
stopwords : [stop1, stop2, stop3, stop4]
myTokenFilter2 :
type: length
min : 0
max : 2000
analyzer:ES內(nèi)置若干analyzer, 另外還可以用內(nèi)置的character filter, tokenizer, token filter組裝一個(gè)analyzer(custom analyzer)
index :
analysis :
analyzer :
myAnalyzer :
tokenizer : standard
filter : [standard, lowercase, stop]
如果你要使用第三方的analyzer插件,需要先在配置文件elasticsearch.yml中注冊(cè), 下面是配置IkAnalyzer的例子
index:
analysis:
analyzer:
ik:
alias: [ik_analyzer]
type: org.elasticsearch.index.analysis.IkAnalyzerProvider
當(dāng)一個(gè)analyzer在配置文件中被注冊(cè)到一個(gè)名字(logical name)下后,在mapping定義或者一些API里就可以用這個(gè)名字來引用該analyzer了。
二.ES中內(nèi)置的analyzer,tokenizer,filter
ES內(nèi)置的一些analyzer
| analyzer | logical name | description |
|---|---|---|
| standard analyzer | standard | standard tokenizer, standard filter, lower case filter, stop filter |
| simple analyzer | simple | lower case tokenizer |
| stop analyzer | stop | lower case tokenizer, stop filter |
| keyword analyzer | keyword | 不分詞,內(nèi)容整體作為一個(gè)token(not_analyzed) |
| pattern analyzer | whitespace | 正則表達(dá)式分詞,默認(rèn)匹配 |
| language analyzers | lang | 各種語言 |
| snowball analyzer | snowball | standard tokenizer, standard filter, lower case filter, stop filter, snowball filter |
| custom analyzer | custom | 一個(gè)Tokenizer, 零個(gè)或多個(gè)Token Filter, 零個(gè)或多個(gè)Char Filter |
tokenizer:ES內(nèi)置的tokenizer列表
| tokenizer | logical name | description |
|---|---|---|
| standard tokenizer | standard | |
| edge ngram tokenizer | edgeNGram | |
| keyword tokenizer | keyword | 不分詞 |
| letter analyzer | letter | 按單詞分 |
| lowercase analyzer | lowercase | letter tokenizer, lower case filter |
| ngram analyzers | nGram | |
| whitespace analyzer | whitespace | 以空格為分隔符拆分 |
| pattern analyzer | pattern | 定義分隔符的正則表達(dá)式 |
| uax email url analyzer | uax_url_email | 不拆分url和email |
| path hierarchy analyzer | path_hierarchy | 處理類似/path/to/somthing樣式的字符串 |
token filter:ES內(nèi)置的token filter列表。
| token filter | logical name | description |
|---|---|---|
| standard filter | standar | |
| dascii folding filter | ascii folding | |
| lengthfilter | length | 去掉太長或者太短的 |
| lowercase filter | lowercase | 轉(zhuǎn)成小寫 |
| ngram filter | nGram | |
| edge ngram filter | edgeNGram | |
| porter stem filter | porterStem | 波特詞干算法 |
| shingle filter | shingle | 定義分隔符的正則表達(dá)式 |
| stop filter | stop | 移除 stop wordsword |
| delimiter filter | word_delimiter | 將一個(gè)單詞再拆成子分詞 |
| stemmer token filter | stemmer | |
| stemmer override filter | stemmer_override | |
| keyword marker filter | keyword_marker | |
| keyword repeat filter | keyword_repeat | |
| kstem filter | kstem | |
| snowball filter | snowball | |
| phonetic filte | rphonetic | 插件 |
| synonym filter | synonyms | 處理同義詞 |
| compound word filter | dictionary_decompounder, hyphenation_decompounder | 分解復(fù)合詞 |
| reverse filter | reverse | 反轉(zhuǎn)字符串 |
| elision filter | elision | 去掉縮略語 |
| truncate filter | truncate | 截?cái)嘧址?/td> |
| unique filter | unique | |
| pattern capture filter | pattern_capture | |
| pattern replace filter | pattern_replace | 用正則表達(dá)式替換 |
| trim filter | trim | 去掉空格 |
| limit token count filter | limit | 限制token數(shù)量 |
| hunspell filter | hunspell | 拼寫檢查 |
| common grams filter | common_grams | |
| normalization filter | arabic_normalization, persian_normalization |
character filter:ES內(nèi)置的character filter列表
| character filter | logical name | description |
|---|---|---|
| mapping char filter | mapping | 根據(jù)配置的映射關(guān)系替換字符 |
| html strip char filter | html_strip | 去掉HTML元素 |
| pattern replace char filter | pattern_replace | 用正則表達(dá)式處理字符串 |