99reav视频国产,久久精品人妻视频

Elasticsearch7學(xué)習(xí)筆記(上)
Elasticsearch7學(xué)習(xí)筆記(中)
Elasticsearch7學(xué)習(xí)筆記(下)
Elasticsearch7學(xué)習(xí)筆記(實(shí)戰(zhàn))

七、內(nèi)核原理

7.1 倒排索引組成結(jié)構(gòu)以及其索引不可變?cè)?/h3>

倒排索引，是適合用于進(jìn)行搜索的

倒排索引的結(jié)構(gòu)：

包含這個(gè)關(guān)鍵詞的document list
包含這個(gè)關(guān)鍵詞的所有document的數(shù)量：IDF（inverse document frequency）
這個(gè)關(guān)鍵詞在每個(gè)document中出現(xiàn)的次數(shù)：TF（term frequency）
這個(gè)關(guān)鍵詞在這個(gè)document中的次序
每個(gè)document的長(zhǎng)度：length norm
包含這個(gè)關(guān)鍵詞的所有document的平均長(zhǎng)度

word        doc1        doc2

dog          *           *
hello        *
you                      *

倒排索引不可變的好處

不需要鎖，提升并發(fā)能力，避免鎖的問題
數(shù)據(jù)不變，一直保存在os cache中，只要cache內(nèi)存足夠
filter cache一直駐留在內(nèi)存，因?yàn)閿?shù)據(jù)不變
可以壓縮，節(jié)省cpu和io開銷

倒排索引不可變的壞處：每次都要重新構(gòu)建整個(gè)索引

7.2 圖解剖析document寫入原理（buffer，segment，commit）

基本流程

數(shù)據(jù)寫入buffer
commit point
buffer中的數(shù)據(jù)寫入新的index segment
等待在os cache中的index segment被fsync強(qiáng)制刷到磁盤上
新的index sgement被打開，供search使用
buffer被清空

每次commit point時(shí)，會(huì)有一個(gè).del文件，標(biāo)記了哪些segment中的哪些document被標(biāo)記為deleted;
搜索的時(shí)候，會(huì)依次查詢所有的segment，從舊的到新的，比如被修改過的document，在舊的segment中，會(huì)標(biāo)記為deleted，在新的segment中會(huì)有其新的數(shù)據(jù)

優(yōu)化后的流程

在基礎(chǔ)流程中通常寫入磁盤是比較耗時(shí)，因此無法實(shí)現(xiàn)NTR近實(shí)時(shí)的查詢。主要瓶頸在于fsync實(shí)際發(fā)生磁盤IO寫數(shù)據(jù)進(jìn)磁盤，是很耗時(shí)的。

寫入流程別改進(jìn)如下：

（1）數(shù)據(jù)寫入buffer
（2）每隔一定時(shí)間，buffer中的數(shù)據(jù)被寫入segment文件，但是先寫入os cache
（3）只要segment寫入os cache，那就直接打開供search使用，不立即執(zhí)行commit

數(shù)據(jù)寫入os cache，并被打開供搜索的過程，叫做refresh，默認(rèn)是每隔1秒refresh一次。
也就是說，每隔一秒就會(huì)將buffer中的數(shù)據(jù)寫入一個(gè)新的index segment file，先寫入os cache中。
所以，es是近實(shí)時(shí)的，數(shù)據(jù)寫入到可以被搜索，默認(rèn)是1秒。

POST /index_demo/_refresh，可以手動(dòng)refresh，一般不需要手動(dòng)執(zhí)行，沒必要，讓es自己搞就可以了

比如現(xiàn)在的時(shí)效性要求，比較低，只要求一條數(shù)據(jù)寫入es，一分鐘以后才讓我們搜索到就可以了，那么就可以調(diào)整refresh interval

PUT /index_demo
{
  "settings": {
    "refresh_interval": "30s" 
  }
}

最終優(yōu)化流程

數(shù)據(jù)寫入buffer緩沖和translog日志文件
每隔一秒鐘，buffer中的數(shù)據(jù)被寫入新的segment file，并進(jìn)入os cache，此時(shí)segment被打開并供search使用
buffer被清空
重復(fù)1~3，新的segment不斷添加，buffer不斷被清空，而translog中的數(shù)據(jù)不斷累加
當(dāng)translog長(zhǎng)度達(dá)到一定程度的時(shí)候，commit操作發(fā)生

5-1. buffer中的所有數(shù)據(jù)寫入一個(gè)新的segment，并寫入os cache，打開供使用
5-2. buffer被清空
5-3. 一個(gè)commit ponit被寫入磁盤，標(biāo)明了所有的index segment
5-4. filesystem cache中的所有index segment file緩存數(shù)據(jù)，被fsync強(qiáng)行刷到磁盤上
5-5. 現(xiàn)有的translog被清空，創(chuàng)建一個(gè)新的translog

基于translog和commit point，如何進(jìn)行數(shù)據(jù)恢復(fù)

fsync+清空translog，就是flush，默認(rèn)每隔30分鐘flush一次，或者當(dāng)translog過大的時(shí)候，也會(huì)flush

POST /index_demo/_flush，一般來說別手動(dòng)flush，讓它自動(dòng)執(zhí)行就可以了

translog，每隔5秒被fsync一次到磁盤上。在一次增刪改操作之后，當(dāng)fsync在primary shard和replica shard都成功之后，那次增刪改操作才會(huì)成功

但是這種在一次增刪改時(shí)強(qiáng)行fsync translog可能會(huì)導(dǎo)致部分操作比較耗時(shí)，也可以允許部分?jǐn)?shù)據(jù)丟失，設(shè)置異步fsync translog

PUT /index_demo/_settings
{
    "index.translog.durability": "async",
    "index.translog.sync_interval": "5s"
}

最后優(yōu)化寫入流程實(shí)現(xiàn)海量磁盤文件合并（segment merge，optimize）

每秒一個(gè)segment file，文件過多，而且每次search都要搜索所有的segment，很耗時(shí)

默認(rèn)會(huì)在后臺(tái)執(zhí)行segment merge操作，在merge的時(shí)候，被標(biāo)記為deleted的document也會(huì)被徹底物理刪除

每次merge操作的執(zhí)行流程

選擇一些有相似大小的segment，merge成一個(gè)大的segment
將新的segment flush到磁盤上去
寫一個(gè)新的commit point，包括了新的segment，并且排除舊的那些segment
將新的segment打開供搜索
將舊的segment刪除

POST /index_demo/_optimize?max_num_segments=1，盡量不要手動(dòng)執(zhí)行，讓它自動(dòng)默認(rèn)執(zhí)行就可以了

八、Java API初步使用

CRUD

老版本（下面的方法都是過期的，在es8開始將會(huì)被移除）

引入maven依賴：

<dependency>
    <groupId>org.elasticsearch.client</groupId>
    <artifactId>transport</artifactId>
    <version>7.8.1</version>
</dependency>

添加日志依賴（可選）：

<dependency>
    <groupId>org.apache.logging.log4j</groupId>
    <artifactId>log4j-api</artifactId>
    <version>2.13.3</version>
</dependency>
<dependency>
    <groupId>org.apache.logging.log4j</groupId>
    <artifactId>log4j-core</artifactId>
    <version>2.13.3</version>
</dependency>

代碼測(cè)試

public static void main(String[] args) throws Exception {

    // 構(gòu)建client
    Settings settings = Settings.builder()
            .put("cluster.name", "docker-cluster")
            .build();
    TransportClient client = new PreBuiltTransportClient(settings)
            .addTransportAddress(new TransportAddress(InetAddress.getByName("192.168.111.40"), 9300));

    //addDoc(client);
    //getDoc(client);
    //updateDoc(client);
    delDoc(client);
    
    client.close();
}

/**
 * 添加
 */
public static void addDoc(TransportClient client) throws IOException {
    IndexResponse response = client.prepareIndex("employee", "_doc", "1")
            .setSource(XContentFactory.jsonBuilder()
                    .startObject()
                    .field("user", "tom")
                    .field("age", 18)
                    .field("position", "scientist")
                    .field("country", "China")
                    .field("join_data", "2020-01-01")
                    .field("salary", 10000)
                    .endObject())
            .get();
    System.out.println(response.getResult());
}

/**
 * 查詢
 */
public static void getDoc(TransportClient client){
    GetResponse documentFields = client.prepareGet("employee", "_doc", "1").get();
    System.out.println(documentFields.getSourceAsString());
}

/**
 * 更新
 */
public static void updateDoc(TransportClient client) throws IOException {
    UpdateResponse response = client.prepareUpdate("employee", "_doc", "1")
            .setDoc(XContentFactory.jsonBuilder()
                    .startObject()
                    .field("salary", 1000000)
                    .endObject())
            .get();
    System.out.println(response.getResult());
}

/**
 * 刪除
 */
public static void delDoc(TransportClient client){
    DeleteResponse response = client.prepareDelete("employee", "_doc", "1").get();
    System.out.println(response);
}

/***
 * 查詢職位中包含scientist，并且年齡在28到40歲之間
 */
public static void search(TransportClient client){
    SearchResponse response = client.prepareSearch("employee")
            .setQuery(QueryBuilders.boolQuery().must(QueryBuilders.matchQuery("position", "scientist"))
                    .filter(QueryBuilders.rangeQuery("age").gte(28).lte(40))).setFrom(0).setSize(2).get();
    System.out.println(response);
}

/***
 * 聚合查詢(需要重建mapping)
 */
public static void search2(TransportClient client){
    SearchResponse response = client.prepareSearch("employee")
            .addAggregation(AggregationBuilders.terms("group_by_country")
                    .field("country")
                    .subAggregation(AggregationBuilders.dateHistogram("group_by_join_date")
                            .field("joinDate")
                            .dateHistogramInterval(DateHistogramInterval.YEAR)
                            .subAggregation(AggregationBuilders.avg("avg_salary").field("salary")))
            ).execute().actionGet();

    System.out.println(response);
}

重建mapping語句：

PUT /employee
{
  "mappings": {
    "properties": {
      "age": {
        "type": "long"
      },
      "country": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        },
        "fielddata": true
      },
      "joinData": {
        "type": "date"
      },
      "name": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "position": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "salary": {
        "type": "long"
      }
    }
  }
}

新版本

添加maven依賴：

<dependency>
    <groupId>org.elasticsearch.client</groupId>
    <artifactId>elasticsearch-rest-high-level-client</artifactId>
    <version>7.8.1</version>
</dependency>

代碼測(cè)試

public static void main(String[] args) throws IOException {
    HttpHost[] httpHost = {HttpHost.create("192.168.111.40:9200")};
    RestHighLevelClient restHighLevelClient = new RestHighLevelClient(RestClient.builder(httpHost));
    // addDoc(restHighLevelClient);
    // getDoc(restHighLevelClient);
    // updateDoc(restHighLevelClient);
    delDoc(restHighLevelClient);

    restHighLevelClient.close();
}

/**
 * 添加
 */
public static void addDoc(RestHighLevelClient client) throws IOException {
    IndexRequest request = new IndexRequest("employee");
    request.id("1");
    request.source(XContentFactory.jsonBuilder()
            .startObject()
            .field("user", "tom")
            .field("age", 18)
            .field("position", "scientist")
            .field("country", "China")
            .field("join_data", "2020-01-01")
            .field("salary", 10000)
            .endObject());
    IndexResponse response = client.index(request, RequestOptions.DEFAULT);
    System.out.println(response.getResult());
}

/**
 * 查詢
 */
public static void getDoc(RestHighLevelClient client) throws IOException {
    // 通過ID來查詢
    GetRequest request = new GetRequest("employee","1");
    GetResponse response = client.get(request, RequestOptions.DEFAULT);
    // 更豐富的查詢條件
    /// SearchRequest searchRequest = new SearchRequest();
    /// client.search(searchRequest, RequestOptions.DEFAULT);

    System.out.println(response.getSourceAsString());
}

/**
 * 更新
 */
public static void updateDoc(RestHighLevelClient client) throws IOException {
    UpdateRequest request = new UpdateRequest("employee", "1");
    request.doc(XContentFactory.jsonBuilder()
            .startObject()
            .field("salary", 1000000)
            .endObject());
    UpdateResponse response = client.update(request, RequestOptions.DEFAULT);
    System.out.println(response.getResult());
}

/**
 * 刪除
 */
public static void delDoc(RestHighLevelClient client) throws IOException {
    DeleteRequest request = new DeleteRequest("employee", "1");
    DeleteResponse response = client.delete(request, RequestOptions.DEFAULT);
    System.out.println(response);
}  

 /**
 * 查詢職位中包含scientist，并且年齡在28到40歲之間
 */
 public static void search(RestHighLevelClient client) throws IOException {
    SearchRequest request = new SearchRequest("employee");
    request.source(SearchSourceBuilder.searchSource()
            .query(QueryBuilders.boolQuery()
                    .must(QueryBuilders.matchQuery("position", "scientist"))
                    .filter(QueryBuilders.rangeQuery("age").gte("28").lte("28"))
            ).from(0).size(2)
    );
    SearchResponse search = client.search(request, RequestOptions.DEFAULT);
    System.out.println(JSONObject.toJSONString(search.getHits()));
 }

九、深度探索搜索技術(shù)

9.1 使用term filter來搜索數(shù)據(jù)

準(zhǔn)備測(cè)試數(shù)據(jù)

POST /forum/_bulk
{ "index": { "_id": 1 }}
{ "articleID" : "XHDK-A-1293-#fJ3", "userID" : 1, "hidden": false, "postDate": "2020-09-09" }
{ "index": { "_id": 2 }}
{ "articleID" : "KDKE-B-9947-#kL5", "userID" : 1, "hidden": false, "postDate": "2020-09-10" }
{ "index": { "_id": 3 }}
{ "articleID" : "JODL-X-1937-#pV7", "userID" : 2, "hidden": false, "postDate": "2020-09-09" }
{ "index": { "_id": 4 }}
{ "articleID" : "QQPX-R-3956-#aD8", "userID" : 2, "hidden": true, "postDate": "2020-09-10" }

查看mapping

GET /forum/_mapping

查詢結(jié)果：

{
  "forum": {
    "mappings": {
      "article": {
        "properties": {
          "articleID": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "hidden": {
            "type": "boolean"
          },
          "postDate": {
            "type": "date"
          },
          "userID": {
            "type": "long"
          }
        }
      }
    }
  }
}

type=text，默認(rèn)會(huì)設(shè)置兩個(gè)field，一個(gè)是field本身，比如articleID，就是分詞的；還有一個(gè)的就是field.keyword，articleID.keyword，默認(rèn)不分詞，會(huì)最多保留256個(gè)字符

根據(jù)用戶ID搜索帖子

GET /forum/_search
{
    "query" : {
        "constant_score" : { 
            "filter" : {
                "term" : { 
                    "userID" : 1
                }
            }
        }
    }
}

term filter/query：對(duì)搜索文本不分詞，直接拿去倒排索引中匹配，你輸入的是什么，就去匹配什么；
比如如果對(duì)搜索文本進(jìn)行分詞的話，“helle world” --> 直接去倒排索引中匹配“hello world”；而不會(huì)去分詞后再匹配。

搜索沒有隱藏的帖子

GET /forum/_search
{
    "query" : {
        "constant_score" : { 
            "filter" : {
                "term" : { 
                    "hidden" : false
                }
            }
        }
    }
}

根據(jù)發(fā)帖日期搜索帖子

GET /forum/_search
{
    "query" : {
        "constant_score" : { 
            "filter" : {
                "term" : { 
                    "postDate" : "2020-09-09"
                }
            }
        }
    }
}

根據(jù)帖子ID搜索帖子

GET /forum/_search
{
    "query" : {
        "constant_score" : { 
            "filter" : {
                "term" : { 
                    "articleID" : "XHDK-A-1293-#fJ3"
                }
            }
        }
    }
}

上面那個(gè)查詢不得任何結(jié)果

GET /forum/_search
{
    "query" : {
        "constant_score" : { 
            "filter" : {
                "term" : { 
                    "articleID.keyword" : "XHDK-A-1293-#fJ3"
                }
            }
        }
    }
}

第一個(gè)為什么查詢不到結(jié)果？前面講了對(duì)應(yīng)類型是text的es會(huì)建立2次索引，一個(gè)是分詞一個(gè)不分詞（在keyword中）；
使用term 進(jìn)行查詢時(shí)不會(huì)對(duì)其進(jìn)行分詞就開始查詢，此時(shí)直接通過字段查詢是取匹配分詞的倒排索引自然也就匹配不到了；因此需要使用articleID.keyword去匹配。

articleID.keyword，是es最新版本內(nèi)置建立的field，就是不分詞的。所以一個(gè)articleID過來的時(shí)候，會(huì)建立兩次索引，一次是自己本身，是要分詞的，分詞后放入倒排索引；
另外一次是基于articleID.keyword，不分詞，最多保留256個(gè)字符，直接一個(gè)字符串放入倒排索引中。
term filter，對(duì)text過濾，可以考慮使用內(nèi)置的field.keyword來進(jìn)行匹配。但是有個(gè)問題，默認(rèn)就保留256個(gè)字符，如果超過了就GG了。
所以盡可能還是自己去手動(dòng)建立索引，指定not_analyzed。在最新版本的es中，不需要指定not_analyzed也可以，將type設(shè)為keyword即可。

查看分詞

GET /forum/_analyze
{
  "field": "articleID",
  "text": "XHDK-A-1293-#fJ3"
}

默認(rèn)是analyzed的text類型的field，建立倒排索引的時(shí)候，會(huì)對(duì)所有的articleID分詞，分詞以后，原本的articleID就沒有了，只有分詞后的各個(gè)word存在于倒排索引中。
term，是不對(duì)搜索文本分詞的，但是articleID建立索引為 xhdk，a，1293，fj3，自然直接搜索也就沒得結(jié)果了。

重建索引

DELETE /forum

PUT /forum
{
  "mappings": {
      "properties": {
        "articleID": {
          "type": "keyword"
        }
      }
    }
}

執(zhí)行上面的初始化數(shù)據(jù)語句，再次直接查詢即可查詢到結(jié)果

term filter：根據(jù)exact value進(jìn)行搜索，數(shù)字、boolean、date天然支持
相當(dāng)于SQL中的單個(gè)where條件

9.2 filter執(zhí)行原理深度剖析（bitset機(jī)制與caching機(jī)制）

（1）在倒排索引中查找搜索串，獲取document list；
（2）為每個(gè)在倒排索引中搜索到的結(jié)果（doc list），構(gòu)建一個(gè)bitset，就是一個(gè)二進(jìn)制的數(shù)組，數(shù)組每個(gè)元素都是0或1，用來標(biāo)識(shí)一個(gè)doc對(duì)一個(gè)filter條件是否匹配，如果匹配就是1，不匹配就是0，類似這樣：[0, 0, 0, 1, 0, 1]；
這樣盡可能用簡(jiǎn)單數(shù)據(jù)結(jié)構(gòu)去實(shí)現(xiàn)復(fù)雜的功能，可以節(jié)省內(nèi)存空間，提升性能；
（3）遍歷每個(gè)過濾條件對(duì)應(yīng)的bitset，優(yōu)先從最稀疏的開始搜索，查找滿足所有條件的document

一次性其實(shí)可以在一個(gè)search請(qǐng)求中，發(fā)出多個(gè)filter條件，每個(gè)filter條件都會(huì)對(duì)應(yīng)一個(gè)bitset；
遍歷每個(gè)filter條件對(duì)應(yīng)的bitset，先從最稀疏的開始遍歷

[0, 0, 0, 1, 0, 0]：比較稀疏
[0, 1, 0, 1, 0, 1]

先遍歷比較稀疏的bitset，就可以先過濾掉盡可能多的數(shù)據(jù)；

遍歷所有的bitset，找到匹配所有filter條件的doc；就可以將document作為結(jié)果返回給client了

（4）caching bitset，跟蹤query，在最近256個(gè)query中超過一定次數(shù)的過濾條件，緩存其bitset。對(duì)于小segment（<1000，或<3%），不緩存bitset。

比如條件為postDate=2017-01-01，生成的bitset為[0, 0, 1, 1, 0, 0]，可以緩存在內(nèi)存中，這樣下次如果再有這個(gè)條件過來的時(shí)候，就不用重新掃描倒排索引，反復(fù)生成bitset，可以大幅度提升性能。

在最近的256個(gè)filter中，有某個(gè)filter超過了一定的次數(shù)，這個(gè)次數(shù)不固定，就會(huì)自動(dòng)緩存這個(gè)filter對(duì)應(yīng)的bitset。

segment（分片），filter針對(duì)小segment獲取到的結(jié)果，可以不緩存，segment記錄數(shù)<1000，或者segment大小<index總大小的3%。

segment數(shù)據(jù)量很小時(shí)，哪怕是掃描也很快；同時(shí)segment會(huì)在后臺(tái)自動(dòng)合并，小segment很快就會(huì)跟其他小segment合并成大segment，此時(shí)緩存也沒有什么意義，因?yàn)檫@些小segment合并后很快就消失了。

filter比query的好處就在于會(huì)caching，實(shí)際上并不是一個(gè)filter返回的完整的doc list數(shù)據(jù)結(jié)果。而是filter bitset緩存完整的doc list數(shù)據(jù)結(jié)果。下次不用掃描倒排索引了。

（5）filter大部分情況下來說，在query之前執(zhí)行，先盡量過濾掉盡可能多的數(shù)據(jù)

query：是會(huì)計(jì)算doc對(duì)搜索條件的relevance score，還會(huì)根據(jù)這個(gè)score去排序

filter：只是簡(jiǎn)單過濾出想要的數(shù)據(jù)，不計(jì)算relevance score，也不排序

（6）如果document有新增或修改，那么cached bitset會(huì)被自動(dòng)更新；
即當(dāng)document有新增或修改時(shí)，會(huì)自動(dòng)更新到相關(guān)filter的bitset中緩存中。
（7）以后只要是有相同的filter條件的，會(huì)直接來使用這個(gè)過濾條件對(duì)應(yīng)的cached bitset即可快速將數(shù)據(jù)過濾出來返回。

9.3 基于bool組合多個(gè)filter條件來搜索數(shù)據(jù)

bool中可以通過must，must_not，should來組合多個(gè)過濾條件；bool可以嵌套,類似SQL中的and

搜索發(fā)帖日期為2020-09-09，或者帖子ID為XHDK-A-1293-#fJ3的帖子，同時(shí)要求帖子的發(fā)帖日期絕對(duì)不為2020-09-09

類似SQL如下：

SELECT
    * 
FROM
    forum.article 
WHERE
    ( post_date = '2020-09-09' OR article_id = 'XHDK-A-1293-#fJ3' ) 
    AND post_date != '2020-09-10'

es查詢語句

GET /forum/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "bool": {
          "should": [
            {"term": { "postDate": "2020-09-09" }},
            {"term": {"articleID": "XHDK-A-1293-#fJ3"}}
          ],
          "must_not": {
            "term": {
              "postDate": "2020-09-10"
            }
          }
        }
      }
    }
  }
}

must 必須匹配，should 可以匹配其中任意一個(gè)即可，must_not 必須不匹配

搜索帖子ID為XHDK-A-1293-#fJ3，或者是帖子ID為JODL-X-1937-#pV7而且發(fā)帖日期為2020-09-09的帖子

GET /forum/_search 
{
  "query": {
    "constant_score": {
      "filter": {
        "bool": {
          "should": [
            {
              "term": {
                "articleID": "XHDK-A-1293-#fJ3"
              }
            },
            {
              "bool": {
                "must": [
                  {
                    "term":{
                      "articleID": "JODL-X-1937-#pV7"
                    }
                  },
                  {
                    "term": {
                      "postDate": "2020-09-09"
                    }
                  }
                ]
              }
            }
          ]
        }
      }
    }
  }
}

9.4 使用terms搜索多個(gè)值以及多值搜索結(jié)果優(yōu)化

term: {"field": "value"}
terms: {"field": ["value1", "value2"]}

sql中的in

select * from tbl where col in ("value1", "value2")

為帖子數(shù)據(jù)增加tag字段

POST /forum/_bulk
{ "update": { "_id": "1"} }
{ "doc" : {"tag" : ["java", "hadoop"]} }
{ "update": { "_id": "2"} }
{ "doc" : {"tag" : ["java"]} }
{ "update": { "_id": "3"} }
{ "doc" : {"tag" : ["hadoop"]} }
{ "update": { "_id": "4"} }
{ "doc" : {"tag" : ["java", "elasticsearch"]} }

搜索articleID為KDKE-B-9947-#kL5或QQPX-R-3956-#aD8的帖子，

GET /forum/_search 
{
  "query": {
    "constant_score": {
      "filter": {
        "terms": {
          "articleID": [
            "KDKE-B-9947-#kL5",
            "QQPX-R-3956-#aD8"
          ]
        }
      }
    }
  }
}

搜索tag中包含java的帖子

GET /forum/_search
{
    "query" : {
        "constant_score" : {
            "filter" : {
                "terms" : { 
                    "tag" : ["java"]
                }
            }
        }
    }
}

優(yōu)化搜索結(jié)果，僅僅搜索tag只包含java的帖子

現(xiàn)有的數(shù)據(jù)結(jié)構(gòu)無法完成要求，因此我們添加一個(gè)標(biāo)識(shí)字段

POST /forum/_bulk
{ "update": { "_id": "1"} }
{ "doc" : {"tag_cnt" : 2} }
{ "update": { "_id": "2"} }
{ "doc" : {"tag_cnt" : 1} }
{ "update": { "_id": "3"} }
{ "doc" : {"tag_cnt" : 1} }
{ "update": { "_id": "4"} }
{ "doc" : {"tag_cnt" : 2} }

執(zhí)行查詢語句

GET /forum/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "bool": {
          "must": [
            {
              "term": {
                "tag_cnt": 1
              }
            },
            {
              "terms": {
                "tag": ["java"]
              }
            }
          ]
        }
      }
    }
  }
}

9.5 基于range filter來進(jìn)行范圍過濾

為帖子數(shù)據(jù)增加瀏覽量的字段

POST /forum/_bulk
{ "update": { "_id": "1"} }
{ "doc" : {"view_cnt" : 30} }
{ "update": { "_id": "2"} }
{ "doc" : {"view_cnt" : 50} }
{ "update": { "_id": "3"} }
{ "doc" : {"view_cnt" : 100} }
{ "update": { "_id": "4"} }
{ "doc" : {"view_cnt" : 80} }

搜索瀏覽量在30~60之間的帖子

GET /forum/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "range": {
          "view_cnt": {
            "gt": 30,
            "lt": 60
          }
        }
      }
    }
  }
}

搜索發(fā)帖日期在最近1個(gè)月的帖子

準(zhǔn)備示例數(shù)據(jù)

POST /forum/_bulk
{ "index": { "_id": 5 }}
{ "articleID" : "DHJK-B-1395-#Ky5", "userID" : 3, "hidden": false, "postDate": "2020-10-01", "tag": ["elasticsearch"], "tag_cnt": 1, "view_cnt": 10 }

執(zhí)行查詢語句

GET /forum/_search 
{
  "query": {
    "constant_score": {
      "filter": {
        "range": {
          "postDate": {
            "lt": "2020-10-10||-30d"
          }
        }
      }
    }
  }
}

GET /forum/_search 
{
  "query": {
    "constant_score": {
      "filter": {
        "range": {
          "postDate": {
            "gt": "now-30d"
          }
        }
      }
    }
  }
}

range相當(dāng)于sql中的between，做范圍過濾

9.6 手動(dòng)控制全文檢索結(jié)果的精準(zhǔn)度

全文檢索的時(shí)候，進(jìn)行多個(gè)值的檢索，有兩種做法，match query；should；

控制搜索結(jié)果精準(zhǔn)度：and operator，minimum_should_match

為帖子數(shù)據(jù)增加標(biāo)題字段

POST /forum/_bulk
{ "update": { "_id": "1"} }
{ "doc" : {"title" : "this is java and elasticsearch blog"} }
{ "update": { "_id": "2"} }
{ "doc" : {"title" : "this is java blog"} }
{ "update": { "_id": "3"} }
{ "doc" : {"title" : "this is elasticsearch blog"} }
{ "update": { "_id": "4"} }
{ "doc" : {"title" : "this is java, elasticsearch, hadoop blog"} }
{ "update": { "_id": "5"} }
{ "doc" : {"title" : "this is spark blog"} }

搜索標(biāo)題中包含java或elasticsearch的blog

這個(gè)和之前的那個(gè)term query不一樣。不是搜索exact value，是進(jìn)行全文檢索（full text）。
負(fù)責(zé)進(jìn)行全文檢索的是match query。當(dāng)然，如果要檢索的field，是not_analyzed類型的，那么match query也相當(dāng)于term query。

GET /forum/_search
{
    "query": {
        "match": {
            "title": "java elasticsearch"
        }
    }
}

搜索標(biāo)題中包含java和elasticsearch的

搜索結(jié)果精準(zhǔn)控制的第一步：靈活使用and關(guān)鍵字，如果你是希望所有的搜索關(guān)鍵字都要匹配的，那么就用and，可以實(shí)現(xiàn)單純match query無法實(shí)現(xiàn)的效果。

GET /forum/_search
{
  "query": {
    "match": {
      "title": {
        "query": "java elasticsearch",
        "operator": "and"
      }
    }
  }
}

搜索包含java，elasticsearch，spark，hadoop，4個(gè)關(guān)鍵字中，至少3個(gè)

控制搜索結(jié)果的精準(zhǔn)度的第二步：指定一些關(guān)鍵字中，必須至少匹配其中的多少個(gè)關(guān)鍵字，才能作為結(jié)果返回

GET /forum/_search
{
  "query": {
    "match": {
      "title": {
        "query": "java elasticsearch spark hadoop",
        "minimum_should_match": "75%"
      }
    }
  }
}

用bool組合多個(gè)搜索條件，來搜索title

GET /forum/_search
{
  "query": {
    "bool": {
      "must": {
        "match": {
          "title": "java"
        }
      },
      "must_not": {
        "match": {
          "title": "spark"
        }
      },
      "should": [
        {
          "match": {
            "title": "hadoop"
          }
        },
        {
          "match": {
            "title": "elasticsearch"
          }
        }
      ]
    }
  }
}

bool組合多個(gè)搜索條件，如何計(jì)算relevance score？

must和should搜索對(duì)應(yīng)的分?jǐn)?shù)，加起來，除以must和should的總數(shù)

排名第一：java，同時(shí)包含should中所有的關(guān)鍵字，hadoop，elasticsearch
排名第二：java，同時(shí)包含should中的elasticsearch
排名第三：java，不包含should中的任何關(guān)鍵字

should是可以影響相關(guān)度分?jǐn)?shù)的

must是確保誰必須有這個(gè)關(guān)鍵字，同時(shí)會(huì)根據(jù)這個(gè)must的條件去計(jì)算出document對(duì)這個(gè)搜索條件的relevance score
在滿足must的基礎(chǔ)之上，should中的條件，不匹配也可以，但是如果匹配的更多，那么document的relevance score就會(huì)更高

搜索java，hadoop，spark，elasticsearch，至少包含其中3個(gè)關(guān)鍵字

默認(rèn)情況下，should是可以不匹配任何一個(gè)的，比如上面的搜索中，this is java blog，就不匹配任何一個(gè)should條件
但是有個(gè)例外的情況，如果沒有must的話，那么should中必須至少匹配一個(gè)才可以
比如下面的搜索，should中有4個(gè)條件，默認(rèn)情況下，只要滿足其中一個(gè)條件，就可以匹配作為結(jié)果返回

但是可以精準(zhǔn)控制，should的4個(gè)條件中，至少匹配幾個(gè)才能作為結(jié)果返回

GET /forum/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "title": "java"
          }
        },
        {
          "match": {
            "title": "elasticsearch"
          }
        },
        {
          "match": {
            "title": "hadoop"
          }
        },
        {
          "match": {
            "title": "spark"
          }
        }
      ],
      "minimum_should_match": 3
    }
  }
}

9.7 基于term+bool實(shí)現(xiàn)的multiword搜索底層原理剖析

普通match如何轉(zhuǎn)換為term+should

{
    "match": { "title": "java elasticsearch"}
}

使用諸如上面的match query進(jìn)行多值搜索的時(shí)候，es會(huì)在底層自動(dòng)將這個(gè)match query轉(zhuǎn)換為bool的語法。
bool should，指定多個(gè)搜索詞，同時(shí)使用term query

{
  "bool": {
    "should": [
      { "term": { "title": "java" }},
      { "term": { "title": "elasticsearch"   }}
    ]
  }
}

and match如何轉(zhuǎn)換為term+must

{
    "match": {
        "title": {
            "query":    "java elasticsearch",
            "operator": "and"
        }
    }
}

轉(zhuǎn)化為：

{
  "bool": {
    "must": [
      { "term": { "title": "java" }},
      { "term": { "title": "elasticsearch"   }}
    ]
  }
}

minimum_should_match如何轉(zhuǎn)換

{
    "match": {
        "title": {
            "query": "java elasticsearch hadoop spark",
            "minimum_should_match": "75%"
        }
    }
}

轉(zhuǎn)化為

{
  "bool": {
    "should": [
      { "term": { "title": "java" }},
      { "term": { "title": "elasticsearch"   }},
      { "term": { "title": "hadoop" }},
      { "term": { "title": "spark" }}
    ],
    "minimum_should_match": 3 
  }
}

9.8 基于boost的細(xì)粒度搜索條件權(quán)重控制

需求：

搜索標(biāo)題中包含java的帖子，同時(shí)呢，如果標(biāo)題中包含hadoop或elasticsearch就優(yōu)先搜索出來，
同時(shí)呢，如果一個(gè)帖子包含java hadoop，一個(gè)帖子包含java elasticsearch，包含hadoop的帖子要比elasticsearch優(yōu)先搜索出來

知識(shí)點(diǎn)：

搜索條件的權(quán)重，boost，可以將某個(gè)搜索條件的權(quán)重加大，此時(shí)當(dāng)匹配這個(gè)搜索條件和匹配另一個(gè)搜索條件的document，
計(jì)算relevance score時(shí)，匹配權(quán)重更大的搜索條件的document，relevance score會(huì)更高，當(dāng)然也就會(huì)優(yōu)先被返回回來。
默認(rèn)情況下，搜索條件的權(quán)重是相同的，都是1

GET /forum/_search 
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "title": "java"
          }
        }
      ],
      "should": [
        {
          "match": {
            "title": {
              "query": "elasticsearch"
            }
          }
        },
        {
          "match": {
            "title": {
              "query": "hadoop",
              "boost": 5
            }
          }
        }
      ]
    }
  }
}

9.9 多shard場(chǎng)景下relevance score不準(zhǔn)確問題

多shard場(chǎng)景下relevance score不準(zhǔn)確問題

如果你的一個(gè)index有多個(gè)shard的話，可能搜索結(jié)果會(huì)不準(zhǔn)確

如何解決該問題？

（1）生產(chǎn)環(huán)境下，數(shù)據(jù)量大，盡可能實(shí)現(xiàn)均勻分配

數(shù)據(jù)量很大的話，其實(shí)一般情況下，在概率學(xué)的背景下，es都是在多個(gè)shard中均勻路由數(shù)據(jù)的，路由的時(shí)候根據(jù)_id，負(fù)載均衡
比如說有10個(gè)document，title都包含java，一共有5個(gè)shard，那么在概率學(xué)的背景下，如果負(fù)載均衡的話，其實(shí)每個(gè)shard都應(yīng)該有2個(gè)doc，title包含java
如果數(shù)據(jù)分布均勻的話，其實(shí)就沒有剛才說的那個(gè)問題了

（2）測(cè)試環(huán)境下，將索引的primary shard設(shè)置為1個(gè)，number_of_shards=1，index settings

如果只有一個(gè)shard，所有的document都在這個(gè)shard里面，也就沒有這個(gè)問題了

（3）測(cè)試環(huán)境下，搜索附帶search_type=dfs_query_then_fetch參數(shù)，會(huì)將local IDF取出來計(jì)算global IDF

計(jì)算一個(gè)doc的相關(guān)度分?jǐn)?shù)的時(shí)候，就會(huì)將所有shard對(duì)local IDF計(jì)算一下獲取出來，然后在本地進(jìn)行g(shù)lobal IDF分?jǐn)?shù)的計(jì)算，之后將所有shard的doc作為上下文來進(jìn)行計(jì)算，也能確保準(zhǔn)確性。
但是production生產(chǎn)環(huán)境下，不推薦這個(gè)參數(shù)，因?yàn)樾阅芎懿睢?/p>

9.10 基于dis_max實(shí)現(xiàn)best fields策略進(jìn)行多字段搜索

為帖子數(shù)據(jù)增加content字段

POST /forum/_bulk
{ "update": { "_id": "1"} }
{ "doc" : {"content" : "i like to write best elasticsearch article"} }
{ "update": { "_id": "2"} }
{ "doc" : {"content" : "i think java is the best programming language"} }
{ "update": { "_id": "3"} }
{ "doc" : {"content" : "i am only an elasticsearch beginner"} }
{ "update": { "_id": "4"} }
{ "doc" : {"content" : "elasticsearch and hadoop are all very good solution, i am a beginner"} }
{ "update": { "_id": "5"} }
{ "doc" : {"content" : "spark is best big data solution based on scala ,an programming language similar to java"} }

搜索title或content中包含java或solution的帖子

GET /forum/_search
{
    "query": {
        "bool": {
            "should": [
                { "match": { "title": "java solution" }},
                { "match": { "content":  "java solution" }}
            ]
        }
    }
}

搜索結(jié)果分析

期望的是doc5，結(jié)果是doc2,doc4排在了前面

計(jì)算每個(gè)document的relevance score：每個(gè)query的分?jǐn)?shù)，乘以matched query數(shù)量，除以總query數(shù)量

算一下doc4的分?jǐn)?shù)

{ "match": { "title": "java solution" }}，針對(duì)doc4，是有一個(gè)分?jǐn)?shù)的
{ "match": { "content":  "java solution" }}，針對(duì)doc4，也是有一個(gè)分?jǐn)?shù)的

所以是兩個(gè)分?jǐn)?shù)加起來，比如說，1.1 + 1.2 = 2.3；matched query數(shù)量 = 2；總query數(shù)量 = 2；即：2.3 * 2 / 2 = 2.3

算一下doc5的分?jǐn)?shù)

{ "match": { "title": "java solution" }}，針對(duì)doc5，是沒有分?jǐn)?shù)的
{ "match": { "content":  "java solution" }}，針對(duì)doc5，是有一個(gè)分?jǐn)?shù)的

只有一個(gè)query是有分?jǐn)?shù)的，比如2.3；matched query數(shù)量 = 1；總query數(shù)量 = 2；即：2.3 * 1 / 2 = 1.15

doc5的分?jǐn)?shù) = 1.15 < doc4的分?jǐn)?shù) = 2.3

best fields策略，dis_max

best fields策略，搜索到的結(jié)果，應(yīng)該是某一個(gè)field中匹配到了盡可能多的關(guān)鍵詞，被排在前面；而不是盡可能多的field匹配到了少數(shù)的關(guān)鍵詞，排在了前面

dis_max語法，直接取多個(gè)query中，分?jǐn)?shù)最高的那一個(gè)query的分?jǐn)?shù)即可

{ "match": { "title": "java solution" }}，針對(duì)doc4，是有一個(gè)分?jǐn)?shù)的，1.1
{ "match": { "content":  "java solution" }}，針對(duì)doc4，也是有一個(gè)分?jǐn)?shù)的，1.2

取最大分?jǐn)?shù)，1.2

{ "match": { "title": "java solution" }}，針對(duì)doc5，是沒有分?jǐn)?shù)的
{ "match": { "content":  "java solution" }}，針對(duì)doc5，是有一個(gè)分?jǐn)?shù)的，2.3

取最大分?jǐn)?shù)，2.3

然后doc4的分?jǐn)?shù) = 1.2 < doc5的分?jǐn)?shù) = 2.3，所以doc5就可以排在更前面的地方，符合我們的需要

GET /forum/_search
{
    "query": {
        "dis_max": {
            "queries": [
                { "match": { "title": "java solution" }},
                { "match": { "content":  "java solution" }}
            ]
        }
    }
}

9.11 基于tie_breaker參數(shù)優(yōu)化dis_max搜索效果

搜索title或content中包含java beginner的帖子

GET /forum/_search
{
    "query": {
        "dis_max": {
            "queries": [
                { "match": { "title": "java beginner" }},
                { "match": { "body":  "java beginner" }}
            ]
        }
    }
}

可能在實(shí)際場(chǎng)景中出現(xiàn)的一個(gè)情況是這樣的：

（1）某個(gè)帖子，doc1，title中包含java，content不包含java beginner任何一個(gè)關(guān)鍵詞
（2）某個(gè)帖子，doc2，content中包含beginner，title中不包含任何一個(gè)關(guān)鍵詞
（3）某個(gè)帖子，doc3，title中包含java，content中包含beginner
（4）最終搜索，可能出來的結(jié)果是，doc1和doc2排在doc3的前面，而不是我們期望的doc3排在最前面

dis_max，只是取分?jǐn)?shù)最高的那個(gè)query的分?jǐn)?shù)而已

dis_max只取某一個(gè)query最大的分?jǐn)?shù)，完全不考慮其他query的分?jǐn)?shù)

使用tie_breaker將其他query的分?jǐn)?shù)也考慮進(jìn)去

tie_breaker參數(shù)的意義，在于將其他query的分?jǐn)?shù)，乘以tie_breaker，然后綜合與最高分?jǐn)?shù)的那個(gè)query的分?jǐn)?shù)，綜合在一起進(jìn)行計(jì)算；
除了取最高分以外，還會(huì)考慮其他的query的分?jǐn)?shù)；tie_breaker的值，在0~1之間，是個(gè)小數(shù)，就ok

GET /forum/_search
{
    "query": {
        "dis_max": {
            "queries": [
                { "match": { "title": "java beginner" }},
                { "match": { "body":  "java beginner" }}
            ],
            "tie_breaker": 0.3
        }
    }
}

9.12 基于multi_match語法實(shí)現(xiàn)dis_max+tie_breaker

GET /forum/_search
{
  "query": {
    "multi_match": {
        "query":                "java solution",
        "type":                 "best_fields", 
        "fields":               [ "title^2", "content" ],
        "tie_breaker":          0.3,
        "minimum_should_match": "50%" 
    }
  } 
}

GET /forum/_search
{
  "query": {
    "dis_max": {
      "queries":  [
        {
          "match": {
            "title": {
              "query": "java beginner",
              "minimum_should_match": "50%",
          "boost": 2
            }
          }
        },
        {
          "match": {
            "body": {
              "query": "java beginner",
              "minimum_should_match": "30%"
            }
          }
        }
      ],
      "tie_breaker": 0.3
    }
  } 
}

minimum_should_match，主要是用來干嘛的？

去長(zhǎng)尾 long tail，什么是長(zhǎng)尾，比如你搜索5個(gè)關(guān)鍵詞，但是很多結(jié)果只匹配1個(gè)關(guān)鍵詞，其實(shí)跟你想要的結(jié)果相差甚遠(yuǎn)，這些結(jié)果就是長(zhǎng)尾；
minimum_should_match，控制搜索結(jié)果的精準(zhǔn)度，只有匹配一定數(shù)量的關(guān)鍵詞的數(shù)據(jù)，才能返回

9.13 基于multi_match+most fiels策略進(jìn)行multi-field搜索

從best-fields換成most-fields策略

best-fields策略，主要是將某一個(gè)field匹配盡可能多的關(guān)鍵詞的doc優(yōu)先返回回來

most-fields策略，主要是盡可能返回更多field匹配到某個(gè)關(guān)鍵詞的doc，優(yōu)先返回回來

POST /forum/_mapping
{
  "properties": {
      "sub_title": { 
          "type":     "text",
          "analyzer": "english",
          "fields": {
              "std":   { 
                  "type":     "text",
                  "analyzer": "standard"
              }
          }
      }
  }
}

POST /forum/_bulk
{ "update": { "_id": "1"} }
{ "doc" : {"sub_title" : "learning more courses"} }
{ "update": { "_id": "2"} }
{ "doc" : {"sub_title" : "learned a lot of course"} }
{ "update": { "_id": "3"} }
{ "doc" : {"sub_title" : "we have a lot of fun"} }
{ "update": { "_id": "4"} }
{ "doc" : {"sub_title" : "both of them are good"} }
{ "update": { "_id": "5"} }
{ "doc" : {"sub_title" : "haha, hello world"} }

GET /forum/_search
{
  "query": {
    "match": {
      "sub_title": "learning courses"
    }
  }
}

sub_title用的是enligsh analyzer，所以還原了單詞

為什么，因?yàn)槿绻覀冇玫氖穷愃朴趀nglish analyzer這種分詞器的話，就會(huì)將單詞還原為其最基本的形態(tài)，stemmer

learning --> learn
learned --> learn
courses --> course


GET /forum/_search
{
   "query": {
        "multi_match": {
            "query":  "learning courses",
            "type":   "most_fields", 
            "fields": [ "sub_title", "sub_title.std" ]
        }
    }
}

與best_fields的區(qū)別

（1）best_fields，是對(duì)多個(gè)field進(jìn)行搜索，挑選某個(gè)field匹配度最高的那個(gè)分?jǐn)?shù)，同時(shí)在多個(gè)query最高分相同的情況下，在一定程度上考慮其他query的分?jǐn)?shù)。
簡(jiǎn)單來說，你對(duì)多個(gè)field進(jìn)行搜索，就想搜索到某一個(gè)field盡可能包含更多關(guān)鍵字的數(shù)據(jù)。

優(yōu)點(diǎn)：通過best_fields策略，以及綜合考慮其他field，還有minimum_should_match支持，可以盡可能精準(zhǔn)地將匹配的結(jié)果推送到最前面。

缺點(diǎn)：除了那些精準(zhǔn)匹配的結(jié)果，其他差不多大的結(jié)果，排序結(jié)果不是太均勻，沒有什么區(qū)分度了。

實(shí)際的例子：百度之類的搜索引擎，最匹配的到最前面，但是其他的就沒什么區(qū)分度了

（2）most_fields，綜合多個(gè)field一起進(jìn)行搜索，盡可能多地讓所有field的query參與到總分?jǐn)?shù)的計(jì)算中來，此時(shí)就會(huì)是個(gè)大雜燴，出現(xiàn)類似best_fields案例最開始的那個(gè)結(jié)果，結(jié)果不一定精準(zhǔn)，
某一個(gè)document的一個(gè)field包含更多的關(guān)鍵字，但是因?yàn)槠渌鹍ocument有更多field匹配到了，所以排在了前面；
因此需要建立類似sub_title.std這樣的field，盡可能讓某一個(gè)field精準(zhǔn)匹配query string，貢獻(xiàn)更高的分?jǐn)?shù)，將更精準(zhǔn)匹配的數(shù)據(jù)排到前面

優(yōu)點(diǎn)：將盡可能匹配更多field的結(jié)果推送到最前面，整個(gè)排序結(jié)果是比較均勻的；

缺點(diǎn)：可能那些精準(zhǔn)匹配的結(jié)果，無法推送到最前面

實(shí)際的例子：wiki，明顯的most_fields策略，搜索結(jié)果比較均勻，但是的確要翻好幾頁才能找到最匹配的結(jié)果

9.14 使用most_fields策略進(jìn)行cross-fields search弊端

cross-fields搜索，一個(gè)唯一標(biāo)識(shí)，跨了多個(gè)field。
比如一個(gè)人，標(biāo)識(shí)，是姓名；一個(gè)建筑，它的標(biāo)識(shí)是地址。姓名可以散落在多個(gè)field中，比如first_name和last_name中，地址可以散落在country，province，city中。

跨多個(gè)field搜索一個(gè)標(biāo)識(shí)，比如搜索一個(gè)人名，或者一個(gè)地址，就是cross-fields搜索

初步來說，如果要實(shí)現(xiàn)，可能用most_fields比較合適。因?yàn)閎est_fields是優(yōu)先搜索單個(gè)field最匹配的結(jié)果，cross-fields本身就不是一個(gè)field的問題了。

POST /forum/_bulk
{ "update": { "_id": "1"} }
{ "doc" : {"author_first_name" : "Peter", "author_last_name" : "Smith"} }
{ "update": { "_id": "2"} }
{ "doc" : {"author_first_name" : "Smith", "author_last_name" : "Williams"} }
{ "update": { "_id": "3"} }
{ "doc" : {"author_first_name" : "Jack", "author_last_name" : "Ma"} }
{ "update": { "_id": "4"} }
{ "doc" : {"author_first_name" : "Robbin", "author_last_name" : "Li"} }
{ "update": { "_id": "5"} }
{ "doc" : {"author_first_name" : "Tonny", "author_last_name" : "Peter Smith"} }

GET /forum/_search
{
  "query": {
    "multi_match": {
      "query":       "Peter Smith",
      "type":        "most_fields",
      "fields":      [ "author_first_name", "author_last_name" ]
    }
  }
}

Peter Smith，匹配author_first_name，匹配到了Smith，這時(shí)候它的分?jǐn)?shù)很高，為什么?。?？？

因?yàn)镮DF分?jǐn)?shù)高，IDF分?jǐn)?shù)要高，那么這個(gè)匹配到的term（Smith），在所有doc中的出現(xiàn)頻率要低，author_first_name field中，Smith就出現(xiàn)過1次

Peter Smith這個(gè)人，doc 1，Smith在author_last_name中，但是author_last_name出現(xiàn)了兩次Smith，所以導(dǎo)致doc 1的IDF分?jǐn)?shù)較低

問題1：只是找到盡可能多的field匹配的doc，而不是某個(gè)field完全匹配的doc

問題2：most_fields，沒辦法用minimum_should_match去掉長(zhǎng)尾數(shù)據(jù)，就是匹配的特別少的結(jié)果

9.15 使用copy_to定制組合field解決cross-fields搜索弊端

上一講，我們說了用most_fields策略，去實(shí)現(xiàn)cross-fields搜索，有3大弊端，而且搜索結(jié)果也顯示出了這3大弊端

第一個(gè)辦法：用copy_to，將多個(gè)field組合成一個(gè)field

問題其實(shí)就出在有多個(gè)field，有多個(gè)field以后，就很尷尬，我們要想辦法將一個(gè)標(biāo)識(shí)跨在多個(gè)field的情況，合并成一個(gè)field。
比如說，一個(gè)人名，本來是first_name，last_name，現(xiàn)在合并成一個(gè)full_name，這樣就直接查full_name 就ok了。

PUT /forum/_mapping
{
  "properties": {
      "new_author_first_name": {
          "type":     "text",
          "copy_to":  "new_author_full_name" 
      },
      "new_author_last_name": {
          "type":     "text",
          "copy_to":  "new_author_full_name" 
      },
      "new_author_full_name": {
          "type":     "text"
      }
  }
}

用了這個(gè)copy_to語法之后，就可以將多個(gè)字段的值拷貝到一個(gè)字段中，并建立倒排索引

POST /forum/_bulk
{ "update": { "_id": "1"} }
{ "doc" : {"new_author_first_name" : "Peter", "new_author_last_name" : "Smith"} }
{ "update": { "_id": "2"} } 
{ "doc" : {"new_author_first_name" : "Smith", "new_author_last_name" : "Williams"} }
{ "update": { "_id": "3"} }
{ "doc" : {"new_author_first_name" : "Jack", "new_author_last_name" : "Ma"} }
{ "update": { "_id": "4"} }
{ "doc" : {"new_author_first_name" : "Robbin", "new_author_last_name" : "Li"} }
{ "update": { "_id": "5"} }
{ "doc" : {"new_author_first_name" : "Tonny", "new_author_last_name" : "Peter Smith"} }

GET /forum/_search
{
  "query": {
    "match": {
      "new_author_full_name": "Peter Smith"
    }
  }
}

問題1：只是找到盡可能多的field匹配的doc，而不是某個(gè)field完全匹配的doc --> 解決，最匹配的document被最先返回

問題2：most_fields，沒辦法用minimum_should_match去掉長(zhǎng)尾數(shù)據(jù)，就是匹配的特別少的結(jié)果 --> 解決，可以使用minimum_should_match去掉長(zhǎng)尾數(shù)據(jù)

問題3：TF/IDF算法，比如Peter Smith和Smith Williams，搜索Peter Smith的時(shí)候，由于first_name中很少有Smith的，所以query在所有document中的頻率很低，得到的分?jǐn)?shù)很高，可能Smith Williams反而會(huì)排在Peter Smith前面 --> 解決，Smith和Peter在一個(gè)field了，所以在所有document中出現(xiàn)的次數(shù)是均勻的，不會(huì)有極端的偏差

9.16 使用原生cross-fiels技術(shù)解決搜索弊端

GET /forum/_search
{
  "query": {
    "multi_match": {
      "query": "Peter Smith",
      "type": "cross_fields", 
      "operator": "and",
      "fields": ["author_first_name", "author_last_name"]
    }
  }
}

問題1：只是找到盡可能多的field匹配的doc，而不是某個(gè)field完全匹配的doc --> 解決，要求每個(gè)term都必須在任何一個(gè)field中出現(xiàn)

問題2：most_fields，沒辦法用minimum_should_match去掉長(zhǎng)尾數(shù)據(jù)，就是匹配的特別少的結(jié)果 --> 解決，既然每個(gè)term都要求出現(xiàn)，長(zhǎng)尾肯定被去除掉了

問題3：TF/IDF算法，比如Peter Smith和Smith Williams，搜索Peter Smith的時(shí)候，由于first_name中很少有Smith的，
所以query在所有document中的頻率很低，得到的分?jǐn)?shù)很高，可能Smith Williams反而會(huì)排在Peter Smith前面 --> 計(jì)算IDF的時(shí)候，
將每個(gè)query在每個(gè)field中的IDF都取出來，取最小值，就不會(huì)出現(xiàn)極端情況下的極大值了

9.17 掌握phrase matching搜索技術(shù)

如果我們要盡量讓java和spark離的很近的document優(yōu)先返回，要給它一個(gè)更高的relevance score，這就涉及到了proximity match，近似匹配

需求：

java spark，就靠在一起，中間不能插入任何其他字符，就要搜索出來這種doc
java spark，但是要求，java和spark兩個(gè)單詞靠的越近，doc的分?jǐn)?shù)越高，排名越靠前

要實(shí)現(xiàn)上述兩個(gè)需求，用match做全文檢索，是搞不定的，必須得用proximity match，近似匹配

phrase match，proximity match：短語匹配，近似匹配

使用match_phrase來查詢包含`java and elasticsearch`的數(shù)據(jù)

POST /forum/_bulk
{ "update": { "_id": "1"} }
{ "doc" : {"content" : "java elasticsearch is friend"} }
{ "update": { "_id": "2"} }
{ "doc" : {"content" : "java and elasticsearch very good"} }
{ "update": { "_id": "3"} }
{ "doc" : {"content" : "this is elasticsearch blog"} }
{ "update": { "_id": "4"} }
{ "doc" : {"content" : "this is java, elasticsearch, hadoop blog"} }
{ "update": { "_id": "5"} }
{ "doc" : {"content" : "this is spark blog"} }

使用match_phrase來查詢包含java and elasticsearch的數(shù)據(jù)

GET /forum/_search
{
    "query": {
        "match_phrase": {
            "content": "java and elasticsearch"
        }
    }
}

match_phrase的基本原理

這里舉個(gè)簡(jiǎn)單例子來說明；有如下2個(gè)文檔內(nèi)容，我們需要 match_phrase匹配的是java elasticsearch

doc1 : hello, java elasticsearch
doc2 : hello, elasticsearch java

首先對(duì)文檔內(nèi)容建立類似如下的倒排索引，在value中保存了term（單詞）的position（位置）

hello -------------- [doc1(1), doc2(1)]  
java --------------- [doc1(2), doc2(3)]
elasticsearch ------ [doc1(3), doc2(2)]

這樣在查詢時(shí)先對(duì)查詢的內(nèi)容java elasticsearch進(jìn)行分詞得到java、elasticsearch然后根據(jù)倒排索引進(jìn)行查詢得到匹配的文檔doc1,doc2;

現(xiàn)在對(duì)數(shù)據(jù)進(jìn)一步做匹配處理：

doc1-->> java-doc1(2)，elasticsearch-doc1(3)；elasticsearch的position剛好比java的大1，符合實(shí)際的順序，doc1 符合條件；

doc2-->> java-doc2(3)，elasticsearch-doc2(2)；elasticsearch的position比java的小1，不符合實(shí)際的順序，doc2 不符合條件；

最終只有doc1符合條件。

9.18 基于slop參數(shù)實(shí)現(xiàn)近似匹配以及原理剖析和相關(guān)實(shí)驗(yàn)

GET /forum/_search
{
    "query": {
        "match_phrase": {
            "content": {
                "query": "java elasticsearch",
                "slop":  1
            }
        }
    }
}

slop的含義：query string，搜索文本中的幾個(gè)term，要經(jīng)過幾次移動(dòng)才能與一個(gè)document匹配，這個(gè)移動(dòng)的次數(shù)，就是slop；
這里設(shè)置slop的意思就是在匹配的過程中最多可以移動(dòng)多少次；

其實(shí)，加了slop的phrase match，就是proximity match，近似匹配

9.19 混合使用match和近似匹配實(shí)現(xiàn)召回率與精準(zhǔn)度的平衡

召回率：搜索一個(gè)java elasticsearch，總共有100個(gè)doc，能返回多少個(gè)doc作為結(jié)果，就是召回率（recall）

精準(zhǔn)度：搜索一個(gè)java elasticsearch，能不能盡可能讓包含java elasticsearch，或者是java和elasticsearch離的很近的doc，排在最前面，就是精準(zhǔn)度（precision）

直接用match_phrase短語搜索，會(huì)導(dǎo)致必須所有term都在doc field中出現(xiàn)，而且距離在slop限定范圍內(nèi)，才能匹配上

match phrase和proximity match要求doc必須包含所有的term，才能作為結(jié)果返回；如果某一個(gè)doc可能就是有某個(gè)term沒有包含，那么就無法作為結(jié)果返回

近似匹配的時(shí)候，召回率比較低，精準(zhǔn)度太高了

但是有時(shí)我們希望的是匹配到幾個(gè)term中的部分，就可以作為結(jié)果出來，這樣可以提高召回率。
同時(shí)我們也希望用上match_phrase根據(jù)距離提升分?jǐn)?shù)的功能，讓幾個(gè)term距離越近分?jǐn)?shù)就越高，優(yōu)先返回。
就是優(yōu)先滿足召回率意思，比如搜索java elasticsearch，包含java的也返回，包含elasticsearch的也返回，包含java和elasticsearch的也返回；
同時(shí)兼顧精準(zhǔn)度，就是包含java和elasticsearch，同時(shí)java和elasticsearch離的越近的doc排在最前面

此時(shí)可以用bool組合match query和match_phrase query一起，來實(shí)現(xiàn)上述效果

GET /forum/_search
{
  "query": {
    "bool": {
      "must": {
        "match": { 
          "content": {
            "query": "java elasticsearch" 
          }
        }
      },
      "should": {
        "match_phrase": { 
          "content": {
            "query": "java elasticsearch",
            "slop":  50
          }
        }
      }
    }
  }
}

在match query中java或elasticsearch或java elasticsearch，java和elasticsearch靠前，但是沒法區(qū)分java和elasticsearch的距離，也許java和elasticsearch靠的很近，但是沒法排在最前面
match_phrase在slop以內(nèi)，如果java elasticsearch能匹配上一個(gè)doc，那么就會(huì)對(duì)doc貢獻(xiàn)自己的relevance score，如果java和elasticsearch靠的越近，那么就分?jǐn)?shù)越高

9.20 使用rescoring機(jī)制優(yōu)化近似匹配搜索的性能

match和phrase match(proximity match)區(qū)別

match：只要簡(jiǎn)單的匹配到了一個(gè)term，就可以理解將term對(duì)應(yīng)的doc作為結(jié)果返回，掃描倒排索引，掃描到了就ok

phrase match ：首先掃描到所有term的doc list; 找到包含所有term的doc list; 然后對(duì)每個(gè)doc都計(jì)算每個(gè)term的position，是否符合指定的范圍; slop，需要進(jìn)行復(fù)雜的運(yùn)算，來判斷能否通過slop移動(dòng)，匹配一個(gè)doc

match query的性能比phrase match和proximity match（有slop）要高很多。因?yàn)楹髢烧叨家?jì)算position的距離。
match query比phrase match的性能要高10倍，比proximity match的性能要高20倍。

但是別太擔(dān)心，因?yàn)閑s的性能一般都在毫秒級(jí)別，match query一般就在幾毫秒，或者幾十毫秒，而phrase match和proximity match的性能在幾十毫秒到幾百毫秒之間，所以也是可以接受的。

優(yōu)化proximity match的性能，一般就是減少要進(jìn)行proximity match搜索的document數(shù)量。
主要思路就是，用match query先過濾出需要的數(shù)據(jù)，然后再用proximity match來根據(jù)term距離提高doc的分?jǐn)?shù)，
同時(shí)proximity match只針對(duì)每個(gè)shard的分?jǐn)?shù)排名前n個(gè)doc起作用，來重新調(diào)整它們的分?jǐn)?shù)，這個(gè)過程稱之為重計(jì)分(rescoring)。
因?yàn)橐话阌脩魰?huì)分頁查詢，只會(huì)看到前幾頁的數(shù)據(jù)，所以不需要對(duì)所有結(jié)果進(jìn)行proximity match操作。

用我們剛才的說法，match + proximity match同時(shí)實(shí)現(xiàn)召回率和精準(zhǔn)度

默認(rèn)情況下，match也許匹配了1000個(gè)doc，proximity match全都需要對(duì)每個(gè)doc進(jìn)行一遍運(yùn)算，判斷能否slop移動(dòng)匹配上，然后去貢獻(xiàn)自己的分?jǐn)?shù)
但是很多情況下，match出來也許1000個(gè)doc，其實(shí)用戶大部分情況下是分頁查詢的，所以可能最多只會(huì)看前幾頁，比如一頁是10條，最多也許就看5頁，就是50條
proximity match只要對(duì)前50個(gè)doc進(jìn)行slop移動(dòng)去匹配，去貢獻(xiàn)自己的分?jǐn)?shù)即可，不需要對(duì)全部1000個(gè)doc都去進(jìn)行計(jì)算和貢獻(xiàn)分?jǐn)?shù)

GET /forum/_search 
{
  "query": {
    "match": {
      "content": "java elasticsearch"
    }
  },
  "rescore": {
    "window_size": 50,
    "query": {
      "rescore_query": {
        "match_phrase": {
          "content": {
            "query": "java elasticsearch",
            "slop": 50
          }
        }
      }
    }
  }
}

9.21 實(shí)戰(zhàn)前綴搜索、通配符搜索、正則搜索等技術(shù)

前綴搜索

GET /forum/_search
{
  "query": {
    "prefix": {
      "articleID.keyword": {
        "value": "X"
      }
    }
  }
}

前綴搜索的原理：prefix query不計(jì)算relevance score，與prefix filter唯一的區(qū)別就是，filter會(huì)cache bitset；掃描整個(gè)倒排索引。前綴越短，要處理的doc越多，性能越差，盡可能用長(zhǎng)前綴搜索

前綴搜索，它是怎么執(zhí)行的？性能為什么差呢？

根據(jù)前綴掃描完整個(gè)的倒排索引，一個(gè)個(gè)匹配將結(jié)果返回，這就是為什么性能差

通配符搜索

跟前綴搜索類似，使用通配符去表達(dá)更加復(fù)雜的模糊搜索的語義，功能更加強(qiáng)大

5?-*5：5個(gè)字符 D 任意個(gè)字符5

GET /forum/_search
{
  "query": {
    "wildcard": {
      "articleID": {
        "value": "X?K*5"
      }
    }
  }
}

?：任意字符; *：0個(gè)或任意多個(gè)字符

性能一樣差，必須掃描整個(gè)倒排索引

正則搜索

GET /forum/_search
{
  "query": {
    "regexp": {
      "articleID": "X[0-9].+"
    }
  }
}

wildcard和regexp，與prefix原理一致，都會(huì)掃描整個(gè)索引，性能很差

9.22 實(shí)戰(zhàn)match_phrase_prefix實(shí)現(xiàn)search-time搜索推薦

GET /forum/_search
{
  "query": {
    "match_phrase_prefix": {
      "content": "java e"
    }
  }
}

原理跟match_phrase類似，唯一的區(qū)別，就是把最后一個(gè)term作為前綴去搜索

大致流程：

搜索java e會(huì)先分詞為java、e；
然后java會(huì)進(jìn)行match搜索對(duì)應(yīng)的doc;
e會(huì)作為前綴，去掃描整個(gè)倒排索引，找到所有w開頭的doc;
然后找到所有doc中，即包含java，又包含e開頭的字符的doc; 根據(jù)你的slop去計(jì)算，看在slop范圍內(nèi)，能不能讓java e，
正好跟doc中的java和e開頭的單詞的position相匹配；也可以指定slop，但是只有最后一個(gè)term會(huì)作為前綴。

max_expansions：指定prefix最多匹配多少個(gè)term，超過這個(gè)數(shù)量就不繼續(xù)匹配了，限定性能

默認(rèn)情況下，前綴要掃描所有的倒排索引中的term，去查找e打頭的單詞，但是這樣性能太差。可以用max_expansions限定，e前綴最多匹配多少個(gè)term，就不再繼續(xù)搜索倒排索引了。

盡量不要用，因?yàn)椋詈笠粋€(gè)前綴始終要去掃描大量的索引，性能可能會(huì)很差

9.23 實(shí)戰(zhàn)通過ngram分詞機(jī)制實(shí)現(xiàn)index-time搜索推薦

ngram和index-time搜索推薦原理

什么是ngram？按詞語可以拆分的長(zhǎng)度進(jìn)行處理，下面舉例說明：

quick，5種長(zhǎng)度下的ngram

ngram length=1，q u i c k
ngram length=2，qu ui ic ck
ngram length=3，qui uic ick
ngram length=4，quic uick
ngram length=5，quick

什么是edge ngram？固定首字母，然后依次疊加詞；下面舉例說明：

quick，根據(jù)首字母后進(jìn)行ngram

q
qu
qui
quic
quick

使用edge ngram將每個(gè)單詞都進(jìn)行進(jìn)一步的分詞切分，用切分后的ngram來實(shí)現(xiàn)前綴搜索推薦功能

搜索的時(shí)候，不用再根據(jù)一個(gè)前綴，然后掃描整個(gè)倒排索引了; 直接拿前綴去倒排索引中匹配即可，如果匹配上了，那么就好了。

ngram示例

設(shè)置ngram

PUT /ngram-demo
{
    "settings": {
        "analysis": {
            "filter": {
                "autocomplete_filter": { 
                    "type":     "edge_ngram",
                    "min_gram": 1,
                    "max_gram": 20
                }
            },
            "analyzer": {
                "autocomplete": {
                    "type":      "custom",
                    "tokenizer": "standard",
                    "filter": [
                        "lowercase",
                        "autocomplete_filter" 
                    ]
                }
            }
        }
    }
}

查看分詞

GET /ngram-demo/_analyze
{
  "analyzer": "autocomplete",
  "text": "quick brown"
}

_mapping設(shè)置

PUT /ngram-demo/_mapping
{
  "properties": {
      "title": {
          "type":     "text",
          "analyzer": "autocomplete",
          "search_analyzer": "standard"
      }
  }
}

添加測(cè)試數(shù)據(jù)

POST /ngram-demo/_bulk
{ "index": { "_id": 1 }}
{ "title" : "hello wiki", "userID" : 1, "hidden": false }

數(shù)據(jù)查詢

GET /ngram-demo/_search
{
  "query": {
    "match_phrase": {
      "title": "hello w"
    }
  }
}

如果用match，只有hello的也會(huì)出來，全文檢索，只是分?jǐn)?shù)比較低；
推薦使用match_phrase，要求每個(gè)term都有，而且position剛好靠著1位，符合我們的期望的

9.24 深入揭秘TF&IDF算法以及向量空間模型算法

boolean model：

類似and這種邏輯操作符，先過濾出包含指定term的doc；比如：

query "hello world" --> 過濾 --> hello / world / hello & world
bool --> must/must not/should --> 過濾 --> 包含 / 不包含 / 可能包含
doc --> 不打分?jǐn)?shù) --> 正或反 true or false --> 為了減少后續(xù)要計(jì)算的doc的數(shù)量，提升性能

單個(gè)term在doc中的分?jǐn)?shù)

TF/IDF：

一個(gè)term在一個(gè)doc中，根據(jù)出現(xiàn)的次數(shù)給個(gè)分?jǐn)?shù)，出現(xiàn)的次數(shù)越多，那么最后給的相關(guān)度評(píng)分就會(huì)越高

IDF：inversed document frequency

一個(gè)term在所有的doc中，出現(xiàn)的次數(shù)越多，那么最后給的相關(guān)度評(píng)分就會(huì)越低

length norm：搜索的那個(gè)field內(nèi)容的長(zhǎng)度，field長(zhǎng)度越長(zhǎng)，給的相關(guān)度評(píng)分越低; field長(zhǎng)度越短，給的相關(guān)度評(píng)分越高

最后，會(huì)將這個(gè)term，對(duì)doc1的分?jǐn)?shù)，綜合TF，IDF，length norm，計(jì)算出來一個(gè)綜合性的分?jǐn)?shù)

vector space model：多個(gè)term對(duì)一個(gè)doc的總分?jǐn)?shù)

es會(huì)根據(jù)搜索詞語在所有doc中的評(píng)分情況，計(jì)算出一個(gè)query vector(query向量)；
會(huì)給每一個(gè)doc，拿每個(gè)term計(jì)算出一個(gè)分?jǐn)?shù)來，再拿所有term的分?jǐn)?shù)組成一個(gè)doc vector；

畫在一個(gè)圖中，取每個(gè)doc vector對(duì)query vector的弧度，給出每個(gè)doc對(duì)多個(gè)term的總分?jǐn)?shù)

每個(gè)doc vector計(jì)算出對(duì)query vector的弧度，最后基于這個(gè)弧度給出一個(gè)doc相對(duì)于query中多個(gè)term的總分?jǐn)?shù)
弧度越大，分?jǐn)?shù)月底; 弧度越小，分?jǐn)?shù)越高

如果是多個(gè)term，那么就是線性代數(shù)來計(jì)算，無法用圖表示

9.25 實(shí)戰(zhàn)掌握四種常見的相關(guān)度分?jǐn)?shù)優(yōu)化方法

query-time boost

GET /forum/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "title": {
              "query": "java spark",
              "boost": 2
            }
          }
        },
        {
          "match": {
            "content": "java spark"
          }
        }
      ]
    }
  }
}

重構(gòu)查詢結(jié)構(gòu)

重構(gòu)查詢結(jié)果，在es新版本中，影響越來越小了。一般情況下，沒什么必要的話，大家不用也行。

 GET /forum/article/_search 
 {
   "query": {
     "bool": {
       "should": [
         {
           "match": {
             "content": "java"
           }
         },
         {
           "match": {
             "content": "spark"
           }
         },
         {
           "bool": {
             "should": [
               {
                 "match": {
                   "content": "solution"
                 }
               },
               {
                 "match": {
                   "content": "beginner"
                 }
               }
             ]
           }
         }
       ]
     }
   }
 }

negative boost降低相關(guān)度

GET /forum/_search 
{
  "query": {
    "boosting": {
      "positive": {
        "match": {
          "content": "java"
        }
      },
      "negative": {
        "match": {
          "content": "spark"
        }
      },
      "negative_boost": 0.2
    }
  }
}

negative的doc，會(huì)乘以negative_boost，降低分?jǐn)?shù)

constant_score

如果你壓根兒不需要相關(guān)度評(píng)分，直接走constant_score加filter，所有的doc分?jǐn)?shù)都是1，沒有評(píng)分的概念了

GET /forum/_search 
{
  "query": {
    "bool": {
      "should": [
        {
          "constant_score": {
            "query": {
              "match": {
                "title": "java"
              }
            }
          }
        },
        {
          "constant_score": {
            "query": {
              "match": {
                "title": "spark"
              }
            }
          }
        }
      ]
    }
  }
}

9.26 實(shí)戰(zhàn)用function_score自定義相關(guān)度分?jǐn)?shù)算法

我們可以做到自定義一個(gè)function_score函數(shù)，自己將某個(gè)field的值，跟es內(nèi)置算出來的分?jǐn)?shù)進(jìn)行運(yùn)算，然后由自己指定的field來進(jìn)行分?jǐn)?shù)的增強(qiáng)

數(shù)據(jù)準(zhǔn)備

POST /forum/_bulk
{ "update": { "_id": "1"} }
{ "doc" : {"follower_num" : 5} }
{ "update": { "_id": "2"} }
{ "doc" : {"follower_num" : 10} }
{ "update": { "_id": "3"} }
{ "doc" : {"follower_num" : 25} }
{ "update": { "_id": "4"} }
{ "doc" : {"follower_num" : 3} }
{ "update": { "_id": "5"} }
{ "doc" : {"follower_num" : 60} }

將搜索得到的分?jǐn)?shù)，跟follower_num進(jìn)行運(yùn)算，由follower_num在一定程度上增強(qiáng)其分?jǐn)?shù)；follower_num越大，那么分?jǐn)?shù)就越高

GET /forum/_search
{
  "query": {
    "function_score": {
      "query": {
        "multi_match": {
          "query": "java spark",
          "fields": ["tile", "content"]
        }
      },
      "field_value_factor": {
        "field": "follower_num",
        "modifier": "log1p",
        "factor": 0.5
      },
      "boost_mode": "sum",
      "max_boost": 2
    }
  }
}

如果只有field，那么會(huì)將每個(gè)doc的分?jǐn)?shù)都乘以follower_num，如果有的doc follower是0，那么分?jǐn)?shù)就會(huì)變?yōu)?，效果很不好。
因此一般會(huì)加個(gè)log1p函數(shù)，公式會(huì)變?yōu)椋琻ew_score = old_score * log(1 + number_of_votes)，這樣出來的分?jǐn)?shù)會(huì)比較合理；
再加個(gè)factor，可以進(jìn)一步影響分?jǐn)?shù)，new_score = old_score * log(1 + factor * number_of_votes)；
boost_mode，可以決定分?jǐn)?shù)與指定字段的值如何計(jì)算，multiply，sum，min，max，replace；
max_boost，限制計(jì)算出來的分?jǐn)?shù)不要超過max_boost指定的值

9.27 實(shí)戰(zhàn)掌握誤拼寫時(shí)的fuzzy模糊搜索技術(shù)

搜索的時(shí)候，可能輸入的搜索文本會(huì)出現(xiàn)誤拼寫的情況

fuzzy搜索技術(shù) --> 自動(dòng)將拼寫錯(cuò)誤的搜索文本，進(jìn)行糾正，糾正以后去嘗試匹配索引中的數(shù)據(jù)

實(shí)際想要hello，但是少寫個(gè)o

GET /forum/_search
{
  "query": {
    "fuzzy": {
      "title": {
        "value": "hell",
        "fuzziness": 2
      }
    }
  }
}

fuzziness 指定的修訂最大次數(shù)，默認(rèn)為2

GET /forum/_search
{
  "query": {
    "match": {
      "title": {
        "query": "helio",
        "fuzziness": "AUTO",
        "operator": "and"
      }
    }
  }
}

十、IK中文分詞器

安裝

從github上下載安裝包（或者自己編譯）：

https://github.com/medcl/elasticsearch-analysis-ik

將解壓后的文件放到es的docker容器中（也可以做個(gè)文件目錄的映射）：

docker cp /home/ik es7:/usr/share/elasticsearch/plugins/

ik目錄就是解壓后的文件目錄
如果不確定plugins目錄在哪兒，可以通過docker exec -it es7 /bin/bash命令進(jìn)入容器內(nèi)查看

然后重啟es

docker restart es7

ik分詞器基礎(chǔ)知識(shí)

兩種analyzer，根據(jù)自己的需要選擇，但是一般是選用ik_max_word

ik_max_word: 會(huì)將文本做最細(xì)粒度的拆分，比如會(huì)將“中華人民共和國(guó)國(guó)歌”拆分為“中華人民共和國(guó),中華人民,中華,華人,人民共和國(guó),人民,人,民,共和國(guó),共和,和,國(guó)國(guó),國(guó)歌”，會(huì)窮盡各種可能的組合；

ik_smart: 會(huì)做最粗粒度的拆分，比如會(huì)將“中華人民共和國(guó)國(guó)歌”拆分為“中華人民共和國(guó),國(guó)歌”。

ik分詞器的使用

配置mapping：

PUT /news
{
  "mappings": {
    "properties": {
      "content":{
        "type": "text",
        "analyzer": "ik_max_word"
      }
    }
  }
}

IK分詞器配置文件講解以及自定義詞庫

IKAnalyzer.cfg.xml：用來配置自定義詞庫
main.dic：ik原生內(nèi)置的中文詞庫，總共有27萬多條，只要是這些單詞，都會(huì)被分在一起
quantifier.dic：放了一些單位相關(guān)的詞
suffix.dic：放了一些后綴
surname.dic：中國(guó)的姓氏
stopword.dic：英文停用詞

ik原生最重要的兩個(gè)配置文件

main.dic：包含了原生的中文詞語，會(huì)按照這個(gè)里面的詞語去分詞

stopword.dic：包含了英文的停用詞，停用詞，stopword

一般，像停用詞，會(huì)在分詞的時(shí)候，直接被干掉，不會(huì)建立在倒排索引中

自定義詞庫

（1）自己建立詞庫：每年都會(huì)涌現(xiàn)一些特殊的流行詞，網(wǎng)紅，藍(lán)瘦香菇，喊麥，鬼畜，一般不會(huì)在ik的原生詞典里

自己補(bǔ)充自己的最新的詞語，到ik的詞庫里面去

在IKAnalyzer.cfg.xml中配置自定義的詞，ext_dict，custom/mydict.dic

補(bǔ)充自己的詞語，然后需要重啟es，才能生效

（2）自己建立停用詞庫：比如了，的，啥，么，我們可能并不想去建立索引，讓人家搜索

custom/ext_stopword.dic，已經(jīng)有了常用的中文停用詞，可以補(bǔ)充自己的停用詞，然后重啟es

修改IK分詞器源碼來基于mysql熱更新詞庫

熱更新

每次都是在es的擴(kuò)展詞典中，手動(dòng)添加新詞語，很坑
（1）每次添加完，都要重啟es才能生效，非常麻煩
（2）es是分布式的，可能有數(shù)百個(gè)節(jié)點(diǎn)，你不能每次都一個(gè)一個(gè)節(jié)點(diǎn)上面去修改

es不停機(jī)，直接我們?cè)谕獠磕硞€(gè)地方添加新的詞語，es中立即熱加載到這些新詞語

熱更新的方案

（1）修改ik分詞器源碼，然后手動(dòng)支持從mysql中每隔一定時(shí)間，自動(dòng)加載新的詞庫
（2）基于ik分詞器原生支持的熱更新方案，部署一個(gè)web服務(wù)器，提供一個(gè)http接口，通過modified和tag兩個(gè)http響應(yīng)頭，來提供詞語的熱更新

用第一種方案，第二種，ik git社區(qū)官方都不建議采用，覺得不太穩(wěn)定

十一、ICU分詞器

ICU Analysis插件是一組將Lucene ICU模塊集成到Elasticsearch中的庫。
本質(zhì)上，ICU的目的是增加對(duì)Unicode和全球化的支持，以提供對(duì)亞洲語言更好的文本分割分析，還有大量對(duì)除英語外其他語言進(jìn)行正確匹配和排序所必須的分詞過濾器。

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

Elasticsearch7學(xué)習(xí)筆記(中)

七、內(nèi)核原理

7.2 圖解剖析document寫入原理（buffer，segment，commit）

基本流程

優(yōu)化后的流程

最終優(yōu)化流程

基于translog和commit point，如何進(jìn)行數(shù)據(jù)恢復(fù)

最后優(yōu)化寫入流程實(shí)現(xiàn)海量磁盤文件合并（segment merge，optimize）

八、Java API初步使用

CRUD

老版本（下面的方法都是過期的，在es8開始將會(huì)被移除）

新版本

九、深度探索搜索技術(shù)

9.1 使用term filter來搜索數(shù)據(jù)

準(zhǔn)備測(cè)試數(shù)據(jù)

查看mapping

根據(jù)用戶ID搜索帖子

搜索沒有隱藏的帖子

根據(jù)發(fā)帖日期搜索帖子

根據(jù)帖子ID搜索帖子

查看分詞

重建索引

9.2 filter執(zhí)行原理深度剖析（bitset機(jī)制與caching機(jī)制）

9.3 基于bool組合多個(gè)filter條件來搜索數(shù)據(jù)

搜索發(fā)帖日期為2020-09-09，或者帖子ID為XHDK-A-1293-#fJ3的帖子，同時(shí)要求帖子的發(fā)帖日期絕對(duì)不為2020-09-09

搜索帖子ID為XHDK-A-1293-#fJ3，或者是帖子ID為JODL-X-1937-#pV7而且發(fā)帖日期為2020-09-09的帖子

9.4 使用terms搜索多個(gè)值以及多值搜索結(jié)果優(yōu)化

為帖子數(shù)據(jù)增加tag字段

搜索articleID為KDKE-B-9947-#kL5或QQPX-R-3956-#aD8的帖子，

搜索tag中包含java的帖子

優(yōu)化搜索結(jié)果，僅僅搜索tag只包含java的帖子

9.5 基于range filter來進(jìn)行范圍過濾

為帖子數(shù)據(jù)增加瀏覽量的字段

搜索瀏覽量在30~60之間的帖子

搜索發(fā)帖日期在最近1個(gè)月的帖子

9.6 手動(dòng)控制全文檢索結(jié)果的精準(zhǔn)度

為帖子數(shù)據(jù)增加標(biāo)題字段

搜索標(biāo)題中包含java或elasticsearch的blog

搜索標(biāo)題中包含java和elasticsearch的

搜索包含java，elasticsearch，spark，hadoop，4個(gè)關(guān)鍵字中，至少3個(gè)

用bool組合多個(gè)搜索條件，來搜索title

bool組合多個(gè)搜索條件，如何計(jì)算relevance score？

搜索java，hadoop，spark，elasticsearch，至少包含其中3個(gè)關(guān)鍵字

9.7 基于term+bool實(shí)現(xiàn)的multiword搜索底層原理剖析

普通match如何轉(zhuǎn)換為term+should

and match如何轉(zhuǎn)換為term+must

minimum_should_match如何轉(zhuǎn)換

9.8 基于boost的細(xì)粒度搜索條件權(quán)重控制

需求：

知識(shí)點(diǎn)：

9.9 多shard場(chǎng)景下relevance score不準(zhǔn)確問題

多shard場(chǎng)景下relevance score不準(zhǔn)確問題

如何解決該問題？

9.10 基于dis_max實(shí)現(xiàn)best fields策略進(jìn)行多字段搜索

為帖子數(shù)據(jù)增加content字段

搜索title或content中包含java或solution的帖子

搜索結(jié)果分析

best fields策略，dis_max

9.11 基于tie_breaker參數(shù)優(yōu)化dis_max搜索效果

搜索title或content中包含java beginner的帖子

dis_max只取某一個(gè)query最大的分?jǐn)?shù)，完全不考慮其他query的分?jǐn)?shù)

使用tie_breaker將其他query的分?jǐn)?shù)也考慮進(jìn)去

9.12 基于multi_match語法實(shí)現(xiàn)dis_max+tie_breaker

9.13 基于multi_match+most fiels策略進(jìn)行multi-field搜索

與best_fields的區(qū)別

9.14 使用most_fields策略進(jìn)行cross-fields search弊端

9.15 使用copy_to定制組合field解決cross-fields搜索弊端

9.16 使用原生cross-fiels技術(shù)解決搜索弊端

9.17 掌握phrase matching搜索技術(shù)

使用match_phrase來查詢包含java and elasticsearch的數(shù)據(jù)

match_phrase的基本原理

9.18 基于slop參數(shù)實(shí)現(xiàn)近似匹配以及原理剖析和相關(guān)實(shí)驗(yàn)

9.19 混合使用match和近似匹配實(shí)現(xiàn)召回率與精準(zhǔn)度的平衡

9.20 使用rescoring機(jī)制優(yōu)化近似匹配搜索的性能

match和phrase match(proximity match)區(qū)別

9.21 實(shí)戰(zhàn)前綴搜索、通配符搜索、正則搜索等技術(shù)

前綴搜索

通配符搜索

正則搜索

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

7.2 圖解剖析document寫入原理（buffer，segment，commit）

基于translog和commit point，如何進(jìn)行數(shù)據(jù)恢復(fù)

最后優(yōu)化寫入流程實(shí)現(xiàn)海量磁盤文件合并（segment merge，optimize）

八、Java API初步使用

老版本（下面的方法都是過期的，在es8開始將會(huì)被移除）

搜索發(fā)帖日期為2020-09-09，或者帖子ID為XHDK-A-1293-#fJ3的帖子，同時(shí)要求帖子的發(fā)帖日期絕對(duì)不為2020-09-09

搜索帖子ID為XHDK-A-1293-#fJ3，或者是帖子ID為JODL-X-1937-#pV7而且發(fā)帖日期為2020-09-09的帖子

搜索articleID為KDKE-B-9947-#kL5或QQPX-R-3956-#aD8的帖子，

優(yōu)化搜索結(jié)果，僅僅搜索tag只包含java的帖子

搜索包含java，elasticsearch，spark，hadoop，4個(gè)關(guān)鍵字中，至少3個(gè)

用bool組合多個(gè)搜索條件，來搜索title

bool組合多個(gè)搜索條件，如何計(jì)算relevance score？

搜索java，hadoop，spark，elasticsearch，至少包含其中3個(gè)關(guān)鍵字

如何解決該問題？

best fields策略，dis_max

使用match_phrase來查詢包含`java and elasticsearch`的數(shù)據(jù)

9.21 實(shí)戰(zhàn)前綴搜索、通配符搜索、正則搜索等技術(shù)

十一、ICU分詞器