二十、Elasticsearch混合使用match和match_phrase實現(xiàn)召回率與精準(zhǔn)度的平衡

1、什么是召回率?

比如你搜索一個java spark,總共有100個doc,能返回多少個doc作為結(jié)果,就是召回率,recall

2、什么是精準(zhǔn)度?

比如你搜索一個java spark,能不能盡可能讓包含java spark或者是java和spark離的很近的doc排在最前面,precision直接用match_phrase短語搜索,會導(dǎo)致必須所有term都在doc field中出現(xiàn),而且距離在slop限定范圍內(nèi)才能匹配上。

match_phrase,proximity match要求doc必須包含所有的term,才能作為結(jié)果返回;如果某一個doc可能就是有某個term沒有包含,那么就無法作為結(jié)果返回。

比如:
java spark --》 hello world java : 就無法匹配到
java spark --》 hello world,java spark : 可以匹配到

3、疑問
近似匹配的時候,召回率比較低,精準(zhǔn)度太高了,但是有時我們希望的是匹配到幾個term中的部分,就可以作為結(jié)果出來,這樣可以提高召回率,同時我們也希望用上match_phrase根據(jù)距離提升分數(shù)的功能,讓幾個term距離越近分數(shù)就越高,越優(yōu)先返回。

就是優(yōu)先滿足召回率。比如
java spark --》 包含java的返回,包含spark的也返回,包含 java和spark的也返回,同時兼顧精準(zhǔn)度,就是包含java和spark,同時java和spark距離越近的doc排最前面。

4、解決疑問
可以用bool組合match query和match_phrase query一起,來實現(xiàn)上述效果。

match提高召回率,帶java和帶spark的都要返回。
match_phrase提高精準(zhǔn)度,保證同時帶java和spark的排在最前面。

效果1:直接用bool match query

GET /forum/article/_search 
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "content": "java spark"
          }
        }
      ]
    }
  }
}

結(jié)果

{
  "took": 54,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 0.68640786,
    "hits": [
      {
        "_index": "forum",
        "_type": "article",
        "_id": "2",
        "_score": 0.68640786,
        "_source": {
          "articleID": "KDKE-B-9947-#kL5",
          "userID": 1,
          "hidden": false,
          "postDate": "2017-01-02",
          "tag": [
            "java"
          ],
          "tag_cnt": 1,
          "view_cnt": 50,
          "title": "this is java blog",
          "content": "i think java is the best programming language",
          "sub_title": "learned a lot of course",
          "author_first_name": "Smith",
          "author_last_name": "Williams",
          "new_author_last_name": "Williams",
          "new_author_first_name": "Smith"
        }
      },
      {
        "_index": "forum",
        "_type": "article",
        "_id": "5",
        "_score": 0.68324494,
        "_source": {
          "articleID": "DHJK-B-1395-#Ky5",
          "userID": 3,
          "hidden": false,
          "postDate": "2017-03-01",
          "tag": [
            "elasticsearch"
          ],
          "tag_cnt": 1,
          "view_cnt": 10,
          "title": "this is spark blog",
          "content": "spark is best big data solution based on scala ,an programming language similar to java spark",
          "sub_title": "haha, hello world",
          "author_first_name": "Tonny",
          "author_last_name": "Peter Smith",
          "new_author_last_name": "Peter Smith",
          "new_author_first_name": "Tonny"
        }
      }
    ]
  }
}

結(jié)果發(fā)現(xiàn)單獨包含java和spark的也被返回了,而且單獨包含java的卻排到了第一位,既包含java又包含spark的卻排到了最后。

效果2:直接用match_phrase

GET /forum/article/_search 
{
  "query": {
    "match_phrase": {
      "content": {
        "query": "java spark",
        "slop" : 50
      }
    }
  }
}

結(jié)果:

{
  "took": 3,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.5753642,
    "hits": [
      {
        "_index": "forum",
        "_type": "article",
        "_id": "5",
        "_score": 0.5753642,
        "_source": {
          "articleID": "DHJK-B-1395-#Ky5",
          "userID": 3,
          "hidden": false,
          "postDate": "2017-03-01",
          "tag": [
            "elasticsearch"
          ],
          "tag_cnt": 1,
          "view_cnt": 10,
          "title": "this is spark blog",
          "content": "spark is best big data solution based on scala ,an programming language similar to java spark",
          "sub_title": "haha, hello world",
          "author_first_name": "Tonny",
          "author_last_name": "Peter Smith",
          "new_author_last_name": "Peter Smith",
          "new_author_first_name": "Tonny"
        }
      }
    ]
  }
}

結(jié)果發(fā)現(xiàn)只返回了既包含java又包含spark的數(shù)據(jù),召回率降低了。

最終效果:我們將兩個結(jié)果合并,既用bool match query又用match_phrase

GET /forum/article/_search 
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "content": "java spark"
          }
        }
      ],
      "should": [
        {
          "match_phrase": {
            "content": {
              "query": "java spark",
              "slop" : 50
            }
          }
        }
      ]
    }
  }
}

結(jié)果:

{
  "took": 4,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 1.258609,
    "hits": [
      {
        "_index": "forum",
        "_type": "article",
        "_id": "5",
        "_score": 1.258609,
        "_source": {
          "articleID": "DHJK-B-1395-#Ky5",
          "userID": 3,
          "hidden": false,
          "postDate": "2017-03-01",
          "tag": [
            "elasticsearch"
          ],
          "tag_cnt": 1,
          "view_cnt": 10,
          "title": "this is spark blog",
          "content": "spark is best big data solution based on scala ,an programming language similar to java spark",
          "sub_title": "haha, hello world",
          "author_first_name": "Tonny",
          "author_last_name": "Peter Smith",
          "new_author_last_name": "Peter Smith",
          "new_author_first_name": "Tonny"
        }
      },
      {
        "_index": "forum",
        "_type": "article",
        "_id": "2",
        "_score": 0.68640786,
        "_source": {
          "articleID": "KDKE-B-9947-#kL5",
          "userID": 1,
          "hidden": false,
          "postDate": "2017-01-02",
          "tag": [
            "java"
          ],
          "tag_cnt": 1,
          "view_cnt": 50,
          "title": "this is java blog",
          "content": "i think java is the best programming language",
          "sub_title": "learned a lot of course",
          "author_first_name": "Smith",
          "author_last_name": "Williams",
          "new_author_last_name": "Williams",
          "new_author_first_name": "Smith"
        }
      }
    ]
  }
}

結(jié)果發(fā)現(xiàn)非常完美,兩個都包含的排到了第一位,并且分數(shù)遠高于第二個。而且召回率也很高

若有興趣,歡迎來加入群,【Java初學(xué)者學(xué)習(xí)交流群】:458430385,此群有Java開發(fā)人員、UI設(shè)計人員和前端工程師。有問必答,共同探討學(xué)習(xí),一起進步!
歡迎關(guān)注我的微信公眾號【Java碼農(nóng)社區(qū)】,會定時推送各種干貨:


qrcode_for_gh_577b64e73701_258.jpg
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容