1、什么是召回率?
比如你搜索一個java spark,總共有100個doc,能返回多少個doc作為結(jié)果,就是召回率,recall
2、什么是精準(zhǔn)度?
比如你搜索一個java spark,能不能盡可能讓包含java spark或者是java和spark離的很近的doc排在最前面,precision直接用match_phrase短語搜索,會導(dǎo)致必須所有term都在doc field中出現(xiàn),而且距離在slop限定范圍內(nèi)才能匹配上。
match_phrase,proximity match要求doc必須包含所有的term,才能作為結(jié)果返回;如果某一個doc可能就是有某個term沒有包含,那么就無法作為結(jié)果返回。
比如:
java spark --》 hello world java : 就無法匹配到
java spark --》 hello world,java spark : 可以匹配到
3、疑問
近似匹配的時候,召回率比較低,精準(zhǔn)度太高了,但是有時我們希望的是匹配到幾個term中的部分,就可以作為結(jié)果出來,這樣可以提高召回率,同時我們也希望用上match_phrase根據(jù)距離提升分數(shù)的功能,讓幾個term距離越近分數(shù)就越高,越優(yōu)先返回。
就是優(yōu)先滿足召回率。比如
java spark --》 包含java的返回,包含spark的也返回,包含 java和spark的也返回,同時兼顧精準(zhǔn)度,就是包含java和spark,同時java和spark距離越近的doc排最前面。
4、解決疑問
可以用bool組合match query和match_phrase query一起,來實現(xiàn)上述效果。
match提高召回率,帶java和帶spark的都要返回。
match_phrase提高精準(zhǔn)度,保證同時帶java和spark的排在最前面。
效果1:直接用bool match query
GET /forum/article/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"content": "java spark"
}
}
]
}
}
}
結(jié)果
{
"took": 54,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.68640786,
"hits": [
{
"_index": "forum",
"_type": "article",
"_id": "2",
"_score": 0.68640786,
"_source": {
"articleID": "KDKE-B-9947-#kL5",
"userID": 1,
"hidden": false,
"postDate": "2017-01-02",
"tag": [
"java"
],
"tag_cnt": 1,
"view_cnt": 50,
"title": "this is java blog",
"content": "i think java is the best programming language",
"sub_title": "learned a lot of course",
"author_first_name": "Smith",
"author_last_name": "Williams",
"new_author_last_name": "Williams",
"new_author_first_name": "Smith"
}
},
{
"_index": "forum",
"_type": "article",
"_id": "5",
"_score": 0.68324494,
"_source": {
"articleID": "DHJK-B-1395-#Ky5",
"userID": 3,
"hidden": false,
"postDate": "2017-03-01",
"tag": [
"elasticsearch"
],
"tag_cnt": 1,
"view_cnt": 10,
"title": "this is spark blog",
"content": "spark is best big data solution based on scala ,an programming language similar to java spark",
"sub_title": "haha, hello world",
"author_first_name": "Tonny",
"author_last_name": "Peter Smith",
"new_author_last_name": "Peter Smith",
"new_author_first_name": "Tonny"
}
}
]
}
}
結(jié)果發(fā)現(xiàn)單獨包含java和spark的也被返回了,而且單獨包含java的卻排到了第一位,既包含java又包含spark的卻排到了最后。
效果2:直接用match_phrase
GET /forum/article/_search
{
"query": {
"match_phrase": {
"content": {
"query": "java spark",
"slop" : 50
}
}
}
}
結(jié)果:
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.5753642,
"hits": [
{
"_index": "forum",
"_type": "article",
"_id": "5",
"_score": 0.5753642,
"_source": {
"articleID": "DHJK-B-1395-#Ky5",
"userID": 3,
"hidden": false,
"postDate": "2017-03-01",
"tag": [
"elasticsearch"
],
"tag_cnt": 1,
"view_cnt": 10,
"title": "this is spark blog",
"content": "spark is best big data solution based on scala ,an programming language similar to java spark",
"sub_title": "haha, hello world",
"author_first_name": "Tonny",
"author_last_name": "Peter Smith",
"new_author_last_name": "Peter Smith",
"new_author_first_name": "Tonny"
}
}
]
}
}
結(jié)果發(fā)現(xiàn)只返回了既包含java又包含spark的數(shù)據(jù),召回率降低了。
最終效果:我們將兩個結(jié)果合并,既用bool match query又用match_phrase
GET /forum/article/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"content": "java spark"
}
}
],
"should": [
{
"match_phrase": {
"content": {
"query": "java spark",
"slop" : 50
}
}
}
]
}
}
}
結(jié)果:
{
"took": 4,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1.258609,
"hits": [
{
"_index": "forum",
"_type": "article",
"_id": "5",
"_score": 1.258609,
"_source": {
"articleID": "DHJK-B-1395-#Ky5",
"userID": 3,
"hidden": false,
"postDate": "2017-03-01",
"tag": [
"elasticsearch"
],
"tag_cnt": 1,
"view_cnt": 10,
"title": "this is spark blog",
"content": "spark is best big data solution based on scala ,an programming language similar to java spark",
"sub_title": "haha, hello world",
"author_first_name": "Tonny",
"author_last_name": "Peter Smith",
"new_author_last_name": "Peter Smith",
"new_author_first_name": "Tonny"
}
},
{
"_index": "forum",
"_type": "article",
"_id": "2",
"_score": 0.68640786,
"_source": {
"articleID": "KDKE-B-9947-#kL5",
"userID": 1,
"hidden": false,
"postDate": "2017-01-02",
"tag": [
"java"
],
"tag_cnt": 1,
"view_cnt": 50,
"title": "this is java blog",
"content": "i think java is the best programming language",
"sub_title": "learned a lot of course",
"author_first_name": "Smith",
"author_last_name": "Williams",
"new_author_last_name": "Williams",
"new_author_first_name": "Smith"
}
}
]
}
}
結(jié)果發(fā)現(xiàn)非常完美,兩個都包含的排到了第一位,并且分數(shù)遠高于第二個。而且召回率也很高
若有興趣,歡迎來加入群,【Java初學(xué)者學(xué)習(xí)交流群】:458430385,此群有Java開發(fā)人員、UI設(shè)計人員和前端工程師。有問必答,共同探討學(xué)習(xí),一起進步!
歡迎關(guān)注我的微信公眾號【Java碼農(nóng)社區(qū)】,會定時推送各種干貨:
