記錄一下，elasticsearch/lucene關(guān)于文檔與query之間相關(guān)性的計算方式，目錄如下，

Lucene/es評分機制

Lucene’s Practical Scoring Function

Query-Time Boosting

Ignoring TF/IDF

Pluggable Similarity Algorithms

Lucene/es評分機制

https://www.elastic.co/guide/en/elasticsearch/guide/current/scoring-theory.html
http://mp.weixin.qq.com/s/By340-7g5rDxVKehY1izeQ

es使用Boolean model來匹配文檔；使用practical scoring function(tfidf, BM25)來計算文檔與query的相關(guān)性；使用vector space model來增加額外特征計算（如queryNorm，coord，norm，boost）。

注，一般query為指定在某個field中查詢的。即score(field, query)；而如果不指定field，且_all字段enable，那么就在該條doc中查詢，即score(doc, query)。

Query & Term

query = quick brown fox
term1 = quick 
term2 = brown
term3 = fox

Boolean Model

full AND text AND search AND (elasticsearch OR lucene).

Term Frequency/Inverse Document Frequency (TF/IDF)

詞頻/逆向文檔頻率，term的重要性隨著它在文檔doc中出現(xiàn)的次數(shù)成正比增加，但同時會隨著它在語料庫docs中出現(xiàn)的頻率成反比下降。主要包含三部分，

tf，該詞在一篇文檔中出現(xiàn)的次數(shù)，tf(t in d) = √frequency
idf，該詞出現(xiàn)在多少篇文檔中（出現(xiàn)一次也算出現(xiàn)），idf(t) = 1+ log((numDocs + 1)/(docFreq + 1))
field-length norm，doc/field的文本長度，norm(d) = 1 / √numTermsInDoc

tfidf

//disable field-length norm可以減少index時候的計算量，加快index速度
PUT /my_index
{
  "mappings": {
    "doc": {
      "properties": {
        "text": {
          "type": "string",
          "norms": { "enabled": false } 
        }
      }
    }
  }
}

Vector Space Model

使得query與doc之間的相關(guān)性可以比較。

query與三文檔的對比

上圖中，query = "happy hippopotamus"，權(quán)重分別是2, 5，
doc1 = I am happy in summer.
doc2 = After Christmas I’m a hippopotamus.
doc3 = The happy hippopotamus helped Harry.
文檔3與query最相關(guān)(夾角最小)。

Lucene’s Practical Scoring Function

lucene的計分函數(shù)，對于multiterm查詢，lucene將布爾模型（Boolean）、詞頻/逆向文檔頻率（tfidf）、向量空間模型（vector space）合并到一個統(tǒng)一的jar包里面，用以收集匹配文檔和分?jǐn)?shù)計算。

//原生multiterm query語句
GET /my_index/doc/_search
{
  "query": {
    "match": {
      "text": "quick fox"
    }
  }
}

//布爾模型實現(xiàn)的改寫
GET /my_index/doc/_search
{
  "query": {
    "bool": {
      "should": [
        {"term": { "text": "quick" }},
        {"term": { "text": "fox"   }}
      ]
    }
  }
}

只要一個文檔與查詢匹配，lucene就會對該文檔算分，然后合并每個term的得分，用到了practical scoring function，

score(q, d)  =  #1
            queryNorm(q)  #2
          · coord(q, d)    #3
          · ∑ (           #4
                   tf(t in d)   #5
                 · idf(t)2      #6
                 · t.getBoost() #7
                 · norm(t, d)    #8
            ) (t in q)    #9

score(q, d)，文檔d 與查詢q 的相關(guān)度分?jǐn)?shù)（relevance score）
queryNorm(q)，查詢正則因子（query normalization factor）
coord(q, d)，協(xié)調(diào)因子（coordination factor）
sum with #9
tf(t in d)，term t 在文檔d 中的詞頻
idf(t)，term t 的逆向文檔頻率
t.getBoost()，查詢中使用的自定義boost
norm(t, d)，文檔d的文本長度正則值
sum with #4，查詢 q 中每個term t 對于文檔d 的權(quán)重和

queryNorm

queryNorm試圖將查詢正則化，以便可以比較兩個不同query的結(jié)果。（不是很有效）

coord

協(xié)調(diào)因子，

query = "quick brown fox"

//without coord (the weight for each term is 1.5)
Document with fox → score: 1.5
Document with quick fox → score: 3.0
Document with quick brown fox → score: 4.5

//with coord
Document with fox → score: 1.5 * 1 / 3 = 0.5
Document with quick fox → score: 3.0 * 2 / 3 = 2.0
Document with quick brown fox → score: 4.5 * 3 / 3 = 4.5

{norm

文本長度。文本越短，文本的權(quán)重越高。norm(d) = 1 / √numTermsInDoc

boost}

自定義權(quán)重。

Query-Time Boosting

查詢時權(quán)重提升，在搜索時令一個查詢語句的自定義權(quán)重有別于其他查詢語句，會更加符合個性化定制搜索的需求。

GET /_search
{
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "title": {
              "query": "quick brown fox",
              "boost": 2 
            }
          }
        },
        {
          "match": { 
            "content": "quick brown fox"
          }
        }
      ]
    }
  }
}

query在title字段的自定義權(quán)重比在content字段的大(2>1)，默認(rèn)是1。

Ignoring TF/IDF

有時我們只關(guān)心一個term是否在某個doc中出現(xiàn)過，而不在乎它在doc中是否頻繁出現(xiàn)，此時可以省去計算tfidf的耗時，加快檢索速度。

constant_score

constant_score替代match，不計算tfidf，但是計算其余項的分?jǐn)?shù)。

//match
GET /_search
{
    "query": {
        "match": {
            "description": "wifi garden pool"
        }
    }
}

//constant_score
GET /_search
{
    "query": {
        "constant_score" : {
            "filter" : {
                "term" : { "user" : "kimchy"}
            },
            "boost" : 1.2
        }
    }
}

function_score query

https://www.elastic.co/guide/en/elasticsearch/reference/6.0/query-dsl-function-score-query.html#function-decay

es檢索時默認(rèn)會以文檔的相關(guān)性進(jìn)行排序，而如果想要改變默認(rèn)的排序規(guī)則，可以通過sort指定一個或多個排序字段。

GET /_search
{
    "query" : {
        "bool" : {
            "filter" : { "term" : { "user_id" : 1 }}
        }
    },
    "sort": { "date": { "order": "desc" }}
}

但是直接指定sort排序過于直接，可能效果不好（除非sort字段index前已經(jīng)計算好）。此時就需要對多個字段進(jìn)行綜合評估，用到function_score，它允許我們?yōu)槊總€與query查詢匹配的doc應(yīng)用一個scoring函數(shù)，以達(dá)到改變默認(rèn)規(guī)則的目的。es已有的function_score，如下，

weight，為每個doc應(yīng)用一個直接而不被正則化的權(quán)重提升值：當(dāng) weight=2 時，最終結(jié)果為 2 * _score（與constant_score的boost=2不同，constant_score的boost參與到_score的正則化計算中，只是constant_score沒有計算tfidf，其余項quertNorm, coord, norm, boost都要參與到正則化計算中）
random_score，根據(jù)seed隨機種子，返回一個0到1的分?jǐn)?shù)；seed相同，隨機分相同。多用于個性化推薦
field_value_factor，通過doc中指定filed從而計算出一個排序分
- field，指定的字段名
- factor，縮放系數(shù)，默認(rèn)為1
- modifier，字段加工方式
  - none，不處理
  - log，對數(shù)
  - log1p，字段值+1后取對數(shù)
  - square，平方
  - sqrt，開方
  - reciprocal，倒數(shù)，etc.
decay_function，linear線性，exp指數(shù)，gauss高斯，入?yún)⑷缦拢?
- orgin，原點
- scale，衰減點
- offset，非零偏移量，默認(rèn)0
- decay，從原點衰減到scale點的所得分，默認(rèn)0.5，即scale點的文檔得分是0.5
script_score，通過腳本自定義不同字段的不同得分邏輯

//weight & random_score & score_mode & boost_mode
GET /_search
{
  "query": {
    "function_score": {
      "filter": {
        "term": { "city": "Barcelona" }
      },
      "functions": [
        {
          "filter": { "term": { "features": "wifi" }},
          "weight": 1
        },
        {
          "filter": { "term": { "features": "garden" }},
          "weight": 1
        },
        {
          "filter": { "term": { "features": "pool" }},
          "weight": 2
        },
        {
          "random_score": { 
            "seed":  "the_users_session_id" 
          }
        }
      ],
      "score_mode": "sum",
      "boost_mode": "multiply"
    }
  }
}

//field_value_factor
GET /_search
{
    "query": {
        "function_score": {
            "field_value_factor": {
                "field": "likes",
                "factor": 1.2,
                "modifier": "sqrt",
                "missing": 1
            }
        }
    }
}

//delay function(d = day)
GET /_search
{
    "query": {
        "function_score": {
            "gauss": {
                "date": {
                      "origin": "2013-09-17", 
                      "scale": "10d",
                      "offset": "5d", 
                      "decay" : 0.5 
                }
            }
        }
    }
}

decay_function

score combination

score_mode，function_score與function_score之間的相處方式，
- multiply，默認(rèn)
- sum
- avg
- max/min
- first
boost_mode，function_score與_score之間的相處方式，
- multiply，默認(rèn)
- sum
- avg
- max/min
- replace

Pluggable Similarity Algorithms

es配置了多種檢索相關(guān)性算法可供選擇，

tfidf，默認(rèn)
BM25
DFR, DFI, IB, etc.

其中，lucene自6.0起使用BM25代替了之前的tfidf。

//configure BM25 in mapping setting
PUT /my_index
{
  "mappings": {
    "doc": {
      "properties": {
        "title": {
          "type":       "string",
          "similarity": "BM25" 
        },
        "body": {
          "type":       "string",
          "similarity": "default" 
        }
      }
  }
}

BM25

http://fjdu.github.io/coding/2017/03/16/bm25-elasticsearch-lucene.html
http://www.itdecent.cn/p/0b372804ff45
https://en.wikipedia.org/wiki/Okapi_BM25

Best Match 25，發(fā)布于1994年，是調(diào)整相關(guān)性計算的第25次迭代。
引入了Term frequency saturation（詞頻飽和度)，計算如下，

BM25.png

其中，

|D|：文檔長度
avgdl：所有文檔的平均文檔長度
k1，b是自由參數(shù)，lucene默認(rèn)k1=1.2，b=0.75
IDF = log((#Docs - #DocsHit + 0.5)/(#DocsHit + 0.5))
TF = query count in one doc

詞頻飽和度snapshot (blue)

Term frequency saturation for TF/IDF and BM25

BM25F

http://www.cnblogs.com/bentuwuying/p/6730891.html

BM25F是BM25的改進(jìn)版本，BM25在計算文檔與query的相關(guān)性時將文檔當(dāng)做整體來考慮；但是隨著advanced search的發(fā)展，文檔的結(jié)構(gòu)化（即每個文檔都可以切分成多個獨立的域field，比如title，abstract，keyword，body text等）需要被考慮，不同的域?qū)ο嚓P(guān)性的貢獻(xiàn)應(yīng)該要更精細(xì)地處理，而BM25F就是query在文檔的各個field中分值的加權(quán)求和。

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

elasticsearch relevance scoring 檢索相關(guān)性計算

elasticsearch relevance scoring 檢索相關(guān)性計算

Lucene/es評分機制

Query & Term

Boolean Model

Term Frequency/Inverse Document Frequency (TF/IDF)

Vector Space Model

Lucene’s Practical Scoring Function

queryNorm

coord

{norm

boost}

Query-Time Boosting

Ignoring TF/IDF

constant_score

function_score query

score combination

Pluggable Similarity Algorithms

BM25

BM25F

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

elasticsearch relevance scoring 檢索相關(guān)性計算

Lucene/es評分機制

Query & Term

Boolean Model

Term Frequency/Inverse Document Frequency (TF/IDF)

Vector Space Model

Lucene’s Practical Scoring Function

queryNorm

coord

{norm

boost}

Query-Time Boosting

Ignoring TF/IDF

constant_score

function_score query

score combination

Pluggable Similarity Algorithms

BM25

BM25F

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av