IR Chapter1-3

chapter 1 boolean search

布爾檢索是數(shù)據(jù)庫(kù)檢索最基本的方法,是用邏輯“或”(+、OR)、邏輯"與"(×、AND)、邏輯"非"(-、NOT)等算符在數(shù)據(jù)庫(kù)中對(duì)相關(guān)文獻(xiàn)的定性選擇的方法。

  1. 單詞-文檔矩陣,
  2. 索引結(jié)構(gòu)是倒排索引 unit-posting,基于關(guān)鍵詞的查找
  • 查找
  • 合并merge:按照Doc Frequency排序,從最小的數(shù)組開(kāi)始合并

chapter 2 vocabulary list and posting

更多是關(guān)于web search,不是精確檢索。在search前需要mapping。

1.文檔處理document delineation and char seguence decoding

  • index granularity索引粒度:確定document unit
    (a message / a file /a message and its contained attachment / a book or a chapter or a sentence)
    PS: 系統(tǒng)需要提供可選的索引granularity

2. vocabulary

  • tokenization( segment,eliminate punctuation etc.):hyphen、space

  • match:返回內(nèi)容充分滿足expectations,windows->Windows、WINDOWS 各種變換的形式

  • common words(stop words,pressure for time and space):
    big list → small list → statistics
    Web search engines generally do not use stop word lists.More precise with stop words

  • Normalization (helps in queries, not in IR performance)
    standard way:create equivalence classes
    1). mapping【badcase U.S.A match USA,but C.A.T. not match cat】
    2). maintain relations
    a. index unnormalized tokens + a query expansion list
    b. index construction, index 含有token同義詞的document

    3). the forms of normalization
    accents /diacritics
    capitalization/case-folding:reduce letters to lower case
    idiosyncratic issue in English:color vs colour

  • Stemming and lemmatization:reduce inflectional forms
    stemming remove derivational affixes
    lemma 還原詞根

3. Faster postings list intersection

  • skip list pointer,而不是原來(lái)的逐一遍歷
    【W(wǎng)here do we place skips?】 There is a tradeoff.
    heuristic : ` √P evenly-spaced skip pointers.

4. Positional postings and phrase queries

  • ▼ Biword indexes→phrase index
    single-word term index also needed
    【cons】
    large vocabulary→partial phrase index
  • ▲ Positional postings 位置信息索引
    【cons】
    expands posting size,depending on token
  • ▲+▼ combined scheme【biword index + position index】
    index what biword?
    • query behavior:index frequently queried phrase
    • individual word high frequency, phrase comparatively low

chapter 3 Dictionaries and tolerant retrieval(容錯(cuò)檢索)

  • Search structures :Dictionary

  • Solution :

  1. hashing
  2. search tree【demand prescribed ordering】
    Binary tree
    B-tree
    binary tree

    B-Tree
  • Wildcard queries 通配符檢索

找索引單元/term/單詞,字符串查找

query:mon*
solution:regular B-Tree
process: walk down the tree following the symbol m, o, n.

query:*mon
solution:reversed B-Tree
process: walk down the tree following the symbol m, o, n.
e.g. lemon root->n->o->m->e->l

query:se*mon
solution2:permuterm indexes
solution1:regular B-Tree + reversed B-Tree
process:
search regular B-Tree, prefix se-, set A
search reversed B-Tree, suffix -mon, set B
intersect A ∩ B

General wildcard queries

  1. permuterm index
    也就是索引單元包含了首、尾的信息
    cons:the dictionary becomes quite large, lead to blowup


    image.png

    e.g. m*n → n$m → man、moron
    e.g. fi*mo*er → 首 er 尾fi → candidate terms →filter by exhaustive enumeration,檢查中間是否包含"mo"
    找完單詞以后,進(jìn)行document retrive

  2. k-gram indexes for wildcard queries
    index k characters, with ¥ denoting beginning and end.
    相比于permuterm index,數(shù)量相對(duì)少一些
    e.g. k=3,castle →¥ca, cas, ast, stl, tle, le$
    search
    e.g. re*ve → ¥re and ve¥
    filtering demanded string-matching operation
    e.g. red* → ed¥→ retired【invalid,因?yàn)閝uery左側(cè)沒(méi)有其他符號(hào)】

Spelling correction

1. implementation of spelling correction

2. 2 steps to solving the problem

  • edit distance
  • k-gram overlap
  1. implementation
  • 選擇正確答案的原則
    選擇最近的,proximity;
    選擇最常見(jiàn)的(出現(xiàn)最多;使用最多)
  • forms
    isolated-term correction
    context-sensitive correction.
  1. 編輯距離的計(jì)算【動(dòng)態(tài)規(guī)劃】
    可以用來(lái)計(jì)算相似度
    編輯距離其實(shí)是編輯操作的最小數(shù)目
    edit operation:insert, replace, delete => 帶權(quán)重的edit operation

3.k-gram索引 for spelling correction
【isolated spelling correction】
【Jaccard coefficient】
set A query里面的k-gram集合
set B term里面的k-gram集合
Jaccard coefficient = |A ∩ B| / |A ∪ B|
流程:
for t in term-posting list
???Jaccard( t ) > 閾值
???選擇t作為候選

t 的 k-gram如何獲得?
Jaccard = num(A&B) / num(A+B)-num(A+B)
num(A+B)是 posting hits
方法:
step1、經(jīng)驗(yàn)方法,直接選出candidate
step2、計(jì)算編輯距離,選出具有最小edit distance的t

【context sensitive spelling correction】
query = w1 + w2 +.....wn
列舉出每一個(gè)詞的candidate
trim,bi-words,previous queries by users

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書(shū)系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容