chapter 1 boolean search

布爾檢索是數(shù)據(jù)庫(kù)檢索最基本的方法，是用邏輯“或”（＋、OR）、邏輯"與"(×、AND)、邏輯"非"(－、NOT)等算符在數(shù)據(jù)庫(kù)中對(duì)相關(guān)文獻(xiàn)的定性選擇的方法。

單詞-文檔矩陣，
索引結(jié)構(gòu)是倒排索引 unit-posting,基于關(guān)鍵詞的查找

查找
合并merge：按照Doc Frequency排序，從最小的數(shù)組開(kāi)始合并

chapter 2 vocabulary list and posting

更多是關(guān)于web search，不是精確檢索。在search前需要mapping。

1.文檔處理document delineation and char seguence decoding

index granularity索引粒度：確定document unit
（a message / a file /a message and its contained attachment / a book or a chapter or a sentence）
PS: 系統(tǒng)需要提供可選的索引granularity

2. vocabulary

tokenization( segment，eliminate punctuation etc.)：hyphen、space
match：返回內(nèi)容充分滿足expectations，windows->Windows、WINDOWS 各種變換的形式
common words(stop words,pressure for time and space)：
big list → small list → statistics
Web search engines generally do not use stop word lists.More precise with stop words
Normalization (helps in queries, not in IR performance)
standard way：create equivalence classes
1). mapping【badcase U.S.A match USA，but C.A.T. not match cat】
2). maintain relations
a. index unnormalized tokens + a query expansion list
b. index construction, index 含有token同義詞的document

3). the forms of normalization
accents /diacritics
capitalization/case-folding：reduce letters to lower case
idiosyncratic issue in English：color vs colour
Stemming and lemmatization：reduce inflectional forms
stemming remove derivational affixes
lemma 還原詞根

3. Faster postings list intersection

skip list pointer，而不是原來(lái)的逐一遍歷
【W(wǎng)here do we place skips?】 There is a tradeoff.
heuristic : ` √P evenly-spaced skip pointers.

4. Positional postings and phrase queries

▼ Biword indexes→phrase index
single-word term index also needed
【cons】
large vocabulary→partial phrase index
▲ Positional postings 位置信息索引
【cons】
expands posting size，depending on token
▲+▼ combined scheme【biword index + position index】
index what biword?
- query behavior：index frequently queried phrase
- individual word high frequency, phrase comparatively low

chapter 3 Dictionaries and tolerant retrieval(容錯(cuò)檢索)

Search structures ：Dictionary
Solution ：

hashing
search tree【demand prescribed ordering】
Binary tree
B-tree

binary tree

B-Tree

Wildcard queries 通配符檢索

找索引單元/term/單詞，字符串查找

query：mon*
solution：regular B-Tree
process: walk down the tree following the symbol m, o, n.

query：*mon
solution：reversed B-Tree
process: walk down the tree following the symbol m, o, n.
e.g. lemon root->n->o->m->e->l

query：se*mon
solution2：permuterm indexes
solution1：regular B-Tree + reversed B-Tree
process:
search regular B-Tree, prefix se-, set A
search reversed B-Tree, suffix -mon, set B
intersect A ∩ B

General wildcard queries

permuterm index
也就是索引單元包含了首、尾的信息
cons：the dictionary becomes quite large, lead to blowup

image.png

e.g. m*n → n$m → man、moron
e.g. fi*mo*er → 首 er 尾fi → candidate terms →filter by exhaustive enumeration，檢查中間是否包含"mo"
找完單詞以后，進(jìn)行document retrive
k-gram indexes for wildcard queries
index k characters, with ￥ denoting beginning and end.
相比于permuterm index，數(shù)量相對(duì)少一些
e.g. k=3,castle →￥ca, cas, ast, stl, tle, le$
search
e.g. re*ve → ￥re and ve￥
filtering demanded string-matching operation
e.g. red* → ed￥→ retired【invalid，因?yàn)閝uery左側(cè)沒(méi)有其他符號(hào)】

Spelling correction

1. implementation of spelling correction

2. 2 steps to solving the problem

edit distance
k-gram overlap

implementation

選擇正確答案的原則
選擇最近的，proximity；
選擇最常見(jiàn)的（出現(xiàn)最多；使用最多）
forms
isolated-term correction
context-sensitive correction.

編輯距離的計(jì)算【動(dòng)態(tài)規(guī)劃】
可以用來(lái)計(jì)算相似度
編輯距離其實(shí)是編輯操作的最小數(shù)目
edit operation：insert, replace, delete => 帶權(quán)重的edit operation

3.k-gram索引 for spelling correction
【isolated spelling correction】
【Jaccard coefficient】
set A query里面的k-gram集合
set B term里面的k-gram集合
Jaccard coefficient = |A ∩ B| / |A ∪ B|
流程：
for t in term-posting list
???Jaccard( t ) > 閾值
???選擇t作為候選

t 的 k-gram如何獲得？
Jaccard = num(A&B) / num(A+B)-num(A+B)
num(A+B)是 posting hits
方法：
step1、經(jīng)驗(yàn)方法，直接選出candidate
step2、計(jì)算編輯距離，選出具有最小edit distance的t

【context sensitive spelling correction】
query = w1 + w2 +.....wn
列舉出每一個(gè)詞的candidate
trim，bi-words,previous queries by users

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

IR Chapter1-3

IR Chapter1-3

chapter 1 boolean search

chapter 2 vocabulary list and posting

chapter 3 Dictionaries and tolerant retrieval(容錯(cuò)檢索)

Search structures ：Dictionary

Solution ：

Wildcard queries 通配符檢索

找索引單元/term/單詞，字符串查找

General wildcard queries

Spelling correction

1. implementation of spelling correction

2. 2 steps to solving the problem

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

IR Chapter1-3

chapter 1 boolean search

chapter 2 vocabulary list and posting

chapter 3 Dictionaries and tolerant retrieval(容錯(cuò)檢索)

Search structures ：Dictionary

Solution ：

Wildcard queries 通配符檢索

找索引單元/term/單詞，字符串查找

General wildcard queries

Spelling correction

1. implementation of spelling correction

2. 2 steps to solving the problem

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

找索引單元/term/單詞，字符串查找