IR-chapter2:the term vocabulary and posting lists


0.document delineation and character sequence decoding

obtain the character sequence in a document

  • determine the character encoding
  • determine the document file
  • assumption: text is a linear sequence of characters.

choose a document unit

  • indexing granularity
    trade-off between precision(small - sentence) and recall(large - book).

1.determine the vocabulary of terms

tokenization

terminology
  • token
    an instance of a sequence of characters
  • type
    the class of all tokens containing the same character sequence.
  • term
    A term is a (perhaps normalized) type that is included in the IR system’s dictionary.

to sleep perchance to dream,

5 tokens,4 types (because there are two instances of to), 3 terms ( if to is omitted from the index as a stop word)

tricky cases
  • do the exact same tokenization for both the dictionary and query

Mr. O’Neill thinks that the boys’ stories about Chile’s capital aren’t amusing.
language specific

  • unusual specific tokens:email address, URL, IP
    Items such as the date of an email, which have a clear semantic
    type, are often indexed separately as document metadata
  • hyphenations, splitting on white space
    encourage users to enter hyphens wherever they may be
    possible(depends on user' training)
  • new language, new issue
    Chinese:word segmentation

dropping common terms: stop words

  • collect frequency, hand-filtered build stop list
  • modern IR system abandoned, for its harm to phrase query

normalization

token normalization

the process of canonicalizing tokens so that matches occur despite superficial differences in the character sequences of the tokens

equivalence classing
  • using mapping rules that remove characters like hyphens
  • maintain relations between unnormalized tokens.
expansion lists
  • query expansion lists
  • perform the expansion during index construction
    Although more space for postings seemed to be less efficient, for modern IR system, the increased flexibility is more appealing.
common forms of normalization
  • Accents and diacritics.
    Spanish
    Non-ASCII text is hardly used on many computer systems, it might be best to equate all words to a form without diacritics.
  • Capitalization/case-folding
    Do case-folding by reducing all letters to lower case.(companies, government organization, personal names)
    Making every token lowercase is to just make some tokens lowercase.(if your users usually use lowercase regardless of the correct case of words?)
  • other issues in English:
    different version of English.
  • other languages:
    Japanese: multiple alphabets

stemming and lemmatization

  • For grammatical reasons, documents are going to use different forms of a
    word, such as organize, organizes, and organizing.
  • stemming
    a crude heuristic process that chops off the ends of words
  • lemmatization
    doing things properly with the use of a vocabulary and morphological analysis of words.

2.faster posting list intersection via skip points

skip list: adding skip points

skip points

for a postings list of length P, use √P evenly spaced skip pointers.

intersect with skip list

the optimal encoding for an inverted index

  • Traditionally, CPUs were slow, and so highly compressed techniques were not optimal.
  • Now CPUs are fast and disk is slow, so reducing disk postings list size dominates.

3.positional postings and phrase queries

Many as 10% of web queries are phrase queries, and many more are implicit phrase queries (such as person names), entered without use of double quotes.

Biword indexes

  • longer phrase: break them down
  • NX*N: a extended biword indexes
    i.e.the abolition of slavery
  • increase the space cost

Positional indexes

<doc1ID,frequency: <?position1, position2, . . . ?>;
doc2ID,frequency: <?position1, position2, . . . ?>;
...>


an example of positional indexes
k word proximity searches

Combination schemes

A combination strategy uses a phrase index, or just a biword index, for certain common queries(like Micheal Jackson) and uses a positional index for other phrase queries.

最后編輯于
?著作權歸作者所有,轉載或內容合作請聯(lián)系作者
【社區(qū)內容提示】社區(qū)部分內容疑似由AI輔助生成,瀏覽時請結合常識與多方信息審慎甄別。
平臺聲明:文章內容(如有圖片或視頻亦包括在內)由作者上傳并發(fā)布,文章內容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務。

相關閱讀更多精彩內容

  • 在View上單獨設置elevation是看不到任何效果的,需要在設置了elevation的View 上同時設置ba...
    笨笨的羊羊閱讀 985評論 0 0
  • 為以上代碼作以下幾點解釋:1、上述Handler的作用,是在無內存泄漏的情況下,為外部Activity的mText...
    Louis_陸閱讀 3,143評論 1 22
  • 如何測試蘋果商店內內購? 有什么現(xiàn)象? 0、蘋果商店的內購,用戶在蘋果商店中選擇內購產(chǎn)品時,如果用戶已經(jīng)安裝好此應...
    Dosun閱讀 7,141評論 11 18
  • 《致青春--重慶求精中學高87級2班高中畢業(yè)30周年同學會致辭》是我臨危受命,在寒假用了大約一周的時間苦苦醞釀出來...
    若水Dewlight閱讀 675評論 0 1
  • 妻子一晚上都郁郁寡歡,丈夫關切地問:“你怎么了?跟我說說。” “我們離婚吧”。 妻子很平靜的說出這句話,就像說別人...
    陳小同學閱讀 221評論 0 0

友情鏈接更多精彩內容