大香蕉久久视频网,熟妇日操夜操,色悠悠美女逼逼

0.document delineation and character sequence decoding

obtain the character sequence in a document

determine the character encoding
determine the document file
assumption: text is a linear sequence of characters.

choose a document unit

indexing granularity
trade-off between precision(small - sentence) and recall(large - book).

1.determine the vocabulary of terms

tokenization

terminology

token
an instance of a sequence of characters
type
the class of all tokens containing the same character sequence.
term
A term is a (perhaps normalized) type that is included in the IR system’s dictionary.

to sleep perchance to dream,

5 tokens,4 types (because there are two instances of to), 3 terms ( if to is omitted from the index as a stop word)

tricky cases

do the exact same tokenization for both the dictionary and query

Mr. O’Neill thinks that the boys’ stories about Chile’s capital aren’t amusing.
language specific

unusual specific tokens:email address, URL, IP
Items such as the date of an email, which have a clear semantic
type, are often indexed separately as document metadata
hyphenations, splitting on white space
encourage users to enter hyphens wherever they may be
possible(depends on user' training)
new language, new issue
Chinese:word segmentation

dropping common terms: stop words

collect frequency, hand-filtered build stop list
modern IR system abandoned, for its harm to phrase query

normalization

token normalization

the process of canonicalizing tokens so that matches occur despite superficial differences in the character sequences of the tokens

equivalence classing

using mapping rules that remove characters like hyphens
maintain relations between unnormalized tokens.

expansion lists

query expansion lists
perform the expansion during index construction
Although more space for postings seemed to be less efficient, for modern IR system, the increased flexibility is more appealing.

common forms of normalization

Accents and diacritics.
Spanish
Non-ASCII text is hardly used on many computer systems, it might be best to equate all words to a form without diacritics.
Capitalization/case-folding
Do case-folding by reducing all letters to lower case.(companies, government organization, personal names)
Making every token lowercase is to just make some tokens lowercase.(if your users usually use lowercase regardless of the correct case of words?)
other issues in English:
different version of English.
other languages:
Japanese: multiple alphabets

stemming and lemmatization

For grammatical reasons, documents are going to use different forms of a
word, such as organize, organizes, and organizing.
stemming
a crude heuristic process that chops off the ends of words
lemmatization
doing things properly with the use of a vocabulary and morphological analysis of words.

2.faster posting list intersection via skip points

skip list: adding skip points

skip points

for a postings list of length P, use √P evenly spaced skip pointers.

intersect with skip list

the optimal encoding for an inverted index

Traditionally, CPUs were slow, and so highly compressed techniques were not optimal.
Now CPUs are fast and disk is slow, so reducing disk postings list size dominates.

3.positional postings and phrase queries

Many as 10% of web queries are phrase queries, and many more are implicit phrase queries (such as person names), entered without use of double quotes.

Biword indexes

longer phrase: break them down
NX*N: a extended biword indexes
i.e.the abolition of slavery
increase the space cost

Positional indexes

<doc1ID,frequency: <?position1, position2, . . . ?>;
doc2ID,frequency: <?position1, position2, . . . ?>;
...>

an example of positional indexes

k word proximity searches

Combination schemes

A combination strategy uses a phrase index, or just a biword index, for certain common queries(like Micheal Jackson) and uses a positional index for other phrase queries.

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

IR-chapter2:the term vocabulary and posting lists

IR-chapter2:the term vocabulary and posting lists

0.document delineation and character sequence decoding

obtain the character sequence in a document

choose a document unit

1.determine the vocabulary of terms

tokenization

terminology

tricky cases

dropping common terms: stop words

normalization

token normalization

equivalence classing

expansion lists

common forms of normalization

stemming and lemmatization

2.faster posting list intersection via skip points

skip list: adding skip points

the optimal encoding for an inverted index

3.positional postings and phrase queries

Biword indexes

Positional indexes

Combination schemes

相關閱讀更多精彩內容

友情鏈接更多精彩內容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

IR-chapter2:the term vocabulary and posting lists

0.document delineation and character sequence decoding

obtain the character sequence in a document

choose a document unit

1.determine the vocabulary of terms

tokenization

terminology

tricky cases

dropping common terms: stop words

normalization

token normalization

equivalence classing

expansion lists

common forms of normalization

stemming and lemmatization

2.faster posting list intersection via skip points

skip list: adding skip points

the optimal encoding for an inverted index

3.positional postings and phrase queries

Biword indexes

Positional indexes

Combination schemes

相關閱讀更多精彩內容

友情鏈接更多精彩內容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av