前言
最近聽(tīng)了業(yè)界大佬Maarten的一個(gè)關(guān)于IR的Talk,如果我沒(méi)記錯(cuò),應(yīng)該和去年在ESSIR上聽(tīng)到的是一樣的,不過(guò)每次聽(tīng)都有新的收獲,將要整理記錄如下。
Query Improvement (online)
- 主要的目:提供shortcut給用戶(hù)、處理查詢(xún)的error
- 主要方式:Log analysis (AOL dataset)
- 主要途徑:
- Query Auto-Completion (QAC): what users' intent in mind but not clearly expressed
- Query Suggestion: recommendation, ranking & diversity
- Query Expansion
- Query Correction
- 關(guān)鍵在于將Query的signals,如clicks, time, news, personal, general, location等信息和query logs相結(jié)合
Getting Content (offline)
- Crawling中常見(jiàn)的問(wèn)題:
- Scale
- Content selection
- URL filtering
- Remove duplicate URLs: exact & near (compare sequences of word, like n-gram words)
- Spam detection: meaningful expressions, sentiment analysis & supervised learning
- Aggregation: considering anchor text on the web & information among entities.
- Inverted index construction: collect -> tokenize -> stopwords -> stem/lemma -> index
- Temporal IR: info can be images, songs, books, news, webs, videos and apps
Query Understanding (online)
- The result of query understanding can be presented on search engine results page (SERP), some contexts should be considered:
- Search goals? search tasks?
- Semantic topics?
- Time-sensitive? location-sensitive?
- Classification query based on pre-defined intent is difficult (short & ambiguous): click-though data & session data.
- Intent Discovery (Non-predefined)
- Shifting intents: intents change with time (Radinsky. 2013)
-
Learning to detect intent shifting (Lefortier. 2014)
- Queries whose intents from non-fresh to fresh
- More clicks to some links?
- Diversity
- Extrinsic: query with uncertainty
- Intrinsic: diversity is part of info needs
Ranker (learning to rank)
- content-based
- structure-based (title, content, tags, time)
- based on interaction behaviors (click through, scanning)
- docs represented by feature vector
Responsible IR
Privacy, Fairness, Accuracy, Transparency (let the sys explain why)