本文是 Pinterest 試驗的圖片檢索算法,本文從:training data,user/image featurization 和 ranking models 等角度進行解讀,并做了性能和質量方面的測試
Backgroud:
1. Purpose: improve the image search quality by evolving ranking techniques
The first search system was built on top of Apache Lucene and solr, the result were ranked by text relevance scores between queries and text description of images. The search system was subsequently evolved by adding varies advancements that address unique challenges of Pinterest ( got problems and explored solutions)
2. Three challenges:?
The first challenge?is the reason why users use Pinterest to search images. It can be understood by adding pins for users to find their intent and explicit, such as 'try it', 'close up' & 'repin'. However, it rises new problem that it is difficult to define a universal performance rule to distinguish the weight between 'try it' and 'close up', which one is more important? An other challenge comes from the the nature of images. An image may reflect many information which is difficult to extract and describe with simple words. The third is balancing the efficiency and effectiveness of ranking algorithms. Though there are many algorithms implemented in industries, it is rarely known that which algorithm is the best for a given application.
3. Three aspects:
data: combine the engagement training data & human curated relevance judgement data
Featurization: include feature engineering, word embedding and visual embedding, visual relevance signal, query intent understanding, user intent understanding etc.
Modeling: cascading core ranking component ( to achieve the balance between search latency and search quality)
Section 2: Introduce how to curate training data?
maxmize both engagement & relevance
engagement data (user behavioral metrics):
click-through engagement Log has became the standard training data for learning to rank optimization in search engine,< q,u, (P, T) >, but T can be?multiple actions towards pins including click-through, repin, close-up, like, hide, comment, try-it, etc,?
so?a new challenge: how we should simultaneously combine and optimize multiple feedbacks?
方案1:每個engagement訓練一個模型,然后再對多個模型進行ensemble和調試,?但是嘗試了很多方法,沒有辦法得到一個不犧牲任何engagement的方法
方案2: 在data_level進行ensemble, l (p | q,u)代表用戶u在搜索詞q下對pin(圖片)的engagement-based quality label,用lp表示。于是對圖片集 P (user query recall)可以生成label set L ,對于圖片集P中的每個圖片,它的lp是所有T集合的線性組合,?
公式1, T是一系列action,包括上面提到的(repin,close Up 等),ct是action的計數(shù),wt是action的權重,權重與action的volume成反比。(volume of Close up? > click )
對公式1 用公式2 做歸一化,以位置和年齡為基礎 歸一化每個pin的label, agep and posp are the age and position of pin p, τ is the normalized weight for the ages of pins, and λ is the parameter that controls the position decay.
another challenge: 樣本偏斜,negative > positive , 修剪樣本集,1. 舍棄lp都小于零的query group2. 舍棄負樣本數(shù)超過某個閾值的query group?
最終得到< q,u, (P, T, L) >
human relevance data (human relevance judgment):
雖然大規(guī)模不可靠用戶搜索會話的聚集提供了隱含的反饋的可靠的參與訓練數(shù)據(jù),但是它也帶來了來自當前排序函數(shù)的偏倚。例如,位置偏倚就是其中之一。To correct the ranking bias, we also curate relevance judgment data from human experts, 類似問卷調查,人工打標。
combining:
we simply consider each trainingdata source independently and feed each of which into the ranking function to train a specific model
Section 3: Feature representation for ranking
how we enhance traditional ranking features to address unique challenges
Beyond Text Relevance Feature:
背景:the text description of each Pin usually is short and noisy.?
1. word-level relevance: 為每個圖片標注text annotation in the format of (n-gram), unigrams, bigrams and trigrams from different sources such as?texts, description, text from the crawled linked web page and?automatically classified annotation label.然后用bm25或者proximity BM25算text matching score。
2.?Categoryboost:?32 L1 categories and 500 L2 categories.
3.?Topicboost: topic 是通過statistical topic modeling分析words的分布找到的(Latent Dirichletallocation topic modeling)
4.Embedding Features: 在latent representation space 上描述了 ?users’ query request 和 the pins之間的相似程度。
User Intent Features: help our learning algorithm learn which type of pins are “really” relevant and interesting to users
1.?Navboost:描述了一個圖片在一般情況下和特殊搜索詞和用戶群下的表現(xiàn),根據(jù)先驗經驗統(tǒng)計close up, click, long-click and repin,還有 country, gender, aggregation time 等信息。
2.?Tokenboost: how well a pin performs in general and in context of a specific token
3.Gender Features:
4.Personalized Features:
Socialness, Visual and other Features
圖片大小, 寬高比,圖像哈希特征,pin following, friends following
Section 4: Cascading models
三級: light-weight stage, full-ranking stage, and re-ranking stage.如圖4, 百萬級到萬集到1000,
q: query, u:user, p:pin, x: feature for a tuple with (q, u, p ), lp 同section 2, y是lp的真值,sp 是quality score of p given q & u, L 是Loss 是function, S 是 score function.
table 1 是各個stage 用的model。略過基于規(guī)則的模型,因為非常直觀。
light-weight stage:
filter out negative pins
full-ranking stage:
re-ranking stage:
提高新鮮度、多樣性、地方性和語言意識。
models:
Gradient Boost Decision Tree (GBDT):use?mean square loss as the training loss
Deep Neural Network (DNN):learns to predict quality score sp, 變成多分類問題?4-scale label [1, 2, 3, 4]. lp變成y ∈ [1, 2, 3, 4]。cross entropy loss
Convolutional Neural Network (CNN):除了結構不同,其他與DNN都是一樣的
RankNet:找到文本對的正確排序,?one model learns a ranking function which predicts the S(q,u,pi,pj, θ probability spi > spj. 基于未歸一的lp,lpi > lpj ? lpi:lpj.? loss function 用公式6
RankSVM:?
Gradient Boost Ranking Tree (GBRT):RankSVM 和 GBDT的組合, 每次迭代都是為了學到一個function that pi 排在pj前面的概率, 和gbdt相似的是,rander是一個有限深度的回歸樹
模型融合:
engagement data 和 human relevance data 訓練出了不同的種模型,用線性融合。
對于訓練階段計算量小的模型,在訓練過程中融合,反之在訓練結束后融合。
Section 5:Experiments
Offline Experimental Setting:
每個國家和語言,human-relevance data : 5000 query and performed user judgement for 400 pins per query ; engagement data: 最近7天 隨機抽取1%的log. 70% 用于訓練,20%測試,10%驗證。
Offline Measurement Metrics 圖6 是部分feature分布。offline 用ndcg做評估。
Online Experimental Setting:
100 buckets
Online Measurement Metrics?
用戶level 和 query level
For query-level measurement metrics, repin per search (Qrepin), click per search (Qclick), close up per search
(Qclose up) and engagement per search (Qengaged) were the main metrics we used
Performance Results
5.3.1 Lightweight Ranking Comparison.
圖7 離線數(shù)據(jù)下, 以rule based 為基準, ranksvm 有很大提升,但是線上ranksvm提升不大,通病, 提升不高但是延遲降低。table2
5.3.2 Full Ranking Comparison
offline: 對比ranksvm,?CNN ? GBRT ? DNN ? RankNet ? GBDT
online: 神經網絡模型引入latency, 線下預先計算好,作為feature添加到線上的數(shù)模型。
5.3.3 Re-ranking Comparison.
當基于規(guī)則的rerank模型被換為GBRT時,在fresh pin 的repin rate和ctr增加了20%。
reference
歸一化折損累積增益(NDCG)
在精確率與召回率中,返回集中每個項目的地位(權值)是一樣,即位置k處的項目與位置1處的項目地位一樣。但用戶對返回結果的理解卻不是這樣。對于搜索引擎給出的排序結果,最前面的答案會遠比排在后面的答案重要。
歸一化折損累積增益(NDCG,Normalized Discounted Cumulative G a i n) 便考慮了這種情況。N D C C包含了3 個遞進的指標: 累積增益(CG,Cumulative Gain),折損累積增益(DCG,Discounted Cumulative Gain),進而得到歸一化折損累積增益。CG是對排序返回的最前面k個項目的相關性得分求和,而DCG在每個項目的得分乘上一個權值,該權值與位置成反關系,即位置越靠前,權值越大。NDCG則對每項的帶權值得分先進行歸一化(把每個項目的得分除以最好的那個項目的得分),這樣得分總是落在0.0和1.0之間。維基百科上的相關文章有更詳細的數(shù)學公式。
DCG或NDCG在信息檢索中或者那些對項目的返回位置關心的模型方法找中用的比較多。