Find median using mapreduce

今天在刷面經(jīng)的時(shí)候看到了這個(gè)小問(wèn)題:

Is there a fast algorithm to run on MapReduce framework to find the Median from a huge Integer set?

查閱一些資料后確定:

對(duì)于sort或find median這類問(wèn)題,所有數(shù)據(jù)最后都需要pass到1個(gè)reducer上。

思路一:
用兩個(gè)map reduce job完成:

  1. Calculate frequencies of values in your dataset (Word Count job, basically)
  2. Identity mapper + a reducer which calculates median based on < value - frequency> pairs

step1類似于wordcount,mapper將每個(gè)value變?yōu)?lt;value,1>的pair后,經(jīng)過(guò)shuffle,value相似的pair分配到同一個(gè)recuder上,reducer再統(tǒng)計(jì)該reducer上每個(gè)value出現(xiàn)的次數(shù)。

step1結(jié)束后可以,每個(gè)value都有一個(gè)<value,freq>pair,其中freq表示該value出現(xiàn)的次數(shù)。mapper將這些pair都放在一個(gè)reducer上后排序,排序完成即可找出median。

思路二:
若已知數(shù)據(jù)的rough distribution,還有一個(gè)可以scale當(dāng)前算法的改進(jìn)方法。

  1. Use a custom partitioner that distributes the keys by range buckets (0-99 go to reducer 0, 100-199 to reducer 2, and so on).
  2. This will however require some secondary job to examine the reducer outputs and perform the final median calculation (knowing for example the number of keys in each reducer, you can calculate which reducer output will contain the median, and at which offset)

根據(jù)已知的rough distribution,可以自定義一個(gè)custom partition,在shuffle階段將value分配到不同reducer,之后根據(jù)每個(gè)reducer output的size(即有多少key在其中),確定median在哪個(gè)reducer的output里以及median在reducer中的offset。

參考:
http://stackoverflow.com/questions/6968215/find-median-of-a-large-integer-set-using-mapreduce
http://stackoverflow.com/questions/10109514/computing-median-in-map-reduce

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書(shū)系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容