今天在刷面經(jīng)的時(shí)候看到了這個(gè)小問(wèn)題:
Is there a fast algorithm to run on MapReduce framework to find the Median from a huge Integer set?
查閱一些資料后確定:
對(duì)于sort或find median這類問(wèn)題,所有數(shù)據(jù)最后都需要pass到1個(gè)reducer上。
思路一:
用兩個(gè)map reduce job完成:
- Calculate frequencies of values in your dataset (Word Count job, basically)
- Identity mapper + a reducer which calculates median based on < value - frequency> pairs
step1類似于wordcount,mapper將每個(gè)value變?yōu)?lt;value,1>的pair后,經(jīng)過(guò)shuffle,value相似的pair分配到同一個(gè)recuder上,reducer再統(tǒng)計(jì)該reducer上每個(gè)value出現(xiàn)的次數(shù)。
step1結(jié)束后可以,每個(gè)value都有一個(gè)<value,freq>pair,其中freq表示該value出現(xiàn)的次數(shù)。mapper將這些pair都放在一個(gè)reducer上后排序,排序完成即可找出median。
思路二:
若已知數(shù)據(jù)的rough distribution,還有一個(gè)可以scale當(dāng)前算法的改進(jìn)方法。
- Use a custom partitioner that distributes the keys by range buckets (0-99 go to reducer 0, 100-199 to reducer 2, and so on).
- This will however require some secondary job to examine the reducer outputs and perform the final median calculation (knowing for example the number of keys in each reducer, you can calculate which reducer output will contain the median, and at which offset)
根據(jù)已知的rough distribution,可以自定義一個(gè)custom partition,在shuffle階段將value分配到不同reducer,之后根據(jù)每個(gè)reducer output的size(即有多少key在其中),確定median在哪個(gè)reducer的output里以及median在reducer中的offset。
參考:
http://stackoverflow.com/questions/6968215/find-median-of-a-large-integer-set-using-mapreduce
http://stackoverflow.com/questions/10109514/computing-median-in-map-reduce