蜜桃激情精品视频,99热精品大香蕉伊人

備份自：http://blog.rainy.im/2016/03/13/python-on-hadoop-mapreduce/

繼上一篇Hadoop 入門實(shí)踐之后，接下來應(yīng)該是 MapReduce 的原理與實(shí)踐操作。

MapReduce 原理

Hadoop 的 MapReduce 是基于 Google - MapReduce: Simplified Data Processing on Large Clusters 的一種實(shí)現(xiàn)。對 MapReduce 的基本介紹如下：

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.

MapReduce 是一種編程模型，用于處理大規(guī)模的數(shù)據(jù)。用戶主要通過指定一個 map 函數(shù)和一個 reduce 函數(shù)來完成數(shù)據(jù)的處理。看到 map/reduce 很容易就聯(lián)想到函數(shù)式編程，而實(shí)際上論文中也提到確實(shí)受到 Lisp 和其它函數(shù)式編程語言的啟發(fā)。以 Python 為例，map/reduce 的用法如下：

from functools import reduce
from operator import add
ls = map(lambda x: len(x), ["ana", "bob", "catty", "dogge"])
# print(list(ls))
# => [3, 3, 5, 5]
reduce(add, ls)
# => 16

MapReduce 的優(yōu)勢在于對大規(guī)模數(shù)據(jù)進(jìn)行切分（split），并在分布式集群上分別運(yùn)行 map/reduce 并行加工，而用戶只需要針對數(shù)據(jù)處理邏輯編寫簡單的 map/reduce 函數(shù)，MapReduce 則負(fù)責(zé)保證分布式運(yùn)行和容錯機(jī)制。Hadoop 的 MapReduce 雖然由 Java 實(shí)現(xiàn)，但同時提供 Streaming API 可以通過標(biāo)準(zhǔn)化輸入/輸出允許我們使用任何編程語言來實(shí)現(xiàn) map/reduce。

MapReduce 在處理數(shù)據(jù)時，首先生成一個 job 將輸入文件切分成獨(dú)立的塊（chunk），切塊的大小是根據(jù)配置設(shè)定的。然后每個獨(dú)立的文件塊交給 map task 并行加工，得到一組 <k1, v1> 列表，MapReduce 再將 map 輸出的結(jié)果按 k1 進(jìn)行重新組合，再將結(jié)果傳遞給 reduce task，最后 reduce 計(jì)算得出結(jié)果。

以官方提供的 WordCount 為例，輸入為兩個文件：

hadoop fs -cat file0
# Hello World Bye World

hadoop fs -cat file1
# Hello Hadoop Goodbye Hadoop

利用 MapReduce 來計(jì)算所有文件中單詞出現(xiàn)數(shù)量的統(tǒng)計(jì)。MapReduce 的運(yùn)行過程如下圖所示：

MapReduce

Python `map`/`reduce`

Hadoop 的 Streaming API 通過 STDIN/STDOUT 傳遞數(shù)據(jù)，因此 Python 版本的 map 可以寫作：

#!/usr/bin/env python3
import sys

def read_inputs(file):
  for line in file:
    line = line.strip()
    yield line.split()
def main():
  file = sys.stdin
  lines = read_inputs(file)
  for words in lines:
    for word in words:
      print("{}\t{}".format(word, 1))
if __name__ == "__main__":
  main()

運(yùn)行一下：

chmod +x map.py
echo "Hello World Bye World" | ./map.py

# Hello   1
# World   1
# Bye     1
# World   1

reduce 函數(shù)以此讀取經(jīng)過排序之后的 map 函數(shù)的輸出，并統(tǒng)計(jì)單詞的次數(shù)：

#!/usr/bin/env python3
import sys

def read_map_outputs(file):
  for line in file:
    yield line.strip().split("\t", 1)
def main():
  current_word = None
  word_count   = 0
  lines = read_map_outputs(sys.stdin)
  for word, count in lines:
    try:
      count = int(count)
    except ValueError:
      continue
    if current_word == word:
      word_count += count
    else:
      if current_word:
        print("{}\t{}".format(current_word, word_count))
      current_word = word
      word_count = count
  if current_word:
    print("{}\t{}".format(current_word, word_count))
if __name__ == "__main__":
  main()

reduce 的輸入是排序后的 map 輸出：

chmod +x reduce.py
echo "Hello World Bye World" | ./map.py | sort | ./reduce.py

# Bye     1
# Hello   1
# World   2

這其實(shí)與 MapReduce 的執(zhí)行流程是一致的，下面我們通過 MapReduce 來執(zhí)行（已啟動 Hadoop），需要用到 hadoop-streaming-2.6.4.jar，不同的 Hadoop 版本位置可能不同：

cd $HADOOP_INSTALL && find ./ -name "hadoop-streaming*.jar"
# ./share/hadoop/tools/lib/hadoop-streaming-2.6.4.jar
 
mkdir wordcount -p wordcount/input
cd wordcount
echo "Hello World Bye World" >> input/file0
echo "Hello Hadoop Goodbye Hadoop" >> input/file1

hadoop jar $HADOOP_INSTALL/share/hadoop/tools/lib/hadoop-streaming-2.6.4.jar \
-input $(pwd)/input \
-output output \
-mapper $(pwd)/map.py \
-reducer $(pwd)/reduce.py

執(zhí)行完成之后會在 output 目錄產(chǎn)生結(jié)果：

hadoop fs -ls output
# Found 2 items
# -rw-r--r--   1 rainy rainy          0 2016-03-13 02:15 output/_SUCCESS
# -rw-r--r--   1 rainy rainy         41 2016-03-13 02:15 output/part-00000
hadoop fs -cat output/part-00000
# Bye     1
# Goodbye 1
# Hadoop  2
# Hello   2
# World   2

總結(jié)

Hadoop 的架構(gòu)讓 MapReduce 的實(shí)際執(zhí)行過程簡化了許多，但這里省略了很多細(xì)節(jié)的內(nèi)容，尤其是針對完全分布式模式，并且要在輸入文件足夠大的情況下才能體現(xiàn)出優(yōu)勢。這里處理純文本文檔作為示例，但我想要做的是通過連接 MongoDB 直接讀取數(shù)據(jù)到 HDFS 然后進(jìn)行 MapReduce 處理，但考慮到數(shù)據(jù)量仍然不是很大（700,000條記錄）的情況，不知道是否會比直接 Python + MongoDB 更快。

下一步目標(biāo)：

MongoDB and Hadoop。

歡迎關(guān)注公眾號 PyHub！

參考

MapReduce
MapReduce Tutorial
Writing an Hadoop MapReduce Program in Python

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

MapReduce 原理與 Python 實(shí)踐

MapReduce 原理與 Python 實(shí)踐

MapReduce 原理

Python `map`/`reduce`

總結(jié)

參考

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

MapReduce 原理與 Python 實(shí)踐

MapReduce 原理

Python map/reduce

總結(jié)

參考

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

Python `map`/`reduce`