大數(shù)據(jù)系列(一):初識Hadoop

學習計劃

  • Big Data Specialization from the Uni of California, San Diego
  • Hadoop 權威指南

本文

  • Hadoop Platform and Application Framework Week1: ** Hadoop Basics**
  • Hadoop 權威指南第一章:初識Hadoop

Hadoop是什么?

Apache Hadoop是在商用硬件集群上儲存并大規(guī)模處理數(shù)據(jù)集的開源軟件框架(Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters of commodity hardware)。

Hadoop框架的基本模塊是什么?

  • Hadoop Common: Hadoop Common 包含其他Hadoop模塊需要的庫和實用程序(Hadoop Common contains libraries and utilities needed by other Hadoop modules
  • Hadoop分布式文件系統(tǒng)(Hadoop Distributed File System): HDFS 是一個用于儲存超大文件的系統(tǒng)。這個系統(tǒng)在商用硬件集群上運行,以流式數(shù)據(jù)訪問模式來存儲這些超大文件(HDFS is a filesystem designed for storing very large files with streaming data access patterns, running on clusters of commodity hardware
    • 超大文件(Very large files): GB, TB, PB級別文件
    • 流式數(shù)據(jù)訪問(Streaming data access):一次寫入,多次讀取
    • 商用硬件(Commodity hardware): 并不需要運行在高可靠的硬件上。因此,成本低但節(jié)點故障率高
  • Hadoop YARN (Yet Another Resource Negotiator): YARN 是用于集群計算資源管理和用戶、應用規(guī)劃的資源管理平臺(YARN is a resource management platform responsible for managing compute resources in the cluster and using them in order to schedule users and applications). YARN的基礎思想是將job tracker的兩個主要功能(資源管理和任務分配與監(jiān)控)分離 (The fundamental idea behind the MapReduce 2.0 is to split up two major functionalities of the job tracker, resource management, and the job scheduling and monitoring, and to do two separate units.)
  • Hadoop MapReduce:一個用于數(shù)據(jù)處理的編程模型(MapReduce is a programming model for data processing.

Hadoop生態(tài)系統(tǒng)主要組成部分是什么?

Apache Hadoop Ecosystem.png
  • Apache Sqoop: 在關系型數(shù)據(jù)庫和HDFS之間移動數(shù)據(jù)的工具(A tool for efficiently moving data between relational databases and HDFS
  • Apache HBase:一個分布式的列數(shù)據(jù)庫。HBase使用HDFS進行基礎儲存并同時支持MapReduce的批量計算和隨機讀取的點查詢(A distributed, column-oriented database. HBase uses HDFS for its underlying storage, and supports both batch-style computation using MapReduce and point queries (random reads)
  • Apache Pig:Pig是一種探索大規(guī)模數(shù)據(jù)集的腳本語言,由兩部分組成:Pig Latin(描述數(shù)據(jù)流)和用于運行Pig Latin程序的執(zhí)行環(huán)境。
  • Apache Hive: Hive是一個分布式的數(shù)據(jù)倉庫,管理存儲在HDFS中的數(shù)據(jù)并提供和SQL長得像的查詢語言來查詢數(shù)據(jù)(A distributed data warehouse. Hive manages data stored in HDFS and provides a query language based on SQL (and which is translated by the runtime engine to MapReduce jobs) for querying the data.
  • Apache Oozie: Oozie用于管理Hadoop所有工作的工作流計劃系統(tǒng)(Oozie's a workflow schedule system that manages all of our Apache Hadoop jobs
  • Apache Flume: Flume 是一個用于收集不斷增加并移動的大量數(shù)據(jù)的分布式服務(Flume is a distributed and reliable available service for efficiently collecting aggregating and moving large amounts of data)
  • Apache Zookeeper: Zookeeper提供分布式的配置服務和同步服務,這樣我們可以將Hadoop的所有工作和整個分布系統(tǒng)的注冊表同步(It provides a distributed configuration service and synchronization service so he can synchronize all these jobs and a naming registry for the entire distributed system
最后編輯于
?著作權歸作者所有,轉載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務。

相關閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容