2018-01-09 Hadoop Platform and Application Framework -- Lesson 3 Hadoop Basic Modules Introdution

Overview of Hadoop Stack

HDFS holds data. YARN is resource manager. MapReduce is one option of engine, Spark is another. Tez is alos one option in Hadoop 2.0, where the applications are layered on top of that.

HBase - a scalable data warehouse with support for large tables

Hive - a data warehouse infrastructure that provides data summarization and ad hoc quering

pig - A high-level data-flow language and execution framework for parallel computation

Spart - a fast and general compute engune for Hadoop data. Wide range of applications -ETL, Machine Learning, stream processing, and graph analytics.


Cloudera Setup:


HDFS and HDFS2

Concept:

????Scalable distributed filesystem

????Distribute data on local disks on several nodes

????Low cost commodity hardware

Design goals:

????Resilience - recover from nodes or nodes' components failing

????Scalability - spreading out the data to blocks on lots of nodes ; namespace capacity

????Application Locality - data scale but application does not. It localise on each compute node and keep compute task on the node with data

????Portability - means commodity hardware widely accepted about OS type and not much change needed.

Architecture:

? ? Single NameNode?

????????Metadata is info about filesystem state, block information, edit & transaction info, locks

? ? Multiple DataNodes?-?Data is spreaded across to blocks on lots of nodes?

????????Manange storage - blocks of data (downward)?

? ? ? ? Serving read/write requests from clients (upward)

? ? ? ? Block creation, deletion, replication (horizontally) -?Replication is 3 times by default


? ?From Hadoop2.0 (Federation):

? ? Multiple NameNode but not single any more. Multiple namespaces providing scalability. Each namespace has a block pool. Metadata is stored in block pools. Pools are spread out over all data nodes.?

? ? Standby NameNode taking snapshot, but failover is handling manually.

? ? Heterogeneous Storage - Archive storage, SSD, Ram_disk



MapReduce Framework

Basic idea: (1)Job splits data into chunks, and MapBus maps tasks to all the (2)compute nodes to process chunks. Once the process chunks of data is finished, the framework sorts the map's output. Reduce tasks use the sorted map's output as input to perform some reduction opetaions.

Typically, compute and data nodes are the same, so MapReduce tasks and HDFS are running on the same nodes.

Before Hadoop 2.0 YARN burn:

Single master JobTracker (1) ?- schedules, monitors, and re-executes failed tasks. It's the main daemon in Hadoop. It initiates TaskTrackers on SlaveNodes (compute nodes/data nodes)

One slave TaskTracker per cluster node (2) - executes tasks from JobTracker requests (with HDFS handler).


YARN

From MapReduce. Main idea : separate resource management and job scheduling / monitoring.

Overall/Coordiante --?ResourceManager : on Master Node, gets job requests from clients, gets Node Status from NodeManagers about what resources are available, gets status of applications from ApplicationMaster.

Resource Management part --?NodeManager : on each node. Like Capacity scheduler / fair share scheduler - choosing container/allocatiing resource based on capacity and queues to jobs

Job Scheduling / monitoring part --?ApplicationMaster : one for each application on certain nodes. All of them together break out that piece of original single JobTracker

So, YARN is doing MapReduce's (1) part, but it is more deeper from container level for scheduling jobs.

YARN has features below also:

????High Availability ResouceManager in the newest Hadoop release - One Standby RM.

? ? Timeline server - trace storage/application history like how many map/reduce/resource are done/used.

? ? Cgourps - manage resources used by containers, as it also support Secure Containers with restrictions to particular users.

? ? Restful API providing web services for cluster access.



Lesson 3 Slides

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容