012 What is Hadoop Cluster? Learn to Build a Cluster in Hadoop
In this blog, we will get familiar with Hadoop cluster the heart of Hadoop framework. First, we will talk about what is a Hadoop cluster? Then look at the basic architecture and protocols it uses for communication. And at last, we will discuss what are the various benefits that Hadoop cluster provide.
在這篇博客中,我們將熟悉 Hadoop 框架的核心 Hadoop 集群.首先,我們將討論什么是 Hadoop 集群?然后看看它用于通信的基本架構(gòu)和協(xié)議.最后,我們將討論 Hadoop 集群提供的各種好處.
So, let us begin our journey of Hadoop Cluster.
所以,讓我們開始 Hadoop 集群的旅程.
What is Hadoop Cluster? Learn to Build a Cluster in Hadoop
1. What is Hadoop Cluster?
A Hadoop cluster is nothing but a group of computers connected together via LAN. We use it for storing and processing large data sets. Hadoop clusters have a number of commodity hardware connected together. They communicate with a high-end machine which acts as a master. These master and slaves implement distributed computing over distributed data storage. It runs open source software for providing distributed functionality.
Hadoop 集群只不過是一組通過局域網(wǎng)連接在一起的計(jì)算機(jī).我們用它來(lái)存儲(chǔ)和處理大數(shù)據(jù)集.Hadoop 集群有許多商品硬件連接在一起.他們與作為主人的高端機(jī)器交流.這些主服務(wù)器和從服務(wù)器通過分布式數(shù)據(jù)存儲(chǔ)實(shí)現(xiàn)分布式計(jì)算.它運(yùn)行提供分布式功能的開源軟件.
2. What is the Basic Architecture of Hadoop Cluster?
Hadoop cluster has master-slave architecture.
Hadoop 集群有主從式架構(gòu).
i. Master in Hadoop Cluster
It is a machine with a good configuration of memory and CPU. There are two daemons running on the master and they are NameNode and Resource Manager.
這是一臺(tái)內(nèi)存和 CPU 配置都很好的機(jī)器.在 master 上運(yùn)行著兩個(gè)守護(hù)進(jìn)程,它們是 NameNode 和資源管理器.
a. Functions of NameNode
Manages file system namespace
Regulates access to files by clients
Stores metadata of actual data Foe example – file path, number of blocks, block id, the location of blocks etc.
Executes file system namespace operations like opening, closing, renaming files and directories
管理文件系統(tǒng)命名空間
管理客戶端對(duì)文件的訪問
存儲(chǔ)實(shí)際數(shù)據(jù)的元數(shù)據(jù),例如文件路徑、塊數(shù)量、塊 id 、塊位置等.
執(zhí)行打開、關(guān)閉、重命名文件和目錄等文件系統(tǒng)命名空間操作
The NameNode stores the metadata in the memory for fast retrieval. Hence we should configure it on a high-end machine.
NameNode 將元數(shù)據(jù)存儲(chǔ)在內(nèi)存中,以便快速檢索.因此,我們應(yīng)該在高端機(jī)器上配置它.
b. Functions of Resource Manager
It arbitrates resources among competing nodes
Keeps track of live and dead nodes
它在競(jìng)爭(zhēng)節(jié)點(diǎn)之間仲裁資源
跟蹤活節(jié)點(diǎn)和死節(jié)點(diǎn)
ii. Slaves in the Hadoop Cluster
It is a machine with a normal configuration. There are two daemons running on Slave machines and they are – DataNode and Node Manager
這是一臺(tái)配置正常的機(jī)器.在從屬機(jī)器上運(yùn)行著兩個(gè)守護(hù)進(jìn)程,它們是-DataNode 和 Node Manager
a. Functions of DataNode
It stores the business data
It does read, write and data processing operations
Upon instruction from a master, it does creation, deletion, and replication of data blocks.
它存儲(chǔ)業(yè)務(wù)數(shù)據(jù)
它做讀、寫和數(shù)據(jù)處理操作.
根據(jù) master 的指令,它會(huì)創(chuàng)建、刪除和復(fù)制數(shù)據(jù)塊.
b. Functions of NodeManager
It runs services on the node to check its health and reports the same to ResourceManager.
它在節(jié)點(diǎn)上運(yùn)行服務(wù)來(lái)檢查其運(yùn)行狀況,并向 resource cemanager 報(bào)告.
We can easily scale Hadoop cluster by adding more nodes to it. Hence we call it a linearly scaled cluster. Each node added increases the throughput of the cluster.
通過增加更多的節(jié)點(diǎn),我們可以很容易地?cái)U(kuò)展 Hadoop 集群.因此,我們稱之為線性規(guī)模的集群.每個(gè)節(jié)點(diǎn)的加入都增加了集群的吞吐量.
Client nodes in Hadoop cluster – We** install Hadoop** and configure it on client nodes.
Hadoop 集群中的客戶端節(jié)點(diǎn)安裝 Hadoop并在客戶端節(jié)點(diǎn)上配置它.
c. Functions of the client node
To load the data on the Hadoop cluster.
Tells how to process the data by submitting MapReduce job.
Collects the output from a specified location.
在 Hadoop 集群上加載數(shù)據(jù).
告訴如何通過提交 MapReduce 作業(yè)來(lái)處理數(shù)據(jù).
從指定位置收集輸出.
3. Single Node Cluster VS Multi-Node Cluster
As the name suggests, single node cluster gets deployed over a single machine. And multi-node clusters gets deployed on several machines.
顧名思義,單個(gè)節(jié)點(diǎn)集群通過單機(jī).多節(jié)點(diǎn)集群被部署到幾臺(tái)機(jī)器.
In single-node Hadoop clusters, all the daemons like NameNode, DataNode run on the same machine. In a single node Hadoop cluster, all the processes run on one JVM instance. The user need not make any configuration setting. The Hadoop user only needs to set JAVA_HOME variable. The default factor for single node Hadoop cluster is one.
在單節(jié)點(diǎn) Hadoop 集群,像 NameNode 、 DataNode 這樣的所有守護(hù)進(jìn)程都在同一臺(tái)機(jī)器上運(yùn)行.在單個(gè)節(jié)點(diǎn) Hadoop 集群中,所有進(jìn)程都在一個(gè) JVM 實(shí)例上運(yùn)行.用戶不需要進(jìn)行任何配置設(shè)置.Hadoop 用戶只需要設(shè)置 java _ home 變量就可以了.單節(jié)點(diǎn) Hadoop 集群的默認(rèn)因素是 1.
In multi-node Hadoop clusters, the daemons run on separate host or machine. A multi-node Hadoop cluster has master-slave architecture. In this NameNode daemon run on the master machine. And DataNode daemon runs on the slave machines. In multi-node Hadoop cluster, the slave daemons like DataNode and NodeManager run on cheap machines. On the other hand, master daemons like NameNode and ResourceManager run on powerful servers. Ina multi-node Hadoop cluster, slave machines can be present in any location irrespective of the physical location of the master server.
在多節(jié)點(diǎn) Hadoop 集群,守護(hù)進(jìn)程在單獨(dú)的主機(jī)或機(jī)器上運(yùn)行.多節(jié)點(diǎn) Hadoop 集群具有主從結(jié)構(gòu).在主機(jī)器上運(yùn)行的這個(gè) NameNode 守護(hù)進(jìn)程中.DataNode 守護(hù)進(jìn)程在從機(jī)上運(yùn)行.在多節(jié)點(diǎn) Hadoop 集群中,像 DataNode 和 NodeManager 這樣的從屬守護(hù)進(jìn)程在廉價(jià)的機(jī)器上運(yùn)行.另一方面,像 NameNode 和 resource cemanager 這樣的主守護(hù)進(jìn)程運(yùn)行在功能強(qiáng)大的服務(wù)器上.在多節(jié)點(diǎn) Hadoop 集群中,無(wú)論主服務(wù)器的物理位置如何,從屬機(jī)器都可以出現(xiàn)在任何位置.
4. Communication Protocols Used in Hadoop Clusters
The HDFS communication protocol works on the top of TCP/IP protocol. The client establishes a connection with NameNode using configurable TCP port. Hadoop cluster establishes the connection to the client using client protocol. DataNode talks to NameNode using the DataNode Protocol. A Remote Procedure Call (RPC) abstraction wraps both Client protocol and DataNode protocol. NameNode does not initiate any RPC instead it responds to RPC from the DataNode.
HDFS通信協(xié)議在 tcp/ip 協(xié)議的基礎(chǔ)上工作.客戶端使用可配置的 TCP 端口與 NameNode 建立連接.Hadoop 集群使用客戶端協(xié)議建立到客戶端的連接.DataNode DataNode 會(huì)談、復(fù)制指令使用的協(xié)議.遠(yuǎn)程過程調(diào)用 (RPC) 抽象包裝了客戶端協(xié)議和 DataNode 協(xié)議.NameNode 不會(huì)啟動(dòng)任何 RPC,而是從 DataNode 響應(yīng) RPC.
5. How to Build a Cluster in Hadoop
Building a Hadoop cluster is a non- trivial job. Ultimately the performance of our system will depend upon how we have configured our cluster. In this section, we will discuss various parameters one should take into consideration while setting up a Hadoop cluster.
構(gòu)建 Hadoop 集群是一項(xiàng)非常重要的工作.最終,我們系統(tǒng)的性能將取決于我們?nèi)绾闻渲眉?在本節(jié)中,我們將討論在設(shè)置 Hadoop 集群時(shí)應(yīng)該考慮的各種參數(shù).
For choosing the right hardware one must consider the following points
要選擇合適的硬件,必須考慮以下幾點(diǎn)
Understand the kind of workloads, the cluster will be dealing with. The volume of data which cluster need to handle. And kind of processing required like CPU bound, I/O bound etc.
Data storage methodology like data compression technique used if any.
Data retention policy like how frequently we need to flush.
了解集群將處理的工作負(fù)載類型.集群需要處理的數(shù)據(jù)量.以及 CPU 綁定、 I/O 綁定等所需的處理方式.
數(shù)據(jù)存儲(chǔ)方法,如使用的數(shù)據(jù)壓縮技術(shù).
數(shù)據(jù)保留策略,比如我們需要刷新的頻率.
Sizing the Hadoop Cluster
For determining the size of Hadoop clusters we need to look at how much data is in hand. We should also examine the daily data generation. Based on these factors we can decide the requirements of a number of machines and their configuration. There should be a balance between performance and cost of the hardware approved.
為了確定 Hadoop 集群的大小,我們需要查看手頭有多少數(shù)據(jù).我們也應(yīng)該檢查日常的數(shù)據(jù)生成.基于這些因素,我們可以決定一些機(jī)器的需求及其配置.批準(zhǔn)的硬件的性能和成本之間應(yīng)該有一個(gè)平衡.
Configuring Hadoop Cluster
For deciding the configuration of Hadoop cluster, run typical Hadoop jobs on the default configuration to get the baseline. We can analyze job history log files to check if a job takes more time than expected. If so then change the configuration. After that repeat the same process to fine tune the Hadoop cluster configuration so that it meets the business requirement. Performance of the cluster greatly depends upon resources allocated to the daemons. The Hadoop cluster allocates one CPU core for small to medium data volume to each DataNode. And for large data sets, it allocates two CPU cores to the HDFS daemons.
要決定 Hadoop 集群的配置,請(qǐng)運(yùn)行典型的Hadoop 作業(yè)獲取基線的默認(rèn)配置.我們可以分析作業(yè)歷史記錄日志文件,以檢查作業(yè)是否比預(yù)期花費(fèi)更多的時(shí)間.如果是這樣,請(qǐng)更改配置.之后重復(fù)同樣的流程,對(duì) Hadoop 集群配置進(jìn)行微調(diào),使其滿足業(yè)務(wù)需求.集群的性能在很大程度上取決于分配給守護(hù)進(jìn)程的資源.Hadoop 集群為每個(gè) DataNode 分配一個(gè)中小數(shù)據(jù)量的 CPU 核心.對(duì)于大型數(shù)據(jù)集,它會(huì)為 HDFS 守護(hù)進(jìn)程分配兩個(gè) CPU 內(nèi)核.
6. Hadoop Cluster Management
When you deploy your Hadoop cluster in production it is apparent that it would scale along all dimensions. They are volume, velocity, and variety. Various features that it should have to become production-ready are – robust, round the clock availability, performance and manageability. Hadoop cluster management is the main aspect of your big data initiative.
當(dāng)您在生產(chǎn)環(huán)境中部署 Hadoop 集群時(shí),很明顯它會(huì)擴(kuò)展到所有維度.它們是體積、速度和種類.它應(yīng)該具備的各種功能在全天候的可用性、性能和可管理性方面都是可靠的.Hadoop 集群管理是你的大數(shù)據(jù)計(jì)劃的主要方面.
A good cluster management tool should have the following features:-
一個(gè)好的集群管理工具應(yīng)該具有以下特性:-
- It should provide diverse work-load management, security, resource provisioning, performance optimization, health monitoring. Also, it needs to provide policy management, job scheduling, back up and recovery across one or more nodes.
- Implement NameNode high availability with load balancing, auto-failover, and hot standbys
- Enabling policy-based controls that prevent any application from gulping more resources than others.
- Managing the deployment of any layers of software over Hadoop clusters by performing regression testing. This is to make sure that any jobs or data won’t crash or encounter any bottlenecks in daily operations.
- 它應(yīng)該提供多樣化的工作負(fù)載管理、安全性、資源調(diào)配、性能優(yōu)化、運(yùn)行狀況監(jiān)控.此外,它還需要在一個(gè)或多個(gè)節(jié)點(diǎn)上提供策略管理、作業(yè)調(diào)度、備份和恢復(fù).
- 實(shí)施高可用性具有負(fù)載平衡、自動(dòng)故障切換和熱備用功能
- 啟用基于策略的控制,防止任何應(yīng)用程序比其他應(yīng)用程序占用更多的資源.
- 通過執(zhí)行回歸測(cè)試來(lái)管理 Hadoop 集群上任何層軟件的部署.這是為了確保任何工作或數(shù)據(jù)在日常操作中不會(huì)崩潰或遇到任何瓶頸.
7. Benefits of Hadoop Clusters
7. Hadoop 的集群
Here is a list of benefits provided by Clusters in Hadoop –
以下是 Hadoop 中集群提供的好處列表-
Robustness
Data disks failures, heartbeats and re-replication
Cluster Rrbalancing
Data integrity
Metadata disk failure
Snapshot
魯棒性
數(shù)據(jù)磁盤故障、心跳和重新復(fù)制
集群資源平衡
數(shù)據(jù)完整性
元數(shù)據(jù)磁盤故障
快照
i. Robustness
The main objective of Hadoop is to store data reliably even in the event of failures. Various kind of failure is NameNode failure, DataNode failure, and network partition. DataNode periodically sends a heartbeat signal to NameNode. In network partition, a set of DataNodes gets disconnected with the NameNode. Thus NameNode does not receive any heartbeat from these DataNodes. It marks these DataNodes as dead. Also, Namenode does not forward any I/O request to them. The replication factor of the blocks stored in these DataNodes falls below their specified value. As a result, NameNode initiates replication of these blocks. In this way, NameNode recovers from the failure.
的Hadoop 的主要目標(biāo)即使發(fā)生故障,也要可靠地存儲(chǔ)數(shù)據(jù).各種各樣的故障是 NameNode 故障、 DataNode 故障和網(wǎng)絡(luò)分區(qū).DataNode 定期發(fā)送心跳信號(hào)、復(fù)制指令.在網(wǎng)絡(luò)分區(qū)中,一組數(shù)據(jù)節(jié)點(diǎn)與 NameNode 斷開連接.因此,NameNode 不會(huì)從這些數(shù)據(jù)節(jié)點(diǎn)接收到任何心跳.它將這些 DataNodes 標(biāo)記為死.此外,Namenode 不會(huì)向他們轉(zhuǎn)發(fā)任何 I/O 請(qǐng)求.存儲(chǔ)在這些 DataNodes 中的塊的復(fù)制因子低于其指定值.因此,NameNode 啟動(dòng)這些塊的復(fù)制.這樣,NameNode 就可以從故障中恢復(fù)過來(lái).
ii. Data Disks Failure, Heartbeats, and Re-replication
NameNode receives a heartbeat from each DataNode. NameNode may fail to receive heartbeat because of certain reasons like network partition. In this case, it marks these nodes as dead. This decreases the replication factor of the data present in the dead nodes. Hence NameNode initiates replication for these blocks thereby making the cluster fault tolerant.
、復(fù)制指令接收從每個(gè) DataNode 心跳.由于某些原因,如網(wǎng)絡(luò)分區(qū),NameNode 可能無(wú)法接收心跳.在這種情況下,它將這些節(jié)點(diǎn)標(biāo)記為死節(jié)點(diǎn).這降低了死節(jié)點(diǎn)中數(shù)據(jù)的復(fù)制因子.因此,NameNode 為這些塊啟動(dòng)復(fù)制,從而使集群容錯(cuò).
iii. Cluster Rebalancing
Iii.集群再平衡
The HDFS architecture automatically does cluster rebalancing. Suppose the free space in a DataNode falls below a threshold level. Then it automatically moves some data to another DataNode where enough space is available.
的HDFS 架構(gòu)自動(dòng)進(jìn)行集群再平衡.假設(shè) DataNode 中的自由空間低于閾值水平.然后,它會(huì)自動(dòng)將一些數(shù)據(jù)移動(dòng)到另一個(gè)有足夠空間的 DataNode.
iv. Data Integrity
四、數(shù)據(jù)完整性
Hadoop cluster implements checksum on each block of the file. It does so to see if there is any corruption due to buggy software, faults in storage device etc. If it finds the block corrupted it seeks it from another DataNode that has a replica of the block.
Hadoop集群對(duì)文件的每個(gè)塊實(shí)現(xiàn)校驗(yàn)和.這樣做是為了看看是否有任何由于軟件錯(cuò)誤、存儲(chǔ)設(shè)備故障等原因?qū)е碌膿p壞.如果發(fā)現(xiàn)塊損壞,它會(huì)從另一個(gè)具有塊副本的 DataNode 中尋找它.
v. Metadata Disk Failure
五、元數(shù)據(jù)磁盤故障
FSImage and Editlog are the central data structures of HDFS. Corruption of these files can stop the** functioning of HDFS**. For this reason, we can configure NameNode to maintain multiple copies of FSImage and EditLog. Updation of multiple copies of FSImage and EditLog can degrade the performance of Namespace operations. But it is fine as Hadoop deals more with the data-intensive application rather than metadata intensive operation.
FSImage 和 Editlog 是 HDFS 的核心數(shù)據(jù)結(jié)構(gòu).這些文件的損壞可以阻止運(yùn)作 HDFS.因此,我們可以將 NameNode 配置為維護(hù) FSImage 和 EditLog 的多個(gè)副本.FSImage 和 EditLog 的多個(gè)副本的更新會(huì)降低命名空間操作的性能.但是,由于 Hadoop 更多地處理數(shù)據(jù)密集型應(yīng)用程序,而不是元數(shù)據(jù)密集型操作,這很好.
vi. Snapshot
六、快照
Snapshot is nothing but storing a copy of data at a particular instance of time. One of the usages of the snapshot is to rollback a failed HDFS instance to a good point in time. We can take Snapshots of the sub-tree of the file system or entire file system. Some of the uses of snapshots are disaster recovery, data backup, and protection against user error. We can take snapshots of any directory. Only the particular directory should be set as Snapshottable. The administrators can set any directory as snapshottable. We cannot rename or delete a snapshottable directory if there are snapshots in it. After removing all the snapshots from the directory, we can rename or delete it.
快照只是在特定的時(shí)間實(shí)例中存儲(chǔ)數(shù)據(jù)副本.快照的一個(gè)用法是將失敗的 HDFS 實(shí)例回滾到一個(gè)好的時(shí)間點(diǎn).我們可以對(duì)文件系統(tǒng)或整個(gè)文件系統(tǒng)的子樹進(jìn)行快照.快照的一些用途是災(zāi)難恢復(fù)、數(shù)據(jù)備份和防止用戶錯(cuò)誤.我們可以拍攝任何目錄的快照.應(yīng)該只將特定目錄設(shè)置為 Snapshottable.管理員可以將任何目錄設(shè)置為 snapshottable.如果 snapshottable 目錄中有快照,我們不能重命名或刪除它.從目錄中刪除所有快照后,我們可以重命名或刪除它.
8. Summary
8. 簡(jiǎn)要
There are several options to manage a Hadoop cluster. One of them is** Ambari**. Hortonworks promote Ambari and many other players. We can manage more than one Hadoop cluster at a time using Ambari. Cloudera Manager is one more tool for Hadoop cluster management. Cloudera manager permits us to deploy and operate complete Hadoop stack very easily. It provides us with many features like performance and health monitoring of the cluster. Hope this helped. Share your feedback through comments.
管理 Hadoop 集群有幾個(gè)選項(xiàng).其中之一是安巴里.霍爾頓的作品促進(jìn)了安巴里和其他許多球員.我們可以使用 Ambari 一次管理多個(gè) Hadoop 集群.Cloudera Manager 是 Hadoop 集群管理的又一個(gè)工具.Cloudera manager 允許我們非常容易地部署和操作完整的 Hadoop 堆棧.它為我們提供了許多功能,比如集群的性能和運(yùn)行狀況監(jiān)控.希望這有所幫助.通過評(píng)論分享你的反饋.
You must explore Top Hadoop Interview Questions
譯者注: Hadoop 集群使用 Cloudera Manager + CDH 來(lái)管理部署,相對(duì)比較輕松




