020 Hadoop 3 中有什么新功能?探索 Hadoop 3 的獨(dú)特功能

020 What is New in Hadoop 3? Explore the Unique Hadoop 3 Features

The release of Hadoop 3.x is the next big milestone in the line of Hadoop releases. Many people have a question in mind about what feature enhancement does Hadoop 3.x gives over Hadoop 2.x. So in this blog, we will take a look at what is new in Hadoop 3 and how it differs from the old versions.

Hadoop 3.X 的發(fā)布是 Hadoop 發(fā)布系列中的下一個(gè)重要里程碑.很多人對(duì)什么功能增強(qiáng)有疑問Hadoop 3.X 放棄 Hadoop 2.X.因此,在這個(gè)博客中,我們將了解 Hadoop 3 中的新功能以及它與舊版本的不同之處.

What is New in Hadoop 3? Explore the Unique Hadoop 3 Features

What’s New in Hadoop 3?

Hadoop 3 有什么新功能?

Below are the 10 changes which are done in Hadoop 3 and that makes it unique and fast. Have a look the What’s new in Hadoop 3.x –

下面是 Hadoop 3 中完成的 10 個(gè)更改,這使得它變得獨(dú)特和快速.看看 Hadoop 3..x 中的新功能

You must check – The Essential Guide to Learn Hadoop 3

你必須檢查一下學(xué)習(xí) Hadoop 3 的基本指南

1. The minimum version of Java supported in Hadoop 3.0 is JDK 8.0

Hadoop 1. 最低版本的 Java 支持 3.0 的 JDK 8.0

They have compiled all the Hadoop jar files using Java 8 run time version. The user now has to install Java 8 to use Hadoop 3.0. And user having JDK 7 has to upgrade it to JDK 8.

他們使用 Java 8 運(yùn)行時(shí)版本編譯了所有 Hadoop jar 文件.為了使用 Hadoop 3.0,用戶現(xiàn)在必須安裝 Java 8.擁有 JDK 7 的用戶必須將其升級(jí)到 JDK 8..

2. HDFS Supports Erasure Coding

2. HDFS 支持擦除編碼

Hadoop 3.x uses erasure coding for providing fault tolerance. Hadoop 2.x uses a replication technique to provide the same level of fault tolerance. Let us explore the difference between the two.

Hadoop 3.X 使用擦除編碼來提供容錯(cuò).Hadoop 2.X 使用復(fù)制技術(shù)來提供相同級(jí)別的容錯(cuò).讓我們來探討一下兩者的區(qū)別.

First, we will look at replication. Let us take the default replication factor of 3. In this, for 6 blocks we have to store a total of 6*3 i.e. 18 blocks. For every block replicated the storage overhead is 100%. Hence in our case, the storage overhead will be 200%.

首先,我們來看一下復(fù)制.讓我們把默認(rèn)的復(fù)制因子取 3.,在這里,對(duì)于 6 個(gè)塊,我們必須總共存儲(chǔ) 6*3 個(gè),即 18 個(gè)塊.對(duì)于復(fù)制的每個(gè)塊,存儲(chǔ)開銷為 100%.因此,在我們的案例中,存儲(chǔ)開銷將為 200%.

Let us see what happens in Erasure Coding. For 6 blocks, 3 parity blocks get calculated. We call this process as encoding. Now whenever a block gets missing or corrupted, it gets calculated from the remaining blocks and parity blocks. We call this process as decoding. In this case, we have a total of 9 blocks stored for 6 blocks making 50% storage overhead. Hence we can achieve the same amount of fault tolerance with much lesser storage. But there always overhead in terms of CPU and network for the process of encoding and decoding. Thus it uses for rarely access data.

讓我們看看擦除編碼會(huì)發(fā)生什么.對(duì)于 6 個(gè)塊,計(jì)算出 3 個(gè)奇偶校驗(yàn)塊.我們把這個(gè)過程稱為編碼.現(xiàn)在,每當(dāng)一個(gè)塊丟失或損壞時(shí),它都會(huì)從剩余的塊和奇偶校驗(yàn)塊中計(jì)算出來.我們把這個(gè)過程稱為解碼.在這種情況下,我們總共為 6 個(gè)塊存儲(chǔ)了 9 個(gè)塊,這造成了 50% 的存儲(chǔ)開銷.因此,我們可以用更少的存儲(chǔ)來實(shí)現(xiàn)相同數(shù)量的容錯(cuò).但是,在編碼和解碼過程中,CPU 和網(wǎng)絡(luò)總是有開銷.因此,它很少用于訪問數(shù)據(jù).

Recommended Reading – Hadoop Master-Slave Architecture

推薦閱讀- Hadoop 主從架構(gòu)

3. YARN Timeline Service v.2

3. 紗時(shí)間服務(wù) v.2

Yarn Timeline Service is new in Hadoop 3. Timeline server is responsible for storage and retrieval of the application’s current and historical information. This information is of two types –

Yarn 時(shí)間線服務(wù)是 Hadoop 3. 中的新服務(wù),Timeline server 負(fù)責(zé)存儲(chǔ)和檢索應(yīng)用程序的當(dāng)前和歷史信息.這個(gè)信息有兩種類型

Generic information of the completed application

已完成申請(qǐng)的一般資料

  • Name of the queue

  • User information

  • Number of attempts per application

  • Information about containers which ran for each attempt

  • Generic data stored by ResourceManager about a completed application which is accessed by web UI.

  • 隊(duì)列的名稱

  • 用戶信息

  • 每個(gè)應(yīng)用程序的嘗試次數(shù)

  • 關(guān)于每次嘗試運(yùn)行的容器的信息

  • Resource cemanager 存儲(chǔ)的關(guān)于 web UI 訪問的已完成應(yīng)用程序的通用數(shù)據(jù).

Per framework information about running and completed application

關(guān)于運(yùn)行和完成的應(yīng)用程序的每個(gè)框架信息

  • Number of map tasks

  • Number of reduce task

  • Counters

  • Information published by application developers to TimeLine Server via Timeline client.

  • 地圖任務(wù)數(shù)量

  • 減少任務(wù)的數(shù)量

  • 柜臺(tái)

  • 應(yīng)用程序開發(fā)人員通過時(shí)間線客戶端發(fā)布到時(shí)間線服務(wù)器的信息.

This data gets queried by REST API for rendering by application or framework specific UI.

REST API 會(huì)查詢這些數(shù)據(jù),以便根據(jù)應(yīng)用程序或框架特定的 UI 進(jìn)行渲染.

The TimeLine server v.2 addresses major shortcomings in version v.1. One of the issues is scalability. The TimeLine server v.1 has a single instance of reader/writer and storage. It is not scalable beyond a few numbers of nodes. Whereas in version v.2 Timeline server has a distributed writer architecture and scalable backend storage. It separates collection (writer) of data from serving (read) of data. Also, it uses one collector per** YARN application**. It has a reader as a separate instance which servers query request via REST API. Timeline server v.2 uses HBase for storage which can get scaled to huge size giving good response time for reads and writes.

時(shí)間線服務(wù)器 v.2 解決了 v.1 版本中的主要缺點(diǎn).其中一個(gè)問題是可擴(kuò)展性.時(shí)間線服務(wù)器 v.1 具有閱讀器/寫入器和存儲(chǔ)的單個(gè)實(shí)例.它不能擴(kuò)展到幾個(gè)節(jié)點(diǎn)之外.而在 v.2 版本中,Timeline server 具有分布式 writer 體系結(jié)構(gòu)和可擴(kuò)展的后端存儲(chǔ).它將數(shù)據(jù)的收集 (寫入器) 與數(shù)據(jù)的服務(wù) (讀取) 分開.此外,它使用每個(gè)收集器一個(gè)紗線應(yīng)用.它有一個(gè)閱讀器作為一個(gè)單獨(dú)的實(shí)例,服務(wù)器通過 REST API 查詢請(qǐng)求.使用的時(shí)間線服務(wù)器HBase 的存儲(chǔ)它可以擴(kuò)展到巨大的大小,為讀取和寫入提供了良好的響應(yīng)時(shí)間.

4. Support for Opportunistic Containers and Distributed Scheduling

集裝箱 4. 支持的機(jī)會(huì)和分布式調(diào)度

Hadoop 3 has introduced the concept of execution type. If there are no resources available at the moment then these containers wait at the NodeManager. Opportunistic containers have low priority than Guaranteed containers. If suppose Guaranteed containers arrive in the middle of the execution of opportunistic containers then later gets preempted. This happens to make room for Guaranteed containers.

Hadoop 3 引入了執(zhí)行類型的概念.如果目前沒有可用的資源,那么這些容器會(huì)在 NodeManager 等待.機(jī)會(huì)集裝箱比保證集裝箱優(yōu)先級(jí)低.如果假設(shè)保證容器在機(jī)會(huì)容器執(zhí)行過程中到達(dá),那么以后就會(huì)被搶占.這恰好為有保證的容器騰出了空間.

5. Support for More Than Two NameNodes

5. 支持兩個(gè)多 NameNodes

Till now Hadoop supported single active NameNode and single standby NameNode. Having edits replicated to three journal nodes, this architecture allowed for the failure of one NameNode.

直到現(xiàn)在Hadoop支持單個(gè)活動(dòng)名稱節(jié)點(diǎn)和單個(gè)備用名稱節(jié)點(diǎn).將編輯復(fù)制到三個(gè)日志節(jié)點(diǎn)后,此體系結(jié)構(gòu)允許一個(gè)名稱節(jié)點(diǎn)失敗.

But some situation requires a high level of fault tolerance. By configuring five journal nodes we can have a system of three NameNodes. Such a system would tolerate the failure of two NameNodes. Thus by introducing support for more than two NameNode** Hadoop 3.0 has made the system more highly available**.

但是有些情況下需要很高的容錯(cuò)能力.通過配置五個(gè)日志節(jié)點(diǎn),我們可以有三個(gè)命名節(jié)點(diǎn)的系統(tǒng).這樣的系統(tǒng)可以容忍兩個(gè)名字的失敗.因此,通過引入對(duì)兩個(gè)以上 NameNode 的支持Hadoop 3.0 提高了系統(tǒng)的可用性.

6. Default Ports of Multiple Services Changes

6. 默認(rèn)端口的多種服務(wù)的變化

Previous to Hadoop 3.0 many Hadoop services had their default port in Linux ephemeral port range (32768-61000). Due to this, many times these services would fail to bind at startup. As they would conflict with other application.

在 Hadoop 3.0 之前,許多 Hadoop 服務(wù)的默認(rèn)端口在 Linux 臨時(shí)端口范圍內(nèi) (32768-61000).因此,很多時(shí)候這些服務(wù)在啟動(dòng)時(shí)無法綁定.因?yàn)樗鼈儠?huì)與其他應(yīng)用程序沖突.

They have moved the default port of these services out of ephemeral range. The services include NameNode, Secondary NameNode, DataNode, and KeyManagementServer.

他們已經(jīng)將這些服務(wù)的默認(rèn)端口移出了短暫的范圍.服務(wù)包括、復(fù)制指令、中學(xué)、復(fù)制指令,DataNode,KeyManagementServer.

7. Intra-DataNode Balancer

內(nèi) 7. DataNode 平衡器

A DataNode Manges many disks. During a write operation, these disks get filled evenly. But when we add or remove the disk it results in a significant skew. The HDFS balancer addresses internode data skew and not intra node.

DataNode 管理許多磁盤.在寫操作期間,這些磁盤被均勻地填充.但是,當(dāng)我們添加或刪除磁盤時(shí),會(huì)導(dǎo)致嚴(yán)重的傾斜.的HDFS 平衡器解決節(jié)點(diǎn)間數(shù)據(jù)傾斜,而不是節(jié)點(diǎn)內(nèi)數(shù)據(jù)傾斜.

Intra-node balancer addresses this situation. The CLI – hdfs diskbalancer invokes this balancer.

節(jié)點(diǎn)內(nèi)平衡器解決了這種情況.CLI-hdfs 磁盤平衡器調(diào)用此平衡器.

8. Daemon and Task Heap Management Reworked

堆 8. 后臺(tái)程序和任務(wù)管理修改

There are a number of changes in the Heap management of daemons and Map-Reduce tasks:

守護(hù)進(jìn)程的堆管理有許多變化地圖-減少任務(wù):

There are new ways to configure daemon heap sizes. The system auto-tunes based on the memory of the host. HADOOP_HEAPSIZE variable is no longer used. In its place, we have HEAP_MAX_SIZE and HEAP_MIN_SIZE variables. Also, they have removed the internal variable JAVA_HEAP_SIZE . They have also removed default heap sizes which allows for auto-tuning by JVM. All the variables of global and daemon heap size support units. If the variable is only a number then it expects the size to be in megabytes. Also, if you want to enable the old default then configure HADOOP_HEAPSIZE_MAX in hadoop-env.sh.

配置守護(hù)進(jìn)程堆大小有新的方法.系統(tǒng)根據(jù)主機(jī)的內(nèi)存自動(dòng)調(diào)諧.不再使用 hadoop _ heapsize 變量.我們有 heap _ max _ size 和 heap _ min _ size 變量.此外,他們還刪除了內(nèi)部變量 java _ heap _ size.它們還刪除了允許 JVM 自動(dòng)調(diào)整的默認(rèn)堆大小.全局和守護(hù)進(jìn)程堆大小支持單元的所有變量.如果變量只是一個(gè)數(shù)字,那么它預(yù)計(jì)大小將以兆字節(jié)為單位.此外,如果要啟用舊的默認(rèn)值,請(qǐng)?jiān)?hadoop-env.sh 中配置 hadoop _ heapsize_max.

If the value for mapreduce.map/reduce.memory.mb is set to the default of -1. Then it will automatically infer the value from Xmx variable specified for mapreduce.map/reduce.java.opts. Xmx is nothing but heap size value system property. This reverse is also possible. Suppose Xmx value is not specified for mapreduce.map/reduce.java.opts keys. The system derives its value from mapredcue.map/reduce.memory.mb keys. If we don’t specify either value then the default is 1024MB. For configuration and job code which specify this value explicitly will not get affected.

如果 mapreduce.map/reduce.memory.mb 的值設(shè)置為 1. 1.然后,它將自動(dòng)從為 mapreduce.map/reduce.java.opts 指定的 Xmx 變量中推斷值.Xmx 只是堆大小值系統(tǒng)屬性.這也是可能的.假設(shè)沒有為 mapreduce.map/reduce.java.opts 鍵指定 Xmx 值.系統(tǒng)從 mapredcue.map/reduce.memory.mb 鍵導(dǎo)出其值.如果我們沒有指定任何一個(gè)值,那么默認(rèn)值為 1024 MB.對(duì)于明確指定此值的配置和作業(yè)代碼不會(huì)受到影響.

Have a look at Hadoop Ecosystem and its Components

看看 Hadoop 生態(tài)系統(tǒng)及其組件

9. Generalization of Yarn Resource Model

9. 推廣紗資源模型

They have generalized the Yarn resource model to include user-defined resources apart from CPU and memory. These user-defined resources can be software licenses, GPU or locally attached storage. Yarn tasks gets scheduled on the basis of these resources.

他們將 Yarn 資源模型推廣到除了 CPU 和內(nèi)存之外的用戶定義資源.這些用戶定義的資源可以是軟件許可證、 GPU 或本地附加存儲(chǔ).根據(jù)這些資源安排紗線任務(wù).

We can extend the Yarn resource model to include arbitrary “countable” resources. A countable resource is one which gets consumed by the container and the system releases it after completion. Both CPU and memory are countable resources. Likewise, GPUs or Graphics Processing Unit and software licenses are countable resources too. Yarn tracks CPU and memory for each node, application, and queue by default. Yarn can extend to track other user-defined countable resources like GPUs and software licenses. The integration of GPUs with containers has enhanced the performance of Data Science and AI use cases.

我們可以將紗線資源模型擴(kuò)展到任意的 “可數(shù)” 資源.可數(shù)資源是容器消耗的資源,系統(tǒng)在完成后釋放它.CPU 和內(nèi)存都是可數(shù)的資源.同樣,gpu 或圖形處理器和軟件許可證也是可數(shù)的資源.默認(rèn)情況下,Yarn 會(huì)跟蹤每個(gè)節(jié)點(diǎn)、應(yīng)用程序和隊(duì)列的 CPU 和內(nèi)存.Yarn 可以擴(kuò)展到跟蹤 gpu 和軟件許可證等其他用戶定義的可數(shù)資源.Gpu 與容器的集成增強(qiáng)了數(shù)據(jù)科學(xué)和人工智能用例.

10. Consistency and Metadata Caching for S3A Client

10.S3A 客戶端的一致性和元數(shù)據(jù)緩存

S3A client of now has the capability to store metadata for files and directories in a fast and consistent way. It does this by using a DynamoDB table. We can refer to this new feature as S3GUARD. It caches the directory information so that S3Aclient can get faster lookups. Also, it provides resilience to inconsistencies between S3 list operations and status of the object. When the files get created using S3GUARD we can always find it. S3GUARD is experimental and we can consider it as unstable.

Now 的 S3A 客戶端能夠以快速一致的方式存儲(chǔ)文件和目錄的元數(shù)據(jù).它通過使用動(dòng)態(tài) odb 表來實(shí)現(xiàn)這一點(diǎn).我們可以將這個(gè)新功能稱為 S3GUARD.它緩存目錄信息,以便 S3Aclient 可以更快地查找.此外,它還為 S3 列表操作和對(duì)象狀態(tài)之間的不一致提供了彈性.當(dāng)使用 S3GUARD 創(chuàng)建文件時(shí),我們總是可以找到它.S3GUARD 是實(shí)驗(yàn)性的,我們可以認(rèn)為它是不穩(wěn)定的.

So, we have explored many new features of Hadoop 3 that makes it unique and popular.

因此,我們已經(jīng)探索了 Hadoop 3 的許多新特性,這些特性使得它變得獨(dú)特和流行.

Summary

總結(jié)

As we have progressed along different versions of Hadoop, it gets better and better. The developers have incorporated many changes to fix bugs, make it more user-friendly and give it enhanced features. The changes made in default ports of various Hadoop services has made it more convenient to use. Hadoop includes various feature enhancements like erasure coding, the introduction of timeline service v.2, adoption of the intra-node balancer and so on. These changes have increased the chances of use of Hadoop by the industry. You must read **top Hadoop questions related to the latest version of Hadoop. **

隨著我們?cè)诓煌姹镜?Hadoop 上的進(jìn)步,它變得越來越好.開發(fā)人員已經(jīng)整合了許多修改來修復(fù)錯(cuò)誤,使其更加用戶友好,并為其提供增強(qiáng)的功能.各種 Hadoop 服務(wù)的默認(rèn)端口所做的更改使得使用起來更加方便.Hadoop 包括各種功能增強(qiáng)像擦除編碼,時(shí)間線服務(wù) v.2 的引入,節(jié)點(diǎn)內(nèi)平衡器的采用等等.這些變化增加了業(yè)界使用 Hadoop 的機(jī)會(huì).你必須讀**與最新版本 Hadoop 相關(guān)的熱門 Hadoop 問題. **

Share your feedback of reading what’s new in Hadoop 3 via comments.

通過評(píng)論分享你閱讀 Hadoop 3 中的新功能的反饋.

https://data-flair.training/blogs/what-is-new-in-hadoop-3

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容