001 大數(shù)據(jù)愛好者的 Hadoop 教程-學(xué)習(xí) Hadoop 的最佳方式

000 Hadoop Tutorial for Big Data Enthusiasts – The Optimal way of Learning Hadoop

Hadoop Tutorial – One of the most searched terms on the internet today. Do you know the reason? It is because Hadoop is the major part or framework of Big Data.

Hadoop 教程-當(dāng)今互聯(lián)網(wǎng)上搜索最多的術(shù)語之一. 你知道原因嗎?這是因為 Hadoop 是大數(shù)據(jù)的主要組成部分或框架.

If you don’t know anything about Big Data then you are in major trouble. But don’t worry I have something for you which is completely FREE –*** 520+ Big Data Tutorials. *** This free tutorial series will make you a master of Big Data in just few weeks. Also, I have explained a little about Big Data in this blog.

如果你對大數(shù)據(jù)一無所知,那你就麻煩大了. 但是別擔(dān)心,我有東西給你 完全免費-***520 + 大數(shù)據(jù)教程:. *** 這個免費的教程系列將在幾周內(nèi)讓你成為大數(shù)據(jù)的大師.此外,我在這個博客中解釋了一點關(guān)于大數(shù)據(jù)的知識.

“Hadoop is a technology to store massive datasets on a cluster of cheap machines in a distributed manner”. It was originated by Doug Cutting and Mike Cafarella.

“Hadoop 是一種以分布式方式將大量數(shù)據(jù)集存儲在廉價機(jī)器集群上的技術(shù)”. 它由道格切和邁克 · 卡法雷拉發(fā)起.

Doug Cutting’s kid named Hadoop to one of his toy that was a yellow elephant. Doug then used the name for his open source project because it was easy to spell, pronounce, and not used elsewhere.

道格 · 切的孩子把 Hadoop 命名為他的一個玩具,那是一只黃色的大象.道格隨后在他的開源項目中使用了這個名字,因為它很容易拼寫、發(fā)音,在其他地方也不使用.

Interesting, right?

有趣是吧?

Hadoop Tutorial

Hadoop 教程:

Now, let’s begin our interesting Hadoop tutorial with the basic introduction to Big Data.

現(xiàn)在,讓我們從大數(shù)據(jù)的基本介紹開始我們有趣的 Hadoop 教程.

What is Big Data?

大數(shù)據(jù)是什么?

Big Data refers to the datasets too large and complex for traditional systems to store and process. The major problems faced by Big Data majorly falls under three Vs. They are volume, velocity, and variety.

大數(shù)據(jù)是指傳統(tǒng)系統(tǒng)存儲和處理的數(shù)據(jù)集太大、太復(fù)雜.大數(shù)據(jù)面臨的主要問題主要是體積、速度和多樣性.

***Do you know – ****Every minute we send 204 million emails, generate 1.8 million Facebook likes, send 278 thousand Tweets, and up-load 200,000 photos to Facebook. *

**你知道嗎****我們每分鐘發(fā)送 2.04億封電子郵件,生成 180萬個 Facebook 贊,發(fā)送 278,000 條推文,并向 Facebook 上傳 200,000 張照片.

Volume: The data is getting generated in order of Tera to petabytes. The largest contributor of data is social media. For instance, Facebook generates 500 TB of data every day. Twitter generates 8TB of data daily.

體積: 按照 Tera 到 pb 的順序生成數(shù)據(jù).社交媒體是最大的數(shù)據(jù)貢獻(xiàn)者.例如,F(xiàn)acebook 每天產(chǎn)生 500 TB 的數(shù)據(jù).Twitter 每天產(chǎn)生 8 TB 的數(shù)據(jù).

Velocity: Every enterprise has its own requirement of the time frame within which they have process data. Many use cases like credit card fraud detection have only a few seconds to process the data in real-time and detect fraud. Hence there is a need of framework which is capable of high-speed data computations.

速度: 每個企業(yè)都有自己的處理數(shù)據(jù)的時間框架要求.像信用卡欺詐檢測這樣的許多用例只有幾秒鐘的時間來實時處理數(shù)據(jù)并檢測欺詐.因此,需要能夠進(jìn)行高速數(shù)據(jù)計算的框架.

Variety: Also the data from various sources have varied formats like text, XML, images, audio, video, etc. Hence the Big Data technology should have the capability of performing analytics on a variety of data.

品種: 此外,來自不同來源的數(shù)據(jù)也有不同的格式,如文本、 XML 、圖像、音頻、視頻等.因此,大數(shù)據(jù)技術(shù)應(yīng)該有能力對各種數(shù)據(jù)進(jìn)行分析.

Hope you have checked the Free Big Data DataFlair Tutorial Series. Here is one more interesting article for you – Top Big Data Quotes by the Experts

希望您已經(jīng)查看了免費的大數(shù)據(jù) DataFlair 教程系列

Why Hadoop is Invented?

Hadoop 為何發(fā)明?

Let us discuss the shortcomings of the traditional approach which led to the invention of Hadoop –

讓我們討論導(dǎo)致 Hadoop 發(fā)明的傳統(tǒng)方法的缺點-

1. Storage for Large Datasets

1. 存儲的大數(shù)據(jù)集

The conventional RDBMS is incapable of storing huge amounts of Data. The cost of data storage in available RDBMS is very high. As it incurs the cost of hardware and software both.

傳統(tǒng)的關(guān)系數(shù)據(jù)庫不能存儲大量的數(shù)據(jù).在可用的數(shù)據(jù)庫中存儲數(shù)據(jù)的成本非常高.因為它會帶來硬件和軟件的成本.

2. Handling data in different formats

2. 、處理不同格式的數(shù)據(jù)

The RDBMS is capable of storing and manipulating data in a structured format. But in the real world we have to deal with data in a structured, unstructured and semi-structured format.

關(guān)系數(shù)據(jù)庫能夠以結(jié)構(gòu)化格式存儲和操作數(shù)據(jù).但是在現(xiàn)實世界中,我們必須以結(jié)構(gòu)化、非結(jié)構(gòu)化和半結(jié)構(gòu)化的格式處理數(shù)據(jù).

3. Data getting generated with high speed:

3..高速生成數(shù)據(jù):

The data in oozing out in the order of tera to peta bytes daily. Hence we need a system to process data in real-time within a few seconds. The traditional RDBMS fail to provide real-time processing at great speeds.

數(shù)據(jù)以 tera 到 peta 字節(jié)的順序每天滲出.因此,我們需要一個系統(tǒng)在幾秒鐘內(nèi)實時處理數(shù)據(jù).傳統(tǒng)的關(guān)系數(shù)據(jù)庫不能提供高速的實時處理.

What is Hadoop?

Hadoop is the solution to above Big Data problems. It is the technology to store massive datasets on a cluster of cheap machines in a distributed manner. Not only this it provides Big Data analytics through distributed computing framework.

Hadoop 是解決上述大數(shù)據(jù)問題的解決方案.這是一種以分布式方式將大量數(shù)據(jù)集存儲在廉價機(jī)器集群上的技術(shù).它不僅通過分布式計算框架提供大數(shù)據(jù)分析.

It is an open-source software developed as a project by Apache Software Foundation. Doug Cutting created Hadoop. In the year 2008 Yahoo gave Hadoop to Apache Software Foundation. Since then two versions of Hadoop has come. Version 1.0 in the year 2011 and version 2.0.6 in the year 2013. Hadoop comes in various flavors like Cloudera, IBM BigInsight, MapR and Hortonworks.

它是 Apache 軟件基金會作為一個項目開發(fā)的開源軟件.Doug Cutting 創(chuàng)建了 Hadoop.2008年,雅虎將 Hadoop 交給了 Apache 軟件基金會.從那以后,Hadoop 有了兩個版本.2011年的 1.0 版和 2013年的版.Hadoop 有 Cloudera 、 IBM BigInsight 、 MapR 和 Hortonworks 等多種版本.

Prerequisites to Learn Hadoop

學(xué)習(xí) Hadoop 的先決條件

  • Familiarity with some basic Linux Command – Hadoop is set up over Linux Operating System preferable Ubuntu. So one must know certain*** basic Linux commands***. These commands are for uploading the file in HDFS, downloading the file from HDFS and so on.
  • Basic Java concepts – Folks want to learn Hadoop can get started in Hadoop while simultaneously grasping basic concepts of Java. We can write map and reduce functions in Hadoop using other languages too. And these are Python, Perl, C, Ruby, etc. This is possible via streaming API. It supports reading from standard input and writing to standard output. Hadoop also has high-level abstractions tools like Pig and Hive which do not require familiarity with Java.

Big Data Hadoop Tutorial Video

  • 熟悉一些基本的 Linux 命令 Hadoop 是在 Linux 操作系統(tǒng)上建立的,比 Ubuntu 更好.所以一定要知道 基本的 Linux 命令. 這些命令用于在 HDFS 中上傳文件、從 HDFS 下載文件等.
  • Java 的基本概念 想學(xué)習(xí) Hadoop 的人可以在同時掌握 Hadoop 的同時開始學(xué)習(xí) Java 的基本概念. 我們也可以使用其他語言在 Hadoop 中編寫 map 和 reduce 函數(shù). 這些是 Python 、 Perl 、 C 、 Ruby 等,這是通過流 API 實現(xiàn)的.它支持從標(biāo)準(zhǔn)輸入到標(biāo)準(zhǔn)輸出的讀取和寫入. Hadoop 還有像 Pig 和 Hive 這樣的高級抽象工具,不需要熟悉 Java.

Hadoop consists of three core components –

Hadoop 由三個核心組件組成

  • Hadoop Distributed File System **(HDFS) – **It is the storage layer of Hadoop.

  • **Map-Reduce – **It is the data processing layer of Hadoop.

  • **YARN – **It is the resource management layer of Hadoop.

  • 分布式文件系統(tǒng) (HDFS)- 是 Hadoop 的存儲層.

  • Map-Reduce- 是 Hadoop 的數(shù)據(jù)處理層.

  • Yarn- Hadoop 的資源管理層.

Core Components of Hadoop

Hadoop 的核心組件

Let us understand these Hadoop components in detail.

讓我們詳細(xì)了解這些 Hadoop 組件.

1. HDFS

Short for Hadoop Distributed File System provides for distributed storage for Hadoop. HDFS has a master-slave topology.

Hadoop 分布式文件系統(tǒng)的簡稱,為 Hadoop 提供分布式存儲.HDFS 具有主從拓?fù)浣Y(jié)構(gòu).

image.png

Master is a high-end machine where as slaves are inexpensive computers. The Big Data files get divided into the number of blocks. Hadoop stores these blocks in a distributed fashion on the cluster of slave nodes. On the master, we have metadata stored.

Master 是一種高端機(jī)器,作為奴隸,它是廉價的計算機(jī).大數(shù)據(jù)文件按照塊的數(shù)量進(jìn)行劃分.Hadoop 以分布式方式將這些塊存儲在從屬節(jié)點集群上.在 master 上,我們存儲了元數(shù)據(jù).

HDFS has two daemons running for it. They are :

HDFS 有兩個守護(hù)進(jìn)程在運行.他們是:

NameNode : NameNode performs following functions –

  • NameNode Daemon runs on the master machine.

  • It is responsible for maintaining, monitoring and managing DataNodes.

  • It records the metadata of the files like the location of blocks, file size, permission, hierarchy etc.

  • Namenode captures all the changes to the metadata like deletion, creation and renaming of the file in edit logs.

  • It regularly receives heartbeat and block reports from the DataNodes.

  • NameNode 守護(hù)進(jìn)程在主機(jī)上運行.

  • 負(fù)責(zé)數(shù)據(jù)節(jié)點的維護(hù)、監(jiān)控和管理.

  • 它記錄文件的元數(shù)據(jù),如塊的位置、文件大小、權(quán)限、層次結(jié)構(gòu)等.

  • Namenode 捕獲對元數(shù)據(jù)的所有更改,如在編輯日志中刪除、創(chuàng)建和重命名文件.

  • 它定期從 DataNodes 接收心跳和阻塞報告.

DataNode: The various functions of DataNode are as follows –

  • DataNode runs on the slave machine.

  • It stores the actual business data.

  • It serves the read-write request from the user.

  • DataNode does the ground work of creating, replicating and deleting the blocks on the command of NameNode.

  • After every 3 seconds, by default, it sends heartbeat to NameNode reporting the health of HDFS.

  • DataNode 在從機(jī)上運行.

  • 存儲實際業(yè)務(wù)數(shù)據(jù).

  • 它服務(wù)于用戶的讀寫請求.

  • DataNode 在 NameNode 命令下執(zhí)行創(chuàng)建、復(fù)制和刪除塊的基本工作.

  • 默認(rèn)情況下,每隔 3 秒,它會向報告 HDFS 健康狀況的 NameNode 發(fā)送心跳.

Erasure Coding in HDFS

擦除編碼 HDFS

Till Hadoop 2.x replication is the only method for providing fault tolerance. Hadoop 3.0 introduces one more method called erasure coding. Erasure coding provides the same level of fault tolerance but with lower storage overhead.

直到 Hadoop 2.X 復(fù)制是提供容錯的唯一方法. Hadoop 3.0 又引入了一種稱為擦除編碼的方法.擦除編碼提供了相同級別的容錯能力,但存儲開銷較低.

Erasure coding is usually used in RAID (Redundant Array of Inexpensive Disks) kind of storage. RAID provides erasure coding via striping. In this, it divides the data into smaller units like bit/byte/block and stores the consecutive units on different disks. Hadoop calculates parity bits for each of these cell (units). We call this process as encoding. On the event of loss of certain cells, Hadoop computes these by decoding. Decoding is a process in which lost cells gets recovered from remaining original and parity cells.

RAID (廉價磁盤的冗余陣列) 存儲通常使用擦除編碼.RAID 通過條帶化提供擦除編碼.在這種情況下,它將數(shù)據(jù)分成更小的單元,如位/字節(jié)/塊,并將連續(xù)的單元存儲在不同的磁盤上.Hadoop 計算每個單元 (單元) 的奇偶校驗位.我們把這個過程稱為編碼.在某些單元丟失的情況下,Hadoop 通過解碼來計算這些單元.解碼是從剩余的原始和奇偶校驗單元格中恢復(fù)丟失的單元格的過程.

Erasure coding is mostly used for warm or cold data which undergo less frequent I/O access. The replication factor of Erasure coded file is always one. we cannot change it by -setrep command. Under erasure coding storage overhead is never more than 50%.

擦除編碼主要用于接受不太頻繁 I/O 訪問的溫暖或寒冷數(shù)據(jù).擦除編碼文件的復(fù)制因子始終是 1.我們不能通過-setrep 命令來改變它.在擦除編碼下,存儲開銷不會超過 50%.

Under conventional Hadoop storage replication factor of 3 is default. It means 6 blocks will get replicated into 6*3 i.e. 18 blocks. This gives a storage overhead of 200%. As opposed to this in Erasure coding technique there are 6 data blocks and 3 parity blocks. This gives storage overhead of 50%.

默認(rèn)情況下,傳統(tǒng)的 Hadoop 存儲復(fù)制因子為 3.這意味著 6 個塊將被復(fù)制到 6*3,即 18 個塊中.這將導(dǎo)致 200% 的存儲開銷.與擦除編碼技術(shù)相反,有 6 個數(shù)據(jù)塊和 3 個奇偶校驗塊.這使得存儲開銷高達(dá) 50%.

The File System Namespace

文件系統(tǒng)命名空間

HDFS supports hierarchical file organization. One can create, remove, move or rename a file. NameNode maintains file system Namespace. NameNode records the changes in the Namespace. It also stores the replication factor of the file.

HDFS 支持分層文件組織.可以創(chuàng)建、刪除、移動或重命名文件.NameNode 維護(hù)文件系統(tǒng)命名空間.NameNode 記錄命名空間中的更改.它還存儲文件的復(fù)制因子.

2. MapReduce

2. MapReduce

It is the data processing layer of Hadoop. It processes data in two phases.

是 Hadoop 的數(shù)據(jù)處理層.它分兩個階段處理數(shù)據(jù).

They are:-

Map Phase- This phase applies business logic to the data. The input data gets converted into key-value pairs.

Map 階段 這個階段對數(shù)據(jù)應(yīng)用業(yè)務(wù)邏輯.輸入數(shù)據(jù)被轉(zhuǎn)換成鍵值對.

Reduce Phase- The Reduce phase takes as input the output of Map Phase. It applies aggregation based on the key of the key-value pairs.

Reduce 階段 將 Map 階段的輸出作為輸入.它基于鍵-值對的鍵應(yīng)用聚合.

Map-Reduce works in the following way:

  • The client specifies the file for input to the Map function. It splits it into tuples

  • Map function defines key and value from the input file. The output of the map function is this key-value pair.

  • MapReduce framework sorts the key-value pair from map function.

  • The framework merges the tuples having the same key together.

  • The reducers get these merged key-value pairs as input.

  • Reducer applies aggregate functions on key-value pair.

  • The output from the reducer gets written to HDFS.

  • 客戶端指定輸入到 Map 函數(shù)的文件.把它拆分成元組

  • Map 函數(shù)從輸入文件中定義鍵和值.Map 函數(shù)的輸出是這個鍵值對.

  • MapReduce 框架根據(jù) map 函數(shù)對鍵值對進(jìn)行排序.

  • 框架將具有相同鍵的元組合并在一起.

  • Reducers 將這些合并的鍵值對作為輸入.

  • Reducer 在鍵值對上應(yīng)用聚合函數(shù).

  • 減速機(jī)的輸出被寫到 HDFS.

3. YARN

Short for Yet Another Resource Locator has the following components:-

另一個資源定位器的縮寫有以下組件:-

Resource Manager 資源經(jīng)理

  • Resource Manager runs on the master node.

  • It knows where the location of slaves (Rack Awareness).

  • It is aware about how much resources each slave have.

  • Resource Scheduler is one of the important service run by the Resource Manager.

  • Resource Scheduler decides how the resources get assigned to various tasks.

  • Application Manager is one more service run by Resource Manager.

  • Application Manager negotiates the first container for an application.

  • Resource Manager keeps track of the heart beats from the Node Manager.

  • 資源管理器在主節(jié)點上運行.

  • 它知道奴隸的位置 (機(jī)架感知).

  • 它知道每個奴隸有多少資源.

  • 資源調(diào)度器是資源管理器運行的重要服務(wù)之一.

  • 資源調(diào)度器決定如何將資源分配給各種任務(wù).

  • Application Manager 是資源管理器運行的又一個服務(wù).

  • Application Manager 為應(yīng)用程序協(xié)商第一個容器.

  • 資源管理器從節(jié)點管理器跟蹤心跳.

Node Manager 節(jié)點管理器

[圖片上傳失敗...(image-3f9790-1564409913371)]

  • It runs on slave machines.

  • It manages containers. Containers are nothing but a fraction of Node Manager’s resource capacity

  • Node manager monitors resource utilization of each container.

  • It sends heartbeat to Resource Manager.

  • 它在從機(jī)上運行.

  • 它管理集裝箱.容器只是節(jié)點管理器資源容量的一小部分

  • 節(jié)點管理器監(jiān)視每個容器的資源利用率.

  • 它向資源管理器發(fā)送心跳.

Job Submitter 工作提交者

The application startup process is as follows:-

應(yīng)用程序啟動過程如下:-

  • The client submits the job to Resource Manager.

  • Resource Manager contacts Resource Scheduler and allocates container.

  • Now Resource Manager contacts the relevant Node Manager to launch the container.

  • Container runs Application Master.

  • 客戶端將作業(yè)提交給資源管理器.

  • 資源管理器聯(lián)系資源調(diào)度器并分配容器.

  • 現(xiàn)在,資源管理器聯(lián)系相關(guān)的節(jié)點管理器來啟動容器.

  • 容器運行應(yīng)用程序 Master.

The basic idea of YARN was to split the task of resource management and job scheduling. It has one global Resource Manager and per-application Application Master. An application can be either one job or DAG of jobs.

YARN 的基本思想是將資源管理和作業(yè)調(diào)度的任務(wù)進(jìn)行拆分.它有一個全局資源管理器和每個應(yīng)用程序的主應(yīng)用程序.應(yīng)用程序可以是一個作業(yè),也可以是作業(yè)的 DAG.

The Resource Manager’s job is to assign resources to various competing applications. Node Manager runs on the slave nodes. It is responsible for containers, monitoring resource utilization and informing about the same to Resource Manager.

資源管理器的工作是為各種競爭的應(yīng)用程序分配資源.節(jié)點管理器在從屬節(jié)點上運行.它負(fù)責(zé)容器、監(jiān)控資源利用率并向資源管理器通知.

The job of Application master is to negotiate resources from the Resource Manager. It also works with NodeManager to execute and monitor the tasks.

應(yīng)用主的工作是從資源管理器協(xié)商資源.它還與 NodeManager 一起執(zhí)行和監(jiān)控任務(wù).

***Wait before scrolling further! This is the time to read about the top 15 Hadoop Ecosystem components. ***

***在進(jìn)一步滾動之前等待!這就是我們看到的Hadoop 生態(tài)系統(tǒng)組件前 15 名. ***

Why Hadoop?

Let us now understand why Big Data Hadoop is very popular, why Apache Hadoop capture more than 90% of the big data market.

現(xiàn)在讓我們來了解為什么大數(shù)據(jù) Hadoop 非常受歡迎,為什么 Apache Hadoop 在大數(shù)據(jù)市場上占據(jù)了 90% 以上的份額.

Apache Hadoop is not only a storage system but is a platform for data storage as well as processing. It is scalable (as we can add more nodes on the fly), Fault-tolerant (Even if nodes go down, data processed by another node).
Following characteristics of Hadoop make it a unique platform:

Apache Hadoop 不僅是一個存儲系統(tǒng),也是一個數(shù)據(jù)存儲和處理的平臺.它是可擴(kuò)展(因為我們可以動態(tài)添加更多節(jié)點),容錯的(即使節(jié)點宕機(jī),數(shù)據(jù)也由另一個節(jié)點處理).
以下Hadoop 的特點打造獨一無二的平臺:

  • Flexibility to store and mine any type of data whether it is structured, semi-structured or unstructured. It is not bounded by a single schema.

  • Excels at processing data of complex nature. Its scale-out architecture divides workloads across many nodes. Another added advantage is that its flexible file-system eliminates ETL bottlenecks.

  • Scales economically, as discussed it can deploy on commodity hardware. Apart from this its open-source nature guards against vendor lock.

  • 靈活地存儲和挖掘任何類型的數(shù)據(jù),無論是結(jié)構(gòu)化的、半結(jié)構(gòu)化的還是非結(jié)構(gòu)化的.它不受單個模式的限制.

  • 擅長處理復(fù)雜性質(zhì)的數(shù)據(jù).它的橫向擴(kuò)展架構(gòu)在許多節(jié)點上劃分工作負(fù)載.它的另一個優(yōu)點是靈活的文件系統(tǒng)消除了 ETL 瓶頸.

  • 正如所討論的,它可以在商品硬件上部署,經(jīng)濟(jì)規(guī)模.除此之外,它的開源自然保護(hù)供應(yīng)商鎖.

What is Hadoop Architecture?

Hadoop 架構(gòu)是什么?

After understanding what is Apache Hadoop, let us now understand the Hadoop Architecture in detail.

了解了什么是 Apache Hadoop 之后,現(xiàn)在就讓我們詳細(xì)了解一下 Hadoop 的架構(gòu).

How Hadoop Works

Hadoop works in master-slave fashion. There is a master node and there are n numbers of slave nodes where n can be 1000s. Master manages, maintains and monitors the slaves while slaves are the actual worker nodes. In Hadoop architecture, the Master should deploy on good configuration hardware, not just commodity hardware. As it is the centerpiece of Hadoop cluster.

Hadoop 的工作原理主-從.有一個主節(jié)點,有 n 個從節(jié)點,其中 n 個可以是 1000.當(dāng)從屬節(jié)點是實際的工作節(jié)點時,Master 管理、維護(hù)和監(jiān)控從屬節(jié)點.在 Hadoop 架構(gòu)中,Master 應(yīng)該部署在配置良好的硬件上,而不僅僅是商品硬件.因為它是 Hadoop 集群.

Master stores the metadata (data about data) while slaves are the nodes which store the data. Distributedly data stores in the cluster. The client connects with the master node to perform any task. Now in this Hadoop tutorial for beginners, we will discuss different features of Hadoop in detail.

Master 存儲元數(shù)據(jù) (關(guān)于數(shù)據(jù)的數(shù)據(jù)),而 slaves 是存儲數(shù)據(jù)的節(jié)點.集群中的分布式數(shù)據(jù)存儲.客戶端與主節(jié)點連接以執(zhí)行任何任務(wù).現(xiàn)在,在這個面向初學(xué)者的 Hadoop 教程中,我們將詳細(xì)討論 Hadoop 的不同特性.

Hadoop Features

Hadoop 特性

Here are the top Hadoop features that make it popular –

以下是 Hadoop 最受歡迎的功能-

1. Reliability

1. 可靠性

In the Hadoop cluster, if any node goes down, it will not disable the whole cluster. Instead, another node will take the place of the failed node. Hadoop cluster will continue functioning as nothing has happened. Hadoop has built-in fault tolerance feature.

在 Hadoop 集群中,如果有任何節(jié)點宕機(jī),都不會禁用整個集群.相反,另一個節(jié)點將取代失敗的節(jié)點.由于沒有發(fā)生任何事情,Hadoop 集群將繼續(xù)運行.Hadoop 內(nèi)置了容錯功能.

2. Scalable

2. 可擴(kuò)展

Hadoop gets integrated with cloud-based service. If you are installing Hadoop on the cloud you need not worry about scalability. You can easily procure more hardware and expand your Hadoop cluster within minutes.

Hadoop 與基于云的服務(wù)集成.如果你在云上安裝 Hadoop,你不需要擔(dān)心可擴(kuò)展性.您可以在幾分鐘內(nèi)輕松獲得更多硬件并擴(kuò)展 Hadoop 集群.

3. Economical

3. 經(jīng)濟(jì)型

Hadoop gets deployed on commodity hardware which is cheap machines. This makes Hadoop very economical. Also as Hadoop is an open system software there is no cost of license too.

Hadoop 部署在廉價機(jī)器上的商用硬件上.這使得 Hadoop 非常經(jīng)濟(jì).此外,由于 Hadoop 是一個開放的系統(tǒng)軟件,許可證也沒有成本.

4. Distributed Processing

4. 分布式處理

In Hadoop, any job submitted by the client gets divided into the number of sub-tasks. These sub-tasks are independent of each other. Hence they execute in parallel giving high throughput.

在 Hadoop 中,客戶端提交的任何作業(yè)都被劃分為子任務(wù)的數(shù)量.這些子任務(wù)是相互獨立的.因此,它們并行執(zhí)行,提供高吞吐量.

5. Distributed Storage

5. 分布式存儲

Hadoop splits each file into the number of blocks. These blocks get stored distributedly on the cluster of machines.

Hadoop 將每個文件拆分成塊的數(shù)量.這些數(shù)據(jù)塊被分布式地存儲在機(jī)器集群上.

6. Fault Tolerance

6. 容錯

Hadoop replicates every block of file many times depending on the replication factor. Replication factor is 3 by default. In Hadoop suppose any node goes down then the data on that node gets recovered. This is because this copy of the data would be available on other nodes due to replication. Hadoop is fault tolerant.

Hadoop 根據(jù)復(fù)制因子多次復(fù)制每個文件塊.默認(rèn)情況下,復(fù)制因子為 3.在 Hadoop 中,假設(shè)任何節(jié)點都關(guān)閉,那么該節(jié)點上的數(shù)據(jù)就會恢復(fù).這是因為由于復(fù)制,數(shù)據(jù)的此副本將在其他節(jié)點上可用.Hadoop 是容錯的.

***Are you looking for more Features? Here are the additional Hadoop Features that make it special. ***

你在尋找更多的功能嗎?以下是Hadoop 的附加功能這讓它變得特別.

Hadoop Flavors

This section of the Hadoop Tutorial talks about the various flavors of Hadoop.

Hadoop 教程的這一部分講述了 Hadoop 的各種風(fēng)格.

  • Apache – Vanilla flavor, as the actual code is residing in Apache repositories.
  • Hortonworks – Popular distribution in the industry.
  • Cloudera – It is the most popular in the industry.
  • MapR – It has rewritten HDFS and its HDFS is faster as compared to others.
  • IBM – Proprietary distribution is known as Big Insights.

All the databases have provided native connectivity with Hadoop for fast data transfer. Because, to transfer data from Oracle to Hadoop, you need a connector.

所有數(shù)據(jù)庫都提供了與 Hadoop 的本地連接,以實現(xiàn)快速數(shù)據(jù)傳輸.因為要將數(shù)據(jù)從 Oracle 傳輸?shù)?Hadoop,需要一個連接器.

All flavors are almost same and if you know one, you can easily work on other flavors as well.

所有的口味幾乎都是一樣的,如果你知道一種,你也可以很容易地嘗試其他口味.

Hadoop Future Scope

未來的 Hadoop

There is going to be a lot of investment in the Big Data industry in coming years. According to a report by FORBES, 90% of global organizations will be investing in Big Data technology. Hence the demand for Hadoop resources will also grow. Learning Apache Hadoop will give you accelerated growth in career. It also tends to increase your pay package.

將會有大量的投資在 未來幾年的大數(shù)據(jù)產(chǎn)業(yè) .根據(jù)一份報告福布斯90% 的全球組織將投資大數(shù)據(jù)技術(shù).因此,對 Hadoop 資源的需求也將增長.學(xué)習(xí) Apache Hadoop 可以讓你的職業(yè)生涯加速發(fā)展.它還會增加你的薪酬.

There is a lot of gap between the supply and demand of Big Data professional. The skill in Big Data technologies continues to be in high demand. This is because companies grow as they try to get the most out of their data. Therefore, their salary package is quite high as compared to professionals in other technology.

大數(shù)據(jù)專業(yè)人才的供需缺口很大.對大數(shù)據(jù)技術(shù)的需求仍然很高.這是因為公司在努力從數(shù)據(jù)中獲得最大收益的過程中不斷增長.因此,與其他技術(shù)專業(yè)人員相比,他們的工資待遇相當(dāng)高.

The managing director of** Dice, Alice Hills** has said that Hadoop jobs have seen 64% increase from the previous year. It is evident that Hadoop is ruling the Big Data market and its future is bright. The demand for Big Data Analytics professional is ever increasing. As it is a known fact that data is nothing without power to analyze it.

You must check Expert’s Prediction for the Future of Hadoop

Summary – Hadoop Tutorial

摘要-Hadoop 教程

On concluding this Hadoop tutorial, we can say that Apache Hadoop is the most popular and powerful big data tool. Big Data stores huge amount of data in the distributed manner and processes the data in parallel on a cluster of nodes. It provides the world’s most reliable storage layer- HDFS. Batch processing engine MapReduce and Resource management layer- YARN.

總結(jié)這個 Hadoop 教程,可以說 Apache Hadoop 是目前最流行、最強(qiáng)大的大數(shù)據(jù)工具.大數(shù)據(jù)以分布式的方式存儲大量數(shù)據(jù),并在一個節(jié)點集群上并行處理數(shù)據(jù).它提供了世界上最可靠的存儲層 -- HDFS.批處理引擎 MapReduce 和資源管理層-YARN.

On summarizing this Hadoop Tutorial, I want to give you a quick revision of all the topics we have discussed

在總結(jié)這個 Hadoop 教程時,我想給你一個我們討論過的所有主題的快速修訂

  • The concept of Big Data

  • Reason for Hadoop’s Invention

  • Prerequisites to learn Hadoop

  • Introduction to Hadoop

  • Core components of Hadoop

  • Why Hadoop

  • Hadoop Architecture

  • Features of Hadoop

  • Hadoop Flavours

  • Future Scope of Hadoop

  • 大數(shù)據(jù)的概念

  • Hadoop 發(fā)明的原因:

  • 學(xué)習(xí) Hadoop 的先決條件

  • Hadoop 簡介

  • Hadoop 的核心組件

  • Hadoop 為什么

  • Hadoop 架構(gòu)

  • Hadoop 的特點

  • Hadoop 特色

  • Hadoop 的未來范圍

Hope this Hadoop Tutorial helped you. If you face any difficulty while understanding Hadoop concept, comment below.

希望這個 Hadoop 教程對你有幫助.如果您在理解 Hadoop 概念時遇到任何困難,請在下面發(fā)表評論.

***This is the right time to start your Hadoop learning with industry experts. ***

***這是開始你的與行業(yè)專家一起學(xué)習(xí) Hadoop. ***

https://data-flair.training/blogs/hadoop-tutorial

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容