017 13 Big Limitations of Hadoop & Solution To Hadoop Drawbacks
Although Hadoop is the most powerful tool of big data, there are various limitations of Hadoop like Hadoop is not suited for small files, it cannot handle firmly the live data, slow processing speed, not efficient for iterative processing, not efficient for caching etc.
雖然 Hadoop 是最強(qiáng)大的大數(shù)據(jù)的工具,Hadoop 有各種各樣的局限性,比如 Hadoop 不適合小文件,它不能牢固地處理實(shí)時(shí)數(shù)據(jù),處理速度慢,迭代處理效率低下,緩存效率低下等.
In this tutorial on the limitations of Hadoop, firstly we will learn about what is Hadoop and what are the pros and cons of Hadoop. We will see features of Hadoop due to which it is so popular. We will also see 13 Big Disadvantages of Hadoop due to which Apache Spark and Apache Flink came into existence and learn about various ways to overcome the drawbacks of Hadoop.
在這篇關(guān)于 Hadoop 局限性的教程中,首先我們將了解什么是 Hadoop,以及 Hadoop 的優(yōu)缺點(diǎn).我們會的查看 Hadoop 如此受歡迎的特性.我們還將看到 Hadoop 的 13 大缺點(diǎn),由于這些缺點(diǎn),Apache Spark 和 Apache Flink 應(yīng)運(yùn)而生,并了解克服 Hadoop 缺點(diǎn)的各種方法.
Drawbacks of Hadoop & its Solutions

Hadoop – Introduction & Features
Hadoop-介紹和功能
Let us start with what is Hadoop and what are Hadoop features that make it so popular.
讓我們從 Hadoop 是什么以及 Hadoop 有哪些特性使得它如此受歡迎開始.
Hadoop is an open-source software framework for distributed storage and distributed processing of extremely large data sets. Important features of Hadoop are:
Hadoop 是一個(gè)開源的軟件框架用于超大數(shù)據(jù)集的分布式存儲和分布式處理.Hadoop 的重要功能有:
Apache Hadoop is an open source project. It means one can modify its code for business requirements.
In Hadoop, data is highly available and accessible despite hardware failure due to multiple copies of data. If a machine or any hardware crashes, then we can access data from another path.
Hadoop is highly scalable, as we can easily add the new hardware to the node. Hadoop also provides horizontal scalability which means we can add nodes on the fly without any downtime.
The Hadoop is** fault tolerant **as by default, 3 replicas of each block are stored across the cluster. So if any node goes down, data on that node can recover from the other node easily.
In Hadoop, data is reliably stored on the cluster despite machine failure due to the replication of data on the cluster.
Hadoop runs on a cluster of commodity hardware which is not very expensive.
Hadoop is very easy to use, as there is no need of the client to deal with distributed computing; the framework takes care of all the things.
Apache Hadoop 是一個(gè)開源的項(xiàng)目.這意味著可以根據(jù)業(yè)務(wù)需求修改代碼.
在 Hadoop 中,數(shù)據(jù)是高可用盡管由于多個(gè)數(shù)據(jù)副本導(dǎo)致硬件故障,但仍然可以訪問.如果機(jī)器或任何硬件崩潰,那么我們可以從另一個(gè)路徑訪問數(shù)據(jù).
Hadoop 是高度可擴(kuò)展,因?yàn)槲覀兛梢院苋菀椎貙⑿掠布砑拥焦?jié)點(diǎn)中.Hadoop 還提供了橫向可擴(kuò)展性,這意味著我們可以在不停機(jī)的情況下動(dòng)態(tài)添加節(jié)點(diǎn).
Hadoop 是容錯(cuò)默認(rèn)情況下,每個(gè)塊的 3 個(gè)副本存儲在整個(gè)集群中.因此,如果任何節(jié)點(diǎn)出現(xiàn)故障,該節(jié)點(diǎn)上的數(shù)據(jù)可以很容易地從另一個(gè)節(jié)點(diǎn)恢復(fù).
在 Hadoop 中可靠地存儲數(shù)據(jù)在群集上,盡管由于群集上的數(shù)據(jù)復(fù)制而導(dǎo)致機(jī)器故障.
Hadoop 運(yùn)行在一個(gè)商品硬件集群上不是很貴.
Hadoop 是非常使用方便,因?yàn)椴恍枰蛻舳藖硖幚矸植际接?jì)算; 框架處理所有的事情.
But as all technologies have pros and cons, similarly there are many limitations of Hadoop as well. As we have already seen features and advantages of Hadoop above, now let us see the limitations of Hadoop, due to which Apache Spark and Apache Flink came into the picture.
但是,由于所有的技術(shù)都有各自的優(yōu)點(diǎn)和缺點(diǎn),同樣,Hadoop 也有許多局限性.正如我們已經(jīng)看到了上面 Hadoop 的特性和優(yōu)勢,現(xiàn)在讓我們看看 Hadoop 的局限性,因?yàn)?Apache Spark 和 Apache Flink 已經(jīng)出現(xiàn)在了這個(gè)問題上.
13 Big Limitations of Hadoop for Big Data Analytics
Hadoop 對大數(shù)據(jù)分析的 13 大限制
We will discuss various limitations of Hadoop in this section along with their solution:
在本節(jié)中,我們將討論 Hadoop 的各種限制及其解決方案:
1. Issue with Small Files
1. 問題與小銼刀
Hadoop does not suit for small data. (HDFS) Hadoop distributed file system lacks the ability to efficiently support the random reading of small files because of its high capacity design.
Hadoop 不適合小數(shù)據(jù).(HDFS) 分布式文件系統(tǒng)由于其大容量設(shè)計(jì),缺乏有效支持隨機(jī)讀取小文件的能力.
Small files are the major problem in HDFS. A small file is significantly smaller than the HDFS block size (default 128MB). If we are storing these huge numbers of small files, HDFS can’t handle this much of files, as HDFS is for working properly with a small number of large files for storing large data sets rather than a large number of small files. If there are too many small files, then the NameNode will get overload since it stores the namespace of HDFS.
小文件是 HDFS 中的主要問題.一個(gè)小文件比HDFS 塊大小 (默認(rèn)為 128 MB).如果我們存儲大量的小文件,那么 HDFS 就無法處理這么多文件.由于 HDFS 是為了與少量大文件一起正常工作,用于存儲大型數(shù)據(jù)集,而不是大量小文件.如果小文件太多,那么南德因?yàn)樗鎯α?HDFS 的命名空間,所以會過載.
Solution-
解決方案-
Solution to this Drawback of Hadoop to deal with small file issue is simple. Just merge the small files to create bigger files and then copy bigger files to HDFS.
The introduction of** HAR files** (Hadoop Archives) was for reducing the problem of lots of files putting pressure on the namenode’s memory. By building a layered filesystem on the top of HDFS, HAR files works. Using the Hadoop archive command, HAR files are created, which runs a MapReduce job to pack the files being archived into a small number of HDFS files. Reading through files in a HAR is not more efficient than reading through files in HDFS. Since each HAR file access requires two index files read as well the data file to read, this makes it slower.
**Sequence files **work very well in practice to overcome the ‘small file problem’, in which we use the filename as the key and the file contents as the value. By writing a program for files (100 KB), we can put them into a single Sequence file and then we can process them in a streaming fashion operating on the Sequence file. MapReduce can break the Sequence file into chunks and operate on each chunk independently because the Sequence file is splittable.
Storing files in HBase is a very common design pattern to overcome small file problem with HDFS. We are not actually storing millions of small files into HBase, rather adding the binary content of the file to a cell.
解決 Hadoop 處理小文件問題的這個(gè)缺點(diǎn)很簡單.只要合并小文件,創(chuàng)建更大的文件,然后將更大的文件復(fù)制到 HDFS.
的介紹HAR 文件(Hadoop Archives) 是為了減少大量文件給 namenode 的內(nèi)存帶來壓力的問題.HAR 文件通過在 HDFS 的頂部構(gòu)建分層文件系統(tǒng)來工作.使用 Hadoop archive 命令創(chuàng)建 HAR 文件,該文件運(yùn)行MapReduce將存檔的文件打包成少量 HDFS 文件的作業(yè).在 HAR 中讀取文件并不比在 HDFS 中讀取文件更有效.由于每個(gè) HAR 文件訪問都需要讀取兩個(gè)索引文件以及要讀取的數(shù)據(jù)文件,這使得速度變慢.
序列文件在實(shí)踐中非常好地解決了 “小文件問題”,在這個(gè)問題中,我們使用文件名作為密鑰,文件內(nèi)容作為值.通過為文件 (100 KB) 編寫程序,我們可以將它們放入一個(gè)序列文件中,然后我們可以以流的方式處理它們,對序列文件進(jìn)行操作.MapReduce 可以將序列文件拆分成塊,并且由于序列文件是可拆分的,因此可以獨(dú)立地對每個(gè)塊進(jìn)行操作.
將文件存儲在HBase 是一種很常見的設(shè)計(jì)模式解決 HDFS 的小文件問題.實(shí)際上,我們并沒有將數(shù)百萬個(gè)小文件存儲到 HBase 中,而是將文件的二進(jìn)制內(nèi)容添加到單元格中.
2. Slow Processing Speed
2. 緩慢的處理速度
In Hadoop, with a parallel and distributed algorithm, the MapReduce process large data sets. There are tasks that we need to perform: Map and Reduce and, MapReduce requires a lot of time to perform these tasks thereby increasing latency. Data is distributed and processed over the cluster in MapReduce which increases the time and reduces processing speed.
在 Hadoop 中,MapReduce 通過并行和分布式算法處理大數(shù)據(jù)集.我們需要執(zhí)行一些任務(wù): Map 和 Reduce,MapReduce 需要大量時(shí)間來執(zhí)行這些任務(wù),從而增加延遲.在 MapReduce 中,數(shù)據(jù)通過集群進(jìn)行分發(fā)和處理,這增加了時(shí)間,降低了處理速度.
Solution-
解決方案-
As a Solution to this Limitation of Hadoop spark has overcome this issue, by in-memory processing of data. In-memory processing is faster as no time is spent in moving the data/processes in and out of the disk. Spark is 100 times faster than MapReduce as it processes everything in memory. We also Flink, as it processes faster than spark because of its streaming architecture and Flink gets instructions to process only the parts of the data that have actually changed, thus significantly increases the performance of the job.
作為解決 Hadoop spark 這一限制的方法,通過內(nèi)存中數(shù)據(jù)處理克服了這一問題.內(nèi)存處理速度更快,因?yàn)樵趯?shù)據(jù)/進(jìn)程移入和移出磁盤方面沒有花費(fèi)時(shí)間.Spark 在處理內(nèi)存中的所有內(nèi)容時(shí)比 MapReduce 快 100 倍.我們還 Flink,因?yàn)樗牧骷軜?gòu)比 spark 處理得更快,F(xiàn)link 得到的指令只處理實(shí)際發(fā)生變化的數(shù)據(jù)部分, 因此,工作績效顯著提高.
3. Support for Batch Processing only
3..僅支持批量處理
Hadoop supports batch processing only, it does not process streamed data, and hence overall performance is slower. The MapReduce framework of Hadoop does not leverage the memory of the Hadoop cluster to the maximum.
Hadoop 只支持批處理,不處理流式數(shù)據(jù),因此整體性能較慢.Hadoop 的 MapReduce 框架沒有利用Hadoop 集群達(dá)到最大.
Solution-
解決方案-
To solve these limitations of Hadoop spark is used that improves the performance, but **Spark stream processing **is not as efficient as Flink as it uses micro-batch processing. Flink improves the overall performance as it provides single run-time for the streaming as well as batch processing. Flink uses native closed loop iteration operators which make machine learning and graph processing faster.
使用 Hadoop spark 來解決這些限制,從而提高了性能,但是**火花流處理 **不像 Flink 使用微批量處理那樣高效.Flink 為流處理和批處理提供了單次運(yùn)行時(shí)間,從而提高了整體性能.Flink 使用本地閉環(huán)迭代運(yùn)算符,這使得機(jī)器學(xué)習(xí) 圖形處理速度更快.
4. No Real-time Data Processing
4..無實(shí)時(shí)數(shù)據(jù)處理
Apache Hadoop is for batch processing, which means it takes a huge amount of data in input, process it and produces the result. Although batch processing is very efficient for processing a high volume of data, depending on the size of the data that processes and the computational power of the system, an output can delay significantly. Hadoop is not suitable for Real-time data processing.
Apache Hadoop 是用于批處理的,這意味著它在輸入、處理和產(chǎn)生結(jié)果時(shí)需要大量數(shù)據(jù).盡管批處理對于處理大量數(shù)據(jù)非常有效,但根據(jù)處理數(shù)據(jù)的大小和系統(tǒng)的計(jì)算能力,輸出可能會顯著延遲.Hadoop 不適合實(shí)時(shí)數(shù)據(jù)處理.
Solution-
解決方案-
Apache Spark supports stream processing. Stream processing involves continuous input and output of data. It emphasizes on the velocity of the data, and data processes within a small period of time. Learn more about** Spark Streaming APIs.**
Apache Flink provides single run-time for the streaming as well as batch processing, so one common run-time is utilized for data streaming applications and batch processing applications. Flink is a stream processing system that is able to process row after row in real time.
Apache Spark支持流處理.流處理包括數(shù)據(jù)的連續(xù)輸入和輸出.它強(qiáng)調(diào)數(shù)據(jù)的速度,以及在很短的時(shí)間內(nèi)處理數(shù)據(jù)的速度.了解更多信息** Spark 流 api.**
Apache Flink為流和批處理提供單一運(yùn)行時(shí),因此數(shù)據(jù)流應(yīng)用程序和批處理應(yīng)用程序使用一個(gè)通用運(yùn)行時(shí).Flink 是一種能夠?qū)崟r(shí)逐行處理的流處理系統(tǒng).
5. No Delta Iteration
5. 沒有增量迭代
Hadoop is not so efficient for iterative processing, as Hadoop does not support cyclic data flow(i.e. a chain of stages in which each output of the previous stage is the input to the next stage).
實(shí)現(xiàn)的不太有效的迭代處理的 Hadoop 不支持循環(huán)數(shù)據(jù)流量 (i.E.一個(gè)階段鏈,其中前一階段的每個(gè)輸出都是下一階段的輸入).
Solution-
解決方案-
We can use Apache Spark to overcome this type of Limitations of Hadoop, as it accesses data from RAM instead of disk, which dramatically improves the performance of iterative algorithms that access the same dataset repeatedly. Spark iterates its data in batches. For iterative processing in Spark, we schedule and execute each iteration separately.
我們可以使用 Apache Spark 來克服 Hadoop 的這種限制,因?yàn)樗鼜?RAM 而不是磁盤訪問數(shù)據(jù),這大大提高了重復(fù)訪問同一數(shù)據(jù)集的迭代算法的性能.Spark 批量迭代其數(shù)據(jù).對于 Spark 中的迭代處理,我們分別安排和執(zhí)行每個(gè)迭代.
6. Latency
6. 延遲
In Hadoop, MapReduce framework is comparatively slower, since it is for supporting different format, structure and huge volume of data. In MapReduce, Map takes a set of data and converts it into another set of data, where individual elements are broken down into key-value pairs and Reduce takes the output from the map as input and process further and MapReduce requires a lot of time to perform these tasks thereby increasing latency.
在 Hadoop 中,由于 MapReduce 框架支持不同的格式、結(jié)構(gòu)和巨大的數(shù)據(jù)量,所以它的速度相對較慢.在MapReduce,Map 獲取一組數(shù)據(jù),并將其轉(zhuǎn)換為另一組數(shù)據(jù),其中單個(gè)元素被分解為鍵值對Reduce 將 map 的輸出作為輸入并進(jìn)一步處理,MapReduce 需要大量時(shí)間來執(zhí)行這些任務(wù),從而增加了延遲.
Solution-
解決方案-
Spark is used to reduce this limitation of Hadoop, Apache Spark is yet another batch system but it is relatively faster since it caches much of the input data on memory by RDD(Resilient Distributed Dataset) and keeps intermediate data in memory itself. Flink’s data streaming achieves low latency and high throughput.FR
Spark 用于減少 Hadoop 的這種限制,Apache Spark 是另一個(gè)批處理系統(tǒng),但是速度相對較快,因?yàn)樗ㄟ^以下方式將大部分輸入數(shù)據(jù)緩存在內(nèi)存中RDD (彈性分布式數(shù)據(jù)集)并將中間數(shù)據(jù)保存在內(nèi)存中.Flink 的數(shù)據(jù)流實(shí)現(xiàn)了低延遲和高吞吐量.
7. Not Easy to Use
7. 不容易使用
In Hadoop, MapReduce developers need to hand code for each and every operation which makes it very difficult to work. MapReduce has no interactive mode, but adding one such as hive and pig makes working with MapReduce a little easier for adopters.
在 Hadoop 中,MapReduce 開發(fā)人員需要為每一項(xiàng)操作手動(dòng)編寫代碼,這使得工作變得非常困難.MapReduce 沒有交互模式,但是添加了一個(gè)蜂巢和豬的 MapReduce 工作得更關(guān)愛.
Solution-
解決方案-
To solve this Drawback of Hadoop, we can use the spark. Spark has interactive mode so that developers and users alike can have intermediate feedback for queries and other activities. Spark is easy to program as it has tons of high-level operators. We can easily use Flink as it also has high-level operators. This way spark can solve many limitations of Hadoop.
為了解決 Hadoop 的這個(gè)缺點(diǎn),我們可以使用 spark.Spark 具有交互模式,因此開發(fā)人員和用戶都可以對查詢和其他活動(dòng)進(jìn)行中間反饋.Spark 擁有大量高級操作員,因此很容易編程.我們可以很容易地使用 Flink,因?yàn)樗灿懈呒壊僮鲉T.這樣 spark 就可以解決 Hadoop 的很多限制.
8. Security
8..安全
Hadoop is challenging in managing the complex application. If the user doesn’t know how to enable a platform who is managing the platform, your data can be a huge risk. At storage and network levels, Hadoop is missing encryption, which is a major point of concern. Hadoop supports Kerberos authentication, which is hard to manage.
Hadoop 在管理復(fù)雜的應(yīng)用程序方面具有挑戰(zhàn)性.如果用戶不知道如何啟用管理平臺的平臺,您的數(shù)據(jù)可能會面臨巨大的風(fēng)險(xiǎn).在存儲和網(wǎng)絡(luò)層面,Hadoop 缺少加密,這是一個(gè)主要關(guān)注點(diǎn).Hadoop 支持Kerberos 認(rèn)證這是很難管理的.
HDFS supports access control lists (ACLs) and a traditional file permissions model. However, third-party vendors have enabled an organization to leverage** Active Directory Kerberos** andLDAP for authentication.
HDFS支持訪問控制列表(Acl) 和傳統(tǒng)的文件權(quán)限模型.然而,第三方供應(yīng)商使組織能夠利用活動(dòng)目錄 Kerberos和LDAP用于認(rèn)證.
Solution-
解決方案-
Spark provides a security bonus to overcome these limitations of Hadoop. If we run the spark in HDFS, it can use HDFS ACLs and file-level permissions. Additionally, Spark can run on YARN giving it the capability of using Kerberos authentication.
Spark 為克服 Hadoop 的這些限制提供了安全獎(jiǎng)勵(lì).如果我們在 spark 中運(yùn)行 spark,它可以使用 HDFS acl 和文件級權(quán)限.此外,Spark 可以在紗線給它使用 Kerberos 身份驗(yàn)證的能力.
9. No Abstraction
9. 無提取
Hadoop does not have any type of abstraction so MapReduce developers need to hand code for each and every operation which makes it very difficult to work.
Hadoop 沒有任何類型的抽象,因此 MapReduce 開發(fā)人員需要為每個(gè)操作手動(dòng)代碼,這使得工作變得非常困難.
Solution-
解決方案-
To overcome these drawbacks of Hadoop, Spark is used in which we have RDD abstraction for the batch. Flink has Dataset abstraction.
為了克服 Hadoop 的這些缺點(diǎn),我們使用了 Spark批處理的 RDD 抽象.Flink 具有數(shù)據(jù)集抽象.
10. Vulnerable by Nature
10.自然脆弱
Hadoop is entirely written in Java, a language most widely used, hence java been most heavily exploited by cyber criminals and as a result, implicated in numerous security breaches.
Hadoop 完全是用JavaJava 是一種使用最廣泛的語言,因此 java 被網(wǎng)絡(luò)犯罪分子利用得最嚴(yán)重,結(jié)果導(dǎo)致了許多安全漏洞.
11. No Caching
11.沒有緩存
Hadoop is not efficient for caching. In Hadoop, MapReduce cannot cache the intermediate data in memory for a further requirement which diminishes the performance of Hadoop.
Hadoop 緩存效率不高.在 Hadoop 中,為了進(jìn)一步降低 Hadoop 的性能,MapReduce 不能將中間數(shù)據(jù)緩存在內(nèi)存中.
Solution-
解決方案-
Spark and Flink can overcome this limitation of Hadoop, as Spark and Flink cache data in memory for further iterations which enhance the overall performance.
Spark 和 Flink 可以克服 Hadoop 的這一限制,因?yàn)?Spark 和 Flink 會在內(nèi)存中緩存數(shù)據(jù),以便進(jìn)一步迭代,從而提高整體性能.
12. Lengthy Line of Code
12.冗長的代碼行
Hadoop has a 1,20,000 line of code, the number of lines produces the number of bugs and it will take more time to execute the program.
Hadoop 有一個(gè) 1,20,000 行代碼,行數(shù)產(chǎn)生錯(cuò)誤的數(shù)量,執(zhí)行程序需要更多的時(shí)間.
Solution-
解決方案-
Although, Spark and Flink are written in scala and java but the implementation is in Scala, so the number of lines of code is lesser than Hadoop. So it will also take less time to execute the program and solve the lengthy line of code limitations of Hadoop.
雖然 Spark 和 Flink 是用 scala 和 java 編寫的,但是實(shí)現(xiàn)是用 Scala 編寫的,所以代碼行數(shù)比 Hadoop 少.因此,執(zhí)行程序和解決 Hadoop 冗長的代碼限制也需要更少的時(shí)間.
13. Uncertainty
13.不確定性
Hadoop only ensures that the data job is complete, but it’s unable to guarantee when the job will be complete.
Hadoop 只確保數(shù)據(jù)作業(yè)完成,但無法保證作業(yè)何時(shí)完成.
Limitations of Hadoop and Its solutions – Summary
Hadoop 及其解決方案的局限性
As a result of Limitations of Hadoop, the need for Spark and Flink emerged. Thus made the system more friendly to play with a huge amount of data. Spark provides in-memory processing of data thus improves the processing speed. Flink improves the overall performance as it provides single run-time for the streaming as well as batch processing. Spark provides a security bonus.
由于 Hadoop 的局限性,出現(xiàn)了對 Spark 和 Flink 的需求.這使得系統(tǒng)在處理大量數(shù)據(jù)時(shí)更加友好.Spark 提供了數(shù)據(jù)的內(nèi)存處理,從而提高了處理速度.Flink 為流處理和批處理提供了單次運(yùn)行時(shí)間,從而提高了整體性能.Spark 提供安全獎(jiǎng)勵(lì).
Now that the flaws of Hadoop have been exposed, will you continue to use it for your big data initiatives, or swap it for something else?
現(xiàn)在的缺陷Hadoop已經(jīng)暴露,你會繼續(xù)將它用于你的大數(shù)據(jù)計(jì)劃,還是將它換成其他東西?
If you have any queries on limitations of Hadoop or any feedback just drop a comment in the comment section and we will get back to you.
如果您對 Hadoop 的限制有任何疑問,或者有任何反饋,請?jiān)谠u論部分留言,我們會回復(fù)您.
