Spark 2.0

What is Apache Spark

sparklogo

Apache Spark is a highly scalable open source cluster computing framework and data processing engine. Originally developed at UC Berkeley’s AMPLab in 2009, it went open source in 2010 under a BSD license. It was ultimately donated to ASF in 2013. It is now distributed under Apache License 2.0.
Spark provides a unified and comprehensive framework. This framework can capably handle the various requirements for processing large datasets. Spark provides you with high-level APIs in Java, Scala, Python and R. It is also provides a higher-level rich set of tools referred to as Libraries.

Spark 2.0 – What’s New
With the upcoming release of Spark 2.0 there has been some significant improvements in the API, Libraries and Abstraction layers. Spark 2.0 attempts to improve on these three components and is said to be 10X faster than Spark 1.x.
Let’s take a look at some of the changes in Spark 2.0.

More SQL Friendly – SQL 2003 Compliant
SQL is one of the primary interfaces Spark applications use. Spark 2.0 introduces a new ANSI SQL parser. The new parser provides good error reporting. Spark 2.0 will have the ability of subqueries (both correlated & uncorrelated). Spark 2.0 can run all the 99 TPC-DS queries.
This is a major improvement which can encourage moving of applications from the traditional SQL Engines to Spark.

Unified API – DataFrames & Datasets
DataFrames is a higher level structured data API introduced in Spark 1.3 in 2015. In a nutshell, DataFrameis a collection of rows with a schema. It provides better performance, ease-of-use and flexibility in comparison with RDD (Resilient Distributed Data) API.
For the users who prefer to use type safety a new API was introduced in Spark 1.6 called DataSets.DataSet is an attempt to provide type safety on top of DataFrame.
In Spark 2.0 the two APIs will be unified together into a single API. Starting in Spark 2.0, DataFrame is just a type alias for Dataset of Row. The new Dataset API includes typed methods and untyped methods.

SparkSession – Single Entry Point
Spark 1.6 provided SparkContext API to connect to Spark cluster. There were several different context provided for different APIs. For instance to connect to SQL we required SQLContext and StreamContextfor Streaming. While using DataFrames API a common confusion is to decide which “context” to use.
Spark 2.0 introduces SparkSession. SparkSession provides a single entry point for DataFrame andDataSet API for Spark. For now SparkSession will cover SQLContext & HiveContext. It will be extended toStreamContext as well.
Please note that the SQLContext & HiveContext will be present in Spark 2.0 for backward compatibility.

Spark as a Compiler – Faster Spark
Spark is known for its performance and speed. Spark 2.0 attempts to take this performance a step further. Spark 1.x – like many other modern data engines – uses the compilers which uses of various function calls and CPU cycles. These CPU cycles are pretty much spent on unwanted work.
Spark 2.0 includes the second generation Tungsten engine. This new engine works by taking the query plan and collapsing it into a single function, which eliminates all the unwanted function calls. The engine uses the CPU register for storing the intermediate data (unlike the traditional method of using memory for storing intermediate data). This method promises around 10X improvement in the performance, depending on the data you are executing.

Structured Streaming – Continous Applications
The current Spark streaming API called DStream was introduced in Spark 0.7. It provides the ability to stream real-time data and process it. Spark 2.0 introduces Structured Streaming.
Spark Structured Streaming is a declarative API that extends DataFrames & DataSets. Spark Structured Streaming is largely built on Spark SQL and also includes ideas from Spark Streaming. It is based on the Datasets API.
Spark Streaming, which uses what’s been called a “micro-batch” architecture for streaming applications, is among the most popular Spark engines. The new Structured Streaming engine will represents Spark’s second attempt at solving some of the tough problems that developers face when building real-time applications.
Essentially, Structured Streaming enables Spark developers to run the same type of DataFrame queries against data streams as they had previously been running against static queries. Thanks to the Catalyst optimizer, the framework figures out the best way to make this all work in an efficient fashion, freeing the developer from worrying about the underlying plumbing.
Upcoming releases of Spark 2.x will include more features and improvements in Spark Structured Streaming.

DataFrame based ML API
In Spark 2.0 Machine Learning “Pipeline” DataFrame-based API will become the primary Machine Learning API.
Conclusion
Spark has already made a mark by providing an easy-to-use, unified and fast data framework. With Spark 2.0 we can expect further improvements in the performance of Spark overall. We can look forward to the GA release of Apache Spark 2.0 in the upcoming days.

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

  • 8.3 突如其來的一場太陽雨。暑氣不消,反倒更熱了…… 8.6 湖邊騎車 8.14 下班路上 8.14 下班路上 ...
    陽光之翼閱讀 523評論 0 0
  • 2016.04.27 涇縣~蔡村鎮(zhèn) 小雨 16公里 我希望有一天,大家知道,在無數(shù)的可以讓你心安靜下來的方法里面...
    二哥空間閱讀 455評論 0 0

友情鏈接更多精彩內(nèi)容