? spark sql 的前身是shark,類似于 hive, 用戶可以基于spark引擎使用sql語(yǔ)句對(duì)數(shù)據(jù)進(jìn)行分析,而不用去編寫程序代碼。

屏幕快照 2018-08-12 下午2.55.58
? 上圖很好的展示了spark sql的功能,提供了 jdbc,console,編程接口三種方式來操作RDD(Resilient Distributed Datasets),用戶只需要編寫sql即可,不需要編寫程序代碼。
? spark sql的運(yùn)行流程如下:

屏幕快照 2018-08-12 下午2.56.38
大概有6步:
- sql 語(yǔ)句經(jīng)過 SqlParser 解析成 Unresolved Logical Plan;
- analyzer 結(jié)合 catalog 進(jìn)行綁定,生成 Logical Plan;
- optimizer 對(duì) Logical Plan 優(yōu)化,生成 Optimized LogicalPlan;
- SparkPlan 將 Optimized LogicalPlan 轉(zhuǎn)換成 Physical Plan;
- prepareForExecution()將 Physical Plan 轉(zhuǎn)換成 executed Physical Plan;
- execute()執(zhí)行可執(zhí)行物理計(jì)劃,得到RDD;
上述流程在spark中對(duì)應(yīng)的源碼部分:
class QueryExecution(val sparkSession: SparkSession, val logical: LogicalPlan) {
// TODO: Move the planner an optimizer into here from SessionState.
protected def planner = sparkSession.sessionState.planner
def assertAnalyzed(): Unit = analyzed
def assertSupported(): Unit = {
if (sparkSession.sessionState.conf.isUnsupportedOperationCheckEnabled) {
UnsupportedOperationChecker.checkForBatch(analyzed)
}
}
lazy val analyzed: LogicalPlan = {
SparkSession.setActiveSession(sparkSession)
sparkSession.sessionState.analyzer.executeAndCheck(logical)
}
lazy val withCachedData: LogicalPlan = {
assertAnalyzed()
assertSupported()
sparkSession.sharedState.cacheManager.useCachedData(analyzed)
}
lazy val optimizedPlan: LogicalPlan = {
sparkSession.sessionState.optimizer.execute(withCachedData)
}
lazy val sparkPlan: SparkPlan = {
SparkSession.setActiveSession(sparkSession)
// TODO: We use next(), i.e. take the first plan returned by the planner, here for now,
// but we will implement to choose the best plan.
planner.plan(ReturnAnswer(optimizedPlan)).next()
}
// executedPlan should not be used to initialize any SparkPlan. It should be
// only used for execution.
lazy val executedPlan: SparkPlan = prepareForExecution(sparkPlan)
/** Internal version of the RDD. Avoids copies and has no schema */
lazy val toRdd: RDD[InternalRow] = executedPlan.execute()
? 邏輯交代的非常清楚,從最后一行的 lazy val toRdd: RDD[InternalRow] = executedPlan.execute() 往前推便可清晰看到整個(gè)流程。細(xì)心的同學(xué)也可以看到 所有步驟都是lazy的,只有調(diào)用了execute才會(huì)觸發(fā)執(zhí)行,這也是spark的重要設(shè)計(jì)思想。
? 后面的文章將對(duì)這些流程進(jìn)行詳細(xì)的介紹。