title: 基于HiveSever2的Azkaban插件的實(shí)現(xiàn)思路
date: 2017-02-05 13:45:03
tags: [Azkaban插件,Hive,HiveServer2]
categories: "Azkaban"
關(guān)鍵字:HiveSever2、Azkaban插件
最近研究了下HiveServer2有關(guān)的內(nèi)容,并且在Azkaban的插件模塊實(shí)現(xiàn)了基于HiveServer2的插件類型作業(yè)?,F(xiàn)在將自己一些經(jīng)驗(yàn)總結(jié)如下。
HiveServer & HiveSever2
先來介紹HiveServer,原名是Thrift server。HiveServer 是一個(gè)服務(wù)端,允許遠(yuǎn)程客戶端通過請(qǐng)求提交hive作業(yè)或者獲取作業(yè)結(jié)果。HiveSever是基于Thrift框架實(shí)現(xiàn)的,但后來的HiveServer2也是基于Thrift框架,所以命名上從Thrift Server更名為HiveServer.
??HiveServer2要比HiveSever更加優(yōu)秀,支持高并發(fā)和安全認(rèn)證。HiveServer已經(jīng)不被推薦使用。下邊是Hive官網(wǎng)原文描述:
HiveServer is an optional service that allows a remote client to submit requests to Hive, using a variety of programming languages, and retrieve results. HiveServer is built on Apache ThriftTM (http://thrift.apache.org/), therefore it is sometimes called the Thrift server although this can lead to confusion because a newer service named HiveServer2 is also built on Thrift. Since the introduction of HiveServer2, HiveServer has also been called HiveServer1. \
HiveServer cannot handle concurrent requests from more than one client. This is actually a limitation imposed by the Thrift interface that HiveServer exports, and can't be resolved by modifying the HiveServer code. \
HiveServer2 is a rewrite of HiveServer that addresses these problems, starting with Hive 0.11.0. Use of HiveServer2 is recommended.
Azkaban插件實(shí)現(xiàn)原理
Azkaban的hadoop相關(guān)作業(yè)插件類都是繼承自JavaProcessJob類,JavaProcessJob類本質(zhì)上是一個(gè)獨(dú)立的Java進(jìn)程,進(jìn)程內(nèi)調(diào)用Client客戶端執(zhí)行hadoop相關(guān)作業(yè)。示意圖如下: \

所有的插件類型都要實(shí)現(xiàn)run方法和cancel方法。
基于HiveServer2 的插件實(shí)現(xiàn)
這里看下關(guān)于hiveServer的一張老圖:

仔細(xì)研究Azkaban的Hive插件的代碼,可以知道,Azkaban是通過Hive Client來提交Hive作業(yè)的,也就是圖中的CLI方式。這種方式的問題還是挺多的,由于直接繞過了HiveServer2,所以不支持高并發(fā)和安全認(rèn)證,存在很多隱患。
??所以有必要開發(fā)基于HiveServer2的Azkaban插件。
??Azakaban的實(shí)現(xiàn)已經(jīng)在文章《Azkaban Learning》中有簡(jiǎn)單介紹,其實(shí)可以簡(jiǎn)單模仿HadoopJava類型作業(yè)的插件實(shí)現(xiàn),這里不作過多的介紹。
HiveServer2提交作業(yè)其實(shí)是通過JDBC方式來提交的,那我們來看下HiveServer2都提供了哪些api:
同步提交HQL
ThriftCLIServiceClient.executeStatement(SessionHandle sessionHandle, String statement, Map<String, String> confOverlay) throws HiveSQLException異步提交HQL
ThriftCLIServiceClient.executeStatementAsync(SessionHandle sessionHandle, String statement, Map<String, String> confOverlay) throws HiveSQLException請(qǐng)求日志或者結(jié)果
ThriftCLIServiceClient.fetchResults(OperationHandle opHandle, FetchOrientation orientation, long maxRows, FetchType fetchType) throws HiveSQLException
(FetchType分為FetchType.LOG 和 FetchType.QUERY_OUTPUT,分別對(duì)應(yīng)日志和結(jié)果)請(qǐng)求執(zhí)行狀態(tài)
ThriftCLIServiceClient.getOperationStatus(OperationHandle opHandle) throws HiveSQLException
(狀態(tài)包括:INITIALIZED RUNNING FINISHED CANCELED CLOSED ERROR UNKNOWN PENDING)取消執(zhí)行
ThriftCLIServiceClient.cancelOperation(OperationHandle opHandle) throws HiveSQLException關(guān)閉句柄
ThriftCLIServiceClient.closeOperation(OperationHandle opHandle) throws HiveSQLException
每執(zhí)行完一條sql都要關(guān)閉句柄
通過這些豐富的api其實(shí)已經(jīng)完全足夠?qū)崿F(xiàn)這個(gè)插件。具體的實(shí)現(xiàn)代碼不便于公開,歡迎私聊咨詢。
參考資料:
- https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Overview
- https://hive.apache.org/
- https://github.com/azkaban/azkaban-plugins
=============2017.06.14 補(bǔ)充 ====================
上邊這套hiveserver2的api已經(jīng)過時(shí)了,太底層了,現(xiàn)在有一套跟JDBC高度類似的api,底層其實(shí)也是調(diào)用上邊的接口。