hiveserver2內(nèi)存異常分析

1、現(xiàn)象
hiveserver2發(fā)生過幾次無法連接問題,進程存在,內(nèi)存使用率達到了80%左右,使用lsof -i:10000查詢1000端口連接并不太多,重啟后恢復;
2、排查過程
1)檢查hiveserver2.log等日志文件未發(fā)現(xiàn)明顯異常日志;
2)日志目錄下發(fā)現(xiàn)dump文件:/data/log/hive/hs2_heapdump.hprof,大小42G;
3)由于dump文件過大,只能在linux安裝mat分析工具,在mat安裝目錄下執(zhí)行如下命令開啟執(zhí)行:
sudo nohup ./ParseHeapDump.sh /data/log/hive/hs2_heapdump.hprof org.eclipse.mat.api:suspects org.eclipse.mat.api:overview org.eclipse.mat.api:top_components >nohup.out 2>&1 &
4)分析結束過在dump文件所在目錄下找到結果文件:
-rw-r--r-- 1 root root 108278 12月 21 11:28 hs2_heapdump_Leak_Suspects.zip
-rw-r--r-- 1 root root 73633 12月 21 11:28 hs2_heapdump_System_Overview.zip
-rw-r--r-- 1 root root 588963 12月 21 11:38 hs2_heapdump_Top_Components.zip
5)分析占用內(nèi)存最大大對象如下圖:


image.png

3、分析及解決方案

參考hive的issuehttps://issues.apache.org/jira/browse/HIVE-24590, 該問題為hive的bug導致,已在4.0后修復,但我們無法立即升級,參考討論區(qū)中的解決方案,修改log4j的內(nèi)存清理策略;

修改內(nèi)容如下:

image.png

參考:
https://github.com/apache/hive/blob/master/llap-server/src/main/resources/llap-daemon-log4j2.properties IdlePurgePolicy配置

——————————————————————————————
20230117
hiveserver2 dump分析:

image.png

使用jprofile找到最大的對象為OperationManager中的queryIdOperation,關于queryId的源碼分析:
hive3.1.0關鍵代碼分析:
1、org.apache.hadoop.hive.ql.QueryState.Builder#build創(chuàng)建queryId并設置到hiveConf對象中;
image.png

2、org.apache.hive.service.cli.operation.OperationManager#addOperation 從hiveConf對象中獲取queryId作為key,Operation作為value存儲到queryIdOperation Map中;
image.png

3、org.apache.hive.service.cli.operation.OperationManager#removeOperation 從hiveConf對象中獲取queryId作為key,并從queryIdOperation Map刪除;
image.png

多線程導致queryId混亂的原因分析:
參考文章:https://www.51cto.com/article/718451.html,造成queryIdOperation占用內(nèi)存大的可能原因為:多線程共用connection,一個線程會將另一個線程的queryId覆蓋,導致其中一個線程失去自己的queryId,導致無法從queryIdOperation的Map中移除對象。
image.png

解決方案:
參考文章中使用的調度系統(tǒng)為Airflow,我們現(xiàn)在使用的dolphin同樣存在多個查詢共用一個connection的情況,同樣會觸發(fā)上述問題;
參考文章的解決方案,修改源碼,將queryId的維護從hive session級別下移到operation級別;
難點:hdp的代碼不開源,修改難度大;

hiveserver2執(zhí)行sql服務器端代碼:
1)org.apache.hive.service.cli.CLIService#executeStatement(org.apache.hive.service.cli.SessionHandle, java.lang.String, java.util.Map<java.lang.String,java.lang.String>, long) ---客戶端RPC調用時服務器端的入口方法
2)org.apache.hive.service.cli.session.HiveSessionImpl#executeStatementInternal
3)org.apache.hive.service.cli.operation.OperationManager#newExecuteStatementOperation ----創(chuàng)建
ExecuteStatementOperation實例并調用addOpertation方法存儲到queryIdOpertaion Map中,ExecuteStatementOperation為Operation的子類;
4)org.apache.hive.service.cli.operation.Operation#Operation(org.apache.hive.service.cli.session.HiveSession, java.util.Map<java.lang.String,java.lang.String>, org.apache.hive.service.cli.OperationType)
org.apache.hadoop.hive.ql.QueryState.Builder#build ----傳入HiveSession的hiveConf


20230130分析:
通過hiveserver2日志發(fā)現(xiàn)內(nèi)存飆升的時間段與元數(shù)據(jù)采集系統(tǒng)執(zhí)行時間吻合,且存在以下現(xiàn)象:


image.png

上圖時間段對應內(nèi)存飆升時間段,對應的執(zhí)行sql為元數(shù)據(jù)采集系統(tǒng)的ods_ewp庫采集,該庫的表多,上述closing operation操作應為Connection被釋放時觸發(fā),在該庫的Connection被釋放時報錯,內(nèi)容如下:
[INFO] 2023-01-30 03:33:17.850 TaskLogLogger-class org.apache.dolphinscheduler.plugin.task.shell.ShellTask:[66] - -> 2023-01-30 03:33:16.999 ERROR 24933 --- [ main] com.erwan365.seaman.util.HiveJdbcUtils : hive,jdbc鏈接資源關系發(fā)生異常?。?!,錯誤表是:ods_ewp,temp_customer

java.sql.SQLException: Error while cleaning up the server resources
    at org.apache.hive.jdbc.HiveConnection.close(HiveConnection.java:723) ~[seaman-action-meta-1.0.jar:1.4]
    at com.erwan365.seaman.util.HiveJdbcUtils.getTableDDL(HiveJdbcUtils.java:88) ~[seaman-action-meta-1.0.jar:1.4]
    at com.erwan365.seaman.action.meta.service.hive.entitiy.DBFetcher.fetch(DBFetcher.java:73) [seaman-action-meta-1.0.jar:1.4]
    at com.erwan365.seaman.action.meta.service.hive.entitiy.DBFetcher$$FastClassBySpringCGLIB$$24b1a38a.invoke(<generated>) [seaman-action-meta-1.0.jar:1.4]
    at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:204) [seaman-action-meta-1.0.jar:1.4]
    at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint(CglibAopProxy.java:746) [seaman-action-meta-1.0.jar:1.4]
    at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163) [seaman-action-meta-1.0.jar:1.4]
    at org.springframework.transaction.interceptor.TransactionAspectSupport.invokeWithinTransaction(TransactionAspectSupport.java:294) ~[seaman-action-meta-1.0.jar:1.4]
    at org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:98) ~[seaman-action-meta-1.0.jar:1.4]
    at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:185) [seaman-action-meta-1.0.jar:1.4]
    at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:688) ~[seaman-action-meta-1.0.jar:1.4]
    at com.erwan365.seaman.action.meta.service.hive.entitiy.DBFetcher$$EnhancerBySpringCGLIB$$327440fd.fetch(<generated>) ~[seaman-action-meta-1.0.jar:1.4]
    at com.erwan365.seaman.action.meta.service.hive.entitiy.DatasourceFetcher.fetch(DatasourceFetcher.java:82) ~[seaman-action-meta-1.0.jar:1.4]
    at com.erwan365.seaman.action.meta.service.EntityCrawlerService.doCraw(EntityCrawlerService.java:71) ~[seaman-action-meta-1.0.jar:1.4]
    at com.erwan365.seaman.action.meta.ActionMain.doExecute(ActionMain.java:73) ~[seaman-action-meta-1.0.jar:1.4]
    at com.erwan365.seaman.action.common.AbstractAction.execute(AbstractAction.java:40) ~[seaman-action-meta-1.0.jar:1.4]
    at com.erwan365.seaman.action.meta.ActionMain.main(ActionMain.java:46) ~[seaman-action-meta-1.0.jar:1.4]
Caused by: org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: Read timed out
    at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129) ~[seaman-action-meta-1.0.jar:1.4]
    at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86) ~[seaman-action-meta-1.0.jar:1.4]
    at org.apache.thrift.transport.TSaslTransport.readLength(TSaslTransport.java:376) ~[seaman-action-meta-1.0.jar:1.4]
    at org.apache.thrift.transport.TSaslTransport.readFrame(TSaslTransport.java:453) ~[seaman-action-meta-1.0.jar:1.4]
    at org.apache.thrift.transport.TSaslTransport.read(TSaslTransport.java:435) ~[seaman-action-meta-1.0.jar:1.4]
    at org.apache.thrift.transport.TSaslClientTransport.read(TSaslClientTransport.java:37) ~[seaman-action-meta-1.0.jar:1.4]
    at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86) ~[seaman-action-meta-1.0.jar:1.4]
    at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429) ~[seaman-action-meta-1.0.jar:1.4]
    at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318) ~[seaman-action-meta-1.0.jar:1.4]
    at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219) ~[seaman-action-meta-1.0.jar:1.4]
    at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:77) ~[seaman-action-meta-1.0.jar:1.4]
    at org.apache.hive.service.cli.thrift.TCLIService$Client.recv_CloseSession(TCLIService.java:179) ~[seaman-action-meta-1.0.jar:1.4]
    at org.apache.hive.service.cli.thrift.TCLIService$Client.CloseSession(TCLIService.java:166) ~[seaman-action-meta-1.0.jar:1.4]
    at org.apache.hive.jdbc.HiveConnection.close(HiveConnection.java:721) ~[seaman-action-meta-1.0.jar:1.4]
    ... 16 common frames omitted
Caused by: java.net.SocketTimeoutException: Read timed out
    at java.net.SocketInputStream.socketRead0(Native Method) ~[na:1.8.0_131]
    at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) ~[na:1.8.0_131]
    at java.net.SocketInputStream.read(SocketInputStream.java:171) ~[na:1.8.0_131]
    at java.net.SocketInputStream.read(SocketInputStream.java:141) ~[na:1.8.0_131]
    at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) ~[na:1.8.0_131]
    at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) ~[na:1.8.0_131]
    at java.io.BufferedInputStream.read(BufferedInputStream.java:345) ~[na:1.8.0_131]
    at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127) ~[seaman-action-meta-1.0.jar:1.4]
    ... 29 common frames omitted

產(chǎn)生queryId與Operation實例一對多的原因分析:
在Connecttion復用執(zhí)行多個Statement情況下,如果Statement執(zhí)行完后沒有執(zhí)行close方法,將不會觸發(fā)服務器端清理queryIdOpertaion Map中本次執(zhí)行對應的Operation實例,待所有Statement都執(zhí)行完后,在執(zhí)行Connection的close方法時將觸發(fā)HiveSession的close方法,該方法將所有Operation釋放,但是由于queryId是session級別的,所以會導致所有的Operation獲取到的queryId均為最后一次Statement執(zhí)行時生成的,因此在執(zhí)行Connecttion的close方法時只釋放了Map中最后一個,其他內(nèi)存均未被釋放;
注:一個connecttion對應一個HiveSession,對應一個HiveConf,對應一個queryId,對應多個Statement;一個Statement對應一個Operation;1個Opertion對應一個QueryState,queryId在QueryState中被創(chuàng)建并設置到HiveSession持有的HiveConf實例中;


image.png

image.png

解決方案:
元數(shù)據(jù)采集代碼修復,在每次Statement執(zhí)行完后執(zhí)行close方法;
注: 元數(shù)據(jù)采集系統(tǒng)不存在多線程并發(fā)使用同一個connection的情況;

總結:
1)造成hiveserver2內(nèi)存泄露的最主要原因為元數(shù)據(jù)采集系統(tǒng)代碼缺陷;
2)多線程共用一個Connection也會導致內(nèi)存泄漏,但由于當前離線批任務量并不算多,不應該造成明顯的內(nèi)存飆升;

  1. hive jdbc方式并不適用于大規(guī)模的離線任務中,后期在dolphins中開發(fā)spark sql插件,逐漸將任務遷移至spark sql;hiveserver2底層queryId由session級改為operation級別暫不做修改。
?著作權歸作者所有,轉載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務。

相關閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容