問題現(xiàn)象
提交大量Spark任務(wù),概率性出現(xiàn)個別Task卡住一段時間,進而導(dǎo)致Stage整體耗時開銷異常。
可能原因
NodeManager FullGC
問題分析
采樣Job836
異常Stage2249 -> 卡住Task8:
對應(yīng)Executor日志:
...
INFO | [Executor task launch worker-78] | Running task 8.0 in stage 2249.0 (TID 222920) | org.apache.spark.Logging$class.logInfo(Logging.scala:59)
ERROR | [shuffle-client-1] | Connection is dead; please adjust spark.network.timeout if this is wrong | org.apache.spark.network.server.TransportChannelHandler.userEventTriggered(TransportChannelHandler.java:128)
ERROR | [shuffle-client-1] | Still have 2 requests outstanding when connection form /10.12.122.244:27337 us closed | org.apache.spark.network.server.TransportChannelHandler.channelUnregistered(TransportChannelHandler.java:102)
INFO | [shuffle-client-1] | Retrying fetch (1/3) for 1 outstanding blocks after 5000 ms | org.apache.spark.network.shuffle.RetryingBlockFetcher.initiateRetry(RetryingBlockFetcher.java:163)
ERROR | [shuffle-client-1] | Failed while starting block fetches | org.apache.spark.network.shuffle.OneForOneBlockFetcher$1.onFailure(OneForOneBlockFetcher.java:151)
java.io.IOException: Connection from /10.12.122.244:27337 closed
at org.apache.spark.network.server.TransportChannelHandler.channelUnregistered(TransportChannelHandler.java:104)
at org.apache.spark.network.server.TransportChannelHandler.channelUnregistered(TransportChannelHandler.java:94)
...
INFO | [shuffle-client-1] | Retrying fetch (1/3) for 1 outstanding blocks after 5000 ms | org.apache.spark.network.shuffle.RetryingBlockFetcher.initiateRetry(RetryingBlockFetcher.java:163)
...
查看主機10.12.122.244的端口27337發(fā)現(xiàn)為NodeManager。查看其內(nèi)存開銷發(fā)現(xiàn)其內(nèi)存已經(jīng)用盡,進一步查看其GC日志,發(fā)現(xiàn)NodeManager存在頻繁的長時間Full GC,進而導(dǎo)致其在GC階段長時間無法響應(yīng)Executor的請求,進而導(dǎo)致Executor卡住。
問題解決方案
調(diào)整NodeManager堆內(nèi)存,適應(yīng)業(yè)務(wù)場景開銷。