spark-sql跑任務報錯org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location f...

spark-SQL跑任務報錯

錯誤信息如下

19/10/17 18:06:50 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_e122_1568025476000_38356_01_000022 on host: node02. Exit status: 137. Diagnostics: Container killed on request. Exit code is 137
Container exited with a non-zero exit code 137
Killed by external signal

19/10/17 18:06:50 INFO BlockManagerMasterEndpoint: Trying to remove executor 21 from BlockManagerMaster.
19/10/17 18:06:50 INFO BlockManagerMaster: Removal of executor 21 requested
19/10/17 18:06:50 INFO YarnSchedulerBackend$YarnDriverEndpoint: Asked to remove non-existent executor 21
19/10/17 18:06:50 INFO TaskSetManager: Starting task 1.1 in stage 28.3 (TID 46, node02, executor 4, partition 1, NODE_LOCAL, 5066 bytes)
19/10/17 18:06:51 INFO BlockManagerInfo: Added broadcast_23_piece0 in memory on node02:27885 (size: 106.5 KB, free: 5.2 GB)
19/10/17 18:06:51 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 6 to xx.xx.xx.xx:30178
19/10/17 18:06:51 INFO MapOutputTrackerMaster: Size of output statuses for shuffle 6 is 166 bytes
19/10/17 18:06:51 WARN TaskSetManager: Lost task 1.1 in stage 28.3 (TID 46, node02, executor 4): FetchFailed(null, shuffleId=6, mapId=-1, reduceId=166, message=
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 6
    at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:697)
    at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:693)
    at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
    at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
    at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
    at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
    at org.apache.spark.MapOutputTracker$.org$apache$spark$MapOutputTracker$$convertMapStatuses(MapOutputTracker.scala:693)
    at org.apache.spark.MapOutputTracker.getMapSizesByExecutorId(MapOutputTracker.scala:147)
    at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:49)
    at org.apache.spark.sql.execution.ShuffledRowRDD.compute(ShuffledRowRDD.scala:165)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:108)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

)

上面的錯誤報完接著就是以下錯誤:
19/10/17 17:10:32 WARN TaskSetManager: Lost task 0.0 in stage 82.0 (TID 1855, node01, executor 5): FetchFailed(BlockManagerId(1, node01, 26969, None), shuffleId=13, mapId=1, reduceId=10, message=
org.apache.spark.shuffle.FetchFailedException: Failed to connect to hostname/xx.xx.xx.xx:26969
    at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:513)
    at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:444)
    at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:61)
    at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
    at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
    at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown Source)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
    at org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:83)
    at org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedStreamed(SortMergeJoinExec.scala:776)
    at org.apache.spark.sql.execution.joins.SortMergeJoinScanner.findNextOuterJoinRows(SortMergeJoinExec.scala:737)
    at org.apache.spark.sql.execution.joins.OneSideOuterIterator.advanceStream(SortMergeJoinExec.scala:899)
    at org.apache.spark.sql.execution.joins.OneSideOuterIterator.advanceNext(SortMergeJoinExec.scala:935)
    at org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:68)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:313)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:256)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:254)
    at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1371)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:259)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:189)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:188)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:108)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Failed to connect to node01/xx.xx.xx.xx:26969
    at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:232)
    at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:182)
    at org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:97)
    at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:141)
    at org.apache.spark.network.shuffle.RetryingBlockFetcher.lambda$initiateRetry$0(RetryingBlockFetcher.java:169)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
    ... 1 more
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: node01/xx.xx.xx.xx:26969
    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
    at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
    at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:257)
    at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:291)
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:631)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
    at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
    ... 2 more

)

解決方案:

增加executor內存,把spark.executor.memory由10G調到20G就不出現(xiàn)這個異常了

原因分析

shuffle分為shuffle write和shuffle read兩部分。
shuffle write的分區(qū)數(shù)由上一階段的RDD分區(qū)數(shù)控制,shuffle read的分區(qū)數(shù)則是由Spark提供的一些參數(shù)控制。
shuffle write可以簡單理解為類似于saveAsLocalDiskFile的操作,將計算的中間結果按某種規(guī)則臨時放到各個executor所在的本地磁盤上。
shuffle read的時候數(shù)據(jù)的分區(qū)數(shù)則是由spark提供的一些參數(shù)控制。可以想到的是,如果這個參數(shù)值設置的很小,同時shuffle read的量很大,那么將會導致一個task需要處理的數(shù)據(jù)非常大。結果導致JVM crash,從而導致取shuffle數(shù)據(jù)失敗,同時executor也丟失了,看到Failed to connect to host的錯誤,也就是executor lost的意思。有時候即使不會導致JVM crash也會造成長時間的gc。

解決思路

知道原因后問題就好解決了,主要從shuffle的數(shù)據(jù)量和處理shuffle數(shù)據(jù)的分區(qū)數(shù)兩個角度入手。

  • 減少shuffle數(shù)據(jù)
    思考是否可以使用map side join或是broadcast join來規(guī)避shuffle的產(chǎn)生。
    將不必要的數(shù)據(jù)在shuffle前進行過濾,比如原始數(shù)據(jù)有20個字段,只要選取需要的字段進行處理即可,將會減少一定的shuffle數(shù)據(jù)。
  • SparkSQL和DataFrame的join,group by等操作
    通過spark.sql.shuffle.partitions控制分區(qū)數(shù),默認為200,根據(jù)shuffle的量以及計算的復雜度提高這個值。
  • Rdd的join,groupBy,reduceByKey等操作
    通過spark.default.parallelism控制shuffle read與reduce處理的分區(qū)數(shù),默認為運行任務的core的總數(shù)(mesos細粒度模式為8個,local模式為本地的core總數(shù)),官方建議為設置成運行任務的core的2-3倍。
  • 提高executor的內存
    通過spark.executor.memory適當提高executor的memory值。
    -是否存在數(shù)據(jù)傾斜的問題
    空值是否已經(jīng)過濾?異常數(shù)據(jù)(某個key數(shù)據(jù)特別大)是否可以單獨處理?考慮改變數(shù)據(jù)分區(qū)規(guī)則
最后編輯于
?著作權歸作者所有,轉載或內容合作請聯(lián)系作者
【社區(qū)內容提示】社區(qū)部分內容疑似由AI輔助生成,瀏覽時請結合常識與多方信息審慎甄別。
平臺聲明:文章內容(如有圖片或視頻亦包括在內)由作者上傳并發(fā)布,文章內容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務。

相關閱讀更多精彩內容

友情鏈接更多精彩內容