Spark-task執(zhí)行過程中的序列化

先看一個例子:

/*
  首先我們定義了一個Search對象,帶有一個String類型的參數(shù)
  該類擁有三個成員方法:
  1)isMatch:判斷參數(shù)字符串s是否包含子串query
  2)getMatchRdd1:使用isMatch方法獲取匹配結(jié)果后的RDD
  3)getMatchRdd1:在filter中實(shí)現(xiàn)方法獲取匹配結(jié)果后的RDD
 */

class Search(query: String) {
  def isMatch(s: String): Boolean = {
    s.contains(query)
  }

  def getMatchRdd1(rdd: RDD[String]): RDD[String] = {
    rdd.filter(isMatch)
  }

  def getMatchRdd2(rdd: RDD[String]): RDD[String] = {
    rdd.filter(_.contains(query))
  }

}

object SerializableDemo {
  def main(args: Array[String]): Unit = {

    val conf = new SparkConf().setMaster("local[4]").setAppName("SerializableDemo")
    val sc = new SparkContext(conf)

    val rdd = sc.parallelize(Array("hello", "world", "hello", "spark"))

    val search = new Search("h")

    val matchRdd = search.getMatchRdd2(rdd)
    matchRdd.collect().foreach(println)

    sc.stop()

  }
}

運(yùn)行后結(jié)果:

Exception in thread "main" org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner.ensureSerializable(ClosureCleaner.scala:345) at org.apache.spark.util.ClosureCleaner.orgapachesparkutilClosureCleanerclean(ClosureCleaner.scala:335) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159) at org.apache.spark.SparkContext.clean(SparkContext.scala:2299) at org.apache.spark.rdd.RDDanonfunfilter1.apply(RDD.scala:388)
at org.apache.spark.rdd.RDD$$anonfunfilter1.apply(RDD.scala:387)
at org.apache.spark.rdd.RDDOperationScope.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.filter(RDD.scala:387)
at adamlee.spark.Search.getMatchRdd2(SerializableDemo.scala:34)
at adamlee.spark.SerializableDemo$.main(SerializableDemo.scala:16)
at adamlee.spark.SerializableDemo.main(SerializableDemo.scala)
Caused by: java.io.NotSerializableException: adamlee.spark.Search
Serialization stack:

  • object not serializable (class: adamlee.spark.Search, value: adamlee.spark.Search@4cafa9aa)
  • field (class: adamlee.spark.Search$$anonfungetMatchRdd21, name: $outer, type: class adamlee.spark.Search)
  • object (class adamlee.spark.Search$$anonfungetMatchRdd21, <function1>)
    at org.apache.spark.serializer.SerializationDebugger.improveException(SerializationDebugger.scala:40) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100) at org.apache.spark.util.ClosureCleaner.ensureSerializable(ClosureCleaner.scala:342)
    ... 12 more

報錯提示Task未能序列化,再看Caused By提示:object not serializable,告訴我們Search這個類的對象未能序列化。

原因就是search對象初始化是在Driver端進(jìn)行的,當(dāng)我們執(zhí)行collect是,觸發(fā)計算,Driver需要將任務(wù)下發(fā)至Executor,這時候就產(chǎn)生了進(jìn)程間通信,Driver和Executor間通信是通過網(wǎng)絡(luò)傳輸,網(wǎng)絡(luò)上傳輸?shù)氖嵌M(jìn)制的比特流,由于Search類并未繼承Serializable類,所以這個類的對象就不能被序列化。

現(xiàn)在我們新建一個類Search1,繼承了Serializable:

class Search1(query: String) extends Serializable {
  def isMatch(s: String): Boolean = {
    s.contains(query)
  }

  def getMatchRdd1(rdd: RDD[String]): RDD[String] = {
    rdd.filter(isMatch)
  }

  def getMatchRdd2(rdd: RDD[String]): RDD[String] = {
    rdd.filter(_.contains(query))
  }

}

object SerializableDemo {
  def main(args: Array[String]): Unit = {

    val conf = new SparkConf().setMaster("local[4]").setAppName("SerializableDemo")
    val sc = new SparkContext(conf)

    val rdd = sc.parallelize(Array("hello", "world", "hello", "spark"))

    val search1 = new Search1("h")

    val matchRdd = search1.getMatchRdd2(rdd)
    matchRdd.collect().foreach(println)

    sc.stop()

  }
}

運(yùn)行后結(jié)果:

hello
hello

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

友情鏈接更多精彩內(nèi)容