先看一個例子:
/*
首先我們定義了一個Search對象,帶有一個String類型的參數(shù)
該類擁有三個成員方法:
1)isMatch:判斷參數(shù)字符串s是否包含子串query
2)getMatchRdd1:使用isMatch方法獲取匹配結(jié)果后的RDD
3)getMatchRdd1:在filter中實(shí)現(xiàn)方法獲取匹配結(jié)果后的RDD
*/
class Search(query: String) {
def isMatch(s: String): Boolean = {
s.contains(query)
}
def getMatchRdd1(rdd: RDD[String]): RDD[String] = {
rdd.filter(isMatch)
}
def getMatchRdd2(rdd: RDD[String]): RDD[String] = {
rdd.filter(_.contains(query))
}
}
object SerializableDemo {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setMaster("local[4]").setAppName("SerializableDemo")
val sc = new SparkContext(conf)
val rdd = sc.parallelize(Array("hello", "world", "hello", "spark"))
val search = new Search("h")
val matchRdd = search.getMatchRdd2(rdd)
matchRdd.collect().foreach(println)
sc.stop()
}
}
運(yùn)行后結(jié)果:
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner.org
spark
ClosureCleaner
anonfun
1.apply(RDD.scala:388)
at org.apache.spark.rdd.RDD$$anonfun1.apply(RDD.scala:387)
at org.apache.spark.rdd.RDDOperationScope.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.filter(RDD.scala:387)
at adamlee.spark.Search.getMatchRdd2(SerializableDemo.scala:34)
at adamlee.spark.SerializableDemo$.main(SerializableDemo.scala:16)
at adamlee.spark.SerializableDemo.main(SerializableDemo.scala)
Caused by: java.io.NotSerializableException: adamlee.spark.Search
Serialization stack:
- object not serializable (class: adamlee.spark.Search, value: adamlee.spark.Search@4cafa9aa)
- field (class: adamlee.spark.Search$$anonfun
1, name: $outer, type: class adamlee.spark.Search)
- object (class adamlee.spark.Search$$anonfun
1, <function1>)
at org.apache.spark.serializer.SerializationDebugger.ensureSerializable(ClosureCleaner.scala:342)
... 12 more
報錯提示Task未能序列化,再看Caused By提示:object not serializable,告訴我們Search這個類的對象未能序列化。
原因就是search對象初始化是在Driver端進(jìn)行的,當(dāng)我們執(zhí)行collect是,觸發(fā)計算,Driver需要將任務(wù)下發(fā)至Executor,這時候就產(chǎn)生了進(jìn)程間通信,Driver和Executor間通信是通過網(wǎng)絡(luò)傳輸,網(wǎng)絡(luò)上傳輸?shù)氖嵌M(jìn)制的比特流,由于Search類并未繼承Serializable類,所以這個類的對象就不能被序列化。
現(xiàn)在我們新建一個類Search1,繼承了Serializable:
class Search1(query: String) extends Serializable {
def isMatch(s: String): Boolean = {
s.contains(query)
}
def getMatchRdd1(rdd: RDD[String]): RDD[String] = {
rdd.filter(isMatch)
}
def getMatchRdd2(rdd: RDD[String]): RDD[String] = {
rdd.filter(_.contains(query))
}
}
object SerializableDemo {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setMaster("local[4]").setAppName("SerializableDemo")
val sc = new SparkContext(conf)
val rdd = sc.parallelize(Array("hello", "world", "hello", "spark"))
val search1 = new Search1("h")
val matchRdd = search1.getMatchRdd2(rdd)
matchRdd.collect().foreach(println)
sc.stop()
}
}
運(yùn)行后結(jié)果:
hello
hello