背景:Spark由2.0.0升級(jí)至2.2.1,導(dǎo)致之前同事寫的Spark加載PMML的工具jar在調(diào)度上跑作業(yè)出錯(cuò)
期望:將Spark2.0.0版加載PMML工具jar升級(jí)到支持Spark2.2.1
解決:
舊版用法
spark-submit \
--class org.jpmml.spark.SparkPmmlWithHive \
--master yarn \
--queue queueName \
--deploy-mode client \
--jars /appcom/service/hive/lib/datanucleus-core-3.2.10.jar \
--files /appcom/config/hive/hive-site.xml \
${dir}/spark-pmml-1.0-SNAPSHOT.jar ${dir}/etl_lsvm-gxd-0.9.xml db.tbl_1 db.tbl_2
spark-pmml-1.0-SNAPSHOT.jar就是同事之前基于spark2.0.0開發(fā)的jar了,但是在我們Spark升級(jí)到2.2.1版本之后,就會(huì)開始報(bào)如下錯(cuò)誤,導(dǎo)致調(diào)度作業(yè)報(bào)錯(cuò)。
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.expressions.CreateStruct.(Lscala/collection/Seq;)V at org.jpmml.spark.PMMLTransformer.transform(PMMLTransformer.java:149) at org.apache.spark.ml.PipelineModel$$anonfun$transform$1.apply(Pipeline.scala:305) at org.apache.spark.ml.PipelineModel$$anonfun$transform$1.apply(Pipeline.scala:305) at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57) at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66) at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186) at org.apache.spark.ml.PipelineModel.transform(Pipeline.scala:305) at org.jpmml.spark.SparkPmmlWithHive.main(SparkPmmlWithHive.java:25) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:775) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
這個(gè)錯(cuò)誤的原因是因?yàn)樵趕park2.0.0里CreateStruct是一個(gè)class,但是在spark2.2.1(其實(shí)2.1.X就已經(jīng)改了)中被定義為object


根據(jù)https://github.com/jpmml/jpmml-evaluator-spark/issues/11這篇文章,已經(jīng)有大神寫了相應(yīng)的升級(jí)版本了,把項(xiàng)目拉下來(lái)改改看能不能行
git clone? https://github.com/sidfeiner/jpmml-spark.git
這拉取下來(lái)的代碼是的spark版本是1.X的,因?yàn)槲覀兏某?.2.1的,所以再根據(jù)這位大神的改改

打開的頁(yè)面里,他已經(jīng)把spark的版本升級(jí)到2.1.0這個(gè)可以解決CreateStruct重新被定義的問(wèn)題,但是注意我要升級(jí)的是2.2.1這里有一點(diǎn)小差別就是PMMLTransformer類中用到的ScalaUDF函數(shù)2.1.0版本是接收四個(gè)參數(shù)的,但是在2.2.1中接收的是5個(gè)參數(shù)
如果用的是傳的是4個(gè)參數(shù)的則會(huì)報(bào)錯(cuò)如下,因?yàn)榉?wù)器上已經(jīng)是2.2.1版本了
java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.expressions.ScalaUDF.(Ljava/lang/Object;Lorg/apache/spark/sql/types/DataType;Lscala/collection/Seq;Lscala/collection/Seq;)V

解決了這個(gè)問(wèn)題基本就差不多了
cd jpmml-spark
mvn clean install
生成
pmml-spark/target/pmml-spark-1.0-SNAPSHOT.jar?- Library JAR file.
pmml-spark-example/target/example-1.0-SNAPSHOT.jar?- Example application JAR file.
其中example-1.0-SNAPSHOT.jar就是我們要的jar,重命名spark-pmml-2.0.jar替換同事的舊版本jar包即可。
代碼:
參考:
https://github.com/jpmml/jpmml-evaluator-spark
https://github.com/jpmml/jpmml-evaluator-spark/issues/11
https://github.com/sidfeiner/jpmml-spark