pyspark写数据到 hbase2.* 的神坑解析

1. 问题阐述

找不到方法: org.apache.hadoop.hbase.client.Put.add([B[B[B)Lorg/apache/hadoop/hbase/client/Put

java.lang.NoSuchMethodError: org.apache.hadoop.hbase.client.Put.add([B[B[B)Lorg/apache/hadoop/hbase/client/Put;
        at org.apache.spark.examples.pythonconverters.StringListToPutConverter.convert(HBaseConverters.scala:81)
        at org.apache.spark.examples.pythonconverters.StringListToPutConverter.convert(HBaseConverters.scala:77)
        at org.apache.spark.api.python.PythonHadoopUtil$$anonfun$convertRDD$1.apply(PythonHadoopUtil.scala:181)
        at org.apache.spark.api.python.PythonHadoopUtil$$anonfun$convertRDD$1.apply(PythonHadoopUtil.scala:181)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
        at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:129)
        at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:127)
        at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
        at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:139)
        at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83)
        at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:121)
        at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
21/04/07 16:58:58 ERROR Utils: Aborting task

2. 问题解析

spark-examples里面的StringListToPutConverter类调用了hbase-client里面的Put.add函数,由于hbase升级了到2.*之后,hbase-clientPut.add接口变了。从Put.add(Byte[...], Byte[...], Byte[...])变成了 Put.addColumn(Byte[...], Byte[...], Byte[...])

3. 解决办法

3.1 更正代码

下载spark最新版的源码,修改StringListToPutConverter类并重新打包spark-examplesStringListToPutConverter的源码:

class StringListToPutConverter extends Converter[Any, Put] {
  override def convert(obj: Any): Put = {
    val output = obj.asInstanceOf[java.util.ArrayList[String]].asScala.map(Bytes.toBytes).toArray
    val put = new Put(output(0))
    put.add(output(1), output(2), output(3))
  }
}

更正后的代码:

class StringListToPutConverter extends Converter[Any, Put] {
  override def convert(obj: Any): Put = {
    val output = obj.asInstanceOf[java.util.ArrayList[String]].asScala.map(Bytes.toBytes).toArray
    val put = new Put(output(0))
    put.addColumn(output(1), output(2), output(3))
  }
}

3.2 重新打包

重新用maven打包,使用的maven命令是:

mvn clean install -e -X -pl :spark-examples_2.11_hardfixed 

注意:要在spark-examples的子项目的pom.xml中修改对应的artifact_id。

3.3 替换jar包

在target目录下得到兼容了新接口的original-spark-examples_2.11_hardfixed-2.4.3.jar包。替换掉原来目录/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/lib/spark/jars/hbase/下的spark-examples_2.11-1.6.0-typesafe-001.jar包,最终成功写入。我修复过的jar包,下载解压后放到指定路径下可用。
附件:original-spark-examples_2.11_hardfixed-2.4.3.jar

4. 打包过程中可能会遇到的问题:

  • compile插件报错:
   注意要使用java1.8 JDK。使用java 12的编译环境会报编译器插件错误。
  • checker报错:
   scala-style-checker和maven checker的错误,按照google的解法。
   方法2: 在主pom中注释掉整个插件。

版权声明:本文为yanpenggong原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接和本声明。