环境配置
Maven-3.3.3
JDK 7u79
Scala 2.10.6
Hive 2.0.1
Spark 1.5.0 source
Hadoop 2.6.4
Hive版本Spark版本要相匹配,因此下载Hive源代码的pom.xml中查看spark.version来确定要使用的spark版本。
Note that you must have a version of Spark which does not include the Hive jars. Meaning one which was not built with the Hive profile.
注意:Spark官网上pre-build spark-2.x都是集成Hive的,所以想要使用Hive on spark那么必须要下载源代码进行编译
推荐 hive-1.2.1 on spark-1.3.1 / hive-2.0.1 on spark-1.5.2
编译Spark
默认是使用Scala 2.10.4来编译的
export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m"
mvn -Pyarn -Phadoop-2.6 -DskipTests clean package
./make-distribution.sh --name xm-spark --tgz -Phadoop-2.6 -Pyarn
若果是用Scala 2.11.x来编译
./dev/change-scala-version.sh 2.11
mvn -Pyarn -Phadoop-2.6 -Dscala-2.11 -DskipTests clean package
./make-distribution.sh --name xm-spark --tgz -Phadoop-2.6 -Pyarn tar包就会生成在spark目录里
Hive-site.xml配置
<property>
<name>hive.execution.engine</name>
<value>spark</value>
</property>
Spark-default.conf配置
spark.executor.instances=X
Hive官网推荐配置
hive.vectorized.execution.enabled=true hive.cbo.enable=true hive.optimize.reducededuplication.min.reducer=4 hive.optimize.reducededuplication=true hive.orc.splits.include.file.footer=false hive.merge.mapfiles=true hive.merge.sparkfiles=false hive.merge.smallfiles.avgsize=16000000 hive.merge.size.per.task=256000000 hive.merge.orcfile.stripe.level=true hive.auto.convert.join=true hive.auto.convert.join.noconditionaltask=true hive.auto.convert.join.noconditionaltask.size=894435328 hive.optimize.bucketmapjoin.sortedmerge=false hive.map.aggr.hash.percentmemory=0.5 hive.map.aggr=true hive.optimize.sort.dynamic.partition=false hive.stats.autogather=true hive.stats.fetch.column.stats=true hive.vectorized.execution.reduce.enabled=false hive.vectorized.groupby.checkinterval=4096 hive.vectorized.groupby.flush.percent=0.1 hive.compute.query.using.stats=true hive.limit.pushdown.memory.usage=0.4 hive.optimize.index.filter=true hive.exec.reducers.bytes.per.reducer=67108864 hive.smbjoin.cache.rows=10000 hive.exec.orc.default.stripe.size=67108864 hive.fetch.task.conversion=more hive.fetch.task.conversion.threshold=1073741824 hive.fetch.task.aggr=false mapreduce.input.fileinputformat.list-status.num-threads=5 spark.kryo.referenceTracking=false spark.kryo.classesToRegister=org.apache.hadoop.hive.ql.io.HiveKey,org.apache.hadoop.io.BytesWritable,org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch
问题总结
1.Causedby:java.lang.NoClassDefFoundError:org/apache/hive/spark/client/Job
a. 编译spark时把-Phive或-Phive-thrift给加上去了
b. hive 和 spark编译版本不匹配导致的错误
2.Failed to execute spark task, with exception 'org.apache.hadoop.hive.ql.metadata.HiveException(Failed to create spark client.)' FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.spark.SparkTask
a. hive 和 spark版本不匹配
b. scala环境错误导致的客户端启动失败(装好scala,重启YARN)
3.环境配置导致的错误