Hive on Spark配置总结

环境配置

Maven-3.3.3

JDK 7u79

Scala 2.10.6

Hive 2.0.1

Spark 1.5.0 source

Hadoop 2.6.4


Hive版本Spark版本要相匹配,因此下载Hive源代码的pom.xml中查看spark.version来确定要使用的spark版本。

Note that you must have a version of Spark which does not include the Hive jars. Meaning one which was not built with the Hive profile.

注意:Spark官网上pre-build spark-2.x都是集成Hive的,所以想要使用Hive on spark那么必须要下载源代码进行编译

推荐 hive-1.2.1 on spark-1.3.1 / hive-2.0.1 on spark-1.5.2


编译Spark

默认是使用Scala 2.10.4来编译的


export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m"


mvn -Pyarn -Phadoop-2.6 -DskipTests clean package


./make-distribution.sh --name xm-spark --tgz -Phadoop-2.6 -Pyarn


若果是用Scala 2.11.x来编译

./dev/change-scala-version.sh 2.11
mvn -Pyarn -Phadoop-2.6 -Dscala-2.11 -DskipTests clean package

./make-distribution.sh --name xm-spark --tgz -Phadoop-2.6 -Pyarn


tar包就会生成在spark目录里


Hive-site.xml配置

<property>
    <name>hive.execution.engine</name>
    <value>spark</value>
</property>


Spark-default.conf配置

set spark.master=<Spark Master URL> #默认可以不用设置
set spark.eventLog.enabled=true;
set spark.eventLog.dir=<Spark event log folder (must exist)>
set spark.executor.memory=512m;             
set spark.serializer=org.apache.spark.serializer.KryoSerializer;
spark.executor.instances=X

Hive官网推荐配置

hive.vectorized.execution.enabled=true

hive.cbo.enable=true
hive.optimize.reducededuplication.min.reducer=4
hive.optimize.reducededuplication=true
hive.orc.splits.include.file.footer=false
hive.merge.mapfiles=true
hive.merge.sparkfiles=false
hive.merge.smallfiles.avgsize=16000000
hive.merge.size.per.task=256000000
hive.merge.orcfile.stripe.level=true
hive.auto.convert.join=true
hive.auto.convert.join.noconditionaltask=true
hive.auto.convert.join.noconditionaltask.size=894435328
hive.optimize.bucketmapjoin.sortedmerge=false
hive.map.aggr.hash.percentmemory=0.5
hive.map.aggr=true
hive.optimize.sort.dynamic.partition=false
hive.stats.autogather=true
hive.stats.fetch.column.stats=true
hive.vectorized.execution.reduce.enabled=false
hive.vectorized.groupby.checkinterval=4096
hive.vectorized.groupby.flush.percent=0.1
hive.compute.query.using.stats=true
hive.limit.pushdown.memory.usage=0.4
hive.optimize.index.filter=true
hive.exec.reducers.bytes.per.reducer=67108864
hive.smbjoin.cache.rows=10000
hive.exec.orc.default.stripe.size=67108864
hive.fetch.task.conversion=more
hive.fetch.task.conversion.threshold=1073741824
hive.fetch.task.aggr=false
mapreduce.input.fileinputformat.list-status.num-threads=5
spark.kryo.referenceTracking=false
spark.kryo.classesToRegister=org.apache.hadoop.hive.ql.io.HiveKey,org.apache.hadoop.io.BytesWritable,org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch

问题总结

1.Causedby:java.lang.NoClassDefFoundError:org/apache/hive/spark/client/Job

  a. 编译spark时把-Phive或-Phive-thrift给加上去了

  b. hive 和 spark编译版本不匹配导致的错误


2.Failed to execute spark task, with exception 'org.apache.hadoop.hive.ql.metadata.HiveException(Failed to create spark client.)' FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.spark.SparkTask

  a. hive 和 spark版本不匹配

  b. scala环境错误导致的客户端启动失败(装好scala,重启YARN)


3.环境配置导致的错误



版权声明:本文为kimsungho原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接和本声明。