Hive on Spark: Getting Started
Spark Installation
Follow instructions to install spark: https://spark.apache.org/docs/latest/spark-standalone.html. In particular:
- Install spark (either download pre-built spark, or build assembly from source).
- Install/build a compatible version. Hive root pom.xml's <spark.version> defines what version of spark it was built/tested with.
- Install/build a compatible distribution. Each version of Spark has several distributions, corresponding with different versions of Hadoop.
- Once spark is installed, find and keep note of the <spark-assembly-*.jar> location.
- Start Spark cluster (Master and workers).
- Keep note of the <Spark Master URL>. This can be found in Spark master WebUI.
Configuring Hive
- As Hive on Spark is still in development, currently only a Hive assembly built from hive/spark development branch supports Spark execution. The development branch is located here: https://github.com/apache/hive/tree/spark. Checkout branch and build hive assembly as described in https://cwiki.apache.org/confluence/display/Hive/HiveDeveloperFAQ.
Start hive with <spark-assembly-*.jar> on the hive auxpath:
hive --auxpath /location/to/spark-assembly-*.jar
Configure hive execution to Spark:
hive> set hive.execution.engine=spark;
Configure Spark-application configs for Hive. See: http://spark.apache.org/docs/latest/configuration.html. This can be done either by adding a file "spark-defaults.conf" with these properties to the hive classpath, or by setting them on hive configuration:
hive> set spark.master=<Spark Master URL> hive> set spark.eventLog.enabled=true; hive> set spark.executor.memory=512m; hive> set spark.serializer=org.apache.spark.serializer.KryoSerializer;
Common Issues
Issue | Cause | Resolution |
---|---|---|
java.lang.NoSuchMethodError: com.google.common.hash.HashFunction.hashInt (I)Lcom/google/common/hash/HashCode | Guava library version conflict between Spark and Hadoop. See HIVE-7387 and SPARK-2420 for details. | Alternatives until this is fixed:
|
org.apache.spark.SparkException: Job aborted due to stage failure: Task 5.0:0 had a not serializable result: java.io.NotSerializableException: org.apache.hadoop.io.BytesWritable | Spark serializer not set to Kryo | Set spark.serializer to be org.apache.spark.serializer.KryoSerializer as described above |
java.lang.NullPointerException at org.apache.hadoop.hive.ql.io.HiveInputFormat.init(HiveInputFormat.java:257) at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:224) | Hive is included in the Spark Assembly | Either build a version of spark without the "hive" profile or unjar the spark assembly and rm -rf org/apache/hive org/apache/hadoop/hive and then rejar. The fix is in SPARK-2741 |