...
- As Hive on Spark is still in development, currently only a Hive assembly built from the Hive/Spark development branch supports Spark execution. The development branch is located here: https://github.com/apache/hive/tree/spark. Checkout the branch and build the Hive assembly as described in https://cwiki.apache.org/confluence/display/Hive/HiveDeveloperFAQ.
- If you download Spark, make sure you use a 1.2.x assembly: http://ec2-50-18-79-139.us-west-1.compute.amazonaws.com/data/spark-assembly-1.2.0-SNAPSHOT-hadoop2.3.0-cdh5.1.2.jar
There are several ways to add the Spark dependency to Hive:
Set the property 'spark.home' to point to the Spark installation:
Code Block hive> set spark.home=/location/to/sparkHome;
Define the SPARK_HOME environment variable before starting Hive CLI/HS2HiveServer2:
Code Block export SPARK_HOME=/usr/lib/spark....
Set the spark-assembly jar on the Hive auxpath:
Code Block hive --auxpath /location/to/spark-assembly-*.jar
Add the spark-assembly jar for the current user session:
Code Block hive> add jar /location/to/spark-assembly-*.jar;
- Link the spark-assembly jar to HIVE_HOME/lib.
Configure Hive execution to Spark:
Code Block hive> set hive.execution.engine=spark;
Configure Spark-application configs for Hive. See: http://spark.apache.org/docs/latest/configuration.html. This can be done either by adding a file "spark-defaults.conf" with these properties to the Hive classpath, or by setting them on Hive configuration:
Code Block hive> set spark.master=<Spark Master URL> hive> set spark.eventLog.enabled=true; hive> set spark.executor.memory=512m; hive> set spark.serializer=org.apache.spark.serializer.KryoSerializer;
...
Issue | Cause | Resolution | |||
---|---|---|---|---|---|
Error: Could not find or load main class org.apache.spark.deploy.SparkSubmit | Spark dependency not correctly set. | Add Spark dependency to Hive, see Step 3 above. | |||
org.apache.spark.SparkException: Job aborted due to stage failure: Task 5.0:0 had a not serializable result: java.io.NotSerializableException: org.apache.hadoop.io.BytesWritable | Spark serializer not set to Kryo. | Set spark.serializer to be org.apache.spark.serializer.KryoSerializer, see Step 5 above. | |||
[ERROR] Terminal initialization failed; falling back to unsupported | Hive has upgraded to Jline2 but jline 0.94 exists in the Hadoop lib. |
| |||
java.lang.SecurityException: class "javax.servlet.DispatcherType"'s | Two versions of the servlet-api are in the classpath. |
| |||
Spark executor gets killed all the time and Spark keeps retrying the failed stage; you may find similar information in the YARN nodemanager log. WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Container [pid=217989,containerID=container_1421717252700_0716_01_50767235] is running beyond physical memory limits. Current usage: 43.1 GB of 43 GB physical memory used; 43.9 GB of 90.3 GB virtual memory used. Killing container. | For Spark on YARN, nodemanager would kill Spark executor if it used more memory than the configured size of "spark.executor.memory" + "spark.yarn.executor.memoryOverhead". | Increase "spark.yarn.executor.memoryOverhead" to make sure it covers the executor off-heap memory usage. | |||
Run query and get an error like: FAILED: Execution Error, return code 3 from org.apache.hadoop.hive.ql.exec.spark.SparkTask In Hive logs, it shows: java.lang.NoClassDefFoundError: Could not initialize class org.xerial.snappy.Snappy | Happens on Mac (not officially supported). This is a general Snappy issue with Mac and is not unique to Hive on Spark, but workaround is noted here because it is needed for startup of Spark client. | Run this command before starting Hive or HiveServer2: export HADOOP_OPTS="-Dorg.xerial.snappy.tempdir=/tmp -Dorg.xerial.snappy.lib.name=libsnappyjava.jnilib $HADOOP_OPTS" | |||
Code Block | Stack trace: ExitCodeException exitCode=1: .../launch_container.sh: line 27: $PWD:$PWD/__spark__.jar:$HADOOP_CONF_DIR.../usr/hdp/${hdp.version}/hadoop/lib/hadoop-lzo-0.6.0.${hdp.version}.jar:/etc/hadoop/conf/secure:$PWD/__app__.jar:$PWD/*: bad substitution
| The key mapreduce.application.classpath in /etc/hadoop/conf/mapred-site.xml contains a variable which is invalid in bash. | From mapreduce.application.classpath remove
from /etc/hadoop/conf/mapred-site.xml | ||
Code Block | Exception in thread "Driver" scala.MatchError: java.lang.NoClassDefFoundError: org/apache/hadoop/mapreduce/TaskAttemptContext (of class java.lang.NoClassDefFoundError)
atat org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:432) | MR is not on the YARN classpath. | If on HDP change from /hdp/apps/${hdp.version}/mapreduce/mapreduce.tar.gz#mr-framework to /hdp/apps/2.2.0.0-2041/mapreduce/mapreduce.tar.gz#mr-framework |
...