Page History

...

As Hive on Spark is still in development, currently only a Hive assembly built from the Hive/Spark development branch supports Spark execution. The development branch is located here: https://github.com/apache/hive/tree/spark. Checkout the branch and build the Hive assembly as described in https://cwiki.apache.org/confluence/display/Hive/HiveDeveloperFAQ.
If you download Spark, make sure you use a 1.2.x assembly: http://ec2-50-18-79-139.us-west-1.compute.amazonaws.com/data/spark-assembly-1.2.0-SNAPSHOT-hadoop2.3.0-cdh5.1.2.jar
There are several ways to add the Spark dependency to Hive:
1. Set the property 'spark.home' to point to the Spark installation:
  Code Block
  hive> set spark.home=/location/to/sparkHome;
2. Define the SPARK_HOME environment variable before starting Hive CLI/HS2HiveServer2:
  Code Block
  export SPARK_HOME=/usr/lib/spark....
3. Set the spark-assembly jar on the Hive auxpath:
  Code Block
  hive --auxpath /location/to/spark-assembly-*.jar
4. Add the spark-assembly jar for the current user session:
  Code Block
  hive> add jar /location/to/spark-assembly-*.jar;
5. Link the spark-assembly jar to HIVE_HOME/lib.
Please note b c and c d are not recommended because they cause Spark to ship the spark-assembly jar to each executor when you run queries.
Configure Hive execution to Spark:
Code Block
hive> set hive.execution.engine=spark;

Configure Spark-application configs for Hive. See: http://spark.apache.org/docs/latest/configuration.html. This can be done either by adding a file "spark-defaults.conf" with these properties to the Hive classpath, or by setting them on Hive configuration:

Code Block
hive> set spark.master=<Spark Master URL> hive> set spark.eventLog.enabled=true; hive> set spark.executor.memory=512m; hive> set spark.serializer=org.apache.spark.serializer.KryoSerializer;

...

Stack trace: ExitCodeException exitCode=1: .../launch_container.sh: line 27: $PWD:$PWD/__spark__.jar:$HADOOP_CONF_DIR.../usr/hdp/${hdp.version}/hadoop/lib/hadoop-lzo-0.6.0.${hdp.version}.jar:/etc/hadoop/conf/secure:$PWD/__app__.jar:$PWD/*: bad substitution

Exception in thread "Driver" scala.MatchError: java.lang.NoClassDefFoundError: org/apache/hadoop/mapreduce/TaskAttemptContext (of class java.lang.NoClassDefFoundError)

at

    at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:432)

Issue

Cause

Resolution

Error: Could not find or load main class org.apache.spark.deploy.SparkSubmit

Spark dependency not correctly set.

Add Spark dependency to Hive, see Step 3 above.

org.apache.spark.SparkException: Job aborted due to stage failure:

Task 5.0:0 had a not serializable result: java.io.NotSerializableException: org.apache.hadoop.io.BytesWritable

Spark serializer not set to Kryo.

Set spark.serializer to be org.apache.spark.serializer.KryoSerializer, see Step 5 above.

[ERROR] Terminal initialization failed; falling back to unsupported
java.lang.IncompatibleClassChangeError: Found class jline.Terminal, but interface was expected

Hive has upgraded to Jline2 but jline 0.94 exists in the Hadoop lib.

Delete jline from the Hadoop lib directory (it's only pulled in transitively from ZooKeeper).
export HADOOP_USER_CLASSPATH_FIRST=true
If this error occurs during mvn test, perform a mvn clean install on the root project and itests directory.

java.lang.SecurityException: class "javax.servlet.DispatcherType"'s
signer information does not match signer information of other classes in the same package at java.lang.ClassLoader.checkCerts(ClassLoader.java:952)

Two versions of the servlet-api are in the classpath.

This should be fixed by HIVE-8905.
Remove the servlet-api-2.5.jar under hive/lib.

Spark executor gets killed all the time and Spark keeps retrying the failed stage; you may find similar information in the YARN nodemanager log.

WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Container [pid=217989,containerID=container_1421717252700_0716_01_50767235] is running beyond physical memory limits. Current usage: 43.1 GB of 43 GB physical memory used; 43.9 GB of 90.3 GB virtual memory used. Killing container.

For Spark on YARN, nodemanager would kill Spark executor if it used more memory than the configured size of "spark.executor.memory" + "spark.yarn.executor.memoryOverhead".

Increase "spark.yarn.executor.memoryOverhead" to make sure it covers the executor off-heap memory usage.

Run query and get an error like:

FAILED: Execution Error, return code 3 from org.apache.hadoop.hive.ql.exec.spark.SparkTask

In Hive logs, it shows:

java.lang.NoClassDefFoundError: Could not initialize class org.xerial.snappy.Snappy
at org.xerial.snappy.SnappyOutputStream.<init>(SnappyOutputStream.java:79)

Happens on Mac (not officially supported).

This is a general Snappy issue with Mac and is not unique to Hive on Spark, but workaround is noted here because it is needed for startup of Spark client.

Run this command before starting Hive or HiveServer2:

export HADOOP_OPTS="-Dorg.xerial.snappy.tempdir=/tmp -Dorg.xerial.snappy.lib.name=libsnappyjava.jnilib $HADOOP_OPTS"

Code Block

The key mapreduce.application.classpath in /etc/hadoop/conf/mapred-site.xml contains a variable which is invalid in bash.

From mapreduce.application.classpath remove

Code Block
:/usr/hdp/${hdp.version}/hadoop/lib/hadoop-lzo-0.6.0.${hdp.version}.jar

from

/etc/hadoop/conf/mapred-site.xml

Code Block

MR is not on the YARN classpath.

If on HDP change from

/hdp/apps/${hdp.version}/mapreduce/mapreduce.tar.gz#mr-framework

to

/hdp/apps/2.2.0.0-2041/mapreduce/mapreduce.tar.gz#mr-framework

...

Space shortcuts

Child pages

Versions Compared

Old Version 53

New Version 54

Key