Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. To add the Spark dependency to Hive:

    • Prior to Hive 2.2.0, link the spark-assembly jar to HIVE_HOME/lib.
    • Since Hive 2.2.0, Hive on Spark runs with Spark 2.0.0 and above, which doesn't have an assembly jar.
      • To run with YARN mode (either yarn-client or yarn-cluster), link the following jars to HIVE_HOME/lib.
        • scala-library
        • spark-core
        • spark-network-common
      • To run with LOCAL mode (for debugging only), link the following jars in addition to those above to HIVE_HOME/lib.
        • chill-java  chill  jackson-module-paranamer  jackson-module-scala  jersey-container-servlet-core
        • jersey-server  json4s-ast  kryo-shaded  minlog  scala-xml  spark-launcher
        • spark-network-shuffle  spark-unsafe  xbean-asm5-shaded
  2. Configure Hive execution engine to use Spark:

    Code Block
    set hive.execution.engine=spark;

    See the Spark section of Hive Configuration Properties for other properties for configuring Hive and the Remote Spark Driver.

     

  3. Configure Spark-application configs for Hive.  See: http://spark.apache.org/docs/latest/configuration.html.  This can be done either by adding a file "spark-defaults.conf" with these properties to the Hive classpath, or by setting them on Hive configuration (hive-site.xml). For instance:

    Code Block
    set spark.master=<Spark Master URL>
    set spark.eventLog.enabled=true;
    set spark.eventLog.dir=<Spark event log folder (must exist)>
    set spark.executor.memory=512m;              
    set spark.serializer=org.apache.spark.serializer.KryoSerializer;

    Configuration property details

    • spark.executor.memoryAmount of memory to use per executor process.
    • spark.executor.cores: Number of cores per executor.
    • spark.yarn.executor.memoryOverhead: The amount of off heap memory (in megabytes) to be allocated per executor, when running Spark on Yarn. This is memory that accounts for things like VM overheads, interned strings, other native overheads, etc. In addition to the executor's memory, the container in which the executor is launched needs some extra memory for system processes, and this is what this overhead is for.

    • spark.executor.instances: The number of executors assigned to each application.
    • spark.driver.memory: The amount of memory assigned to the Remote Spark Context (RSC). We recommend 4GB.
    • spark.yarn.driver.memoryOverhead: We recommend 400 (MB).
  4. Allow Yarn to cache necessary spark dependency jars on nodes so that it does not need to be distributed each time when an application runs.

    • Prior to Hive 2.2.0, upload spark-assembly jar to hdfs file(for example: hdfs://xxxx:8020/spark-assembly.jar) and add following in hive-site.xml

      Code Block
      <property>
        <name>spark.yarn.jar</name>
        <value>hdfs://xxxx:8020/spark-assembly.jar</value>
      </property>
    • Hive 2.2.0, upload all jars in $SPARK_HOME/jars to hdfs folder(for example:hdfs:///xxxx:8020/spark-jars) and add following in hive-site.xml

      Code Block
      <property>
        <name>spark.yarn.jars</name>
        <value>hdfs://xxxx:8020/spark-jars/*</value>
      </property>

Configuring Spark

Setting executor memory size is more complicated than simply setting it to be as large as possible. There are several things that need to be taken into consideration:

...