...
Configure Spark-application configs for Hive. See: http://spark.apache.org/docs/latest/configuration.html. This can be done either by adding a file "spark-defaults.conf" with these properties to the Hive classpath, or by setting them on Hive configuration:
Code Block hive> set spark.master=<Spark Master URL> hive> set spark.eventLog.enabled=true; hive> set spark.executor.memory=512m; hive> set spark.serializer=org.apache.spark.serializer.KryoSerializer;
A little explanation for some of the configuration properties:
spark.executor.memory
: Amount of memory to use per executor process.spark.executor.cores
: number of cores per executor.spark.yarn.executor.memoryOverhead
: The amount of off heap memory (in megabytes) to be allocated per executor, when running Spark on Yarn. This is memory that accounts for things like VM overheads, interned strings, other native overheads, etc. In In addition to the executor's memory, the container in which the executor is launched needs some extra memory for system processes, and this is what this overhead is for.spark.executor.instances
: The number of executors assigned to each application.
More executor memory means it can enable mapjoin optimization for more queries.
More executor memory, on the other hand, become unwieldy from GC perspective.
- Some experiments shows that HDFS client doesn’t handle concurrent writers well, so it may face race condition if executor cores is too many.
For spark.executor.memory, we recommend to set it to {{yarn.nodemanager.resource.memory-mb}} * ({{spark.executor.cores}} /
yarn.nodemanager.resource.cpu-vcores
), then split that between {{spark.executor.memory}} and {{spark.yarn.executor.memoryOverhead. Usually it’s }}. According to our experiment, we recommended to set {{spark.yarn.executor.memoryOverhead}} to be around 15-20% of the total memory.After you’ve decided on how much memory each executor receives, you need to decide how many executors will be allocated to queries. In the GA release Spark dynamic executor allocation will be supported. However for this beta only static resource allocation can be used. Based on the physical memory in each node and the configuration of spark.executor.memory and spark.yarn.executor.memoryOverhead you will need to choose the number of instances and set spark.executor.instances.
Now a real world example. Assuming 10 nodes with 64GB of memory per node with 12 virtual cores, e.g. {{yarn.nodemanager.resource.cpu-vcores}}=12. One node will be used as the master and as such the cluster will have 9 slave nodes. We’ll configurespark.executor.cores
to to 6. Given 64GB of ramyarn.nodemanager.resource.memory-mb
will will be 50GB. We’ll determine the amount of memory for each executor as follows: 50GB * (6/12) = 25GB. We’ll assign 20% to {{spark.yarn.executor.memoryOverhead}}, or 5120, and 80% to {{spark.executor.memory}}, or 20g.On this 9 node cluster we’ll have two executors per host. As such you will configure {{spark.executor.instances}} somewhere between 2 and 18. A value of 18 would utilize the entire cluster.
Common Issues (Green are resolved, will be removed from this list)
...