Apache Kylin : Analytical Data Warehouse for Big Data

Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

The query engine of kylin4 is called Sparder, which is a long-running spark application.

PropertyDefaultDescriptionVersion
kylin.query.auto-sparder-context-enabled
false

Whether to automatically start sparder(The query engine of kylin4.0) when kylin starts.

When this value is false, the sparder will start when the first query is executed.

4.0+
kylin.query.spark-conf.spark.master
yarnThe cluster manager to connect to. Kylin support set it to yarn/standalone.
kylin.query.spark-conf.spark.submit.deployMode
clientOnly client mode is supported here.
kylin.query.spark-conf.spark.driver.cores
1Number of cores to use for the driver process, only in cluster mode.
kylin.query.spark-conf.spark.driver.memory
4GAmount of memory to use for the driver process
kylin.query.spark-conf.spark.driver.memoryOverhead
1GAmount of non-heap memory to be allocated per driver process in cluster mode, in MiB unless otherwise specified.

kylin.query.spark-conf.spark.executor.cores

1The number of cores to use on each executor.

kylin.query.spark-conf.spark.executor.instances

1The number of executor.

kylin.query.spark-conf.spark.executor.memory

4GAmount of memory to use per executor process, in the same format as JVM memory strings with a size unit suffix ("k", "m", "g" or "t") (e.g. 512m, 2g).

kylin.query.spark-conf.spark.executor.memoryOverhead

1GAmount of additional memory to be allocated per executor process, in MiB unless otherwise specified.

kylin.query.spark-conf.spark.serializer

org.apache.spark.serializer.JavaSerializer

Class to use for serializing objects that will be sent over the network or need to be cached in serialized form

kylin.query.spark-conf.spark.sql.shuffle.partitions

40

The default number of partitions to use when shuffling data for joins or aggregations.

kylin.query.spark-conf.spark.executor.extraJavaOptions


Code Block
languagebash
titleextraJavaOptions
collapsetrue
Dhdp.version=current 
-Dlog4j.configuration=spark-executor-log4j.properties 
-Dlog4j.debug -Dkylin.hdfs.working.dir=${kylin.env.hdfs-working-dir} 
-Dkylin.metadata.identifier=${kylin.metadata.url.identifier} -Dkylin.spark.category=sparder -Dkylin.spark.identifier={{APP_ID}}


A string of extra JVM options to pass to executors.

Generally, this parameter does not need to be changed, and kylin will provided the variable at submit spark application.


kylin.query.need-replace-exactly-agg

trueWhen the query can accurately hit the cuboid, whether to skip the AGG process and directly return the qualified results saved in the cuboid parquet file.

kylin.query.bitmap-upper-bound

10000000
The maximum number of returned values for intersect_value function

kylin.query.spark.pool

Automatically select according to the query task

The fair scheduler of Apache Spark supports grouping jobs into pools, and setting different scheduling options (e.g. weight) for each pool. This can be useful to create a “high-priority” pool for more important query jobs.

Query engine of Kylin 4 support set pool for query at project level and thread level, and it has built-in pools:($KYLIN_HOME/conf/fairscheduler.xml)

- lightweight_tasks are query which not require all available cpu cores
- heavy_tasks are query which require all available cpu cores
- query_pushdown are query which not answered by cube

Please check following link for detail.
- http://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application
-https://cwiki.apache.org/confluence/display/KYLIN/Use+different+spark+pool+for+different+query


kylin.query.spark-engine.expose-sharding-trait

true

kylin.query.spark-engine.max-sharding-size-mb

64

kylin.query.spark-engine.partition-split-size-mb

64

kylin.query.spark-engine.spark-sql-shuffle-partitions

-1

kylin.query.sparder-context.app-name

sparder_on_${hostName}-${port}

SparderContext application name.

kylin.query.pushdown.runner-class-name

null

When a query cannot be answered by the cube, kylin supports routing the query to the pushdown engine for query. 

When this configuration is null, it means that query pushdown is not enabled. If users want to enable query pushdown, it can be configured as "org.apache.kylin.query.pushdown.PushDownRunnerSparkImpl"


kylin.query.pushdown.update-enabled

false
true4.0+
Whether to allow update operation in pushdown engine, such as create table.

Spark Context Canary Configuration

Sparder Canary is a component used to monitor the running status of Sparder. It will periodically check whether the current Sparder is running normally. If the running status is abnormal, such as Sparder unexpectedly exits or becomes unresponsive, Sparder Canary will create a new Sparder instance.

PropertyDefaultDescription

kylin.canary.sparder-context-canary-enabled

trueWhether to enable sparder canary.

kylin.canary.sparder-context-threshold-to-restart-spark

3When the number of abnormal detection times exceeds this threshold, restart spark context.

kylin.canary.sparder-context-period-min

3Check interval.

kylin.canary.sparder-context-error-response-ms

3000Single detection timeout time, if single detection timeout means no response from spark context