Apache Kylin : Analytical Data Warehouse for Big Data

Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

The query engine of kylin4 is called Sparder, which is a long-running spark application.

Query basic Configuration

PropertyDefaultDescriptionVersion
kylin.query.auto-sparder-context-enabledfalseWhether to automatically start sparder(The query engine of kylin4.0) when kylin starts.When this value is false, the sparder will delay to start when the first query is executed.4.0+
kylin.query.spark-conf.spark.masteryarnThe cluster manager to connect to. Kylin support set it to yarn/standalone.
kylin.query.spark-conf.spark.submit.deployModeclientOnly client mode is supported here.
kylin.query.spark-conf.spark.driver.cores1Number of cores to use for the driver process, only in cluster mode.
kylin.query.spark-conf.spark.driver.memory4GAmount of memory to use for the driver process
kylin.query.spark-conf.spark.driver.memoryOverhead1GAmount of non-heap memory to be allocated per driver process in cluster mode, in MiB unless otherwise specified.
kylin.query.spark-conf.spark.executor.cores1The number of cores to use on each executor.
kylin.query.spark-conf.spark.executor.instances1The number of executor.
kylin.query.spark-conf.spark.executor.memory4GAmount of memory to use per executor process, in the same format as JVM memory strings with a size unit suffix ("k", "m", "g" or "t") (e.g. 512m, 2g).
kylin.query.spark-conf.spark.executor.memoryOverhead1GAmount of additional memory to be allocated per executor process, in MiB unless otherwise specified.
kylin.query.spark-conf.spark.serializerorg.apache.spark.serializer.JavaSerializerClass to use for serializing objects that will be sent over the network or need to be cached in serialized form
kylin.query.spark-conf.spark.sql.shuffle.partitions40The default number of partitions to use when shuffling data for joins or aggregations.
kylin.query.spark-conf.spark.executor.extraJavaOptions

Code Block
languagebash
titleextraJavaOptions
collapsetrue
Dhdp.version=current 
-Dlog4j.configuration=spark-executor-log4j.properties 
-Dlog4j.debug -Dkylin.hdfs.working.dir=${kylin.env.hdfs-working-dir} 
-Dkylin.metadata.identifier=${kylin.metadata.url.identifier} -Dkylin.spark.category=sparder -Dkylin.spark.identifier={{APP_ID}}
A string of extra JVM options to pass to executors.Generally, this parameter does not need to be changed, and kylin will provided the variable at submit spark application.
kylin.query.need-replace-exactly-aggtrueWhen the query can accurately hit the cuboid, whether to skip the AGG process and directly return the qualified results saved in the cuboid parquet file.
kylin.query.bitmap-upper-bound
10000000
The maximum number of returned values for intersect_value function

kylin.query.spark.pool
Automatically select according to the query task
The fair scheduler of Apache Spark supports grouping jobs into pools, and setting different scheduling options (e.g. weight) for each pool. This can be useful to create a “high-priority” pool for more important query jobs.

Query engine of Kylin 4 support set pool for query at project level and thread level, and it has built-in pools:($KYLIN_HOME/conf/fairscheduler.xml)

- lightweight_tasks are query which not require all available cpu cores
- heavy_tasks are query which require all available cpu cores
- query_pushdown are query which not answered by cube
Please check following link for detail.
- http://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application
-https://cwiki.apache.org/confluence/display/KYLIN/Use+different+spark+pool+for+different+query

kylin.query.spark-engine.expose-sharding-traittrue

kylin.query.spark-engine.max-sharding-size-mb64The max size in mb handled per task when using shard by column, if the sharding size exceeds this value, it will fall back to non-sharding read RDD.
kylin.query.spark-engine.partition-split-size-mb64

Be used to auto set `spark.sql.shuffle.partitions` for each query.

Refer to config item: kylin.query.spark-engine.spark-sql-shuffle-partitions.


kylin.query.spark-engine.spark-sql-shuffle-partitions-1

SparderContext will try to set `spark.sql.shuffle.partitions` for each query according to bytes to scan
1. set to -1 to let it auto decided by query engine, to be specific, it is
* ${total bytes of all files after pruned by FilePruner} / KylinConfigBase#getQueryPartitionSplitSizeMB
2. other positive integer to set a fixed value.


kylin.query.sparder-context.app-namesparder_on_${hostName}-${port}SparderContext application name.
kylin.query.pushdown.runner-class-namenullWhen a query cannot be answered by the cube, kylin supports routing the query to the pushdown engine for query. When this configuration is null, it means that query pushdown is not enabled. If users want to enable query pushdown, it can be configured as "org.apache.kylin.query.pushdown.PushDownRunnerSparkImpl"
kylin.query.pushdown.update-enabledfalseWhether to allow update operation in pushdown engine, such as create table.

Spark Context Canary Configuration

Sparder Canary is a component used to monitor the running status of Sparder. It will periodically check whether the current Sparder is running normally. If the running status is abnormal, such as Sparder unexpectedly exits or becomes unresponsive, Sparder Canary will create a new Sparder instance.

PropertyDefaultDescription

kylin.canary.sparder-context-canary-enabled

trueWhether to enable sparder canary.

kylin.canary.sparder-context-threshold-to-restart-spark

3When the number of abnormal detection times exceeds this threshold, restart spark context.

kylin.canary.sparder-context-period-min

3Check interval.

kylin.canary.sparder-context-error-response-ms

3000Single detection timeout time, if single detection timeout means no response from spark context