Kylin 4.0 Query Engine Configuration

The query engine of kylin4 is called Sparder, which is a long-running spark application.

Query basic Configuration

Property	Default	Description	Version
kylin.query.auto-sparder-context-enabled	false	Whether to automatically start sparder(The query engine of kylin4.0) when kylin starts.When this value is false, the sparder will delay to start when the first query is executed.	4.0+
kylin.query.spark-conf.spark.master	yarn	The cluster manager to connect to. Kylin support set it to yarn/standalone.
kylin.query.spark-conf.spark.submit.deployMode	client	Only client mode is supported here.
kylin.query.spark-conf.spark.driver.cores	1	Number of cores to use for the driver process, only in cluster mode.
kylin.query.spark-conf.spark.driver.memory	4G	Amount of memory to use for the driver process
kylin.query.spark-conf.spark.driver.memoryOverhead	1G	Amount of non-heap memory to be allocated per driver process in cluster mode, in MiB unless otherwise specified.
kylin.query.spark-conf.spark.executor.cores	1	The number of cores to use on each executor.
kylin.query.spark-conf.spark.executor.instances	1	The number of executor.
kylin.query.spark-conf.spark.executor.memory	4G	Amount of memory to use per executor process, in the same format as JVM memory strings with a size unit suffix ("k", "m", "g" or "t") (e.g. `512m`, `2g`).
kylin.query.spark-conf.spark.executor.memoryOverhead	1G	Amount of additional memory to be allocated per executor process, in MiB unless otherwise specified.
kylin.query.spark-conf.spark.serializer	org.apache.spark.serializer.JavaSerializer	Class to use for serializing objects that will be sent over the network or need to be cached in serialized form
kylin.query.spark-conf.spark.sql.shuffle.partitions	40	The default number of partitions to use when shuffling data for joins or aggregations.
kylin.query.spark-conf.spark.executor.extraJavaOptions	extraJavaOptions Expand source Dhdp.version=current -Dlog4j.configuration=spark-executor-log4j.properties -Dlog4j.debug -Dkylin.hdfs.working.dir=${kylin.env.hdfs-working-dir} -Dkylin.metadata.identifier=${kylin.metadata.url.identifier} -Dkylin.spark.category=sparder -Dkylin.spark.identifier={{APP_ID}}	A string of extra JVM options to pass to executors.Generally, this parameter does not need to be changed, and kylin will provided the variable at submit spark application.
kylin.query.need-replace-exactly-agg	true	When the query can accurately hit the cuboid, whether to skip the AGG process and directly return the qualified results saved in the cuboid parquet file.
kylin.query.bitmap-upper-bound	10000000	The maximum number of returned values for intersect_value function
kylin.query.spark.pool	Automatically select according to the query task	The fair scheduler of Apache Spark supports grouping jobs into pools, and setting different scheduling options (e.g. weight) for each pool. This can be useful to create a “high-priority” pool for more important query jobs. Query engine of Kylin 4 support set pool for query at project level and thread level, and it has built-in pools:($KYLIN_HOME/conf/fairscheduler.xml) - lightweight_tasks are query which not require all available cpu cores - heavy_tasks are query which require all available cpu cores - query_pushdown are query which not answered by cubePlease check following link for detail. - http://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application -https://cwiki.apache.org/confluence/display/KYLIN/Use+different+spark+pool+for+different+query
kylin.query.spark-engine.expose-sharding-trait	true
kylin.query.spark-engine.max-sharding-size-mb	64	The max size in mb handled per task when using shard by column, if the sharding size exceeds this value, it will fall back to non-sharding read RDD.
kylin.query.spark-engine.partition-split-size-mb	64	Be used to auto set `spark.sql.shuffle.partitions` for each query. Refer to config item: kylin.query.spark-engine.spark-sql-shuffle-partitions.
kylin.query.spark-engine.spark-sql-shuffle-partitions	-1	SparderContext will try to set `spark.sql.shuffle.partitions` for each query according to bytes to scan 1. set to -1 to let it auto decided by query engine, to be specific, it is * ${total bytes of all files after pruned by FilePruner} / KylinConfigBase#getQueryPartitionSplitSizeMB 2. other positive integer to set a fixed value.
kylin.query.sparder-context.app-name	sparder_on_${hostName}-${port}	SparderContext application name.
kylin.query.pushdown.runner-class-name	null	When a query cannot be answered by the cube, kylin supports routing the query to the pushdown engine for query. When this configuration is null, it means that query pushdown is not enabled. If users want to enable query pushdown, it can be configured as "org.apache.kylin.query.pushdown.PushDownRunnerSparkImpl"
kylin.query.pushdown.update-enabled	false	Whether to allow update operation in pushdown engine, such as create table.

Spark Context Canary Configuration

Sparder Canary is a component used to monitor the running status of Sparder. It will periodically check whether the current Sparder is running normally. If the running status is abnormal, such as Sparder unexpectedly exits or becomes unresponsive, Sparder Canary will create a new Sparder instance.

Property	Default	Description
kylin.canary.sparder-context-canary-enabled	true	Whether to enable sparder canary.
kylin.canary.sparder-context-threshold-to-restart-spark	3	When the number of abnormal detection times exceeds this threshold, restart spark context.
kylin.canary.sparder-context-period-min	3	Check interval.
kylin.canary.sparder-context-error-response-ms	3000	Single detection timeout time, if single detection timeout means no response from spark context

Space shortcuts

Page tree

Query basic Configuration

Spark Context Canary Configuration