Apache Kylin : Analytical Data Warehouse for Big Data

Page tree

Welcome to Kylin Wiki.

The query engine of kylin4 is called Sparder, which is a long-running spark application.

Query basic Configuration

PropertyDefaultDescriptionVersion

kylin.query.auto-sparder-context-enabled

false

Whether to automatically start sparder(The query engine of kylin4.0) when kylin starts.

When this value is false, the sparder will delay to start when the first query is executed.

4.0+

kylin.query.spark-conf.spark.master

yarnThe cluster manager to connect to. Kylin support set it to yarn/standalone.

kylin.query.spark-conf.spark.submit.deployMode

clientOnly client mode is supported here.

kylin.query.spark-conf.spark.driver.cores

1Number of cores to use for the driver process, only in cluster mode.

kylin.query.spark-conf.spark.driver.memory

4GAmount of memory to use for the driver process

kylin.query.spark-conf.spark.driver.memoryOverhead

1GAmount of non-heap memory to be allocated per driver process in cluster mode, in MiB unless otherwise specified.

kylin.query.spark-conf.spark.executor.cores

1The number of cores to use on each executor.

kylin.query.spark-conf.spark.executor.instances

1The number of executor.

kylin.query.spark-conf.spark.executor.memory

4GAmount of memory to use per executor process, in the same format as JVM memory strings with a size unit suffix ("k", "m", "g" or "t") (e.g. 512m, 2g).

kylin.query.spark-conf.spark.executor.memoryOverhead

1GAmount of additional memory to be allocated per executor process, in MiB unless otherwise specified.

kylin.query.spark-conf.spark.serializer

org.apache.spark.serializer.JavaSerializer

Class to use for serializing objects that will be sent over the network or need to be cached in serialized form

kylin.query.spark-conf.spark.sql.shuffle.partitions

40

The default number of partitions to use when shuffling data for joins or aggregations.

kylin.query.spark-conf.spark.executor.extraJavaOptions

extraJavaOptions
Dhdp.version=current 
-Dlog4j.configuration=spark-executor-log4j.properties 
-Dlog4j.debug -Dkylin.hdfs.working.dir=${kylin.env.hdfs-working-dir} 
-Dkylin.metadata.identifier=${kylin.metadata.url.identifier} -Dkylin.spark.category=sparder -Dkylin.spark.identifier={{APP_ID}}

A string of extra JVM options to pass to executors.

Generally, this parameter does not need to be changed, and kylin will provided the variable at submit spark application.


kylin.query.need-replace-exactly-agg

trueWhen the query can accurately hit the cuboid, whether to skip the AGG process and directly return the qualified results saved in the cuboid parquet file.

kylin.query.bitmap-upper-bound

10000000
The maximum number of returned values for intersect_value function

kylin.query.spark.pool

Automatically select according to the query task

The fair scheduler of Apache Spark supports grouping jobs into pools, and setting different scheduling options (e.g. weight) for each pool. This can be useful to create a “high-priority” pool for more important query jobs.

Query engine of Kylin 4 support set pool for query at project level and thread level, and it has built-in pools:($KYLIN_HOME/conf/fairscheduler.xml)

- lightweight_tasks are query which not require all available cpu cores
- heavy_tasks are query which require all available cpu cores
- query_pushdown are query which not answered by cube

Please check following link for detail.
- http://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application
-https://cwiki.apache.org/confluence/display/KYLIN/Use+different+spark+pool+for+different+query


kylin.query.spark-engine.expose-sharding-trait

true

kylin.query.spark-engine.max-sharding-size-mb

64

The max size in mb handled per task when using shard by column, if the sharding size exceeds this value, it will fall back to non-sharding read RDD.


kylin.query.spark-engine.partition-split-size-mb

64

kylin.query.spark-engine.spark-sql-shuffle-partitions

-1

kylin.query.sparder-context.app-name

sparder_on_${hostName}-${port}

SparderContext application name.

kylin.query.pushdown.runner-class-name

null

When a query cannot be answered by the cube, kylin supports routing the query to the pushdown engine for query. 

When this configuration is null, it means that query pushdown is not enabled. If users want to enable query pushdown, it can be configured as "org.apache.kylin.query.pushdown.PushDownRunnerSparkImpl"


kylin.query.pushdown.update-enabled

falseWhether to allow update operation in pushdown engine, such as create table.

Spark Context Canary Configuration

Sparder Canary is a component used to monitor the running status of Sparder. It will periodically check whether the current Sparder is running normally. If the running status is abnormal, such as Sparder unexpectedly exits or becomes unresponsive, Sparder Canary will create a new Sparder instance.

PropertyDefaultDescription

kylin.canary.sparder-context-canary-enabled

trueWhether to enable sparder canary.

kylin.canary.sparder-context-threshold-to-restart-spark

3When the number of abnormal detection times exceeds this threshold, restart spark context.

kylin.canary.sparder-context-period-min

3Check interval.

kylin.canary.sparder-context-error-response-ms

3000Single detection timeout time, if single detection timeout means no response from spark context
  • No labels