Apache Kylin : Analytical Data Warehouse for Big Data
Page History
The query engine of kylin4 is called Sparder, which is a long-running spark application.
Query basic Configuration
Property | Default | Description | Version | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
kylin.query.auto-sparder-context-enabled | false | Whether to automatically start sparder(The query engine of kylin4.0) when kylin starts.When this value is false, the sparder will delay to start when the first query is executed. | 4.0+ | |||||||||
kylin.query.spark-conf.spark.master | yarn | The cluster manager to connect to. Kylin support set it to yarn/standalone. | ||||||||||
kylin.query.spark-conf.spark.submit.deployMode | client | Only client mode is supported here. | ||||||||||
kylin.query.spark-conf.spark.driver.cores | 1 | Number of cores to use for the driver process, only in cluster mode. | ||||||||||
kylin.query.spark-conf.spark.driver.memory | 4G | Amount of memory to use for the driver process | ||||||||||
kylin.query.spark-conf.spark.driver.memoryOverhead | 1G | Amount of non-heap memory to be allocated per driver process in cluster mode, in MiB unless otherwise specified. | ||||||||||
kylin.query.spark-conf.spark.executor.cores | 1 | The number of cores to use on each executor. | ||||||||||
kylin.query.spark-conf.spark.executor.instances | 1 | The number of executor. | ||||||||||
kylin.query.spark-conf.spark.executor.memory | 4G | Amount of memory to use per executor process, in the same format as JVM memory strings with a size unit suffix ("k", "m", "g" or "t") (e.g. 512m , 2g ). | ||||||||||
kylin.query.spark-conf.spark.executor.memoryOverhead | 1G | Amount of additional memory to be allocated per executor process, in MiB unless otherwise specified. | ||||||||||
kylin.query.spark-conf.spark.serializer | org.apache.spark.serializer.JavaSerializer | Class to use for serializing objects that will be sent over the network or need to be cached in serialized form | ||||||||||
kylin.query.spark-conf.spark.sql.shuffle.partitions | 40 | The default number of partitions to use when shuffling data for joins or aggregations. | ||||||||||
kylin.query.spark-conf.spark.executor.extraJavaOptions |
| A string of extra JVM options to pass to executors.Generally, this parameter does not need to be changed, and kylin will provided the variable at submit spark application. | ||||||||||
kylin.query.need-replace-exactly-agg | true | When the query can accurately hit the cuboid, whether to skip the AGG process and directly return the qualified results saved in the cuboid parquet file. | ||||||||||
kylin.query.bitmap-upper-bound | 10000000 | The maximum number of returned values for intersect_value function | ||||||||||
kylin.query.spark.pool | Automatically select according to the query task | The fair scheduler of Apache Spark supports grouping jobs into pools, and setting different scheduling options (e.g. weight) for each pool. This can be useful to create a “high-priority” pool for more important query jobs. Query engine of Kylin 4 support set pool for query at project level and thread level, and it has built-in pools:($KYLIN_HOME/conf/fairscheduler.xml) - lightweight_tasks are query which not require all available cpu cores - heavy_tasks are query which require all available cpu cores - query_pushdown are query which not answered by cubePlease check following link for detail. - http://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application -https://cwiki.apache.org/confluence/display/KYLIN/Use+different+spark+pool+for+different+query | ||||||||||
kylin.query.spark-engine.expose-sharding-trait | true | |||||||||||
kylin.query.spark-engine.max-sharding-size-mb | 64 | The max size in mb handled per task when using shard by column, if the sharding size exceeds this value, it will fall back to non-sharding read RDD. | ||||||||||
kylin.query.spark-engine.partition-split-size-mb | 64 | Be used to auto set `spark.sql.shuffle.partitions` for each query. Refer to config item: kylin.query.spark-engine.spark-sql-shuffle-partitions. | ||||||||||
kylin.query.spark-engine.spark-sql-shuffle-partitions | -1 | SparderContext will try to set `spark.sql.shuffle.partitions` for each query according to bytes to scan | ||||||||||
kylin.query.sparder-context.app-name | sparder_on_${hostName}-${port} | SparderContext application name. | ||||||||||
kylin.query.pushdown.runner-class-name | null | When a query cannot be answered by the cube, kylin supports routing the query to the pushdown engine for query. When this configuration is null, it means that query pushdown is not enabled. If users want to enable query pushdown, it can be configured as "org.apache.kylin.query.pushdown.PushDownRunnerSparkImpl" | ||||||||||
kylin.query.pushdown.update-enabled | false | Whether to allow update operation in pushdown engine, such as create table. |
Spark Context Canary Configuration
Sparder Canary is a component used to monitor the running status of Sparder. It will periodically check whether the current Sparder is running normally. If the running status is abnormal, such as Sparder unexpectedly exits or becomes unresponsive, Sparder Canary will create a new Sparder instance.
Property | Default | Description |
---|---|---|
kylin.canary.sparder-context-canary-enabled | true | Whether to enable sparder canary. |
kylin.canary.sparder-context-threshold-to-restart-spark | 3 | When the number of abnormal detection times exceeds this threshold, restart spark context. |
kylin.canary.sparder-context-period-min | 3 | Check interval. |
kylin.canary.sparder-context-error-response-ms | 3000 | Single detection timeout time, if single detection timeout means no response from spark context |