Kylin 4.0 Build Engine Configuration

Some properties are not listed because they are not for Kylin user.

Basic configuration

Property	Default	Description	Since
kylin.snapshot.parallel-build-enabled
kylin.snapshot.parallel-build-timeout-seconds
kylin.snapshot.shard-size-mb
kylin.storage.columnar.shard-size-mb
kylin.storage.columnar.shard-rowcount
kylin.storage.columnar.shard-countdistinct-rowcount
kylin.storage.columnar.repartition-threshold-size-mb

Advanced configuration

Property	Default	Description	Since
kylin.engine.submit-hadoop-conf-dir	null	Please refer to Read-Write Separation Deployment for Kylin 4.0 .	4.0.0
kylin.engine.spark.cache-parent-dataset-storage-level	NONE		4.0.0
kylin.engine.spark.cache-parent-dataset-count	1		4.0.0
kylin.engine.build-base-cuboid-enabled	true		4.0.0
kylin.engine.spark.repartition.dataset.after.encode-enabled	false	Global dictionary will be split into several buckets. To encode a column to int value more efficiently, source dataset will be repartitioned by the to-be-encoded column to the same amount of partitions as the dictionary's bucket size. It sometimes bring side effect, because repartitioning by a single column is more likely to cause serious data skew, causing one task takes the majority of time in first layer's cuboid building. When faced with this case, you can try repartitioning encoded dataset by all RowKey columns to avoid data skew. The repartition size is default to max bucket size of all dictionaries, but you can also set to other flexible value by this option: 'kylin.engine.spark.repartition.dataset.after.encode.num'	4.0.0
kylin.engine.spark.repartition.dataset.after.encode.num	0	See above	4.0.0

Spark resources automatic adjustment strategy (experimental feature)

Property	Default	Description	Since
kylin.spark-conf.auto.prior	true	For a CubeBuildJob and CubeMergeJob, it is important to allocate enough and proper resources(cpu/memory), including following config entries mainly: spark.driver.memory spark.executor.memory spark.executor.cores spark.executor.memoryOverhead spark.executor.instances spark.sql.shuffle.partitions When `kylin.spark-conf.auto.prior` is set to true, Kylin will try to adjust above config entries according to: Count of cuboids to be built Max size of fact table Available resources from current resource manager 's queue But user still can choose to override some config in the form of `kylin.engine.spark-conf.<key> = <value>` at the Cube level. The parameter value configured by the user will overwrite the parameter value of automatic parameter adjustment. Check detail at How to improve cube building and query performance	4.0.0
kylin.engine.spark-conf.spark.master	yarn	The cluster manager to connect to. Kylin support set it to yarn/standalone.
kylin.engine.spark-conf.spark.submit.deployMode	client	The deploy mode of Spark driver program, either "client" or "cluster", Which means to launch driver program locally ("client") or remotely ("cluster") on one of the nodes inside the cluster.
kylin.engine.spark-conf.spark.yarn.queue	default
kylin.engine.spark-conf.spark.shuffle.service.enabled	false	Enables the external shuffle service. This service preserves the shuffle files written by executors so the executors can be safely removed. The external shuffle service must be set up in order to enable it.	4.0.0
kylin.engine.spark-conf.spark.eventLog.enabled	true	Whether to log Spark events, useful for reconstructing the Web UI after the application has finished.
kylin.engine.spark-conf.spark.eventLog.dir	hdfs\:///kylin/spark-history	Base directory in which Spark events are logged, if `spark.eventLog.enabled` is true.
kylin.engine.spark-conf.spark.hadoop.yarn.timeline-service.enabled	false
kylin.engine.spark-conf.spark.executor.extraJavaOptions	extraJavaOptions Expand source -Dfile.encoding=UTF-8 -Dhdp.version=current -Dlog4j.configuration=spark-executor-log4j.properties -Dlog4j.debug -Dkylin.hdfs.working.dir=${hdfs.working.dir} -Dkylin.metadata.identifier=${kylin.metadata.url.identifier} -Dkylin.spark.category=job -Dkylin.spark.project=${job.project} -Dkylin.spark.identifier=${job.id} -Dkylin.spark.jobName=${job.stepId} -Duser.timezone=${user.timezone}
kylin.engine.spark-conf.spark.yarn.jars	hdfs://localhost:9000/spark2_jars/*	Manually upload spark-assembly jar to HDFS and then set this property will avoid repeatedly uploading jar at runtime
kylin.engine.driver-memory-base	1024	Driver memory(spark.driver.memory) is auto adjusted by cuboid count and configuration. kylin.engine.driver-memory-strategy will decided some level. For example, "2,20,100" will transfer to four cuboid count ranges, from low to high, as following: Level 1 : (0, 2) Level 2 : (2, 20) Level 3 : (20, 100) Level 4 : (100, +) So, we can find a proper level for specific cuboid count. 12 will be level 2, and 230 will be level 4. Driver memory will be calculated by following formula : min(kylin.engine.driver-memory-base * level, kylin.engine.driver-memory-maximum)	4.0.0
kylin.engine.driver-memory-maximum	4096	See above.	4.0.0
kylin.engine.driver-memory-strategy	2,20,100	See above.	4.0.0
kylin.engine.base-executor-instance	5		4.0.0
kylin.engine.spark.required-cores	1		4.0.0
kylin.engine.executor-instance-strategy	100,2,500,3,1000,4		4.0.0
kylin.engine.retry-memory-gradient			4.0.0

Resource Detect File Summary

Following files are under WORKING-DIR/$PROJECT/job_tmp/${JOB_ID}/share, produced in the first step of BuildJob. And they served to spark resources automatic adjustment strategy. (Source code : ResourceDetectBeforeCubingJob).

Resource Detect File

Data Type

Format

Description

count_distinct.json

Boolean

Binary

Cube contains COUNT_DISTINCT(bitmap) measure.

Sample :

true

${JOB_ID}_resource_path.json

Map<String, List<String>>

Binary

Key is cuboid ID, and value is cuboid's parent dataset's partition path.

-1 means Flat Table.

Sample :

{
   "-1" : ["hdfs://cdh-master:8020/user/hive/warehouse/tpch_flat_orc_10.db/lineitem", 
"hdfs://cdh-master:8020/user/hive/warehouse/tpch_flat_orc_10.db/part"]
}

${JOB_ID}_cubing_detect_items.json

Map<String, Integer>

Binary

Key is cuboid ID, and value is cuboid's parent dataset's partition count.

Sample :

{
  "-1": 32
}

Global dictionary

Property	Default	Description	Since




kylin.dictionary.detect-data-skew-sample-enabled
kylin.dictionary.detect-data-skew-sample-rate
kylin.dictionary.detect-data-skew-percentage-threshold

Space shortcuts

Page tree

Basic configuration

Advanced configuration

Spark resources automatic adjustment strategy (experimental feature)

Resource Detect File Summary

Global dictionary