Apache Kylin : Analytical Data Warehouse for Big Data

Page tree

Welcome to Kylin Wiki.

Some properties are not listed because they are not for Kylin user.

Basic configuration

PropertyDefaultDescriptionSince
kylin.snapshot.parallel-build-enabled
trueWhether to build dimension table snapshot in parallel.4.0.0
kylin.snapshot.parallel-build-timeout-seconds
3600Timeout seconds for parallel build dimension table snapshot4.0.0
kylin.snapshot.shard-size-mb
128Dimension table snapshot shard size, in MB4.0.0
kylin.storage.columnar.shard-size-mb
128

Parquet file shard size, in MB

4.0.0
kylin.storage.columnar.shard-rowcount
2500000Parquet file shard row count4.0.0
kylin.storage.columnar.shard-countdistinct-rowcount
1000000

Parquet file shard row count for countDistinct measure.

When there is a countDistinct measure in a cube, this value will overrides `kylin.storage.columnar.shard-rowcount`

4.0.0
kylin.storage.columnar.repartition-threshold-size-mb
128Repartition threshold size, in MB 4.0.0


Advanced configuration

PropertyDefaultDescriptionSince
kylin.engine.submit-hadoop-conf-dirnullPlease refer to Read-Write Separation Deployment for Kylin 4.0 .4.0.0

kylin.engine.spark.cache-parent-dataset-storage-level

NONE

Where are cuboid datasets cached during build. No cache by default.

It can be set to: 

DISK_ONLY/DISK_ONLY_2/MEMORY_ONLY/MEMORY_ONLY_2/MEMORY_ONLY_SER/MEMORY_ONLY_SER_2/MEMORY_AND_DISK/MEMORY_AND_DISK_2/MEMORY_AND_DISK_SER/MEMORY_AND_DISK_SER_2/OFF_HEAP
4.0.0

kylin.engine.spark.cache-parent-dataset-count

1

How many cuboid datasets can be cached at the same time. It takes effect only when the value of  `kylin.engine.spark.cache-parent-dataset-storage-level` is not NONE.

If the value is too large, it will occupy too much memory. If it is too small, it will affect the concurrency of building cuboid. It is recommended to set it to a value of about 5, which can be adjusted according to your cube and data.

4.0.0

kylin.engine.build-base-cuboid-enabled

trueWhether to build base cuboid. The default is to build.4.0.0

kylin.engine.spark.repartition.dataset.after.encode-enabled

false

Global dictionary will be split into several buckets. To encode a column to int value more efficiently, source dataset will be repartitioned by the to-be-encoded column to the same amount of partitions as the dictionary's bucket size.

It sometimes bring side effect, because repartitioning by a single column is more likely to cause serious data skew, causing one task takes the majority of time in first layer's cuboid building. When faced with this case, you can try repartitioning encoded dataset by all RowKey columns to avoid data skew. The repartition size is default to max bucket size of all dictionaries, but you can also set to other flexible value by this option: 'kylin.engine.spark.repartition.dataset.after.encode.num'

4.0.0

kylin.engine.spark.repartition.dataset.after.encode.num

0See above4.0.0

Spark resources automatic adjustment strategy (experimental feature)


PropertyDefaultDescriptionSince

kylin.spark-conf.auto.prior

trueFor a CubeBuildJob and CubeMergeJob, it is important to allocate enough and proper resources(cpu/memory), including following config entries mainly:
  • spark.driver.memory
  • spark.executor.memory
  • spark.executor.cores
  • spark.executor.memoryOverhead
  • spark.executor.instances
  • spark.sql.shuffle.partitions

When `kylin.spark-conf.auto.prior` is set to true, Kylin will try to adjust above config entries according to:
  • Count of cuboids to be built
  • Max size of fact table
  • Available resources from current resource manager 's queue

But user still can choose to override some config in the form of `kylin.engine.spark-conf.<key> = <value>` at the Cube level. The parameter value configured by the user will overwrite the parameter value of automatic parameter adjustment.
Check detail at How to improve cube building and query performance
4.0.0

kylin.engine.spark-conf.spark.master

yarnThe cluster manager to connect to. Kylin support set it to yarn/standalone.

kylin.engine.spark-conf.spark.submit.deployMode

client

The deploy mode of Spark driver program, either "client" or "cluster", Which means to launch driver program locally ("client") or remotely ("cluster") on one of the nodes inside the cluster.


kylin.engine.spark-conf.spark.yarn.queue

default

kylin.engine.spark-conf.spark.shuffle.service.enabled

false

Enables the external shuffle service. This service preserves the shuffle files written by executors so the executors can be safely removed. The external shuffle service must be set up in order to enable it.


4.0.0
kylin.engine.spark-conf.spark.eventLog.enabledtrueWhether to log Spark events, useful for reconstructing the Web UI after the application has finished.

kylin.engine.spark-conf.spark.eventLog.dir

hdfs\:///kylin/spark-history

Base directory in which Spark events are logged, if spark.eventLog.enabled is true. 

kylin.engine.spark-conf.spark.hadoop.yarn.timeline-service.enabled

false

kylin.engine.spark-conf.spark.executor.extraJavaOptions

extraJavaOptions
-Dfile.encoding=UTF-8 
-Dhdp.version=current 
-Dlog4j.configuration=spark-executor-log4j.properties 
-Dlog4j.debug 
-Dkylin.hdfs.working.dir=${hdfs.working.dir} 
-Dkylin.metadata.identifier=${kylin.metadata.url.identifier} -Dkylin.spark.category=job 
-Dkylin.spark.project=${job.project} 
-Dkylin.spark.identifier=${job.id} 
-Dkylin.spark.jobName=${job.stepId} 
-Duser.timezone=${user.timezone}

A string of extra JVM options to pass to executors. 

kylin.engine.spark-conf.spark.yarn.jars

hdfs://localhost:9000/spark2_jars/*

Manually upload spark-assembly jar to HDFS and then set this property will avoid repeatedly uploading jar at runtime


kylin.engine.driver-memory-base
1024

Driver memory(spark.driver.memory) is auto adjusted by cuboid count and configuration.

kylin.engine.driver-memory-strategy will decided some level. For example, "2,20,100" will transfer to four cuboid count ranges, from low to high, as following: 

  • Level 1 : (0, 2)
  • Level 2 : (2, 20)
  • Level 3 : (20, 100)
  • Level 4 : (100, +)

So, we can find a proper level for specific cuboid count. 12 will be level 2, and 230 will be level 4.


Driver memory will be calculated by following formula : 

min(kylin.engine.driver-memory-base * level, kylin.engine.driver-memory-maximum)
4.0.0
kylin.engine.driver-memory-maximum
4096See above.4.0.0
kylin.engine.driver-memory-strategy
2,20,100See above.4.0.0
kylin.engine.base-executor-instance
5See above.4.0.0
kylin.engine.spark.required-cores
1See above.4.0.0
kylin.engine.executor-instance-strategy
100,2,500,3,1000,4See above. 4.0.0
kylin.engine.retry-memory-gradient 
1.5See above.4.0.0

Resource Detect File Summary

Following files are under WORKING-DIR/$PROJECT/job_tmp/${JOB_ID}/share, produced in the first step of BuildJob.  And they served to spark resources automatic adjustment strategy. (Source code : ResourceDetectBeforeCubingJob).

Resource Detect FileData TypeFormatDescription
count_distinct.jsonBooleanBinary

Cube contains COUNT_DISTINCT(bitmap) measure.

Sample :

true

${JOB_ID}_resource_path.json

Map<String, List<String>>

Binary

Key is cuboid ID, and value is cuboid's parent dataset's partition path.

-1 means Flat Table.

Sample :

{
   "-1" : ["hdfs://cdh-master:8020/user/hive/warehouse/tpch_flat_orc_10.db/lineitem", 
"hdfs://cdh-master:8020/user/hive/warehouse/tpch_flat_orc_10.db/part"]
}
${JOB_ID}_cubing_detect_items.jsonMap<String, Integer>Binary

Key is cuboid ID, and value is cuboid's parent dataset's partition count.

Sample : 

{
  "-1": 32
}

Global dictionary

PropertyDefaultDescriptionSince
kylin.dictionary.detect-data-skew-sample-enabled
false

Whether to Detect dataset skew in dictionary encode step.

4.0.0
kylin.dictionary.detect-data-skew-sample-rate
0.1

In some data skew cases, the repartition step during dictionary encoding will be slow.
We can choose to sample from the dataset to detect skewed. This configuration is used to set the sample rate.

4.0.0
kylin.dictionary.detect-data-skew-percentage-threshold
0.05

In KYLIN4, dictionaries are hashed into several buckets, column data are repartitioned by the same hash algorithm during encoding step too. In data skew cases, the repartition step will be very slow. Kylin will automatically sample from the source to detect skewed data and repartition these skewed data to random partitions.
This configuration is used to set the skew data threshold, valued from 0 to 1.
e.g.
If you set this value to 0.05, for each value that takes up more than 5% percent of the total will be regarded as skew data, as a result the skewed data will be no more than 20 records

4.0.0
  • No labels

1 Comment

  1.         Path p = new Path("/Users/xiaoxiang.yu/Downloads/8871d57f-7bb2-b9c4-389f-3011e63718b8_resource_paths.json");
            FileSystem fs  = p.getFileSystem(new Configuration());
            FSDataInputStream in = fs.open(p);
            int i = in.readInt();
            byte[] bytes = new byte[i];
            in.readFully(bytes);
            Gson json = new Gson();
            TypeToken<Map<String, Object>> t = new TypeToken<Map<String, Object>>() {};
            Map<String, Object> o = json.fromJson(
                    new String(bytes, Charset.defaultCharset()),
                    t.getType());
            System.out.println(o);