Apache Kylin : Analytical Data Warehouse for Big Data
Page History
Table of Contents |
---|
Background
Build Engine
Maybe you have known that, pre-calculated cuboid data is stored in parquet files in Kylin 4. Before saving into HDFS/Object Storage, Kylin will do repartition on pre-calculated cuboid data. This article will introduce how this pre-calculated cuboid data is repartitioned, and how this affect query performance.
...
When you did specific a shard by column,
...
Kylin will do repartition(Transformation of Spark Dataframe) by shard by column, and partition number is calclulated by following configuration (kylin.storage.columnar.shard-size-mb, kylin.storage.columnar.shard-rowcount). If you do not specific a shard by column, repartition is done only with repartition number.
Query Engine
For example, there's a column which have high cardinality called seller_id and our application scenario will filter according to contains this column. There're some sample SQLs:
Code Block | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
select count(*) from kylin_sales left join kylin_order where SELLER_ID = '10000233' select count(*) from kylin_sales left join kylin_order where SELLER_ID in (10000233,10000234,10000235) select count(*) from kylin_sales left join kylin_order where SELLER_ID is NULL select count(*) from kylin_sales left join kylin_order where SELLER_ID in (10000233,10000234,10000235) and SELLER_ID = 10000233 select count(*) from kylin_sales left join kylin_order where SELLER_ID = 10000233 or SELLER_ID = 1 |
When filter in SQL query contains those operators which related to shard by column:
- Equality
- In
- IsNull
Query Engine can purge all parquet files which did not contains specific values. So we suggest you to set the column as shard by column which both has high cardinality and used in where clause.
How to use
Model design
Edit cube model and add dimension seller_id. Remember that the type of dimension should be normal but not derived.
Cube Design
From Cube Designer → Advanced Setting → Rowkeys, find the column seller_id and set the shard by to true. Remember that now only support one shard by column, so there should only be at most one shard by column which set to true.
Advanced configuration
FAQ
Before configure shard by column, there are some things need to pay attention to:
...
For each cuboid, if you want to specific the count of all parquet files(or the size fof them). THe following are supported in cube level.
Key | Default value | Descripation |
---|---|---|
kylin.storage.columnar.shard-size-mb | 128 | The max size of each parquet file, in MB. |
kylin.storage.columnar.shard-rowcount | 2500000 | Each parquet files should contain at most 2.5 million rows. |
kylin.storage.columnar.shard-countdistinct-rowcount | 1000000 | Since that fact that Bitmap has bigger size, so we can specific the max row count for cuboid with contain Bitmap. By default it contains at most 1.0 million row. |
Build Segment
Considering the size of file size, total size is 147.7 M, so file count should be 147.7 M/15M = 11.
Considering the row count of file size, total row count is 300000, so file count should be 300000/10000 = 3.
So, the final file count is (11 + 3)/2 = 7.
Code Block | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
| ||||||||||
2020-09-02 14:25:22,320 INFO [Scheduler 2044728756 Job f73e2bca-0c0b-4a48-b8e7-a2eacb116f52-245] job.NSparkExecutable:41 : 2020-09-02 14:25:22,320 INFO [thread-build-cuboid-2047] job.CubeBuildJob:344 : Cuboids are saved to temp path : hdfs://cdh-master:8020
_temp
2020-09-02 14:25:48,504 INFO [BadQueryDetector] service.BadQueryDetector:148 : Detect bad query.
2020-09-02 14:26:48,504 INFO [BadQueryDetector] service.BadQueryDetector:148 : Detect bad query.
2020-09-02 14:26:48,767 INFO [Scheduler 2044728756 Job f73e2bca-0c0b-4a48-b8e7-a2eacb116f52-245] job.NSparkExecutable:41 : 2020-09-02 14:26:48,767 DEBUG [thread-build-cuboid-2047] util.HadoopUtil:302 : Use provider:org.apache.kylin.common.storage.DefaultStorageProvider
2020-09-02 14:26:48,777 INFO [Scheduler 2044728756 Job f73e2bca-0c0b-4a48-b8e7-a2eacb116f52-245] job.NSparkExecutable:41 : 2020-09-02 14:26:48,777 INFO [thread-build-cuboid-2047] utils.Repartitioner:121 : File length repartition num : 11, Row count Rpartition num: 3
2020-09-02 14:26:48,777 INFO [Scheduler 2044728756 Job f73e2bca-0c0b-4a48-b8e7-a2eacb116f52-245] job.NSparkExecutable:41 : 2020-09-02 14:26:48,777 INFO [thread-build-cuboid-2047] utils.Repartitioner:124 : Repartition size is :7 |
Code Block | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
[root@cdh-worker-2 20200902]# hadoop fs -du -h /LACUS/LACUS/UserActionPrj/parquet/xxx/20200104000000_20200105000000_QVV/2047
21.9 M 21.9 M /LACUS/LACUS/UserActionPrj/parquet/xxx/20200104000000_20200105000000_QVV/2047/part-00000-7b12421d-1e79-45fd-8c8e-ba3628d5db3d-c000.snappy.parquet
20.9 M 20.9 M /LACUS/LACUS/UserActionPrj/parquet/xxx/20200104000000_20200105000000_QVV/2047/part-00001-7b12421d-1e79-45fd-8c8e-ba3628d5db3d-c000.snappy.parquet
21.5 M 21.5 M /LACUS/LACUS/UserActionPrj/parquet/xxx/20200104000000_20200105000000_QVV/2047/part-00002-7b12421d-1e79-45fd-8c8e-ba3628d5db3d-c000.snappy.parquet
21.7 M 21.7 M /LACUS/LACUS/UserActionPrj/parquet/xxx/20200104000000_20200105000000_QVV/2047/part-00003-7b12421d-1e79-45fd-8c8e-ba3628d5db3d-c000.snappy.parquet
21.4 M 21.4 M /LACUS/LACUS/UserActionPrj/parquet/xxx/20200104000000_20200105000000_QVV/2047/part-00004-7b12421d-1e79-45fd-8c8e-ba3628d5db3d-c000.snappy.parquet
20.0 M 20.0 M /LACUS/LACUS/UserActionPrj/parquet/xxx/20200104000000_20200105000000_QVV/2047/part-00005-7b12421d-1e79-45fd-8c8e-ba3628d5db3d-c000.snappy.parquet
20.2 M 20.2 M /LACUS/LACUS/UserActionPrj/parquet/xx/20200104000000_20200105000000_QVV/2047/part-00006-7b12421d-1e79-45fd-8c8e-ba3628d5db3d-c000.snappy.parquet |
Query Log
You can check "Total scan files" in kylin.log .
Code Block | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
2020-09-02 14:49:12,366 INFO [Query 414c15ae-ac04-2ac1-918e-3fe8c97a86bd-99] service.QueryService:387 :
==========================[QUERY]===============================
Query Id: 414c15ae-ac04-2ac1-918e-3fe8c97a86bd
SQL: xxx
...
Total scan count: 1409511
Total scan files: 4
Total metadata time: 0ms
Total spark scan time: 386ms
Total scan bytes: 67481914
Result row count: 500
...
==========================[QUERY]=============================== |
...