Apache Kylin : Analytical Data Warehouse for Big Data
Welcome to Kylin Wiki.
Background
Build Engine
Maybe you have known that, pre-calculated cuboid data is stored in parquet files in Kylin 4. Before saving into HDFS/Object Storage, Kylin will do repartition on pre-calculated cuboid data. This article will introduce how this pre-calculated cuboid data is repartitioned, and how this affect query performance.
When you did specific a shard by column, Kylin will do repartition(Transformation of Spark Dataframe) by shard by column, and partition number is calclulated by following configuration (kylin.storage.columnar.shard-size-mb, kylin.storage.columnar.shard-rowcount). If you do not specific a shard by column, repartition is done only with repartition number.
Query Engine
For example, there's a column which have high cardinality called seller_id and our scenario will filter contains this column. There're some sample SQLs:
select count(*) from kylin_sales left join kylin_order where SELLER_ID = '10000233' select count(*) from kylin_sales left join kylin_order where SELLER_ID in (10000233,10000234,10000235) select count(*) from kylin_sales left join kylin_order where SELLER_ID is NULL select count(*) from kylin_sales left join kylin_order where SELLER_ID in (10000233,10000234,10000235) and SELLER_ID = 10000233 select count(*) from kylin_sales left join kylin_order where SELLER_ID = 10000233 or SELLER_ID = 1
When filter in SQL query contains those operators which related to shard by column:
- Equality
- In
- IsNull
Query Engine can purge all parquet files which did not contains specific values. So we suggest you to set the column as shard by column which both has high cardinality and used in where clause.
How to use
Model design
Edit model and add dimension seller_id. Remember that the type of dimension should be normal but not derived.
Cube Design
From Cube Designer → Advanced Setting → Rowkeys, find the column seller_id and set the shard by to true. Remember that now only support one shard by column, so there should be at most one shard by column which set to true.
Advanced configuration
For each cuboid, if you want to specific the count of all parquet files(or the size fof them). THe following are supported in cube level.
Key | Default value | Descripation |
---|---|---|
kylin.storage.columnar.shard-size-mb | 128 | The max size of each parquet file, in MB. |
kylin.storage.columnar.shard-rowcount | 2500000 | Each parquet files should contain at most 2.5 million rows. |
kylin.storage.columnar.shard-countdistinct-rowcount | 1000000 | Since that fact that Bitmap has bigger size, so we can specific the max row count for cuboid with contain Bitmap. By default it contains at most 1.0 million row. |
Build Segment
Considering the size of file size, total size is 147.7 M, so file count should be 147.7 M/15M = 11.
Considering the row count of file size, total row count is 300000, so file count should be 300000/10000 = 3.
So, the final file count is (11 + 3)/2 = 7.
[root@cdh-worker-2 20200902]# hadoop fs -du -h /LACUS/LACUS/UserActionPrj/parquet/xxx/20200104000000_20200105000000_QVV/2047 21.9 M 21.9 M /LACUS/LACUS/UserActionPrj/parquet/xxx/20200104000000_20200105000000_QVV/2047/part-00000-7b12421d-1e79-45fd-8c8e-ba3628d5db3d-c000.snappy.parquet 20.9 M 20.9 M /LACUS/LACUS/UserActionPrj/parquet/xxx/20200104000000_20200105000000_QVV/2047/part-00001-7b12421d-1e79-45fd-8c8e-ba3628d5db3d-c000.snappy.parquet 21.5 M 21.5 M /LACUS/LACUS/UserActionPrj/parquet/xxx/20200104000000_20200105000000_QVV/2047/part-00002-7b12421d-1e79-45fd-8c8e-ba3628d5db3d-c000.snappy.parquet 21.7 M 21.7 M /LACUS/LACUS/UserActionPrj/parquet/xxx/20200104000000_20200105000000_QVV/2047/part-00003-7b12421d-1e79-45fd-8c8e-ba3628d5db3d-c000.snappy.parquet 21.4 M 21.4 M /LACUS/LACUS/UserActionPrj/parquet/xxx/20200104000000_20200105000000_QVV/2047/part-00004-7b12421d-1e79-45fd-8c8e-ba3628d5db3d-c000.snappy.parquet 20.0 M 20.0 M /LACUS/LACUS/UserActionPrj/parquet/xxx/20200104000000_20200105000000_QVV/2047/part-00005-7b12421d-1e79-45fd-8c8e-ba3628d5db3d-c000.snappy.parquet 20.2 M 20.2 M /LACUS/LACUS/UserActionPrj/parquet/xx/20200104000000_20200105000000_QVV/2047/part-00006-7b12421d-1e79-45fd-8c8e-ba3628d5db3d-c000.snappy.parquet
Query Log
You can check "Total scan files" in kylin.log .
2020-09-02 14:49:12,366 INFO [Query 414c15ae-ac04-2ac1-918e-3fe8c97a86bd-99] service.QueryService:387 : ==========================[QUERY]=============================== Query Id: 414c15ae-ac04-2ac1-918e-3fe8c97a86bd SQL: xxx ... Total scan count: 1409511 Total scan files: 4 Total metadata time: 0ms Total spark scan time: 386ms Total scan bytes: 67481914 Result row count: 500 ... ==========================[QUERY]===============================