Apache Kylin : Analytical Data Warehouse for Big Data

Page tree

Welcome to Kylin Wiki.

reference issue KYLIN-5949 - Getting issue details... STATUS


01 Background

The current Kylin has a two-level storage structure with segment and layout. When the user's query hardware resources are sufficient, providing adequate concurrent computing power, the benefits of precomputation allow the query performance to meet expectations. However, there are significant drawbacks:

  1. Too many small index files are distributed across different segments, resulting in suboptimal storage efficiency and read I/O efficiency.
  2. The foundational (detailed/aggregate) index data files are large, and the query performance cannot meet user expectations.
  3. The storage structure lacks flexibility for customization based on user business scenarios, with limited space/methods for optimization, such as:
    1. Point Query Scenario When users wish to perform point or range queries on high cardinality columns like UserID or phone numbers, they must perform a full scan of the relevant layout to obtain results.
    2. Aggregate Queries with Filter Conditions In common customer queries with multiple fields (sometimes dozens) as filter conditions, Kylin needs to hit the layout that includes all the filter dimensions before performing subsequent aggregation calculations. Since Kylin's precomputation is done on the entire data set, unlike traditional materialized views that can be precomputed based on precisely defined filter conditions, hitting a large index or even a foundational detailed index for high cardinality filter fields can lead to query performance that fails to meet user expectations in many scenarios.


02 Dev Design

Segment Logicalization:

Eliminate the setting of Segment to manage index data, Segment only retains the logical concepts
Index storage as a table:

according to the different types of indexes, set up different table types, index tabularization can make better use of the query engine's ability to handle tables.
Index storage type can be extended:

default storage is replaced from Parquet to Delta Lake, and Iceberg and Hudi can be supported for fast replacement.
Dynamic tuning of build and query runtime parameters:

Dynamic tuning of execution engine parameters at runtime (build and query) according to index characteristics.
Stability of query performance:

query performance should be relatively consistent for both early and recent data.
Index targeted optimization capability:

according to a specific query, targeted optimization of the corresponding index, the ability to specific query extreme acceleration


Storage format changes

Original Segment + parquet storage

  1. V1 Cube results data file structure
    parquet/
    └── dc65dd61-dbe3-8f46-7d44-668b688b96c1 (Model ID)
    └── 12d2c4c1-248f-b1f8-0bdb-88b0eb9c8580 (Segment ID)
      ├── 1 (Agg Index ID)
       │   └── part-00000-393b8b08-84fc-40c6-8c2e-d579485dcc57-c000.snappy.parquet(Data)
       ├── 10001
      ├── 20001
       ├── 30001
       ├── 40001
      └── 20000000001(Table index ID)

V3 file format - data is organized by delta lake and stored in Parquet format


What needs to be done is as follows:

1. Support Delta Lake as Index storage.
2. When querying, you can choose to cache the Delta Log on the driver or in RDD Cache mode.
3. V1 and V3 storage is isolated at the model level.
4. Data storage is no longer divided into segments.
5. Query storage optimization can be performed at the index level.

  • No labels