Apache Kylin : Analytical Data Warehouse for Big Data

Page tree

Welcome to Kylin Wiki.

# Kylin 5.0 支持推荐索引

### 1. 背景

- Kylin 作为一个强大的 OLAP 引擎,其核心功能之一是通过预计算和索引来加速查询。然而,当前的模型定义和索引创建过程对用户来说较为复杂,特别是在设计高效的聚合组时面临挑战。
- 系统已经具备存储查询历史的能力,这为我们提供了一个宝贵的机会。通过分析这些历史数据,我们可以智能地优化现有索引结构,不仅能提升查询性能,还可以更有效地利用存储资源。
- 随着系统使用时间的增长,索引数量可能会不断增加。这不仅会导致构建时间和资源消耗的增加,还可能影响整体系统性能。因此,建立一个机制来识别和移除不再有价值的索引变得尤为重要。

通过实现这些优化,我们旨在简化 Kylin 的使用流程,提高系统的自适应能力,并在性能和资源利用之间取得更好的平衡。

### 2. 一些基本概念

- 查询历史:存储在系统中的用户查询语句,包括:命中索引、被下推到其它执行引擎(Spark)、查询失败等情况。
- 有效的查询历史:可以被加速的那一类查询历史。一个基本假定,使用它们推荐出来索引一般来说会获得更好的查询性能,但这不是绝对的,数据量越大,效果越明显。具体来说,有效的查询历史包括:
    - 没有完美击中系统中已经存在的索引
    - 碰到 NoRealizationFound 异常被下推到其它引擎的查询
- 优化建议:查询历史生成的候选索引不会直接修改模型或者索引信息,而是存储到元数据表中,这个表中的每一条记录代表一条优化建议。

### 3. 设计

- 建议表结构
  
  
    | **列名** | **类型** | **备注** |
    | --- | --- | --- |
    | id | int | 建议id |
    | project | string | 项目名 |
    | model_id | string | 模型uuid |
    | unique_flag | string(200) | 唯一标识 MD5 值 |
    | semantic_version | int | 模型的重大变更Version |
    | type | tinyint | 建议的类型:COMPUTED_COLUMN[1], DIMENSION[2], MEASURE[3], ADDITIONAL_LAYOUT[4], REMOVAL_LAYOUT[5] |
    | state | tinyint | 建议的状态:INITIAL[0], RECOMMENDED[1], APPLIED[2], DISCARD[3], BROKEN[4] |
    | rec_entity | text | 建议对应的实体信息 |
    | depend_ids | text | 该建议所对应的所有外部依赖 |
    | create_time | long | 建议的创建时间 |
    | update_time | long | 建议的更新时间 |
    | layout_metric | text | 这个layout每一天的使用次数和每一天的所有查询的查询延迟之和 |
    | cost | double | 不接受这个建议预估的成本 |
    | total_latency_of_last_day | double | layout在前一天的所有查询的查询延迟之和。每小时更新一次 |
    | hit_count | int | layout被击中的总次数 |
    | total_time | double | 击中这个layout的所有查询的延迟之和 |
    | max_time | double | 击中这个layout的所有查询的延迟中的最大值 |
    | min_time | double | 击中这个layout的所有查询的延迟中的最小值 |
    | query_history_info | text |  |
    | reserved_field1 | string | 建议的来源。对于 REMOVAL_LAYOUT 建议,它的来源有三种: LOW_FREQUENCY , INCLUDED , SIMILAR 。对于 ADDITIONAL_LAYOUT 建议,它的来源有两种: QUERY_HISTORY , IMPORTED |
    | reserved_field2 | longblob |  |
    | reserved_field3 | longblob |  |
    
    关于表结构的特殊说明
    
    - 主键:id,在插入初始建议表的时候会自增
        - 达到整型的最大值大约21亿的时候无法处理,系统无法处理(当前的Kylin限制)
    - 组合唯一键:[project, model_id, unique_flag, semantic_version]
        - 组合唯一键如果说同样的建议出现状态不同的时候会不会导致插入失败?这个请参考后面的保存逻辑部分。
    - depend_ids: 这条建议所对应的所有外部依赖。如果依赖的是模型上的列那么取列的id,如果依赖的是数据库中其他的建议,取建议的id的相反数。对于可计算列是指可计算列表达式中使用到的模型上的列,对于维度是指维度对应的那个列,对于度量是指度量的参数中使用到的列,对于 layout 则是它的 colOrder 所对应的维度或者度量。
        - 可计算列使用到的列在模型上已经不存在,那么这个可计算列会变成 BROKEN 状态,依赖这个 layout 都会变成 BROKEN
        - 度量中的参数不是列而是一个常量,那么此时的依赖会使用 Integer.MAX_VALUE
        - layout 建议对应的维度或者度量在model上已经存在,那么依赖的id就是它们在model上的id,如果不存在,那么取初始优化建议表中保存下来的维度、度量在数据库中的 id 的相反数。
    - rec_entity: 生成的建议收集的实体信息,它对应于 RecItemV2 类,它的基本属性包括 createTime、uniqueContent、uuid,它有四个子类分别用于表示 CC、dimension、measure、layout。uniqueContent 会记录这个实体的特征信息:对于dimension 是它依赖的列具体信息,对于measure是它依赖的聚合函数名称和聚合函数参数,对于 CC 是它的 innerExpression,对于 layout 暂时用不到这个信息。
    - reserved_field1: 建议的来源,包括 LOW_FREQUENCY , INCLUDED , SIMILAR , QUERY_HISTORY , IMPORTED 五种。
- 查询历史生成建议
  
    通过定时任务调度读取查询历史表中的记录,过滤出有效的查询历史,然后再转换成优化建议。基于查询历史生成建议只能对已有的模型进行优化,也就是不能创建新的模型(避免过了一段时间后,系统中会莫名其妙多出来很多模型的尴尬情况)。
    
- 导入 SQL 模式创建模型或者生成索引
  
    在模型界面导入 SQL 或者 调用 kylin 提供的 API 得到需要的索引(API 的方式可以得到建议)。一般来说这种主要用于定向优化。
    
- 推荐优化建议
    - 定时任务调度裁汰系统中的索引:通过分析系统中索引是否相似、包含或者低频使用达到淘汰索引的目的。
    - 分析建议表中的候选建议,基于统计信息选择候选价值大的N条建议推荐给用户。
- 批准优化建议转换成索引(会更改模型和索引)
  
    解析建议的依赖关系,分析建议是否失效,然后将它们转换成可计算列、维度、度量、索引并更新模型和系统中的索引集合。
    

# Kylin 5.0 Support for Recommended Indexes

### 1. Background

- Kylin, as a powerful OLAP engine, has one of its core functions in accelerating queries through pre-computation and indexing. However, the current model definition and index creation process is relatively complex for users, especially when designing efficient aggregation groups.
- The system already has the ability to store query history, which provides us with a valuable opportunity. By analyzing this historical data, we can intelligently optimize existing index structures, not only improving query performance but also more effectively utilizing storage resources.
- As the system usage time increases, the number of indexes may continue to grow. This not only leads to increased build time and resource consumption but may also affect overall system performance. Therefore, establishing a mechanism to identify and remove indexes that are no longer valuable becomes particularly important.

By implementing these optimizations, we aim to simplify the usage process of Kylin, enhance the system's adaptive capabilities, and achieve a better balance between performance and resource utilization.

### 2. Some Basic Concepts

- Query History: User query statements stored in the system, including: hit indexes, pushed down to other execution engines (Spark), query failures, etc.
- Effective Query History: The type of query history that can be accelerated. A basic assumption is that using them to recommend indexes will generally result in better query performance, but this is not absolute; the larger the data volume, the more obvious the effect. Specifically, effective query history includes:
    - Queries that did not perfectly hit existing indexes in the system
    - Queries that encountered NoRealizationFound exceptions and were pushed down to other engines
- Optimization Suggestions: Candidate indexes generated from query history will not directly modify model or index information, but will be stored in a metadata table. Each record in this table represents an optimization suggestion.

### 3. Design

- Suggestion Table Structure
  
  
    | **Column Name** | **Type** | **Remarks** |
    | --- | --- | --- |
    | id | int | Suggestion ID |
    | project | string | Project name |
    | model_id | string | Model UUID |
    | unique_flag | string(200) | Unique identifier MD5 value |
    | semantic_version | int | Major change version of the model |
    | type | tinyint | Suggestion type: COMPUTED_COLUMN[1], DIMENSION[2], MEASURE[3], ADDITIONAL_LAYOUT[4], REMOVAL_LAYOUT[5] |
    | state | tinyint | Suggestion state: INITIAL[0], RECOMMENDED[1], APPLIED[2], DISCARD[3], BROKEN[4] |
    | rec_entity | text | Entity information corresponding to the suggestion |
    | depend_ids | text | All external dependencies corresponding to this suggestion |
    | create_time | long | Creation time of the suggestion |
    | update_time | long | Update time of the suggestion |
    | layout_metric | text | The number of uses of this layout each day and the sum of query latencies for all queries each day |
    | cost | double | Estimated cost of not accepting this suggestion |
    | total_latency_of_last_day | double | Sum of query latencies for all queries on the layout in the previous day. Updated hourly |
    | hit_count | int | Total number of times the layout was hit |
    | total_time | double | Sum of latencies for all queries that hit this layout |
    | max_time | double | Maximum latency among all queries that hit this layout |
    | min_time | double | Minimum latency among all queries that hit this layout |
    | query_history_info | text |  |
    | reserved_field1 | string | Source of the suggestion. For REMOVAL_LAYOUT suggestions, there are three sources: LOW_FREQUENCY, INCLUDED, SIMILAR. For ADDITIONAL_LAYOUT suggestions, there are two sources: QUERY_HISTORY, IMPORTED |
    | reserved_field2 | longblob |  |
    | reserved_field3 | longblob |  |
    
    Special notes about the table structure
    
    - Primary key: id, which will auto-increment when inserting into the initial suggestion table
        - When it reaches the maximum value of integer (about 2.1 billion), it cannot be processed (current Kylin limitation)
    - Composite unique key: [project, model_id, unique_flag, semantic_version]
        - Will the composite unique key cause insertion failure if the same suggestion appears with different states? Please refer to the saving logic section later.
    - depend_ids: All external dependencies corresponding to this suggestion. If it depends on a column in the model, take the column's id; if it depends on other suggestions in the database, take the negative of the suggestion's id. For computed columns, it refers to the columns used in the computed column expression; for dimensions, it refers to the column corresponding to the dimension; for measures, it refers to the columns used in the measure's parameters; for layouts, it refers to the dimensions or measures corresponding to its colOrder.
        - If the column used by a computed column no longer exists in the model, that computed column will become BROKEN, and all layouts depending on it will become BROKEN
        - If the parameter in a measure is not a column but a constant, then the dependency will use Integer.MAX_VALUE
        - If the dimension or measure corresponding to a layout suggestion already exists in the model, the dependency id is their id in the model; if it doesn't exist, take the negative of the id of the dimension or measure saved in the initial optimization suggestion table in the database.
    - rec_entity: Entity information collected for the generated suggestion, corresponding to the RecItemV2 class. Its basic properties include createTime, uniqueContent, uuid, and it has four subclasses used to represent CC, dimension, measure, and layout. uniqueContent will record the characteristic information of this entity: for dimension, it's the specific information of the dependent column; for measure, it's the name and parameters of the dependent aggregate function; for CC, it's its innerExpression; for layout, this information is not used currently.
    - reserved_field1: Source of the suggestion, including five types: LOW_FREQUENCY, INCLUDED, SIMILAR, QUERY_HISTORY, IMPORTED.
- Generating Suggestions from Query History
  
    Read records from the query history table through scheduled tasks, filter out effective query histories, and then convert them into optimization suggestions. Suggestions based on query history can only optimize existing models, that is, they cannot create new models (to avoid the awkward situation where many new models appear in the system inexplicably after a period of time).
    
- Creating Models or Generating Indexes through Imported SQL Mode
  
    Import SQL in the model interface or call the API provided by Kylin to get the required indexes (suggestions can be obtained through the API method). Generally, this is mainly used for targeted optimization.
    
- Recommending Optimization Suggestions
    - Scheduled tasks to prune indexes in the system: Achieve the purpose of eliminating indexes by analyzing whether the indexes in the system are similar, included, or infrequently used.
    - Analyze candidate suggestions in the suggestion table, select N suggestions with high candidate value based on statistical information, and recommend them to users.
- Approving Optimization Suggestions to Convert into Indexes (will modify models and indexes)
  
    Parse the dependency relationships of suggestions, analyze whether suggestions are invalid, then convert them into computed columns, dimensions, measures, indexes, and update the models and index collections in the system.

  • No labels