Apache Kylin : Analytical Data Warehouse for Big Data

Page tree

Welcome to Kylin Wiki.

Q1. What are you trying to do? Articulate your objectives using absolutely no jargon.

  • For Hudi DataLake source type Integrate:
    • Integrate Kylin's sourcing input from Hudi format's  dataset in enterprise company's  raw or curated data in Data Lake
  • For Kylin cube rebuild&merge optimization(out of this scope)
    • Enable Kylin's cuboid storage format type with Hudi and accelerate and optimize Kylin's cube rebuilding & merge process using Hudi's upsert & incremental view query to extract only the changed source data from the timestamp of last cube building , but kylin's cube rebuild need to calculate the whole raw flatTable data to sum or count...etc,  incremental update the raw flatTable can't do too much performance uplift for the whole cube rebuild&merge process, and also it's big architectural changes for the cuboid storage  
    • so this part work is out of the KIP scope

Q2. What problem is this proposal NOT designed to solve?

  • Other types of data source(e.g Kafka) which don't support Hudi is not in this scope 
  • Streaming CubeEnginer is not within this scope

Q3. How is it done today, and what are the limits of current practice?

  • Currently, Kylin uses the Beeline JDBC mechanism to directly connect to the Hive source, no matter the input format is Hudi or not;
  • Customer's raw/curated data using hudi has multiple ways of implementation  ,such as Spark DF or Spark SQL, so hive not fully know the hudi source format in raw/curated data when Kylin using to extract source dataset

Q4. What is new in your approach and why do you think it will be successful?

  • For Hudi Source integration:
    • New Approach
      • Accelerate Kylin's cube building process using Hudi's native optimized view query with MOR table
    • Why it will be successful
      • Hudi has been released and mature in bigdata domain&tech stack, which many companies already using in Data Lake/Raw/Curated data layer
      • Hudi lib has already integrated with Spark DF/Spark SQL, which can enable Kylin's Spark Engine to query Hudi source
      • Hudi's parquet base files and Avro redo logs as well as the index metadata...etc, can be connected via Hive's external table and input format definition, which Kylin can leverage to successfully do the extraction    

Q5. Who cares? If you are successful, what difference will it make?

  • Data scientist, who is doing data mining/exploration/reporting...etc, will have faster cube building time slot if enable the new integration feature in Kylin 

Q6. What are the risks?

There is no other risk as it's just an alternative option for configuration of Hudi source type, other Kylin's components & pipeline won't be effected

Q7. How long will it take?

N/A

Q8. How does it work?

Overall architectural design's logic diagram is as follows:

  • For Hudi source integration:

    • Add new config item in kylin.property for Hudi source type(e.g: isHudiSouce=true, HudiType=MOR)
    • Add new ISouce interface and implementation using Hudi native client API
    • Use Hudi client API's optimal view query API on top of hive external table to extract the source Hudi dataset
  • For Hudi cuboid storage(out of this scope):

    • Add new config item in kylin.property for Hudi storage type for cuboid(e.g: isHudiCuboidStorage=true)
    • Add new ITarget interface and implementation using Hudi write API for internal store and operations of cuboid files
  • For cube rebuild with new Hudi source type(out of this scope):

    • Use Hudi's incremental query API to only extract the changed data from the last time of Cube segment's timestamp
    • Use Hudi's upset API to merge the changed data & former history data of cuboid
  • For cube merge with new Hudi cuboid storage type(out of this scope):

    • Use Hudi's upset API to merge the 2 cuboid files

Reference

Hudi framework: https://hudi.apache.org/docs/

hive/spark integration support for Hudi: https://hudi.apache.org/docs/querying_data.html

  • No labels