Apache Kylin : Analytical Data Warehouse for Big Data

Page tree

Welcome to Kylin Wiki.

Author: Yiming Xu, Mingming Ge

1. Background: Why Kylin on Parquet

Currently, Kylin uses Apache HBase as the storage for OLAP cubes.

HBase is very fast, while it also has some drawbacks:

  • HBase is not real columnar storage;
  • HBase has no secondary index; Rowkey is the only index;
  • HBase has no encoding, Kylin has to do the encoding by itself;
  • HBase does not fit for cloud deployment and auto-scaling;
  • HBase has different API versions  and has compatible issues (e.g, 0.98, 1.0, 1.1, 2.0);
  • HBase has different vendor releases and has compatible issues (e.g, Cloudera's is not compatible with others);
      

This proposal is to use Apache Parquet + Spark to replace HBase:

  • Parquet is an open-source columnar file format;
  • Parquet is more cloud-friendly, can work with most FS including HDFS, S3, Azure Blob store, Ali OSS, etc;
  • Parquet can integrate very well with Hadoop, Hive, Spark, Impala, and others;
  • Support custom index;
  • It is mature and stable;

2. Parquet file layouts on HDFS

  • Storage layout is important for I/O optimizations

  • Do as much as possible pruning before reading the file

    • Filter by folder, file name, etc
  • Each Cuboid uses a dedicated folder

  • Cube

    • Segment A
      • Cuboid-1111
        • part-0000-XXX.snappy.parquet
        • part-0001-XXX.snappy.parquet
      • Cuboid-1001
        • part-0000-XXX.snappy.parquet
        • part-0001-XXX.snappy.parquet
    • Segment B
      • Cuboid-1111
        • part-0000-XXX.snappy.parquet
        • ...
  • Advantages

    • Filter by folder is good enough
    • Can dynamically add/remove cuboid without impact others (ShaofengShi: Currently Kylin won't dynamically add/remove cuboid)
  • Disadvantage

    • Many folders when the cube has many cuboids (ShaofengShi: I think this will bring too many small files, it will increase the burden to HDFS, how can we overcome it?)

3. Dimension/measure layouts in Parquet

  • Dimension and measures layouts in parquet files
    If there is a dimension combination of [d1, d2, d3] and measures of [m1, m2],then a parquet file like this will be generated:
    Columns 1, 2, and 3 correspond to Dimension d1, d2, and d3, respectively
    Column 110000 and 110001 respectively correspond to Measure m1, m2 
  • (ShaofengShi: Is "1", "110000" the column name in parquet? or the column name should be the original column name?)
  • (ShaofengShi: How does the "110000" number come? )


Parquet file schema:
    1:           OPTIONAL INT64 R:0 D:1
    2:           REQUIRED DOUBLE R:0 D:0
    3:           OPTIONAL INT64 R:0 D:1
    110000:      OPTIONAL INT64 R:0 D:1
    110001:      OPTIONAL INT64 R:0 D:1
  • "REQUIRED" and "OPTIONAL" correspond to "nullable" in database system.
  • Parquet data type includes BOOLEAN, INT32, INT64, INT96, FLOAT, DOUBLE and BYTE_ARRAY. The data with string type in hive will be stored as BYTE_ARRAY in parquet.

  • How to deal with the order of dimension and measure

    • In a parquet file, the order of the columns is always dimension first and measure last
    • There is no order between dimensions and between measures
  • Parquet file split

    • parquet.block.size default 128mb
    • (ShaofengShi: How many row groups in a parquet file?)

4. Data types mapping in Parquet

  • How do you encode the data into a parquet?
    • Kylin no longer needs to encode columns
    • Parquet will encode needed columns
  • All data types can be accurately mapped to Parquet
    • Support with ParquetWriteSupport
      • StructType ArrayType MapType
    • Direct mapping transformation
TypeSparkParquet
Numeric typesByteTypeINT32
Numeric typesShortTypeINT32
Numeric typesIntegerTypeINT32
Numeric typesLongTypeINT64
Numeric typesFloatTypeFLOAT
Numeric typesDoubleTypeDOUBLE
Numeric typesDecimalTypeINT32,INT64,BinaryType,FIXED_LEN_BYTE_ARRAY
String typeStringTypeBYTE_ARRAY
Binary typeBinaryTypeBYTE_ARRAY
Boolean typeBooleanTypeBOOLEAN
Datetime typeTimestampTypeINT96
Datetime typeDateTypeINT32
  • How computed columns are stored
    • Bitmap: BYTE_ARRAY
    • TopN: BYTE_ARRAY

5. How to build Cube into Parquet

  • Reduced build steps
    • From ten-twenty steps to only two steps
  • Build Engine
    • Simple and clear architecture
    • Spark as the only build engine
    • All builds are done via spark
    • Adaptively adjust spark parameters
    • Dictionary of dimensions no longer needed
  • Supported measures
    • Sum
    • Count
    • Min
    • Max
    • TopN
    • Bitmap
    • HyperLogLog
  • Cube into parquet
    *

(ShaofengShi: this part need detailed info)

6. How to query with Parquet

  • Query Engine: Sparder
    • Use spark as a calculation tool
    • Distributed query engine,avoid single-point-of-failure
    • Unified calculation engine for building and querying
    • There is a substantial increase in query performance
    • Can benefit from spark new features and ecology

  

  • The basic process of Sparder query
    1. Parser => Sql to AST tree
    2. Validation => Further verify the validity of SQL based on metadata
    3. Optimizer => Generate LogicPlan according to optimization rules
    4. Kylin's Adaptation => Convert AST's nodes to rel nodes(Various classes ending with Rel, such as FilterRel)
    5. Spark Plan => relnode to Spark plan
    6. Query Execution => Read cube data based on the generated spark plan

  

  • What are the optimizations of Kylin reading parquet data?
    • Segment Pruning
    • Shard by
    • Parquet page index
    • Project Pushdown
    • Predicate Pushdown
        

7. Performance

      Kyligence provides dataset tool for SSB and TPC-H which contains test SQL case, the repositories are as follows:


  •  Environment
    • 4 nodes hadoop cluster
    • Yarn queue has 400G memory and 128 cpu cores
  • Build(Over SSB)

  • Query(Over SSB and TPC-H)

                 

             

8. Next step

  • No labels