Apache Kylin : Analytical Data Warehouse for Big Data

Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Parquet file schema:
    1:           OPTIONAL INT64 R:0 D:1
    2:           REQUIRED INT64DOUBLE R:0 D:0
    3:           OPTIONAL INT64 R:0 D:1
    110000:      OPTIONAL INT64 R:0 D:1
    110001:      OPTIONAL INT64 R:0 D:1
  • "REQUIRED" and "OPTIONAL" correspond to "nullable" in database system.
  • Parquet data type includes BOOLEAN, INT32, INT64, INT96, FLOAT, DOUBLE and BYTE_ARRAY. The data with string type in hive will be stored as BYTE_ARRAY in parquet.(ShaofengShi: Why the column "2" is "REQUIRED" but others are "OPTIONAL", please explain. And why their data types are all "INT64"? )

  • How to deal with the order of dimension and measure

    • In a parquet file, the order of the columns is always dimension first and measure last
    • There is no order between dimensions and between measures
  • Parquet file split

    • parquet.block.size default 128mb
    • (ShaofengShi: How many row groups in a parquet file?)

...

TypeSparkParquet
Numeric typesByteTypeINT32
Numeric typesShortTypeINT32
Numeric typesIntegerTypeINT32
Numeric typesLongTypeINT64
Numeric typesFloatTypeFLOAT
Numeric typesDoubleTypeDOUBLE
Numeric typesDecimalTypeINT32,INT64,BinaryType,FIXED_LEN_BYTE_ARRAY
String typeStringTypeBinaryBYTE_ARRAY
Binary typeBinaryTypeBinaryBYTE_ARRAY
Boolean typeBooleanTypeBOOLEAN
Datetime typeTimestampTypeINT96
Datetime typeDateTypeINT32
  • How computed columns are stored
    • Bitmap: Binary BYTE_ARRAY
    • TopN: Binary BYTE_ARRAY

5. How to build Cube into Parquet

...

  • What are the optimizations of Kylin reading parquet data?
    • Segment Pruning
    • Shard by
    • Parquet page index
    • Project Pushdown
    • Predicate Pushdown
        

7. Performance

...

Build

  • Use TPCH as the dataset to remember the test (ShaofengShi: What's the cluster configuration? What's the model/cube design? )
  • The detailed data is as follows
    Image Removed

      Kyligence provides dataset tool for SSB and TPC-H which contains test SQL case, the repositories are as follows:


  •  Environment
    • 4 nodes hadoop cluster
    • Yarn queue has 400G memory and 128 cpu cores
  • Build(Over SSB)

    Image AddedImage Added
  • Query(Over SSB and TPC-H)

                 

              Image AddedImage Added

Query 

...

8. Next step