Apache Kylin : Analytical Data Warehouse for Big Data
Page History
...
Parquet file schema:
1: OPTIONAL INT64 R:0 D:1
2: REQUIRED INT64DOUBLE R:0 D:0
3: OPTIONAL INT64 R:0 D:1
110000: OPTIONAL INT64 R:0 D:1
110001: OPTIONAL INT64 R:0 D:1
- "REQUIRED" and "OPTIONAL" correspond to "nullable" in database system.
- Parquet data type includes BOOLEAN, INT32, INT64, INT96, FLOAT, DOUBLE and BYTE_ARRAY. The data with string type in hive will be stored as BYTE_ARRAY in parquet.(ShaofengShi: Why the column "2" is "REQUIRED" but others are "OPTIONAL", please explain. And why their data types are all "INT64"? )
How to deal with the order of dimension and measure
- In a parquet file, the order of the columns is always dimension first and measure last
- There is no order between dimensions and between measures
Parquet file split
- parquet.block.size default 128mb
- (ShaofengShi: How many row groups in a parquet file?)
...
Type | Spark | Parquet |
---|---|---|
Numeric types | ByteType | INT32 |
Numeric types | ShortType | INT32 |
Numeric types | IntegerType | INT32 |
Numeric types | LongType | INT64 |
Numeric types | FloatType | FLOAT |
Numeric types | DoubleType | DOUBLE |
Numeric types | DecimalType | INT32,INT64,BinaryType,FIXED_LEN_BYTE_ARRAY |
String type | StringTypeBinary | BYTE_ARRAY |
Binary type | BinaryTypeBinary | BYTE_ARRAY |
Boolean type | BooleanType | BOOLEAN |
Datetime type | TimestampType | INT96 |
Datetime type | DateType | INT32 |
- How computed columns are stored
- Bitmap: Binary BYTE_ARRAY
- TopN: Binary BYTE_ARRAY
5. How to build Cube into Parquet
...
- What are the optimizations of Kylin reading parquet data?
- Segment Pruning
- Shard by
- Parquet page index
- Project Pushdown
- Predicate Pushdown
7. Performance
...
Build
- Use TPCH as the dataset to remember the test (ShaofengShi: What's the cluster configuration? What's the model/cube design? )
- The detailed data is as follows
Kyligence provides dataset tool for SSB and TPC-H which contains test SQL case, the repositories are as follows:
- Environment
- 4 nodes hadoop cluster
- Yarn queue has 400G memory and 128 cpu cores
Build(Over SSB)
Query(Over SSB and TPC-H)
Query
...
8. Next step
Overview
Content Tools
ThemeBuilder
Apps