Apache Kylin : Analytical Data Warehouse for Big Data
Welcome to Kylin Wiki.
Preparation
In order to let readers understand the performance differences between Kylin 3 and Kylin 4 simply and directly, I provided a performance benchmark report in a standard software and hardware environment. Because I am familiar with AWS products, AWS EMR was chosen as my benchmark platform.
Besides, I chose TPC-H (https://github.com/Kyligence/kylin-tpch) and SSB (https://github.com/Kyligence/ssb-kylin) as the benchmark standard. The scale factor used in this test is 10 ( meaning fact table has 60 million rows).
The following table shows the aspects compared between different versions in this benchmark report.
Metrics/Aspect | Description |
Cubing Duration | Duration of pre-calculation(cube building) process(load source table into Kylin) . |
Cube Size | Disk space occupied by cube/index. |
Response Time | Serial query test lasting fifteen minutes, taking the 95th percentile of the overall Response Time as the result. |
The following table shows information about software and hardware used in this performance benchmark.
Item | Value |
Instance Type | m5.4xlarge |
Node Memory | 64 GB |
Node vCPU | 16 |
Node Disk | 400 * 2; SSD |
Network Brand with | Up to 10 Gbps |
Node Count | A master node and four worker nodes |
Allocated Memory on Yarn | 202 GB |
Allocated Cores on Yarn | 52 |
Kylin Version | 3.1.2 & 4.0.0 |
EMR Version | 5.31 |
Hadoop Version | 2.10.0 |
HBase Version | 1.4.13 |
Benchmark Results
Figure-1 : Cubing duration of TPC-H (sf = 10)
Figure-2 : Storage size of TPC-H (sf = 10)
Figure-3 : Avg response time of SSB Query (sf=10)
Figure-4 : Avg response time of TPC-H Query (sf=10)
Conclusions
Cubing duration and cube size.
Compared with Kylin 3's MR cube engine, thanks to higher resource utilization and no more steps of converting cuboid to specific data format(HFile), Kylin 4 greatly reduces the cubing duration by 62.6%.
In Kylin 3, the cuboid files are stored in two different formats, instead Kylin 4 uses Parquet. We know Parquet has better encode efficiency and higher compression ratio, so the disk space of same cube reduced greatly by 72.56%.
Figure-5 : Kylin 3(MR engine) has lower resource utilization
Figure-6 : Kylin 4(New Spark Engine) has a higher and stable resource utilization
Query performance.
In big query scenarios(query which scans and does onsite complex calculations on a large mount of partitions/files), Kylin 3 query optimization is difficult, and needs to optimize HBase RS Server and Kylin Query Server repeatedly. In stress test scenarios, query node is unstable because it need do post-calculation on large data set, and performance(query latency) is getting worse as time goes by. Kylin 4 removes the single bottleneck of Query Server, and both Response Time and QPS are obviously improved and performance is stable during the stress test. In TPC-H query set, response time of Kylin 4 is improved by 5-7 times, and its concurrency is also improved by 4 times.
Figure-7 : P95 response time of TPC-H Query under different concurrency
In the point query scenario (query which scans small mount of partitions/files and do not need too much onsite calculations) , Kylin 4 can meet the sub-second query latency requirement after some simple parameters adjustment, and its performance is relatively close to Kylin 3 (to be specific, only worse sightly) .
Cost of learning and difficulty of performance optimization(parameter adjustment).
Compared with Kylin 4, Kylin 3 has many building steps, and different steps depends on different components, such as Hive, MapReduce and HBase. It is necessary to learn and understand many architectures and technical details, and be familiar with many parameters related to these components, so it is depressing for new user when they know they have to learn so many things.
Instead, the cubing and query of Kylin 4 are uniformly switched to the popular Spark engine, and new users only need to master Spark to learn and adjust parameters. These learning materials for Spark can be easily found, and the commonly used parameters are far less than Kylin 3.
1 Comment
Xiaoxiang Yu
Related details : https://github.com/Kyligence/kylin-tpch/issues/6