Apache Kylin : Analytical Data Warehouse for Big Data

Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

https://github.com/Kyligence/kylin-tpch


Environment


  • Hadoop cluster with 4 physical nodes
  • Yarn queue has 400G memory and 128 CPU cores

...

               https://download-resource.s3.cn-north-1.amazonaws.com.cn/osspark/spark-2.4.1-os-kylin-r3

Performance of Build Engine

Over SSB

The following two figures show the comparison between the construction time and the storage space occupied after the construction. We can see that under the SSB 60 million and 90 million data volumes, the new build engine has doubled the speed of construction, and eventually the storage space occupied has been reduced by nearly double.

...

  • Kylin on Parquet

Kylin on Parquet

  • Kylin on HBase


Performance of Query Engine

The query engine of Kylin on Parquet will create a resident process on YARN during the first query, which is specially used to process query tasks, so the first query will be slower (the initialization process is about 20 seconds). The time of the first query is not counted.

...

We use the SSB data set (90 million rows) and TPC-H (12 million rows) official standard SQL for query response time testing. The lower the query response time, the better the query engine performance. The standard query SQL for both data sets can be found in the SSB and TPC-H data set tool warehouses mentioned at the beginning of the article.

Over SSB

From the figure below, we can see that for the SSB dataset, Kylin on Parquet query response is slower than Kylin 3.0, but most queries can still be returned within 1 second.

Over TPC-H

915pxBecause the main purpose of TPC-H is to test the response time of complex queries in the database system, the SQL of the TPC-H data set is more complicated and requires higher requirements. As you can see from the figure below, Kylin on Parquet has more processing time for complex SQL queries Fast and has obvious advantages.


Conclusion

According to the performance comparison data of the Kylin on Parquet and Kylin3.0 query build engines, we can see that the performance of the Kylin on Parquet build engine has been greatly improved, and the build time and storage space have been reduced by nearly double. From the comparison results of the SSB data set query, the query engine has a certain gap with Kylin3.0 for simple query requests, but most of them can still achieve second-level responses. For the more complex SQL used in the TPC-H data set test, generally the post-calculation will be more, and the new query engine will have better performance.
At present, Kylin on Parquet is still in the stage of continuous improvement. Finally, the address of the GitHub warehouse is attached, https://github.com/apache/kylin/tree/kylin-on-parquet-v2. Welcome to raise issues and pr.

...