Apache Kylin : Analytical Data Warehouse for Big Data
Welcome to Kylin Wiki.
Q1. What are you trying to do? Articulate your objectives using absolutely no jargon.
In Apache Kylin 4, Kylin team have implemented/developed new build engine and new query engine to provide better performance, please refer to KIP-1: Parquet storage if you are interested. But the current cuboid prune tools(Cube Planner) is not incompatible with new build engine, so I want to make new build engine support Cube Planner.
Q2. What problem is this proposal NOT designed to solve?
I am not going to support Cube Planner phase 2 at the moment, because phase 2 depend on some metrics in CubeVisitService.java(aggRowCount & totalRowCount) to infer row count of unbuilt/new cuboid. HBase storage is removed in Kylin 4, so we have find a another way to infer row count for unbuilt/new cuboid. Besides, System Cube(or metrics system) need to be refactored and metrics in METRICS_QUERY_RPC is deprecated because storage is changed(we don't have HBase's region server any more).
Q3. How is it done today, and what are the limits of current practice?
- It is almost done in my patch, please check or review my patch at https://github.com/apache/kylin/pull/1485 .
- Add a new step to calculate cuboid's HyperHyperLog did degrade build performance slightly, and it looks acceptable to me.
Q4. What is new in your approach and why do you think it will be successful?
- It is not a new way, main logic of new added code looks like the original one in FactDistinctColumnsMapper.java .
- We know that Cube Planner phase 1 depend on row count of each cuboid to calculate BPUS(benefit per unit space). By introduce a new step which will calcualte HyperLogLog for each candidate cuboid, we can enable Cube Planner phase 1 now.
Q5. Who cares? If you are successful, what difference will it make?
After this task is done, Kylin 4 will support Cube Planner phase 1, and make cuboid prune much easier than current state(didn't support ).
Q6. What are the risks?
So far so good.
Q7. How long will it take?
I have spent about three weeks to read original source code, write my code and test it. It is almost done.
Q8. How it works?
- Use Spark to calculate cuboid's HllCounter for the first segment and persist into HDFS.
- Re-enable Cube planner by default, but not support cube planner phase two.
- Not merge cuboid statistics(HLLCounter) when merge segment.
- By default, only calculate cuboid statistics for the FIRST segment. (No necessary becuase phase two is not supported )
- Cuboid statistics for HLLCounter use precision 14.
- Calculate cuboid statistics use 100% input flat table data. (Maybe use sample for input RDD in the future.)