Apache Kylin : Analytical Data Warehouse for Big Data
Kylin introduced the concept of buckets, which can be understood as dividing the data into several buckets (ie, multiple partitions) for parallel processing when processing data. When the dictionary is built for the first time, the value in each bucket will be encoded starting from 1, and after the encoding of all buckets is completed, the overall dictionary value will be allocated according to the offset value of each bucket. In the code, the two encodings are stored through two HashMaps, one of which stores the relative dictionary value in the bucket, and the other stores the absolute dictionary value between among all buckets.
The following figure shows the transfer of the dictionary in the bucket in multiple construction tasks for the bucket numbered 1, and each build creates a new version for the bucket (ie v1, v2, v3, etc.), add the reason behind the version control. Curr (current) and Prev (Previous) are two HashMaps in a bucket, which store the relative (Relative) code value of the dictionary in the current bucket and the absolute (Absolute) code value of all the dictionary values that have been constructed before.
- Create a flat table through Spark and obtain the distinct values that need to be accurately deduplicated;
- Confirm Decide the number of shards/buckets according to the number of literal values after deduplication, and determine whether to expand according to demand;
- Repartition data to multiple partitionsbuckets, encode them separately, and store them in their own dictionary files;
- Assign a version number to the current build task;
- Save dictionary files and metadata data (number of buckets and offset value of buckets);
- According to the conditions, the old version needs to be deleted;
The global dictionary is isolated by assigning a timestamp-based version number to a single build. The reason for adding version control is that the build task may be executed concurrently. Through version control, the global dictionary built before can be completely read for every code, which ensures that the latest version of the dictionary has the most complete global dictionary code, and the global dictionary of a Cube will be read every time it is read Select the latest version of the dictionary. The dictionary is finally stored by version on the file storage system (HDFS here) as shown in the figure below.
- Why do I need to use two Maps in a BucketDIctionary?
- At the beginning of the construction process, it is necessary to make a relative code for the dictionary assigned to each bucket starting from 1. The relative code value of this part of the dictionary will be stored in a HashMap. After the relative dictionary value coding is completed, each bucket will be obtained. The obtain the offset value, that is, the number of dictionaries in the bucket, and then calculate the absolute (Absolute) code of the dictionary value in each bucket (the bucket is ordered), relative to the offset value of all buckets, and the absolute code of the dictionary is also Will will use another HashMap for storageto store.
- Will there be data skew issues?
- Now that with the test is donewe made, the probability that of the hotspot cannot be constructed causing build failure is very small. Generally, it will not pass if it is tilted by one billion level. A lot of "count distinct" columns may indeed cause this problem, but the number of coded buckets can be infinitely enlarged. Unless a single key hotspot is required, adjusting the parameters is also The build is easy to completea way to overcome it.
- Why can the value number of in a global dictionaries dictionary exceed the limit of the maximum integer base (2^31, 2.1 billion)?
- Because of the introduction of we use the new BitMap Bitmap data structure "Roaring64BitMap", which is 64 bit (2^64); after the global dictionary encoding is completed, the encoding will be compressed into binary and stored in the Roaring64BitMap object. The bitmap is actually stored through in Long instead of Integer.