INTRODUCTION
Apache CarbonData has Rich Multi-Level Index Support.
DESCRIPTION
Apache CarbonData uses multiple indexes at various levels to enable faster search and query processing.
Using indexes, we can efficiently find the position of the data that is required while skipping the parts of data that are not required (need not be processed) and hence results in faster query processing.
Storing data along with index significantly accelerates query performance and reduces the I/O scans and CPU resources in case of filters in the query. CarbonData index consists of multiple levels of indices, a processing framework can leverage this index to reduce the number of tasks it needs to schedule and process. It can also do skip scan in more fine-grained units (called blocklet) in task side scanning instead of scanning the whole file.
To get data using indexing, the steps followed are :
File Pruning.
Blocklet Pruning.
Binary search using Inverted Index.
TYPES OF INDEXES
I) Index stored in file footer(enables two levels of B+ tree indexing):
- Table level index: global B+ tree, efficient file level filtering.
Searching for the file, using the table level index.
These files will be further used, to get the row-groups(Data Blocks) using the file level index.
Figure 1: Table Level Indexing
2. File level index: local B+ tree, efficient blocklet level filtering
Figure 2: File Level Indexing
Figure 3: Blocklet Level Indexing
Blocklet Min-Max Index is used to record the min/max value of all columns in the blocklet. Min-Max Index for all columns aids in quickly locating the row groups(Data Blocks) that contain the data matching search/filter criteria.
Figure 4: Data Blocks
II) Column level index: Inverted index used for efficient column chunk scan.
The inverted index tells the actual position of the column value in the column (i.e, the row number). It stores the values in sorted order and the binary search will effectively improve the searching time for the filter value.
Figure 5: Data contains Column Level Indexes
Data Block level Inverted Index for all columns aids in quickly locating the rows that contain the data matching search/filter criteria within a row group(Data Blocks).