Apache CarbonData has Rich Multi-Level Index Support.
Apache CarbonData uses multiple indexes at various levels to enable faster search and query processing.
Using indexes, we can efficiently find the position of the data that is required while skipping the parts of data that are not required (need not be processed) and hence results in faster query processing.
Storing data along with index significantly accelerates query performance and reduces the I/O scans and CPU resources in case of filters in the query. CarbonData index consists of multiple levels of indices, a processing framework can leverage this index to reduce the number of tasks it needs to schedule and process. It can also do skip scan in more fine grained units (called blocklet) in task side scanning instead of scanning the whole file.
To get data using indexing, the steps followed are :
Binary search using Inverted Index.
TYPES OF INDEXES
I) Index stored in file footer(enables two level of B+ tree indexing):
- Table level index: global B+ tree, efficient file level filtering.
Searching for the file, using the table level index.
These files will be further used, to get the row-groups(Data Blocks) using the file level index.
Figure 1 : Table Level Indexing
2. File level index: local B+ tree, efficient blocklet level filtering
Figure 2 : File Level Indexing
Global Multi Dimensional Keys(MDK) based B+Tree Index for all non- measure columns aids in quickly locating the row groups(Data Blocks) that contain the data matching search/filter criteria.
Figure 3 : Blocklet Level Indexing
Min-Max Index for all columns aids in quickly locating the row groups(Data Blocks) that contain the data matching search/filter criteria.
Figure 4 : Data Blocks
II) Column level index: inverted index used for efficient column chunk scan
Figure 5 : Data contains Column Level Indexes
Data Block level Inverted Index for all columns aids in quickly locating the rows that contain the data matching search/filter criteria within a row group(Data Blocks).