IndexDev Bitmap

=3D Bitmap Indexing =3D

Introduction

This document explains the proposed design for adding a bitmap index han= dler (https://issues.apache.org/jira/browse/HIVE-= 1803).
Bitmap indexing (http://en.wikipedia.org/wiki/Bitmap_i= ndex) is a standard technique for indexing columns with few distinct values, such as gender.

Approach

We want to develop a bitmap index that can reuse as much of the existing= Compact Index code as possible.

Proposal

First implementation

This implementation confers some of the benefits of bitmap indexing and = should be easy to implement given the already existing compact index, but i= t does few of the optimizations such as compression that a really good bitm= ap index should do.

Like the complex index, this implementation uses an index table. The ind= ex table on a column "key" has four or more columns: first, the columns tha= t are being indexed, then _bucketname, _offset, and _bitmaps. _bucketname i= s a string pointing to the hadoop file that is storing this block in the ta= ble, _offset is the block offset of a block, and _bitmaps is an uncompresse= d bitmap encoding (an Array of bytes) of the bitmap for this column value, = bucketname, and row offset. Each bit in the bitmap corresponds to one row i= n the block. The bit is 1 if that row has the value of the values in the co= lumns being indexed, and a 0 if not. If a key value does not appear in a bl= ock at all, the value is not stored in the map.

When querying this index, if there are boolean AND or OR operations done= on the predicates with bitmap indexes, we can use bitwise operations to tr= y to eliminate blocks as well. We can then eliminate blocks that do not con= tain the value combinations we are interested in. We can use this data to g= enerate the filename, array of block offsets format that the compact index = handler uses and reuse that in the bitmap index query.

Second iteration

The basic implementation's only compression is eliminating blocks where = all rows are 0s. This is unlikely to happen for larger blocks, so we need a= better compression format. What we can do is do byte-aligned bitmap compre= ssion, where the bitmap is an array of bytes, and a byte of all 1s or all 0= s implies one or more bytes where every value is 0 or 1. Then, we would jus= t need to add another column in the bitmap index table that is an array of = Ints that describe how long the gaps are and logic to expand the compressio= n.

Example

Suppose we have a bitmap index on a key where, on the first block, value= "a" appears in rows 5, 12, and 64, and value "b" appears in rows 7, 8, and= 9. Then, for the preliminary implementation, the first entry in the index = table will be:

https://iss= ues.apache.org/jira/secure/attachment/12460083/bitmap_index_1.png


The values in the array represent the bitmap for each block, where each =
32-bit BigInt value stores 32 rows.
For the second iteration, the first entry will be:
https://iss=
ues.apache.org/jira/secure/attachment/12460124/bitmap_index_2.png

This one uses 1-byte array entries, so each value in the array stores 8 =
rows. If an entry is 0x00 or 0xFF, it represents 1 or more consecutive byte=
s of zeros, (in this case 5 and 4, respectively)