Date: Thu, 28 Mar 2024 10:05:50 +0000 (UTC) Message-ID: <1803735333.67455.1711620350658@cwiki-he-fi.apache.org> Subject: Exported From Confluence MIME-Version: 1.0 Content-Type: multipart/related; boundary="----=_Part_67454_1601566931.1711620350658" ------=_Part_67454_1601566931.1711620350658 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Content-Location: file:///C:/exported.html
=3D Bitmap Indexing =3D
This document explains the proposed design for adding a bitmap index han=
dler (https://issues.apache.org/jira/browse/HIVE-=
1803).
Bitmap indexing (http://en.wikipedia.org/wiki/Bitmap_i=
ndex) is a standard technique for indexing columns with few distinct
values, such as gender.
We want to develop a bitmap index that can reuse as much of the existing= Compact Index code as possible.
This implementation confers some of the benefits of bitmap indexing and = should be easy to implement given the already existing compact index, but i= t does few of the optimizations such as compression that a really good bitm= ap index should do.
Like the complex index, this implementation uses an index table. The ind= ex table on a column "key" has four or more columns: first, the columns tha= t are being indexed, then _bucketname, _offset, and _bitmaps. _bucketname i= s a string pointing to the hadoop file that is storing this block in the ta= ble, _offset is the block offset of a block, and _bitmaps is an uncompresse= d bitmap encoding (an Array of bytes) of the bitmap for this column value, = bucketname, and row offset. Each bit in the bitmap corresponds to one row i= n the block. The bit is 1 if that row has the value of the values in the co= lumns being indexed, and a 0 if not. If a key value does not appear in a bl= ock at all, the value is not stored in the map.
When querying this index, if there are boolean AND or OR operations done= on the predicates with bitmap indexes, we can use bitwise operations to tr= y to eliminate blocks as well. We can then eliminate blocks that do not con= tain the value combinations we are interested in. We can use this data to g= enerate the filename, array of block offsets format that the compact index = handler uses and reuse that in the bitmap index query.
The basic implementation's only compression is eliminating blocks where = all rows are 0s. This is unlikely to happen for larger blocks, so we need a= better compression format. What we can do is do byte-aligned bitmap compre= ssion, where the bitmap is an array of bytes, and a byte of all 1s or all 0= s implies one or more bytes where every value is 0 or 1. Then, we would jus= t need to add another column in the bitmap index table that is an array of = Ints that describe how long the gaps are and logic to expand the compressio= n.
Suppose we have a bitmap index on a key where, on the first block, value= "a" appears in rows 5, 12, and 64, and value "b" appears in rows 7, 8, and= 9. Then, for the preliminary implementation, the first entry in the index = table will be:
https://iss=
ues.apache.org/jira/secure/attachment/12460083/bitmap_index_1.png
The values in the array represent the bitmap for each block, where each = 32-bit BigInt value stores 32 rows.
For the second iteration, the first entry will be:
https://iss=
ues.apache.org/jira/secure/attachment/12460124/bitmap_index_2.png
This one uses 1-byte array entries, so each value in the array stores 8 = rows. If an entry is 0x00 or 0xFF, it represents 1 or more consecutive byte= s of zeros, (in this case 5 and 4, respectively)