This Confluence has been LDAP enabled, if you are an ASF Committer, please use your LDAP Credentials to login. Any problems file an INFRA jira ticket please.

Child pages
  • IndexDev

Versions Compared


  • This line was added.
  • This line was removed.
  • Formatting was changed.


This document explains the proposed design for adding index support to
Hive ( Indexing is a
standard database technique, but with many possible variations.
Rather than trying to provide a "one-size-fits-all" index
implementation, the approach we are taking is to define indexing in a
pluggable manner (related to Hive-StorageHandlers) and provide one
concrete indexing implementation as a reference, leaving it open for
contributors to plug in other indexing schemes as time goes by. No index support will be available until Hive 0.7.


Only single-table indexes are supported. Others (such as join
indexes) may be more appropriately expressed as materialized views
once Hive has support for those.

This document currently only covers index creation and maintenance. A
follow-on will explain how indexes are used to optimize queries (building on Hive-FilterPushdownDev).


For the details of the various clauses such as ROW FORMAT, see

By default, index partitioning matches the partitioning of the
base table. The PARTITIONED BY clause may be used to specify a subset
of the table's partitioning columns (this column list may be empty to
indicate that the index spans all partitions of the table). For
example, a table may be partitioned by date+region even though the index
is partitioned by date alone (each index partition spanning all regions).

Indexes cannot be created on views. We will (eventually) support them
on non-native tables (in cases where the corresponding storage handler
indicates that it supports them).

Index handlers may require that the base table being indexed have a
particular format.

Question: should we allow indexes on EXTERNAL tables? What does this
mean for implicit DROP when the table is dropped? Is there a concept
of an EXTERNAL index?

If the index handler stores its representation in tabular form, then
index_table_name can be used to control the name of the "index table"
automatically created for this purpose. The index table storage
format can be controlled using STORED AS (e.g. RCFILE or SEQUENCFILE)
or STORED BY (e.g. to store the index table in a non-native table such
as HBase), although some index handlers may require usage of a
specific storage format. Not all index handlers store their
representation in tabular form; some may use non-table files, and
others may use structures maintained completely outside of Hive
(e.g. a persistent key/value store).


The new IDXS table in the metastore schema contains one entry per
index created. It has two relationships with the TBLS table:


In the IDXS entry for x, ORIG_TBL_ID will reference the TBL_ID of x,
and IDX_TBL_ID will reference the TBL_ID of default+t_x+.

To avoid the generated name, a user-specified name such as t_x can be
supplied instead:

Code Block
AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'

Note that index names are qualified by the containing base table (like
partitions), so the same index name can be used across two different
tables. However, names of index tables are in the same namespace as
all other tables and views, so they must be unique within the same

An index has a storage descriptor which includes the subset of columns
from the original table covered by the index. If the index
representation is stored in a table, most of the other fields in the
index's own storage descriptor (e.g. LOCATION) will be irrelevant.


If WITH DEFERRED REBUILD is specified on CREATE INDEX, then the newly
created index is initially empty (regardless of whether the table
contains any data). The ALTER INDEX ... REBUILD command can be used
to build the index structure for all partitions or a single partition.

If data in the base table changes, then the REBUILD command must be
used to bring the index up to date. This is an atomic operation, so
if the table was previously indexed, and a rebuild fails, then the
stale index remains untouched.


An index can be dropped at any time with DROP INDEX. This will also
cascade to the index table (if one exists).


When an indexed base table is dropped, the DROP implicitly cascades to
all indexes (and their corresponding index tables if any).

When an indexed base table has one of its partitions dropped, this
implicitly cascades to drop corresponding partitions from all indexes.

Question: what do we do if the index partitioning granularity is not the
same as the table partitioning granularity? Probably just ignore the drop, and let the user clean up manually with a new ALTER INDEX DROP PARTITION statement.


The corresponding Java inerface is defined below, together with a
companion abstract base class which handlers should extend.


For CREATE INDEX, Hive first calls usesIndexTable() on the handler to
determine whether an index table will be created. If this returns
false, the statement fails immediately if the user specified any table
storage options for the index. However, if usesIndexTable() returns
true, then Hive creates a partial table definition for the index table
based on the index definition (such as the covered columns) combined
with any table storage options supplied by the user. Next, Hive calls
analyzeIndexDefinition (passing either null or the partial index table
definition for the indexTable parameter). The handler responds by
validating the definitions (throwing an exception if any unsupported
combination is detected) and then filling in additional information on
the index and indexTable parameters as output. Hive then stores these
results in the metastore.

TBD: we will be adding methods for calling the handler when an index
is dropped (e.g. to give a cleanup opportunity to a handler which stores
the index representation in an external system such as HBase)


The reference implementation creates what is referred to as a
"compact" index. This means that rather than storing the HDFS
location of each occurrence of a particular value, it only stores the
addresses of HDFS blocks containing that value. This is optimized for
point-lookups in the case where a value typically occurs more than
once in nearby rows; the index size is kept small since there are many
fewer blocks than rows. The tradeoff is that extra work is required
during queries in order to filter out the other rows from the indexed

The compact index is stored in an index table. The index table
columns consist of the indexed columns from the base table followed by
a _bucketname string column (indicating the name of the file
containing the indexed block) followed by an _offsets array<string>
column (indicating the block offsets within the corresponding file).
The index table is stored as sorted on the indexed columns (but not on
the generated columns).

The reference implementation can be plugged in with