This document explains the proposed design for adding index support to Hive (http://issues.apache.org/jira/browse/HIVE-417). Indexing is a standard database technique, but with many possible variations. Rather than trying to provide a "one-size-fits-all" index implementation, the approach we are taking is to define indexing in a pluggable manner (related to Hive- StorageHandlers) and provide one concrete indexing implementation as a reference, leaving it open for contributors to plug in other indexing schemes as time goes by. No index support will be available until Hive 0.7.
This document currently only covers index creation and maintenance. A follow-on will explain how indexes are used to optimize queries (building on Hive- FilterPushdownDev).
CREATE INDEX index_name ON TABLE base_table_name (col_name, ...) AS 'index.handler.class.name' [WITH DEFERRED REBUILD] [IDXPROPERTIES (property_name=property_value, ...)] [IN TABLE index_table_name] [PARTITIONED BY (col_name, ...)] [ [ ROW FORMAT ...] STORED AS ... | STORED BY ... ] [LOCATION hdfs_path] [TBLPROPERTIES (...)] [COMMENT "index comment"]
By default, index partitioning matches the partitioning of the base table. The PARTITIONED BY clause may be used to specify a subset of the table's partitioning columns (this column list may be empty to indicate that the index spans all partitions of the table). For example, a table may be partitioned by date+region even though the index is partitioned by date alone (each index partition spanning all regions).
If WITH DEFERRED REBUILD is specified on CREATE INDEX, then the newly created index is initially empty (regardless of whether the table contains any data). The ALTER INDEX ... REBUILD command can be used to build the index structure for all partitions or a single partition.
- specs for SHOW/DESCRIBE INDEX (HIVE-1497 (http://issues.apache.org/jira/browse/HIVE-1497))
- ALTER INDEX DROP PARTITION?
- ALTER INDEX SET IDXPROPERTIES, change tableformat, etc
- what happens when the structure of a table or partition changes after it has already been indexed
- automatic indexing as part of INSERT when WITH DEFERRED REBUILD is not specified
- prevent creation of an index on an index table?
- metastore upgrade script
- stats collection for index tables
- intersection with new Hive concurrency control feature (what locks do we take for various index operations?)