Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Add secondary indexing information

...

These are set by including a pound sign ('#') after the column mapping element with either the long or short serialization value. The default serialization is 'string'. For example, for the value 10, "person:age#s" is synonymous with the "person:age" and would serialize the value as the literal string "10". If "person:age#b" was used instead, the value would be serialized as four bytes: \x00\x00\x00\xA0.

Indexing

Starting in Hive 3.0.0 with HIVE-15795, indexing support has been added to Accumulo-backed Hive tables. Indexing works by using another Accumulo table to store the field value mapping to rowId of the data table. The index table is automatically populated on record insertion via Hive.

Using index tables greatly improve performance of non-rowId predicate queries by eliminating full table scans. Indexing works for both internally and externally managed tables using either the Tez or Map Reduce query engines. The following options control indexing behavior.

Option Name

Description

accumulo.indextable.name(Required) The name of the index table in Accumulo.

accumulo.indexed.columns

(Optional) A comma separated list of hive columns to index, or * which indexes all columns (default: *)
accumulo.index.rows.max

(Optional) The maximum number of predicate values to scan from the index for each search predicate (default: 20000)

 

See this note about this value
accumulo.index.scanner(Optional) The index scanner implementation. (default: org.apache.hadoop.hive.accumulo.AccumuloDefaultIndexScanner)

The indexes are stored in the index table using the following format:

rowId = [field value in data table]

column_family = [field column family in data table] + ‘_’ + [field column quantifier in data table]

column_quantifier = [field rowId in data table]

visibility = [field visibility in data table]

value = [empty byte array]

 

When using a string encoded table, the indexed field value is encoded using Accumulo Lexicoder methods for numeric types. Otherwise, values are encoding using native binary encoding. This information will allow applications to insert data and index values into Accumulo outside of Hive but still require high performance queries from within Hive.

It is important to note when inserting data and indexes outside of Hive it is important to update both tables within the same unit of work. If the Hive query does not find indexes matches for the any of the query predicates, the query will short circuit and return empty results without searching the data table.

Anchor
index_rows_max
index_rows_max
If the search predicate matches more entries than defined by the option accumulo.index.rows.max (default 20000), the index search results will be abandoned and the query will fall back to a full scan of the data table with predicate filtering. Remember using large values for this option or having very large data table rowId values may require increasing hive memory to prevent memory errors.

Other options

The following options are also valid to be used with SERDEPROPERTIES or TABLEPROPERTIES for further control over the actions of the AccumuloStorageHandler:

...