Hive Accumulo Integration

Overview

Apache Accumulo is a sorted, distributed key-value store based on the Google BigTable paper. The API methods that Accumulo provides are in terms of Keys and Values which present the highest level of flexibility in reading and writing data; however, higher-level query abstractions are typically an exercise left to the user. Leveraging Apache Hive as a SQL interface to Accumulo complements its existing high-throughput batch access and low-latency random lookups.

Implementation

The initial implementation was added to Hive 0.14 in HIVE-7068 and is designed to work with Accumulo 1.6.x. There are two main components which make up the implementation: the AccumuloStorageHandler and the AccumuloPredicateHandler. The AccumuloStorageHandler is a StorageHandler implementation. The primary roles of this class are to manage the mapping of Hive table to Accumulo table and configures Hive queries. The AccumuloPredicateHandler is used push down filter operations to the Accumulo for more efficient reduction of data.

Usage

To issue queries against Accumulo using Hive, four parameters must be provided by the Hive configuration:

accumulo.instance.name
accumulo.zookeepers
accumulo.user.name
accumulo.user.pass

For those familiar with Accumulo, these four configurations are the normal configuration values necessary to connect to Accumulo: the Accumulo instance name, the ZooKeeper quorum (comma-separated list of hosts), and Accumulo username and password. The easiest way to provide these values is by using the -hiveconf option to the hive command. It is expected that the Accumulo user provided either has the ability to create new tables, or that the Hive queries will only be accessing existing Accumulo tables.

hive -hiveconf accumulo.instance.name=accumulo -hiveconf accumulo.zookeepers=localhost -hiveconf accumulo.user.name=hive -hiveconf accumulo.user.pass=hive

To access Accumulo tables, a Hive table must be created using the CREATE command with the STORED BY clause. If the EXTERNAL keyword is omitted from the CREATE call, the lifecycle of the Accumulo table is tied to the lifetime of the Hive table: if the Hive table is deleted, so is the Accumulo table. This is the default case. Providing the EXTERNAL keyword will create a Hive table that references an Accumulo table but will not remove the underlying Accumulo table if the Hive table is dropped.

CREATE TABLE accumulo_table(row STRING, name STRING, age INT, weight DOUBLE, height INT)
STORED BY 'org.apache.hadoop.hive.accumulo.AccumuloStorageHandler'
WITH SERDEPROPERTIES('accumulo.columns.mapping' = ':rowid,person:name,person:age,person:weight,person:height');

Space shortcuts

Child pages

Hive Accumulo Integration

Overview

Implementation

Usage

Space shortcuts

Child pages

AccumuloIntegration

Hive Accumulo Integration

Overview

Implementation

Usage