Hive Accumulo Integration
Apache Accumulo is a sorted, distributed key-value store based on the Google BigTable paper. The API methods that Accumulo provides are in terms of Keys and Values which present the highest level of flexibility in reading and writing data; however, higher-level query abstractions are typically an exercise left to the user. Leveraging Apache Hive as a SQL interface to Accumulo complements its existing high-throughput batch access and low-latency random lookups.
The initial implementation was added to Hive 0.14 in HIVE-7068 and is designed to work with Accumulo 1.6.x. There are two main components which make up the implementation: the AccumuloStorageHandler and the AccumuloPredicateHandler. The AccumuloStorageHandler is a StorageHandler implementation. The primary roles of this class are to manage the mapping of Hive table to Accumulo table and configures Hive queries. The AccumuloPredicateHandler is used push down filter operations to the Accumulo for more efficient reduction of data.
The only additional Accumulo configuration necessary is the inclusion of the hive-
accumulo-handler.jar, provided as a part of the Hive distribution, to be included in the Accumulo server classpath. This can be accomplished a variety of ways: copying/symlink the jar into
$ACCUMULO_HOME/lib/ext or include the path to the jar in
general.classpaths in accumulo-site.xml. Be sure to restart the Accumulo tabletservers if the jar is added to the classpath in a non-dynamic fashion (using
general.classpaths in accumulo-site.xml).
To issue queries against Accumulo using Hive, four parameters must be provided by the Hive configuration:
For those familiar with Accumulo, these four configurations are the normal configuration values necessary to connect to Accumulo: the Accumulo instance name, the ZooKeeper quorum (comma-separated list of hosts), and Accumulo username and password. The easiest way to provide these values is by using the
-hiveconf option to the
hive command. It is expected that the Accumulo user provided either has the ability to create new tables, or that the Hive queries will only be accessing existing Accumulo tables.
To access Accumulo tables, a Hive table must be created using the
CREATE command with the
STORED BY clause. If the
EXTERNAL keyword is omitted from the
CREATE call, the lifecycle of the Accumulo table is tied to the lifetime of the Hive table: if the Hive table is deleted, so is the Accumulo table. This is the default case. Providing the
EXTERNAL keyword will create a Hive table that references an Accumulo table but will not remove the underlying Accumulo table if the Hive table is dropped.
Each Hive row maps to a set of Accumulo keys with the same row ID. One column in the Hive row is designated as a "special" column which is used as the Accumulo row ID. All other Hive columns in the row have some mapping to Accumulo column (column family and qualifier) where the Hive column value is placed in the Accumulo value.
In the above statement, normal Hive column name and type pairs are provided as is the case with normal create table statements. The full AccumuloStorageHandler class name is provided to inform Hive that Accumulo will back this Hive table. A number of properties can be provided to configure the AccumuloStorageHandler via SERDEPROPERTIES or TBLPROPERTIES. The most important property is "accumulo.columns.mapping" which controls how the Hive columns map to Accumulo columns. In this case, the "row" Hive column is used to populate the Accumulo row ID component of the Accumulo Key, while the other Hive columns (name, age, weight and height) are all columns within the Accumulo row.
For the above schema in the "accumulo_table", we could envision a single row in the table:
The above record would be serialized into Accumulo Key-Value pairs in the following manner given the declared accumulo.columns.mapping:
The power of the column mapping is that multiple Hive tables with differing column mappings can interact with the same Accumulo table and produce different results. When columns are excluded, the performance of Hive queries can be improved through the use of Accumulo locality groups to filter out unwanted data at the server-side.
The column mapping string is comma-separated list of encoded values whose offset corresponds to the Hive schema for the table. The order of the columns in the Hive schema can be arbitrary as long as the elements in the column mapping align to the intended Hive column. For those familiar with Accumulo, each element in the column mapping string resembles a column_family:column_qualifier; however, there are a few different variants that allow for different control.
- A single column
- This places the value for the Hive column into the Accumulo value with the given column family and column qualifier.
- A column qualifier map
- A column family is provided and a column qualifier prefix of any length is allowed, follow by an asterisk.
- The Hive column type is expected to be a Map, the key of the Hive map is appended to the column qualifier prefix
- The value of the Hive map is placed in the Accumulo value.
- The rowid
- Controls which Hive column is used as the Accumulo rowid.
- Exactly one ":rowid" element must exist in each column mapping
- ":rowid" is case insensitive (:rowID is equivalent to :rowId)
Additionally, a serialization option can be provided to each element in the column mapping which will control how the value is serialized. Currently, the options are:
- 'binary' or 'b'
- 'string' or 's'
These are set by including a pound sign ('#') after the column mapping element with either the long or short serialization value. The default serialization is 'string'. For example, for the value 10, "person:age#s" is synonymous with the "person:age" and would serialize the value as the literal string "10". If "person:age#b" was used instead, the value would be serialized as four bytes: \x00\x00\x00\xA0.
The following options are also valid to be used with SERDEPROPERTIES or TABLEPROPERTIES for further control over the actions of the AccumuloStorageHandler:
|accumulo.iterator.pushdown||Should filter predicates be satisfied within Accumulo using Iterators (default: true)|
|accumulo.default.storage||The default storage serialization method for values (default: string)|
A static ColumnVisibility string to use when writing any records to Accumulo (default: empty string)
A comma-separated list of authorizations to use when scanning Accumulo (default: no authorizations).
Note that the Accumulo user provided to connect to Accumulo must have all authorizations provided.
Extension point which allows a custom class to be provided when constructing LazyObjects from the rowid without changing
the ObjectInspector for the rowid column.
|accumulo.composite.rowid||Extension point which allows for custom parsing of the rowid column into a LazyObject.|
|accumulo.table.name||Controls what Accumulo table name is used (default: the Hive table name)|
|accumulo.mock.instance||Use a MockAccumulo instance instead of connecting to a real instance (default: false). Useful for testing.|
Override the Accumulo table name
Create a user table, consisting of some unique key for a user, a user ID, and a username. The Accumulo row ID is from the Hive column, the user ID column is written to the "f" column family and "userid" column qualifier, and the username column to the "f" column family and the "nickname" column qualifier. Instead of using the "users" Accumulo table, it is overridden in the TBLPROPERTIES to use the Accumulo table "hive_users" instead.
Store a Hive map with binary serialization
Using an asterisk in the column mapping string, a Hive map can be expanded from a single Accumulo Key-Value pair to multiple Key-Value pairs. The Hive Map is a parameterized type: in the below case, the key is a string, and the value integer. The default serialization is overriden from 'string' to 'binary' which means that the integers in the value of the Hive map will be stored as a series of bytes instead of the UTF-8 string representation.
Register an external table
Creating the Hive table with the external keyword decouples the lifecycle of the Accumulo table from that of the Hive table. Creating this table assumes that the Accumulo table "countries" already exists. This is a very useful way to use Hive to manage tables that are created and populated by some external tool (e.g. A MapReduce job). When the Hive table countries is deleted, the Accumulo table will not be deleted. Additionally, the external keyword can also be useful when creating multiple Hive tables with different options that operate on the same underlying Accumulo table.
I would be remiss to not mention the efforts made by Brian Femiano that were the basis for this storage handler. His initial prototype for Accumulo-Hive integration was the base for this work.