Hive HBase Integration
- Hive HBase Integration
- Storage Handlers
- Column Mapping
- Put Timestamps
- Key Uniqueness
- Potential Followups
- Open Issues (JIRA)
This page documents the Hive/HBase integration support originally introduced in HIVE-705. This feature allows Hive QL statements to access HBase tables for both read (SELECT) and write (INSERT). It is even possible to combine access to HBase tables with native Hive tables via joins and unions.
A presentation is available from the HBase HUG10 Meetup
This feature is a work in progress, and suggestions for its improvement are very welcome.
Before proceeding, please read StorageHandlers for an overview of the generic storage handler framework on which HBase integration depends.
The storage handler is built as an independent module,
hive-hbase-handler-x.y.z.jar, which must be available on the Hive client auxpath, along with HBase, Guava and ZooKeeper jars. It also requires the correct configuration property to be set in order to connect to the right HBase master. See the HBase documentation for how to set up an HBase cluster.
Here's an example using CLI from a source build environment, targeting a single-node HBase server. (Note that the jar locations and names have changed in Hive 0.9.0, so for earlier releases, some changes are needed.)
Here's an example which instead targets a distributed HBase cluster where a quorum of 3 zookeepers is used to elect the HBase master:
The handler requires Hadoop 0.20 or higher, and has only been tested with dependency versions hadoop-0.20.x, hbase-0.92.0 and zookeeper-3.3.4. If you are not using hbase-0.92.0, you will need to rebuild the handler with the HBase jar matching your version, and change the
--auxpath above accordingly. Failure to use matching versions will lead to misleading connection failures such as MasterNotRunningException since the HBase RPC protocol changes often.
In order to create a new HBase table which is to be managed by Hive, use the
STORED BY clause on
hbase.columns.mapping property is required and will be explained in the next section. The
hbase.table.name property is optional; it controls the name of the table as known by HBase, and allows the Hive table to have a different name. In this example, the table is known as
hbase_table_1 within Hive, and as
xyz within HBase. If not specified, then the Hive and HBase table names will be identical.
After executing the command above, you should be able to see the new (empty) table in the HBase shell:
Notice that even though a column name "val" is specified in the mapping, only the column family name "cf1" appears in the DESCRIBE output in the HBase shell. This is because in HBase, only column families (not columns) are known in the table-level metadata; column names within a column family are only present at the per-row level.
Here's how to move data from Hive into the HBase table (see GettingStarted for how to create the example table
pokes in Hive first):
Use HBase shell to verify that the data actually got loaded:
And then query it back via Hive:
Inserting large amounts of data may be slow due to WAL overhead; if you would like to disable this, make sure you have HIVE-1383 (as of Hive 0.6), and then issue this command before the INSERT:
Warning: disabling WAL may lead to data loss if an HBase failure occurs, so only use this if you have some other recovery strategy available.
If you want to give Hive access to an existing HBase table, use CREATE EXTERNAL TABLE:
hbase.columns.mapping is required (and will be validated against the existing HBase table's column families), whereas
hbase.table.name is optional.
There are two
SERDEPROPERTIES that control the mapping of HBase columns to Hive:
hbase.table.default.storage.type: Can have a value of either
string(the default) or
binary, this option is only available as of Hive 0.9 and the
stringbehavior is the only one available in earlier versions
The column mapping support currently available is somewhat cumbersome and restrictive:
- for each Hive column, the table creator must specify a corresponding entry in the comma-delimited
hbase.columns.mappingstring (so for a Hive table with n columns, the string should have n entries); whitespace should not be used in between entries since these will be interperted as part of the column name, which is almost certainly not what you want
- a mapping entry must be either
:keyor of the form
column-family-name:[column-name][#(binary|string)(the type specification that delimited by # was added in Hive 0.9.0, earlier versions interpreted everything as strings)
- If no type specification is given the value from
hbase.table.default.storage.typewill be used
- Any prefixes of the valid values are valid too (i.e.
- If you specify a column as
binarythe bytes in the corresponding HBase cells are expected to be of the form that HBase's
- If no type specification is given the value from
- there must be exactly one
:keymapping (this can be mapped either to a string or struct column–see Simple Composite Keys and Complex Composite Keys )
- (note that before HIVE-1228 in Hive 0.6,
:keywas not supported, and the first Hive column implicitly mapped to the key; as of Hive 0.6, it is now strongly recommended that you always specify the key explictly; we will drop support for implicit key mapping in the future)
- if no column-name is given, then the Hive column will map to all columns in the corresponding HBase column family, and the Hive MAP datatype must be used to allow access to these (possibly sparse) columns
- there is currently no way to access the HBase timestamp attribute, and queries always access data with the latest timestamp.
- Since HBase does not associate datatype information with columns, the serde converts everything to string representation before storing it in HBase; there is currently no way to plug in a custom serde per column
- it is not necessary to reference every HBase column family, but those that are not mapped will be inaccessible via the Hive table; it's possible to map multiple Hive tables to the same HBase table
The next few sections provide detailed examples of the kinds of column mappings currently possible.
Multiple Columns and Families
Here's an example with three Hive columns and two HBase column families, with two of the Hive columns (
value2) corresponding to one of the column families (
a, with HBase column names
c), and the other Hive column corresponding to a single column (
e) in its own column family (
Here's how this looks in HBase:
And when queried back into Hive:
Hive MAP to HBase Column Family
Here's how a Hive MAP datatype can be used to access an entire column family. Each row can have a different set of columns, where the column names correspond to the map keys and the column values correspond to the map values.
(This example also demonstrates using a Hive column other than the first as the HBase row key.)
Here's how this looks in HBase (with different column names in different rows):
And when queried back into Hive:
Note that the key of the MAP must have datatype string, since it is used for naming the HBase column, so the following table definition will fail:
Illegal: Hive Primitive to HBase Column Family
Table definitions such as the following are illegal because a
Hive column mapped to an entire column family must have MAP type:
Example with binary columns
Relying on the default value of
Simple Composite Row Keys
Hive can read and write delimited composite keys to HBase by mapping the HBase row key to a hive struct, and using the ROW FORMAT DELIMITED...COLLECTION ITEMS TERMINATED BY. Example:
Complex Composite Row Keys and HBaseKeyFactory
For more complex use cases, hive allows users to specify an HBaseKeyFactory which defines the mapping of a key to fields in a hive struct.This can be configured using the property "hbase.composite.key.factory" in the SERDEPROPERTIES option:
"hbase.composite.key.factory" should be the fully qualified class name of a class implementing HBaseKeyFactory. See SampleHBaseKeyFactory2 for a fixed length example in the same package. This class must be on your classpath in order for the above example to work. TODO: place these in an accessible place; they're currently only in test code.
If inserting into a HBase table using Hive the HBase default timestamp is added which is usually the current timestamp. This can be overridden on a per-table basis using the
hbase.put.timestamp which must be a valid timestamp or
-1 to reenable the default strategy.
One subtle difference between HBase tables and Hive tables is that HBase tables have a unique key, whereas Hive tables do not. When multiple rows with the same key are inserted into HBase, only one of them is stored (the choice is arbitrary, so do not rely on HBase to pick the right one). This is in contrast to Hive, which is happy to store multiple rows with the same key and different values.
For example, the pokes table contains rows with duplicate keys. If it is copied into another Hive table, the duplicates are preserved:
But in HBase, the duplicates are silently eliminated:
Another difference to note between HBase tables and other Hive tables is that when INSERT OVERWRITE is used, existing rows are not deleted from the table. However, existing rows are overwritten if they have keys which match new rows.
There are a number of areas where Hive/HBase integration could definitely use more love:
- more flexible column mapping (HIVE-806, HIVE-1245)
- default column mapping in cases where no mapping spec is given
- filter pushdown and indexing (see FilterPushdownDev and IndexDev)
- expose timestamp attribute, possibly also with support for treating it as a partition key
- allow per-table hbase.master configuration
- run profiler and minimize any per-row overhead in column mapping
- user defined routines for lookups and data loads via HBase client API (HIVE-758 and HIVE-791)
- logging is very noisy, with a lot of spurious exceptions; investigate these and either fix their cause or squelch them
Code for the storage handler is located under
HBase and Zookeeper dependencies are fetched via ivy.
Class-level unit tests are provided under
Positive QL tests are under
hbase-handler/src/test/queries. These use a HBase+Zookeeper mini-cluster for hosting the fixture tables in-process, so no free-standing HBase installation is needed in order to run them. To avoid failures due to port conflicts, don't try to run these tests on the same machine where a real HBase master or zookeeper is running.
The QL tests can be executed via ant like this:
An Eclipse launch template remains to be defined.
- For information on how to bulk load data from Hive into HBase, see HBaseBulkLoad.
- For another project which adds SQL-like query language support on top of HBase, see HBQL (unrelated to Hive).
- Primary credit for this feature goes to Samuel Guo, who did most of the development work in the early drafts of the patch
Open Issues (JIRA)
|HIVE-10545||Implement predicate pushdown for queries over HBase snapshots||Unassigned||Andrew Mains||Open||Unresolved||Apr 30, 2015||Apr 30, 2015|
|HIVE-10491||Refactor HBaseStorageHandler::configureJobConf() and configureTableJobProperties||Swarnim Kulkarni||Ashutosh Chauhan||Open||Unresolved||Apr 26, 2015||May 14, 2015|
|HIVE-9591||Add support for OrderedByte encodings||Unassigned||Nick Dimiduk||Open||Unresolved||Feb 05, 2015||Feb 05, 2015|
|HIVE-8871||Hive Hbase Integration : Support for NULL value columns||Unassigned||Jasper Knulst||Open||Unresolved||Nov 14, 2014||Dec 17, 2014|
|HIVE-8267||Exposing hbase cell latest timestamp through hbase columns mappings to hive columns.||Unassigned||Muhammad Ehsan ul Haque||Patch Available||Unresolved||Sep 26, 2014||Nov 12, 2014|
|HIVE-8020||Add avro serialization support for HBase||Unassigned||Swarnim Kulkarni||Open||Unresolved||Sep 08, 2014||Sep 08, 2014|
|HIVE-7849||Support more generic predicate pushdown for hbase handler||Navis||Navis||Patch Available||Unresolved||Aug 22, 2014||Feb 05, 2015|
|HIVE-7805||Support running multiple scans in hbase-handler||Andrew Mains||Andrew Mains||Patch Available||Unresolved||Aug 20, 2014||Apr 26, 2015|
|HIVE-7566||HIVE can't count hbase NULL column value properly||Unassigned||Kent Kong||Open||Unresolved||Jul 31, 2014||Jul 31, 2014|
|HIVE-7534||remove reflection from HBaseSplit||Unassigned||Nick Dimiduk||Open||Unresolved||Jul 28, 2014||Jul 28, 2014|
|HIVE-7248||UNION ALL in hive returns incorrect results on Hbase backed table||Navis||Mala Chikka Kempanna||Patch Available||Unresolved||Jun 17, 2014||Dec 16, 2014|
|HIVE-7197||Enable and address flakiness of hbase_bulk.m||Unassigned||Nick Dimiduk||Open||Unresolved||Jun 09, 2014||Jun 09, 2014|
|HIVE-7179||hive connect to hbase cause select results error||Unassigned||zhengzhuangjie||Open||Unresolved||Jun 05, 2014||Jun 05, 2014|
|HIVE-7128||Add direct support for creating and managing salted hbase tables||Swarnim Kulkarni||Swarnim Kulkarni||In Progress||Unresolved||May 27, 2014||Dec 29, 2014|
|HIVE-7103||Add additional tests for HIVE-6411||Swarnim Kulkarni||Swarnim Kulkarni||Open||Unresolved||May 20, 2014||May 20, 2014|
|HIVE-7058||Cleanup HiveHBase*InputFormat||Unassigned||Nick Dimiduk||Open||Unresolved||May 14, 2014||May 14, 2014|
|HIVE-6195||Create unit tests to exercise behaviour when creating a HBase Table in Hive||Viraj Bhat||Viraj Bhat||Open||Unresolved||Jan 14, 2014||Nov 14, 2014|
|HIVE-5927||wrong start/stop key on hbase scan with inner join and where clause on id||Unassigned||Jan Van Besien||Open||Unresolved||Dec 03, 2013||Dec 03, 2013|
|HIVE-5277||HBase handler skips rows with null valued first cells when only row key is selected||Teddy Choi||Teddy Choi||Patch Available||Unresolved||Sep 12, 2013||Oct 12, 2013|
|HIVE-4765||Improve HBase bulk loading facility||Navis||Navis||Patch Available||Unresolved||Jun 20, 2013||May 09, 2015|