HBase Output Storage Driver - Design

Description

This twiki outlines the design of the HCatOutputStorageDriver for HBase. As an iterative approach the two drivers will be created: HBaseDirectOutputStorageDriver and HBaseBulkOutputStorageDriver.

HBaseDirectOutputStorageDriver makes use of the HBase client (HTable) to make random writes for each record that needs to be written. This driver will validate the fundamental changes and enhancements required to support HBase via HCatalog before moving on to the more complex implementation. Also it would be nice determine if this driver would be sufficient for some use cases.

HBaseBulkOutputStorageDriver makes use of HBase's bulk import facility wherein generated HFile's (HBase's internal storage format) are loaded onto their respective region servers. This is the solution we wish to achieve and will be mainly using in production environments.

HBaseDirectOutputStorageDriver

Diagram

Discussion

As one of the requirements for batch loading data onto HBase all revisions must be written with the same revision number to uniquely identify each batch update. Thus we have to add a new field to
OutputJobInfo which enables us to pass implementation specific parameters to the underlying storage driver. This method of passing application specific information is a non-invasive step which we will reevaluate once we have some deliverables running.

OutputJobInfo.java
  private Map<String,String> properties;

  /**
   * Set/Get Property information to be passed down to *StorageDriver implementation
   * put implementation specific storage driver configurations here
   * @return
   */
  public Map<String,String> getProperties() {
    return properties;
  }

Not Depicted in the diagram is a Constants class for storing the property keys relevant to this storage driver:

HBaseConstants.java
public class HBaseConstants {

  public static final String CONF_OUTPUT_VERSION_KEY = HCatConstants.HCAT_DEFAULT_TOPIC_PREFIX+".hbase.outputVersion";

   ....

}

HBaseDirectStorageDriver itself is a pretty straightforward implementation. HBaseDirectOutputFormat decorates HBase's TableOutputFormat or we can implement one ourselves controlling the client directly enabling us better flexibility with tuning ie disabling WAL for higher write rates. This OutputFormat's key is not used and the Value can only be an HBase Put.

HCatDirectOutputFormat.java
public class HBaseDirectOutputFormat extends OutputFormat<WritableComparable<?>,Writable> implements Configurable {
....
}

One of the main HCat changes that's needed to be made at this stage is it's "assumption" that table are always stored as files HCatOutputFormat and HCatOutputCommitter makes such assumptions such as checking the existence of a path to verify the existence of a partition.

ResultConverter is a utility class which can be shared by both the input and output storage driver converting data from HCat to native HBase forms. We can initially make use of Hive's HBaseSerDe to power this class and later on move to our own implementation. Though it is not clear whether HCat driver developers are expected to use Hive SerDes to maintain interoperability with Hive.

HBaseBulkOutputStorageDriver

Diagram

Discussion

With this driver data is first stored as a sequencefile using HBaseBulkOutputFormat and a second job is run to sort and write data as HBase region server files (HFileOutputFormat) and finally instruct region servers to "bulkImport" their respective regions.

HBaseBulkOutputFormat also does not use the key field and the value can only be a Put. These classes implements Writable so no extra work is needed to serialize them.

HBaseBulkOutputFormat.java
public class HBaseBulkOutputFormat extends OutputFormat<WritableComparable<?>,Writable> implements Configurable {
...
}

ImportSequenceFile is the MR job which does the actual bulk import. It's tasks involves sorting and partitioning the data correctly and finally loading the partition onto their respective region servers. A good reference implementation for this Class is HBase's ImportTSV which does bulk imports on TSV files.

HBaseBulkOutputCommitter is used to update the MetaStore the final status of a HBaseBulkOutputFormat write via the thrift client. Essentially ending in either a run of ImportSequenceFile on success or invalidation of the revision on failure. This commit task is done only after the baseCommitTask has completed.

HBaseBulkOutputCommiter
public class HBaseBulkOutputCommiter extends OutputCommitter {
    private OutputCommitter baseOutputCommiter;

    public HBaseBulkOutputCommiter (OutputCommitter outputCommitter) {
        this.baseOutputCommiter = outputCommitter;
    }
...
}
  • No labels