This Confluence has been LDAP enabled, if you are an ASF Committer, please use your LDAP Credentials to login. Any problems file an INFRA jira ticket please.

Child pages
  • Hadoop-compatible Input-Output Format for Hive
Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 3 Next »

Overview

This is a proposal for adding API to hive which allows reading and writing using a Hadoop compatible API. Specifically, the interfaces being implemented are:

The classes will be named HiveApiInputFormat and HiveApiOutputFormat.

InputFormat (reading from Hive)

Usage:

  1. Create a HiveInputDescription object
  2. Fill it with information about the table to read from
  3. Initialize HiveApiInputFormat with that information
  4. Go to town using HiveApiInputFormat with your Hadoop-compatible reading system.

More detailed information:

  • The HiveInputDescription describes the database, table and columns to select. It also has a partition filter property that can be used to read from only the partitions that match the filter statement.
  • HiveApiInputFormat supports reading from multiple tables by having a concept of profiles. Each profile stores its input description in a separate section, and the HiveApiInputFormat has a member which tells it which profile to read from. When initializing the input data in HiveApiInputFormat you can pair it with a profile. If no profile is selected then a default profile is used.

Future plans:

  • Lots of performance work. Expose more direct byte[] sort of semantics.
  • Filtering of rows returned.

OutputFormat (writing to Hive)

TODO

  • No labels