Page History

...

Hive’s metastore has long been used by other projects in the Hadoop ecosystem to store and access metadata. Apache Impala, Apache Spark, Apache Drill, Presto, and other systems all use Hive’s metastore. Some, like Impala and Presto, can use it as their own metadata system with the rest of Hive not present.

...

Any Hive PMC member or committer will be welcome to join the new project at the same level. We propose this project go straight to a top level project. Given that the initial PMC will be formed from experienced Hive PMC members we do not believe incubation will be necessary. (Note that the Apache board will need to approve this.)

More Details

Use Cases

As noted above, the metastore will continue to focus on being a metadata system for SQL systems like Hive and Impala. This system should also be able to store metadata for streaming systems such as data stored in Kafka. Additionally it should easily allow use by machine learning systems and others that need to access the data stored in SQL engines, streams, etc.

A use case that we are not initially targeting is the larger area of a full data catalog, storing information such as lineage, user tags, etc. with support for end user discoverability and interaction.

Supporting these uses cases will drive requirements such as:

The ability to support various big data engines and frameworks, including relational, batch, and streaming
The ability to scale to support a system with petabytes of data and thousands of users and their jobs
High reliability and/or fault tolerance
The ability to support multiple co-located systems (e.g. multiple Hive instances in one cloud or Impala and a streaming system in the same on-premise facility)
Low response time (< 200ms) to support interactive and high throughput systems
Support for transactional SQL systems
Support for versioned schemas
Ability to work on premise and in the cloud
Maintaining backwards compatibility (including specifying public versus private APIs). This will be very important as the metastore already has a significant user community.

...

Moving the code from Hive into a new project is not straightforward and will take some time. The following steps are proposed:

A new TLP is established. As mentioned above, any existing Hive PMC members will be welcome to join the PMC, and any existing Hive committers will be granted committership in the new project.
Hive begins the process of detangling the metastore code inside the Hive project. This will be done inside Hive to avoid a time where the code is in both Hive and the new project that would require double patching of any new features or bugs.
In order to enable the new project to begin adding layers around the core metastore and make releases, Hive can make source-only releases of only the metastore code during this interim period, similar to how the storage-api is released now. The new project can then depend on those releases.
Once the detangling is complete and Hive is satisfied that the result works, the code will be moved from Hive to the new project.

There are many technical questions of how to separate out the code. These mainly center around which pieces of code should be moved into the new project, and whether the new project continues to depend on Hive’s storage-api (as ORC does today) or whether it copies any code that both it and Hive require (such as parts of the shim layer) in order to avoid any Hive dependencies. Also there are places where metastore "calls back" into QL via reflection (e.g. partition expression evaluation). We will need to determine how to continue this without pulling a dependency on all of Hive into the new project. Discussions and decisions on this will happen throughout the process via the normal methods.

Backwards Compatibility

There are already many users of Hive metastore outside of Hive. We do not want to break backwards compatibility for those users. Our goal will be to make sure there is a binary compatible metastore client available for these users that will support interoperation across versions of the metastore in Hive and as a stand alone system. Another possible approach is to assure that the Thrift interface continues to accept old clients (e.g. Hive 1.x and 2.x), rather than focusing on binary or source compatibility of of the Hive client itself.

Project Name

The following have been suggested as a name for this project:

Flora
Honeycomb
Metastore (NOTE: there are concerns that this would be too generic for Apache to defend the trademark and that it would not be clear enough to users that this was no long longer just the Hive metastore)
Omegastore
Riven
ZCatalog

Space shortcuts

Child pages

Versions Compared

Old Version 2

New Version Current

Key

More Details

Use Cases

Backwards Compatibility

Project Name