This roadmap is proposed and has not yet been accepted/approved by the HCatalog community.
Hive today provides a simple and familiar database like tabular model of data management to its users. HCatalog seeks to generalize this table model, so that Hive tables become Hadoop tables. Hadoop tables can be backed by HDFS or alternate storage systems such as cloud stores, NoSQL stores or databases. Hadoop tables can be easily used by all current and future Hadoop programing and data management frameworks. Programing frameworks, including Map-Reduce, Pig, Streaming and many others. HCatalog APIs are being created to enable future data management frameworks to allow data migrations, replication, transformation, archival and other services.
- Enable sharing of Hive table data between diverse tools.
- Present users of these tools an abstraction that removes them from details of where and in what format data and metadata are stored.
- Provide APIs to enable tools that manage the lifecycle of data in Hadoop.
- Provide APIs to external systems and external users that allow them to interact efficiently with Hive table data in Hadoop. This includes creating, altering, removing, exploring, reading, and writing table data.
- Provide APIs to allow Hive and other HCatalog clients to transparently connect to external data stores and use them as Hive tables . (e.g. S3, HBase, or any database or NoSQL store could be used to store a Hive table)
- Support data in all its forms in Hadoop. This includes structured, semi-structured, and unstructured data. It also includes handling schema transitions over time and HBase or Mongo like tables where each row can present a different set of fields
- Provide a shared data type model across tools that includes the data types that users expect in modern SQL.
- Embrace the Hive security model, while extending it to provided needed protection to Hadoop data accessed via any tool or UDF.
- Provide tables that can accept streams of records and/or row updates efficiently.
- Provide a registry of SerDes, InputFormats and OutputFormats, and StorageHandlers that allow HCatalog clients to reference any Hive table in any storage format or on any data source without needing to install code on their machines.
- Provide a Registry of UDFs and Table Functions that allow clients to utilize registered UDFs from compatible tools by invoking them by name.
Current Features Based On Goals:
- HCatLoader and HCatStorer functions for Pig that provide a table abstraction to Pig users. These functions also provide Pig access to the data schema and a mapping between Hive's and Pig's type models. 1,2
- HCatInputFormat and HCatOutputFormat for MapReduce users that provide a table abstraction. These classes also provide MapReduce programs access to the schema and a defined type model. 1,2
- Use of Hive SerDes and MapReduce InputFormats and OutputFormats to minimize the need for developers to write new code and to maximize sharing of code between different tools. 1
- The REST interface to the metadata server that allows new data management frameworks to create, update, delete and explore Hive tables. 4
- HCatReader and HCatWriter interfaces that allow parallel reads and writes of records in and out of tables where the parallelism is determined by the reader or writer rather than by the Hadoop system. HCatReader allows readers to push down partition pruning predicates and column projections. 4
- Support for storing data in binary format without interpreting or translating it. 6
- Support Kerberos based authentication of user identity. 8
- Support for adding columns to partitions without requiring restating of existing stored data. 6
- Support for presenting HBase tables as Hive tables. 5
Envisioned Future Features Based on Goals:
- Access to statistics concerning data sets via HCatLoader and the ability to generate and store statistics via HCatStorer. 1
- Access to statistics concerning data sets via HCatInputFormat and the ability to generate and store statistics via HCatOutputFormat. 1
- Ability for MapReduce streaming users to read data from and write data to HCatalog tables. Schema information should also be communicated via environment variables. 1,2
- A REST interface that will allow parallel read and write of records where the parallelism is determined by the reader or writer. This interface must also support partition pruning predicates, simple predicates (equality, inequality, is/is not null, boolean) on columns, and column projections. 4
- APIs for and reference implementations of data lifecycle management tools such as cleaners that remove old data, archivers that archive data by reformatting data on HDFS (partition aggregation, erasure coding, compression, ...)or by relocating data to another storage system, and replication tools that mirror data between clusters. These APIs will be in REST or a similar language independent format. 3
- Allowing the user to assert a desired schema at the time the data is read. For formats where the schema is stored in HCatalog's metadata this will mean merging of the known schema with the user asserted schema. Where the schema is not stored in the metadata but is stored in the data it will mean merging the user asserted schema with the data stored schema. For data where the schema is not stored in the metadata nor in the data, it will mean parsing the data in a user provided way (such as CSV). This schema merging and parsing must gracefully handle the case where columns specified by the user are missing. It must not assume that every record will have the same fields in the same order. It must allow the user to specify what action to take when a desired field is missing (e.g. insert a null, discard the row, fail). 6
- Provide an authorization model that allows finer grained access to data than the current storage based model while not running user provided code effectively as super-user. In the case of HDFS stored code this finer grained control means not relying on POSIX group semantics for file access but rather allowing individual users to be granted specific access rights on a table or partition. It also means providing columnar and potentially row wise access controls. 8
- Support transition of column types over time without requiring restatement of existing data. Not all type transitions would be supported, but many should (such as integer to long, long to floating point, integer or long to fixed point, etc). 6
- Add support for a fixed point type. 7
- Connect the existing boolean and datetime types that exist in Hive and Pig together via HCatalog. 7
- Expand support for connecting to HBase tables to include the ability to alter tables and push down non-key predicates. 5
- Implement StorageHandlers to connect HCatalog to the metadata of RDBMS NoSQL stores, etc. 5
- Add a REGISTER FUNCTION command to Hive (in addition to the REGISTER TEMP FUNCTION that exists) and devise way for Hive to store code associated with those functions. 11
- Integrate Pig and MR with registered functions to allow them to make use of the code stored by users. 11
- Ability to import/export metadata into and out of of HCatalog. Some level of support for this was added in version 0.2 but currently broken. This can be useful for user backing up HCatalog server metadata and the ability to import into another cluster, etc instead of having to replay add/alter partitions. For eg: We would like to move a project from one cluster to another and it has 1+ years worth of data. Copying the data can be easily done with distcp by copying the top level table directory but copying the metadata is more cumbersome.
Features Under Discussion
This section contains features the HCatalog community has discussed but has not yet committed to adding to HCatalog. These features
may be added to HCatalog or it may be determined that they belong in different projects or tools.
- Process data from one cluster and store into another cluster. I.e HCatLoader() reads from one cluster's metastore server and HCatStorer() writes to another metastore server.
- Metastore server to handle multiple cluster metadata so that one HCatalog instance could exist per group of co-located rather than one per cluster.
- Ability to store data provenance/lineage apart from statistics on the data.
- Ability to discover data. E.g.: If a new user needs to know where click data for search ads is stored, he needs to go through twikis or mail userlists to find where exactly it is stored. Need the ability to query on keywords/producer of data and find which table contains the data.
- Consolidate the many service components into a smaller number to simplify running HCat in production (HiveMetaStore, Hive thrift service, HiveWebInterface, webhcat).