Last upated: 7/31/2020
|Table of Contents|
- To modernize Hive Metastore’s interface with a state-of-the-art serving layer based on gRPC while also keeping it backwards compatible with Thrift for minimal upgrade toil;
- To achieve this the proposed design is to add support for a proxy-layer between the Thrift interface and a new gRPC interface that allows for in-memory request/response translation in-between;
- To expand the Hive client to work with Hive Metastore server in both gRPC and Thrift mode.
Hive Metastore is the central repository of Apache Hive (among others like Presto and Spark) metadata. It stores metadata for tables (e.g., schema, location, and statistics) and partitions in a relational database. It provides client access to this information by using a Thrift Metastore API.
Providing gRPC as an option to access Metastore brings us many benefits. Compared to Thrift, gRPC supports streaming that provides better performance for large requests. In addition, it is extensible to more advanced authentication features and is fully compatible with Google’s IAM service that supports fine grained permission checks. A path to integrate gRPC with Hive Metastore is sketched out by this proposal.
The overall design of the gRPC support in Hive Metastore is illustrated in Fig.1. On the server side, based on user configuration, the Hive Metastore Server can listen on a port for Thrift or gRPC request. The lifecycle of a Thrift request has not been changed. For a gRPC request, the new HiveMetastoreGrpcServer will translate an incoming gRPC request into a Thrift request, transparently pass it to HiveMetastoreThriftServer, and translate the response back into gRPC.
The implementation details are described in the following sections.
Pluggable gRPC Support
To have a loose coupling between Hive Metastore and the gRPC layer, we propose to have a pluggable layer which implements only a hook in the Hive Metastore repository, while implements the gRPC proxy library in a separate repository. To enable the gRPC server, a user set “metastore.custom.server.class” in the Hive configuration to the class path of the server in gRPC library. Hive Metastore will then instantiate this class and start the gRPC server described as follows. Here is an example of a similar pluggable library in Hive.
The gRPC layer at the client side is implemented similarly in the separate repository. Changes need to be made into Hive repository to load the gRPC Hive client if enabled by config. For example, both SessionHiveMetastoreClient.java and RetryingMetastoreClient.java can be amended to dynamically load the HiveMetastore gRPC client if the metastore.uris starts with “grpc://”.
Hive Metastore Server
The following is assuming modification of the standalone-metastore package.
As shown in Figure 1, green elements are newly added class while yellow is modified from the current design.
With the potential of starting a new reachable endpoint the requirement of additional hive-site.xml configs are required. The current proposed configuration values are shown below.
- metastore.grpc.service.account.keyfile - string, optional; the path to the JSON keyfile that the Metastore server will run as.
- metastore.grpc.authentication.class - string, optional; the gRPC class will use this class to perform authn/authz against the gRPC requests.
- The detailed implementation of auth support is not in scope for this design proposal.
- Additional gRPC server configs; maximal request size, max connections, port, etc.
Hive Metastore Client
While a Hive Metastore that can support gRPC requests is still useful without any clients it would be helpful to also have the Hive client support gRPC communication with Hive Metastore. This is fairly similar to the previous section, but worth a section on its own for clarity.
The definition of the getTable method is defined in the metastore server spec, so all the client needs to do is worry about conversion of the Thrift object to gRPC and which gRPC method to call.
Similar to the changes to the server config, a user can populate the following fields to user a gRPC enabled client:
- metastore.uris (reuse existing field) - string; the socket of the listening gRPC server, can be separated by commas to be chosen randomly between all of them for load balancing purposes. Grpc connections should be prefixed as grpc://
- Creation of the protobuf definition files for the gRPC client and server
- [a separate ASF licensed repo]:
- Add an additional HiveMetaStoreGrpcServer class that implements the logic of the gRPC service methods that translates & explicitly calls the predefined Thrift implementation.
- Add an additional HiveMetaStoreGrpcClient class that implements IMetaStoreClient that opens a gRPC connection with the gRPC metastore server and translates Thrift API requests to gRPC and sends to server
- Separate logic of HiveMetaStore.java so that a gRPC server, in addition to the Thrift server, can be initialized to making a clear distinction between Thrift and gRPC implementations.
- Add required configuration values and implement the dynamical gRPC class loading/instantiation wiring code inside Hive Metastore Server and Client (e.g., SessionHiveMetastoreClient.java and RetryingMetastoreClient.java).
- As this is simply adding gRPC support by calling the Thrift APIs via in-memory rather than network-io streaming support is not inherently gained. This is currently just a proposal to get gRPC support a foot in the door, and from there we can iterate on the implementation to add true streaming (or other) support.
- Look into adding a layer below Hive.java but above IMetastoreClient.java for Thrift/gRPC