Hcat Security Design
HCat, as a data catalog has two kind of operations, meta data operations on the meta store, and data operations at the storage level. For both of these types of operations, there is authentication and authorization aspects.
Meta store authentication
Hive's metastore thrift server already offers kerberos and delegation token based authentication using SASL. However, in case of Hive server or templeton, the client should be authenticated, and data operations should be performed on behalf of the user. Hive's metastore already provides SASL authenticated thrift server, however, this should also be ported to HiveServer if Hive is running as a stand alone server.
Meta store authorization
Hive supports a very fine grained data model and ql for access control (see background). All of the privileges are stored at the metastore, and they are enforced by the client, which allows to circumvent the auth checks.
However, for Hcat, metadata operations at the end are always tied to the actual data. To prevent any discrepancy between the data and the metadata in terms of access semantics, HCat metastore should allow/deny mutation requests based on the underlying data access semantics. That is, the metastore should delegate the authorization request to the storage handler, and, if the storage handler allows mutation (write/admin) for the data path, the metastore should allow the meta mutation as well. This means that if a user is able to change the actual table data in dfs for example, she should be able to alter the metadata for the table as well. For the sake of consistency, dfs operation checks will also be wrapped to a fake storage handler.
Data store authentication
For HBase and HDFS, since the client connects directly to the storage layer, authentication will be delegated. For the MR jobs launched by Hcat/Hive, hadoop and hbase delegation tokens should be obtained before launching the job, and saved into the job conf. Hive metastore already has an API for obtaining/renewing delegation tokens, so for jobs that result in Hive/Hcat metastore operations, tokens should be obtained and saved to the conf as well. HiveStorageHandler.configureTableJobProperties() should be enough to configure MR jobs.
Data store authorization
For data operations, all of the authorization checks will be delegated to the actual data store. Since the user can always directly read from or write to the actual data store, there is no need to enforce them at the Hcat server.
The checks for the authorization will be performed on the client side, but this is just a convenience check to prevent launching a job with insufficient permissions to see it fail. For the delegation to work, we should implement a storage handler specific HiveAuthorizationProvider, obtained from the HiveStorageHandler. This work is tracked as a part of HCATALOG-237.
However, this framework makes it difficult for a data admin to manage the permissions, since the table's metadata is managed by hcat, but the permissions should be managed at hdfs/hbase. To deal with that problem, we propose a future work, where we plug in to the GRANT ql syntax, and delegate the permission change operations to the storage handler with a well defined mapping between storage layer permission model and a limited version of the hive's model (leaving out roles, etc.). So for example, running:
will be the same as running in hbase shell 0.92+:
However, since HDFS does not provide true ACL's the mapping for GRANT commands can be something like:
This kind of ql interface, would allow the data admin to manage permissions the same way she manages other table metadata.
Access control in Hdfs
DFS implements posix-like user/group based access control. See http://hadoop.apache.org/common/docs/r0.20.2/hdfs_permissions_guide.html
Access control in HBase
HBase 0.92 provides Access control in terms of ACL lists for users and groups. The user/group information is obtained from the Hadoop authentication mechanisms. The ACL lists can be defined at the global/table/column family or column qualifier level. There are 5 actions (privileges): READ, WRITE, EXEC, CREATE, and ADMIN, with the codes RWXCA, respectively. A grant is a tuple of:
Note that group_name's start with @, and column_family and column_qualifier is optional. Every table created has a dedicated owner which is saved with the table metadata with the key 'OWNER'. You can alter the table to change the ownership if you have sufficient permissions.
There are 3 new shell commands related to access control:
create/drop/alter table operations are associated with a global level permission, where each action checks for global permissions for the user/groups to have CREATE/DROP or ADMIN permissions.
put/get/scan operations are defined as per table/cf/cq, and the checks are performed to check using the following logic
1. All users need read access to .META. and ROOT tables.
2. The table owner has full privileges
3. check for the table-level, if successful we can short-circuit
4. check permissions against the requested families
for all families
a) check for family level access
b) if no family level grant is found, check for qualifier level access
5. no families to check and table level access failed, deny the request.
HBase enforces the authorization at the region server for the user tables using the AccessController coprocessor.
Access control in Hive
Hive's authorization is well documented in below links: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Authorization https://issues.apache.org/jira/browse/HIVE-78