You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 4 Next »

1. Use Cases and Motivations

1.1   Hive Privilege Changes as Result of SQL Object Changes

SQL “DROP TABLE/DATABASE” command would like to have all the privileges directly on the table/database to be deleted automatically from Ranger. 

 

Similarly, SQL “RENAME TABLE” command would like to have all the privileges directly on the table to be changed to the renamed table.

 

One more use case is SQL drop/rename column commands would like to see the privileges directly on the column(s) to be adjusted accordingly. 

 

1.2  Hive CLI Users contacts directly with or through the Hive Metastore Server

HIVE supports metastore authorization for this use case (HIVE-3705), but it does not work through the existing Ranger Hive Plugin that works only for HiveServer2 now. 

 

1.3   Hadoop Users of M/R, Pig, Hive CLI want to access data sets created by HiveServer2

Specifically, HiveServer2 users want to enjoy the service by the HiveServer2 as a SQL data source with SQL-flavored access control on finer granular objects such as columns, among other advantages from a SQL server.  Currently HiveServer2 supports two modes of authorization. The first is “storage based authorization” and the second is  “SQL Standard based authorization”.

 

The first mode is the default and is intended to share the data between Hive and other Hadoop applications. But the downside is that the Hive SQL access privileges have to be used in combination with those of the underlying HDFS privileges; which is not convenient and natural to SQL users.

 

The second mode is enabled by setting the “impersonate” flag to false, and is intended to provide the access controls the same as a SQL user would enjoy.  This is realized through a “superuser” named “hive” who has the full access to the Hive tables. The downside is that the data sharing with other Hadoop application is virtually none.

 

So it is hoped that there is a seamless way of controlling the access to, and supporting the sharing of, the Hadoop data between Hive and other Hadoop applications. 

2. Functionalities

2.1 System Authorization

It is required that the user of “hive” be a Ranger admin user to allow him the access to manipulate HDFS privileges (See Section 2.2.3). Otherwise the system authentication and authorization are the same as of now. 

 

2.2 Hive

2.2.1 Meta Store Plugin and Listeners

This is a new Ranger plugin. It uses the same Hive service name as the existing Hive plugin does to communicate with the Ranger Admin server and is co-enabled with the existing Ranger Hive Plugin through the same enabling script of  

enable-hive-plugin.sh.

 

The new metastore plugin will be used as a static instance by two Hive metastore listener classes to communicate with the Hive service in the Ranger Admin. During the “enabling” process, the two listeners will be added to hive-site.xml to be instantiated by Hive.  And the two listeners can optionally enable logging.

 

The first Hive metastore listener class extends the Hive’s MetaStorePreEventListener abstract class to provide 1) Ranger-based authorization on the Hive metastore.  Specifically all DML requests, and query requests on databases and tables, are to be authorized this way. But query requests on finer granular levels such as columns or partitions won’t be checked here and instead will be checked by the normal RangerHiverAuthorizer that uses the existing RangerHivePlugin for authorization against the Ranger Admin. And 2) handling of the possible needs to sync proper privileges to the HDFS files underlying a Hive table. Details are in Section 2.2.3.  An object of this class will listen on all Hive metastore events.

 

 The second Hive metastore listener class extends the Hive’s MetaStoreEventListener abstract class to handle the adjustments of Ranger Hive privileges as result of DDL operations. Details are in Section 2.2.2.

 

The new plugin will extend from RangerBasePlugin, handling the authorization requests as therein.  It will also send the new requests for the HDFS privilege sync to the Ranger Admin.

2.2.2 Range Hive Privilege Adjustments as Result of Hive DDL Operations

HIVE SQL DDL operations that add/remove/change a HDFS resource name will see the Ranger policy on the exactly matched resource to be added/removed/changed accordingly.  Failure of such adjustments will not cause the operation to fail, but just to log a warning of the failure.  An example of such a failure is a “rename” operation that finds an existing policy already on the renamed resource. This is possible because Ranger policy could be on nonexistent objects while SQL does not allow such a scenario.

2.2.3     Range HDFS Privilege Changes as Result of Hive Metadata Changes

There will be a new String member introduced in the “configs” list of the Hive’s servicedef json file, named “resourceService” that will specify the HDFS service name whose HDFS entries under a Hive table will have access policies added/deleted according to the existence of the Hive table’s objects of data.  The default value of null will disable the sync of the HDFS privilege sync due to Hive metadata changes. The setting of this member will be through GUI and RESTful API.

 

To enable the sync of the HDFS privilege due to Hive metadata changes, the proper setting of this new member plus the listener class configuration as described in section 2.2.1 are both required.

 

There are four parts of the functionality.

 

The first part is to handle HDFS policy changes as result of Hive DDL operations. This includes any HDFS location creation/deletion from SQL operations of table/partition creation, alteration and deletion. The policy will be for the login user on the HDFS directories on the object’s storage location recursively if the login user is different from the current user. The handling is by the new implementation of the MetaStorePreEventListener.

 

The second part is to adjust corresponding HDFS policies to reflect the privilege changes as result of SQL’s GRANT/REVOKE calls if such a policy is not present already for GRANT or is present already for REVOKE, and if Hive is not impersonated. The handling is through enhancements to the grant/revokePriveleges methods of the existing RangerHiveAuthorizer class. The name of a Hive-synced HDFS policy will be of the form of hive-grant-<timestamp>.

The GRANT will add a policy of recursive access to the HDFS path underlying the Hive object in the GRANT. The REVOKE will remove a policy of the exactly matched resource and on a corresponding privilege.

 

The third part is to adjust Ranger Hive policies as result of SQL’s GRANT/REVOKE calls. Right now, Ranger Hive Plugin is only enabled for the HiveServer2 so the Hive CLI does not see corresponding Ranger policies being adjusted as result of Hive GRANT/REVOKE calls. Installation change is required to enable the plugin not just on HiveServer2, but for Hive CLI as well. See 2.4.

 

The names of the new policies created from the sync of the Hive metadata objects will be of the form of hive-grant-<timestamp>.

 

The forth part is to adjust corresponding HDFS policies to reflect the privilege changes as result of Ranger Hive policy changes. Corresponding HDFS policies will have the names of hive-grant-<hive policy name>, and will map the resources, resource patterns, privileges and taggings from the Hive policies.

 

Note that only the SQL objects that have direct backing storage could trigger the HDFS policies changes. These objects include tables and do not include views, locks,

plus databases for their not having direct backing stores

 

A “prohibitive” approach will be adopted when privileges are managed at a finer granularity that the finest backing storage ACL unit of files. On one hand, that is, say, if a user is allowed to access only some, but not all, of columns of a Hive table file, then the file is not accessible to the user.  A use case is that a Hive user is only allowed to view the “age” and “address” fields but not allowed to view the “SSN” field of a “customer” table. The “prohibitive” approach will not give him the access to the HDFS files containing backing the “customer” table. If the user has access to all of the columns of the table, he will be allowed to access the backing files on HDFS.

 

On the other hand, Hive privileges will be mapped to HDFS privileges in a “prohibitive” manner. For instance, both of SQL’s CREATE and DROP must be allowed for a backing store’s HDFS “write” to be allowed. Conceivably the full mapping could be complex and could be made ever more comprehensive in a phased approach.

2.2.4  Sequence Diagram of HDFS Policy Sync from Hive Privilege Changes

2.3 Ranger Admin

The RangerServiceREST’s grant/revokeAccess methods will handle the policy adjustments as is now, even though the requests could come from both the existing Hive plugin and the new Hive metastore plugin.

 

In addition, the RangerServiceREST’s grant/revokeAccess methods, once determined that there is a non-null value of the service’s configured key of “resourceService”, will locate a HDFS service with the name and adjust policies accordingly therein.

 

A new method of RangerServiceREST, “alterResource”, will be added to handle the resource renaming requests as result of the SQL’s “ALTER … RENAME …” operations. 

 

2.4  The “ServicePolicies” Class

This class will be added a new “Map<String, String> serviceConfigs” field to hold service-specific configurations. For now, if the corresponding serviceDef has a non-null “resourceService” field, a map entry of “resourceService=>true” will be used and, after fetched  by the refresher (see 2.5) of a plugin, will trigger the Hive plugin to send over the table storage information to the Admin.

 

2.5 Refresher

The refresher will be enhanced to fetch the “serviceConfigs” of the “ServicePolicies” objects from the Admin.

 

2.6 Hive Plugins

If the “resourceService” Boolean flag fetched from Admin is true (see 2.4), will send the table storage information to the Admin on DDL commands.

 

2.7 Installation

The Hive configuration needs to enable Hive Metastore Security. Specifically, the hive.metastore.pre.event.listeners and hive.metastore.event.listeners need to be configured to use Ranger implementations.

 

In addition, to support Range Hive policy changes as result of Hive GRANT/REVOKE calls from Hive CLI, the Ranger Hive Plugin is to be enabled in hive-site.xml instead of  hiveserver2-site.xml.

 

Essentially through these configuration settings, both Hive Security and Hive Metastore Security are enabled simultaneous through the Ranger. We don’t support enabling just one of the two as Hive itself could.

 

2.8 Ranger DB Store

The new “resourceService” configuration field of the servicedef will be added to the persistent data store. Backward compatibility should be retained through addition to the x_service_config_map table.

 

2.9 GUI

The “Config Properties” list of the Hive’s “Create Service” page will be added a new entry named “Storage Service” that defaults to empty and will otherwise contain the field that denotes the HDFS service name that will see the synched policies as result of the driving Hive table’s privilege changes. If the HDFS service of the name does not exist already, an error will be returned and the creation of the Hive service will fail. 

 

3 Appendix

3.1  Hook Invocations by Hive

The invocations of the two hooks of MetaStorePreEventListener and HiveAuthorizer  by the Hive are examined among different configurations and runtimes. Results are shown in below tables for future references in case when questions/doubts may rise as to what hooks are or should be invoked. MetaStoreEventListener invocations are not examined here and could be added in the future if necessary to clarify things out in that corner. Similarly the experiments are performed using MYSQL as the backing store for the metastore. No other backing store, embedded stores in particular, have been tested here.

In the tables, “listener” denotes “MetaStorePreEventListener; “Authorizer” denotes “HiveAuthorizer”; “x” means no invocation at all; “*” means “seemingly always being denied before possibly proceed further”.

Conclusions are 1) Hive metastore security needs to be enabled to provide access controls to HIVE CLI;  2) when metastore security is enabled, some checks may be redundantly performed by both of the two hooks, which may represent some inefficiency. When this occurs, metastore checks seem to be performed before the ones by the authorizer, indicating a preference of former over the latter for sake of performance. But the authorizer is capable of finer granular checks like column-level access checks. It remains to be seen how to invoke just one hook over the other depending upon the target to be access controlled. This, however, might require changes on the Hive part. 

 

3.1.1 Hive CLI, HiveAuthorizer specified in hive-site.xml

Metastore SecuritySELECTDDL/DMLGRANT/REVOKE
None(hive.metastore.pre.event.listeners not set)xxAuthorizer
Storage-BasedListenerListenerListener+Authorizer
DefaultListener*Listener*Listener*

 

3.1.2 Hive CLI, HiveAuthorizer specified in hiveserver2-site.xml

Metastore SecuritySELECTDDL/DMLGRANT/REVOKE
None(hive.metastore.pre.event.listeners not set)xxx
Storage-BasedListenerListenerListener
DefaultListener*Listener*Listener*

 

3.1.3 Hive Server2, HiveAuthorizer specified in hiveserver2-site.xml

Metastore SecuritySELECTDDL/DMLGRANT/REVOKE
None(hive.metastore.pre.event.listeners not set)AuthorizerAuthorizerAuthorizer
Storage-BasedListener+AuthorizerListener+AuthorizerListener+Authorizer
DefaultListener*Listener*Listener+Authorizer

 

3.1.4 Hive Server2, HiveAuthorizer specified in hive-site.xml

Metastore SecuritySELECTDDL/DMLGRANT/REVOKE
None(hive.metastore.pre.event.listeners not set)AuthorizerAuthorizerAuthorizer
Storage-BasedListener+AuthorizerListener+AuthorizerListener+Authorizer
DefaultListener*Listener*Listener+Authorizer

 

3.2 Future Extensions

It is conceivable that the same sync mechanism as described in Section 2.2.3 can be similarly applied to other Hadoop applications. In particular, the new “resourceService” field can serve as a link between an application and its underlying storage. It could be even pushed to form a “synch chain” of morn than two levels. For instance, for a Hive on HBase on HDFS.  

 

  • No labels