High Level Concepts
Tag: Tags are arbitrary named strings. Tags are applied on resources, which could be at any granularity that can be identified. E.g. Table Customer can be tagged as PII. Where Customer is the resource and PII is the tag. Tagging of resources enables multiple use cases, e.g. Access Control (who has or doesn’t have access to a resource based on tag), Reporting (who accessed the resources for a given Tag), etc.
Tag Attribute: When resources are tagged, it can be associated with key/value pairs. E.g. If an customer tax file is stored, then it might be tagged with optional attribute like “ExpiryTime=2022-03-06 GMT”, which means, this document should not be accessible after the ExpiryTime. Similar to Tag, tag attributes are also subject to interpretation by the policy.
Tag Source System: The source of the tag is generally an external system e.g. Apache Atlas. It is highly recommended that there should be only one source of truth for the tags.
Ranger Tag API Interface: Ranger API to enable external system to send list of Tags and resources which are tagged. This is useful if there is an external system which manages the tags for Ranger.
Tag Synchronizer: These are custom adaptor process code which is responsible to keep the tagged resources in sync with Ranger. It could be implemented using poll on regular interval or if the source system supports message queue, then this process can subscribe and call the Ranger Tag API to update the Ranger Tag Database.
Ranger Tag database: This database or tables are used to store resources which are tagged. This would also have the attributes associated with the resource for the tag. Ranger tag database should be able store static or meta level tags. However, tags at the row or cell level should be stored at the component level or should be queried with the Tag Source System during policy execution from the component plugin.
Ranger Tag policies: Ranger needs to support policies which are defined at the Tag level. Since tag policies are configured at global level, it needs to address the permission set supported by the different components. TODO DISCUSS: Tag policies accross repository
Tag caching at Plugin: The current policy pull mechanism by the plugins need to be extended to pull the tagged resources also on regular interval.
Tag policy execution: Plugins at the components would enforce the policies that are defined at the tag/global level. If there is a policy existing at the tag/global level, then it will trump all other policies that are defined at the resource level. E.g. If a column is tagged and as a policy which restricts who can access it, then that policy will be be enforced, regardless what is provided at the resource level. (TODO: Fall back specifics)
Dynamic policy execution: These extendable policies can be used to support advanced use cases which needs special understanding the tag and attribute value. E.g. if there is policy which currently says it should expire in “90” days, but later on the requirement changes to “60” days, then the customer might design the tag based policies where the value “days” is accepted via policy definition or from other source, but do the computation in real-time based on when the resource was created. Out here, the resource would have tag with attribute “CreateTime” and it would be set when the source is tagged and sent to Ranger
Auditing Support for Tagged Resources: Audit reporting based on tags would be a very useful feature. To support this, we should store the tags along with the audit records, so it is run reports at a later time.
All of the requirements assume that Apache Atlas or an external system would be able to classify data within Hadoop, and be able to store classification labels against metadata. For example, a column in a Hive table could be classified as "sensitive".
- Users would classify data externally in Apache Atlas or an external system
- Ranger would need to sync with external metastore to retrieve the classification labels/tag and the associated metadata
- Ranger would need to provide an ability for users to create security policies based on the classification. Refer to scenarios below
- Ranger would continue to provide ability for user to create policies for each component, like it is done today
- Ranger would implement a separate service to sync tag and related metadata with external service
- If data in Hadoop is tagged or classified, user would create tag based policies in Ranger granting access to certain users or groups for the data classified as the specific tag. Ranger plugins would use these policies for enforcing access control.
- If data in Hadoop is not classified or tagged, then Ranger plugins would continue to enforce policies which are created for each component/service resource, similar to how Ranger plugins work today.
- Data could be classified for lineage or analytical purposes, and may not have security implications. In this case, tags would exist for data but no security policies would exist in Ranger for these tags. Ranger plugins should continue to enforce based on component resource based policies
- In a compliance driven or sensitive environments, some of the data could be classified as sensitive or PII, and access for such type could be restricted. Users should be able to specify policies for data tagged as sensitive where certain users are allowed access and certain users are restricted. Here we need to think a little about deny policies and what order they should be evaluated
- If data is classified with multiple tags, there could be a possibility that different policies exists for different tags. Users should be given access if any of the the policies provide access to the user or the group. Exceptions would be sensitive or classified policies where users could be explicitly granted or denied permissions. If a user is denied permission in a policy, it would take precedence over any access given in other policies