GaianDB as a virtualizer with governance enforcement

ATLAS-1696 - Getting issue details... STATUS RANGER-1488 - Getting issue details... STATUS

This page is intended to introduce the GaianDB technology and how in the VDC project we will use it to enforce access to data. Any code & designs mentioned here do not form part of the Atlas project work per se, but are driving additional capabilities in Atlas and demonstrating a real world scenario for virtualization. Its important to note too that the frameworks we refer to below are open, pluggable.

What is Gaian?

GaianDB is a IBM developed open source project that is available on github . It is effectively a wrapper, and additional VTIs around Apache Derby that supports highly distributed, loosely coupled, self learning, federated database views. For this project we are not concerned with the self-learning aspect, nor necessarily the highly distributed topologies possible but we are making use of it's ability to federate across multiple databases.

In our initial project we will offer two different "views" of the same database. One will look very much like the real database, with technical column names. The other will have more business-oriented names (these names come via glossary lookups) and my have a subset of the columns (technical gawp not needed by a business user). GaianDB is capable of an awful lot more including joining data across multiple databases (multiple rdbms, files, rest API calls).

The use cases explain the process in more detail.

Issues that we find with gaian itself are recorded via the GaianDB GitHub Issue Tracker

Supported data sources with Gaian

The gaiandb readme lists the supported data stores as:

DB2
SQL Server
MySQL
Oracle
CSV Files
Excel spreadsheets (requires Apache POI)

The gaiandb prereq document additionally documents these

Hive
Lucene
BigSQL

I (Nigel) have personally had success in additionally using

Postgres (though use of TEXT rather than VARCHAR resulted in this performance issue
MariaDB (which is a fork of mysql)

See specific links on versions (my tests were current as of Nov 2017)

The gaian team have also successully used Hive (requiring hive-jdbc and hadoop-commons)

Changing your application to make use of GaianDB

Any applications which previously might have gone directly to the underlaying data source should now connect to the GaianDB database via JDBC.

The current Apache Derby driver is the preferred client driver to use, following the derby url prefix of jdbc:derby:net

In the event of any application not permitting additional JDBC drivers to use it may be possible to use the DB2 unified driver, either with the prefix of jdbc:derby:net or jdbc:db2, but this is not supported by gaiandb/derby/db2 at this point

An issue has been raised for further clarification on exact levels supported

A note on User Impersonation

Many existing databases allow user impersonation - ie the JDBC connection to the database will be made using an NPA (non personal account) such as admin/admin (not really!), but then additional properties will be passed on the connection to specify the effective user id (so that policies/security can be applied).

Additionally applications such as UIs may have a web application behind them that acts on behalf of many users - it effectively uses a pooled connection to execute queries on their behalf. Here the effective user may change even during the connection using additional API calls.

Work to add this capability to gaiandb is currently underway in RANGER-1850

NOTE: At a future point it is expected an OCF-compliant connector will be available.

Enforcing data access through gaian

For this project, we want to use metadata to drive enforcement. We will use Apache Ranger as the enforcement technology & build a Ranger plugin for gaianDB in a very similar way to the Ranger plugin for Hive

Resource definitions (assets) and classifications will be defined in Atlas & can then be used within ranger rules. For example we may tag a column with "confidentiality=secret", but due to the improved glossary support in Atlas this won't be a direct association but rather will be determined via the fact the column is associated with a business term such as "customer income". Our updated GAF OMAS which integrates with Ranger's tagsync will support this.

This plugin will also record audit events, whilst ranger itself could also be enhanced to provide additional metadata about it's rules & configuration back to atlas where they can be kitted with Atlas's knowledge (likely through some stewardship process) and we can link business policies with implemented rules.

The ranger plugin for GaianDB initially was managed via github, but it's now in the process of going into ranger through the JIRA and ReviewBoard process

Note that the Governance Action Framework term could be seen to only refer to this OMAS & Use of ranger, but in fact should be seen more broadly as a set of capabilities we need to actively govern the system (enforcement, audit, etc)

Page tree