Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Typically the end user would want to use meaningful business terms to describe the data they need, they may want so see related descriptions of the data and the profile of its data values and its lineage.  Other information about the owners/stewards of the data and the organization they come from, and any license associated with the data would also be relevant.  To provide this information, the VDC project needs to expand the types defined in Apache Atlas; expand out the capability of the glossary so it supports categories and other types of semantic relationships to help the end user locate the right data; provide a new catalog API and interface for discovery of data based on these values.

...


Figure 1: Catalog self service UI

Figure 1 shows a mock-up of the catalog search UI that the VDC supports.  A person can enter search queries and a list of potential data sources are displayed on the left-hand side of the screen.  Selecting one of the search results causes more details of the metadata for that entry to be displayed in the top right-hand side of the screen and underneath it, a preview of the data if the end user has permission to access the data.

 

 

 

 

At the start of the use case, details of the data repositories, the mappings to the business glossary terms and the security classifications are managed in IBM's Information Governance Catalog.  This is shown in Figure 2.

 

The first step is to replicate the metadata from IGC to Apache Atlas so it can be extended to support the virtual views.

This is shown in Figure 3.

 

Figure 2: IBM's Information Governance Catalog (IGC) holding data lake metadata

 
 

Figure 3: Replicating metadata from IGC to Atlas

 

Since IGC remains the master copy of the original metadata, the replication must be ongoing so that Atlas remains up to date with the latest metadata from IGC.

Thus the replication capability listens for IGC events and converts them into OMRS events that can then be used to drive updates through the OMRS connector API to the Apache Atlas repository.

 

 

The virtualizer is an optional component of Atlas that receives notifications from Apache Atlas through the Information View OMAS event topic and builds logical tables in Gaian as well and information view metadata in Atlas.

Gaian is an open source information virtualization technology.  The virtualizer is written to be modular so calls to a different virtualization technology can be made at this point with a small change to the virtualizer. 

The aim at the MVP is to prove out the user of Apache Atlas as a manager for an information virtualization technology.

 

 

Figure 4: Building information views with the virtualizer

 

 

 

Figure 5: Configuring enforcement points in Gaian using Apache Ranger

 

Using a similar technique, the synchronization processes for Apache Ranger pick up knowledge from the Governance Action OMAS that the Information Views have been created/changed in Apache Atlas.   They push appropriate metadata to control access to the Ranger server which then configures Ranger plugins in Gaian.  See Figure 5.
 

 

The Ranger plugins in Gaian cache all of the metadata they need to make access decisions based on the user information passed on a request.

Read more about GaianDB and the Ranger plugins here

The system is now configured.  Changes to the IGC metadata will ripple through Atlas, Virtualizer, Ranger and Gaian so they are consistent and up-to-date.

 

When the end user makes a search request, or clicks on a search result to see more detail, the request and response comes through the Catalog OMAS to Apache Atlas.  See Figure 6.

Figure 6: Requesting catalog information from Atlas

 


 

Figure 7: Requesting data from Gaian

 

 

 

When the data preview is requested, Gaian is called to extract the data.  The Ranger plugins validate the access request allowing Gaian to retrieve the data from the data lake.  See Figure 7.

 

 

 

 

 

Figure 8 summarizes the whole end-to-end flow

Figure 8: VDC end-to-end flow (MVP1)

 

...