One of the important reasons for managing metadata about data assets is the ability for individuals to search and locate the data they need to make new uses of an organization's data. The metadata describes the location, structure and content of the data to varying degrees of detail. By bringing the metadata for data assets from across the enterprise, an organization is able to to manage and use this data more effectively.
Apache Atlas should bring together all of the knowledge the organization has about each data source in order to have enough information to differentiate between them during the data selection process. Figure 1 below shows the types of information about a data source that should be available through Apache Atlas.
Figure 1: Drill-down from catalog search results to explore the content and qualities of a data set
This metadata is assembled through notifications from data processing engines (via the bridges/hooks), from metadata discovery pipelines, from user interfaces and API calls. The result is a rich description of the data source and its content. All of this detail is necessary to support the catalog search because an organization is likely to have many hundreds of data sources that seem to have the same type of data in them but each may have different levels of quality, coverage of attributes, scope of instances, currency, precision etc.
During the search for data for a data project, the Atlas user needs to be able to iteratively search, review results and refine the search to narrow down the list of candidate data sources as fast as possible. When they have identified the assets of interest they can request the data is provisioned to a sandbox for further analysis.
The architecture that supports the catalog search is shown in figures 2 and 3. In all cases, the catalog search UI accesses Apache Atlas through the Catalog Open Metadata Access Service (OMAS) REST API. This interface interacts with an Open Metadata Repository Service (OMRS) Connector that it retrieves from the Open Connector Framework (OCF). All OMRS connectors support the same interfaces:
- The entity and relationship types supported by the metadata repositor(y/ies)
- The entity and relationship APIs to access all types of metadata in a common manner
- Specialized, type-safe interfaces for the core metadata types that are included in the Apache Atlas build.
There are two implementations of the OMRS Connector provided for Apache Atlas: a Local Atlas OMRS Connector for accessing a local Apache Atlas metadata repository and an Enterprise OMRS Connector for making federated queries across many metadata repositories.
Figure 2: Catalog search using a single instance of an Apache Atlas Repository
Figure 2 shows the Catalog OMAS API calling the Local Atlas OMRS Connector. The Local Atlas OMRS connector provides access to the local metadata repository. Within the Atlas metadata repository is the graph that provides the metadata entities and linking relationships. The graph is supported by other data stores that provide logs and other supporting information. The repository service provides a search API over all of the repository stores as well as a query and update interface.
Figure 3: Catalog search across an enterprise
Figure 3 shows the Catalog OMAS API calling the Enterprise OMRS connector. This connector makes calls to the local OMRS connector as well as REST API calls to the OMRS connectors on remote metadata repositories.
If these remote repositories are Apache Atlas, then the OMRS connector called would be a Local Atlas OMRS Connector. However, other metadata repositories may be connected in by implementing their own OMRS connector that translated the OMRS requests into their local API calls.
<more to come>