Apache Atlas


Atlas provides open metadata management and governance capabilities for organizations that are using data intensive platforms such as Apache Hadoop, cloud platforms, mobile and IoT systems that all need to be integrated with their traditional systems to exchange data for analytics and data driven-decisions.  Through these capabilities, an organization can build a catalog of their data assets, classify and govern these assets and provide collaboration capabilities around these data assets for data scientists, analysts and the data governance team.


Why Atlas?

Atlas targets a scalable and extensible set of core foundation metadata management and governance services – enabling enterprises to effectively and efficiently meet their compliance requirements on individual data platforms while ensuring integration with the whole data ecosystem. Apache Atlas is organized around two guiding principals:

  • Metadata truth through automation, collaboration and open standards: Atlas should provide true visibility of the data assets in an organization.  
    • Modern organizations have many IT systems hosting data that collectively are using a wide range of technology.   Atlas as an open source project will help establish standards for metadata and governance that all technology providers can rally around helping to break down the data silos that organizations struggle with today. 
    • Through APIs, hooks and bridges Atlas facilitates easy exchange of metadata through open standards that facilitates inter-operability across many metadata producers. 
    • Atlas focuses on the automation of metadata and governance.  It captures details of new data assets as they are created and their lineage as data is processed and copied around.
    • With the extensible typesystem, Atlas is able to bring different perspectives and expertise around data assets together to enable collaboration and innovative use of data.

  • Developed in the open: Atlas was incubated by Hortonworks under the umbrella of Data Governance Initiative (DGI) in collaboration with a variety of organizations in multiple verticals including financial services, healthcare, oil&gas, retail, and pharma.  Engineers from Aetna, JPMorgan Chase, Merck, SAS, Schlumberger, and Target collaborated with Hortonworks on incubating Atlas as an open metadata and governance platform for Hadoop ecosystem. After 2+ years of maturation, based on this great start, Hortonworks, IBM, ING and many other organizations are now extending Atlas to address data governance problems across a wide range of industries. This approach is an example of open source community innovation that helps accelerate product maturity and time-to-value for a data driven enterprise.


Atlas today

Figure 1 below show the initial architecture proposed for Apache Atlas as it went into the incubator.


Figure 1: the initial vision for Apache Atlas


The core capabilities defined by the incubator project included the following:

  • Data Classification – to create an understanding of the data within a data platform such as Hadoop and provide a classification of this data to external and internal sources
  • Centralized Auditing – to provide a framework for capturing and reporting on access to and modifications of data within Hadoop
  • Search and Lineage – to allow pre-defined and ad-hoc exploration of data and metadata while maintaining a history of how a data source or explicit data was constructed
  • Security and Policy Engine – to protect data and rationalize data access according to compliance policy.

The Atlas community has delivered those requirements with the following components:

  1. Flexible knowledge store and type system
  2. Automatic cataloguing of data assets and lineage through hooks and bridges
  3. APIs and a simple UI to provide access to the metadata
  4. Integration with Apache Ranger to add real-time, tag-based access control to Ranger’s already strong role-based access control capabilities.

Stay Tuned for More to Come

Atlas today focuses on the Apache Hadoop platform.  However, at its core, Atlas is designed to exchange metadata with other tools and processes within and outside of the Hadoop ecosystem, thereby enabling platform-agnostic governance controls that effectively address compliance requirements.

The projects underway today will expand both the platforms it can operate on, its core capabilities for metadata discovery and governance automation as well as creating an open interchange ecosystem of message exchange and connectors to allow different instances of Apache Atlas and other types of metadata tools to integrate together into an enterprise view of an organization's data assets, their governance and use. 

Atlas is only as good as the people who are contributing.  If metadata management and governance is an area of interest or expertise four you then please consider becoming part of the Atlas community and Getting Involved.