Abstract

Gravitino is a high-performance, geo-distributed, and federated metadata lake designed to manage metadata seamlessly across diverse data sources, vendors, and regions. Its primary goal is to provide users with unified metadata access for both data and AI assets.

Background

Gravitino addresses the growing need for multi-cloud, vendor-neutral solutions in the metadata management space. Recognizing the importance of data interoperability, Gravitino ensures compatibility across various cloud providers, stopping vendor lock-in and fostering an open and collaborative ecosystem.

Rationale

The demand for vendor-agnostic solutions is evident in today's tech landscape. Gravitino fulfils this need, offering a flexible and interoperable platform. As enterprises increasingly adopt diverse cloud environments, Gravitino's role in providing a unified metadata layer under an open source license is an attractive solution.

Current Status

Last year (2023), a startup company Datastrato, founded by three Apache members, initiated the development of Gravitino, initially wanting to solve the multi-cloud data silo issue. The scope of Gravitino is to build a unified metadata lake that consolidates different catalog systems for table and non-table data for analytics and AI workloads.


Having transitioned to open source six months ago, Gravitino has made multiple releases and attracted contributors and interest from industry-leading companies, such as Amazon, Apple, eBay, Pinterest, Tencent, Xiaomi, etc. Fortune 500 companies are actively participating in testing Gravitino and using it in production, underscoring its reliability, scalability and usefulness.

Meritocracy

Gravitino operates with a governance structure similar to established ASF projects. While some adjustments to the release and committer selection processes may be required, these changes are minor. The project already has several experienced ASF members among its contributors.

Community

Over the past several months, Gravitino has cultivated a diverse and vibrant community, including contributions from developers and vendors from different backgrounds. The project's adoption by Fortune 500 companies shows its real-world applicability.

Core Developers

While most of the core development team hails from a single company, that company, Datastrato, is committed to open source and is a sponsor of the ASF and LF. The founders all have experience with open source, the ASF Incubator, and ASF projects, such as Apache Hadoop, Spark, YuniKorn, etc.

Alignment

Gravitino leverages various ASF projects, including Apache Iceberg, Apache Hive, Apache Ranger, Apache Spark, and Apache Hadoop. This should foster collaboration within these Apache communities.

Known Risks

The project was developed when the usage of generative AI technology became popular in most open source projects. Inevitably, some AI-generated code could be introduced during its development. A fragment checker has been used to ensure no accidental inclusion of any incompatible third-party code.


The Gravitino Web UI project currently includes a dependency with an incompatible license (CC-BY-4.0). Plans are in place to address this licensing issue during incubation.

Project Name

An informal search revealed no significant trademark conflicts or clashes with existing open-source software names. The chosen name, Gravitino, comes from the idea that data has weight and can grow over time.

Orphaned Products

The commitment of the core developers to the project's longevity and community health minimizes the risk of abandonment and provides assurance of ongoing development and support. During incubation, we want to widen our contributor base further to decrease the risk of this happening.

Inexperience with Open Source

The project's composition includes individuals with extensive prior experience in open source software, the ASF incubator and ASF projects. This collective experience will ensure a smooth transition into the ASF ecosystem. Some of the contribbitors to the project are involved in other ASF or open source projects. A coueple of the initial committers have had no exprience with contribititing to open source before now.

Length of Incubation

Gravitino anticipates an incubation period of a year or so, during which it aims to grow its community base, enhance diversity among contributors, and align further with ASF practices.

Homogeneous Developers

Gravitino is committed to expanding its contributor base beyond its current mostly single-company core developer team. Efforts during incubation will focus on fostering a more diverse community.

Reliance on Salaried Developers

While most contributors are affiliated with Datastrato, the organization's foundation by open source enthusiasts ensures a genuine commitment to the principles of open source. The project also has had contributions from 50+ people outside of Datastrato.

Relationships with Other Apache Products

Gravitino's integration with Apache Iceberg, Apache Hive, Apache Spark, Apache Ranger and Apache Hadoop enhances its collaborative potential within the Apache ecosystem. The project looks forward to engaging with these communities to broaden its contributor base.

Excessive Fascination with the Apache Brand

Gravitino's interest in joining the ASF is rooted in its extensive use of ASF technologies and its alignment with the principles and practices of the Apache communities. The focus is on collaboration and integration rather than promotion by being part of the ASF.

Documentation

Comprehensive information about Gravitino is available on its website (https://datastrato.ai/) and GitHub pages (https://github.com/datastrato/gravitino).

Initial Source

The source code for Gravitino can be found in the following repositories:

Source and Intellectual Property Submission Plan

Datastrato intends to submit a software grant for the mentioned GitHub repositories. The code is licensed under the Apache license or compatible licenses, and all significant contributors have agreed to provide their contributions under the Apache license. The project produces SBOMs and each release has done an exhaustive review of all included 3rd party code and dependancies for each release. The project LICENSE and NOTICE files closly follow current ASF policy.

The project uses AI-generated code in some places. A fragment checker has been used to ensure no accidental inclusion of any incompatible third-party code. 

External Dependencies

Gravitino's external dependencies align with the ASF license, as outlined in the LICENSE and NOTICE files. Plans are in place to fix the Gravitino Web UI's dependency with an incompatible license (CC-BY-4.0) after entering the Incubator.

Required Resources

Mailing Lists

Git Repositories

Issue Tracking

The project will use GitHub for issue tracking.

Other Resources

Gravitino makes extensive use of GitHub actions. Recognizing the need for compliance with ASF's usage of GitHub actions, the project is prepared to make necessary adjustments during incubation.

Sponsors

Initial Committers

Champion

Jean-Baptiste Onofré

Nominated Mentors


Sponsoring Entity

We expect the Apache Incubator to sponsor this project.



  • No labels