Abstract
Gravitino is a high-performance, geo-distributed, and federated metadata lake designed to manage metadata seamlessly across diverse data sources, vendors, and regions. Its primary goal is to provide users with unified metadata access for both data and AI assets.
Background
Gravitino addresses the growing need for multi-cloud, vendor-neutral solutions in the metadata management space. Recognizing the importance of data interoperability, Gravitino ensures compatibility across various cloud providers, stopping vendor lock-in and fostering an open and collaborative ecosystem.
Rationale
The demand for vendor-agnostic solutions is evident in today's tech landscape. Gravitino fulfils this need, offering a flexible and interoperable platform. As enterprises increasingly adopt diverse cloud environments, Gravitino's role in providing a unified metadata layer under an open source license is an attractive solution.
Current Status
Last year (2023), a startup company Datastrato, founded by three Apache members, initiated the development of Gravitino, initially wanting to solve the multi-cloud data silo issue. The scope of Gravitino is to build a unified metadata lake that consolidates different catalog systems for table and non-table data for analytics and AI workloads.
Having transitioned to open source six months ago, Gravitino has made multiple releases and attracted contributors and interest from industry-leading companies, such as Amazon, Apple, eBay, Pinterest, Tencent, Xiaomi, etc. Fortune 500 companies are actively participating in testing Gravitino and using it in production, underscoring its reliability, scalability and usefulness.
Meritocracy
Gravitino operates with a governance structure similar to established ASF projects. While some adjustments to the release and committer selection processes may be required, these changes are minor. The project already has several experienced ASF members, PMC members and committers among its contributors. Some of the inital committers have been selcted, not due to their code contributions but due to other contribution to the project.
Community
Over the past several months, Gravitino has cultivated a diverse and vibrant community, including contributions from developers and vendors from different backgrounds. The project's adoption by Fortune 500 companies shows its real-world applicability.
Core Developers
While most of the core development team hails from a single company, that company, Datastrato, is committed to open source and is a sponsor of the ASF and LF. The founders and most developers have experience with open source, the ASF Incubator, and ASF projects, such as Apache Hadoop, Spark, YuniKorn, etc.
Alignment
Gravitino leverages various ASF projects, including Apache Iceberg, Apache Hive, Apache Ranger, Apache Spark, and Apache Hadoop. This should foster collaboration within these Apache communities.
Known Risks
The project was developed when the usage of generative AI technology became popular in most open source projects. Inevitably, some AI-generated code could be introduced during its development. A fragment checker has been used to ensure no accidental inclusion of any incompatible third-party code.
The Gravitino Web UI project currently includes a dependency with an incompatible license (CC-BY-4.0). Plans are in place to address this licensing issue during incubation.
Project Name
An informal search revealed no significant trademark conflicts or clashes with existing open-source software names. The chosen name, Gravitino, comes from the idea that data has weight and can grow over time.
Orphaned Products
The commitment of the core developers to the project's longevity and community health minimizes the risk of abandonment and provides assurance of ongoing development and support. During incubation, we want to widen our contributor base further to decrease the risk of this happening.
Inexperience with Open Source
The project's composition includes individuals with extensive prior experience in open source software, the ASF incubator and ASF projects. This collective experience will ensure a smooth transition into the ASF ecosystem. Some of the contribbitors to the project are involved in other ASF or open source projects. A coueple of the initial committers have had no exprience with contribititing to open source before now.
Length of Incubation
Gravitino anticipates an incubation period of a year or so, during which it aims to grow its community base, enhance diversity among contributors, and align further with ASF practices.
Homogeneous Developers
Gravitino is committed to expanding its contributor base beyond its current mostly single-company core developer team. Efforts during incubation will focus on fostering a more diverse community.
Reliance on Salaried Developers
While most contributors are affiliated with Datastrato, the organization's foundation by open source enthusiasts ensures a genuine commitment to the principles of open source. The project also has had contributions from 50+ people outside of Datastrato.
Relationships with Other Apache Products
Gravitino's integration with Apache Iceberg, Apache Hive, Apache Spark, Apache Ranger and Apache Hadoop enhances its collaborative potential within the Apache ecosystem. The project looks forward to engaging with these communities to broaden its contributor base.
Excessive Fascination with the Apache Brand
Gravitino's interest in joining the ASF is rooted in its extensive use of ASF technologies and its alignment with the principles and practices of the Apache communities. The focus is on collaboration and integration rather than promotion by being part of the ASF.
Documentation
Comprehensive information about Gravitino is available on its website (https://datastrato.ai/) and GitHub pages (https://github.com/datastrato/gravitino).
Initial Source
The source code for Gravitino can be found in the following repositories:
Source and Intellectual Property Submission Plan
Datastrato intends to submit a software grant for the mentioned GitHub repositories. The code is licensed under the Apache license or compatible licenses, and all significant contributors have agreed to provide their contributions under the Apache license. (See https://github.com/datastrato/gravitino/blob/main/MAINTAINERS.md) The project produces SBOMs and each release has done an exhaustive review of all included 3rd party code and dependancies for each release. The project LICENSE and NOTICE files closly follow current ASF policy.
The project uses AI-generated code in some places. A fragment checker has been used to ensure no accidental inclusion of any incompatible third-party code.
External Dependencies
Gravitino's external dependencies align with the ASF license, as outlined in the LICENSE and NOTICE files. Plans are in place to fix the Gravitino Web UI's dependency with an incompatible license (CC-BY-4.0) after entering the Incubator.
Required Resources
Mailing Lists
- private@gravitino.apache.org
- dev@gravitino.apache.org
- user@gravitino.apache.org
- commits@gravitino.apache.org
Git Repositories
Issue Tracking
The project will use GitHub for issue tracking.
Other Resources
Gravitino makes extensive use of GitHub actions. Recognizing the need for compliance with ASF's usage of GitHub actions, the project is prepared to make necessary adjustments during incubation.
Sponsors
Initial Committers
- Ashish Singh (asingh@apache.org) (GitHub:singhasdev) - Pinterest
- Charlie Cheng (charlie.cheng630@gmail.com) (GitHub: charliecheng630) - cacaFly
- Jerry Shao (jshao@apache.org) (GitHub:jerryshoa) - Datastrato
- Kang Zhou (zhoukang1@xiaomi.com) (GitHub:zhoukangcn) - Xiaomi
- Minghuang Li (liminghuang@datastrato.com) GitHub:mchades) - Datastrato
- Nicholas Jiang (nicholasjiang@apache.org) (GitHub:SteNicholas) - Bilibili
- Qi Yu (yuqi@datastrato.com) (GitHu:yuqi1129) - Datastrato
- Xing Yong (yongxing@xiaomi.com) (GitHub:YxAc) - Xiaomi
- Xiaojing Fang (xiaojin@datastrato.com) (GitHubb:FANNG1) - Datastrato
- Xiao Xu (theoryxu@tencent.com) (GitHub:theoryxu) - Tencent
- Xun Liu (liuxun@apache.org) (GitHub:xunliu) - Datastrato
- Yunqing Wei (tianranjuan0617@gmail.com) (GitHub:Clearvive) - Construct Tech
- Ziva Li (zivalee87@gmail.com) (GitHub: zivali) - Yahoo
Champion
Jean-Baptiste Onofré
Nominated Mentors
- Daniel Dai (daijy@apache.org)
- Junping Du (junping_du@apache.org)
- Justin Mclean (jmclean@apache.org)
- Shaofeng Shi (shaofengshi@apache.org)
- Larry Mccay (larry.mccay@gmail.com)
Sponsoring Entity
We expect the Apache Incubator to sponsor this project.