You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 5 Next »

Abstract

Polaris is a catalog for data lakes. It provides new levels of choice, flexibility and control over data, with full enterprise security and Apache Iceberg interoperability across a multitude of engines and infrastructure. Polaris builds on standards such as those created by Apache Iceberg, providing the following benefits for the ecosystem:

  • Multi-engine interoperability over a single copy of data, eliminating the need for moving and copying data across different engines and catalogs.
  • An interoperable security model providing a unified authorization layer independent from the engines processing analytical tables.
  • For multi-catalog scenarios, a unified catalog level view of data across multiple catalogs via catalog notification integrations.
  • The ability to host Polaris Catalog on the infrastructure of your choice.

Background

Open source file and table formats have garnered much interest in the data industry because of their potential for interoperability, unlocking the ability for many technologies to safely operate over a single copy of data. Greater interoperability not only reduces the complexity and costs associated with using many tools and processing engines in parallel, but it also reduces potential risks associated with vendor lock-in.

Despite rapid adoption of open file and table formats, many interdependent limitations exist between engines and catalogs, which create lock-in that diminishes the value of Apache Iceberg’s open standards. This leaves data architects and engineers with the difficult task of navigating these constraints and making difficult trade-offs between complexity and lock-in. In an effort to improve interoperability, the Apache Iceberg community has developed an open standard for a REST protocol. The open API specification is a big step toward achieving interoperability, but many challenges remain. With Polaris, we aim to:

  • Support vendor neutrality: Most implementations of the Iceberg Catalog API today are vendor specific. With Polaris, we aim to provide a vendor-neutral, open source catalog option for the ecosystem.

  • Provide a full, production-ready reference implementation: Most implementations today implement a subset of the Iceberg Catalog API. With Polaris, we aim to continue to provide a full reference implementation of the entire Catalog API spec over time; if something is in the spec, it’s available in Polaris and ready for production.

  • Further drive innovation in the community: Specifications are challenging to evolve in a vacuum, and specifications which are primarily driven by vendor-specific implementations often tend towards stagnation and least common denominator subsets of functionality. With Polaris, we aim to help increase the velocity of innovation in the Iceberg Catalog API Spec by approaching missing features, such as security, from a first-principles perspective and with a community-first mindset, with the goal of pushing forward the state of the art in the way most befitting for the community.

Rationale

With Polaris, we believe we can provide a state of the art, open source, vendor neutral Iceberg catalog built as both a production-ready reference implementation, as well as a proving ground for potential new Iceberg Catalog Spec features such as security and catalog-level versioning. As of today, Polaris already supports the following key pieces of functionality.

  • Cross-engine read and write interoperability: Many organizations either use various processing engines to perform specific workloads or seek the flexibility to easily add or swap processing engines in the future. Either way, they want the freedom to safely use multiple engines on a single copy of data to minimize the storage and compute costs associated with moving data or maintaining multiple copies.

    Catalogs play a critical role in a multi-engine architecture. They make operations on tables reliable by supporting atomic transactions. This means that data engineers and their pipelines can modify tables concurrently, and queries on these tables produce accurate results. To accomplish this, all Apache Iceberg table read and write operations, even from different engines, are routed through a catalog.

    Polaris implements the full Apache Iceberg open REST API to maximize the number of engines you can utilize, and integrates credential vending with all major public cloud storage vendors.

  • Security: Polaris implements a brand new role based access control (RBAC) security model, designed from first principles to provide a generalized foundation for fine-grained security. Security is a critical piece of enterprise catalogs, and with Polaris we aim to help push forward the state of the art in OSS catalog security.

  • Multi-tenancy: Built from the ground up to support serving multiple catalogs and multiple users from a single instance.

  • Run anywhere, no lock-in: Polaris can be deployed on different infrastructures: cloud infrastructure, your own infrastructure with containers such as Docker or Kubernetes. Regardless of how you deploy Polaris, there’s no lock-in.

Future Plans

Though we expect most of the future roadmap to be decided by the community if this proposal is accepted, our intent is to continue pushing forward on the Polaris core with features like catalog level versioning, additional governance support (Row and Column Access Control (RCAC), User-Based Access Management system (UBAC), etc.), catalog support for non-Iceberg data lakes, and more.

Current Status

Meritocracy

Polaris was initially developed based on ideas from many employees within Snowflake. As an Apache 2.0 open source project, Polaris is receiving contributions from several contributors. As a project under incubation, we are committed to expand our effort to build an environment which supports meritocracy. We are focused on engaging the community and other related projects for support and contributions. Moreover, we are committed to ensure contributors and committers to Polaris come from a broad mix of organizations through a merit-based decision process during incubation. We believe strongly in the Polaris model and are committed to grow an inclusive community for Polaris contributors.

Community

Polaris is cultivating a diverse and vibrant community, with contributions from a number of organizations and individuals. Polaris is currently the main production catalog in a very large company. We believe that, with Polaris at ASF, it will allow us to consolidate existing Polaris related work, grow the Polaris community, and deepen connections between Polaris and other open source projects.

Core Developers

While most of the core development team hails from two companies, these companies, Snowflake and Dremio, are committed to open source, with experience on several ASF projects. Most developers have experience with Open Source, ASF Incubator, ASF projects, such as Apache Beam, Iceberg, Camel, ActiveMQ, …

Alignment

Polaris leverages various Apache projects, including Apache Iceberg, …
This should foster collaboration within these Apache projects.

Known Risks

Project Name

Snowflake has filed a US trademark application for “POLARIS CATALOG” (S/N 98575771), and this application will be donated to the ASF.

Orphaned Products

Polaris is presently used by organizations, as their main catalog. Snowflake and Dremio have a long-time commitment to advance the Polaris project; moreover, Polaris is seeing increasing interest, development, and adoption from organizations outside of Snowflake.

Inexperience with Open Source

The project’s composition includes individuals with extensive experience in open source software, the ASF incubator and ASF projects. Most of the Polaris contributors know the Apache way and how to execute projects at ASF.

Homogeneous Developers

Polaris is committed to expand its contributor base beyond its current mostly two-company's core developer team. Effort during incubation will focus on fostering a more diverse community.

Reliance on Salaried Developers

While most of the contributors are affiliated with Snowflake and Dremio, Snowflake’s and Dremio’s commitments to Open Source ensure a genuine commitment to the Apache way and open source principles.

Relationship with Other Apache Products

Polaris is connected with Apache Iceberg, Apache Hadoop, Apache Spark, Apache Flink, and potentially Apache Hudi, Apache XTable, …
Polaris wants to expand its coverage with other Apache projects depending on the use cases, looking forward to engaging with these communities to broaden its contributor base.

Excessive fascination with the Apache Brand

Polaris’s interest in joining the ASF is rooted in its extensive use of the ASF technologies and its alignment with the principles and practices of the Apache communities. The focus is on collaboration, community building and integration, rather than promotion of being part of the ASF.

Documentation

Polaris documentation is available here: https://polaris-catalog.github.io/polaris/

Initial Source

The initial source for Polaris which we will submit to the Apache Software Foundation are available on GitHub: https://github.com/polaris-catalog/polaris
The project is Apache 2.0 licensed.

Source and Intellectual Property Submission Plan

Snowflake intends to submit a software grant agreement for the mentioned GitHub repository. The code is licensed under the Apache 2.0 license, and all contributors have signed contributor license agreements based substantially on ASF CLAs 
The project LICENSE and NOTICE files are similar to the ASF policies.

External Dependencies

Polaris’s external dependencies align with the ASF license (according to https://www.apache.org/legal/resolved.html), as outlined in the LICENSE and NOTICE files.
We plan to do a new full pass on our dependencies to double check that there’s no Category B or X.

Required Resources

Mailing Lists

Git Repository

Issue Tracking

The project will use GitHub for issue tracking.

Other Resources

Polaris will use GitHub Actions for CI/CD, in compliance with the ASF policies.

Sponsors

Initial PPMC (with affiliation)

Initial Committers (with affiliation)

  • Aihua Xu (GitHub: aihuaxu) - Snowflake 
  • Ajantha Bhat (GitHub: ajantha-bhat) - Dremio
  • Alex Dutra (GitHub: adutra) - Dremio
  • Dennis Huo (GitHub: dennishuo) - Snowflake
  • Dmitri Bourlatchkov (GitHub: dimas-b) - Dremio
  • Laurent Goujon (GitHub: laurentgo) - Dremio
  • Maninder Parmar (GitHub: manisin) - Snowflake
  • Micah Kornfield (GitHub: emkornfield) - Google 
  • Michael Collado (GitHub: collado-mike) - Snowflake 
  • Vivo Xu (Github: hivivo) - Snowflake

Champion

Nominated Mentors

Sponsoring Entity

We expect the Apache Incubator to sponsor the project.


  • No labels