Abstract

Cloudberry Database, built on the newer PostgreSQL kernel (now 14.4), is one of the most advanced and mature open-source MPP databases available. It comes with multiple features, including high concurrency and high availability. It can perform quick and efficient computing for complex tasks, meeting the demands of managing and computing vast amounts of data. It has evolved from Open Source version of the Pivotal Greenplum DatabaseⓇ and inherits its original Apache License v2.

Cloudberry is 100% ANSI SQL compliant (supporting ANSI SQL-92, SQL-99, and SQL-2003, plus OLAP extensions) and supports open database connectivity (ODBC) and Java database connectivity (JDBC), as well. Most business intelligence, data analysis and data visualization tools work with Cloudberry out of the box without the need for specialized drivers.

Given the abandonment of the original Open Source version of the Pivotal Greenplum DatabaseⓇ (after the Broadcom acquisition of VMWare) and a complete shut down of its open source community, we expect a lot of refugee developers and users who have historically coalesced around Pivotal Greenplum DatabaseⓇ to migrate to Cloudberry. In fact, a lot of these constituents have already expressed the sentiment that they demand a neutrally governed, community-first version of Pivotal Greenplum DatabaseⓇ.

Cloudberry is implemented in C and C++.

Cloudberry has a few runtime dependencies licensed under the Cat X list:

  • bash (GPL 3)
  • iproute (GPL Version 2.0) 
  • perl (GPL )
  • readline (GPL Version 3)
  • rsync (GPL Version 3)
  • sed (GPL Version 3)
  • tar (GPL Version 3)
  • which (GPL 2)

However, given the runtime (dynamic linking) nature of these dependencies it doesn't represent a problem for Cloudberry to be considered an ASF project.

Proposal

The goal of this proposal is to bring the existing Cloudberry codebase and its existing community into the Apache Software Foundation (ASF) in order to continue fostering a vibrant, diverse and self-governed open source community around the technology. HashData Technology has agreed to transfer the brand name "Cloudberry" to Apache Software Foundation and will stop using Cloudberry to refer to this software if the project gets accepted into the ASF Incubator under the name of "Apache Cloudberry (incubating)". HashData Technology will continue to market and sell a data warehouse product that includes Apache Cloudberry (incubating). While Cloudberry is our primary choice for a name of the project, in anticipation of any potential issues with PODLINGNAMESEARCH we have come up with two alternative names: (1) Apache CloudberryDB; or (2) Apache Lemon.

HashData Technology is submitting this proposal to donate the Cloudberry source code and associated artifacts (documentation, website content, wiki, etc. See the following artifacts list.) to the Apache Software Foundation Incubator under the Apache License, Version 2.0 and is asking Incubator PMC to establish an open source community.

The artifacts are currently available on GitHub at https://github.com/cloudberrydb, including:

Background

The History of Greenplum Database: Closed-source -> Open-source (October 2015) -> Closed-source (May 2024)

The Pivotal Greenplum DatabaseⓇ was originally developed by a company called Greenplum Inc., which was founded in 2003. It is based on the massively parallel processing (MPP) architecture and the PostgreSQL technology. Greenplum Database can serve as data warehousing and be used for large-scale data analytics.

In 2010, Greenplum Inc. was acquired by EMC Corporation. In 2012, EMC and VMware (a subsidiary of EMC) combined several of their software assets, including Greenplum Database, into a new company called Pivotal Software, Inc. Pivotal open sourced the Greenplum’s core engine in 2015 and rebranded it as Pivotal Greenplum DatabaseⓇ. Pivotal Greenplum DatabaseⓇ became the first open-source MPP data warehouse and had a big impact on other similar projects. Until now, the Greenplum Database has been massively adopted by large numbers of small, medium, and big giant teams from different industries. Greenplum Database is also listed on Top50 popular databases among 400+ systems according to the score of the DB-Engines website (https://db-engines.com/en/ranking).

In 2019, VMware acquired Pivotal Software. This acquisition brought the Pivotal Greenplum DatabaseⓇ back into VMware. VMware continued to support the Greenplum Database development and its open-source community and provided VMware Tanzu Greenplum as the commercial product in the following years.

In Nov 2023, Broadcom completed its acquisition of VMware and Greenplum is under Broadcom(https://investors.broadcom.com/news-releases/news-release-details/broadcom-completes-acquisition-vmware). After this acquisition, the Greenplum Database went closed-source in 05/2024. Greenplum’s GitHub repo has been archived since 05/2024 and became read-only, all the Pull requests, issues, and releases are purged. The Slack workspace was shut down and deleted (https://greenplum.slack.com). The user (https://groups.google.com/a/greenplum.org/g/gpdb-users) and dev (https://groups.google.com/a/greenplum.org/g/gpdb-dev) community email lists have also gone silent. Other projects under the Greenplum GitHub organization were also archived. These are done by its owner without any announcements.

Why Cloudberry Database

In recent years, the Greenplum Database owners changed many times, which caused concerns for the ecosystem partners, and the open-source Greenplum community users and developers. We also recognized that the Greenplum Database has lost its drive for innovation and major feature updates for a long time. Greenplum lags behind the user demands on performance, cloud-native, lakehouse, etc., and cannot catch up with the modern industry trends and become less competitive with other new generation open source data warehouses and data analytics projects. More importantly, Greenplum is always controlled by one single vendor and does not have an open governance mode for the community to participate in the decisions and votes. 

So Cloudberry Database was created as a fork of Greenplum 7 Beta 3 in 2022 and opened its source code in 2023. Cloudberry Database is not only simply remade with a different brand. It comes with high ambitions and ships a lot of advanced features and highlights, including: 

  • Newer PostgreSQL kernel. Now using 14.4 and plans to upgrade annually, Greenplum uses 12 but it will reach EOL soon by November 2024, see PostgreSQL Versioning Policy: https://www.postgresql.org/support/versioning/
  • Enhanced security. Eg, supporting password policy and more encryption algorithms.
  • End-to-end performance optimization. Eg, parallel queries execution, aggregation pushdown, runtime filter, incremental materialized views, vectorized batch processing, JIT compilation, etc., 
  • Supporting AI/ML workloads, including introducing Directory Table, pgvector, PostgresML, etc. 
  • Streaming, lakehouse integration, and more. 

More new features are listed here: https://cloudberrydb.org/docs/cbdb-vs-gp-features. You can see our Roadmap for Cloudberry, some of them are on the way: https://github.com/orgs/cloudberrydb/discussions/369.

We value the user and developers’ experience. One goal of Cloudberry is to be compatible with Greenplum to let users can use Cloudberry the way using Greenplum. Also, we will open source our business migration tool(will donate to ASF too) to help open-source users migrate from Greenplum to Cloudberry with no pain.

More importantly, we are thinking about how to make sure Cloudberry is open governance, community-driven to avoid the fate of Greenplum happening on Cloudberry again. If Cloudberry is still controlled by one single vendor, the project is still built on one fragile root with an unpredictable future. The community cannot bear this happening again. That’s why we are here.

Rationale

MPP-based data management architectures continue their expansion into the enterprise. As the amount of data stored in enterprise clusters grows, unlocking the analytics capabilities and democratizing access to that treasure trove of data becomes one of the key concerns. While the Data Warehouse market has no shortage of purposefully designed MPP databases, the easiest and most cost-effective way to onboard the largest amount of data consumers is provided by offering SQL APIs for data retrieval at scale. Today, the undeniable truth is that the hotbed of Open Source SQL API implementation remains with PostgreSQL community.  Given the high velocity of innovation happening in the underlying PostgreSQL ecosystem, any Open Source MPP solution has to keep up with the community. We strongly believe that in the Data Warehouse space, this can be optimally achieved through a vibrant, diverse, self-governed community collectively innovating around a single codebase while at the same time cross-pollinating with various other data management communities. Apache Software Foundation is the ideal place to meet those ambitious goals.

Initial Goals

Our initial goals are to bring Cloudberry into the ASF, transition internal engineering processes into the open, and foster a collaborative development model according to the "Apache Way." HashData Technology and its partners plan to develop new functionality in an open, community-driven way. To get there, the existing internal build, test and release processes will be refactored to support open development.

Current Status

Currently, the project code base is licensed under the Apache License v2  and is available to the general public. The project and its community could best be described as a single-vendor Open Source project on GitHub. The documentation and wiki pages are available at https://cloudberrydb.org/docs. Although Cloudberry was mainly developed by a single company so far, its roots are in the PostgreSQL community and the internal engineering practices adopted by the development team lend themselves well to an open, collaborative and meritocratic environment.

The Cloudberry team has always focused on building a robust end-user community of paying and non-paying customers. The existing documentation along with GitHub Discussions (https://github.com/orgs/cloudberrydb/discussions), Slack (https://cloudberrydb.slack.com/) and other similar forums are expected to facilitate conversions between our existing users so as to transform them into active community of Cloudberry members, stakeholders and developers.

Meritocracy

Our proposed list of initial committers includes the current Cloudberry R&D team, several existing partners and refugees from the Open Source version of the Pivotal Greenplum DatabaseⓇ. This group will form a base for the broader community we will invite to collaborate on the codebase. We intend to radically expand the initial developer and user community by running the project in accordance with the "Apache Way". Users and new contributors will be treated with respect and welcomed. By participating in the community and providing quality patches/support that move the project forward, contributors will earn merit. They also will be encouraged to provide non-code contributions (documentation, events, community management, etc.) and will gain merit for doing so. Those with a proven support and quality track record will be encouraged to become committers.

Community

If Cloudberry is accepted for incubation, the primary initial goal will be transitioning the core community towards embracing the Apache Way of project governance. We would solicit major existing contributors to become committers on the project from the start.

Core Developers

A few of Cloudberry's core developers are skilled in working as part of openly governed Apache communities (mainly around Hadoop ecosystem). That said, most of the core developers are currently NOT affiliated with the ASF and would require new ICLAs before committing to the project.

Alignment

The following existing ASF projects can be considered when reviewing Cloudberry proposal:

Apache MADlib is an open-source library for scalable in-database analytics. It provides data-parallel implementations of mathematical, statistical, graph and machine learning methods for structured and unstructured data. Cloudberry can integrate with Apache MADlib to provide kinds of statistical and machine learning capabilities that can be natively invoked from SQL or (in the context of PL/Python, PL/Java or PL/R). Apache MADlib community has shown interest in supporting Cloudberry (https://lists.apache.org/thread/p3mlspbjb5vhh94md9lwyxz7xwxm5p9q).

Cloudberry can also integrate with various Apache projects to provide modern solutions for enterprises and users. For instance, Cloudberry will integrate with Apache Kafka, allowing it to stream and process real-time data from multiple sources into the database for quick analysis. Additionally, Cloudberry will integrate with Apache Iceberg, Apache Hudi, or Hadoop to help build a robust lakehouse data platform. These integrations are all part of our open-source roadmap.

Known Risks

Development has been sponsored mostly by a single company thus far and coordinated mainly by the core Cloudberry team.

For the project to fully transition to the Apache Way governance model, development must shift towards the meritocracy-centric model of growing a community of contributors balanced with the needs for extreme stability and core implementation coherency.

The tools and development practices in place for the Cloudberry product are compatible with the ASF infrastructure and thus we do not anticipate any on-boarding pains.

The project currently includes a modified version of PostgreSQL 14.4 source code. Given the ASF's position that the PostgreSQL License is compatible with the Apache License version 2.0, we do NOT anticipate any issues with licensing the code base. However, any new capabilities developed by the Cloudberry team once part of the ASF would need to be consumed by the PostgreSQL community under the Apache License version 2.0. This is very similar to how Apache HAWQ was developed for more than 6 years without any licensing issues raised.

Orphaned products

HashData Technology is fully committed to maintaining its position as one of the leading providers of MPP solutions and the corresponding Cloudberry commercial product will continue to be based on the Apache Cloudberry project. Moreover, HashData Technology has a vested interest in making Apache Cloudberry successful by driving its close integration with the sister ASF projects Apache MADlib. We expect this to further reduces the risk of orphaning the product.

Inexperience with Open Source

HashData Technology has embraced open source software since its formation by employing contributors/committers and by shepherding open source projects like Apache HAWQ, Apache MADlib, Apache Pulsar, Apache Arrow, Greenplum and PostgreSQL. Although some of the initial committers or PMC members have not had the experience of developing entirely open source, community-driven projects, we expect to bring to bear the open development practices that have proven successful on longstanding HashData Technology open source projects to the Cloudberry community. Additionally, several ASF veterans have agreed to mentor the project and are listed in this proposal. The project will rely on their collective guidance and wisdom to quickly transition the entire team of initial committers towards practicing the Apache Way.

Homogeneous Developers

While most of the initial committers are employed by HashData Technology, we have already seen a healthy level of interest from existing customers and partners. We intend to convert that interest directly into participation and will be investing in activities to recruit additional committers from other companies.

Reliance on Salaried Developers

Most of the contributors are paid to work in the MPP space. While they might wander from their current employers, they are unlikely to venture far from their core expertise and thus will continue to be engaged with the project regardless of their current employers.

Relationships with Other Apache Products

As mentioned in the Alignment section, Cloudberry may consider various degrees of integration and code exchange with Apache MADlib, Apache Kafka, Apache Iceberg, Apache Hudi and Apache Hadoop projects. We expect integration points to be inside and outside the project. 

We look forward to collaborating with these communities as well as other communities. A few examples of possible Cloudberry integrations similar that of Greenplum under the Apache umbrella:

An Excessive Fascination with the Apache Brand

While we intend to leverage the Apache ‘branding’ when talking to other projects as testament of our project’s ‘neutrality’, we have no plans for making use of Apache brand in press releases nor posting billboards advertising acceptance of Cloudberry into Apache Incubator.

Documentation

The documentation is currently available at https://cloudberrydb.org/docs/

Initial Source

Initial source code is currently available at https://github.com/cloudberrydb and is licensed under the Apache License v2.

Note: the PostGIS for Cloudberry (https://github.com/cloudberrydb/postgis) and PL/R for Cloudberry (https://github.com/cloudberrydb/plr) use GPL license. The community is discussing contributing back the changes to the upstream, if not accepted by the upstream then can be seen there is necessary to keep them in separate repositories on GitHub.

Source and Intellectual Property Submission Plan

As soon as Cloudberry is approved to join the Incubator, the source code will be transitioned via an Software Grant Agreement onto ASF infrastructure. We know of no legal encumberments that would inhibit the transfer of source code to the ASF.

External Dependencies

Cloudberry source code

(1) Runtime dependencies:

  • apr (Apache License 2.0)
  • apr-util (Apache License 2.0)
  • bash (GPL 3)
  • bzip2 (BSD-style License)
  • curl (MIT/X Derivative License)
  • iproute (GPL 2)
  • libuuid (LGPL 2)
  • libuv (MIT)
  • libcurl (MIT/X Derivative License)
  • libevent (BSD-3-Clause License)
  • libedit (BSD-style license)
  • libxml2 (MIT)
  • libyaml (MIT)
  • libssl (OpenSSL License and the Original SSLeay License, BSD style)
  • libzstd (BSD)
  • openldap (The OpenLDAP Public License)
  • openssh (BSD style license)
  • openssh-client (BSD style license)
  • openssh-server (BSD style license)
  • openssl (OpenSSL License and the Original SSLeay License, BSD style)
  • openssl-libs (Apache License 2.0)
  • python (Python Software Foundation License Version 2)
  • psutil - Python Module (BSD-3-Clause License)
  • pyyaml - Python Module (MIT)
  • pygresql - Python Module (PostgreSQL License)
  • psycopg2 - Python Module (LGPL 3)
  • readline (GPL Version 3)
  • rsync (GPL Version 3)
  • sed (GPL Version 3)
  • tar (GPL Version 3)
  • which (GPL 2)
  • zlib (Permissive Free Software License)
  • zip (Info-ZIP license, BSD-like license)

Some runtime dependencies are needed only when run ./configure with the corresponding option to enable the PostgreSQL features that are not built by default:

  • pam (BSD-style License) (needed when with the option --with-pam to build with PAM support)
  • libicu (ICU License) (needed when with the option --with-icu to build with support for the ICU library, enabling use of ICU collation features)
  • krb5-libs (MIT) (needed when with the option--with-gssapi)
  • lz4 (GPL2, BSD 2-Clause License) (needed when with the option --with-lz4 to build with LZ4 support)
  • Perl (GPL) (needed when the option --with-perl is used or tap-tests are enabled)

(2) Build only dependencies:

  • bison (GPL Version 3)
  • cmake (BSD 3-clause License)
  • flex (BSD License)
  • gcc (GPL Version 3)
  • gcc-c++ (GPL Version 3)

If you enable the options --with-pam, --with-icu, --with-gssapi, --with-lz4, and --with-perl, you also need the build dependencies: pam-devel, libicu-develkrb5-devel, lz4-devel, perl

(3) Test only dependencies:

  • googletest (BSD License) (for gpcloud test)
  • perl (GPL) (for TAP tests)
  • perl-ExtUtils-Embed (GPL , Artistic License) (for TAP tests)
  • perl-IPC-Run (GPL , Artistic License) (for TAP tests)
  • perl-Test-Simple (GPL , Artistic License) (for TAP tests)
  • perl-core (GPL) (for TAP tests)

Website

  • MIT License:
    • docusaurus
    • docusaurus-plugin-sass
    • easyops-cn/docusaurus-search-loca
    • floating-ui/react
    • mdx-js/react
    • popperjs/core
    • types/lodash-es
    • ahooks
    • clsx
    • dayjs
    • lodash-es
    • prism-react-renderer
    • react-dom
    • react
    • sass
    • typed.js
    • node
  • Apache License 2.0
    • typescript

gpbackup for Cloudberry

(1) Direct dependencies list

(2) Indirect dependencies list

  • Apache License 2.0
  • BSD-3-Clause license

gpbackup-s3-plugin for Cloudberry

(1) Direct dependencies list

(2) Indirect dependencies list

pxf for Cloudberry

(1) Server dependencies

  • BSD
    • com.esotericsoftware:kryo:3.0.3 (BSD)
    • com.esotericsoftware:minlog:1.3.0 (BSD 3-clause)
    • com.esotericsoftware:reflectasm:1.11.6 (BSD 3-clause)
    • com.google.protobuf:protobuf-java:2.5.0 (BSD 3-clause)
    • org.antlr:antlr-runtime:3.5.2 (BSD 3-clause)
    • org.codehaus.woodstox:stax2-api:3.1.4 (BSD 2-clause)
    • org.jodd:jodd-core:3.5.2 (BSD 2-clause)
    • org.postgresql:postgresql:42.4.1 (BSD 2-clause)
    • org.threeten:threeten-extra:1.5.0 (BSD 3-clause)
  •  Apache 2.0
    • com.fasterxml.woodstox:woodstox-core:5.0.3 
    • com.google.guava:guava:20.0 
    • com.google.cloud.bigdataoss:gcs-connector:hadoop2-1.9.17 
    • com.microsoft.azure:azure-storage:5.4.0 
    • com.yammer.metrics:metrics-core:2.2.0 
    • com.zaxxer:HikariCP:3.4.5 
    • commons-codec:commons-codec:1.14 
    • commons-collections:commons-collections:3.2.2 
    • commons-configuration:commons-configuration:1.10 
    • commons-io:commons-io:2.7 
    • commons-lang:commons-lang:2.6 
    • commons-logging:commons-logging:1.1.3 
    • io.airlift:aircompressor:0.8 
    • javax.jdo:jdo-api:3.0.1
    • joda-time:joda-time:2.8.1 
    • net.sf.opencsv:opencsv:2.3 
    • org.apache.commons:commons-compress:1.20 
    • org.apache.htrace:htrace-core:3.1.0-incubating 
    • org.apache.htrace:htrace-core4:4.0.1-incubating 
    • org.apache.zookeeper:zookeeper:3.4.6 
    • org.datanucleus:datanucleus-api-jdo:4.2.4 
    • org.datanucleus:datanucleus-core:4.1.17 
    • org.mortbay.jetty:jetty-util:6.1.26 (EPL 2.0)
    • org.objenesis:objenesis:2.1 
    • org.apache.tomcat.embed:tomcat-embed-core:9.0.72 
    • org.apache.tomcat.embed:tomcat-embed-el:9.0.72 
    • org.apache.tomcat.embed:tomcat-embed-websocket:9.0.72 
    • org.wildfly.openssl:wildfly-openssl:1.0.7.Final 
    • org.xerial.snappy:snappy-java:1.1.8.4 
    • org.apache.hadoop 
    • org.apache.hbase 
    • org.apache.hive 
    • org.apache.parquet:parquet-format:2.7.0 
    • org.apache.thrif 
    • org.apache.orc 
    • org.apache.avro 
    • org.codehaus.jackson 
  • MIT
    • com.microsoft.azure:azure-data-lake-store-sdk:2.3.9
    • org.simplify4u:slf4j-mock:2.1.0
  • LGPL
    • com.google.code.findbugs:annotations:1.3.9
  • Public
    • org.json:json:20090211 (Public)
    • org.tukaani:xz:1.8 (Public)

(2) Command line dependencies

Direct dependencies

Indirect dependencies

Cryptography

The proposal does not include cryptographic code.

Required Resources

Mailing lists

Git Repository

https://github.com/apache/cloudberry-incubating

Issue Tracking

JIRA Project CLOUDBERRY (CLOUDBERRY)

Cloudberry currently uses [GitHub Issues](https://github.com/cloudberrydb/cloudberrydb/issues) to track issues. The community would like to continue to use GitHub Issues and can also introduce the JIRA to the community users who are used to it.

Other Resources

Cloudberry utilizes GitHub actions for the CI/CD pipeline. Acknowledging the need to comply with ASF's use of GitHub actions, it is ready to make necessary adjustments during incubation.

Initial Committers

Affiliations

  • Roman Shaposhnik (Ainekko)
  • Ed Espino (Individual)
  • Dianjin Wang (HashData)
  • Max Yang (HashData)
  • Jianghua Yang (HashData)
  • Zhang Mingli (HashData)
  • Hao Wu (HashData)
  • Jiaqi Zhou (HashData)
  • Antonio Petrole (Individual)
  • Hope Gao (HighGo)
  • Kirill Reshke (Yandex Cloud)
  • Andrey Borodin (Yandex Cloud)
  • Maxim Smyatkin (Yandex Cloud)
  • Sen Hu (HashData)
  • Xiaoran Wang (HashData)
  • Jinbao Chen (HashData)
  • Weinan WANG (HashData)

Sponsors

Champion

Roman Shaposhnik

Nominated Mentors

The initial mentors are listed below:

  • Willem Jiang - Apache Member, ByteDance
  • Roman Shaposhnik - Apache Member, Ainekko
  • XXX

Sponsoring Entity

We would like to propose Apache incubator to sponsor this project.

  • No labels