Abstract

Celeborn is an intermediate data service for big data compute engines (i.e. ETL, OLAP, and Streaming engines) to boost performance, stability, and flexibility. Intermediate data typically include shuffle and spilled data.

Proposal

Celeborn provides a high-performance intermediate data service for big data compute engines. Instead of writing to local disks, compute engines could push shuffle and spilled data to Celeborn workers, freeing compute nodes from the dependency on large local disks. In addition, Celeborn re-arranges the layout of the intermediate data to be more IO-friendly, thus improving performance and stability.

Background

Intermediate data typically include shuffle and spilled data, and it's normally a major cause of inefficiency, instability, and inflexibility in the lifecycle of a distributed job.

Shuffle is one of the most resource-consuming operators in big-data processing. The existing pull-style implementations have several drawbacks:

  1. Rely on local disks, prohibiting the adoption of disaggregated architecture, and limiting elasticity.
  2. Massive random access to disks, easily saturates IOPS capability, causing inefficiency and instability.

Spilled data is another source of intermediate data. Taking Spark as an example, it spills in-memory data to disks whenever memory limitation is reached (i.e. sort, aggregation). It's necessary to consider spilled data to support disaggregation and free compute nodes from large local disks.

To solve the above problems, Celeborn takes over both kinds of intermediate data by providing push-style APIs, re-arranging data layout to be more IO friendly, and leveraging DFS and object store to enable massive storage space.

Rationale

Disaggregated architecture has several benefits like flexible provision of computing and storage nodes separately, better elasticity, and more cost-effectiveness. However, in a lot of cases, intermediate data depends on large local disks, which prohibits shifting to such architecture. Thus, taking over all kinds of intermediate data, i.e. shuffle and spilled data is one of the key motivations of Celeborn.

Another key motivation of Celeborn is to solve inefficiency and instability issues caused by the existing pull-style shuffle, as described in the former section.

Celeborn solves the problems through the following core designs:

  1. Push-based shuffle plus partition data aggregation to turn random IO access into sequential access.
  2. FileSystem-like API to support writing spilled data.
  3. Hierarchical storage from memory to DFS/object store to enable fast cache and massive storage space.
  4. Engine-irrelevant APIs for easy integrating to various engines.
  5. Extended fault tolerance and data replication(optional) to increase reliability.

Celeborn is currently adopted in the production environment at both Alibaba and many other companies, serving petabytes of data per day, which demonstrates its capability of resolving real-world problems.

Initial Goals

Although many of the core features of Celeborn have been developed and verified in the production environment (mainly Spark jobs), there is still a lot to evolve, such as the following requirements from users:

  1. Extension: support more engines of various categories, i.e. Streaming engines (Apache Flink), OLAP engines, and engines written in python and native languages.
  2. Completeness: implement the support for spilled data and possibly other kinds of intermediate data.
  3. Enhancement: rolling upgrade, multi-tenant, layered storage, columnar shuffle, enhanced performance and fault tolerance, etc.

Current Status

Meritocracy

Celeborn was started at Alibaba in 2020 with the project name "RemoteShuffleService" and open-sourced in 2021. Since then, Celeborn has gained strong interest from dozens of companies and individuals. Roadmaps, issues, and design docs are accessible to everyone and discussed across developers. Celeborn already has active contributors from different organizations.

We encourage community members to participate, and those who invest time and talent in the project will have privileges, authority, and influence over decisions. We will try our best to build an environment that supports meritocracy. We believe the community and project will grow better if we run in the Apache Way.

Community

Celeborn has been building a community around users and contributors since it was open-sourced. We have organized several online meetups with data-infra teams from various well-known companies, and we have a dozen of contributors out of Alibaba, some of them are among the most active contributors. We frequently have online discussions with our contributors. By bringing Celeborn to Apache, we believe we can grow the base of contributors by inviting and encouraging more contributors in The Apache Way.

Users

Celeborn is adopted as one of the components of the E-MapReduce (EMR) product in Alibaba Cloud, and a dozen of customers are using it. Beyond that, we have even more users irrelevant to EMR who got in touch with us through the community, including Shopee, NetEase, Bilibily, BOSS, and Synnex, to name a few. Most of our users have made contributions to the project.

Core Developers

  • Keyong Zhou. He is the founder of this project, from Alibaba. (GitHub ID: waitinfuture)
  • Ethan Feng. He loves open source and coding, from Alibaba. (GitHub ID: FMX)
  • Kerwin Zhang. He is a developer of the project, from Alibaba. (GitHub ID: kerwin-zk)
  • AngersZhuuuu. He is Apache Spark/YARN/HADOOP Contributor, an individual open-source enthusiast, from Shopee. (GitHub ID: AngersZhuuuu)
  • Cheng Pan. He is the Apache Kyuubi (Incubating) PPMC member, Apache Iceberg/Spark Contributor, from NetEase. (GitHub ID: pan3793)

Alignment

Celeborn aims to store and serve intermediate data for big-data compute engines, and many of the projects are from Apache, such as Apache Spark, Apache Flink, Apache Hive, etc. Celeborn currently mainly supports Spark, and we plan to support Flink and others in the near future. Celeborn is already under Apache License 2.0, and many of the core developers have experience on Apache projects.

Known Risks

Project Name

Celeborn is the name of the White Tree in J. R. R. Tolkien's fantasy stories. "Celeb" is the sindarin for "silver", while "orn" is that for "tree", and we think it's cool to introduce this fantasy tree to the Apache world. Based on our search results, the term Celeborn is not used as a trademark under any class, so it is legal to use it as our project name.

Orphaned products

Celeborn is vastly adopted in Alibaba and is a built-in component in Alibaba Cloud's E-MapReduce product. Developers in Alibaba will keep improving the project to ensure it will fulfill the current and to-emerge requirements. Many organizations outside Alibaba are also using Celeborn to enhance their production jobs, and several of them have already participated in. We are now actively operating the community and will continue to increase the vitality of the community to attract more contributors to the community.

Inexperience with Open Source

Many of the Celeborn contributors have experience working on open source projects, and by working with our mentors and the Apache community we believe we will be able to conduct ourselves in accordance with Apache Incubator guidelines.

Homogenous Developers

The current contributors are across various organizations, including Alibaba, Shopee, NetEase, Xiaomi, Bilibily, BOSS, Xiaohongshu, etc. We are committed to recruiting additional committers based on their contributions to the project.

Reliance on Salaried Developers

Most of the developers are paid by their employers to contribute to this project. Given some volunteer developers and the committers' sense of ownership of the code, the project could continue even if no salaried developers contributed to the project.

Relationships with Other Apache Products

Celeborn was originally built to enhance Apache Spark and uses Apache Ratis for high availability, and Apache Hadoop for layered storage. Celeborn also uses Apache Commons and Apache Logging. In the future, as Celeborn supports more engines, it will cooperate with Apache Flink, Apache Hive, etc.

An Excessive Fascination with the Apache Brand

We believe the Apache way, not the brand, will help Celeborn grow and persist. We hope to make sure that a very inclusive, diverse, and meritocratic community is built outside the umbrella of a single company.

Documentation

Documentation can be found on the GitHub Pages.

Initial Source

The initial source code for Celeborn is hosted at https://github.com/alibaba/RemoteShuffleService.

The project name is still not renamed and we will rename it from RemoteShuffleService to Celeborn after the incubation proposal is approved.

Source and Intellectual Property Submission Plan

As soon as Celeborn is approved to join Apache Incubator, our initial committers will submit iCLA(s), SGA, and CCLA(s). The codebase is already licensed under Apache License 2.0.

We will also deprecate the initial source repository and redirect it to the new incubator project repository after approved.

External Dependencies

Apache Licence 2.0

  • com.fasterxml.jackson.core:jackson-annotations
  • com.fasterxml.jackson.core:jackson-core
  • com.fasterxml.jackson.core:jackson-databind
  • com.fasterxml.jackson.jaxrs:jackson-jaxrs-base
  • com.fasterxml.jackson.jaxrs:jackson-jaxrs-json-provider
  • com.fasterxml.jackson.module:jackson-module-jaxb-annotations
  • com.fasterxml.woodstox:woodstox-core
  • com.github.stephenc.jcip:jcip-annotations
  • com.google.code.findbugs:jsr305
  • com.google.code.gson:gson
  • com.google.guava:guava
  • com.nimbusds:nimbus-jose-jwt
  • com.squareup.okhttp:okhttp
  • com.squareup.okio:okio
  • commons-beanutils:commons-beanutils
  • commons-cli:commons-cli
  • commons-codec:commons-codec
  • commons-collections:commons-collections
  • commons-io:commons-io
  • commons-net:commons-net
  • io.dropwizard.metrics:metrics-core
  • io.dropwizard.metrics:metrics-graphite
  • io.dropwizard.metrics:metrics-jvm
  • io.netty:netty-all
  • io.netty:netty-buffer
  • io.netty:netty-codec
  • io.netty:netty-codec-dns
  • io.netty:netty-codec-haproxy
  • io.netty:netty-codec-http
  • io.netty:netty-codec-http2
  • io.netty:netty-codec-memcache
  • io.netty:netty-codec-mqtt
  • io.netty:netty-codec-redis
  • io.netty:netty-codec-smtp
  • io.netty:netty-codec-socks
  • io.netty:netty-codec-stomp
  • io.netty:netty-codec-xml
  • io.netty:netty-common
  • io.netty:netty-handler
  • io.netty:netty-handler-proxy
  • io.netty:netty-resolver
  • io.netty:netty-resolver-dns
  • io.netty:netty-resolver-dns-classes-macos
  • io.netty:netty-transport
  • io.netty:netty-transport-classes-epoll
  • io.netty:netty-transport-classes-kqueue
  • io.netty:netty-transport-native-unix-common
  • io.netty:netty-transport-rxtx
  • io.netty:netty-transport-sctp
  • io.netty:netty-transport-udt
  • net.minidev:accessors-smart
  • net.minidev:json-smart
  • org.apache.avro:avro
  • org.apache.commons:commons-compress
  • org.apache.commons:commons-configuration2
  • org.apache.commons:commons-crypto
  • org.apache.commons:commons-lang3
  • org.apache.commons:commons-math3
  • org.apache.commons:commons-text
  • org.apache.curator:curator-client
  • org.apache.curator:curator-framework
  • org.apache.curator:curator-recipes
  • org.apache.hadoop:hadoop-annotations
  • org.apache.hadoop:hadoop-auth
  • org.apache.hadoop:hadoop-client
  • org.apache.hadoop:hadoop-common
  • org.apache.hadoop:hadoop-hdfs-client
  • org.apache.hadoop:hadoop-mapreduce-client-common
  • org.apache.hadoop:hadoop-mapreduce-client-core
  • org.apache.hadoop:hadoop-mapreduce-client-jobclient
  • org.apache.hadoop:hadoop-yarn-api
  • org.apache.hadoop:hadoop-yarn-client
  • org.apache.hadoop:hadoop-yarn-common
  • org.apache.hadoop.thirdparty:hadoop-shaded-guava
  • org.apache.hadoop.thirdparty:hadoop-shaded-protobuf_3_7
  • org.apache.htrace:htrace-core4:
  • org.apache.httpcomponents:httpclient
  • org.apache.httpcomponents:httpcore
  • org.apache.kerby:kerb-admin
  • org.apache.kerby:kerb-client
  • org.apache.kerby:kerb-common
  • org.apache.kerby:kerb-core
  • org.apache.kerby:kerb-crypto
  • org.apache.kerby:kerb-identity
  • org.apache.kerby:kerb-server
  • org.apache.kerby:kerb-simplekdc
  • org.apache.kerby:kerb-util
  • org.apache.kerby:kerby-asn1
  • org.apache.kerby:kerby-config
  • org.apache.kerby:kerby-pkix
  • org.apache.kerby:kerby-util
  • org.apache.kerby:kerby-xdr
  • org.apache.kerby:token-provider
  • org.apache.ratis:ratis-client
  • org.apache.ratis:ratis-common
  • org.apache.ratis:ratis-proto
  • org.apache.ratis:ratis-thirdparty-misc
  • org.codehaus.jackson:jackson-core-asl
  • org.eclipse.jetty:jetty-client
  • org.eclipse.jetty:jetty-http
  • org.eclipse.jetty:jetty-io
  • org.eclipse.jetty:jetty-security
  • org.eclipse.jetty:jetty-servlet
  • org.eclipse.jetty:jetty-util
  • org.eclipse.jetty:jetty-util-ajax
  • org.eclipse.jetty.websocket:websocket-api
  • org.eclipse.jetty.websocket:websocket-client
  • org.eclipse.jetty.websocket:websocket-common
  • org.scala-lang:scala-library
  • org.scala-lang:scala-reflect
  • org.xerial.snappy:snappy-java
  • org.lz4:lz4-java
  • org.apache.ratis:ratis-grpc
  • org.apache.ratis:ratis-metrics
  • org.apache.ratis:ratis-netty
  • org.apache.ratis:ratis-server
  • org.apache.ratis:ratis-server-api
  • commons-codec:commons-codec
  • org.apache.commons:commons-compress
  • org.apache.commons:commons-math3
  • org.apache.commons:commons-text
  • org.apache.curator:curator-framework
  • org.apache.curator:curator-recipes
  • org.xerial.snappy:snappy-java
  • org.apache.logging.log4j:log4j-1.2-api
  • org.apache.logging.log4j:log4j-api
  • org.apache.logging.log4j:log4j-core
  • org.apache.logging.log4j:log4j-slf4j-impl
  • org.roaringbitmap:RoaringBitmap
  • org.roaringbitmap:shims

BSD

  • org.codehaus.woodstox:stax2-api
  • org.jline:jline

BSD 3-clause

  • com.google.protobuf:protobuf-java
  • org.fusesource.leveldbjni:leveldbjni-all
  • org.codehaus.janino:commons-compiler

BSD 2-clause

  • dnsjava:dnsjava

CDDL 1.1

  • javax.xml.bind:jaxb-api

CDDL + GPLv2

  • javax.servlet:javax.servlet-api

Eclipse Distribution License 1.0

  • jakarta.activation:jakarta.activation-api
  • jakarta.xml.bind:jakarta.xml.bind-api

MIT License

  • org.slf4j:slf4j-api
  • org.slf4j:jcl-over-slf4j
  • org.slf4j:jul-to-slf4j

The Go License

  • com.google.re2j:re2j

Cryptography

Celeborn does not currently include any cryptography-related code.

Required Resources

Mailing lists

Git Repositories:

Issue Tracking

The community would like to continue using GitHub Issues (but will be moved to github.com/apache/).

Other Resources

  • The community has already chosen GitHub actions as continuous integration tools.

Initial Committers

Sponsors

Champion

Yu Li (liyu@apache.org)

Nominated Mentors

Becket Qin (jqin@apache.org)

Duo Zhang (zhangduo@apache.org)

Lidong Dai (lidongdai@apache.org)

Willem Jiang (ningjiang@apache.org)

Sponsoring Entity

We are expecting the Apache Incubator could sponsor this project.

  • No labels