Abstract
Celeborn is an intermediate data service for big data compute engines (i.e. ETL, OLAP, and Streaming engines) to boost performance, stability, and flexibility. Intermediate data typically include shuffle and spilled data.
Proposal
Celeborn provides a high-performance intermediate data service for big data compute engines. Instead of writing to local disks, compute engines could push shuffle and spilled data to Celeborn workers, freeing compute nodes from the dependency on large local disks. In addition, Celeborn re-arranges the layout of the intermediate data to be more IO-friendly, thus improving performance and stability.
Background
Intermediate data typically include shuffle and spilled data, and it's normally a major cause of inefficiency, instability, and inflexibility in the lifecycle of a distributed job.
Shuffle is one of the most resource-consuming operators in big-data processing. The existing pull-style implementations have several drawbacks:
- Rely on local disks, prohibiting the adoption of disaggregated architecture, and limiting elasticity.
- Massive random access to disks, easily saturates IOPS capability, causing inefficiency and instability.
Spilled data is another source of intermediate data. Taking Spark as an example, it spills in-memory data to disks whenever memory limitation is reached (i.e. sort, aggregation). It's necessary to consider spilled data to support disaggregation and free compute nodes from large local disks.
To solve the above problems, Celeborn takes over both kinds of intermediate data by providing push-style APIs, re-arranging data layout to be more IO friendly, and leveraging DFS and object store to enable massive storage space.
Rationale
Disaggregated architecture has several benefits like flexible provision of computing and storage nodes separately, better elasticity, and more cost-effectiveness. However, in a lot of cases, intermediate data depends on large local disks, which prohibits shifting to such architecture. Thus, taking over all kinds of intermediate data, i.e. shuffle and spilled data is one of the key motivations of Celeborn.
Another key motivation of Celeborn is to solve inefficiency and instability issues caused by the existing pull-style shuffle, as described in the former section.
Celeborn solves the problems through the following core designs:
- Push-based shuffle plus partition data aggregation to turn random IO access into sequential access.
- FileSystem-like API to support writing spilled data.
- Hierarchical storage from memory to DFS/object store to enable fast cache and massive storage space.
- Engine-irrelevant APIs for easy integrating to various engines.
- Extended fault tolerance and data replication(optional) to increase reliability.
Celeborn is currently adopted in the production environment at both Alibaba and many other companies, serving petabytes of data per day, which demonstrates its capability of resolving real-world problems.
Initial Goals
Although many of the core features of Celeborn have been developed and verified in the production environment (mainly Spark jobs), there is still a lot to evolve, such as the following requirements from users:
- Extension: support more engines of various categories, i.e. Streaming engines (Apache Flink), OLAP engines, and engines written in python and native languages.
- Completeness: implement the support for spilled data and possibly other kinds of intermediate data.
- Enhancement: rolling upgrade, multi-tenant, layered storage, columnar shuffle, enhanced performance and fault tolerance, etc.
Current Status
Meritocracy
Celeborn was started at Alibaba in 2020 with the project name "RemoteShuffleService" and open-sourced in 2021. Since then, Celeborn has gained strong interest from dozens of companies and individuals. Roadmaps, issues, and design docs are accessible to everyone and discussed across developers. Celeborn already has active contributors from different organizations.
We encourage community members to participate, and those who invest time and talent in the project will have privileges, authority, and influence over decisions. We will try our best to build an environment that supports meritocracy. We believe the community and project will grow better if we run in the Apache Way.
Community
Celeborn has been building a community around users and contributors since it was open-sourced. We have organized several online meetups with data-infra teams from various well-known companies, and we have a dozen of contributors out of Alibaba, some of them are among the most active contributors. We frequently have online discussions with our contributors. By bringing Celeborn to Apache, we believe we can grow the base of contributors by inviting and encouraging more contributors in The Apache Way.
Users
Celeborn is adopted as one of the components of the E-MapReduce (EMR) product in Alibaba Cloud, and a dozen of customers are using it. Beyond that, we have even more users irrelevant to EMR who got in touch with us through the community, including Shopee, NetEase, Bilibily, BOSS, and Synnex, to name a few. Most of our users have made contributions to the project.
Core Developers
- Keyong Zhou. He is the founder of this project, from Alibaba. (GitHub ID: waitinfuture)
- Ethan Feng. He loves open source and coding, from Alibaba. (GitHub ID: FMX)
- Kerwin Zhang. He is a developer of the project, from Alibaba. (GitHub ID: kerwin-zk)
- AngersZhuuuu. He is Apache Spark/YARN/HADOOP Contributor, an individual open-source enthusiast, from Shopee. (GitHub ID: AngersZhuuuu)
- Cheng Pan. He is the Apache Kyuubi (Incubating) PPMC member, Apache Iceberg/Spark Contributor, from NetEase. (GitHub ID: pan3793)
Alignment
Celeborn aims to store and serve intermediate data for big-data compute engines, and many of the projects are from Apache, such as Apache Spark, Apache Flink, Apache Hive, etc. Celeborn currently mainly supports Spark, and we plan to support Flink and others in the near future. Celeborn is already under Apache License 2.0, and many of the core developers have experience on Apache projects.
Known Risks
Project Name
Celeborn is the name of the White Tree in J. R. R. Tolkien's fantasy stories. "Celeb" is the sindarin for "silver", while "orn" is that for "tree", and we think it's cool to introduce this fantasy tree to the Apache world. Based on our search results, the term Celeborn is not used as a trademark under any class, so it is legal to use it as our project name.
Orphaned products
Celeborn is vastly adopted in Alibaba and is a built-in component in Alibaba Cloud's E-MapReduce product. Developers in Alibaba will keep improving the project to ensure it will fulfill the current and to-emerge requirements. Many organizations outside Alibaba are also using Celeborn to enhance their production jobs, and several of them have already participated in. We are now actively operating the community and will continue to increase the vitality of the community to attract more contributors to the community.
Inexperience with Open Source
Many of the Celeborn contributors have experience working on open source projects, and by working with our mentors and the Apache community we believe we will be able to conduct ourselves in accordance with Apache Incubator guidelines.
Homogenous Developers
The current contributors are across various organizations, including Alibaba, Shopee, NetEase, Xiaomi, Bilibily, BOSS, Xiaohongshu, etc. We are committed to recruiting additional committers based on their contributions to the project.
Reliance on Salaried Developers
Most of the developers are paid by their employers to contribute to this project. Given some volunteer developers and the committers' sense of ownership of the code, the project could continue even if no salaried developers contributed to the project.
Relationships with Other Apache Products
Celeborn was originally built to enhance Apache Spark and uses Apache Ratis for high availability, and Apache Hadoop for layered storage. Celeborn also uses Apache Commons and Apache Logging. In the future, as Celeborn supports more engines, it will cooperate with Apache Flink, Apache Hive, etc.
An Excessive Fascination with the Apache Brand
We believe the Apache way, not the brand, will help Celeborn grow and persist. We hope to make sure that a very inclusive, diverse, and meritocratic community is built outside the umbrella of a single company.
Documentation
Documentation can be found on the GitHub Pages.
Initial Source
The initial source code for Celeborn is hosted at https://github.com/alibaba/RemoteShuffleService.
The project name is still not renamed and we will rename it from RemoteShuffleService to Celeborn after the incubation proposal is approved.
Source and Intellectual Property Submission Plan
As soon as Celeborn is approved to join Apache Incubator, our initial committers will submit iCLA(s), SGA, and CCLA(s). The codebase is already licensed under Apache License 2.0.
We will also deprecate the initial source repository and redirect it to the new incubator project repository after approved.
External Dependencies
Apache Licence 2.0
- com.fasterxml.jackson.core:jackson-annotations
- com.fasterxml.jackson.core:jackson-core
- com.fasterxml.jackson.core:jackson-databind
- com.fasterxml.jackson.jaxrs:jackson-jaxrs-base
- com.fasterxml.jackson.jaxrs:jackson-jaxrs-json-provider
- com.fasterxml.jackson.module:jackson-module-jaxb-annotations
- com.fasterxml.woodstox:woodstox-core
- com.github.stephenc.jcip:jcip-annotations
- com.google.code.findbugs:jsr305
- com.google.code.gson:gson
- com.google.guava:guava
- com.nimbusds:nimbus-jose-jwt
- com.squareup.okhttp:okhttp
- com.squareup.okio:okio
- commons-beanutils:commons-beanutils
- commons-cli:commons-cli
- commons-codec:commons-codec
- commons-collections:commons-collections
- commons-io:commons-io
- commons-net:commons-net
- io.dropwizard.metrics:metrics-core
- io.dropwizard.metrics:metrics-graphite
- io.dropwizard.metrics:metrics-jvm
- io.netty:netty-all
- io.netty:netty-buffer
- io.netty:netty-codec
- io.netty:netty-codec-dns
- io.netty:netty-codec-haproxy
- io.netty:netty-codec-http
- io.netty:netty-codec-http2
- io.netty:netty-codec-memcache
- io.netty:netty-codec-mqtt
- io.netty:netty-codec-redis
- io.netty:netty-codec-smtp
- io.netty:netty-codec-socks
- io.netty:netty-codec-stomp
- io.netty:netty-codec-xml
- io.netty:netty-common
- io.netty:netty-handler
- io.netty:netty-handler-proxy
- io.netty:netty-resolver
- io.netty:netty-resolver-dns
- io.netty:netty-resolver-dns-classes-macos
- io.netty:netty-transport
- io.netty:netty-transport-classes-epoll
- io.netty:netty-transport-classes-kqueue
- io.netty:netty-transport-native-unix-common
- io.netty:netty-transport-rxtx
- io.netty:netty-transport-sctp
- io.netty:netty-transport-udt
- net.minidev:accessors-smart
- net.minidev:json-smart
- org.apache.avro:avro
- org.apache.commons:commons-compress
- org.apache.commons:commons-configuration2
- org.apache.commons:commons-crypto
- org.apache.commons:commons-lang3
- org.apache.commons:commons-math3
- org.apache.commons:commons-text
- org.apache.curator:curator-client
- org.apache.curator:curator-framework
- org.apache.curator:curator-recipes
- org.apache.hadoop:hadoop-annotations
- org.apache.hadoop:hadoop-auth
- org.apache.hadoop:hadoop-client
- org.apache.hadoop:hadoop-common
- org.apache.hadoop:hadoop-hdfs-client
- org.apache.hadoop:hadoop-mapreduce-client-common
- org.apache.hadoop:hadoop-mapreduce-client-core
- org.apache.hadoop:hadoop-mapreduce-client-jobclient
- org.apache.hadoop:hadoop-yarn-api
- org.apache.hadoop:hadoop-yarn-client
- org.apache.hadoop:hadoop-yarn-common
- org.apache.hadoop.thirdparty:hadoop-shaded-guava
- org.apache.hadoop.thirdparty:hadoop-shaded-protobuf_3_7
- org.apache.htrace:htrace-core4:
- org.apache.httpcomponents:httpclient
- org.apache.httpcomponents:httpcore
- org.apache.kerby:kerb-admin
- org.apache.kerby:kerb-client
- org.apache.kerby:kerb-common
- org.apache.kerby:kerb-core
- org.apache.kerby:kerb-crypto
- org.apache.kerby:kerb-identity
- org.apache.kerby:kerb-server
- org.apache.kerby:kerb-simplekdc
- org.apache.kerby:kerb-util
- org.apache.kerby:kerby-asn1
- org.apache.kerby:kerby-config
- org.apache.kerby:kerby-pkix
- org.apache.kerby:kerby-util
- org.apache.kerby:kerby-xdr
- org.apache.kerby:token-provider
- org.apache.ratis:ratis-client
- org.apache.ratis:ratis-common
- org.apache.ratis:ratis-proto
- org.apache.ratis:ratis-thirdparty-misc
- org.codehaus.jackson:jackson-core-asl
- org.eclipse.jetty:jetty-client
- org.eclipse.jetty:jetty-http
- org.eclipse.jetty:jetty-io
- org.eclipse.jetty:jetty-security
- org.eclipse.jetty:jetty-servlet
- org.eclipse.jetty:jetty-util
- org.eclipse.jetty:jetty-util-ajax
- org.eclipse.jetty.websocket:websocket-api
- org.eclipse.jetty.websocket:websocket-client
- org.eclipse.jetty.websocket:websocket-common
- org.scala-lang:scala-library
- org.scala-lang:scala-reflect
- org.xerial.snappy:snappy-java
- org.lz4:lz4-java
- org.apache.ratis:ratis-grpc
- org.apache.ratis:ratis-metrics
- org.apache.ratis:ratis-netty
- org.apache.ratis:ratis-server
- org.apache.ratis:ratis-server-api
- commons-codec:commons-codec
- org.apache.commons:commons-compress
- org.apache.commons:commons-math3
- org.apache.commons:commons-text
- org.apache.curator:curator-framework
- org.apache.curator:curator-recipes
- org.xerial.snappy:snappy-java
- org.apache.logging.log4j:log4j-1.2-api
- org.apache.logging.log4j:log4j-api
- org.apache.logging.log4j:log4j-core
- org.apache.logging.log4j:log4j-slf4j-impl
- org.roaringbitmap:RoaringBitmap
- org.roaringbitmap:shims
BSD
- org.codehaus.woodstox:stax2-api
- org.jline:jline
BSD 3-clause
- com.google.protobuf:protobuf-java
- org.fusesource.leveldbjni:leveldbjni-all
- org.codehaus.janino:commons-compiler
BSD 2-clause
- dnsjava:dnsjava
CDDL 1.1
- javax.xml.bind:jaxb-api
CDDL + GPLv2
- javax.servlet:javax.servlet-api
Eclipse Distribution License 1.0
- jakarta.activation:jakarta.activation-api
- jakarta.xml.bind:jakarta.xml.bind-api
MIT License
- org.slf4j:slf4j-api
- org.slf4j:jcl-over-slf4j
- org.slf4j:jul-to-slf4j
The Go License
- com.google.re2j:re2j
Cryptography
Celeborn does not currently include any cryptography-related code.
Required Resources
Mailing lists
Git Repositories:
Issue Tracking
The community would like to continue using GitHub Issues (but will be moved to github.com/apache/).
Other Resources
- The community has already chosen GitHub actions as continuous integration tools.
Initial Committers
- Keyong Zhou (waitinfuture@gmail.com)
- Ethan Feng (ethan.aquarius.fmx@gmail.com)
- Kerwin Zhang (kerwin.libra.zk@gmail.com)
- WU Wei (woo.wei@gmail.com)
- AngersZhuuuu (angers.zhu@gmail.com)
- Cheng Pan (chengpan@apache.org)
Sponsors
Champion
Yu Li (liyu@apache.org)
Nominated Mentors
Becket Qin (jqin@apache.org)
Duo Zhang (zhangduo@apache.org)
Lidong Dai (lidongdai@apache.org)
Willem Jiang (ningjiang@apache.org)
Sponsoring Entity
We are expecting the Apache Incubator could sponsor this project.