Abstract
Paimon is a unified lake storage to build dynamic tables for both streaming and batch processing with big data compute engines (i.e. Flink, Spark, Hive, Trino, etc.), supporting high-speed data ingestion and real-time data query.
Proposal
Paimon is a lake storage backed by data files on distributed file system (HDFS, S3, etc.). Different from other data lake storage projects, Paimon is designed to support both high throughput and low end-to-end latency (better data freshness), especially for intensive UPDATE and DELETE workloads.
Background
Nowadays, data lake storage has been more and more widely used. It can store data on object storage and provide atomic control of data files at low cost.
Meantime, with the adoption of stream processing in production (with technologies such as Flink, Spark-Streaming, etc.), there is an increasing demand for storage to simultaneously support updates, deletes and streaming reads. To support such requirements:
One option is to use OLAP systems like ClickHouse and Aapache Doris, which has the ability to serve high speed data ingestion. However, they do not support streaming read, and the storage cost is relatively high.
Another option is to use the existing lake storages, such as Apache Hudi and Apache Iceberg. However, the high speed ingestion of fresh (updating) data from the realtime processing system proposes big challenge and would overwhelm both systems.
In order to solve the limitations of existing solutions, we created Paimon, a lake storage aims to:
Support storage of large datasets and allow read/write in both batch and streaming mode.
Support incremental snapshots for stream consumption.
Support streaming queries with minimum latency down to milliseconds.
Support Batch/OLAP queries with minimum latency down to the second level.
Rationale
Paimon natively adopts LSM (Log-Structured Merge-tree) as its underlying data structure, and provides enhanced performance for data with primary keys, in addition to common lake storage capabilities. What's more, Paimon supports both batch and stream operations (reads and writes), facilitating applications pursuing batch-stream-unified semantics. Specifically:
Paimon provides excellent performance on the intensive updata / delete workload, leveraging the append-write feature of the LSM data structure.
Paimon utilizes the ordered feature of LSM to support effective filter pushdown, and could reduce the latency of queries with primary key filtering to milliseconds.
Paimon supports various (row-based or row-columnar) file formats including Apache Avro, Apache ORC and Apache Parquet (rows will be sorted by the primary key before writing out).
Tables provided by Paimon can be queried by various engines, including Apache Flink, Apache Spark, Apache Hive, Trino, etc.
Paimon's meta data is self-managed, stored on the distributed file system and can be synchronized to Hive metastore (HMS).
Besides the common batch read and write support, Paimon also supports streaming read and change data feed.
Initial Goals
Paimon was founded in the Flink community in 2022 with the name of "Flink Table Store" and has been developed for more than one year. As its adoption expends to more computing engines, some of the ecology users express their concerns about the neutrality of the project. This makes us rethink the positioning of Flink Table Store, which can be an independent lake storage.
With adequate discussions, we have got the support from the Flink community to enter Apache incubation, with the below expectations:
Expand Paimon's ecosystem, providing independent Java APIs to support reading and writing from more big data engines such as Spark, Hive, Trino, Presto, Doris, etc.
Supplement key capabilities, especially streaming reads and intensive updates/deletes, for creating a unified and easy-to-use streaming data warehouse (lakehouse).
Grow into a more vibrant and neutral open source community.
Current Status
Paimon was founded in the Apache Flink community in 2022 with the name of "Flink Table Store", has been developed for one year and produced 4 releases.
Meritocracy
Paimon is running in the Apache way from the very beginning. Meritocracy is the foundation of this project, and many companies and individuals have been encouraged to contribute to it.
Community
Due to the origin of the project, most existing developers of Paimon come from the Apache Flink community. Everyone can access the roadmap, issues and design documents and discuss publicly through mailing list, JIRA, slack channels, etc. Our main developers are from Alibaba, as well as developers from other companies. By introducing into Apache incubator, we believe Paimon can be promoted to more companies and individuals.
Users
Paimon has been used by various users and companies, including Alibaba, Bilibili, ByteDance and so on. Paimon is also integrated into Alibaba Cloud's E-MapReduce and Realtime Compute products to provide cloud services.
Core Developers
Jingsong Lee. He is the founder of this project and a PMC member of Apache Flink, from Alibaba (GitHub ID: JingsongLi)
Caizhi Weng. He is a developer of the project and a Committer of Apache Flink, from Alibaba. (GitHub ID: tsreaper)
Fang Yong. He is a developer of the project, from ByteDance. (GitHub ID: FangYongs)
Nicholas Jiang. He is a developer of the project and Committer of many Apache projects such as Apache ShardingSphere, from Bilibili. (GitHub ID: SteNicholas)
Feng Wang. He is the chief architect as well as a developer of the project, and a Committer of Apache Flink, from Alibaba. (GitHub ID: wangfengpro)
Timo Walther. He is an architect of the project and actively invovled in many technical (FLIP) discussions, and a PMC member of Apache Flink, from Confluent. (GitHub ID: twalthr)
Alignment
Paimon aims to integrate with various big data computing engines. Many projects come from Apache, such as Apache Flink, Apache Hive, Apache Spark, etc. We will continue to expand Paimon's ecosystem.
Known Risks
Project Name
Paimon is a spirit named in early grimoires. These include The Lesser Key of Solomon (in the Ars Goetia), Johann Weyer's Pseudomonarchia Daemonum, Collin de Plancy's Dictionnaire Infernal and so on.
Paimon is an NPC in the game "Genshin Impact" who accompanies the Traveler throughout their adventure in Teyvat as their guide.
Orphaned products
Paimon originates from Apache Flink and has been used / will be adopted in many companies such as Alibaba, Bilibili, ByteDance, etc. Developers from both the community and these companies are committed to the future development of Paimon. We are now actively operating the project and will continue to increase the vitality of the community to attract more contributors.
Inexperience with Open Source
Paimon was born as a sub-project of Apache Flink and the Flink community decided to donate it into Apache incubator. All existing contributors of Paimon have been involving in the Flink community and are familiar with the Apache Way.
Homogenous Developers
The current contributors are across various organizations, including Alibaba, Bilibili, ByteDance, Confluent, etc. We are committed to recruiting additional committers based on their contributions to the project.
Reliance on Salaried Developers
Most of the developers are paid by their employers to contribute to this project. Given some volunteer developers and the committers' sense of ownership of the code, the project could continue even if no salaried developers contributed to the project.
Relationships with Other Apache Products
Paimon aims to become the de-facto data lake storage that could cooperate with most computing engines to provide mature data lake solutions. Now it can work together with Apache Flink, Apache Spark, Apache Hive, Apache Presto, etc. and will actively develop its ecosystem and integrate with more computing engines, such as Apache Doris.
An Excessive Fascination with the Apache Brand
We believe the Apache way, not the brand, will help Paimon grow and persist. We hope to make sure that a very inclusive, diverse, and meritocratic community is built outside the umbrella of a single company.
Documentation
Documentation can be found on https://nightlies.apache.org/flink/flink-table-store-docs-master/
Initial Source
The initial source code for Paimon is hosted at https://github.com/apache/flink-table-store
The project name is still not renamed and we will rename it from flink-table-store to Paimon after the incubation proposal is approved.
Source and Intellectual Property Submission Plan
Paimon originates as a sub-project of Apache Flink and all sources and IPs belong to the ASF. Most initial committers are already Apache committers with ICLAs submitted, and we will ask none-Apache committers to submit their ICLAs as soon as Paimon is approved to join Apache Incubator. The codebase is already licensed under Apache License 2.0. We will deprecate the current source repository and redirect it to the new incubator project repository after approved.
External Dependencies
ASF Projects
Apache Flink
Apache Hadoop
Apache Hive
Apache ORC
Apache Parquet
Apache Spark
Apache Avro
Apache Licence 2.0
org.apache.orc:orc-core
org.apache.orc:orc-shims
org.apache.hive:hive-storage-api
io.airlift:aircompressor
commons-lang:commons-lang
org.apache.avro:avro
com.fasterxml.jackson.core:jackson-core
com.fasterxml.jackson.core:jackson-databind
com.fasterxml.jackson.core:jackson-annotations
org.apache.commons:commons-compress
org.apache.parquet:parquet-avro
org.apache.parquet:parquet-hadoop
org.apache.parquet:parquet-column
org.apache.parquet:parquet-common
org.apache.parquet:parquet-encoding
org.apache.parquet:parquet-format-structures
org.apache.parquet:parquet-jackson
commons-pool:commons-pool
com.aliyun.oss:aliyun-sdk-oss
com.aliyun:aliyun-java-sdk-core
com.aliyun:aliyun-java-sdk-kms
com.aliyun:aliyun-java-sdk-ram
com.google.code.gson:gson
commons-codec:commons-codec
commons-logging:commons-logging
io.opentracing:opentracing-api
io.opentracing:opentracing-noop
io.opentracing:opentracing-util
org.apache.hadoop:hadoop-aliyun
org.apache.httpcomponents:httpclient
org.apache.httpcomponents:httpcore
org.codehaus.jettison:jettison
org.ini4j:ini4j
stax:stax-api
com.amazonaws:aws-java-sdk-core
com.amazonaws:aws-java-sdk-dynamodb
com.amazonaws:aws-java-sdk-kms
com.amazonaws:aws-java-sdk-s3
com.amazonaws:aws-java-sdk-sts
com.amazonaws:jmespath-java
BSD
com.google.protobuf:protobuf-java
BSD 3-clause
org.antlr:antlr4-runtime
Eclipse Distribution License 2.0
org.jacoco:org.jacoco.agent:runtime
MIT License
org.slf4j:slf4j-api
The JDOM License
org.jdom:jdom2
Cryptography
Paimon does not currently include any cryptography-related code.
Required Resources
Mailing lists
Git Repositories:
Issue Tracking
The community would like to continue using GitHub Issues.
Other Resources
The community has already chosen GitHub actions as continuous integration tools.
Initial Committers
Jingsong Lee (lzljs3620320@apache.org)
Caizhi Weng (czweng@apache.org)
Fang Yong (zjureel@gmail.com)
Nicholas Jiang (nicholasjiang@apache.org)
Feng Wang (fengwang@apache.org)
Timo Walther (twalthr@apache.org)
Sponsors
Champion
Yu Li (liyu@apache.org)
Nominated Mentors
Becket Qin (jqin@apache.org)
Robert Metzger (rmetzger@apache.org)
Stephan Ewen (sewen@apache.org)
Yu Li (liyu@apache.org)
Sponsoring Entity
We are expecting the Apache Incubator could sponsor this project.