Abstract
SeaTunnel is a distributed, high-performance data integration platform for the synchronization and transformation of massive data (offline & real-time). It was initially known as Waterdrop and renamed SeaTunnel on October 12, 2021.
Proposal
SeaTunnel is a distributed, high-performance data integration platform with the following features:
- Real-time integrate data from different data platforms
- High performance with billions of rows of data
- Assure 0 data loss and duplication
- Easily Monitor the whole status of the system
For now, SeaTunnel has a strong community. We believe that bringing SeaTunnel into ASF could advance the development of a much stronger and more diverse open source community.
The purpose of the proposal is to donate the existing SeaTunnel code repository, developers, and communities to the Apache Software Foundation (ASF) to build a global, diversified and autonomous open-source community in the field of data synchronization.
The related works are currently available on GitHub, and the codes are already under Apache License Version 2.0 :
https://github.com/InterestingLab/SeaTunnel Code repository
https://github.com/InterestingLab/SeaTunnel-docs Document repository
https://github.com/InterestingLab/SeaTunnel-example Examples
Background
In the era of big data, with the emergence of excellent projects such as Apache Spark, Apache Flink, Apache Hadoop, and Apache Kafka, data storage and processing capabilities are rapidly increasing. With the growth of all kinds of data platforms, users often encounter the following issues when synchronizing data among different data platforms:
- Data loss and duplication
- Task accumulation and delay
- Low throughput
- Lack of application running status monitoring
- Complex redevelopment and expansion
We developed the SeaTunnel in Leshi (SeaTunnel was initially developed in Leshi) in 2017 and made it open-source to provide a good solution to solve the above issues. At present, SeaTunnel has run in many users’ production environments and won wide recognition and admiration from users.
Rationale
The whole data processing of SeaTunnel, i.e. Input → Filter → Output constitutes the processing flow of SeaTunnel (Pipeline). We only need to write a SeaTunnel configuration file to complete the data reading, processing, and writing. Input-Output defines the input and output of the data source, and in the Filter, we configure a series of conversions, and most of the data analysis demands can be satisfied. SeaTunnel provides a variety of plug-ins to meet users' needs of real-time & offline synchronization of massive data in different scenarios. At the same time, users can implement new plugins to meet their requirements.
Current Status
Meritocracy
SeaTunnel was founded in 2017 and open-sourced on GitHub on July 30 of the same year. The project now has contributors and users from more than a dozen companies.
Community
SeaTunnel has devoted itself to building an active community since its open source. In four years, we have organized multiple meetups and harvested contributors ranging from coders, community managers(Lifeng Nie) to evangelists. Currently, we have a user base of more than 2,000, and we will turn to mails and other communication methods that are more in line with the Apache Way to better develop our project.
By inviting all friends who follow the Apache Way to join contributing, we hope the contributor community thrives on more fresh talents. Now, the code of SeaTunnel is hosted on GitHub, and we will use Slack Channel for community communication.
Furthermore, SeaTunnel will welcome more developers and user communities to join the project during the incubation period.
Core Developers
The core developers of SeaTunnel are all experienced open-source developers and team leaders, and bring unimaginable diversity to the team.
Known Risks
Orphaned products
The contributors and community cooperation bring the project from a small concept into a real data processing tool. SeaTunnel has also been widely adopted by companies in China, among which many large companies (Shuidichou, Weibo, yixia)have used it extensively in practice. e.g. in Shuidichou, the data set that SeaTunnel synchronizes every day has exceeded 3 T, and two of the initial committers will devote themselves full-time to the project.
So far, SeaTunnel has released 32 versions, which contain 2 major versions. More than ten contributors and thousands of forks further show that SeaTunnel is actively supported, and we seek to further prosper the community with the aid of Apache. As a consequence, SeaTunnel is unlikely to be reduced to an orphaned project.
Inexperience with Open Source
The initial committer, Calvin Kirs is a PMC member of Apache DolphinScheduler. He participated in the entire process of DolphinScheduler's’ graduation from Apache and has a deep understanding of Apache release and Apache Way at the same time. Jiajie Zhong is a Committer member of Apache Airflow. Other members are also experienced in open-source projects, and we will continue to learn Apache Way during the incubation to gain more open-source experience. Therefore, we believe that we are experienced enough to operate a well-run community.
Homogenous Developers
Now, the core developers are working in Tencent, WhaleOps, and other companies, and some individual developers have been accepted as SeaTunnel developers. Considering that there are still many users showing great interest in SeaTunnel, we are encouraging and inviting them to be contributors to promote the development of the project.
Reliance on Salaried Developers
Some members of the committers are paid by their employers to contribute to SeaTunnel. Not only SeaTunnel but also massive data synchronization are very attractive and important to the companies that the contributors work for, that will promote the community to grow better. As new contributors continuously join in, we will work hard to be more diversified.
Relationships with Other Apache Products
Apache DolphinScheduler has supports SeaTunnel as one of the task plugins.
We have integrated with Apache Spark, Apache Flink, and Apache Commons, and planned to do better ecological integration with other Apache projects in the future (Mainly big data projects).
An Excessive Fascination with the Apache Brand
We believe that the Apache brand will bring greater value and reputation to SeaTunnel. However, it is the community provided by the Apache Software Foundation that we hope will enable the project to achieve long-term stable development. Therefore, we propose SeaTunnel be incubated in Apache to help diversify the community, instead of taking advantage of the Apache brand.
Documentation
A complete set of documents is provided on GitHub, including English and Simplified Chinese versions.
Initial Code repository
The project consists of three different code repositories, and each of them is offered as a separate git repository.
- https://github.com/InterestingLab/seatunnel Code
- https://github.com/InterestingLab/seatunnel-docs Documentation repository
- https://github.com/InterestingLab/seatunnel-example Development Tutorial
Initial Source
SeaTunnel was initially developed within Leshi and was open-sourced with The Apache License 2.0 under Interesting Lab Group on GitHub in 2017. Once SeaTunnel is approved to join the Apache incubator, Leshi will provide a Software License Agreement (SGA), and the initial submitter will submit ICLA. This code has been licensed under the Apache Software License, Version 2.0.
Source and Intellectual Property Submission Plan
All dependencies comply with the Apache compatible license, the details are as follows:
External Dependencies
Apache-2.0 licenses
- Spark
- Flink
- Parquet
- Phoenix
- Apache-commons
- HBase-connectors
- ElasticSearch (6.3.1)
- elasticsearch-spark(6.8.3)
- Fastjson
- clickhouse-jdbc
- org.lz4:lz4-java
- net.jpountz.lz4:lz4
- com.101tec:zkclient
- com.norbitltd:spoiwo
- lightbend/config
- We ported part of the /lightbend/config code. We will change to a direct dependency method as soon as possible.
MIT licenses
- slf4j
- play-mailer
- com.github.scopt(3.7.1)
Required Resources
Git Repositories
- https://github.com/apache/incubator-seatunnel
- https://github.com/apache/incubator-seatunnel-website
- https://github.com/apache/incubator-seatunnel-example
Issue Tracking
The community would like to continue using GitHub Issues.
Continuous Integration tool
GitHub Action
Mailing Lists
- seatunnel-dev:for development discussions
- seatunnel-private:for PPMC discussions
- seatunnel-notifications:for users notifications
Initial Committers
- Gary Gao (garygaowork@gmail.com)
- RickyHuo (rickyhuo1994@gmail.com)
- kid-Xiong (ridxiong@gmail.com)
- Calvin Kirs (kirs@apache.org)
- Jiajie Zhong (zhongjiajie@apache.org)
- Lifeng Nie (Nielifeng@gmail.com)
Affiliations
- Individuals: RickyHuo,kid-Xiong, Gary Gao
- WhaleOps: Calvin Kirs, Jiajie Zhong, LiFeng Nie
Sponsors
Champion
- Willem Ning Jiang (ningjiang@apache.org )
Mentors
- Lidong Dai (lidongdai@apache.org)
- Ted Liu (tedliu@apache.org)
- William Guo(guowei@apache.org)
- Zhenxu Ke (kezhenxu94@apache.org)
- Kevin Ratnasekera (djkevincr1989@gmail.com)
- JB Onofré (jb@nanthrax.net)
Sponsoring Entity
We are expecting the Apache Incubator could sponsor this project.