Abstract

SeaTunnel is a distributed, high-performance data integration platform for the synchronization and transformation of massive data (offline & real-time). It was initially known as Waterdrop and renamed SeaTunnel on October 12, 2021.

Proposal

SeaTunnel is a distributed, high-performance data integration platform with the following features:

  • Real-time integrate data from different data platforms 
  • High performance with billions of rows of data 
  • Assure 0 data loss and duplication
  • Easily Monitor the whole status of the system 

For now, SeaTunnel has a strong community. We believe that bringing SeaTunnel into ASF could advance the development of a much stronger and more diverse open source community.

The purpose of the proposal is to donate the existing SeaTunnel code repository, developers, and communities to the Apache Software Foundation (ASF) to build a global, diversified and autonomous open-source community in the field of data synchronization. 

The related works are currently available on GitHub, and the codes are already under Apache License Version 2.0

https://github.com/InterestingLab/SeaTunnel Code repository 

https://github.com/InterestingLab/SeaTunnel-docs Document repository

https://github.com/InterestingLab/SeaTunnel-example  Examples

Background

In the era of big data, with the emergence of excellent projects such as Apache Spark, Apache Flink, Apache Hadoop, and Apache Kafka, data storage and processing capabilities are rapidly increasing. With the growth of all kinds of data platforms, users often encounter the following issues when synchronizing data among different data platforms:

  • Data loss and duplication
  • Task accumulation and delay
  • Low throughput
  • Lack of application running status monitoring
  • Complex redevelopment and expansion

We developed the SeaTunnel in Leshi (SeaTunnel was initially developed in Leshi) in 2017 and made it open-source to provide a good solution to solve the above issues. At present, SeaTunnel has run in many users’ production environments and won wide recognition and admiration from users.

Rationale

The whole data processing of SeaTunnel, i.e. Input → Filter → Output constitutes the processing flow of SeaTunnel (Pipeline). We only need to write a SeaTunnel configuration file to complete the data reading, processing, and writing. Input-Output defines the input and output of the data source, and in the Filter, we configure a series of conversions, and most of the data analysis demands can be satisfied. SeaTunnel provides a variety of plug-ins to meet users' needs of real-time & offline synchronization of massive data in different scenarios. At the same time, users can implement new plugins to meet their requirements.

Current Status

Meritocracy

SeaTunnel was founded in 2017 and open-sourced on GitHub on July 30 of the same year. The project now has contributors and users from more than a dozen companies.

 

Community

SeaTunnel has devoted itself to building an active community since its open source. In four years, we have organized multiple meetups and harvested contributors ranging from coders, community managers(Lifeng Nie) to evangelists. Currently, we have a user base of more than 2,000, and we will turn to mails and other communication methods that are more in line with the Apache Way to better develop our project.

By inviting all friends who follow the Apache Way to join contributing, we hope the contributor community thrives on more fresh talents. Now, the code of SeaTunnel is hosted on GitHub, and we will use Slack Channel for community communication.

Furthermore, SeaTunnel will welcome more developers and user communities to join the project during the incubation period.


Core Developers

The core developers of SeaTunnel are all experienced open-source developers and team leaders, and bring unimaginable diversity to the team.

Known Risks


Orphaned products

The contributors and community cooperation bring the project from a small concept into a real data processing tool. SeaTunnel has also been widely adopted by companies in China, among which many large companies (Shuidichou, Weibo, yixia)have used it extensively in practice. e.g. in Shuidichou, the data set that SeaTunnel synchronizes every day has exceeded 3 T, and two of the initial committers will devote themselves full-time to the project.


So far, SeaTunnel has released 32 versions, which contain 2 major versions. More than ten contributors and thousands of forks further show that SeaTunnel is actively supported, and we seek to further prosper the community with the aid of Apache. As a consequence, SeaTunnel is unlikely to be reduced to an orphaned project.


Inexperience with Open Source

The initial committer, Calvin Kirs is a PMC member of Apache DolphinScheduler. He participated in the entire process of DolphinScheduler's’ graduation from Apache and has a deep understanding of Apache release and Apache Way at the same time. Jiajie Zhong is a Committer member of Apache Airflow. Other members are also experienced in open-source projects, and we will continue to learn Apache Way during the incubation to gain more open-source experience. Therefore, we believe that we are experienced enough to operate a well-run community.


Homogenous Developers

Now, the core developers are working in Tencent, WhaleOps, and other companies, and some individual developers have been accepted as SeaTunnel developers. Considering that there are still many users showing great interest in SeaTunnel, we are encouraging and inviting them to be contributors to promote the development of the project.


Reliance on Salaried Developers

Some members of the committers are paid by their employers to contribute to SeaTunnel. Not only SeaTunnel but also massive data synchronization are very attractive and important to the companies that the contributors work for, that will promote the community to grow better. As new contributors continuously join in, we will work hard to be more diversified.


Relationships with Other Apache Products

Apache DolphinScheduler has supports SeaTunnel as one of the task plugins.

We have integrated with Apache Spark, Apache Flink, and Apache Commons, and planned to do better ecological integration with other Apache projects in the future (Mainly big data projects).


An Excessive Fascination with the Apache Brand

We believe that the Apache brand will bring greater value and reputation to SeaTunnel. However, it is the community provided by the Apache Software Foundation that we hope will enable the project to achieve long-term stable development. Therefore, we propose SeaTunnel be incubated in Apache to help diversify the community, instead of taking advantage of the Apache brand.

Documentation

A complete set of documents is provided on GitHub, including English and Simplified Chinese versions.

Initial Code repository

The project consists of three different code repositories, and each of them is offered as a separate git repository.

Initial Source

SeaTunnel was initially developed within Leshi and was open-sourced with The Apache License 2.0 under Interesting Lab Group on GitHub in 2017. Once SeaTunnel is approved to join the Apache incubator, Leshi will provide a Software License Agreement (SGA), and the initial submitter will submit ICLA. This code has been licensed under the Apache Software License, Version 2.0.

Source and Intellectual Property Submission Plan

All dependencies comply with the Apache compatible license, the details are as follows: 

External Dependencies

Apache-2.0 licenses

  • Spark
  • Flink
  • Parquet
  • Phoenix
  • Apache-commons
  • HBase-connectors
  • ElasticSearch (6.3.1) 
  • elasticsearch-spark(6.8.3)
  • Fastjson
  • clickhouse-jdbc
  • org.lz4:lz4-java
  • net.jpountz.lz4:lz4
  • com.101tec:zkclient
  • com.norbitltd:spoiwo
  • lightbend/config
  • We ported part of the /lightbend/config code. We will change to a direct dependency method as soon as possible.

MIT licenses

  • slf4j
  • play-mailer
  • com.github.scopt(3.7.1)

 

Required Resources

Git Repositories

Issue Tracking

The community would like to continue using GitHub Issues.


Continuous Integration tool

GitHub Action

Mailing Lists

  • seatunnel-dev:for development discussions
  • seatunnel-private:for PPMC discussions
  • seatunnel-notifications:for users notifications

Initial Committers

Affiliations

  • Individuals: RickyHuo,kid-Xiong, Gary Gao
  • WhaleOps: Calvin Kirs, Jiajie Zhong, LiFeng Nie

Sponsors

Champion

Mentors

 

Sponsoring Entity

We are expecting the Apache Incubator could sponsor this project.

  • No labels