Abstract
Texera is an open-source system to support collaborative data science, AI, and ML using GUI-based workflows. Our vision is to develop a system to support cloud platforms on which users can easily analyze data and use AI/ML techniques provided as operators. Users with various backgrounds, irrespective of whether they know coding or not, can collaborate on the same project to construct a pipeline. Experienced users can use programming languages such as Python, R, Java, and Scala to implement customized computation logic. The platform allows users to pause the execution of a workflow to investigate the operator states, and resume the execution at a later time. The platform can be used by a research community to publish valuable resources such as data sets, workflows, and ML models to share their domain-specific knowledge and support reproducibility of scientific research. The platform also allows users to elastically request computing resources from public clouds for computationally-intensive tasks.
Proposal
The goal of this proposal is to bring the existing Texera codebase, developers, and community to the Apache Software Foundation. We plan to donate the source code of Texera and its related materials such as documentation and wiki. These materials already adopted the Apache License 2.0.
The relevant materials can be found on https://texera.io/ and https://github.com/Texera/texera.
Background
With the increasing importance and popularity of data science, AI, and ML in many application domains, more and more people want to use state-of-the-art techniques in their daily jobs, studies, and scientific research. Many people face a barrier due to their limited IT and coding skills, and as a result, cannot utilize these techniques . In addition, often projects require people from multiple disciplines with various and complementary backgrounds to collaborate to complete a task together. For tasks that require a lot of computing resources such as biosequence analysis, researchers also face the challenge of not having enough local resources to finish a task efficiently. We develop the Texera system to enable cloud-based platforms to meet these needs.
Rationale
Typically people do data science and AI/ML using programming languages such as Python, R, and SQL, in development environments such as Jupyter. Recent cloud-based platforms including Databricks, Snowflow, and Microsoft Fabric also require users to have enough IT skills. For people who do not have these skills, including those researchers of various scientific disciplines, they are not able to use their domain knowledge to participate in data science and benefit from the latest AI/ML techniques.
Existing workflow systems, including Rapid Miner, Knime, and Alteryx, address these challenges by providing a GUI interface on which people can drag and drop operators to construct a workflow. While these solutions have been widely used in many domains, they are not easily deployable to provide a cloud-based service. In particular, they require users to download and install their software, and do not provide a web-based interface. Their runtime engine is not parallel and cannot run on a compute cluster, limiting their capabilities to process large amounts of data. They have limited support of collaboration functionalities for shared editing and shared execution. They also have limited capabilities to request computing resources from public clouds (Amazon AWS, Google GCP, and Microsoft Azure). These capabilities are critical for data-intensive data science tasks, e.g., training an ML model using GPUs.
To address these limitations, we develop the Texera system to support cloud-based platforms with the following core features:
- Supporting low/no-coding data science using workflows (i.e., DAGs of operators);
- Providing a rich set of operators for AI and ML;
- Supporting parallel data processing on computing clusters;
- Allowing real-time collaborations of multiple users;
- Allowing shared editing and execution, version control, and debugging;
- Supporting languages such as Python, R, Java, and Scala;
- Enabling access control and community-based sharing of workflows and datasets;
- Being deployable on both on-prem and public clouds.
Current Status
Meritocracy
Texera originated as a university project, and has since grown into a vibrant, meritocratic community. Many of the original committers began their contributions during their academic journeys and have continued to actively engage and contribute beyond graduation. Our goal is to further expand and strengthen both the developer and user communities. One key avenue for growth has been ongoing academic collaborations, including partnerships with Cornell University, Case Western Reserve University, and the University of California, Los Angeles. During the incubation process, we also plan to actively pursue greater industrial participation to broaden our community.
From the beginning, Texera has upheld the principles of meritocracy, creating an environment where contributions are valued and opportunities are based on demonstrated merit. Advancement to committership is achieved through consensus among core contributors, ensuring fairness and transparency. We are fully committed to fostering open, merit-based interactions within our community. Our decision to join the Apache Software Foundation incubation process reflects our dedication to adopting Apache’s best practices, which align with our meritocratic values. By embracing “the Apache way,” we aim to significantly expand the community while providing contributors with clear and equitable pathways to recognition and leadership based on their merit and contributions.
Community
Contributors: Texera has established a thriving open-source community comprising 141 developers, with 3,259 pull requests merged and 6,665 commits recorded.
Users: The platform supports four active deployments catering to diverse use cases, engaging over 390 users who have collectively created more than 120 projects and 2,900 workflows, generated over 270,000 workflow versions, and executed workflows over 58,000 times.
Users include:
- Researchers: Interdisciplinary researchers from UCI departments such as Public Health, Informatics, Biology, and Mathematics,
- Students: Undergraduate, high school, and college students from non-CS majors; and
- Scientists: from the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK).
Core Developers
The core developers are all experienced open-source developers. They have been running the Texera community for 8 years.
- Chen Li, a founder and principal investigator of the project. (GitHub ID: chenlica)
- Zuozhi Wang: a founding researcher and developer, architect, Ph.D. from UC Irvine. (GitHub ID = zuozhiw)
- Yicong Huang, a founding researcher and developer, architect, Ph.D. candidate at UC Irvine, main contributor for Python UDF and Python engine. (GitHub ID: Yicong-Huang)
- Shengquan Ni, a founding researcher and developer, architect, Ph.D. candidate at UC Irvine, main contributor for the backend distributed engine called Amber. (GitHub ID: shengquan-ni)
Avinash Kumar, a founding researcher and developer, Ph.D. from UC Irvine, main contributor to the backend engine (GitHub ID: avinash0161)
Sadeem Alsudais, a founding researcher and developer, Ph.D. from UC Irvine, main contributor to workflow versions and execution history (GitHub ID: sadeemsaleh)
- Xinyuan Lin, a founding researcher and developer, Ph.D. candidate at UC Irvine, main contributor for the UI and control blocks. (GitHub ID: aglinxinyuan)
- Xiaozhen Liu, a researcher and developer, Ph.D. candidate at UC Irvine, main contributor to shared editing and scheduler for pipelined execution. (GitHub ID: Xiao-zhen-Liu)
Jiadong Bai, a researcher and developer, Ph.D. student at UC Irvine, main contributor to the compiling service, datasets, and storage (GitHub ID: bobbai00)
Kun Woo Park, a researcher and developer, Ph.D. student at UC Irvine, main contributor to storage and run-time statistics. (GitHub ID: kunwp1)
Alignment
Texera leverages a wide range of open source projects, including Apache Arrow, Apache Iceberg, Apache Lucene, and Apache Hadoop, to enhance its capabilities for scalable and efficient data analytics. The system is designed to integrate with open-source projects, particularly those from the Apache Software Foundation, such as Apache Pekko and Apache Polaris. As we advance, we are committed to continuously expanding the Texera’s ecosystem to support diverse use cases and emerging technologies.
Known Risks
Orphaned products
Given the significant intellectual investment in Texera, the risk of the project being abandoned is minimal. Since its inception as an open-source project in 2016, Texera has contributed to over 20 research papers published in top-tier conferences and journals. UCI faculty member Chen Li is highly motivated to continue its development, as the Information Systems Group (ISG) at UC Irvine relies on Texera as a foundational platform for long-term graduate research projects.
We plan to use Texera in the diabetes research community, supported by the NIH NIDDK, over the next 4 years to enhance reproducibility and AI/ML computing. Additionally, Cerritos College is working on integrating Texera as a platform in their data science course.
The project is actively managed through weekly status meetings, attended both locally and via Zoom, with an average of 17 active contributors participating regularly. This collaborative environment ensures the ongoing development and sustainability of Texera.
Inexperience with Open Source
Texera has been developed as an open-source project since 2016, adhering to best practices in open-source development. The project incorporates issue tracking, pull request (PR) reviews, continuous integration (CI) automation, and documentation to ensure quality and transparency. Additionally, several team members, such as Yicong Huang, bring valuable experience from other prominent open-source projects, including gRPC.
Homogenous Developers
We understand that the initial list of committers may not provide the ideal long-term diversity for the project. Therefore, we are strongly committed to expanding the project by fostering a more diverse and inclusive development team. A key motivation for entering the Apache incubation process is to make the project available to a broader community of interested participants, encouraging greater collaboration and contribution.
Reliance on Salaried Developers
Among the initial committers, only Chen Li is an active staff member at UCI. The remaining committers consist of a diverse group, including current students, alumni who continue to contribute to the project, and individuals who dedicate part of their time or work on the project in their spare time with appropriate permissions.
Relationships with Other Apache Products
Texera leverages Apache Arrow for efficient data transfer between its Scala and Python engines. It utilizes Apache Commons for managing virtual file systems and depends on Apache Iceberg for table format storage and Apache Hadoop for file storage. Additionally, the Keyword Search operator integrates Apache Lucene to provide advanced fuzzy search functionality.
Documentation
Documentation is currently written in GitHub Wiki (https://github.com/Texera/texera/wiki).
Source and Intellectual Property Submission Plan
Initial Source
The initial source code can be found on https://github.com/Texera/texera
External Dependencies
Most dependencies are compatible with Apache 2.0. There are some GPL licensed dependencies. They will be removed or made optional before we attempt to do a release.
ASF Projects:
Apache Arrow, Apache 2.0, v14.0.1
Apache Commons, Apache 2.0, v2.9.0
Apache Hadoop, Apache 2.0, v3.3.1
Apache Iceberg, Apache 2.0, v1.7.1
Apache Lucene, Apache 2.0, v8.7.0
and other open source projects (organized by license):
Apache License 2.0
Jasmine Spec Reporter
edit-distance.js
Fuse.js
JSON Ref Resolver
RxJS
ts-proto
Typescript
pyarrow
python-dateutil
overrides
dataclasses
transformers
pyiceberg
tenacity
Apache Arrow Flight
Apache Commons IO
Apache Commons JCS
Apache Commons VFS
Apache Iceberg
Apache Lucene
AssertJ
Bean Validation API
Akka Kryo Serialization
Cloning
DateParser
Dropwizard
Dropwizard Asset Bundle
Dropwizard Auth JWT
Dropwizard Redirect Bundle
Ehcache SizeOf
Google APIs Client Library For Java
Google Java API Client Services (To be Removed)
Google OAuth Client Library For Java
Guava
hadoop
Jackson Annotations
Jackson Core
Jackson Databind
Jackson Datatype: Joda
Jackson Module: JsonSchema
Jackson Module: Kotlin
Jackson Module: No Constructor Deserialization
Jackson Module: Scala
Jackson Modules: Java 8
JASYPT: Java Simplified Encryption
Jetty
Joda Time
jOOQ
Json Formatting for ScalaPB
lucene
macwire
MariaDB4j
MongoDB Driver Sync
Nscala Time
Play JSON
RxJava
sbt-scalafmt
Scala Collection Contrib
Scala CSV
Scala Language
Scala Logging
ScalacticDotty
ScalaPB Runtime
ScalaTest
SIGAR Loader
SnakeYAML
twittered
Univocity Parsers
Utilities Core
Zjsonpatch
SBT Native Packager
Akka (2.6)
JSON Schema Validator
BSD License
numpy
praw
cached-property
psutil
Pg8000
BSD 0-Clause License
tslib
XZ for Java
BSD 2-Clause License
pybase64
PostgreSQL JDBC Driver
JUnit Interface
Linked Blocking Multi Queue
BSD 3-Clause License
Quill Rich Text Editor
Eslint Plugin Jsdoc
torch
pandas
protobuf
Leveldbjni All
sbt-scalafix
CC-BY-ND-3.0
Nx Cloud
COMMON DEVELOPMENT AND DISTRIBUTION LICENSE (CDDL) 1.1
JavaMail API
Eclipse Distribution License 1.0
JGit
JUnit 4
Logback
Eclipse Distribution License 2.0
JGraphT
Jersey
GPLv2+
rpy2 (we aim to make this dependency optional before the next release)
Historical Permission Notice and Disclaimer
pillow
ISC License
es6-weak-map
tinyqueue
JBCrypt
sanitize-filename build status
MIT
@codingame/monaco-vscode-api
Ajv JSON schema validator
Angular
Angular Color Picker
Angular PDF Viewer
Angular Social Login
angular tree component
angular-jwt
backbone
dagre - Graph layout for JavaScript
Deep Map
DefinitelyTyped
FileSaver.js
Formly
html2canvas
js-abbreviation-number
JSZip
lodash
Marked
Monaco Language Client
monaco-breakpoints
NG-ZORRO
ngx-file-drop
ngx-json-viewer
ngx-markdown
NgxImageViewer
Papa Parse
path-browserify
plotly.js
Popper.js
quill-cursors
read-excel-file
Ring Buffer
Unsubscribe For Pros
uuid
validator.js
y-monaco
y-quill
y-websocket
Yjs
Yjs Protocols
Angular
Angular Builders
Angular Cli
Angular Eslint
Babel Plugin Dynamic Import Node
Concurrently
Dart Sass
DefinitelyTyped
Eslint
Eslint Plugin Import
Eslint Plugin Prefer Arrow
Eslint Plugin Prettier
Eslint Plugin Rxjs
Eslint Plugin Rxjs Angular
Jasmine
Karma
Karma Chrome Launcher
Karma Jasmine
Node Fs Extra
Node Git Describe
Nodecat
Nx
Nz Tslint Rules
Prettier
Prettier Eslint Cli
Rxjs Marbles
Style Loader
Ts Node
Typescript Eslint
Webpack Bundle Analyzer
wheel
flake8
black
iniconfig
Loguru
pytest
pytest-timeout
betterproto
pampy
pytest-reraise
Deprecated
fs
python-lsp-server
tzlocal
readerwriterlock
SQLAlchemy
wordcloud
plotly
rpy2-arrow
ClassGraph
Dropwizard Websocket Support
Fs2 GRPC
Mockito
ScalaMock
SLF4J
Unirest for Java
Mozilla Public License 2.0
JointJS
bidict
new BSD
scikit-learn
Python Software Foundation License
typing
typing_extensions
Full Dependencies List:
https://github.com/Texera/texera/blob/master/core/gui/package.json
https://github.com/Texera/texera/blob/master/core/amber/requirements.txt
https://github.com/Texera/texera/blob/master/core/amber/operator-requirements.txt
https://github.com/Texera/texera/blob/master/core/amber/r-requirements.txt
https://github.com/Texera/texera/blob/master/core/build.sbt
https://github.com/Texera/texera/blob/master/core/project/plugins.sbt
https://github.com/Texera/texera/blob/master/core/amber/build.sbt
https://github.com/Texera/texera/blob/master/core/amber/project/plugins.sbt
https://github.com/Texera/texera/blob/master/core/workflow-operator/build.sbt
https://github.com/Texera/texera/blob/master/core/workflow-compiling-service/build.sbt
https://github.com/Texera/texera/blob/master/core/workflow-core/build.sbt
https://github.com/Texera/texera/blob/master/core/dao/build.sbt
Required Resources
Mailing lists
Currently, we use GitHub as our primary communication channel, where people subscribe to the GitHub project to stay updated on the latest commits and discussions.
We plan to establish these mailing lists:
dev@texera.apache.org for development-related discussions
general@texera.apache.org for broader community engagement.
private@texera.apache.org for security and new committer discussions
Git Repositories
The GitHub repository can be found: https://github.com/Texera/texera
The aim is to move this to https://github.com/apache/texera
Issue Tracking
The Texera community utilizes GitHub Issues to track bugs, feature requests, and other development tasks: https://github.com/Texera/texera/issues.
Other Resources
The community has adopted GitHub Actions as the continuous integration (CI) tool to automate testing workflows: https://github.com/Texera/texera/actions.
Initial Committers
The following is a list of the planned initial Apache committers (the active subset of the committers for the current repository at GitHub).
Chen Li, chenli AT ics DOT uci DOT edu
Yicong Huang, yicongh1 AT uci DOT edu
Shengquan Ni, shengqun AT uci DOT edu
Xinyuan Lin, xinyual3 AT uci DOT edu
Xiaozhen Liu, xiaozl3 AT uci DOT edu
Jiadong Bai, jiadongb AT uci DOT edu
Yunyan Ding, yunyad1 AT uci DOT edu
Kun Woo Park, kunwp1 AT uci DOT edu
Ali Risheh, arisheh AT uci DOT edu
Sponsors
Champion
PJ Fanning (fanningpj AT apache DOT org)
Mentors
PJ Fanning (fanningpj AT apache DOT org)
Cezar Andrei (cezar AT apache DOT org)
Gordon King (garyw AT apache DOT org)
Sponsoring Entity
Apache Incubator