Abstract


Texera is an open-source system to support collaborative data science, AI, and ML using GUI-based workflows. Our vision is to develop a system to support cloud platforms on which users can easily analyze data and use AI/ML techniques provided as operators. Users with various backgrounds, irrespective of whether they know coding or not, can collaborate on the same project to construct a pipeline. Experienced users can use programming languages such as Python, R, Java, and Scala to implement customized computation logic. The platform allows users to pause the execution of a workflow to investigate the operator states, and resume the execution at a later time.  The platform can be used by a research community to publish valuable resources such as data sets, workflows, and ML models to share their domain-specific knowledge and support reproducibility of scientific research. The platform also allows users to elastically request computing resources from public clouds for computationally-intensive tasks.

Proposal


The goal of this proposal is to bring the existing Texera codebase, developers, and community to the Apache Software Foundation. We plan to donate the source code of Texera and its related materials such as documentation and wiki. These materials already adopted the Apache License 2.0.

The relevant materials can be found on https://texera.io/ and https://github.com/Texera/texera.

Background


With the increasing importance and popularity of data science, AI, and ML in many application domains, more and more people want to use state-of-the-art techniques in their daily jobs, studies, and scientific research. Many people face a barrier due to their limited IT and coding skills, and as a result, cannot utilize these techniques . In addition, often projects require people from multiple disciplines with various and complementary backgrounds to collaborate to complete a task together. For tasks that require a lot of computing resources such as biosequence analysis, researchers also face the challenge of not having enough local resources to finish a task efficiently. We develop the Texera system to enable cloud-based platforms to meet these needs.

Rationale


Typically people do data science and AI/ML using programming languages such as Python, R, and SQL, in development environments such as Jupyter. Recent cloud-based platforms including Databricks, Snowflow, and Microsoft Fabric also require users to have enough IT skills.  For people who do not have these skills, including those researchers of various scientific disciplines, they are not able to use their domain knowledge to participate in data science and benefit from the latest AI/ML techniques.

Existing workflow systems, including Rapid Miner, Knime, and Alteryx, address these challenges by providing a GUI interface on which people can drag and drop operators to construct a workflow.  While these solutions have been widely used in many domains, they are not easily deployable to provide a cloud-based service.  In particular, they require users to download and install their software, and do not provide a web-based interface.  Their runtime engine is not parallel and cannot run on a compute cluster, limiting their capabilities to process large amounts of data. They have limited support of collaboration functionalities for shared editing and shared execution. They also have limited capabilities to request computing resources from public clouds (Amazon AWS, Google GCP, and Microsoft Azure). These capabilities are critical for data-intensive data science tasks, e.g., training an ML model using GPUs.

To address these limitations, we develop the Texera system to support cloud-based platforms with the following core features:

  • Supporting low/no-coding data science using workflows (i.e., DAGs of operators);
  • Providing a rich set of operators for AI and ML;
  • Supporting parallel data processing on computing clusters;
  • Allowing real-time collaborations of multiple users;
  • Allowing shared editing and execution, version control, and debugging;
  • Supporting languages such as Python, R, Java, and Scala;
  • Enabling access control and community-based sharing of workflows and datasets;
  • Being deployable on both on-prem and public clouds.

Current Status

Meritocracy

Texera originated as a university project, and has since grown into a vibrant, meritocratic community. Many of the original committers began their contributions during their academic journeys and have continued to actively engage and contribute beyond graduation. Our goal is to further expand and strengthen both the developer and user communities. One key avenue for growth has been ongoing academic collaborations, including partnerships with Cornell University, Case Western Reserve University, and the University of California, Los Angeles. During the incubation process, we also plan to actively pursue greater industrial participation to broaden our community.

From the beginning, Texera has upheld the principles of meritocracy, creating an environment where contributions are valued and opportunities are based on demonstrated merit. Advancement to committership is achieved through consensus among core contributors, ensuring fairness and transparency. We are fully committed to fostering open, merit-based interactions within our community. Our decision to join the Apache Software Foundation incubation process reflects our dedication to adopting Apache’s best practices, which align with our meritocratic values. By embracing “the Apache way,” we aim to significantly expand the community while providing contributors with clear and equitable pathways to recognition and leadership based on their merit and contributions.

Community


Contributors: Texera has established a thriving open-source community comprising 141 developers, with 3,259 pull requests merged and 6,665 commits recorded

Users: The platform supports four active deployments catering to diverse use cases, engaging over 390 users who have collectively created more than 120 projects and 2,900 workflows, generated over 270,000 workflow versions, and executed workflows over 58,000 times. 

Users include:

  1. Researchers: Interdisciplinary researchers from UCI departments such as Public Health, Informatics, Biology, and Mathematics,
  2. Students: Undergraduate, high school, and college students from non-CS majors; and
  3. Scientists: from the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK).


Core Developers


The core developers are all experienced open-source developers. They have been running the Texera community for 8 years. 

  • Chen Li, a founder and principal investigator of the project. (GitHub ID: chenlica)
  • Zuozhi Wang: a founding researcher and developer, architect, Ph.D. from UC Irvine. (GitHub ID = zuozhiw)
  • Yicong Huang, a founding researcher and developer, architect, Ph.D. candidate at UC Irvine, main contributor for Python UDF and Python engine. (GitHub ID: Yicong-Huang)
  • Shengquan Ni, a founding researcher and developer, architect, Ph.D. candidate at UC Irvine, main contributor for the backend distributed engine called Amber.  (GitHub ID: shengquan-ni)
  • Avinash Kumar, a founding researcher and developer, Ph.D. from UC Irvine, main contributor to the backend engine (GitHub ID: avinash0161)

  • Sadeem Alsudais, a founding researcher and developer, Ph.D. from UC Irvine, main contributor to workflow versions and execution history (GitHub ID: sadeemsaleh)

  • Xinyuan Lin, a founding researcher and developer, Ph.D. candidate at UC Irvine, main contributor for the UI and control blocks. (GitHub ID: aglinxinyuan)
  • Xiaozhen Liu, a researcher and developer, Ph.D. candidate at UC Irvine, main contributor to shared editing and scheduler for pipelined execution. (GitHub ID: Xiao-zhen-Liu)
  • Jiadong Bai, a researcher and developer, Ph.D. student at UC Irvine, main contributor to the compiling service, datasets, and storage (GitHub ID: bobbai00)

  • Kun Woo Park, a researcher and developer, Ph.D. student at UC Irvine, main contributor to storage and run-time statistics. (GitHub ID: kunwp1)

Alignment

Texera leverages a wide range of open source projects, including Apache Arrow, Apache Iceberg, Apache Lucene, and Apache Hadoop, to enhance its capabilities for scalable and efficient data analytics. The system is designed to integrate with open-source projects, particularly those from the Apache Software Foundation, such as Apache Pekko and Apache Polaris. As we advance, we are committed to continuously expanding the Texera’s ecosystem to support diverse use cases and emerging technologies.

Known Risks

Orphaned products

Given the significant intellectual investment in Texera, the risk of the project being abandoned is minimal. Since its inception as an open-source project in 2016, Texera has contributed to over 20 research papers published in top-tier conferences and journals. UCI faculty member Chen Li is highly motivated to continue its development, as the Information Systems Group (ISG) at UC Irvine relies on Texera as a foundational platform for long-term graduate research projects.

We plan to use Texera in the diabetes research community, supported by the NIH NIDDK, over the next 4 years to enhance reproducibility and AI/ML computing. Additionally, Cerritos College is working on integrating Texera as a platform in their data science course.

The project is actively managed through weekly status meetings, attended both locally and via Zoom, with an average of 17 active contributors participating regularly. This collaborative environment ensures the ongoing development and sustainability of Texera.

Inexperience with Open Source

Texera has been developed as an open-source project since 2016, adhering to best practices in open-source development. The project incorporates issue tracking, pull request (PR) reviews, continuous integration (CI) automation, and documentation to ensure quality and transparency. Additionally, several team members, such as Yicong Huang, bring valuable experience from other prominent open-source projects, including gRPC.



Homogenous Developers

We understand that the initial list of committers may not provide the ideal long-term diversity for the project. Therefore, we are strongly committed to expanding the project by fostering a more diverse and inclusive development team. A key motivation for entering the Apache incubation process is to make the project available to a broader community of interested participants, encouraging greater collaboration and contribution.

Reliance on Salaried Developers

Among the initial committers, only Chen Li is an active staff member at UCI. The remaining committers consist of a diverse group, including current students, alumni who continue to contribute to the project, and individuals who dedicate part of their time or work on the project in their spare time with appropriate permissions.

Relationships with Other Apache Products

Texera leverages Apache Arrow for efficient data transfer between its Scala and Python engines. It utilizes Apache Commons for managing virtual file systems and depends on Apache Iceberg for table format storage and Apache Hadoop for file storage. Additionally, the Keyword Search operator integrates Apache Lucene to provide advanced fuzzy search functionality.

Documentation

Documentation is currently written in GitHub Wiki (https://github.com/Texera/texera/wiki).

Source and Intellectual Property Submission Plan

Initial Source

The initial source code can be found on https://github.com/Texera/texera

External Dependencies

Most dependencies are compatible with Apache 2.0. There are some GPL licensed dependencies. They will be removed or made optional before we attempt to do a release.


ASF Projects:

Apache Arrow, Apache 2.0, v14.0.1

Apache Commons, Apache 2.0, v2.9.0

Apache Hadoop, Apache 2.0, v3.3.1

Apache Iceberg, Apache 2.0, v1.7.1

Apache Lucene, Apache 2.0, v8.7.0


and other open source projects (organized by license):

Apache License 2.0

Jasmine Spec Reporter

edit-distance.js

Fuse.js

JSON Ref Resolver

RxJS

ts-proto

Typescript

pyarrow

python-dateutil

overrides

dataclasses

transformers

pyiceberg

tenacity

Apache Arrow Flight

Apache Commons IO

Apache Commons JCS

Apache Commons VFS

Apache Iceberg

Apache Lucene

AssertJ

Bean Validation API

Akka Kryo Serialization

Cloning

DateParser

Dropwizard

Dropwizard Asset Bundle

Dropwizard Auth JWT

Dropwizard Redirect Bundle

Ehcache SizeOf

Google APIs Client Library For Java

Google Java API Client Services (To be Removed)

Google OAuth Client Library For Java

Guava

hadoop

Jackson Annotations

Jackson Core

Jackson Databind

Jackson Datatype: Joda

Jackson Module: JsonSchema

Jackson Module: Kotlin

Jackson Module: No Constructor Deserialization

Jackson Module: Scala

Jackson Modules: Java 8

JASYPT: Java Simplified Encryption

Jetty

Joda Time

jOOQ

Json Formatting for ScalaPB

lucene

macwire

MariaDB4j

MongoDB Driver Sync

Nscala Time

Play JSON

RxJava

sbt-scalafmt

Scala Collection Contrib

Scala CSV

Scala Language

Scala Logging

ScalacticDotty

ScalaPB Runtime

ScalaTest

SIGAR Loader

SnakeYAML

twittered

Univocity Parsers

Utilities Core

Zjsonpatch

SBT Native Packager

Akka (2.6)

JSON Schema Validator

BSD License

numpy

praw

cached-property

psutil

Pg8000

BSD 0-Clause License

tslib

XZ for Java

BSD 2-Clause License

pybase64

PostgreSQL JDBC Driver

JUnit Interface

Linked Blocking Multi Queue

BSD 3-Clause License

Quill Rich Text Editor

Eslint Plugin Jsdoc

torch

pandas

protobuf

Leveldbjni All

sbt-scalafix

CC-BY-ND-3.0

Nx Cloud

COMMON DEVELOPMENT AND DISTRIBUTION LICENSE (CDDL) 1.1

JavaMail API

Eclipse Distribution License 1.0

JGit

JUnit 4

Logback

Eclipse Distribution License 2.0

JGraphT

Jersey

GPLv2+

rpy2 (we aim to make this dependency optional before the next release)

Historical Permission Notice and Disclaimer

pillow

ISC License

es6-weak-map

tinyqueue

JBCrypt

sanitize-filename build status

MIT

@codingame/monaco-vscode-api

Ajv JSON schema validator

Angular

Angular Color Picker

Angular PDF Viewer

Angular Social Login

angular tree component

angular-jwt

backbone

dagre - Graph layout for JavaScript

Deep Map

DefinitelyTyped

FileSaver.js

Formly

html2canvas

js-abbreviation-number

JSZip

loaders.gl

lodash

luma.gl

Marked

Monaco Language Client

monaco-breakpoints

NG-ZORRO

ngx-file-drop

ngx-json-viewer

ngx-markdown

NgxImageViewer

Papa Parse

path-browserify

plotly.js

Popper.js

quill-cursors

read-excel-file

Ring Buffer

Unsubscribe For Pros

uuid

validator.js

y-monaco

y-quill

y-websocket

Yjs

Yjs Protocols

Angular

Angular Builders

Angular Cli

Angular Eslint

Babel Plugin Dynamic Import Node

Concurrently

Dart Sass

DefinitelyTyped

Eslint

Eslint Plugin Import

Eslint Plugin Prefer Arrow

Eslint Plugin Prettier

Eslint Plugin Rxjs

Eslint Plugin Rxjs Angular

Jasmine

Karma

Karma Chrome Launcher

Karma Jasmine

Node Fs Extra

Node Git Describe

Nodecat

Nx

Nz Tslint Rules

Prettier

Prettier Eslint Cli

Rxjs Marbles

Style Loader

Ts Node

Typescript Eslint

Webpack Bundle Analyzer

wheel

flake8

black

iniconfig

Loguru

pytest

pytest-timeout

betterproto

pampy

pytest-reraise

Deprecated

fs

python-lsp-server

tzlocal

readerwriterlock

SQLAlchemy

wordcloud

plotly

rpy2-arrow

ClassGraph

Dropwizard Websocket Support

Fs2 GRPC

Mockito

ScalaMock

SLF4J

Unirest for Java

Mozilla Public License 2.0

JointJS

bidict

new BSD

scikit-learn

Python Software Foundation License

typing

typing_extensions


Full Dependencies List:

https://github.com/Texera/texera/blob/master/core/gui/package.json

https://github.com/Texera/texera/blob/master/core/amber/requirements.txt

https://github.com/Texera/texera/blob/master/core/amber/operator-requirements.txt

https://github.com/Texera/texera/blob/master/core/amber/r-requirements.txt

https://github.com/Texera/texera/blob/master/core/build.sbt

https://github.com/Texera/texera/blob/master/core/project/plugins.sbt

https://github.com/Texera/texera/blob/master/core/amber/build.sbt

https://github.com/Texera/texera/blob/master/core/amber/project/plugins.sbt

https://github.com/Texera/texera/blob/master/core/workflow-operator/build.sbt

https://github.com/Texera/texera/blob/master/core/workflow-compiling-service/build.sbt

https://github.com/Texera/texera/blob/master/core/workflow-core/build.sbt

https://github.com/Texera/texera/blob/master/core/dao/build.sbt

Required Resources

Mailing lists

Currently, we use GitHub as our primary communication channel, where people subscribe to the GitHub project to stay updated on the latest commits and discussions.

We plan to establish these mailing lists:

dev@texera.apache.org for development-related discussions

general@texera.apache.org for broader community engagement.

private@texera.apache.org for security and new committer discussions

Git Repositories

The GitHub repository can be found: https://github.com/Texera/texera

The aim is to move this to https://github.com/apache/texera

Issue Tracking

The Texera community utilizes GitHub Issues to track bugs, feature requests, and other development tasks: https://github.com/Texera/texera/issues

Other Resources

The community has adopted GitHub Actions as the continuous integration (CI) tool to automate testing workflows: https://github.com/Texera/texera/actions.

Initial Committers

The following is a list of the planned initial Apache committers (the active subset of the committers for the current repository at GitHub).

Chen Li, chenli AT ics DOT uci DOT edu

Yicong Huang, yicongh1 AT uci DOT edu

Shengquan Ni, shengqun AT uci DOT edu

Xinyuan Lin, xinyual3 AT uci DOT edu

Xiaozhen Liu, xiaozl3 AT uci DOT edu

Jiadong Bai, jiadongb AT uci DOT edu

Yunyan Ding, yunyad1 AT uci DOT edu

Kun Woo Park, kunwp1 AT uci DOT edu

Ali Risheh, arisheh AT uci DOT edu

Sponsors

Champion

PJ Fanning (fanningpj AT apache DOT org)

Mentors

PJ Fanning (fanningpj AT apache DOT org)

Cezar Andrei (cezar AT apache DOT org)

Gordon King (garyw AT apache DOT org)

Sponsoring Entity

Apache Incubator

  • No labels