Abstract

Hop is short for the Hop Orchestration Platform. Written completely in Java it aims to provide a wide range of data orchestration tools, including a visual development environment, servers, metadata analysis, auditing services and so on.  As a platform, Hop also wants to be a reusable library so that it can be easily reused by other software.

Proposal

Hop provides all the tools to build, maintain and deploy data orchestration, ETL and data integration solutions. For example, Hop allows you to diagram a data flow that propagates changes from a database via Apache Kafka to a data warehouse and deploy it as an Apache Beam pipeline.  The core concepts of Hop are Pipelines and Workflows. 

If these terms sound familiar it’s because they are taken from the Apache Beam and Apache Airflow projects. 

The main components of the Hop platform are: 


The cornerstone of the Hop platform is extensibility: all major components of the platform are designed to be pluggable. This allows any possible missing functionality to be created in a short amount of time.

Background

The Hop Orchestration Platform has its origins in the Kettle community. Kettle got acquired by Pentaho and after Pentaho’s acquisition by Hitachi in 2015, the community struck out to solve problems less aligned with Hitachi’s interests.

Rationale

In the Hop community, we have always aimed to function as a meritocracy, where contributions are accepted based on merit, and individuals gain status in the community based on their contributions (coding and otherwise). We’re proud to have a diverse group of people doing all the required things in a project: development, documentation, tutorials, architecture, testing, graphics design and much more. Bringing the project under the Apache Software Foundation would allow us to continue and grow, but also give our users confidence about the governance, IP status, and future of the project.

ASF Preparation Phase

The very first goal of project Hop is to find a good way to cooperate on the development across wide geographical, economical and social spectra. To make this possible real changes were needed to a codebase which is essentially 20 years old. Most of these changes have been tackled by now. We think it’s fair to say that by now, Hop is a new platform even though it shares a common background as it partly started from the Kettle code base.

Here are a few of the key focus areas we’re trying to saveguard going forward: 


For a list of the changes you can look at the monthly roundup which was compiled since February 2020.  It documents the hard work of our community so far:

    http://www.project-hop.org/news/roundup-2020-02/

    http://www.project-hop.org/news/roundup-2020-03/

    http://www.project-hop.org/news/roundup-2020-04/

    http://www.project-hop.org/news/roundup-2020-05/

    http://www.project-hop.org/news/roundup-2020-06/

    http://www.project-hop.org/news/roundup-2020-08/

Goals 

Here are a few more details and specifics of things we still want to take on going forward:

Current Status

Meritocracy

With Project Hop, we actively work to foster the existing community and encourage community contributions. As of  September 1st 2020 we received over 250 pull requests and have around 600 tickets in our JIRA platform (a lot of which were created by community members) and have active discussions in our Mattermost chat platform with over 80 members.

The last half year we started to ask users on our chat chat server for specific feedback on terminology, features and so on.  It’s been a wonderfully positive experience to have in-depth discussions on complex issues with industry experts. We look forward to moving these discussions and votes to an Apache mailing list.  

Community

Hop is developed, extended and maintained by a global community of users and developers. The Hop community is what has driven its development and growth. 

The particular past history of Hop has led to a lot of interest for the project and already led to a number of contributions, documentation and translations.

Core Developers

We have a diverse group of core developers with people joining on a regular basis.  Matt Casters, Rodrigo Haces and David Rosenblum are part time developers on Hop, salaried by Neo Solutions.  Bart Maertens, Hans Van Akelyen, Yannick Mols are part time Hop developers paid for by company know.bi.  Doug and Gretchen Moran were Pentaho employees but along with Rafael Valenzuela, Dan Keeley, Jason Chu, Sergio Ramazzina and many others they can be considered to be long time consultants and community members for over a decade that joined the Hop community in the last year or two.

Alignment

We want to anchor and safeguard our development and community building efforts for the future. We strongly believe that as an Apache project this can be achieved in the best possible way. The Hop project also started to align with projects like Apache Beam, Spark and Flink in its use of terminology, tools, manner of configuration and so on.  As mentioned elsewhere in this document Hop is a large user of other Apache projects and libraries and we believe that becoming an Apache project is mutually beneficial.  Specifically for Apache Beam we believe that providing a visual pipeline development tool can be of great value.

Known Risks

While the current code-base of Kettle on which we have started from is already released under the Apache Public License 2.0 proper attribution needs to happen to Hitachi Vantara.

We have no knowledge of existing patents on any part of the Kettle codebase.

To further reduce any risk of there even being any discussion on naming the Hop team decided to rename the project, its tools (to be more self-evident as well), the java API and even the main concepts (Transformations are now called Pipelines, in line with Apache Beam naming conventions).  

Orphaned products

There is little risk that the project will become orphaned. The list of active developers is large, and consists of a mix of developers  who have been working on the code for several years and recent arrivals in the community

Inexperience with Open Source

The project team has a long history in open source and has contributed to Apache licensed open source projects, mostly in the Kettle ecosystem such as Kettle itself and the many plugins and projects surrounding it. The experience gained there has allowed us to quickly set up all required build tools and processes.  In its fairly short history, Hop has been advocating open source in all aspects of the project. Our submission to the Apache Software Foundation is a logical extension of our commitment to open source software.

Licensing

The original source code we started from (see below) has been open source since december 2005, initially under the Lesser GPL but since January 2012 all under the Apache License version 2.0. All Hop code has been scanned for compliance with APL 2.0. We integrated Apache Rat with our build process.

Heterogeneous Developers

Hop is built, developed and maintained by a global community of developers.  Input comes from a large group of developers and users from all over the world.  At this moment over 7 companies contribute to Hop through the developers along with a list of individuals and consultants.

Reliance on Salaried Developers

Hop developers are a mix of volunteers, enthusiasts and people working for an employer. There is also a group of consultants who want to be involved in Hop because it allows them to do projects with it.  They are in fact our most important users and developers since they provide valuable feedback from the trenches.

Relationships with Other Apache Products

Hop is a heavy user of Apache software libraries.

Apache Commons usage:


Other libraries:


Other usage of Apache projects related to Hop (plugins):


For the build process

An excessive Fascination with the Apache Brand

With this proposal we are not seeking attention or publicity. Rather, we firmly believe in Hop, visual data pipeline development and the ability to treat the developed data pipelines (ETL) as software code. While the original Hop code has been open source for about 15 years, we believe putting code on GitHub can only go so far. We see the Apache community, processes, and mission as critical for ensuring Hop is truly community-driven, positively impactful, and innovative open source software. We believe Hop is a great fit for the Apache Software Foundation due to its focus on visual data processing and its relationships to existing ASF projects.

Documentation

Over the years, the community has contributed extensive documentation to wiki.pentaho.com. Over time, areas of the available information have become incomplete or outdated. Most of this documentation has been reviewed, updated and will be contributed to the Apache foundation with the Hop source code. Documentation for the extensive new functionality that was added to Hop in recent months is being written. 

We consider documentation to be a core piece of the Hop platform and will treat documentation as any other item of code. 

Initial Source

While there isn’t a Java class in Hop which is unchanged from its origins we should mention we selected this source code to form the base of Apache Kettle:  

https://github.com/pentaho/pentaho-kettle/tree/8.2.0.7-R

We merged various changes from the WebSpoon fork found over here:

https://github.com/HiromuHota/pentaho-kettle

Various community driven Kettle plugins were written to bypass bugs, slow down code-rot and to implement missing features.  They were were merged into Hop from these locations:

https://github.com/mattcasters/kettle-debug-plugin (better debugging)

https://github.com/mattcasters/kettle-beam (Apache Beam support)

https://github.com/mattcasters/pentaho-pdi-dataset (Unit Testing)

https://github.com/mattcasters/kettle-needful-things (Bug fixes & workarounds)

https://github.com/mattcasters/kettle-environment (Environment management)

The Hop repositories are currently hosted at: 

 https://github.com/project-hop/

Source and Intellectual Property Submission Plan

The originating source code is already licensed under an Apache 2 license:


For all contributions we have an agreement in place: https://cla-assistant.io/project-hop/hop

External Dependencies

Over the course of the last year we removed non-essential dependencies as much as possible and replaced them by interfaces and plugin types. We did this to simplify the architecture.  

It’s important to note all external dependencies are licensed under an Apache 2.0 or Apache-compatible license. As we grow the Hop community we will configure our build process to require and validate all contributions and dependencies are licensed under the Apache 2.0 license or are under an Apache-compatible license.

Cryptography

Required Resources

Mailing lists

We currently use a mix of email and Mattermost. We will migrate our existing mailing lists to the following:

dev@hop.incubator.apache.org

user@hop.incubator.apache.org

private@hop.incubator.apache.org

commits@hop.incubator.apache.org

Git Repository

The Hop code is currently in git, we’d like to keep it that way. We request a git repository for incubator-hop with mirroring to GitHub.  

Issue Tracking

We request the creation of an Apache-hosted JIRA. 

Jira ID: HOP

Other Resources

To allow other projects to use Hop as a library we would love to publish artifacts on a Maven server like maven.apache.org.

Initial Committers

Affiliations

Sponsors

Champion

Maximilian Michels (mxm@apache.org)

Nominated Mentors

Tom Barber (magicaltrout@apache.org)

Julian Hyde (jhyde@apache.org)

Maximilian Michels (mxm@apache.org)

Francois Papon (fpapon@apache.org)

Kevin Ratnasekera (djkevincr@apache.org)