Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: clarify that first challenge applies to the long-term structure


Status

State: Draft

Discussion thread: http://mail-archives.apache.org/mod_mbox/airflow-dev/201901.mbox/%3CCAJM+deJtsNGWXwsq_W+OwsxDkbNVGEy6WDrdTwwP6dgtwQPsGQ@mail.gmail.com%3E

JIRA: 

Jira
serverASF JIRA
serverId5aa69414-a9e9-3523-82ec-879b028fb15b
keyAIRFLOW-3644


Motivation

Apache Airflow integrates with many different software systems, with hooks and operators to Amazon Web Services, Google Cloud Platform, Microsoft Azure, and more. As cloud APIs evolve, they require changes to the Airflow operator and/or hook to support new features and bug fixes.

Cloud API launch schedules do not align with the Airflow release schedule. To use the latest cloud features or bug fixes, developers need to run Airflow at master or maintain a fork of Airflow and backport patches.

In addition, the Airflow issue tracker receives many questions/bugs for operator-specific issues. No single contributor has expertise across all hooks/operators. This produces overhead and delays as issues are routed to someone with appropriate expertise.

Considerations

Requirements / Constraints

  • Hook and Operator packages are releasable separately from the core Airflow package.

  • Feedback on a specific Hook or Operator is easily routed to the person or group with relevant expertise.

  • Hooks / operators that do not tie to a specific system or are part of the standard Python distribution (e.g. sqlite3) remain in the core package.

    • No changes to the the hooks: BaseHook, DbApiHook, HttpHook, SqliteHook

    • No changes to the operators: DummyOperator, PythonOperator, LatestOnlyOperator

Proposed changes to the workflow / infrastructure

Short-term structure

The existing core and contrib namespaces are used for backwards compatibility of DAGs, but hook/operator logic is moved to a separate "airflow-backward-compatibility-operators" package.

  • Create a new top-level directory in the airflow repository called "operators".

  • Exclude the "operators" directory from the root setup.py file.

  • Copy hooks and operators from both core and contrib to /operators/airflow_operators directory.

  • Add a /operators/setup.py file to package the "operators" directory as airflow_operators.

  • Verify that DAGs can be created and run using the airflow_operators namespace when the /operators directory is installed locally in development mode.

  • Release a PyPI package for "airflow-backward-compatibility-operators".

  • Make the root /setup.py depend on the new "airflow-backward-compatibility-operators" package.

  • For each hook and operator, remove the class definition and import the class from the "airflow-backward-compatibility-operators" package.

  • Update get_hook to use the new hook locations in the "airflow-backward-compatibility-operators" package.

  • The "airflow" package can now be released independently from the "airflow-backward-compatibility-operators" package.

Long-term structure

One hook or operator per package, following the "micro package" philosophy.

  • Deprecate each of the moved hooks/operators in core and contrib.

  • Package each hook/operator separately. The details of this will be covered in a later AIP.

  • Point hooks/operators in the airflow-backward-compatibility-operators package to their new packages.

  • Deprecate moved hooks/operators in the airflow-backward-compatibility-operators package.

Challenges

  • There are 24+ In the long-term structure there are many more repositories and packages to maintain.

    • Since the packages are distributed separately, they needn't all be released at the same time.

    • Testing of each package will be faster. Someone contributing a fix to a BigQuery operator needn't be concerned with running the tests for the Hive operators.

    • It will be easier for people with relevant expertise, such as employees and customers of cloud providers to help with maintenance of these packages.

    • These repositories needn't be in the Apache org. For example, they could be created in the relevant cloud provider GitHub org or open source org.

    • It will be easier to create new operators, with the only change needed to core being to create the relevant connection type.

  • Airflow will be tightly coupled with many new packages.

    • Airflow only needs to include connection configuration and hook construction in the core package. Hooks change slower than operators, as authentication methods rarely change.

    • Releases need not be tightly coupled. For example, pandas split out I/O modules into separate packages, such as pandas-gbq for BigQuery support. pandas-gbq supports several versions of pandas and vice versa. No special considerations are needed when making releases of pandas-gbq or pandas except to document the supported versions.

  • Airflow community may seem less active because contributors are less likely to need to send a PR to the core Airflow repository. Only new hooks need PRs to the core repo so that their needed connection configuration can be created in the database. New operators that use existing hooks need not touch the core repository.

    • There will still be plenty of work needed on the core Airflow package, git commits won't slow to a crawl.

    • Download statistics are unaffected. Airflow users still need to install the core Airflow package from PyPI.

    • The bar for contributing to the overall community is much lower. With operators pulled out of the central repository, it is more clear that anyone can create and package an operator for Airflow, especially if it makes use of an existing hook. This is similar to the ecosystem of JavaScript callbag routines, which actively encourages distributed ownership of related packages.