Status

StateAccepted
Discussion Thread
Vote Threadhttps://lists.apache.org/thread/9570fr5b2jv6hb2fd5z43jmsws42ls1z
Vote Result Threadhttps://lists.apache.org/thread/f0rvbdbpq7bylt3kv0v6gn90qr3f98ng
Progress Tacking (PR/GitHub Project/Issue Label)
Date Created

2024-06-26

Version Released
Authors

Motivation

Airflow has become the standard for orchestrating complex data workflows. However, it operates with limited visibility into the actual data it processes or produces. While it understands task execution order and attributes like operators and parameters in use, it lacks insight into the nature of data inputs and outputs. This link between data and tasks is fundamental to data engineering, and vital for providing insights into the state and health of data as they move through the workflow. Orchestrators are the heart of data platforms, and if they can understand this link, they can make orchestration decisions based on data quality and freshness, while also providing data engineers with insights about system and data reliability in one place.

Rationale

In the context of data engineering, assets are the data entities that move through pipelines, undergo transformations, and ultimately drive business insights and power data products. Although data engineers spend considerable time building and maintaining data pipelines (and very important!), they are a means to an end. The data asset itself is ultimately what drives business value.

Most data tools like Fivetran, Snowflake, dbt, Monte Carlo, Feast, and Tableau have adopted data assets as their central concept. Airflow, on the other hand, prioritizes task execution and treats the data asset as a byproduct. This conscious decision has allowed Airflow to be flexible enough to be used for a myriad of use cases outside of data engineering, and is a key factor to Airflow’s viral adoption. However, with changes to the data landscape and the advent of Generative AI, stakeholders’ access to high quality data is more important than ever, making it critical for data orchestration tools to provide data engineers with actionable insights about the data’s state and health as it moves through the workflow. 

The current disconnect hampers Airflow’s ability to make informed orchestration decisions based on the actual data and creates a gap between task management and data management, resulting in clunky integrations, limited understanding of how data flows from point A to point B, and forces data engineers to translate between the workflow-oriented world of Airflow and the data asset-oriented world of the rest of the data platform.

On the road towards expanding data awareness in Airflow, there are key principles that we would like to adhere to: Progressive adoptability & and ability to handle different incremental load strategies.

Progressive Adoptability

Enhancing data awareness in Airflow requires a strategy that supports progressive adoptability to bridge the gap between task-centric and asset-oriented approaches. This strategy is crucial as it is impractical to rewrite existing pipelines entirely. Evolution, not revolution, is the key.

  • Resource and Effort Considerations: Defining and maintaining detailed data semantics introduces complexity. Users need the flexibility to adopt these features gradually, implementing them as their needs evolve without an overwhelming burden.
  • Legacy Support: Airflow is widely used, with many critical existing DAGs. Progressive adoptability ensures that enhancements in data awareness can be integrated without disrupting or rendering these workflows obsolete, thus protecting users' investments.
  • Selective Enforcement: Similar to CI tools with customizable rulesets, Airflow can allow users to incrementally adopt data-awareness features. This approach enables users to apply specific aspects of data awareness to their workflows, aligning with their requirements and maturity levels.
  • Leveraging Existing Practices: Drawing from software development practices, such as Python type annotations, Airflow can enable users to gradually enhance their DAGs with data-related semantics and validations, improving data handling and pipeline reliability at their own pace.

By facilitating a smooth, incremental transition, Airflow can address the gap effectively, allowing it to evolve into a more data-aware platform while ensuring continuity and minimizing disruption.

Proposed Changes

Due to the scope of proposed changes, we will be using the current AIP as an umbrella AIP and split the proposal into 4 sub-AIPs: 

  1. AIP-74 Introducing Data Assets: This proposal focuses on introducing and formally defining data assets within Airflow. Assets represent logically related data entities such as tables, machine learning models, reports, or directories of files. By renaming Datasets to Assets and potentially introducing subtypes, this sub-AIP aims to align Airflow's data management with modern data engineering practices. This will provide clearer insights into the state and transformations of data within workflows, enabling better tracking and lineage.
  2. AIP-75 New Asset-Centric Syntax: This proposal aims to extend the Taskflow API to support asset-centric primitives, allowing for more intuitive definitions for data assets. New syntax features include the @asset decorator for defining assets within functions, handling multiple assets within a single function, and setting asset dependencies. This will make Airflow more user-friendly for data engineers and enhance its ability to model complex data dependencies.
  3. AIP-76 Asset Partitions: This proposal introduces the concept of partitions to represent slices of data that can be processed independently. Partitions can be time-based or segment-based, allowing for more granular control and efficient handling of large datasets. By decoupling partitions from task runs, Airflow will better support incremental data processing and enable more detailed tracking of data lineage and dependencies.

This AIP will be considered done once all sub-AIPs listed above are done.

Future Works

As we continue to enhance Airflow's data awareness capabilities, there are several key areas of improvement that we plan to address in future iterations. These enhancements aim to further optimize data processing, improve collaboration, and ensure robustness in diverse deployment scenarios.

  • AIP-77 Asset Validations: Capabilities for defining and verifying expectations around data assets. It introduces pre-runtime and post-runtime validations to ensure data reliability and adherence to established data flow contracts. Data engineers can specify validation rules for data inputs and outputs, improving the robustness of data pipelines and reducing runtime errors.
  • Backfilling Partitions: Enhancing support for backfilling partitions will enable efficient re-processing of specific slices of data. This will allow users to correct historical data errors or update datasets with new logic without reprocessing the entire dataset. By leveraging the new partitioning capabilities, Airflow will streamline the backfill process, ensuring data accuracy and consistency across time-based and segment-based partitions. This will be defined and done in line with the Backfills at Scale proposal. 
  • Scheduling Tolerations and Timeouts: Introducing scheduling tolerations and timeouts will enhance the flexibility and reliability of data workflows in Airflow. Scheduling tolerations will allow tasks to proceed with available data even if some dependencies aren't met, provided they were updated within an acceptable period, like for example the last month. Timeouts will provide a mechanism to enforce maximum wait times for task executions, ensuring that workflows progress efficiently and do not stall due to delays in data availability. Tolerations and Timeouts will accommodate scenarios where timely data generation is critical despite incomplete or not so fresh upstream data, improving overall workflow robustness.
  • Sharing Data Assets in Multi-Team Deployments: In multi-team deployments, sharing data assets seamlessly across teams is crucial for collaboration and efficiency. Future enhancements will focus on developing mechanisms to manage and share assets with proper access controls and metadata tagging. This will facilitate better collaboration between teams, reduce asset duplication, and ensure that teams can leverage shared data assets effectively while maintaining data governance and security.