Status

StateDraft
Discussion Thread
Vote Thread
Vote Result Thread
Progress Tracking (PR/GitHub Project/Issue Label)
Date Created

2025.12.04

Version Released
AuthorsStefan Wang 

Motivation

Summary

This document proposes enhancing Airflow's task execution reliability by enabling infrastructure-aware decisions during failures and terminations. It introduces Execution Context Propagation and Infrastructure Failure Auto-Retry to help Airflow distinguish between infrastructure issues (worker crashes, pod evictions) and application errors (user code bugs), enabling smarter retry budgets and better operational clarity.

Key Benefits:

  • Platform teams can accurately attribute failures (infrastructure vs application)
  • Users' retry budgets are protected from infrastructure disruptions
  • Operators can make intelligent cleanup decisions (preserve vs cancel remote jobs)
  • Clear observability through listener hooks with rich failure context

Motivation

Current Behavior and Limitations

Problem 1: No Observability Context for Root Cause Analysis

When DAG runs fail, listener hooks receive only the exception object with no structured context:

@hookimpl

def on_task_instance_failed(previous_state, task_instance, error):

    #  'error' is just the exception

    #  No category (infrastructure vs application)

    # No source (executor, scheduler, worker)

    # No reason (pod eviction vs timeout vs OOM)

Platform teams must manually traverse logs and task states to determine root causes.

Problem 2: Infrastructure Failures Consume User Retry Budgets

Users configure retries for application issues, but infrastructure failures silently consume them:

@task(retries=3)  # "I want 3 retries for data processing job issues"

def process_data():

    cleaned = clean_data(raw_data)

    if not validate(cleaned):

        raise DataProcessingError("Invalid data")

What actually happens:

  • Try 1: DNS failure during worker init → Infrastructure
  • Try 2: K8s pod evicted → Infrastructure
  • Try 3: DataProcessingError in user code → Application

Result: Failed permanently → User: "I only got 1 real retry!"

Considerations

https://docs.google.com/document/d/1BAOJTAPfWK93JnN6LQrISo8IqDiE7LpnfG2Q42fnn7M/edit?usp=sharing


What change do you propose to make?


What problem does it solve?


Why is it needed?


Are there any downsides to this change?


Which users are affected by the change?


How are users affected by the change? (e.g. DB upgrade required?)


What is the level of migration effort (manual and automated) needed for the users to adapt to the breaking changes? (especially in context of Airflow 3)


Other considerations?


What defines this AIP as "done"?