DUE TO SPAM, SIGN-UP IS DISABLED. Goto Selfserve wiki signup and request an account.
Status
Motivation
Summary
This document proposes enhancing Airflow's task execution reliability by enabling infrastructure-aware decisions during failures and terminations. It introduces Execution Context Propagation and Infrastructure Failure Auto-Retry to help Airflow distinguish between infrastructure issues (worker crashes, pod evictions) and application errors (user code bugs), enabling smarter retry budgets and better operational clarity.
Key Benefits:
- Platform teams can accurately attribute failures (infrastructure vs application)
- Users' retry budgets are protected from infrastructure disruptions
- Operators can make intelligent cleanup decisions (preserve vs cancel remote jobs)
- Clear observability through listener hooks with rich failure context
Motivation
Current Behavior and Limitations
Problem 1: No Observability Context for Root Cause Analysis
When DAG runs fail, listener hooks receive only the exception object with no structured context:
@hookimpl def on_task_instance_failed(previous_state, task_instance, error): # 'error' is just the exception # No category (infrastructure vs application) # No source (executor, scheduler, worker) # No reason (pod eviction vs timeout vs OOM) |
Platform teams must manually traverse logs and task states to determine root causes.
Problem 2: Infrastructure Failures Consume User Retry Budgets
Users configure retries for application issues, but infrastructure failures silently consume them:
@task(retries=3) # "I want 3 retries for data processing job issues" def process_data(): cleaned = clean_data(raw_data) if not validate(cleaned): raise DataProcessingError("Invalid data") |
What actually happens:
- Try 1: DNS failure during worker init → Infrastructure
- Try 2: K8s pod evicted → Infrastructure
- Try 3: DataProcessingError in user code → Application
Result: Failed permanently → User: "I only got 1 real retry!"
Considerations
https://docs.google.com/document/d/1BAOJTAPfWK93JnN6LQrISo8IqDiE7LpnfG2Q42fnn7M/edit?usp=sharing
What change do you propose to make?