Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The Apache Airflow website does not have a section with an architectural overview of the general architecture. This sort of architecture section. An overview would enable new contributors and users to get develop a mental model of the workings of Apache Airflow , and be able to start contributing sooner.

This project involves documenting what are the different parts that make up of Apache Airflow , and how they are developed. Some of the questions that this This documentation should answer are:

  • What are the different components of Apache Airflow[1]?
  • Which

...

  • components are stateful

...

  • or stateless?
  • How does Apache Airflow distribute tasks to workers[2]?
  • How can Apache Airflow run on different deployment architectures and databases?

Expected deliverables

  • A page describing that describes the architecture
  • The page should have detailed descriptions of each componentof the following components:
    • Scheduler
    • Web Server
    • Worker
    • Metadata DB
  • The page should also contain a diagram on of the Apache Airflow architecture (e.g. [1])
  • Description of how Apache Airflow schedules tasks[2]
  • Detailed examples with diagrams and text using PlantUML[3]

Related resources

[1] https://imgur.com/a/YGpg5Wa

[2] https://blog.sicara.com/using-airflow-with-celery-workers-54cb5212d405 

[3] https://github.com/plantuml/plantuml


2. Deployment

Jira issue: AIRFLOW-4369

Project description

Apache Airflow is used to automate and orchestrate automates and orchestrates complex workflows. It hides the complexity of managing dependencies between operators , and scheduling tasks; and it allows , enabling users to worry about their logicfocus on the logic of their workflows.

On the other hand, deploying Apache Airflow in a resilient manner is the first step on starting to use using it. There are many strategies to deploy it, and each has advantages and disadvantages. By documenting thisdeployment, project newcomers to the project will be able to adopt Apache Airflow with confidence.

This project will consist in documenting document strategies to deploy Apache Airflow in different the following environments:

  • Cloud (AWS / GCP / Azure)
  • On-premisepremises
  • Special attention can be given to the Kubernetes executor, as Kubernetes is a very popular technology to manage workloads.

Expected deliverables

  • A page introducing that introduces and describing describes deployment techniques
  • A page that describes the deployment models and help you helps users choose the best one (full management, GKE-like service, Paas PaaS - Astronomer/Google Composer)
  • A page that will allow you helps users to choose the best executor
  • A page that will allow you to choose the best executorA section describing section that describes how to deploy Apache Airflow with Kubernetes. The section should include snippets for Kubernetes files, and scripts, and back it up with a PlantUML diagram for clarity.
  • A section on running Apache Airflow on different cloud providers (AWS / Azure / GCP)
  • A table comparing different executors

Related resources

[1] https://github.com/jghoman/awesome-apache-airflow

[2] https://apache-airflow.slack.com/archives/CCV3FV9KL/p1554319091033300,; https://apache-airflow.slack.com/archives/CCV3FV9KL/p1553708569192000

...

Project description

Apache Airflow allows enables people to perform complex workflows that may might affect many components in their infrastructure, it . It is important to be able to test an Apache Airflow workflow , and be sure ensure that when it runs on production it will work it works as intended when run in a production environment.

The existing documentation does not have great information when it comes to testing information that helps users test their workflows (known as DAGs), ensuring that they will be scheduled schedule DAGs properly, and enabling users to write their own custom operators.

Users that know what are the “best practices” on dealing with who know best practices for creating Apache Airflow DAGs and using operators will be able to adopt it in their infrastructure much better, Apache Airflow more easily and with fewer mishaps.

Expected deliverables

  • Introduction in testing workflows.A page or section that introduces testing workflows. The page should include information about the following testing stages:
    • Unit tests: applies to one class
    • DAG integrity tests: checks DAG code for missing variables, imports, etc.
    • System tests
    • Data tests: checks if the DAG performs its purpose
  • A page for Designing designing and Testing DAGstesting DAGs and includes the following information:
    • Tips and working examples on “good practices” good practices for designing DAGs
    • Descriptions on how to perform DAG dry-runs of DAGs
    • Descriptions on how to write unit tests for DAGs
    • Snippets with working examples of DAGs and tests for them. Use PlantUML diagrams to compliment all the new documentation
    How
  • A section on how to develop operators that are testable?

Related resources

[1] https://github.com/jghoman/awesome-apache-airflow

[2] https://airflow.apache.org/scheduler.html

[3] https://github.com/PolideaInternal/airflow/blob/simplified-development-workflow/CONTRIBUTING.md 

[4] Airflow Breeze


4. How to create a workflow

Jira issue: AIRFLOW-4371

Project description

In Apache Airflow, the workflows are saved as a code. It uses such elements as operator, DAG  DAGs use operators to build complex workflow. A DAG is a collection of all the tasks you want to run, organized in a way that reflects their relationships and dependencies. The deweloper A developer can describe the relationship relationships in several ways. Task logic is saved in operators. Apache Airflow has many ready operators that integrate with many services, but often you developers need to write your their own operators. It is possible to communicate between different tasks using Tasks can use the xcom metabase to communicate.

Expected deliverables

  • A page for how to create a DAG that also includes:
    • Revamping the page related to scheduling of a DAG
    • Tips Adding tips for specific DAG conditions, such as rerunning a failed task
  • A page for developing Custom Operatorscustom operators that includes:
    • Describing mechanisms that are important when creating an operator, such as template fields, UI color, hooks, connection, etc.
    • Describing the responsibility between the operator and the hook
    • Things to keep in mind when Considerations for dealing with shared resources (e.g. connections, such as connections and hooks)
  • A page that describes how to describe define the relations relationships between tasks. The page should include information about:
    • >> <<
    • set_upstream/set_downstream
    • helpers method ex. chain
  • A page that describes the communication between tasks that also includes:
    • Revamping the page related to macros and XCOM

...

Jira issue: AIRFLOW-4372

Project description

At the moment Currently, people who want to join Apache Airflow Community and start contributing to Apache Airflow might find it very difficult to get on-board. Setting up a local development environment is not easydifficult. Depending on the level of testing needed, Apache Airflow might require manual setup of a combination of environment variables, external dependencies (Postgres, MySql MySQL databases, Kerberos, and a number of others), proper configuration and database initialization and . Additionally, they have to know how to run the tests. There are scripts to help with that - they run in CI environment to help, but the scripts are mainly typically used for running a full set of tests rather than running not individual tests.

All 3600+ tests in Apache Airflow are executed in CI, and the problem we are trying to solve is that it’s very difficult for developers to recreate failures in CI locally and fix them. It takes time and effort to run and re-run the tests and iterate while trying to fix the failures. Also Apache Airflow project is being continuously developed, and there are lots of changes from multiple people - it’s rather . It’s hard to keep up with the changes in your local development environment (currently, it requires full rebuild after every change).

There are three different types of environment environments, and it’s not very easy to decide which one should be used and what are to use based on the limitations and benefits of those environments. The environments are: Local virtualenv for IDE integration and running unit tests, Selfself-managed docker image based environment for simple integration tests with SQLite, and CI Docker-compose based environment.

We have a Simplified Development Environment (work in progress of Simplified Development Environment ), which makes it very easy to create and manage the CI Docker Based environment, so that you . You can have a working environment in less fewer than 10 minutes from scratch , that is self managed, as it is being developed by others that you . You can run tests immediately and iterate quickly (re-running tests have sub-seconds overhead comparing compared to 20-30 seconds previously . Also and it has a built in self-management features (the ). The environment rebuilds incrementally when there are incoming changes from other people). The environment is called Breeze (like “It’s a Breeze to develop Apache Airflow”)

We would like to not get the only give  the environment but  also to also improve the documentation so that it is easy to discover and understand when you should use which environment and how to use it. The benefit is much a faster learning curve for new developers joining the project, thus opening community for to more developers. It will also help the experienced developers who will be able to iterate faster while fixing problems and implementing new features.

The relevant documentation is currently a work in progress, but it is already complete completed [1][2].

There are two relevant documents: CONTRIBUTING.md and BREEZE.rst. But at the end we can think about different structure.

Expected deliverables

  • On-boarding documentation chapter/page A chapter or page of onboarding documentation that will be easily discoverable easy to find for new developers joining Apache Airflow community or someone who wants to start working on Apache Airflow development on a new PC. Ideally that , the documentation could be a step-by-step guide or some kind of video guide - generally , interactive tutorial, or a video guide—generally something easy to follow. Specifically, it should be clear that there are different local development environments depending on your needs and experience - from a local virtualenv through docker Docker image to full-blown replica of CI integration testing environment . Maybe some kind of interactive tutorial would be good as well.

...

  • and that choosing one depends on your needs and experience level.

[1] https://github.com/PolideaInternal/airflow/blob/simplified-development-workflow/CONTRIBUTING.md 

[2]Airflow Breeze


6. System maintenance

Jira issue: AIRFLOW-4373

Project description

Users rely on Apache Airflow to provide a reliable scheduler to orchestrate and run tasks. This means that an Airflow deployment should be resilient , and low - maintenance. This project involves collecting and documenting ways in which one can documenting how to ensure a reliable deployment , and maintain a healthy Apache Airflow Instance.

Some Examples of the things that can be documented areinclude:

  • Good practices on how to ensure continuous and , trouble-free system operation of the system
  • Ways Methods and mechanisms for ensuring system monitoring
  • Description of the SLA mechanism, such as:
    • Monitoring a running Apache Airflow instance . Doing and doing health checks, etc.
    • Setting up Prometheus and Grafana, the two most common monitoring tools, to monitor Apache Airflow metrics

Expected deliverables

  • Instructions and a step-by-step guide on how to setup set up monitoring for Apache Airflow - including the , including Prometheus and Grafana (two most common monitoring tools - Prometheus and Grafana)

---