The Apache Airflow community is happy to share that we have applied to participate in the first edition of Season of Docs.
Season of Docs is a program organized by Google Open Source to match technical writers with mentors to work on documentation for open source projects. We, at Apache Airflow, couldn’t be more excited about this opportunity, because as a small, but fast growing project, we need to make sure that our documentation stays up to date, and in good condition.
After a discussion that involved many members of the community, where lots of ideas were shared, we were able to gather a set of projects that we would love a technical writer to work on. In fact, we have asked for two technical writers in our application, as we believe there is a big interest in the community, and lots of work to be done.
If you like the idea of spending a few months working in our awesome open source project, in a welcoming community that will be thrilled to see your contributions, please take a look at the projects list, and consider applying for Season of Docs. If you have any questions, do not hesitate to reach out to us in the Apache Airflow mailing list at email@example.com (you will need to subscribe first by emailing firstname.lastname@example.org), and the Airflow slack channel.
Our team of mentors are looking forward to hearing from you <3
1. Apache Airflow architecture
Jira issue: AIRFLOW-4368
The Apache Airflow website does not have a section with an overview of the general architecture. This sort of architecture would enable new contributors and users to get a mental model of the workings of Apache Airflow, and be able to start contributing sooner.
This project involves documenting what are the different parts that make up Apache Airflow, and how they are developed. Some of the questions that this documentation should answer are: What are the different components of Apache Airflow? Which ones are stateful, and which ones are stateless? How does Apache Airflow distribute tasks to workers? How can Apache Airflow run on different deployment architectures and databases?
- A page describing the architecture
- The page should have detailed descriptions of each component:
- Web Server
- Metadata DB
- The page should also contain a diagram on Apache Airflow architecture
- Description of how Apache Airflow schedules tasks
- Detailed examples with diagrams and text using PlantUML
Jira issue: AIRFLOW-4369
Apache Airflow is used to automate and orchestrate complex workflows. It hides the complexity of managing dependencies between operators, and scheduling tasks; and it allows users to worry about their logic.
On the other hand, deploying Apache Airflow in a resilient manner is the first step on starting to use it. There are many strategies to deploy it, and each has advantages and disadvantages. By documenting this, newcomers to the project will be able to adopt Apache Airflow with confidence.
This project will consist in documenting strategies to deploy Apache Airflow in different environments:
- Cloud (AWS / GCP / Azure)
- Special attention can be given to the Kubernetes executor, as Kubernetes is a very popular technology to manage workloads
- A page introducing and describing deployment techniques
- A page that describes the deployment models and help you choose the best one (full management, GKE-like service, Paas - Astronomer/Google Composer)
- A page that will allow you to choose the best executor
- A page that will allow you to choose the best executor
- A section describing how to deploy Apache Airflow with Kubernetes. The section should include snippets for Kubernetes files, and scripts, and back it up with PlantUML diagram for clarity.
- A section on running Apache Airflow on different cloud providers (AWS / Azure / GCP)
- A table comparing different executors
Jira issue: AIRFLOW-4370
Apache Airflow allows people to perform complex workflows that may affect many components in their infrastructure, it is important to be able to test an Apache Airflow workflow, and be sure that when it runs on production it will work as intended.
The existing documentation does not have great information when it comes to testing workflows (known as DAGs), ensuring that they will be scheduled properly, and enabling users to write their own custom operators.
Users that know what are the “best practices” on dealing with Apache Airflow DAGs and operators will be able to adopt it in their infrastructure much better, and with fewer mishaps.
- Introduction in testing workflows.
- A page for Designing and Testing DAGs
- Tips and working examples on “good practices” for designing DAGs
- Descriptions on how to perform dry-runs of DAGs
- Descriptions on how to write unit tests for DAGs
- Snippets with working examples of DAGs and tests for them. Use PlantUML diagrams to compliment all the new documentation
- How to develop operators that are testable?
4. How to create a workflow
Jira issue: AIRFLOW-4371
In Apache Airflow, the workflows are saved as a code. It uses such elements as operator, DAG to build complex workflow. DAG is a collection of all the tasks you want to run, organized in a way that reflects their relationships and dependencies. The deweloper can describe the relationship in several ways. Task logic is saved in operators. Apache Airflow has many ready operators that integrate with many services, but often you need to write your own operators. It is possible to communicate between different tasks using the xcom metabase.
- A page for how to create a DAG
- Revamping the page related to scheduling of DAG
- Tips for specific DAG conditions, such as rerunning a failed task
- A page for developing Custom Operators
- Describing mechanisms that are important when creating an operator such as template fields, UI color, hooks, connection etc.
- Describing the responsibility between the operator and the hook
- Things to keep in mind when dealing with shared resources (e.g. connections, hooks)
- A page that describes how to describe the relations between tasks
- >> <<
- helpers method ex. chain
- A page that describes the communication between tasks
- Revamping the page related to macros and XCOM
5. Documenting using local development environments
Jira issue: AIRFLOW-4372
At the moment people who want to join Apache Airflow Community and start contributing to Apache Airflow might find it very difficult to get on-board. Setting up local development environment is not easy. Depending on the level of testing needed, Apache Airflow might require manual setup of a combination of environment variables, external dependencies (Postgres, MySql databases, Kerberos, and a number of others), proper configuration and database initialization and they have to know how to run the tests. There are scripts to help with that - they run in CI environment but the scripts are mainly used for running a full set of tests rather than running individual tests.
All 3600+ tests in Apache Airflow are executed in CI and the problem we are trying to solve is that it’s very difficult for developers to recreate failures in CI locally and fix them. It takes time and effort to run and re-run the tests and iterate while trying to fix the failures. Also Apache Airflow project is being continuously developed and there are lots of changes from multiple people - it’s rather hard to keep up with the changes in your local development environment (currently it requires full rebuild after every change).
There are three different types of environment and it’s not very easy to decide which one should be used and what are the limitations and benefits of those environments. The environments are: Local virtualenv for IDE integration and running unit tests, Self-managed docker image based environment for simple integration tests with SQLite, CI Docker-compose based environment.
We have a work in progress of Simplified Development Environment which makes it very easy to create and manage the CI Docker Based environment, so that you can have a working environment in less than 10 minutes from scratch, that is self managed as it is being developed by others that you can run tests immediately and iterate quickly (re-running tests have sub-seconds overhead comparing to 20-30 seconds previously. Also it has a built in self-management features (the environment rebuilds incrementally when there are incoming changes from other people). The environment is called Breeze (like “It’s a Breeze to develop Apache Airflow”)
We would like to not get the environment but also to improve the documentation so that it is easy to discover and understand when you should use which environment and how to use it. The benefit is much faster learning curve for new developers joining the project thus opening community for more developers. It will also help the experienced developers who will be able to iterate faster while fixing problems and implementing new features.
The relevant documentation is currently a work in progress but it is already complete .
There are two relevant documents: CONTRIBUTING.md and BREEZE.rst. But at the end we can think about different structure.
- On-boarding documentation chapter/page that will be easily discoverable for new developers joining Apache Airflow community or someone who wants to start working on Apache Airflow development on a new PC. Ideally that could be a step-by-step guide or some kind of video guide - generally something easy to follow. Specifically it should be clear that there are different local development environments depending on your needs and experience - from local virtualenv through docker image to full-blown replica of CI integration testing environment. Maybe some kind of interactive tutorial would be good as well.
6. System maintenance
Jira issue: AIRFLOW-4373
Users rely on Apache Airflow to provide a reliable scheduler to orchestrate and run tasks. This means that an Airflow deployment should be resilient, and low-maintenance. This project involves collecting and documenting ways in which one can ensure a reliable deployment, and maintain a healthy Apache Airflow Instance.
Some of the things that can be documented are:
- Good practices on how to ensure continuous and trouble-free operation of the system
- Ways and mechanisms for ensuring system monitoring
- Description of the SLA mechanism
- Monitoring a running Apache Airflow instance. Doing health checks, etc.
- Setting up Prometheus and Grafana to monitor Apache Airflow metrics
- Instructions and step-by-step guide how to setup monitoring for Apache Airflow - including the two most common monitoring tools - Prometheus and Grafana