Status
Motivation
I propose to enforce the usage of a linter and code formatter for Airflow so that all code is structured by the same conventions resulting in less inconsistent and more readable code.
Currently the Airflow codebase is written by many different people (which is great!), but also with many different styles of programming. There are many inconsistencies in code style which evaluate correctly but make it harder to read the Airflow code. I think enforcing a code format would benefit Airflow's code readability because all would be structured the same style and people stick to the same conventions for writing Airflow code. This also prevents occasional PRs such as https://github.com/apache/incubator-airflow/pull/3714.
I created this AIP since introducing a code formatter will most likely change a large number of lines and result in a few code conflicts.
Some examples where I think a code formatter would help:
Example #1: imports in /airflow/bin/cli.py
from __future__ import print_function import logging import os import subprocess import textwrap import random import string from importlib import import_module import getpass import reprlib import argparse from builtins import input from collections import namedtuple from airflow.models.connection import Connection from airflow.utils.timezone import parse as parsedate import json from tabulate import tabulate import daemon from daemon.pidfile import TimeoutPIDLockFile import signal import sys import threading import traceback import time import psutil import re from urllib.parse import urlunparse import airflow from airflow import api from airflow import jobs, settings from airflow import configuration as conf from airflow.exceptions import AirflowException, AirflowWebServerTimeout from airflow.executors import GetDefaultExecutor from airflow.models import (DagModel, DagBag, TaskInstance, DagPickle, DagRun, Variable, DagStat, DAG) from airflow.ti_deps.dep_context import (DepContext, SCHEDULER_DEPS) from airflow.utils import cli as cli_utils from airflow.utils import db as db_utils from airflow.utils.net import get_hostname from airflow.utils.log.logging_mixin import (LoggingMixin, redirect_stderr, redirect_stdout) from airflow.www.app import (cached_app, create_app) from airflow.www_rbac.app import cached_app as cached_app_rbac from airflow.www_rbac.app import create_app as create_app_rbac from airflow.www_rbac.app import cached_appbuilder from sqlalchemy import func from sqlalchemy.orm import exc
PEP8 convention states to group imports in 3 groups in the following order, separated by a blank line:
- Standard library imports.
- Related third party imports.
- Local application/library specific imports.
The example above, formatted with PyCharm optimize imports:
from __future__ import print_function import argparse import getpass import json import logging import os import random import re import reprlib import signal import string import subprocess import sys import textwrap import threading import time import traceback from builtins import input from collections import namedtuple from importlib import import_module from urllib.parse import urlunparse import daemon import psutil from daemon.pidfile import TimeoutPIDLockFile from sqlalchemy import func from sqlalchemy.orm import exc from tabulate import tabulate import airflow from airflow import api from airflow import configuration as conf from airflow import jobs, settings from airflow.exceptions import AirflowException, AirflowWebServerTimeout from airflow.executors import GetDefaultExecutor from airflow.models import (DagModel, DagBag, TaskInstance, DagPickle, DagRun, Variable, DagStat, DAG) from airflow.models.connection import Connection from airflow.ti_deps.dep_context import (DepContext, SCHEDULER_DEPS) from airflow.utils import cli as cli_utils from airflow.utils import db as db_utils from airflow.utils.log.logging_mixin import (LoggingMixin, redirect_stderr, redirect_stdout) from airflow.utils.net import get_hostname from airflow.utils.timezone import parse as parsedate from airflow.www.app import (cached_app, create_app) from airflow.www_rbac.app import cached_app as cached_app_rbac from airflow.www_rbac.app import cached_appbuilder from airflow.www_rbac.app import create_app as create_app_rbac
Example #1 contained 6 groups of imports and the formatted result contains 4 and all imports are ordered alphabetically which IMO makes them much more readable.
Example #2: example_trigger_controller_dag.py
dag = DAG( dag_id='example_trigger_controller_dag', default_args={ "owner": "airflow", "start_date": datetime.utcnow(), }, schedule_interval='@once', )
Formatted with Black:
dag = DAG( dag_id="example_trigger_controller_dag", default_args={"owner": "airflow", "start_date": datetime.utcnow()}, schedule_interval="@once", )
All quotes (single and double) are set to a single type (double in this case) for consistency. Black is configured with a maximum line length and if a collection fits on a single line, it auto-formats to a single line. In .flake8 there's a max-line-length of 110 but when looking at the code, most seems to be formatted with line length of +-90 characters.
Example 3: airflow/contrib/kubernetes/worker_configuration.py
return Pod( namespace=namespace, name=pod_id, image=kube_executor_config.image or self.kube_config.kube_image, image_pull_policy=(kube_executor_config.image_pull_policy or self.kube_config.kube_image_pull_policy), cmds=airflow_command, labels={ 'airflow-worker': worker_uuid, 'dag_id': dag_id, 'task_id': task_id, 'execution_date': execution_date }, envs=self._get_environment(), secrets=self._get_secrets(), service_account_name=self.kube_config.worker_service_account_name, image_pull_secrets=self.kube_config.image_pull_secrets, init_containers=worker_init_container_spec, volumes=volumes, volume_mounts=volume_mounts, resources=resources, annotations=annotations, node_selectors=(kube_executor_config.node_selectors or self.kube_config.kube_node_selectors), affinity=kube_executor_config.affinity )
Formatted with Black:
return Pod( namespace=namespace, name=pod_id, image=kube_executor_config.image or self.kube_config.kube_image, image_pull_policy=( kube_executor_config.image_pull_policy or self.kube_config.kube_image_pull_policy ), cmds=airflow_command, labels={ "airflow-worker": worker_uuid, "dag_id": dag_id, "task_id": task_id, "execution_date": execution_date, }, envs=self._get_environment(), secrets=self._get_secrets(), service_account_name=self.kube_config.worker_service_account_name, image_pull_secrets=self.kube_config.image_pull_secrets, init_containers=worker_init_container_spec, volumes=volumes, volume_mounts=volume_mounts, resources=resources, annotations=annotations, node_selectors=( kube_executor_config.node_selectors or self.kube_config.kube_node_selectors ), affinity=kube_executor_config.affinity, )
All bracketed items are given the same indentation, resulting in the same level of indentation and (IMO) more readable code.
Considerations
I know of several code formatters:
This blog post compares all 3.
Flake8 has a similar purpose but only restricts to being PEP8 compliant, which helps but is nowhere near the "strictness" of the tools above.
- Black is a tool by a Python core developer, is deliberately unconfigurable and there's very few configuration options. Important note is Python 3.6+ is required, although older Python code can be formatted with Black too.
- YAPF is a tool by Google and very configurable, in contrast to Black.
- I don't consider autopep8 for this as it doesn't enforce a similar level of "strictness" as Black and YAPF do.
My suggestion is to go with YAPF, since it is configurable and we can make a style which follows most of the current Airflow style. IMO it's important to enforce a style for consistency, but I don't care too much about the actual style itself.
12 Comments
Ash Berlin-Taylor
> IMO it's important to enforce a style for consistency, but I don't care too much about the actual style itself.
100% to this.
Bas Harenslak
PR ordering imports → https://github.com/apache/airflow/pull/4784
Chen Tong
Big +1.
IMO, the formatter need to be
1. compatible with/based on PEP8 or any other PEP about style.
2. supported by Python community in a long run.
They do good job in above two points.
3. support most of versions we used in Airflow.
Black:
> It requires Python 3.6.0+ to run but you can reformat Python 2 code with it, too. https://github.com/ambv/black
YAPF:
> YAPF supports Python 2.7 and 3.6.4+. (Note that some Python 3 features may fail to parse with Python versions before 3.6.4.) https://github.com/google/yapf
These two tools are satisfied with my first two points, but all have limited support python version under 3.6.
4. easy to integrated in CI.
YAPF could be easily integreted in CI process. See https://github.com/google/yapf#return-codes
Thus, YAPF seems better in functionality, easy to use, etc.
Fokko Driesprong
We're now using Python 3.6+ on Master, so Black is an option. Is this something that we would reconsider? I'm all in for Black
Jarek Potiuk
All for it Fokko Driesprong. The only problem is how to introduce it. Introducing Black is probably going to change every single line in every single file
Kamil Bregula
I'm also in favor of Black's introduction. For a significant portion of the code, we can do this with the introduction of AIP-21
Jarek Potiuk
True. Why not similar approach as with pylint_todo?
Bas Harenslak
There's manual work to be done with fixing all Pylint errors, while with Black we can simply run it once over the entire codebase and we're done. Should be much simpler than Pylint, clashing with existing PRs is inevitable. I prefer to make the change as quick as possible, to avoid half-done situations.
Running Black on Airflow gives me (black airflow --line-length=110 --exclude "airflow/www/node_modules/*"):
692 files changed, 20507 insertions(+), 21570 deletions(-)
(1 file failed to format, didn't bother investigating right now)
Bas Harenslak
The person making such a commit will gain eternal fame in the Airflow contributors charts
Jarek Potiuk
Well. I argue that this is a rather pale change. Just -1063 lines. The old good wisdom says that your productivity is measured by the total number of lines of code removed, not added
.
I agree for black we should make one-time change + add pre-commit + apply black in 1.10* branch.
There is only one problem - if you want to run `git blame'... But we might switch to git hyper-blame for that : https://commondatastorage.googleapis.com/chrome-infra-docs/flat/depot_tools/docs/html/git-hyper-blame.html
J.
Fokko Driesprong
If someone has a PR that was submitted before black was applied, they can apply black, and then the merge conflicts should be fixed, right?
Jarek Potiuk
Correct. There is just a problem with history/blame. But git-hyper-blame solves it nicely. I am all for moving to black asap. Maybe we should start a thread @ devlist?