Status

StateCompleted
Discussion ThreadAIP-6: Enforce the usage of a code formatter
JIRA
Created

$action.dateFormatter.formatGivenString("yyyy-MM-dd", $content.getCreationDate())

In Release2.0.0

Motivation

I propose to enforce the usage of a linter and code formatter for Airflow so that all code is structured by the same conventions resulting in less inconsistent and more readable code.

Currently the Airflow codebase is written by many different people (which is great!), but also with many different styles of programming. There are many inconsistencies in code style which evaluate correctly but make it harder to read the Airflow code. I think enforcing a code format would benefit Airflow's code readability because all would be structured the same style and people stick to the same conventions for writing Airflow code. This also prevents occasional PRs such as https://github.com/apache/incubator-airflow/pull/3714.

I created this AIP since introducing a code formatter will most likely change a large number of lines and result in a few code conflicts.

Some examples where I think a code formatter would help:

Example #1: imports in /airflow/bin/cli.py

from __future__ import print_function
import logging

import os
import subprocess
import textwrap
import random
import string
from importlib import import_module

import getpass
import reprlib
import argparse
from builtins import input
from collections import namedtuple

from airflow.models.connection import Connection
from airflow.utils.timezone import parse as parsedate
import json
from tabulate import tabulate

import daemon
from daemon.pidfile import TimeoutPIDLockFile
import signal
import sys
import threading
import traceback
import time
import psutil
import re
from urllib.parse import urlunparse

import airflow
from airflow import api
from airflow import jobs, settings
from airflow import configuration as conf
from airflow.exceptions import AirflowException, AirflowWebServerTimeout
from airflow.executors import GetDefaultExecutor
from airflow.models import (DagModel, DagBag, TaskInstance,
                            DagPickle, DagRun, Variable, DagStat, DAG)

from airflow.ti_deps.dep_context import (DepContext, SCHEDULER_DEPS)
from airflow.utils import cli as cli_utils
from airflow.utils import db as db_utils
from airflow.utils.net import get_hostname
from airflow.utils.log.logging_mixin import (LoggingMixin, redirect_stderr,
                                             redirect_stdout)
from airflow.www.app import (cached_app, create_app)
from airflow.www_rbac.app import cached_app as cached_app_rbac
from airflow.www_rbac.app import create_app as create_app_rbac
from airflow.www_rbac.app import cached_appbuilder

from sqlalchemy import func
from sqlalchemy.orm import exc

PEP8 convention states to group imports in 3 groups in the following order, separated by a blank line:

  1. Standard library imports.
  2. Related third party imports.
  3. Local application/library specific imports.

The example above, formatted with PyCharm optimize imports:

from __future__ import print_function

import argparse
import getpass
import json
import logging
import os
import random
import re
import reprlib
import signal
import string
import subprocess
import sys
import textwrap
import threading
import time
import traceback
from builtins import input
from collections import namedtuple
from importlib import import_module
from urllib.parse import urlunparse

import daemon
import psutil
from daemon.pidfile import TimeoutPIDLockFile
from sqlalchemy import func
from sqlalchemy.orm import exc
from tabulate import tabulate

import airflow
from airflow import api
from airflow import configuration as conf
from airflow import jobs, settings
from airflow.exceptions import AirflowException, AirflowWebServerTimeout
from airflow.executors import GetDefaultExecutor
from airflow.models import (DagModel, DagBag, TaskInstance,
                            DagPickle, DagRun, Variable, DagStat, DAG)
from airflow.models.connection import Connection
from airflow.ti_deps.dep_context import (DepContext, SCHEDULER_DEPS)
from airflow.utils import cli as cli_utils
from airflow.utils import db as db_utils
from airflow.utils.log.logging_mixin import (LoggingMixin, redirect_stderr,
                                             redirect_stdout)
from airflow.utils.net import get_hostname
from airflow.utils.timezone import parse as parsedate
from airflow.www.app import (cached_app, create_app)
from airflow.www_rbac.app import cached_app as cached_app_rbac
from airflow.www_rbac.app import cached_appbuilder
from airflow.www_rbac.app import create_app as create_app_rbac

Example #1 contained 6 groups of imports and the formatted result contains 4 and all imports are ordered alphabetically which IMO makes them much more readable.

Example #2: example_trigger_controller_dag.py

dag = DAG(
    dag_id='example_trigger_controller_dag',
    default_args={
        "owner": "airflow",
        "start_date": datetime.utcnow(),
    },
    schedule_interval='@once',
)

Formatted with Black:

dag = DAG(
    dag_id="example_trigger_controller_dag",
    default_args={"owner": "airflow", "start_date": datetime.utcnow()},
    schedule_interval="@once",
)

All quotes (single and double) are set to a single type (double in this case) for consistency. Black is configured with a maximum line length and if a collection fits on a single line, it auto-formats to a single line. In .flake8 there's a max-line-length of 110 but when looking at the code, most seems to be formatted with line length of +-90 characters.

Example 3: airflow/contrib/kubernetes/worker_configuration.py

return Pod(
    namespace=namespace,
    name=pod_id,
    image=kube_executor_config.image or self.kube_config.kube_image,
    image_pull_policy=(kube_executor_config.image_pull_policy or
                       self.kube_config.kube_image_pull_policy),
    cmds=airflow_command,
    labels={
        'airflow-worker': worker_uuid,
        'dag_id': dag_id,
        'task_id': task_id,
        'execution_date': execution_date
    },
    envs=self._get_environment(),
    secrets=self._get_secrets(),
    service_account_name=self.kube_config.worker_service_account_name,
    image_pull_secrets=self.kube_config.image_pull_secrets,
    init_containers=worker_init_container_spec,
    volumes=volumes,
    volume_mounts=volume_mounts,
    resources=resources,
    annotations=annotations,
    node_selectors=(kube_executor_config.node_selectors or
                    self.kube_config.kube_node_selectors),
    affinity=kube_executor_config.affinity
)

Formatted with Black:

return Pod(
    namespace=namespace,
    name=pod_id,
    image=kube_executor_config.image or self.kube_config.kube_image,
    image_pull_policy=(
        kube_executor_config.image_pull_policy
        or self.kube_config.kube_image_pull_policy
    ),
    cmds=airflow_command,
    labels={
        "airflow-worker": worker_uuid,
        "dag_id": dag_id,
        "task_id": task_id,
        "execution_date": execution_date,
    },
    envs=self._get_environment(),
    secrets=self._get_secrets(),
    service_account_name=self.kube_config.worker_service_account_name,
    image_pull_secrets=self.kube_config.image_pull_secrets,
    init_containers=worker_init_container_spec,
    volumes=volumes,
    volume_mounts=volume_mounts,
    resources=resources,
    annotations=annotations,
    node_selectors=(
        kube_executor_config.node_selectors
        or self.kube_config.kube_node_selectors
    ),
    affinity=kube_executor_config.affinity,
)

All bracketed items are given the same indentation, resulting in the same level of indentation and (IMO) more readable code.

Considerations

I know of several code formatters:

  1. Black
  2. YAPF
  3. Autopep8

This blog post compares all 3.

Flake8 has a similar purpose but only restricts to being PEP8 compliant, which helps but is nowhere near the "strictness" of the tools above.

  • Black is a tool by a Python core developer, is deliberately unconfigurable and there's very few configuration options. Important note is Python 3.6+ is required, although older Python code can be formatted with Black too.
  • YAPF is a tool by Google and very configurable, in contrast to Black.
  • I don't consider autopep8 for this as it doesn't enforce a similar level of "strictness" as Black and YAPF do.

My suggestion is to go with YAPF, since it is configurable and we can make a style which follows most of the current Airflow style. IMO it's important to enforce a style for consistency, but I don't care too much about the actual style itself.

12 Comments

  1. > IMO it's important to enforce a style for consistency, but I don't care too much about the actual style itself.

    100% to this.

  2. Big +1.

    IMO, the formatter need to be

    1. compatible with/based on PEP8 or any other PEP about style.

    2. supported by Python community in a long run.

    They do good job in above two points.

    3. support most of versions we used in Airflow.

    Black:  

    > It requires Python 3.6.0+ to run but you can reformat Python 2 code with it, too. https://github.com/ambv/black

    YAPF:

    YAPF supports Python 2.7 and 3.6.4+. (Note that some Python 3 features may fail to parse with Python versions before 3.6.4.) https://github.com/google/yapf

    These two tools are satisfied with my first two points, but all have limited support python version under 3.6. 

    4. easy to integrated in CI. 

    YAPF could be easily integreted in CI process. See https://github.com/google/yapf#return-codes


    Thus, YAPF seems better in functionality, easy to use, etc.

  3. We're now using Python 3.6+ on Master, so Black is an option. Is this something that we would reconsider? I'm all in for Black (smile)

  4. All for it Fokko Driesprong. The only problem is how to introduce it. Introducing Black is probably going to change every single line in every single file (smile)

  5. I'm also in favor of Black's introduction. For a significant portion of the code, we can do this with the introduction of AIP-21

  6. True. Why not similar approach as with pylint_todo?

  7. There's manual work to be done with fixing all Pylint errors, while with Black we can simply run it once over the entire codebase and we're done. Should be much simpler than Pylint, clashing with existing PRs is inevitable. I prefer to make the change as quick as possible, to avoid half-done situations.

    Running Black on Airflow gives me (black airflow --line-length=110 --exclude "airflow/www/node_modules/*"):

    692 files changed, 20507 insertions(+), 21570 deletions(-)

    (1 file failed to format, didn't bother investigating right now)

  8. The person making such a commit will gain eternal fame in the Airflow contributors charts(big grin)

  9. Well.  I argue that this is a rather pale change. Just -1063 lines. The old good wisdom says that your productivity is measured by the total number of lines of code removed, not added (wink).

    I agree for black we should make one-time change + add pre-commit + apply black in 1.10* branch.

    There is only one problem - if you want to run `git blame'... But we might switch to git hyper-blame for that : https://commondatastorage.googleapis.com/chrome-infra-docs/flat/depot_tools/docs/html/git-hyper-blame.html

    J.

  10. If someone has a PR that was submitted before black was applied, they can apply black, and then the merge conflicts should be fixed, right?

    suspicious thinking GIF by SpongeBob SquarePants

  11. Correct. There is just a problem with history/blame. But git-hyper-blame solves it nicely. I am all for moving to black asap. Maybe we should start a thread @ devlist?