AIP-6 Enforce the usage of a code formatter & linter

Status

State	Completed
Discussion Thread	AIP-6: Enforce the usage of a code formatter
JIRA
Created	$action.dateFormatter.formatGivenString("yyyy-MM-dd", $content.getCreationDate())
In Release	2.0.0

Motivation

I propose to enforce the usage of a linter and code formatter for Airflow so that all code is structured by the same conventions resulting in less inconsistent and more readable code.

Currently the Airflow codebase is written by many different people (which is great!), but also with many different styles of programming. There are many inconsistencies in code style which evaluate correctly but make it harder to read the Airflow code. I think enforcing a code format would benefit Airflow's code readability because all would be structured the same style and people stick to the same conventions for writing Airflow code. This also prevents occasional PRs such as https://github.com/apache/incubator-airflow/pull/3714.

I created this AIP since introducing a code formatter will most likely change a large number of lines and result in a few code conflicts.

Some examples where I think a code formatter would help:

Example #1: imports in /airflow/bin/cli.py

from __future__ import print_function
import logging

import os
import subprocess
import textwrap
import random
import string
from importlib import import_module

import getpass
import reprlib
import argparse
from builtins import input
from collections import namedtuple

from airflow.models.connection import Connection
from airflow.utils.timezone import parse as parsedate
import json
from tabulate import tabulate

import daemon
from daemon.pidfile import TimeoutPIDLockFile
import signal
import sys
import threading
import traceback
import time
import psutil
import re
from urllib.parse import urlunparse

import airflow
from airflow import api
from airflow import jobs, settings
from airflow import configuration as conf
from airflow.exceptions import AirflowException, AirflowWebServerTimeout
from airflow.executors import GetDefaultExecutor
from airflow.models import (DagModel, DagBag, TaskInstance,
                            DagPickle, DagRun, Variable, DagStat, DAG)

from airflow.ti_deps.dep_context import (DepContext, SCHEDULER_DEPS)
from airflow.utils import cli as cli_utils
from airflow.utils import db as db_utils
from airflow.utils.net import get_hostname
from airflow.utils.log.logging_mixin import (LoggingMixin, redirect_stderr,
                                             redirect_stdout)
from airflow.www.app import (cached_app, create_app)
from airflow.www_rbac.app import cached_app as cached_app_rbac
from airflow.www_rbac.app import create_app as create_app_rbac
from airflow.www_rbac.app import cached_appbuilder

from sqlalchemy import func
from sqlalchemy.orm import exc

PEP8 convention states to group imports in 3 groups in the following order, separated by a blank line:

Standard library imports.
Related third party imports.
Local application/library specific imports.

The example above, formatted with PyCharm optimize imports:

from __future__ import print_function

import argparse
import getpass
import json
import logging
import os
import random
import re
import reprlib
import signal
import string
import subprocess
import sys
import textwrap
import threading
import time
import traceback
from builtins import input
from collections import namedtuple
from importlib import import_module
from urllib.parse import urlunparse

import daemon
import psutil
from daemon.pidfile import TimeoutPIDLockFile
from sqlalchemy import func
from sqlalchemy.orm import exc
from tabulate import tabulate

import airflow
from airflow import api
from airflow import configuration as conf
from airflow import jobs, settings
from airflow.exceptions import AirflowException, AirflowWebServerTimeout
from airflow.executors import GetDefaultExecutor
from airflow.models import (DagModel, DagBag, TaskInstance,
                            DagPickle, DagRun, Variable, DagStat, DAG)
from airflow.models.connection import Connection
from airflow.ti_deps.dep_context import (DepContext, SCHEDULER_DEPS)
from airflow.utils import cli as cli_utils
from airflow.utils import db as db_utils
from airflow.utils.log.logging_mixin import (LoggingMixin, redirect_stderr,
                                             redirect_stdout)
from airflow.utils.net import get_hostname
from airflow.utils.timezone import parse as parsedate
from airflow.www.app import (cached_app, create_app)
from airflow.www_rbac.app import cached_app as cached_app_rbac
from airflow.www_rbac.app import cached_appbuilder
from airflow.www_rbac.app import create_app as create_app_rbac

Example #1 contained 6 groups of imports and the formatted result contains 4 and all imports are ordered alphabetically which IMO makes them much more readable.

Example #2: example_trigger_controller_dag.py

dag = DAG(
    dag_id='example_trigger_controller_dag',
    default_args={
        "owner": "airflow",
        "start_date": datetime.utcnow(),
    },
    schedule_interval='@once',
)

Formatted with Black:

dag = DAG(
    dag_id="example_trigger_controller_dag",
    default_args={"owner": "airflow", "start_date": datetime.utcnow()},
    schedule_interval="@once",
)

All quotes (single and double) are set to a single type (double in this case) for consistency. Black is configured with a maximum line length and if a collection fits on a single line, it auto-formats to a single line. In .flake8 there's a max-line-length of 110 but when looking at the code, most seems to be formatted with line length of +-90 characters.

Example 3: airflow/contrib/kubernetes/worker_configuration.py

return Pod(
    namespace=namespace,
    name=pod_id,
    image=kube_executor_config.image or self.kube_config.kube_image,
    image_pull_policy=(kube_executor_config.image_pull_policy or
                       self.kube_config.kube_image_pull_policy),
    cmds=airflow_command,
    labels={
        'airflow-worker': worker_uuid,
        'dag_id': dag_id,
        'task_id': task_id,
        'execution_date': execution_date
    },
    envs=self._get_environment(),
    secrets=self._get_secrets(),
    service_account_name=self.kube_config.worker_service_account_name,
    image_pull_secrets=self.kube_config.image_pull_secrets,
    init_containers=worker_init_container_spec,
    volumes=volumes,
    volume_mounts=volume_mounts,
    resources=resources,
    annotations=annotations,
    node_selectors=(kube_executor_config.node_selectors or
                    self.kube_config.kube_node_selectors),
    affinity=kube_executor_config.affinity
)

Formatted with Black:

return Pod(
    namespace=namespace,
    name=pod_id,
    image=kube_executor_config.image or self.kube_config.kube_image,
    image_pull_policy=(
        kube_executor_config.image_pull_policy
        or self.kube_config.kube_image_pull_policy
    ),
    cmds=airflow_command,
    labels={
        "airflow-worker": worker_uuid,
        "dag_id": dag_id,
        "task_id": task_id,
        "execution_date": execution_date,
    },
    envs=self._get_environment(),
    secrets=self._get_secrets(),
    service_account_name=self.kube_config.worker_service_account_name,
    image_pull_secrets=self.kube_config.image_pull_secrets,
    init_containers=worker_init_container_spec,
    volumes=volumes,
    volume_mounts=volume_mounts,
    resources=resources,
    annotations=annotations,
    node_selectors=(
        kube_executor_config.node_selectors
        or self.kube_config.kube_node_selectors
    ),
    affinity=kube_executor_config.affinity,
)

All bracketed items are given the same indentation, resulting in the same level of indentation and (IMO) more readable code.

Considerations

I know of several code formatters:

This blog post compares all 3.

Flake8 has a similar purpose but only restricts to being PEP8 compliant, which helps but is nowhere near the "strictness" of the tools above.

Black is a tool by a Python core developer, is deliberately unconfigurable and there's very few configuration options. Important note is Python 3.6+ is required, although older Python code can be formatted with Black too.
YAPF is a tool by Google and very configurable, in contrast to Black.
I don't consider autopep8 for this as it doesn't enforce a similar level of "strictness" as Black and YAPF do.

My suggestion is to go with YAPF, since it is configurable and we can make a style which follows most of the current Airflow style. IMO it's important to enforce a style for consistency, but I don't care too much about the actual style itself.

Space shortcuts

Child pages

Status

Motivation

Example #1: imports in /airflow/bin/cli.py

Example #2: example_trigger_controller_dag.py

Example 3: airflow/contrib/kubernetes/worker_configuration.py

Considerations

12 Comments

Ash Berlin-Taylor

Bas Harenslak

Chen Tong

Fokko Driesprong

Jarek Potiuk

Kamil Bregula

Jarek Potiuk

Bas Harenslak

Bas Harenslak

Jarek Potiuk

Fokko Driesprong

Jarek Potiuk