Skip to end of metadata
Go to start of metadata

Contact: fenglu@google.com, in collaboration with etune@google.comdaniel.imberman@gmail.com, wwlian@google.comramanathana@google.com

Motivation

The ongoing Airflow KubernetesExecutor discussion doesn’t have the story of binding credentials (e.g., GCP service accounts) to task PODs. Depending on how the kubernetes cluster is provisioned, in the case of GKE, the default compute engine service account is inherited by the  PODs created. It becomes a problem when users wish to attach different service accounts to a task POD. This document suggests a set of mechanisms to be incorporated into the Airflow KubernetesExecutor design so that any Airflow task can specify a set of credentials to be pre-configured on each task POD. We limit the scope of this document to GCP service account only, but our design can be seamlessly applied to other types of credentials (e.g., AWS access key).

Background

We start by reviewing the current practice of specifying and using service accounts in Airflow. For Airflow GCP specific operators (e.g., BigQueryOperator), the service account is indirectly specified by the connection ID, which is a primary key into the connections table in Airflow metadata database. The connections table stores additional information on how to retrieve the service account details (the way how service account is stored in the connections table today is not very satisfactory, details). For other non-GCP operators (e.g., PythonOperator and BashOperator), handling service accounts is not explicitly supported in the Airflow framework and left to individual workflows/dags. This approach unfortunately mixes credential management code with workflow business logic and is potentially insecure. The Airflow Kubernetes integration work opens up the possibility to manage task credentials inside the framework and free workflows from handling sensitive account information.

Design

This design aims to address the following objectives:

  • Any task including GCP related tasks can independently specify a service account to be attached to the task execution environment. This particular service account is used whenever get_application_default() is invoked.

  • Preserve existing connection ID semantics. Specifically, workflows that include GCP operators should continue to work under this new design with zero modification.

  • The mechanism to specify per-task service account should be simple and doesn’t involve handling sensitive information.

We defer the following goals to a future design:

  • Dynamic service account binding during task execution.

  • Revoking a service account in the middle of task execution.

  • ACL support around service accounts (e.g., which group of workflow/dag users can access a particular service account).


High-Level Overview


Our design primarily leverages the admission controller mechanism in Kubernetes for offloading service account configuration when each task Pod is started. The set of service accounts used by Airflow workflows/dags will be injected as secrets in the Kubernetes cluster. In addition, a service account initializer(proposed by ahmetb@google.com and etune@google.com) is started. An initializer is somewhat similar to PodPreset but offers more flexibilities post cluster creation. The service account initializer is a one-time configuration work and does the actual service account Pod-manifest modifications (i.e., volumes, volumeMounts, and GOOGLE_APPLICATION_CREDENTIALS env) based on Pod annotations. The pod annotations, derived from Airflow task properties, are provided by the KubernetesExecutor during task creation.

KubernetesExecutor

Config

The KubernetesExecutor config is extended to include a list of service accounts used:

[kubernetes]
gcp_service_accounts=key_name1=key_path1,key_name2=key_path2,key_name3=key_path3

where key_name is the service account ID (e.g., service-account@xxx.iam.gserviceaccount.com) and key_path saves the service account key file location, accessible from the KubernetesExecutor.

Setup

When the Airflow Scheduler and KubernetesExecutor are initialized, the following steps related to service account management are executed:

  1. Read gcp_service_accounts config and inject these service accounts into Kubernetes cluster as secrets: 

    kubectl create secret generic service-account-name --from-file=key.json=<PATH-TO-KEY-FILE>.json 
  2. Start the service account initializer controller (the controller code will check for uninitialized pod object, modify the Pod manifest to include service account volumes|volumeMounts|GOOGLE_APPLICATION_CREDENTIALS ENV so that k8s master will schedule the Pod): 

    kubectl create -f initializer-controller-deployment.yaml 
  3. Create the service account initializer config: 

    apiVersion: admissionregistration.k8s.io/v1alpha1
    kind: InitializerConfiguration
    Metadata:
      name: example-service-account-config 
    initializers:
    - name: serviceaccounts.google.com
      rules: 
      - apiGroups:
        - "" 
        apiVersions:
        - v1
        resources:
        - pods  

Task Execution

When a task is about to run, the following annotation (if present in task properties) is added to the Pod spec by KubernetesExecutor:

annotations:
  iam.cloud.google.com/service-account: “service-account-name”

Airflow Operator/Task

The GCP service account annotation is specified as part of the task executor_config. Here is a concrete example:

t = BashOperator(
  task_id = ‘account-test’,
  bash_command = ‘gcloud auth application-default login’,
  dag = dag,
  executor_config = {
   ‘gcp-service-account’ : ‘service-account@xxx.iam.gserviceaccount.com
  }
)

Conclusion

In summary, the mechanisms discussed in this doc fulfillsour initial objectives and introduce minimal implementation and configuration complexities. We can easily extended this design to other types of credentials (e.g., AWS access key) by modifying the service account initializer and attaching additional annotations during task Pod creation.



  • No labels