Contact: fenglu@google.com, in collaboration with etune@google.com, daniel.imberman@gmail.com, wwlian@google.com, ramanathana@google.com
This feature has been discontinued. Details: https://github.com/apache/airflow/pull/6768 |
---|
The ongoing Airflow KubernetesExecutor discussion doesn’t have the story of binding credentials (e.g., GCP service accounts) to task PODs. Depending on how the kubernetes cluster is provisioned, in the case of GKE, the default compute engine service account is inherited by the PODs created. It becomes a problem when users wish to attach different service accounts to a task POD. This document suggests a set of mechanisms to be incorporated into the Airflow KubernetesExecutor design so that any Airflow task can specify a set of credentials to be pre-configured on each task POD. We limit the scope of this document to GCP service account only, but our design can be seamlessly applied to other types of credentials (e.g., AWS access key).
We start by reviewing the current practice of specifying and using service accounts in Airflow. For Airflow GCP specific operators (e.g., BigQueryOperator), the service account is indirectly specified by the connection ID, which is a primary key into the connections table in Airflow metadata database. The connections table stores additional information on how to retrieve the service account details (the way how service account is stored in the connections table today is not very satisfactory, details). For other non-GCP operators (e.g., PythonOperator and BashOperator), handling service accounts is not explicitly supported in the Airflow framework and left to individual workflows/dags. This approach unfortunately mixes credential management code with workflow business logic and is potentially insecure. The Airflow Kubernetes integration work opens up the possibility to manage task credentials inside the framework and free workflows from handling sensitive account information.
This design aims to address the following objectives:
Any task including GCP related tasks can independently specify a service account to be attached to the task execution environment. This particular service account is used whenever get_application_default() is invoked.
Preserve existing connection ID semantics. Specifically, workflows that include GCP operators should continue to work under this new design with zero modification.
The mechanism to specify per-task service account should be simple and doesn’t involve handling sensitive information.
We defer the following goals to a future design:
Dynamic service account binding during task execution.
Revoking a service account in the middle of task execution.
ACL support around service accounts (e.g., which group of workflow/dag users can access a particular service account).
The KubernetesExecutor config is extended to include a list of service accounts used:
[kubernetes] gcp_service_accounts=key_name1=key_path1,key_name2=key_path2,key_name3=key_path3 |
---|
where key_name is the service account ID (e.g., service-account@xxx.iam.gserviceaccount.com) and key_path saves the service account key file location, accessible from the KubernetesExecutor.
When the Airflow Scheduler and KubernetesExecutor are initialized, the following steps related to service account management are executed:
Read gcp_service_accounts config and inject these service accounts into Kubernetes cluster as secrets:
kubectl create secret generic service-account-name --from-file=key.json=<PATH-TO-KEY-FILE>.json |
---|
Start the service account initializer controller (the controller code will check for uninitialized pod object, modify the Pod manifest to include service account volumes|volumeMounts|GOOGLE_APPLICATION_CREDENTIALS ENV so that k8s master will schedule the Pod):
kubectl create -f initializer-controller-deployment.yaml |
---|
Create the service account initializer config:
apiVersion: admissionregistration.k8s.io/v1alpha1 kind: InitializerConfiguration Metadata: name: example-service-account-config initializers: - name: serviceaccounts.google.com rules: - apiGroups: - "" apiVersions: - v1 resources: - pods |
---|
When a task is about to run, the following annotation (if present in task properties) is added to the Pod spec by KubernetesExecutor:
annotations: iam.cloud.google.com/service-account: “service-account-name” |
---|
The GCP service account annotation is specified as part of the task executor_config. Here is a concrete example:
t = BashOperator( task_id = ‘account-test’, bash_command = ‘gcloud auth application-default login’, dag = dag, executor_config = { ‘gcp-service-account’ : ‘service-account@xxx.iam.gserviceaccount.com’ } ) |
---|
In summary, the mechanisms discussed in this doc fulfillsour initial objectives and introduce minimal implementation and configuration complexities. We can easily extended this design to other types of credentials (e.g., AWS access key) by modifying the service account initializer and attaching additional annotations during task Pod creation.