Data Platform/Systems/Airflow/Kubernetes
This page relates specifically to our Airflow instances deployed to Kubernetes, and their specificity. We assume that the airflow instance is deployed alongside a dedicated CloudnativePG cluster, running in the same namespace.
Note: replace airflow-test-k8s by other instance names where appropriate.
Architecture
Components
Airflow is deployed via the airflow
chart. The release is composed of 3 Deployment
resources:
- the webserver
- the scheduler
- the kerberos ticket renewer
Executor
While we migrate the airflow instances to Kubernetes, the scheduler will be configured with LocalExecutor
, meaning that tasks are executed as subprocess of the scheduler process. However, the chart was developed with KubernetesExecutor
in mind, meaning that any DAG task should be executed in a Kubernetes pod.
DAGs deployment
For the moment, any Airlfow instance running on Kubernetes syncs up the main
branch of airflow-dags every 5 minutes using https://github.com/kubernetes/git-sync, meaning that any merged MR should be reflected in Airflow in about 5 minutes.
In the near future, the model will be turned into a push vs a pull. When an MR is merged, the Gitlab CI will send a POST request to blunderbuss , which will trigger a sync of the main
branch into the volume mounted by the airflow schedulers.
Even when that is in place, we'll still be able to use git-sync
to synchronize feature branches in development instances.
Logging
The logs of the airflow components themselves are sent to our observability pipeline and are accessible through logstash. However, the DAG task logs themselves are uploaded to S3 after completion. Streaming the logs of an ongoing DAG task can be done from the web UI, and relies on the Kubernetes Logs API (when using KubernetesExecutor
) or simply tails local logs (when using LocalExecutor
).
Security
Webserver authentication and authorization
Access to the webserver is OIDC authenticated, and the user role is derived from its LDAP groups. For example, SREs (members of the ops
LDAP group) are automatically given the Admin
role. The mapping can be customized per instance, so that we can define LDAP groups for per-instance admins and per-instance members.
API access
Access to the Airflow API will be Kerberos authenticated, meaning that:
- services will be able to access the API by authenticating to Kerberos via their own Keytab
- users will be able to access the API by authenticated to Kerberos via their password and
kinit
Kerberos
We generate a keytab for each instance. It will be stored as a base64-encoded secret, and only mounted on the airflow-kerberos
pod, in charge of obtaining (as well as regularly renewing) a TGT, itself mounted into every single pod that will need to communicate with Kerberised systems (aka the worker pods).
Kubernetes RBAC
When using the KubernetesExecutor
, the scheduler needs to be able to perform CRUD operations on Pods
, and the webserver needs to be able to tail Pod
logs. As the user used deploy charts does not have permissions to create Role
and RoleBinding
resources, we deploy the chart with a specific user/role that can, called deploy-airflow
.
UNIX user impersonation
Each airflow instance has a dedicated keytab, with first principal of the form <user>/airflow-<instance-name>.discovery.wmnet@WIKIMEDIA
. This will ensure that any interaction with HDFS, Spark, etc, will impersonate the <user>
user.
For example, the first principal of airflow-test-k8s instance is analytics/airflow-test-k8s.discovery.wmnet@WIKIMEDIA
, which enables impersonation of the analytics
user in Hadoop.
Database access
The airflow chart was designed to run alongside a CloudnativePG cluster running in the same namespace. However, it can be configured to use an "external" PG database, such as an-db1001.eqiad.wmnet
for transitioning purposes. The ultimate goal is to have each instance run alongside its own PG cluster.
When configured to use a Cloudnative PG cluster, access to the DB goes through PGBouncer, instead of hitting PG directly. This, as described in the airflow documentation, was made to mitigate the fact that:
Airflow is known - especially in high-performance setup - to open many connections to metadata database. This might cause problems for Postgres resource usage, because in Postgres, each connection creates a new process and it makes Postgres resource-hungry when a lot of connections are opened. Therefore we recommend to use PGBouncer as database proxy for all Postgres production installations. PGBouncer can handle connection pooling from multiple components, but also in case you have remote database with potentially unstable connectivity, it will make your DB connectivity much more resilient to temporary network problems.
Connections
Connections are managed via helm values, under .Values.config.airflow.connections
. As such, they are managed by a LocalFilesystemBackend
secret manager, and will not be visible in the web UI.
Management DAGs
The Kubernetes Airflow instances come with built-in maintenance DAGs, performing actions such as:
- removing task logs from S3 after they reach a certain age
- expunging DB tables from data that has reached a certain age
- removing obsolete Airflow DAG/Task lineage data from DataHub
- ...
These DAGs are tagged with airflow_maintenance
.
You can set the following Airflow variables in your release values, under config.airflow.variables
, to configure the Airflow maintenance DAGs:
s3_log_retention_days
(default value: 30): number of days of task logs to keep in S3db_cleanup_tables
: a comma-separated list of tables that will be regularly expunged of old data, to keep the database as lean as possibledb_cleanup_retention_days
: if specified along withdb_cleanup_tables
, specifies the number of days after which data will be cleaned from the these tables.
Operations
Moved to Data Platform/Systems/Airflow/Kubernetes/Operations
I'm getting paged
Pods are not running
If you're getting an alert or getting paged because the app isn't running, investigate if something in the application logs (see the checklist section) could explain the crash. In case of a recurring crash, the pod would be in CrashloopBackoff
state in Kubernetes. To check whether this is the case, ssh to the deployment server and run the following commands
kube_env <namespace> dse-k8s-eqiad kubectl get pods
Then you can tail the logs as needed. Feel free to refer to the log dashboard listed in the checklist.
If no pod at all is displayed, re-deploy the app by following the Kubernetes deployment instructions.
How to
Use the airflow CLI
brouberol@deploy2002:~$ kube_env airflow-test-k8s-deploy dse-k8s-eqiad
brouberol@deploy2002:~$ kubectl exec -it $(kubectl get pod -l app=airflow,component=webserver --no-headers -o custom-columns=":metadata.name") -c airflow-production -- airflow
Usage: airflow [-h] GROUP_OR_COMMAND ...
Positional Arguments:
GROUP_OR_COMMAND
Groups
config View configuration
connections Manage connections
dags Manage DAGs
db Database operations
jobs Manage jobs
kubernetes Tools to help run the KubernetesExecutor
pools Manage pools
providers Display providers
roles Manage roles
tasks Manage tasks
users Manage users
variables Manage variables
Commands:
cheat-sheet Display cheat sheet
dag-processor Start a standalone Dag Processor instance
info Show information about current Airflow and environment
kerberos Start a kerberos ticket renewer
plugins Dump information about loaded plugins
rotate-fernet-key
Rotate encrypted connection credentials and variables
scheduler Start a scheduler instance
standalone Run an all-in-one copy of Airflow
sync-perm Update permissions for existing roles and optionally DAGs
triggerer Start a triggerer instance
version Show the version
webserver Start a Airflow webserver instance
Options:
-h, --help show this help message and exit
airflow command error: the following arguments are required: GROUP_OR_COMMAND, see help above.
command terminated with exit code 2