Data Platform/Systems/Airflow
Apache Airflow is a workflow job scheduler. Developers declare job workflows using a custom DAG (directed acyclic graph) python API.
This page documents the Data Engineering managed Airflow instances in the Analytics Cluster. As of May 2025, we are running Airflow 2.10.5 (docs).
If you wish to develop DAGs with Airflow, you can find more information on the Airflow Developer guide page.
Airflow setup and conventions
The Data Engineering team maintains several Airflow instances. Usually, these instances are team specific. Teams have full control over their airflow instance. Data Platform Engineering manages the tooling needed to deploy and run these instances.
All of the Airflow instances run on Kubernetes on the dse-k8s cluster in eqiad.
The instances all have access to Hadoop and other Analytics Cluster related tools.
It is now possible to choose within a DAG where the computional work of a pipeline will run, whether that be on YARN or on Kubernetes.
Authentication
We use an authentication mechanism that is integrated with airflow and backed by our CAS-SSO system. Users authenticate using their Mediwiki developer account and an LDAP group mapping determines the level of access permitted. Membership of the wmf or nda groups is required for read-only access. Each instance then has a specific LDAP group that maps to the operations users capability. Members of ops are granted admin rights. on the instances.
Metadata Database
Each of these instances has its own PostgreSQL cluster that is deployed with the instance by the CloudnativePG operator. There are two PostgreSQL database instances in each cluster, operating in a high-availability mode, with automatic failover. The storage layer for these databases is the Ceph storage cluster operated by the Data Platform Engineering team.
These CloudnativePG clusters also include a set of three pgbouncer pods, running as a connection pooler.
Airflow DAGs
To develop best practices around Airflow, we use a single shared git repository for Airflow DAGs for all instances: data-engineering/airflow-dags. Airflow instance (and team) specific DAGs live in subdirectories of this repository, e.g. in <instance_name>/dags.
Continuous DAG deployment
We use a continuous-deployment model for this, although the precise mechanism is still being developed See task T368033 for the current work.
For the time being, we use a git-sync pod as documented, to pull from the main branch of airflow-dags, every 5 minutes.
Artifact syncing
Each Airflow instance has its own artifacts.yaml file that contains a list of the software artifacts required by the DAGS. e.g. main/main/config/artifacts.yaml?
These artifacts are deployed automatically by Blunderbuss, following any merge to the master branch of the airflow-dags repository.
Skein
We run Skein as a way to schedule Python Spark jobs on YARN from Airflow scheduled jobs.
See also
- Shared Airflow - Design Document
- phab:T272973
- Analytics/Systems/Cluster/Workflow_management_tools_study
- Phabricator project
Airflow Instances
Kept up to date at: Data_Engineering/Systems/Airflow/Instances#List_of_instances
Airflow on Kubernetes
Kept up to date at Data Platform/Systems/Airflow/Kubernetes
Airflow Upgrades
The Airflow upgrade procedure is documented at: Data_engineering/Systems/Airflow/Upgrading
Administration
Please see: Data Platform/Systems/Airflow/Kubernetes/Operations
Incident reports & known issues
Add incident reports and knowns issues in the following table. Please add a short description of the issue and a link to a more detailed one: either a wiki page or a Phabricator task. Thanks! :]
| Date | Incident / Issue description | link |
|---|---|---|
| 2022-07-26 | This is an example incident description. | example.link |