Jump to content

Data Platform/Systems/Airflow/Instances

From Wikitech

WMF's Airflow system is composed of several Airflow instances. Each instance is supposed to schedule and orchestrate jobs belonging to a particular grouping of teams or common purposes.

In general, each instance is associated with a Service user account, which equates to a POSIX user id and corresponding ownership on the Hadoop HDFS file system. It is now possible to override the service user for a particular DAG, so this mapping of instances to file ownership on HDFS is no longer a strict requirement.

The primary Airflow instance is called main which schedules jobs that generate and process analytics data sets. There's another instance called research, which orchestrates jobs that process research-related data sets. Usually, each Airflow instance is managed by a given WMF team, for example, the main instance is managed by the Data Engineering team, and most of its jobs have been developed by them. However, an Airflow instance can also be shared by several teams, and also one team can part-take in the development of jobs in multiple Airflow instances.

Multi-instance vs. single instance

During the development of WMF's Airflow system, we've had discussions about using a single instance approach versus a multi-instance approach. There are advantages and disadvantages in both cases. This thread contains most of the arguments we discussed, which include the following:

Single instance Multi-instance
Pros Single configuration, no custom stacks for teams, and thus, easy upgrades and maintenance. No single point of failure, if a team deploys code that breaks Airflow services, the other instances continue working. Teams have more independence when deploying.
Cons Airflow does not support Kerberos multitenancy (yet), so one single instance would require that all WMF jobs accessed Hadoop with the same Kerberos credentials, not allowing for access control or specific permissions. When doing maintenance, the Data Engineering team will have to wrangle multiple airflow instances to stop jobs.

We decided to kick off the project with a multi-instance approach, mainly because of the Kerberos issue, but we don't discard the possibility of switching to single instance in the future.

Airflow and Kubernetes

When we started, all of the Airflow instances were running on either bare-metal or VMs and the configuration of the instances was managed by puppet. Since then, we have completed a migration to Kubernetes so now all instances except the analytics instance now run on the dse-k8s-eqiad Kubernetes cluster. We only retain this instance for historical reference or back-fill purposes. The VMs are also retained temporarily in order to assist with the artifact deployment and will be decommissioned as soon as a replacement for this element has been deployed.

List of instances

main

Airflow instance currently owned by the Data Engineering team. We also named it this way in expectation that all instances would be folded into this one once Airflow becomes multi-tenant.

Web UI Access https://airflow.wikimedia.org/
Service user analytics
Dags airflow-dags/main/dags
Dags deployment Data Platform/Systems/Airflow/Kubernetes#DAGs deployment
Artifact deployment Data Platform/Systems/Airflow/Kubernetes#Artifacts deployment

analytics_test

Airflow test instance owned by the Data Engineering team. Contains some jobs analog to the ones in the main instance.

n.b. This instance is configured to use the analytics_test_hadoop cluster.

Web UI Access https://airflow-analytics-test.wikimedia.org/
Service user analytics
Dags airflow-dags/analytics_test/dags
Dags deployment Data Platform/Systems/Airflow/Kubernetes#DAGs deployment
Artifact deployment Data Platform/Systems/Airflow/Kubernetes#Artifacts deployment

Airflow instance owned by the Search team.

Web UI Access https://airflow-search.wikimedia.org/
Service user analytics-search
Dags airflow-dags/search/dags
Dags deployment Data Platform/Systems/Airflow/Kubernetes#DAGs deployment
Artifact deployment Data Platform/Systems/Airflow/Kubernetes#Artifacts deployment

research

Airflow instance owned by the Research team.

Web UI Access https://airflow-research.wikimedia.org/
Service user analytics-research
Dags airflow-dags/research/dags
Dags deployment Data Platform/Systems/Airflow/Kubernetes#DAGs deployment
Artifact deployment Data Platform/Systems/Airflow/Kubernetes#Artifacts deployment

platform_eng

Airflow instance owned by the Platform Engineering team.

Web UI Access https://airflow-platform-eng.wikimedia.org/
Service user analytics-platform-eng
Dags airflow-dags/platform_eng/dags
Dags deployment Data Platform/Systems/Airflow/Kubernetes#DAGs deployment
Artifact deployment Data Platform/Systems/Airflow/Kubernetes#Artifacts deployment

analytics_product

Airflow instance owned by the Product Analytics engineering team. Contains all production jobs historically developed by the team.

Web UI Access https://airflow-analytics-product.wikimedia.org/
Service user analytics-product
Dags airflow-dags/analytics_product/dags
Dags deployment Data Platform/Systems/Airflow/Kubernetes#DAGs deployment
Artifact deployment Data Platform/Systems/Airflow/Kubernetes#Artifacts deployment

wmde

Airflow instance owned by the WMDE engineering team. Contains all production jobs historically developed by the team.

Web UI Access https://airflow-wmde.wikimedia.org/
Service user analytics-wmde
Dags airflow-dags/wmde/dags
Dags deployment Data Platform/Systems/Airflow/Kubernetes#DAGs deployment
Artifact deployment Data Platform/Systems/Airflow/Kubernetes#Artifacts deployment

ml

Airflow instance owned by the ML team. Contains all production jobs developed by the team.

Web UI Access https://airflow-ml.wikimedia.org/
Service user analytics-ml
Dags airflow-dags/ml/dags
Dags deployment Data Platform/Systems/Airflow/Kubernetes#DAGs deployment
Artifact deployment Data Platform/Systems/Airflow/Kubernetes#Artifacts deployment

test-k8s

This is an instance that is owned by the Data Platform SRE team and is used to test new features and versions. It is currently also used to run dumps v1 DAGs.

Web UI Access https://airflow-test-k8s.wikimedia.org
Service user Default: analytics - but this can be overridden at the DAG and task levels
Web UI Access https://airflow-test-k8s.wikimedia.org
Dags airflow-dags/test_k8s/dags
Dags deployment Data Platform/Systems/Airflow/Kubernetes#DAGs deployment
Artifact deployment Data Platform/Systems/Airflow/Kubernetes#Artifacts deployment

Custom test instance

More at Analytics/Systems/Airflow/Airflow_testing_instance_tutorial