Data Engineering/Systems/Airflow/Instances

From Wikitech

WMF's Airflow system is composed of several Airflow instances. Each instance is supposed to schedule and orchestrate jobs belonging to a particular grouping. For example, there's an Airflow instance called analytics, which schedules jobs that generate and process analytics data sets. There's another instance called research, which orchestrates jobs that process research-related data sets. Usually, each Airflow instance is managed by a given WMF team, for example, the analytics instance is managed by the Data Engineering team, and most of its jobs have been developed by them. However, an Airflow instance can also be shared by several teams, and also one team can part-take in the development of jobs in multiple Airflow instances.

Multi-instance vs. single instance

During the development of WMF's Airflow system, we've had discussions about using a single instance approach versus a multi-instance approach. There are advantages and disadvantages in both cases. This thread contains most of the arguments we discussed, which include the following:

Single instance Multi-instance
Pros Single configuration, no custom stacks for teams, and thus, easy upgrades and maintenance. No single point of failure, if a team deploys code that breaks Airflow services, the other instances continue working. Teams have more independence when deploying.
Cons Airflow does not support Kerberos multitenancy (yet), so one single instance would require that all WMF jobs accessed Hadoop with the same Kerberos credentials, not allowing for access control or specific permissions. When doing maintenance, the Data Engineering team will have to wrangle multiple airflow instances to stop jobs.

We Data Engineering decided to kick off the project with a multi-instance approach, mainly because of the Kerberos issue, but we don't discard the possibility of switching to single instance in the future. All WMF Airflow instances are set up by the same puppet configuration, so even if we provide multiple instances, they all will have the same stack (see: https://github.com/wikimedia/puppet/tree/production/modules/airflow).

Access

The web UI access commands below show how an instance owner can connect directly to the instance. The commands will not work for non-owners.

However, anyone in the analytics-privatedata-users group can access any server by routing their SSH connection through one of the analytics clients. This is useful for tracking the status of jobs on different instances. For example, to connect to the analytics instance through stat1011, use the following command:

ssh -N -L 8600:an-launcher1002.eqiad.wmnet:8600 stat1011.eqiad.wmnet

You can now access the web UI at http://localhost:8600.

If you can access the Airflow web UI, you can change any setting, including pausing or deleting jobs. Be careful! (T358137)

List of instances

analytics

Airflow instance owned by the Data / Analytics engineering team. Contains all production jobs historically developed by the team.

Host an-launcher1002.eqiad.wmnet
Service user analytics
Web UI Port 8600
Web UI Access ssh -t -N -L8600:127.0.0.1:8600 an-launcher1002.eqiad.wmnet - http://localhost:8600
Dags airflow-dags/analytics/dags
Dags deployment
ssh deployment.eqiad.wmnet
cd /srv/deployment/airflow-dags/analytics
git fetch && git rebase
scap deploy

analytics_test

Airflow test instance owned by the Data / Analytics engineering team. Contains some jobs analog to the ones in the analytics instance, just to create some data flows in the Data Engineering's test cluster.

Host an-test-client1002.eqiad.wmnet
Service user analytics
Web UI Port 8600
Web UI Access ssh -t -N -L8600:127.0.0.1:8600 an-test-client1002.eqiad.wmnet - http://localhost:8600
Dags airflow-dags/analytics_test/dags
Dags deployment
ssh deployment.eqiad.wmnet
cd /srv/deployment/airflow-dags/analytics_test
git fetch && git rebase
scap deploy

search

Airflow instance owned by the Search team.

Host an-airflow1005.eqiad.wmnet
Service user analytics-search
Web UI Port 8600
Web UI Access ssh -t -N -L8600:localhost:8600 an-airflow1005.eqiad.wmnet - http://localhost:8600
Dags airflow-dags/search/dags
Dags deployment
ssh deployment.eqiad.wmnet
cd /srv/deployment/airflow-dags/search
git fetch && git rebase
scap deploy

research

Airflow instance owned by the Research team.

Host an-airflow1002.eqiad.wmnet
Service user analytics-research
Web UI Port 8600
Web UI Access ssh -t -N -L8600:127.0.0.1:8600 an-airflow1002.eqiad.wmnet - http://localhost:8600
Dags airflow-dags/research/dags
Dags deployment
ssh deployment.eqiad.wmnet
cd /srv/deployment/airflow-dags/research
git fetch && git rebase
scap deploy

platform_eng

Airflow instance owned by the Platform Engineering team.

Host an-airflow1004.eqiad.wmnet
Service user analytics-platform-eng
Web UI Port 8600
Web UI Access ssh -t -N -L8600:127.0.0.1:8600 an-airflow1004.eqiad.wmnet - http://localhost:8600
Dags airflow-dags/platform_eng/dags
Dags deployment
ssh deployment.eqiad.wmnet
cd /srv/deployment/airflow-dags/platform_eng
git fetch && git rebase
scap deploy

analytics_product

Airflow instance owned by the Product Analytics engineering team. Contains all production jobs historically developed by the team.

Host an-airflow1006.eqiad.wmnet
Service user product-analytics
Web UI Port 8600
Web UI Access ssh -t -N -L8600:127.0.0.1:8600 an-airflow1006.eqiad.wmnet - http://localhost:8600
Dags airflow-dags/analytics_product/dags
Dags deployment
ssh deployment.eqiad.wmnet
cd /srv/deployment/airflow-dags/analytics_product
git fetch && git rebase
scap deploy

wmde

Airflow instance owned by the WMDE engineering team. Contains all production jobs historically developed by the team.

Host an-airflow1007.eqiad.wmnet
Service user analytics-wmde
Web UI Port 8600
Web UI Access ssh -t -N -L8600:127.0.0.1:8600 an-airflow1007.eqiad.wmnet - http://localhost:8600
Dags airflow-dags/wmde/dags
Dags deployment
ssh deployment.eqiad.wmnet
cd /srv/deployment/airflow-dags/wmde
git fetch && git rebase
scap deploy

Custom test instance

More at Analytics/Systems/Airflow/Airflow_testing_instance_tutorial