Jump to content

Data Platform/Systems/Airflow/Kubernetes/Administration

From Wikitech

Deleting all failed Airflow task pods

When failed task pods accumulate, used CPUs/memory reporting in https://grafana.wikimedia.org/goto/qG7AGR3HR?orgId=1 gets messed up. From time to time, it does not hurt to delete them all.

brouberol@deploy1003:~$ sudo -i
root@deploy1003:~# kube-env admin dse-k8s-eqiad
root@deploy1003:~# kubectl delete pods -l app=airflow,component=task-pod --field-selector status.phase=Failed -A
pod "canary-events-produce-canary-event-ivht08mg" deleted
pod "projectview-hourly-aggregrate-pageview-to-projectview-fmggyosn" deleted
pod "refine-to-hive-hourly-refine-hive-dataset-evolve-and-r-3lf6rljw" deleted
pod "refine-to-hive-hourly-refine-hive-dataset-evolve-and-r-mp8293gz" deleted
pod "refine-to-hive-hourly-refine-hive-dataset-mark-input-p-d0xax0k8" deleted
pod "refine-to-hive-hourly-refine-hive-dataset-wait-for-gob-r2e242ce" deleted

Restart all Airflow schedulers

When a change to wmf_airflow_common was merged and needs to be picked up by all Airflow instances, you need to restart all schedulers to pick up the change (DAGs are regularly reloaded, but wmf_airflow_common is only loaded at startup).

brouberol@deploy1003:~$ sudo -i
root@deploy1003:~# kube_env admin dse-k8s-eqiad
root@deploy1003:~# kubectl get pod -A -l app=airflow,release=production,component=scheduler
NAMESPACE                   NAME                                 READY   STATUS    RESTARTS        AGE
airflow-analytics-product   airflow-scheduler-65855dd558-jjwzn   2/2     Running   1 (16d ago)     36d
airflow-analytics-test      airflow-scheduler-7457499855-n5r87   2/2     Running   1 (41d ago)     43d
airflow-main                airflow-scheduler-bd6d6c67d-r2qd9    2/2     Running   4 (2d20h ago)   13d
airflow-ml                  airflow-scheduler-7f9b469d7f-fwnnr   2/2     Running   0               6d5h
airflow-platform-eng        airflow-scheduler-6fdcbb4d8c-dshmh   2/2     Running   1 (13d ago)     28d
airflow-research            airflow-scheduler-5db5db6dc-b2brx    2/2     Running   1 (41d ago)     42d
airflow-search              airflow-scheduler-5fbb8d86c7-djtcn   2/2     Running   0               13d
airflow-test-k8s            airflow-scheduler-55d4675d7b-qfj7d   1/1     Running   0               6d22h
airflow-wmde                airflow-scheduler-5fbdcd75c-5njcj    1/1     Running   0               11d
root@deploy1003:~# kubectl delete pod -A -l app=airflow,release=production,component=scheduler
pod "airflow-scheduler-65855dd558-jjwzn" deleted
pod "airflow-scheduler-7457499855-n5r87" deleted
pod "airflow-scheduler-bd6d6c67d-r2qd9" deleted
pod "airflow-scheduler-7f9b469d7f-fwnnr" deleted
pod "airflow-scheduler-6fdcbb4d8c-dshmh" deleted
pod "airflow-scheduler-5db5db6dc-b2brx" deleted
pod "airflow-scheduler-5fbb8d86c7-djtcn" deleted
pod "airflow-scheduler-55d4675d7b-qfj7d" deleted
pod "airflow-scheduler-5fbdcd75c-5njcj" deleted

Upgrading Airflow

To upgrade Airflow, we first need to rebuild a new docker image installing on a more recent apache-airflow package version (example). Once the patch is merged, a publish:airflow job will be kicked off for each airflow image.

Then, use the CLI described at User:Brouberol#Finding the published docker image name and tag from the logs of a Gitlab image publishing pipeline to automatically get the docker image tag the newly published airflow image from the Gitlab jobs (or copy it manually from the Gitlab build job logs).

Now, deploy the new image to the airflow-test-k8s instance, by changing the app.version field in deployment_charts/helmfile.d/dse-k8s-services/airflow-example/values-production.yaml, and redeploy the test instance. Any outstanding DB migrations will automatically be applied. If everything goes well, bump the airflow version under deployment_charts/helmfile.d/dse-k8s-services/_airflow_common_/values-dse-k8s-eqiad.yaml, and redeploy every instance, one after the other.