Data Platform/Systems/Airflow/Kubernetes/Administration
Deleting all failed Airflow task pods
When failed task pods accumulate, used CPUs/memory reporting in https://grafana.wikimedia.org/goto/qG7AGR3HR?orgId=1 gets messed up. From time to time, it does not hurt to delete them all.
brouberol@deploy1003:~$ sudo -i
root@deploy1003:~# kube-env admin dse-k8s-eqiad
root@deploy1003:~# kubectl delete pods -l app=airflow,component=task-pod --field-selector status.phase=Failed -A
pod "canary-events-produce-canary-event-ivht08mg" deleted
pod "projectview-hourly-aggregrate-pageview-to-projectview-fmggyosn" deleted
pod "refine-to-hive-hourly-refine-hive-dataset-evolve-and-r-3lf6rljw" deleted
pod "refine-to-hive-hourly-refine-hive-dataset-evolve-and-r-mp8293gz" deleted
pod "refine-to-hive-hourly-refine-hive-dataset-mark-input-p-d0xax0k8" deleted
pod "refine-to-hive-hourly-refine-hive-dataset-wait-for-gob-r2e242ce" deleted
Restart all Airflow schedulers
When a change to wmf_airflow_common was merged and needs to be picked up by all Airflow instances, you need to restart all schedulers to pick up the change (DAGs are regularly reloaded, but wmf_airflow_common is only loaded at startup).
brouberol@deploy1003:~$ sudo -i
root@deploy1003:~# kube_env admin dse-k8s-eqiad
root@deploy1003:~# kubectl get pod -A -l app=airflow,release=production,component=scheduler
NAMESPACE NAME READY STATUS RESTARTS AGE
airflow-analytics-product airflow-scheduler-65855dd558-jjwzn 2/2 Running 1 (16d ago) 36d
airflow-analytics-test airflow-scheduler-7457499855-n5r87 2/2 Running 1 (41d ago) 43d
airflow-main airflow-scheduler-bd6d6c67d-r2qd9 2/2 Running 4 (2d20h ago) 13d
airflow-ml airflow-scheduler-7f9b469d7f-fwnnr 2/2 Running 0 6d5h
airflow-platform-eng airflow-scheduler-6fdcbb4d8c-dshmh 2/2 Running 1 (13d ago) 28d
airflow-research airflow-scheduler-5db5db6dc-b2brx 2/2 Running 1 (41d ago) 42d
airflow-search airflow-scheduler-5fbb8d86c7-djtcn 2/2 Running 0 13d
airflow-test-k8s airflow-scheduler-55d4675d7b-qfj7d 1/1 Running 0 6d22h
airflow-wmde airflow-scheduler-5fbdcd75c-5njcj 1/1 Running 0 11d
root@deploy1003:~# kubectl delete pod -A -l app=airflow,release=production,component=scheduler
pod "airflow-scheduler-65855dd558-jjwzn" deleted
pod "airflow-scheduler-7457499855-n5r87" deleted
pod "airflow-scheduler-bd6d6c67d-r2qd9" deleted
pod "airflow-scheduler-7f9b469d7f-fwnnr" deleted
pod "airflow-scheduler-6fdcbb4d8c-dshmh" deleted
pod "airflow-scheduler-5db5db6dc-b2brx" deleted
pod "airflow-scheduler-5fbb8d86c7-djtcn" deleted
pod "airflow-scheduler-55d4675d7b-qfj7d" deleted
pod "airflow-scheduler-5fbdcd75c-5njcj" deleted
Upgrading Airflow
To upgrade Airflow, we first need to rebuild a new docker image installing on a more recent apache-airflow package version (example). Once the patch is merged, a publish:airflow job will be kicked off for each airflow image.
Then, use the CLI described at User:Brouberol#Finding the published docker image name and tag from the logs of a Gitlab image publishing pipeline to automatically get the docker image tag the newly published airflow image from the Gitlab jobs (or copy it manually from the Gitlab build job logs).
Now, deploy the new image to the airflow-test-k8s instance, by changing the app.version field in deployment_charts/helmfile.d/dse-k8s-services/airflow-example/values-production.yaml, and redeploy the test instance. Any outstanding DB migrations will automatically be applied. If everything goes well, bump the airflow version under deployment_charts/helmfile.d/dse-k8s-services/_airflow_common_/values-dse-k8s-eqiad.yaml, and redeploy every instance, one after the other.