Incidents/2023-05-19 videoscaler/jobrunner

document status: draft


Incident metadata (see Incident Scorecard)
Incident ID jobrunner Start 2023-05-19 19:04:00
Task T279100 End 2023-05-19 19:49:00
People paged 2 Responder count 1
Coordinators Dzahn Affected metrics/SLOs No relevant SLOs exist
Impact The user who uploaded those videos has to wait a bit longer to get different formats. Possibly other users waited a bit longer for other jobs.

A large video-scaling job made server mw1469 so busy that its capacity was maxed out by ffmpeg processes. Since mw1469 was both a jobrunner and a videoscaler this led to alerts for both jobrunner and videoscaler services.

For the first part of the incident alerts could be seen on IRC but there had been no pages yet. Alerts were flapping, also on mw1469 specifically. Around 19:24 it eventually triggered a page. Dzahn and Aokoth were paged and

started looking at it and kept an eye on it for a while. Since it kept flapping the runbook was followed (Application servers/Runbook#Jobrunners.) and mw1469 was depooled from videoscaler, but pooled in jobrunner. The ffmpeg processes on mw1469 were killed. This protected the jobrunner which is much more important than the videoscaling (quoting runbook). Jobrunner alerts recovered.

A little while later, server mw1495 was depooled from jobrunner and turned into a dedicated videoscaler. Videoscaler alerts recovered.


All times in UTC.

  • 19:04 flapping of IRC alerts begins
  • 19:24 page is sent
  • 19:25 Dzahn ACKs alert, starts investigating, watches the situation
  • 19:45 Since alerts are still flapping, Dzahn depools mw1469 from videoscaler, to protect jobrunner
  • 19:46 Dzahn kills ffmpeg processes on mw1469 (as instructed per runbook)
  • 19:47 alerts recover on mw1469
  • 20:23 alerts start on mw1495
  • 20:52 mw1495 is depooled from jobrunner, made dedicated videoscaler, so it can finish mmpeg processes eventually
  • 20:53 alerts for videoscaler recover

First Icinga started reporting via icinga-wm on IRC, a little later SRE on duty got paged via Alertmanager.


