Incidents/2023-05-19 videoscaler/jobrunner

From Wikitech

document status: draft

Summary

Incident metadata (see Incident Scorecard)
Incident ID jobrunner Start 2023-05-19 19:04:00
Task T279100 End 2023-05-19 19:49:00
People paged 2 Responder count 1
Coordinators Dzahn Affected metrics/SLOs No relevant SLOs exist
Impact The user who uploaded those videos has to wait a bit longer to get different formats. Possibly other users waited a bit longer for other jobs.

A large video-scaling job made server mw1469 so busy that its capacity was maxed out by ffmpeg processes. Since mw1469 was both a jobrunner and a videoscaler this led to alerts for both jobrunner and videoscaler services.

For the first part of the incident alerts could be seen on IRC but there had been no pages yet. Alerts were flapping, also on mw1469 specifically. Around 19:24 it eventually triggered a page. Dzahn and Aokoth were paged and

started looking at it and kept an eye on it for a while. Since it kept flapping the runbook was followed (Application servers/Runbook#Jobrunners.) and mw1469 was depooled from videoscaler, but pooled in jobrunner. The ffmpeg processes on mw1469 were killed. This protected the jobrunner which is much more important than the videoscaling (quoting runbook). Jobrunner alerts recovered.

A little while later, server mw1495 was depooled from jobrunner and turned into a dedicated videoscaler. Videoscaler alerts recovered.

Timeline

Write a step by step outline of what happened to cause the incident, and how it was remedied. Include the lead-up to the incident, and any epilogue.

Consider including a graphs of the error rate or other surrogate.

Link to a specific offset in SAL using the SAL tool at https://sal.toolforge.org/ (example)

All times in UTC.

  • 19:04 flapping of IRC alerts begins
  • 19:24 page is sent
  • 19:25 Dzahn ACKs alert, starts investigating, watches the situation
  • 19:45 Since alerts are still flapping, Dzahn depools mw1469 from videoscaler, to protect jobrunner
  • 19:46 Dzahn kills ffmpeg processes on mw1469 (as instructed per runbook)
  • 19:47 alerts recover on mw1469
  • 20:23 alerts start on mw1495
  • 20:52 mw1495 is depooled from jobrunner, made dedicated videoscaler, so it can finish mmpeg processes eventually
  • 20:53 alerts for videoscaler recover

TODO: Clearly indicate when the user-visible outage began and ended.

Detection

First Icinga started reporting via icinga-wm on IRC, a little later SRE on duty got paged via Alertmanager.

Conclusions

OPTIONAL: General conclusions (bullet points or narrative)

What went well?

OPTIONAL: (Use bullet points) for example: automated monitoring detected the incident, outage was root-caused quickly, etc

What went poorly?

OPTIONAL: (Use bullet points) for example: documentation on the affected service was unhelpful, communication difficulties, etc

Where did we get lucky?

OPTIONAL: (Use bullet points) for example: user's error report was exceptionally detailed, incident occurred when the most people were online to assist, etc

Links to relevant documentation

Actionables

Scorecard

Incident Engagement ScoreCard
Question Answer

(yes/no)

Notes
People Were the people responding to this incident sufficiently different than the previous five incidents?
Were the people who responded prepared enough to respond effectively
Were fewer than five people paged?
Were pages routed to the correct sub-team(s)?
Were pages routed to online (business hours) engineers?  Answer “no” if engineers were paged after business hours.
Process Was the "Incident status" section atop the Google Doc kept up-to-date during the incident?
Was a public wikimediastatus.net entry created?
Is there a phabricator task for the incident?
Are the documented action items assigned?
Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence?
Tooling To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? Answer “no” if there are

open tasks that would prevent this incident or make mitigation easier if implemented.

Were the people responding able to communicate effectively during the incident with the existing tooling?
Did existing monitoring notify the initial responders?
Were the engineering tools that were to be used during the incident, available and in service?
Were the steps taken to mitigate guided by an existing runbook?
Total score (count of all “yes” answers above)