Incident documentation/2021-03-30 Jobqueue overload

From Wikitech
Jump to navigation Jump to search

document status: in-review

Summary

An upload of 65 video 4k files via the server-side upload process caused high CPU/socket timeout errors on jobrunners (all jobrunner hosts are also videoscalers). This caused an increase in job backlog and unavailability on several mw-related servers (job queue runners, etc.). It seems that a combination of the files being 4k (and thus requiring many different downscales), long (averaging an hour in length), combined with the fact that the videos were uploads from a local server (mwmaint) with a fast connection to the rest of our infrastructure resulted in too much load being placed on the jobqueue infrastructure.

Halting the uploads and temporarily splitting the jobqueue into videoscalers and other jobrunners allowed the infrastructure to catch up.

Actionables