document status: in-review
Jobs stopped being processed after rollout of MediaWiki 1.34.0-wmf.10.
With the Kafka job queue the jobs are serialized into JSON and then via EventBus and ChangeProp are delivered to an internal endpoint on job runners for execution. in
page_title property was removed from the job specification. We have been preparing for this, removing the property from the job JSON schema, however, we've forgot that the internal endpoint checks for the field to be present and returns 400 if it's not. The check is done very early in the codepath, so no logging was done, and change-prop ignored the 400. So the issue was not observable. It was fixed by
Jobs were not processed for group0 and group1 wikis for 12 hours. Jobs that go around all the wikis, like
GlobalRename, got stuck reaching a group0 or group1 wiki.
The issue was not detected automatically by monitoring tool, logs or graphs, it was found by accident due to
MassMessage job being stuck
- 06-18 19:00-20:00 group0 and group1 deployed to wmf.10
- 06-19 07:30 Issue discovered and ticket created
- 06-19 07:40 group1 rolled back to wmf.8
- 06-19 08:40 group0 rolled back to wmf.8
- 06-24 Patch for the job executor deployed, train unblocked.
What went poorly?
- No indication of a breakage in logs, graphs or monitoring tools
- Due to cpt offsite a long time has passed between the issue discovery and a fix
Where did we get lucky?
- The outage was noticed quickly due to stuck
Links to relevant documentation
- task T229277 Change-Prop should report 400 errors from endpoints
- Better observability for jobqueue (TODO: link to existing task)
- Automated integration test? (TODO: create task)