Incidents/2019-06-19 JobQueue
document status: in-review
Summary
Jobs stopped being processed after rollout of MediaWiki 1.34.0-wmf.10.
With the Kafka job queue the jobs are serialized into JSON and then via EventBus and ChangeProp are delivered to an internal endpoint on job runners for execution. in 516241 a page_title
property was removed from the job specification. We have been preparing for this, removing the property from the job JSON schema, however, we've forgot that the internal endpoint checks for the field to be present and returns 400 if it's not. The check is done very early in the codepath, so no logging was done, and change-prop ignored the 400. So the issue was not observable. It was fixed by 518751
Impact
Jobs were not processed for group0 and group1 wikis for 12 hours. Jobs that go around all the wikis, like GlobalRename
, got stuck reaching a group0 or group1 wiki.
Detection
The issue was not detected automatically by monitoring tool, logs or graphs, it was found by accident due to MassMessage
job being stuck
Timeline
- 06-18 19:00-20:00 group0 and group1 deployed to wmf.10
- 06-19 07:30 Issue discovered and ticket created
- 06-19 07:40 group1 rolled back to wmf.8
- 06-19 08:40 group0 rolled back to wmf.8
- 06-24 Patch for the job executor deployed, train unblocked.
Conclusions
What went poorly?
- No indication of a breakage in logs, graphs or monitoring tools
- Due to cpt offsite a long time has passed between the issue discovery and a fix
Where did we get lucky?
- The outage was noticed quickly due to stuck
MassMessage
jobs
Links to relevant documentation
Actionables
- task T229277 Change-Prop should report 400 errors from endpoints
- Better observability for jobqueue (TODO: link to existing task)
- Automated integration test? (TODO: create task)