The main (gallery) namespace on Wikimedia Commons became uneditable for some 19 hours.
Editors on Wikimedia Commons were affected, though the gallery namespace is not as frequently edited as others (file uploads and categorization were unaffected, for instance).
End users reported the error on-wiki fairly soon after it became effective, and a Phabricator task was created as well. However, it took over half a day for operations to become aware of the task.
All times in UTC.
- 2019-03-27 16:44 last edit in the gallery namespace (according to recent changes)
- 2019-03-27 16:55 config change deployed OUTAGE BEGINS
- 2019-03-27 18:43 first report on Commons’ Village pump
- 2019-03-27 20:43 second report on Commons’ Help desk
- 2017-03-27 20:53 phabricator:T219450 created
- 2019-03-28 11:26 User:Yann mentions on the Phabricator task that this affects all NS0 edits, not just "some" as the task had been filed
- 2019-03-28 11:29 User:Yann mentions the Phabricator task in #wikimedia-operations connect
- 2019-03-28 12:02 config revert deployed OUTAGE ENDS
What went well?
- Once the bug was reported in #wikimedia-operations, response was relatively quick
What went poorly?
- The bug was not detected on Test Commons or Beta Cluster Commons over months of testing.
- Wikibase in production (for Wikidata) uses NS0 for entities, but this is not true by default or on most development machines. This area of the code is thus not well-tested.
- A ticket about an UBN! situation had no response from the technical community for several hours, no escalation or notification method is in place for it.
Where did we get lucky?
- The main namespace is not frequently used on Commons, so work on files, categories etc. was unhindered
Links to relevant documentation
- I followed SWAT deploys/Deployers when deploying the revert
- Wikibase config documentation?
NOTE: Please add the #wikimedia-incident Phabricator project to these follow-up tasks and move them to the "follow-up/actionable" column.
- Fix the underlying bug and deploy the config again while testing *all* namespaces (bug T219450)
- [task management] The largest issue was not the time between deployment and notice, but the time between the a ticket was first filed (untriaged, no tags) and it was acted on. Analyze what failed and either change Phabricator usage recommendations/reporting best practices, or something else (if considered necessary) when a clear outage is happening. (bug T219589)
- Consider setting up production (regression) testing