Incident documentation/20200225-mediawiki interface language

From Wikitech
Jump to navigation Jump to search

document status: in-review

Summary

Logged-out users saw the user interface in their browser language (Accept-Language header) rather than the default wiki content language, due to a configuration change inadvertently causing the $wgULSLanguageDetection setting to be unset and falling back to the default in the UniversalLanguageSelector extension.

Impact

Anonymous users were affected, registered users were not.

Appserver load was increased due to the cache being split by Accept-Language. Average response time also went up considerably [1].

Detection

Reported by users at phabricator:T246071 and other tasks. There were no alerts as far as Lucas Werkmeister (WMDE) is aware.

If human only, an actionable should probably be "add alerting".

Timeline

This is a step by step outline of what happened to cause the incident and how it was remedied. Include the lead-up to the incident, as well as any epilogue, and clearly indicate when the user-visible outage began and ended.

All times in UTC.

Other links:

Conclusions

What weaknesses did we learn about and how can we address them?

What went well?

  • No major user impact (arguably this is a feature that we would like for multilingual sites, but done in a planned way)
  • Cause was correctly identified quickly.

What went poorly?

  • The initial scap sync to deploy the fix apparently did not affect all servers, due to T236104. Another scap sync happened to take place soon afterwards (for unrelated reasons), with the side-effect of ensuring that the fix reached the remaining servers.
  • Impact was not clear from the beginning, since the additional load was not significant to trigger any alerts (even through average app server response time jumped 33%).
  • Maintainers of ULS were not aware of ULS configuration setting being changed.
  • The configuration setting in InitializeSettings.php was not documented to be required and dangerous.

Where did we get lucky?

  • An unrelated second scap sync mitigated the impact of T236104 (the first scap sync not reaching all servers).
  • Out of the three alternatives (cache pollution, cache splitting, no cache) we got the "best" option that only increased app server load, and not too much to overload it. Cache pollution would have caused logged out users to see pages in random languages and no cache could have brought the site down.

How many people were involved in the remediation?

Mainly Brian Wolff, Lucas Werkmeister (WMDE), Nikerabbit

  • for example: 2 SREs and 1 software engineer troubleshooting the issue plus 1 incident commander

Links to relevant documentation

Where is the documentation that someone responding to this alert should have (runbook, plus supporting docs). If that documentation does not exist, there should be an action item to create it.

Actionables

Explicit next steps to prevent this from happening again as much as possible, with Phabricator tasks linked for every step.

NOTE: Please add the #wikimedia-incident Phabricator project to these follow-up tasks and move them to the "follow-up/actionable" column.