Incidents/2020-02-25 mediawiki interface language

document status: in-review

Summary

Logged-out users saw the user interface in their browser language (Accept-Language header) rather than the default wiki content language, due to a configuration change inadvertently causing the $wgULSLanguageDetection setting to be unset and falling back to the default in the UniversalLanguageSelector extension.

Impact

Anonymous users were affected, registered users were not.

Appserver load was increased due to the cache being split by Accept-Language. Average response time also went up considerably [1].

Detection

Reported by users at phabricator:T246071 and other tasks. There were no alerts as far as Lucas Werkmeister (WMDE) is aware.

If human only, an actionable should probably be "add alerting".

Timeline

This is a step by step outline of what happened to cause the incident and how it was remedied. Include the lead-up to the incident, as well as any epilogue, and clearly indicate when the user-visible outage began and ended.

All times in UTC.

00:35 Config patch to delete fixcopyrightwiki and related config was merged: https://gerrit.wikimedia.org/r/552549
00:43 Synchronized dblists/all.dblist: T238803: Remove fixcopyrightwiki from all.dblist (duration: 00m 56s)
00:45 rebuilt and synchronized wikiversions files: T238803: Remove fixcopyrightwiki from wikiversions
00:46 Synchronized dblists/ (Delete fixcopyrightwiki). Average response time rises immediately (from ca. 180 ms to ca. 240 ms), though it’s not clear why. The issue may have begun here already.
00:51 Synchronized wmf-config/CommonSettings.php: Stop trying to read wmgUseSkinPerPage or wmgUseEUCopyrightCampaign (duration: 00m 55s)
00:51 Ran DELETE FROM globalimagelinks WHERE gil_wiki='fixcopyrightwiki'; (1 row removed) T238803
00:53 ISSUE BEGINS: Synchronized wmf-config/InitialiseSettings.php Remove all IS config related to fixcopyrightwiki (duration: 00m 55s)
08:41 T246071 created
09:59 T246081 created (duplicate)
10:04 T246071 escalated to "Unbreak Now!" priority
10:41 likely cause (turned out to be correct) pointed out at T246071#5915117
12:00 (ca.) Average response time begins to rise much further, not leveling until incident was resolved. Probably the first sync of the EU SWAT finished the previous sync from 00:53.
12:15 T246095 created (duplicate)
12:26 T246071 brought up on #mediawiki-i18n (IRC), asking for Language Team attention
~12:30 Average response time starts going up considerably [more traffic?]
13:01 T246071 brought up in #mediawiki-releng (IRC)
13:28 gerrit:574743 uploaded
14:13 gerrit:574743 merged
14:15 Synchronized wmf-config/InitialiseSettings.php (Reinstate wgULSLanguageDetection setting)
14:35 ISSUE ENDS: Synchronized php-1.35.0-wmf.20/extensions/Wikibase/lib (wbterms: only select entity terms that are requested), effectively finishing the previous sync (T236104)

Conclusions

What weaknesses did we learn about and how can we address them?

What went well?

No major user impact (arguably this is a feature that we would like for multilingual sites, but done in a planned way)
Cause was correctly identified quickly.

What went poorly?

The initial scap sync to deploy the fix apparently did not affect all servers, due to T236104. Another scap sync happened to take place soon afterwards (for unrelated reasons), with the side-effect of ensuring that the fix reached the remaining servers.
Impact was not clear from the beginning, since the additional load was not significant to trigger any alerts (even through average app server response time jumped 33%).
Maintainers of ULS were not aware of ULS configuration setting being changed.
The configuration setting in InitializeSettings.php was not documented to be required and dangerous.

Where did we get lucky?

An unrelated second scap sync mitigated the impact of T236104 (the first scap sync not reaching all servers).
Out of the three alternatives (cache pollution, cache splitting, no cache) we got the "best" option that only increased app server load, and not too much to overload it. Cache pollution would have caused logged out users to see pages in random languages and no cache could have brought the site down.

How many people were involved in the remediation?

Mainly Brian Wolff, Lucas Werkmeister (WMDE), Nikerabbit

for example: 2 SREs and 1 software engineer troubleshooting the issue plus 1 incident commander

Links to relevant documentation

Where is the documentation that someone responding to this alert should have (runbook, plus supporting docs). If that documentation does not exist, there should be an action item to create it.

Actionables

Explicit next steps to prevent this from happening again as much as possible, with Phabricator tasks linked for every step.

NOTE: Please add the #wikimedia-incident Phabricator project to these follow-up tasks and move them to the "follow-up/actionable" column.

Done Remind more people of double sync workaround for T236104 – ops-l email by Ladsgroup, SWAT documentation update
T236104
T246212 - the setting was moved to CommonSettings.php and documented.
…