Incidents/2018-05-22 MediaWiki
(Redirected from Incident documentation/20180522-MediaWiki)
Summary
MediaWiki issues cause by Translate extension Symptoms initially looked like db issues or network issues, then like a SWAT patch had caused the issue, but nothing seemed to line up that well.
See https://phabricator.wikimedia.org/T195293#4224220
Timeline
Patchs Merged (as part of SWAT)
- 13:21 addshore@tin: Synchronized wmf-config/InitialiseSettings.php: SWAT: Revert: Temp rate limit for arwiki due to mass vandalism T192668 (duration: 01m 18s)
- 13:25 addshore@tin: Synchronized wmf-config/InitialiseSettings.php: SWAT: Enable $wgUseRCPatrol on azwiki T194389 (duration: 01m 20s)
meta.wikimedia.org actions
- 13:26, 22 May 2018 FuzzyBot (talk | contribs) changed the state of Russian translations of Privacy policy from Needs updating to In progress
- 13:28, 22 May 2018 Kaganer (talk | contribs) m . . (52,758 bytes) (+3) . . (thank) - https://meta.wikimedia.org/w/index.php?title=Privacy_policy&diff=18066512&oldid=18063747
- 13:29, 22 May 2018 Kaganer (talk | contribs) marked Privacy policy for translation
- 13:32, 22 May 2018 Kaganer (talk | contribs) changed the state of Russian translations of Privacy policy from In progress to Needs updating
Issue
- 13:34 paladox@#wikimedia-operations: hmm https://meta.wikimedia.org/wiki/Privacy_policy is not loading for me
- 13:35 NotASpy@#wikimedia-operations: yeah, en.wp is crawling along for me.
- 13:35 addshore@#wikimedia-operations: *looks around*
- 13:35 addshore@#wikimedia-operations: I can see a bunch of db errors
- 13:35 addshore@#wikimedia-operations: spike in lag or issue with replication
- 13:37 https://phabricator.wikimedia.org/T195293 - 503 error attempting to open multiple projects (Wikipedia and meta wiki are loading very slowly)
- 13:40 addshore@#wikimedia-operations: started at 13:31
- <discussion>
- 13:43 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
Reverts
- 13:44 addshore - Started first revert
- 13:46 addshore@tin: Synchronized wmf-config/InitialiseSettings.php: Revert Enable $wgUseRCPatrol on azwiki (duration: 01m 19s)
- 13:47 marostegui@#wikimedia-operations: all the connection errors I am seeing are on s7
- 13:48 addshore@#wikimedia-operations: s7 has ar wiki?
- 13:49 marostegui@#wikimedia-operations: addshore: actually yes
- 13:49 addshore - started second revert
- 13:51 addshore@tin: Synchronized wmf-config/InitialiseSettings.php: Revert Revert Temp rate limit for arwiki due to mass vandalism (duration: 01m 18s)
Recovery
- 13:52 marostegui@#wikimedia-operations: connections are decreasing on db1094
- 13:52 __joe__@#wikimedia-operations: yes, queues on the appservers are vanishing
- 13:52 volans@#wikimedia-operations: 500s goind down
- 14:11 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
Conclusions
TODO
Actionables
TODO