Incident documentation/20140206-Math

From Wikitech
Jump to: navigation, search

Summary

https://gerrit.wikimedia.org/r/#/c/104991/ changed the parser cache keys for pages with <math> in them, causing a spike in cache misses and thus the cluster fell over.

This has been slowly rolling out on small wikis, mostly unnoticed since math isn't widely used there. Rolling out today to larger wikis (dewiki, etc) caused the cache stampede to be more obvious and cause downtime. Reverting the change didn't work because of incompatibilities between core + the extension, but was ok because we had mostly gotten through the invalidation before the roll back.

This would've been a problem if we weren't having fatals, we would've started invalidating to the old version again. We got lucky. Going back to new version caused a little more invalidations, but seems reasonable and should level off soon probably.

Timeline

1.23wmf13 was branched. Checked out onto tin before 15:00 UTC
16:00 logmsgbot: reedy started scap: testwiki to 1.23wmf13 and build l10n cache
16:04 logmsgbot: reedy started scap: testwiki to 1.23wmf13 and build l10n cache
Both scaps failed, no errors reported to IRC. Same issue happened on old scap, https://bugzilla.wikimedia.org/show_bug.cgi?id=51174 updated
Bug for error handling (and notification) opened as https://bugzilla.wikimedia.org/show_bug.cgi?id=60967
cp wmf-config/ExtensionMessages-1.23wmf12.php wmf-config/ExtensionMessages-1.23wmf13.php
16:07 logmsgbot: reedy started scap: testwiki to 1.23wmf13 and build l10n cache
// Apache criticals started
Stuff broke.
[16:29:03] <icinga-wm> PROBLEM - Apache HTTP on mw1073 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
16:41 logmsgbot: reedy rebuilt wikiversions.cdb and synchronized wikiversions files: # Forcing mediawikiwiki back to 1.23wmf12
[16:41:32] <paravoid> what's with all the texvc stuff?
[16:41:35] <Reedy> That's running, but I'm not sure it's at fault
[16:41:55] <paravoid> root@mw1051:~# ps aux|grep -c texvc
[16:41:55] <paravoid> 67
[16:46:43] <icinga-wm> RECOVERY - Apache HTTP on mw1110 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.076 second response time // Last apache recovery
16:48 Nemo_bis: #Wikipedia and #Wikimedia wikis have been partly down for about 10 min, now recovered/ing
[16:48:25] <Reedy> 2014-02-06 16:48:10 mw1174 mediawikiwiki: [627387b9] /wiki/Special:CentralAutoLogin/createSession?token=e091e533ecb821612b11518a7d4bfbfb&type=1x1&from=enwiki&proto=https   Exception from line 468 of /usr/local/apache/common-local/php-1.23wmf13/includes/cache/LocalisationCache.php: No localisation cache found for English. Please run maintenance/rebuildLocalisationCache.php.
[16:48:26] <Reedy> Again
[16:48:28] <paravoid> Reedy: can you explain what was supposed to happen and what happened?
[16:48:39] <Reedy> wtf is putting mediawikiwiki on wmf13
[16:49:10] <paravoid> are you aware scap was replaced (I think) yesterday?
[16:49:41] <Nemo_bis> paravoid: 17.34 < Reedy> There's nothing changed at all for mediawkiwiki
[16:49:42] <Reedy> reedy@tin:/usr/local/apache/common$ grep mediawikiwiki wikiversions.dat
[16:49:42] <Reedy> mediawikiwiki php-1.23wmf12 *
[16:50:03] <Reedy> scap doesn't change mediawikiwiki versions though
[16:50:10] <Reedy> well, it doesn't modify wikiversions.dat
[16:50:48] <logmsgbot> !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files:# Forcing mediawikiwiki back to 1.23wmf12 (again)
[16:50:54] <greg-g> new scap may misinterpret though?
[16:50:57] <morebots> Logged the message, Master
[16:51:01] <paravoid> that's a lot of math generation btw
[16:51:03] <Reedy> misinterpret what?
[16:51:08] --> shadhu (75d356ec@gateway/web/freenode/ip.117.211.86.236) has joined #wikimedia-operations
[16:51:19] <greg-g> the dat
[16:51:22] <Reedy> No
[16:51:24] -*- greg-g shrugs
[16:51:24] <Reedy> 1.23wmf12 doesn't magically become 1.23wmf13
[16:51:52] <Reedy> the scap source on tin doesn't have mediawikiwiki set to 1.23wmf13 either
17:17 logmsgbot: reedy finished scap: testwiki to 1.23wmf13 and build l10n cache (duration: 69m 51s)
17:18 logmsgbot: reedy rebuilt wikiversions.cdb and synchronized wikiversions files: testwiki back to 1.23wmf12 till window
17:42 Reedy: mw1165 is segfaulting a lot
17:45 paravoid: depooling mw1165
18:36 logmsgbot: demon synchronized wmf-config/CirrusSearch-labs.php 'Enabled interwiki searches for beta -- no-op in prod'
19:13 logmsgbot: reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Wikipedias to 1.23wmf12
19:28 logmsgbot: reedy rebuilt wikiversions.cdb and synchronized wikiversions files: group0 wikis to 1.23wmf13
19:31 logmsgbot: reedy started scap: run 2, should be a noop
[19:33:44] <Reedy> ctrl + c'd
[19:35:08] <ori> texvccheck
[19:35:14] <ori> /bin/bash /usr/local/apache/common-local/php-1.23wmf12/includes/limit.sh /usr/local/apache/uncommon/bin/texvccheck 'W^{k,p}(\Omega)' MW_INCLUDE_STDERR=;MW_CPU_LIMIT=50; MW_CGROUP='/sys/fs/cgroup/memory/mediawiki/job'; MW_MEM_LIMIT=307200; MW_FILE_SIZE_LIMIT=102400; MW_WALL_CLOCK_LIMIT=180; MW_USE_LOG_PIPE=yes
[19:35:18] <Reedy> http://p.defau.lt/?ttkQySf5OkEgGNzsSNV45g // Shell log of "run 2, should be a noop"
19:45 logmsgbot: reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Revert earlier version changes // Reverts wikipedias 1.23wmf11->1.23wmf12 and group0 1.23wmf12->1.23wmf13
Stuff (e.g. pages that heavily use wikidata) worked fine on enwiki / wmf12 before outage (aude notes), at 19:21:35 UTC
apache procs on mw1162: http://p.defau.lt/?byo0WsKNCeD5KIRbPHQ04g
texvc log volume increase: https://logstash.wikimedia.org/index.html#dashboard/temp/Q2lRL7sHR_y5yvxiYLQGSA
https://www.mediawiki.org/w/index.php?title=MediaWiki_1.23/wmf12/Changelog#Math
Rate of texvc log messages went up dematically starting at approximately 16:17 and ending at 16:56. Pattern repeated from 19:13 to 20:04

Conclusions

  • We really need to process through the backlog of Math extension changesets from physikerwelt who's done great work on the extension but is lacking review.
  • Automated browser testing in beta labs by QA identified two problems, one with new dependencies for Math and the other with failure to update ParserOptions for the new changes to Math, either one of which would have caused an outage in production. In hindsight, these test failures should have triggered closer review for changes not discernable in the UI in beta labs, such as the issue that actually did cause the outage. (These bugs are marked as duplicate but they are reporting two separate issues with changes to Math https://bugzilla.wikimedia.org/show_bug.cgi?id=60486 https://bugzilla.wikimedia.org/show_bug.cgi?id=60546)

Actionables

  • Status:    Done - wrap Math stuff in PoolCounter so it doesn't kill apaches so easily. More review on recent changes to Math. Be careful in rolling this release out further.
  • Status:    Not done - Let's get better at reviewing the Math extension
    • need client side knowledge, and caching
    • Brion Vibber?
    • Greg to ping Brion - (done)
  • Status:    Not done - implement true code deployment pipleine (so that all code spends a comparable amount of time in testing/beta cluster before hitting production)