Incident documentation/20170814-Train

From Wikitech
Jump to: navigation, search

Summary

The train for 1.30.0-wmf.14 (week of 2017-08-14) was rolled back on Wednesday after going to group1 due to Task T173462 Cannot flush pre-lock snapshot because writes are pending. On Thursday morning, database lag caused by a problem in Wikidata 1.30.0-wmf.12 (which was a submodule of MW core 1.30.0-wmf.13) (Task T164173 Cache invalidations coming from the JobQueue are causing lag on several wikis) meant that rolling forward group1 to 1.30.0-wmf.14 even with problems became urgent.

Timeline

  • 2017-08-15 WMF holiday, shortened train schedule starting Wednesday 2017-08-16
  • 2017-08-16 There are several tasks blocking 1.30.0-wmf.14, after reading them they all seem to relate to a new Wikidata release, which should be a submodule of the new release, commented on tasks (Task T172394#3528232, Task T172320#3528235, Task T172394#3528232)
  • 2017-08-16 19:35:13 <logmsgbot> !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis to 1.30.0-wmf.14
  • 2017-08-16 21:10 Noticed a slow but steady and building increase in Cannot flush pre-lock snapshot because writes are pending in logstash. Filed as Task T173462
  • 2017-08-16 21:21:28 <logmsgbot> !log thcipriani@tin Synchronized php: revert group1 wikis to 1.30.0-wmf.14 for T173462 (duration: 00m 47s)
  • 2017-08-16 21:21:54Sent email to engineering-l, wikitech-l: https://lists.wikimedia.org/pipermail/engineering/2017-August/000457.html
  • 2017-08-17 15:43:26 <marostegui> Actually it is not that, it was our friend: https://phabricator.wikimedia.org/T164173
  • 2017-08-17 16:10:20 <Goatification> jynus: regarding https://phabricator.wikimedia.org/T164173 I'm around to work on it (I'm Amir1), but it's outside of Wikidata team because the fix is merged and not deployed because of https://phabricator.wikimedia.org/T173462
  • 2017-08-17 16:19:14 <thcipriani> sigh. I'm not sure what the user impact is for https://phabricator.wikimedia.org/T173462 but it sounds like the user impact from halting the train may outsize it?
  • 2017-08-17 16:27:11 <greg-g> anyone else agree we need aaron's input?
  • 2017-08-17 16:33:58 <thcipriani> so my understanding is that https://phabricator.wikimedia.org/T164173 has a fix that is in wmf.14 (just from reading that ticket) the rollout of which is blocked on https://phabricator.wikimedia.org/T173462 which AaronSchulz has a patch for
  • 2017-08-17 16:37:42 <AaronSchulz> thcipriani: I did some quick local testing and un-WIP'ed it
  • 2017-08-17 16:51:30 <logmsgbot> !log thcipriani@tin Synchronized php-1.30.0-wmf.14/includes/jobqueue/jobs/RefreshLinksJob.php: Avoid lock acquisition errors for multi-title refreshlinks jobs T173462 (duration: 00m 51s)
  • 2017-08-17 16:54:53 <logmsgbot> !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis back to wmf.14 now for T164173
  • 2017-08-17 16:57:32 <thcipriani> AaronSchulz: Goatification hrm now after rolling forward I'm seeing a lot of error: Stack overflow in /srv/mediawiki/php-1.30.0-wmf.14/includes/libs/objectcache/WANObjectCache.php on line 251 and error: Stack overflow in /srv/mediawiki/php-1.30.0-wmf.14/includes/libs/objectcache/MemcachedBagOStuff.php on line 182
  • Created Task T173520
  • 2017-08-17 17:20:06 <logmsgbot> !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis back to wmf.13 now T173520
  • 2017-08-17 17:53:17 <AaronSchulz> thcipriani: I'm betting on b48f361d7d606eff5ab48cc2a64c1cae4e794c84
  • 2017-08-17 18:22:30 <thcipriani> AaronSchulz: I think this is everything, but it's definitely a ton: https://gerrit.wikimedia.org/r/#/c/372427/
  • 2017-08-17 19:07:41 <logmsgbot> !log thcipriani@tin Finished scap: ProofReadPage Revert to db7507246665e69384c1d92af2aedc62263a5116 T173520 (duration: 06m 13s)
  • 2017-08-17 19:12:13 <logmsgbot> !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis to wmf.14

Conclusions

  • We put the train in a position where the previous version had big problems, but the new version had different problems
  • Wikidata build process made it hard to think about backporting fixes
  • The deployment process requires a lot of people to be around to fix things
  • Change propagation related patches (core and Wikibase) should be tested locally on pages with multiple backlinks, using edits that actually change some of the links, property, or other tracking tables
  • Change propagation is complex and involves multiple wikis and manual testing of patches; it might be worth investigating a more automated approach

Actionables