Incident documentation/20180717-Train

From Wikitech
Jump to navigation Jump to search

Summary

There were several problems with 1.32.0-wmf.13. Tasks are sorted from oldest to newest.

  • phab:T199495 Allow $input in LogEventsListGetExtraInputs hook, deprecate it
  • phab:T199504 Editing of content model other than wikitext fails
  • phab:T199763 Special:Log UnexpectedValueException after changes in HTMLTitleTextField and MapCacheLRU
  • phab:T199856 Special:Log form should be using GET not POST
  • phab:T199941 Fatal MWException in Babel: "Language::isValidBuiltInCode must be passed a string"
  • phab:T199983 Wikidata showing wrong language for page elements
  • phab:T200026 RepoGroup exceptions due to "false" being passed as a key to MapCacheLRU
  • phab:T200072 MCR broke Undelete action (and made Cognate tests fail) "The given Title does not belong to page ID" / phab:T200269 Unable to undelete revision (Fatal error: given Title does not belong to page ID, RevisionStoreRecord)
  • phab:T200136 Does not work for change a log type drop down when the log type specified by URL / argument @ 1.32.0-wmf.13 (360f7b5)

Timeline

Events are sorted from oldest to newest. Times are UTC.

2018-07-13 Friday

  • 💣 20:31 Jdforrester-WMF added a subtask: T199504: Editing of content model other than wikitext fails.

2018-07-17 Tuesday

  • 💣 05:40 Krinkle added a subtask: T199763: Special:Log UnexpectedValueException after changes in HTMLTitleTextField and MapCacheLRU.
  • ✅ 05:45 Jdforrester-WMF closed subtask T199504: Editing of content model other than wikitext fails as Resolved.
  • 🚂 wmf.13→0 14:47 <zfilipin@deploy1001> rebuilt and synchronized wikiversions files: group0 to 1.32.0-wmf.13
  • 💣 19:10 matmarex added a subtask: T199495: Allow $input in LogEventsListGetExtraInputs hook, deprecate it.
  • ✅ 23:51 Krinkle removed a subtask: T199763: Special:Log UnexpectedValueException after changes in HTMLTitleTextField and MapCacheLRU.

2018-07-18 Wednesday

  • ✅ 00:10 matmarex closed subtask T199495: Allow $input in LogEventsListGetExtraInputs hook, deprecate it as Resolved.
  • 🚂 wmf.13→1 13:13 <zfilipin@deploy1001> rebuilt and synchronized wikiversions files: group1 wikis to 1.32.0-wmf.13
  • 💣 17:49 Krinkle added a subtask: T199856: Special:Log form should be using GET not POST.
  • 💣 19:51 Krinkle added a subtask: T199941: Fatal MWException in Babel: "Language::isValidBuiltInCode must be passed a string".

2018-07-19 Thursday

  • ✅ 14:59 matmarex removed a subtask: T199856: Special:Log form should be using GET not POST.
  • 💣 16:20 aaron created subtask T200026: RepoGroup exceptions due to "false" being passed as a key to MapCacheLRU

2018-07-20 Friday

  • ✅ 00:33 Krinkle closed subtask T200026: RepoGroup exceptions due to "false" being passed as a key to MapCacheLRU as Resolved.
  • 💣 14:23 Reedy added a subtask: T199983: Wikidata showing wrong language for page elements.

2018-07-21 Saturday

  • 💣 20:38 Rxy added a subtask: T200136: Does not work for change a log type drop down when the log type specified by URL / argument @ 1.32.0-wmf.13 (360f7b5).

2018-07-24 Tuesday

  • ✅ 01:41 Prtksxna closed subtask T200136: Does not work for change a log type drop down when the log type specified by URL / argument @ 1.32.0-wmf.13 (360f7b5) as Resolved.
  • ✅ 10:08 zeljkofilipin removed a subtask: T199983: Wikidata showing wrong language for page elements.
  • ✅ 13:03 thcipriani closed subtask T199941: Fatal MWException in Babel: "Language::isValidBuiltInCode must be passed a string" as Resolved.
  • 💣 13:47 Krinkle added a subtask: T200269: Unable to undelete revision (Fatal error: given Title does not belong to page ID, RevisionStoreRecord).
  • 💣 15:33 greg added a subtask: T200072: MCR broke Undelete action (and made Cognate tests fail) "The given Title does not belong to page ID".
  • ✅ 15:34 greg closed T200269 Unable to undelete revision (Fatal error: given Title does not belong to page ID, RevisionStoreRecord) as a duplicate of T200072: MCR broke Undelete action (and made Cognate tests fail) "The given Title does not belong to page ID".
  • ✅ 17:03 zeljkofilipin removed a subtask: T200072: MCR broke Undelete action (and made Cognate tests fail) "The given Title does not belong to page ID".
  • 🚂 wmf.13→2 17:33 <zfilipin@deploy1001> rebuilt and synchronized wikiversions files: all wikis to 1.32.0-wmf.13

Conclusions

What weakness did we learn about and how can we address them?

  • This was the second ever train during European working hours, and the first that had problems. It took more time than usual to move the train forward because the train conductor would stop working just as people in the US would start working, delaying everything until the next day. Moving cutting the branch and deploying to group 0 on Monday (instead of Tuesday, a day earlier than now) might help, because it would add another day to resolve problems.
  • Production is the integration environment. Some problems were not caught earlier because production is the only place where everything is finally integrated.
  • When there was a problem, it was not clear who should fix it. It was not hard, and the right people/teams were found in a timely manner, but there is room for improvement. Developers/Maintainers page was a lot of help, but it is far from complete. Maybe something like T190891 (Develop canonical/single record of origin, machine readable list of all repos deployed to WMF sites) would help.
  • This train overlapped with Wikimania 2018 and as a result there was limited developer availability.
    • It's not as simple as "availability": all the participants in T199941 were at Wikimania which didn't prevent them from contributing. But they were distracted enough that resolution was delayed because communication was disrupted.
  • 2 problems were noticed before deploying 1.32.0-wmf.13 to group 0.
  • 1 problem was noticed after deploying 1.32.0-wmf.13 to group 0.
  • 6 problems were noticed after deploying 1.32.0-wmf.13 to group 1.

Before wmf.13 → group0

  • T199504 Editing of content model other than wikitext fails
    • Daniel Kinzler reverted a bad patch by Adam Wight, in gerrit:445658. This is resolved now.
    • The patch changed most of the conditionals in EditPage, a notoriously complex and fragile legacy class.
    • Switching to strict equality created new possibilities where false, empty string, and null values were not properly managed.
    • More analysis will be done in Task T202913, where AWight will write the tests that were missing from EditPage.
  • T199763 Special:Log UnexpectedValueException after changes in HTMLTitleTextField and MapCacheLRU
    • Underway In progress Feedback needed from Prateek Saxena (Audiences Design), Timo Tijhof (Performance), Bartosz Dziewoński (Contributors).
    • A change by Aaron in low-level code in the Title class (445160) broke an implicit assumption in HTMLTitleTextField that has been there for years.
    • For a low-level change like that, it's practically impossible how it will affect everything else, you can only hope that it won't.
    • Potentially could have been avoided by a unit test for HTMLTitleTextField::validate (although the failing scenario is not obvious to check).

wmf.13 → group0

  • T199495 Allow $input in LogEventsListGetExtraInputs hook, deprecate it
    • Underway In progress Feedback needed from Prateek Saxena (Audiences Design).
    • This is basically the same situation as T199856 below, and caused by the same patch.

wmf.13 → group1

  • T199856 Special:Log form should be using GET not POST
    • Underway In progress Feedback needed from Prateek Saxena (Audiences Design), Timo Tijhof (Performance), Bartosz Dziewoński (Contributors).
    • This was a simple bug that should have been caught in code review of 428871. (Bartosz did notice it in a late post-merge review.)
    • Perhaps we could have been more careful, or waited for more people to review it.
    • Very impractical to check for this in unit tests. Theoretically there could be unit tests to check that the HTML output on Special:Log is as expected, but:
      • Tests like that are extremely annoying (constantly broken due to harmless changes to the HTML structure).
      • After a total rewrite of the page, like we had here, the test would probably have been incorrectly updated anyway by copy-paste.
    • The patch caused several issues which were fixed separately (also T199495, T200136); maybe it should have been reverted.
  • T199941 Fatal MWException in Babel: "Language::isValidBuiltInCode must be passed a string"
    • Started after gerrit:442200 got merged: this caused LanguageCode::bcp47('nrm') == 'nrf', among other normalizations.
    • As Krinkle noted on gerrit:446766, the fatal exception came from mGenerateContent calling LanguageBabelBox::render() for a LanguageBabelBox instance having an invalid language code that Language::factory rejects.
      • In mParseParameter(): BabelLanguageCodes::getCode('nrm') == 'nrm' ("Narom")
      • In LanguageBabelBox::__construct(): bcp47('nrm') == 'nrf' ("Norman")
      • In LanguageBabelBox::render(): getCode('nrf') == false (not in LanguageCode::deprecatedLanguageCodeMapping() or MediaWiki\Languages\Data\Names)
      • In LanguageBabelBox::render(): Language:getFactory(false) => CRASH
    • gerrit:446766 was proposed as fix (it would ensure getCode('nrf') == 'nrf', preventing the crash)
    • Due to confusion about who could issue a C+2 ("Good general direction, but Niklas is the expert who'll care about the details like that.") and how the patch worked ("It's not clear to me how this patch addresses the fatal exception.") it was decided to revert the changes to bcp47 in core instead of fixing Babel: gerrit:447469/gerrit:447470
    • Revert was merged 2018-07-23T21:53:44Z and stopped the fatal exceptions.
    • Some observations:
      • Fix delayed by confusion of responsibilities and unclear communication during Wikimania.
      • Babel's test suite doesn't appear to exercise all possible valid language code parameters which could be passed to mGenerateContent
      • Implicit assumption was made that getCode(bcp47(getCode(x))) is non-false if getCode(x) is non-false.
  • T199983 Wikidata showing wrong language for page elements
    • Yes Done Feedback needed from Wikidata team (Wikimedia Deutschland).
    • Started after gerrit:444950 got merged (follow up to MCR-related changes in mediawiki)
    • With the above change, parser cache was no longer split by user language as expected for item pages
    • phab:T201170 was merged as a good enough quick fix (it does not do parser cache split but prevents caching at all for cases when the user language is different than the language defined in parser options object
    • Follow up for adjusting the underlying behaviour "the right way": phab:T201170 (not yet resolved as of 2018-08-09)
    • WMDE has done the internal post-mortem about this particular incident with outcome being:
      • More effort should be put into automated testing of parser side-effect behaviour that affects Wikibase/Wikidata
      • Wikidata developers at WMDE should gather more knowledge about and be more involved into review of changes in mediawiki that have impact on wikibase
  • T200026 RepoGroup exceptions due to "false" being passed as a key to MapCacheLRU
    • N Not done Feedback needed from Aaron Schulz (Performance).
  • T200072 MCR broke Undelete action (and made Cognate tests fail) "The given Title does not belong to page ID"
    • Bug was noticed in a failing unit test Friday, July 20, and a fix submitted on Monday, July 23. It wasn't realized this might also affect production.
    • Once it was found on group 0 on Tuesday, the patch was reasonably quickly reviewed and backported.
  • T200136 Does not work for change a log type drop down when the log type specified by URL / argument @ 1.32.0-wmf.13 (360f7b5)
    • Underway In progress Feedback needed from Prateek Saxena (Audiences Design), Bartosz Dziewoński (Contributors).
    • This is basically the same situation as T199856 above, and caused by the same patch.

Links to relevant documentation

Where is the documentation that someone responding to this alert should have (cookbook / runbook). If that documentation does not exist, there should be an action item to create it.

Actionables

Explicit next steps to prevent this from happening again as much as possible, with Phabricator tasks linked for every step.

NOTE: Please add the #wikimedia-incident Phabricator project to these follow-up tasks and move them to the "follow-up/actionable" column.

  • phab:T199495 Allow $input in LogEventsListGetExtraInputs hook, deprecate it
    • TODO: how a similar problem could be prevented? N Not done
  • phab:T199504 Editing of content model other than wikitext fails
    • TODO: how a similar problem could be prevented? N Not done
  • phab:T199763 Special:Log UnexpectedValueException after changes in HTMLTitleTextField and MapCacheLRU
    • Should updates to complex Special pages (like Special:Log) be review by at least two people (who?) before being merged? Not sure if its required, just asking. —Prtksxna (talk) 02:52, 2 August 2018 (UTC)
  • phab:T199856 Special:Log form should be using GET not POST
    • Should updates to complex Special pages (like Special:Log) be review by at least two people (who?) before being merged? Not sure if its required, just asking. —Prtksxna (talk) 02:52, 2 August 2018 (UTC)
  • phab:T199941 Fatal MWException in Babel: "Language::isValidBuiltInCode must be passed a string"
  • phab:T199983 Wikidata showing wrong language for page elements
    • phab:T201170 Allow wikibase to render entities for an arbitrary target language, instead of the user's UI language
  • phab:T200026 RepoGroup exceptions due to "false" being passed as a key to MapCacheLRU
    • TODO: how a similar problem could be prevented? N Not done
  • phab:T200072 MCR broke Undelete action (and made Cognate tests fail) "The given Title does not belong to page ID" / phab:T200269 Unable to undelete revision (Fatal error: given Title does not belong to page ID, RevisionStoreRecord)
    • TODO: how a similar problem could be prevented?
      • Even more unit tests, somehow. There were tests covering this code, but they didn't (and still don't) test the specific case that caused the problem here.
  • phab:T200136 Does not work for change a log type drop down when the log type specified by URL / argument @ 1.32.0-wmf.13 (360f7b5)
    • TODO: how a similar problem could be prevented? N Not done
  • I really don't think deployments during Wikimania are wise. Even if we *think* we have adequate coverage, with the travel and distractions usual lines of communication are disrupted. C. Scott Ananian (talk) 16:42, 10 August 2018 (UTC)
    Agreed. Legoktm (talk) 17:14, 22 August 2018 (UTC)