Incident documentation/20180724-Train

From Wikitech
Jump to navigation Jump to search

Summary

There were several problems with 1.32.0-wmf.14. Tasks are sorted from oldest to newest.

  • T200257 `scap sync` fails with `Error: You are missing some external dependencies.`
  • T200340 Wikibase\DataModel\Entity\EntityIdParsingException $serialization must not be an empty string
  • T200346 wmf.14 failing to execute ThumbnailRender jobs "error: ThumbnailRenderJob::run: HTTP request failure"
  • T200412 PageTriage requires ORES to be installed
  • T200420 Wikidata dispatching stuck (not releasing lockmanager locks)
  • T200456 MapCacheLRU::has called with invalid key. Must be string or integer.

Timeline

Events are sorted from newest to oldest. Times are UTC.

2018-07-30 Monday

  • 🚂 wmf.14→2 13:58 <zfilipin@deploy1001> rebuilt and synchronized wikiversions files: all wikis to 1.32.0-wmf.14

2018-07-26 Thursday

  • ✅ 21:34 Tgr closed subtask T200456: MapCacheLRU::has called with invalid key. Must be string or integer. as Resolved.
  • 🚂 wmf.14→1 18:19 <zfilipin@deploy1001> rebuilt and synchronized wikiversions files: Revert "all wikis to 1.32.0-wmf.14"
  • 💣 18:16 zeljkofilipin added a subtask: T200456: MapCacheLRU::has called with invalid key. Must be string or integer.
  • 🚂 wmf.14→2 18:13 <zfilipin@deploy1001> rebuilt and synchronized wikiversions files: all wikis to 1.32.0-wmf.14
  • ✅ 17:01 zeljkofilipin removed a subtask: T200420: Wikidata dispatching stuck (not releasing lockmanager locks).
  • 💣 13:18 zeljkofilipin added a subtask: T200420: Wikidata dispatching stuck (not releasing lockmanager locks).
  • 🚂 wikidatawiki>wmf.13 12:38 <reedy@deploy1001> rebuilt and synchronized wikiversions files: wikidatawiki back to .13 T200420
  • ✅ 10:51 zeljkofilipin closed subtask T200412: PageTriage requires ORES to be installed as Resolved.
  • 🚂 wmf.14→1 10:45 <zfilipin@deploy1001> rebuilt and synchronized wikiversions files: group1 wikis to 1.32.0-wmf.14
  • 🚂 wmf.14→0 10:08 zfilipin@deploy1001> rebuilt and synchronized wikiversions files: Revert "group1 wikis to 1.32.0-wmf.14"
  • 💣 10:00 zeljkofilipin added a subtask: T200412: PageTriage requires ORES to be installed.
  • 🚂 wmf.14→1 09:49 <zfilipin@deploy1001> rebuilt and synchronized wikiversions files: group1 wikis to 1.32.0-wmf.14

2018-07-25 Wednesday

  • ✅ 20:13 Krinkle closed subtask T200346: wmf.14 failing to execute ThumbnailRender jobs "error: ThumbnailRenderJob::run: HTTP request failure" as Resolved.
  • ✅ 17:09 Krinkle removed a subtask: T200340: Wikibase\DataModel\Entity\EntityIdParsingException $serialization must not be an empty string.
  • 💣 15:15 Krinkle added a subtask: T200346: wmf.14 failing to execute ThumbnailRender jobs "error: ThumbnailRenderJob::run: HTTP request failure".
  • 🚂 wmf.14→0 14:39 <zfilipin@deploy1001> rebuilt and synchronized wikiversions files: (no justification provided) (Revert "group1 wikis to 1.32.0-wmf.14")
  • 💣 14:28 zeljkofilipin added a subtask: T200340: Wikibase\DataModel\Entity\EntityIdParsingException $serialization must not be an empty string.
  • 🚂 wmf.14→1 13:59 <zfilipin@deploy1001> rebuilt and synchronized wikiversions files: group1 wikis to 1.32.0-wmf.14

2018-07-24 Tuesday

  • ✅ 12:50 thcipriani closed subtask T200257: `scap sync` fails with `Error: You are missing some external dependencies.` as Resolved.
  • 💣 12:04 zeljkofilipin added a subtask: T200257: `scap sync` fails with `Error: You are missing some external dependencies.`

Conclusions

What weakness did we learn about and how can we address them?

  • Scap should perform canary checks for sync-wikiversions.
  • 1 problem was caused by train conductor inexperience, before deploying 1.32.0-wmf.14 to group 0.
  • 4 problems were noticed after deploying 1.32.0-wmf.14 to group 1.
  • 1 problem was noticed after deploying 1.32.0-wmf.14 to group 2.

Before wmf.14 → group0

wmf.14 → group1

  • T200340 Wikibase\DataModel\Entity\EntityIdParsingException $serialization must not be an empty string
    • Done Feedback needed from Adam Shorland (Wikimedia Deutschland).
  • T200346 wmf.14 failing to execute ThumbnailRender jobs "error: ThumbnailRenderJob::run: HTTP request failure"
    • Done Feedback needed from Gergő Tisza (Readers), Ian Marlier (Performance), Brad Jorsch (MediaWiki Platform), Timo Tijhof (Performance).
    • This was not a new error, but rather an error being incorrectly indicated. A change that was unrelated to the ThumbnailRender job itself resulted in an MWHttpRequest returning an HTTP status of 0 instead of an HTTP status of 200. ThumbnailRender was configured to consider a status of 200 to be successful, but did not consider a status of 0 to be successful, and thus logged an error message. Realistically this should not have stopped the train, but it did require investigation to realize that. The actual remediation for this is the work that BPirkle is doing, in phab:T202110 and related.
  • T200412 PageTriage requires ORES to be installed
    • Done Feedback needed from Amir Sarabadani (Wikimedia Deutschland), Adam Wight (Scoring Platform), Stephane Bisson (Contributors).
    • Done It could have been prevented transparently by softening the dependency (done in 448098), and could have been mitigated manually by knowing that it was necessary to enable ORES.
  • T200420 Wikidata dispatching stuck (not releasing lockmanager locks)
    • Done Feedback needed from Adam Shorland (Wikimedia Deutschland).

wmf.14 → group2

  • T200456 MapCacheLRU::has called with invalid key. Must be string or integer
    • Done Feedback needed from Gergő Tisza (Readers), Aaron Schulz (Performance).

Links to relevant documentation

Where is the documentation that someone responding to this alert should have (cookbook / runbook). If that documentation does not exist, there should be an action item to create it.

Actionables

Explicit next steps to prevent this from happening again as much as possible, with Phabricator tasks linked for every step.

NOTE: Please add the #wikimedia-incident Phabricator project to these follow-up tasks and move them to the "follow-up/actionable" column.

Feedback from various teams is needed on how each problem could have been prevented:

  • (Release Engineering) phab:T200257 `scap sync` fails with `Error: You are missing some external dependencies.`
    • Done No further action needed.
  • (Wikidata) phab:T200340 EntityIdParsingException $serialization must not be an empty string
    • Done The fix contains a regression test.
  • (Reading Infrastructure) phab:T200346 Failing to execute ThumbnailRender jobs
  • (ORES/Wikidata) phab:T200412 PageTriage requires ORES to be installed
    • phab:T200944 Detect missing extension dependencies before production
  • (Wikidata) phab:T200420 Wikidata dispatching stuck (not releasing lockmanager locks)
    • gerrit:448103 Use getClientLockName value for releaseClientLock when dispatching
    • Done The above patch fixed the issue.
    • What has trigerred the dispatching issues is still not clear.
  • (Readers) phab:T200456 MapCacheLRU::has called with invalid key. Must be string or integer
  • (Release Engineering) phab:T198640 Perform scap canary checks after sync-wikiversions