Incidents/2023-02-07 mediawiki.page-undelete event stream

From Wikitech

document status: in-review

Summary

Incident metadata (see Incident Scorecard)
Incident ID 2023-02-07 mediawiki.page-undelete event stream Start 2022-10-31
Task T329064 End 2023-02-07
People paged Responder count 2
Coordinators Andrew Otto Affected metrics/SLOs
Impact On 2022-10-31, the Event Platform team merged a change to the EventBus extension to produce the new mediawiki.page-change event stream. This change accidentally unregistered the older hook handler that resulted in the mediawiki.page-undelete stream being produced.


We are not aware of all consumers of this stream.  No one noticed this change for over 3 months.  

Root Cause

The developer (Andrew Otto) and reviewers did not catch the accidental change to extension.json that unregistered the EventBusHooks:onPageUndelete hook handler.

Affected Datasets and Services

The main fallout is that WDQS and WCQS will have inconsistencies in their downstream datastores: any wiki pages that were undeleted during this time period will not be available in WDQS. There may be exceptions to this, the WDQS updater is supposed to detect inconsistencies (i.e. getting an edit on a deleted page) and apply some reconciliation but apparently this system did not work as expected here. Resolving the inconsistencies for WDQS will be achieved via full data-reload (something that was already in progress).

There may be other affected services as well.  The event.mediawiki_page_undelete table in Hive will be empty for this time.  We also expose this stream publicly via stream.wikimedia.org, so if there are external consumers (Internet Archive?) they will also have missed these page undelete events.

Timeline

All times in UTC.

  • 2022-10-31 Andrew Otto merges a change to EventBus extension that causes mediawiki.page-undelete events to not be sent. This is deployed over the next week as part of the MediaWiki deployment train.
  • 2023-02-07 - A user reports inconsistencies in WDQS results. David Cause asks Andrew Otto about any known issues with mediawiki.page-undelete.
  • 2023-02-07 - Andrew Otto discovers the mistake, pushes a fix, and has the fix deployed in a backport deploy window.
  • 2023-02-07 - OUTAGE ENDS

Links to relevant documentation


Actionables

Scorecard

Incident Engagement ScoreCard
Question Answer

(yes/no)

Notes
People Were the people responding to this incident sufficiently different than the previous five incidents? yes
Were the people who responded prepared enough to respond effectively yes
Were fewer than five people paged? yes
Were pages routed to the correct sub-team(s)? no
Were pages routed to online (business hours) engineers?  Answer “no” if engineers were paged after business hours. no
Process Was the "Incident status" section atop the Google Doc kept up-to-date during the incident?
Was a public wikimediastatus.net entry created? no
Is there a phabricator task for the incident? no
Are the documented action items assigned? no
Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence? yes
Tooling To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? Answer “no” if there are

open tasks that would prevent this incident or make mitigation easier if implemented.

yes
Were the people responding able to communicate effectively during the incident with the existing tooling? yes
Did existing monitoring notify the initial responders? no
Were the engineering tools that were to be used during the incident, available and in service? yes
Were the steps taken to mitigate guided by an existing runbook? no
Total score (count of all “yes” answers above)