Incidents/2023-05-28 wikireplicas lag

From Wikitech

document status: draft

Summary

Incident metadata (see Incident Scorecard)
Incident ID 2023-05-28 wikireplicas lag Start 2023-05-25
Task T337446 End 2023-06-01
People paged 0 Responder count 3
Coordinators none Affected metrics/SLOs WMCS services (bots and tools) were affected, mostly used by the community. No production services affected.
Impact Wikireplicas had outdated data and were unavailable for around one week. There were periods where not even old data was available. Tools most likely experienced intermittent unavailability or outdated data from 2023-05-25 to 2023-06-01. Replication of multiple sections (s1, s2, s5, s7) broke most probably due to a bug in mariadb version 10.4.29. A version downgrade to 10.4.26 was made and all hosts where recloned.
Example tools which rely on the wikireplicas:

Timeline

All times in UTC.

Outage begins

  • 2023-05-25 05:11: Icinga alert fires: <icinga-wm> PROBLEM - MariaDB Replica SQL: s1 on db1154 is CRITICAL, replication lag on db1154, db1155 and clouddb start to increase
  • 2023-05-25 05:11: Replication breaks on two different sanitarium hosts (db1154 and db1155): <icinga-wm> PROBLEM - MariaDB Replica SQL: s1 on db1154 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Delete_rows_v1 event on table enwiki.user_properties: Cant find record in user_properties, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1196-bin.001099, end_log_pos 654625806
  • 2023-05-25 05:18: Manuel starts checking and communicates so on IRC
  • 2023-05-25 05:35: Incident opened by alert of broken replication.  Amir opens a task on Phabricator T337446 and the relevant teams and individuals tagged. Ways to mitigate this were explored, such as adding some transaction to get the replication flowing back but discarded as it would compromise data integrity.
  • 2023-05-25 05:47: Manuel tries to fix a row in s5, host crashes
  • 2023-05-25 05:51: Manuel: The first attempts to manually fix the rows result in the hosts crashing
  • 2023-05-25 06:13: Fix row of S1 fails, host crashes
  • 2023-05-25 06:27: The decision to rebuild everything is made and work starts on sanitarium hosts
  • 2023-05-25 14:40: The initial for broken sections are done in sanitarium and start the sanitizing process.
  • 2023-05-26 05:10 Sanitarium host is still being worked on for s1,s2, s5 and s7
  • 2023-05-26 13:23 Manuel pings btullis on IRC and phabricator asking for data engineering to prioritize the outage
  • 2023-05-26 15:08 Cookbook cookbooks.sre.wikireplicas.update-views run by nskaggs: Started updating wikireplica views
  • 2023-05-26 16:50 Manuel calls for help on IRC from WMCS and other people who have owned or own the replicas to run cookbooks responsible for updating views. There’s some response but nothing gets done by the end of day.
  • 2023-05-26 replication on s3 section also breaks

Replication breaks again

  • 2023-05-27 05:12 replication on s1 section breaks again
  • 2023-05-27 05:16 replication on s5 section breaks again
  • 2023-05-27 05:33 replication on s7 section breaks again
  • 2023-05-27 19:44 Manuel realizes that the fixed sections got broken again (at pretty much the same time)
  • 2023-05-28 06:10 upgrade mysql version from 10.4.26 to 10.4.28 T337446#8885816 Stopping sections.
  • 2023-05-28 06:12 SAL <marostegui> Change innodb_fast_shutdown to 0 on db1154 before downgrading
  • 2023-05-29 05:08 <marostegui> I have had to kill them, it's been more than 24h waiting to stop. Going to downgrade + rebuild
  • 2023-05-29 05:32 recloning db1154:3311 db1154:3313 db1154:3315 db1155:3312 clouddb1021:3317
  • 2023-05-29 07:58 clouddb1021 (s7) has been recloned
  • 2023-05-29 08:08 Started to reclone 1014 and 1018
  • 2023-05-27 / 2023-05-28: More investigation about why the second breakage happened starts on phabricator and the decision to reclone everything again + downgrade is made (from 10.4.29 to 10.4.26)

Email sent to wikitech concerning outage

  • 2023-05-29 13:59 Email sent to wikitech-l about outage.
  • 2023-05-29 14:21 s7 is fully recloned
  • 2023-05-29 14:35 clouddb1021:s2 is fully ready with views, grants, users etc
  • 2023-05-29 15:46 clouddb1021:s3 is fully ready with views, grants, users etc
  • 2023-05-29 16:52 And issue with dbproxy1018 and dbproxy1019 (WMCS wikireplicas proxies) and Healthcheck for unable wikireplicas overwhelm pybal, almost causing a general outage, is detected by Valentin but there is no more context on why this is happening. several extra steps have been missing on rebuilding cloud dbs documentations. Fixing that.
  • 2023-05-29 19:10 WMCS pinged again to help restore some missing tables, indexes and data because the Data Persistence team did not have enough context to proceed confidently.
  • 2023-05-29 21:59 A response is sent on another thread on the same subject to reassure users that work was ongoing to restore the replicas. Users complained about the lack of access to meta_p and heartbeat_p databases.
  • 2023-05-28 22:45: Replication lag in dewiki is mentioned on wikitech-l
  • 2023-05-30 04:38 Manuel detects s4 has broken too and manually fixes it inserting the missing row but will reclone later
  • 2023-05-30 05:19 db1154 s4 fixed by inserting the missing row
  • 2023-05-30 06:33 s5 is fully recloned
  • 2023-05-30 09:21 Arturo fixes indexes/permissions (maintain-meta_p --all-databases --bootstrap) T337446#8887809
  • 2023-05-30 09:58 clouddb1014:3312 is now catching up
  • 2023-05-30 09:59 https://phabricator.wikimedia.org/T337721 is created to investigate the proxy issue affecting pybal
  • 2023-05-30 10:43 clouddb1021:3311 (s1) is fully ready with grants, views etc. Once it has caught up I will clone the other two s1 hosts.
  • 2023-05-30 12:56  IdleConnection monitor is disabled is disabled via https://gerrit.wikimedia.org/r/c/operations/puppet/+/924342
  • 2023-05-30 14:34 Users complained that while the clouddbs are accessible and updated, they are extremely slow to query.  T337734 was filed and WMCS was pinged for help. It turned out that the script that’s adding indexes is broken, WMCS couldn’t figure out why. Amir started investigating, fixed and started running the script.
  • 2023-05-30 15:17 s2 is fully recloned
  • 2023-05-30 18:44 s3 is fully recloned and it is now catching up (it is 8h behind)
  • 2023-05-31 06:58 Manuel starts recloning s4
  • 2023-05-31 08:10 s4 has been fully recloned, clouddb1019:3314 is now catching up with its master
  • 2023-05-31 17:51:31 WMCS (Nicholas) checks in and Manuel flags some items that needed attention: still pending help with https://phabricator.wikimedia.org/T337734 and communication with users.

Databases are restored, but accessing data is very slow for tools

  • 2023-06-01 08:47 replication caught up on all sections T337446#8894523 Manuel reduces priority on task because all the broken sections had been successfully recloned.
  • 2023-06-01 Amir finishes creating indexes using the fixed script.

Outage ends

Detection

Replication lag was detected first by Icinga alerts. Phabricator task was created and triaged.

2023-05-25 05:11:16 <icinga-wm> PROBLEM - MariaDB Replica SQL: s1 on db1154 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Delete_rows_v1 event on table enwiki.user_properties: Cant find record in user_properties, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1196-bin.001099, end_log_pos 654625806 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
2023-05-25 05:11:44 <icinga-wm> PROBLEM - MariaDB Replica SQL: s2 on db1155 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Delete_rows_v1 event on table plwiki.user_properties: Cant find record in user_properties, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1156-bin.003729, end_log_pos 633898246 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica

Broken replication on sanitarium host:

Broken replication on sanitarium host
PROBLEM - MariaDB Replica SQL: s5 on db1154 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Delete_rows_v1 event on table dewiki.flaggedpage_pending: Cant find record in flaggedpage_pending, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1161-bin.001646, end_log_pos 385492288

MariaDB crashing when inserting the missing row example log:

Conclusions

  • mariadb version 10.4.29 maybe a replication bug - hard/impossible to reproduce
  • wikireplicas is quite critical for the community although it's not considered a "production" service

What went well?

  • Database operations, such as version downgrades or reclones are well known exercises and went well.

What went poorly?

  • Database replication broke multiple times on different sections, needing manual troubleshooting and intervention
  • Documentation of wikireplicas setup had missing pieces (like grants?) and is scattered across multiple places
  • Troubleshooting the issue was difficult
  • Databases (db1154) did not shut down properly
  • Recloning a database takes a lot of time

Where did we get lucky?

  • Not all database sections broke
    • We could copy/check grants from those non-broken sections
  • Amir was able to quickly fix the bit rotten indexing script

Links to relevant documentation

S1 replication lag

Actionables

  • Add more documentation of wikireplicas setup
  • Alerting for wikireplicas lag
  • Discuss SLO for wikireplicas and level of "production", per Leon Ziemba
    • "I wanted to ask something I've genuinely been curious about for years -- since the wiki replicas are relied upon so heavily by the editing communities (and to some degree, readers), should we as an org treat their health with more scrutiny? This of course is insignificant compared to the production replicas going down, but nonetheless the effects were surely felt all across the movement (editathons don't have live tracking, stewards can't query for global contribs, important bots stop working, etc.). I.e. I wonder if there's any appetite to file an incident report, especially if we feel there are lessons to be learned to prevent similar future outages? I noticed other comparatively low-impact incidents have been documented, such as PAWS outages." [1]

Scorecard

Incident Engagement ScoreCard
Question Answer

(yes/no)

Notes
People Were the people responding to this incident sufficiently different than the previous five incidents? no At least not for database problems
Were the people who responded prepared enough to respond effectively yes Only for the database related parts
Were fewer than five people paged? yes no page
Were pages routed to the correct sub-team(s)? no no page
Were pages routed to online (business hours) engineers?  Answer “no” if engineers were paged after business hours. no no page
Process Was the "Incident status" section atop the Google Doc kept up-to-date during the incident? no no incident document
Was a public wikimediastatus.net entry created? no
Is there a phabricator task for the incident? yes T337446
Are the documented action items assigned? no
Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence? yes
Tooling To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? Answer “no” if there are

open tasks that would prevent this incident or make mitigation easier if implemented.

yes
Were the people responding able to communicate effectively during the incident with the existing tooling? yes
Did existing monitoring notify the initial responders? yes
Were the engineering tools that were to be used during the incident, available and in service? no Documentation lacking. Index rebuild script was bit rotten.
Were the steps taken to mitigate guided by an existing runbook? no
Total score (count of all “yes” answers above)