Incidents/2024-09-18 replication

document status: final

Summary

Incident metadata (see Incident Scorecard)
Incident ID	2024-09-18 replication	Start	2024-09-18 21:42:19
Task	T375144	End	2024-09-18 21:48:19
People paged	0	Responder count	3
Coordinators	SWFrench	Affected metrics/SLOs	No relevant SLOs exist
Impact	For 6 minutes, replication between eqiad and codfw of x1 section stopped. It didn't cause a visible amount of errors, but stale data of some x1-based services (reading lists, echo, etc.) could have been served to clients in the Americas for a while.

In anticipation of the datacenter switchover of the following week, core databases were reconfigured from replicating data in the direction of eqiad -> codfw datacenters to setting up circular replication, so when codfw would be set as the primary & only active dc for testing, data flows automatically back to eqiad. Sadly, while the cookbook worked perfectly for STATEMENT-based replication servers, ROW based replication servers contained an unexpected hurdle: when replication in the direction codfw -> eqiad was setup and pt-heartbeat started, due to cleaned up heartbeat rows, it updated an existing row on codfw servers but failed to do so on eqiad ones, as that row was missing. This caused no issue on STATEMENT-based replication servers, as pt-heartbeat does REPLACE (upsert), but row-format only has capabilities for either write or delete+write a new row. The latter caused replication to break and stop with an error of "missing row" on eqiad x1 servers.

At first this was unnoticed due to a gap in monitoring, so replication broke on both x1 and es6 sections. When handing the issue, and attempting to delete all heartbeat table rows, the issue on x1 didn't get fixed, it only moved to codfw, starting the user impact (as replication being broken in the eqiad -> codfw direction wasn't an issue). This was quickly mitigated by skipping heartbeat replication with SET GLOBAL Replication_wild_ignore_table='heartbeat.%';. Once the right fix was devised (insert the missing row/placeholder row on all codfw hosts, individually, with replication disabled: set sql_log_bin=0; INSERT INTO heartbeat.heartbeat (server_id, shard, datacenter) VALUES (<eqiad-server-id>, 'x1', 'eqiad'); and replication worked again, an equivalent fix was applied to the es hosts, but in the original breaking direction (insert on all eqiad servers individually): set sql_log_bin=0; INSERT INTO heartbeat.heartbeat (server_id, shard, datacenter) VALUES (<codfw-server-id>, 'es6' [or 'es7'], 'codfw');

Once all circular replication intended to be ready were setup, the incident was declared closed and work started on avoiding it in the future and fixing the monitoring gaps.

EDIT: After testing an initial fix, it was seen that this wouldn't be a proper solution, as it would only cause the issue to transmit to the other primary server. After further research, the issue was narrowed to: this change which, together with the cleanup procedure of pt-heartbeat extra rows that were setup after Orchestrator, caused x1 and read-write es hosts, which used ROW for the server, causing issues by not overriding pt-heartbeat-wikimedia to use STATEMENT for its session. This was fixed finally at this deploy.

Timeline

All times in UTC.

Cookbook starts executing:
[20:58:52] <Amir1> swfrench-wmf: about to start the circular replication cookbook
[20:59:12] <logmsgbot> !log ladsgroup@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the switch from eqiad to codfw
s1 maintenance completes without problems:
[21:00:44] <jynus> looking good s1
[21:00:59] <Amir1> s1 is good
All other statement-based replication sections (s1-s8 complete without problem)
x1 maintenance starts:
2024-09-18 21:26:01,731 ladsgroup 2097984 [INFO] ==> Performing steps for section x1
At some point after this, x1 codfw -> eqiad replication breaks (which is not checked)
[21:27:22] ladsgroup 2097984 [INFO] Completed command 'enable-puppet
[21:28:41] <jynus> there may be an issue on x1
[21:29:00] <jynus> the primary master is not replicating
[21:29:29] <Amir1> yeah and es6 too
Issue is debugged and identified:
[21:31:47] <jynus> Could not execute Update_rows_v1 event on table heartbeat.heartbeat; Can't find record in 'heartbeat'
[21:31:57] <jynus> heartbeat table wasn't properly cleaned up
[21:33:24] <jynus> I know why
[21:33:37] <jynus> x1 and es use row based replicatin
[21:34:05] <jynus> this means that the REPLACE gets translated into update row
[21:34:20] <jynus> but that row doesn't exist on eqiad
Heartbeat rows are wiped as a first fix, which fixes the eqiad issue, but moves it to codfw (where it is slightly more impacting, as the codfw -> eqiad is mostly idle) OUTAGE BEGINS:
[21:42:19] <icinga-wm_> PROBLEM - MariaDB Replica SQL: x1 on db2196 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1062, Errmsg: Could not execute Write_rows_v1 event on table heartbeat.heartbeat: Duplicate entry 180360966 for key PRIMARY, Error_code: 1062: handler error HA_ERR_FOUND_DUPP_KEY: the events master log db1220-bin.015294, end_log_pos 946695232 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[21:43:43] <domas> oh no
[21:43:57] <domas> heartbeat dupe is weird!
[21:44:23] <swfrench-wmf> jynus: is the alert about x1 just a delayed response to what you've now repaired manually?
[21:44:34] <jynus> yeah, but it is causing fallout
A temporary fix is setup so replication can flow OUTAGE ENDS:
[21:48:19] <icinga-wm_> RECOVERY - MariaDB Replica SQL: x1 on db2196 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[21:48:27] <jynus> I have deployed a temporary fix, which is replicate-wild-ignore-table=heartbeat.%
A proper fix is tested on es6, which has the codfw -> eqiad replication broken, but otherwise unaffected:
[22:00:32] <jynus> let me try to fix es6 in a cleaner way
[22:00:48] <jynus> by inserting without binlog a new codfw row
[22:05:52] <jynus> yeah, that works for es6
The proper fix is applied to es7 before it is affected at all:
[22:06:09] <jynus> will do it now for es7 ahead of the circular
[22:06:36] <Amir1> let me know once you're done and I will run it
[22:06:58] <jynus> yep, taking my time, to make sure I dont break stuff
Circular replication setup finishes and goes on the last ROW section without issues:
[22:11:08] <jynus> Amir1: done, you should be good to go and it should just work
[22:11:18] <Amir1> going
[22:11:32] <jynus> waiting to check everthing is ok before fixing x1 for real
[22:11:55] <logmsgbot> !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the switch from eqiad to codfw
[22:12:05] <Amir1> done ^
Finally, we change the interim fix for the final one on x1, too:
[22:12:50] <jynus> ok, the fixing x1 codfw to remove the ignore table
[22:19:12] <jynus> !log inserting without binlog missing heartbeat reecod on x1 codfw hosts
[22:19:15] <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:22:36] <jynus> and we should be ok
Incident declared closed

Detection

Incident was manually detected first on orchestrator because the general monitoring of circular replication hasn't been enabled yet (it is a post-setup task that requires a puppet changes, but a pre-dc switchover task as it is passive/idle until actual dc switchover happens. Only the impacting failure (the eqiad -> codfw direction) was correctly identified by our monitoring, which is sort of expected.

The other gap was on the automation- the check after setup was "green", because replication didn't break immediately, only after pt-heartbeat was restarted on codfw. An additional check validating this replication channel should have existed after all tasks are completed, an not just immediately after.

Funnily, because the automation for setting up replication was run for the first time, there were more eyes than usual monitoring the status of the databases, which led to a fast identification. The automation wasn't the cause of the issue, but the undocumented/unexpected failure, which would had happened anyway, given the mix of the configuration and data status of the databases.

Conclusions

This incident was written not because it was a high impacting issue (all the opposite) but precisely because it was a close call, avoiding most of the user-impacting effect, but could have been messy due to the highly specific conditions that led to it.

What went well?

No sql_skip_slave_counter was run on any host, as it could have cause worse issues or even data loss
Practically, the outage was probably completely unnoticed by all users, as data was available all the time, and no error codes were produced, only stale data could have been (theoretically) shown for some minutes to some users
Issue was quickly detected and debugged despite its complexity

What went poorly?

Because circular replication is not a normal working mode, the first fix (wiping the table) didn't fix the issue (it did for eqiad, but it only moved it to codfw). A truncate table command would have worked better in this case better than a DELETE, as TRUNCATE get transmitted in STATEMENT mode always, as it is a DDL query, but it was not considedered as it is more dangerous
The combination of ROW based replication, pt-heartbeat and rows being different on the heartbeat table is an unfortunate issue that no one could have previewed. This was not mentioned on past maintenances, so it must be a new situation due to a change in configuration or practices
It is unknown why this blew up this run, as this process has been done multiple times and while there is detailed documentation of how to do it manually, there is no mention of this issue. Either some practice has changed lately (e.g. cleaning up heartbeat tables) or something changed since the last dc switchover (this is unknown at the moment, and the person what may have the answer is unavailable ATM). See edit above.

Where did we get lucky?

People were more aware than usual about failures due to the unrelated first run of the full automation of this task

Links to relevant documentation

There is no links to this specific issue as it was a complete surprise to everyone involved
There is, however, generic advice at: https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Data_inconsistency_between_nodes_%22drift%22_/_replication_broken

Actionables

task T375144 Explanation of the issue and followups, including:
- Avoiding the exact same issue again (probably by inserting manually the missing row if detected) TBD
- Doing a post-maintenance health check to check replication is working both ways as intended (script validation gap) https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1074127
- Extra options to run the maintenance in other order or individually on each section https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/1074111
- Update phabricator more frequently (for each section rather than just at the end) https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1074128/1

Scorecard

Incident Engagement ScoreCard
	Question	Answer (yes/no)	Notes
People	Were the people responding to this incident sufficiently different than the previous five incidents?	yes
	Were the people who responded prepared enough to respond effectively	yes (?)
	Were fewer than five people paged?	yes	No paging occurred, as the failure was on a non critical maintenance/setup
	Were pages routed to the correct sub-team(s)?	n/a
	Were pages routed to online (business hours) engineers? Answer “no” if engineers were paged after business hours.	n/a
Process	Was the "Incident status" section atop the Google Doc kept up-to-date during the incident?	no	A doc was not setup
	Was a public wikimediastatus.net entry created?	no	but it may not have been needed due to low impact
	Is there a phabricator task for the incident?	yes	task T375144
	Are the documented action items assigned?	no
	Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence?	yes
Tooling	To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? Answer “no” if there are open tasks that would prevent this incident or make mitigation easier if implemented.	yes
	Were the people responding able to communicate effectively during the incident with the existing tooling?	yes
	Did existing monitoring notify the initial responders?	no	People were faster than monitoring as it was actively being monitored for other reasons
	Were the engineering tools that were to be used during the incident, available and in service?	n/a
	Were the steps taken to mitigate guided by an existing runbook?	no
Total score (count of all “yes” answers above)		7