Incidents/2024-06-10 puppet volatile data broken sync

document status: draft

Summary

Incident metadata (see Incident Scorecard)
Incident ID	2024-06-10 puppet volatile data broken sync	Start	2024-04-27 00:00:00
Task	T367113	End	2024-06-10
People paged	0	Responder count	3
Coordinators		Affected metrics/SLOs
Impact	Unknown.

Something happened between 2024-04-27 00:00 and 2024-04-27 00:08 UTC, and the rsync clients of most puppetmasters/puppetservers to sync data from puppetmaster1001 hung indefinitely. On 2024-06-10, they were killed and restarted manually.

This means that new data from puppet 'volatile' was only rarely/intermittently synced to much of the fleet during this window.

The problem was particularly bad in codfw, where all 3 puppetservers had failed to rsync data for the entire duration.

Getting a firm idea of the impact of this is difficult. The new GeoLite2 files were unavailable, but also not in use in production yet(?). The older 'enterprise' file was still being used for many uses, however, and would have grown stale. Analytics (a heavy user of GeoIP data) was probably not very affected because it is eqiad-only. Any CheckUser calls would have likely been affected by stale data.

On the other hand, aside from the one service wishing to use newly-added files ... no one noticed? So this puts a sort of upper bound on the potential impact, however unsatisfying.

Timeline

Write a step by step outline of what happened to cause the incident, and how it was remedied. Include the lead-up to the incident, and any epilogue.

Consider including a graphs of the error rate or other surrogate.

Link to a specific offset in SAL using the SAL tool at https://sal.toolforge.org/ (example)

All times in UTC.

00:00 (TODO) OUTAGE BEGINS
00:04 (Something something)
00:06 (Voila) OUTAGE ENDS
00:15 (post-outage cleanup finished)

TODO: Clearly indicate when the user-visible outage began and ended.

Detection

Manual.

Kosta Harlan asked on #wikimedia-sre-foundations IRC to confirm that the new GeoLite2 files were available, as part of work on https://phabricator.wikimedia.org/T366272. cdanis began investigating and discovered that the files were missing on most hosts in codfw where they were expected to exist: https://phabricator.wikimedia.org/P64540

Conclusions

What went well?

What went poorly?

Zero monitoring
- No end-to-end alerting on data freshness (probably unnecessary, given below, but it would have been sufficient)
- Monitoring and logging were explicitly disabled on the sync-puppet-volatile systemd::timer::job used to invoke the rsync clients. This removes the possibility of alerting/notification on any hypothetical sync failures (if we had set timeouts)
Infinite timeouts were allowed, allowing the rsync clients to get stuck forever
- No TimeoutStartSec on systemd::timer::job or on puppetmaster::rsync's invocations thereof
- No use of --timeout or --contimeout in the invocation of rsync
- This (somehow!) allowed multiple rsync clients to get stuck in an "impossible" situation
  - An strace of one client inspected while the situation persisted showed it was blocked waiting on a select() call for its socket with the server to become readable
  - However, the logs on the server side do not show any of the stuck clients connecting except for the previous runs ~15 minutes before the run that got perma-wedged. The only cause that seems possible is rsyncd processes somehow getting deadlocked.

Where did we get lucky?

Nothing obviously broke?

Links to relevant documentation

Puppet#Volatile mount

Actionables

Add a generous default TimeoutStartSec in systemd::timer::job. It cannot be infinite.
Enable monitoring and logging for the systemd::timer::jobs defined in modules/puppetmaster/manifests/rsync.pp
Ensure that timer failures for sync-puppet-volatile get reported somewhere (#-sre-foundations IRC?)

Create a list of action items that will help prevent this from happening again as much as possible. Link to or create a Phabricator task for every step.

Add the #Sustainability (Incident Followup) and the #SRE-OnFire Phabricator tag to these tasks.

Scorecard

Incident Engagement ScoreCard
	Question	Answer (yes/no)	Notes
People	Were the people responding to this incident sufficiently different than the previous five incidents?
	Were the people who responded prepared enough to respond effectively
	Were fewer than five people paged?
	Were pages routed to the correct sub-team(s)?
	Were pages routed to online (business hours) engineers? Answer “no” if engineers were paged after business hours.
Process	Was the "Incident status" section atop the Google Doc kept up-to-date during the incident?
	Was a public wikimediastatus.net entry created?
	Is there a phabricator task for the incident?
	Are the documented action items assigned?
	Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence?
Tooling	To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? Answer “no” if there are open tasks that would prevent this incident or make mitigation easier if implemented.
	Were the people responding able to communicate effectively during the incident with the existing tooling?
	Did existing monitoring notify the initial responders?
	Were the engineering tools that were to be used during the incident, available and in service?
	Were the steps taken to mitigate guided by an existing runbook?
Total score (count of all “yes” answers above)