Incidents/2022-05-09 confctl

From Wikitech

document status: in-review

Summary

Incident metadata (see Incident Scorecard)
Incident ID 2022-05-09 confctl Start 2022-05-09 07:44:00
Task T309691 End 2022-05-09 07:51:00
People paged 26 Responder count 6
Coordinators Affected metrics/SLOs
Impact For 5 minutes, all web traffic routed to Codfw received error responses. This affected central USA and South America (local time after midnight).

The confctl command to depool a server was accidentally run with an invalid selection parameter (host=mw1415 instead of name=mw1415, details at T308100). There exists no "host" parameter, and Confctl did not validate it, but silently ignore it. The result was that the depool command was interpreted as applying to all hosts, of all services, in all data centers. The command was cancelled partway through the first DC it iterated on (Codfw).

Confctl-managed services were set as inactive for most of the Codfw data center. This caused all end-user traffic that was at the time being routed to codfw (Central US, South America - at a low traffic moment) to respond with errors. While appservers in codfw were at the moment "passive" (not receiving end-user traffic), other services that are active were affected (CDN edge cache, Swift media files, Elasticsearch, WDQS…).

The most visible effect, during the duration of the incident, was approximately 1.4k HTTP requests per second to not be served to text edges and 800 HTTP requests per second to fail to be served from upload edges. The trigger for the issue was a gap in tooling that allowed running a command with invalid input.

Timeline

All times in UTC.

  • 07:44 confctl command with invalid parameters is executed OUTAGE BEGINS
  • 07:44 Engineer executing the change realizes the change is running against more servers than expected and cancels the execution mid-way
  • 07:46 Monitoring system detects the app servers unavailability, 15 pages are sent
  • 07:46 Engineer executing the change notifies others via IRC
  • 07:50 confctl command to repool all codfw servers is executed OUTAGE ENDS
confctl 5xx errors
wikimediastatus.net

Detection

The issue was detected by both the monitoring, with expected alerts firing, and the engineer executing the change.

Example alerts:

07:46:18: <jinxer-wm> (ProbeDown) firing: (27) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown

07:46:19: <jinxer-wm> (ProbeDown) firing: (29) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown

Conclusions

When provided with invalid input, confctl executes the command against all hosts, it should fail instead.

What went well?

  • Monitoring detected the issue
  • Rollback was performed quickly

What went poorly?

  • Tooling allowed executing a command with bad input

Where did we get lucky?

  • The engineer executing the change realized what was going on and stopped the command from completing

How many people were involved in the remediation?

  • 6 SREs

Links to relevant documentation

Conftool#The tools

Actionables

Scorecard

Incident Engagement™ ScoreCard
Question Answer

(yes/no)

Notes
People Were the people responding to this incident sufficiently different than the previous five incidents? no
Were the people who responded prepared enough to respond effectively yes
Were fewer than five people paged? no
Were pages routed to the correct sub-team(s)? no
Were pages routed to online (business hours) engineers?  Answer “no” if engineers were paged after business hours. yes
Process Was the incident status section actively updated during the incident? no
Was the public status page updated? no
Is there a phabricator task for the incident? yes
Are the documented action items assigned? yes
Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence? yes
Tooling To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? Answer “no” if there are

open tasks that would prevent this incident or make mitigation easier if implemented.

yes
Were the people responding able to communicate effectively during the incident with the existing tooling? yes
Did existing monitoring notify the initial responders? yes
Were all engineering tools required available and in service? yes
Was there a runbook for all known issues present? no
Total score (count of all “yes” answers above) 9