Incidents/2022-03-01 ulsfo network

From Wikitech

document status: in-review

Summary

Incident metadata (see Incident Scorecard)
Incident ID 2022-03-01 ulsfo network Start 2022-03-01 22:35:25
Task End 2022-03-01 22:55:00
People paged Responder count 4
Coordinators jhathaway Affected metrics/SLOs
Impact For 20 minutes, clients normally routed to Ulsfo were unable to reach any of our projects. This includes New Zealand, parts of Canada, and the United States west coast.

Multiple of our redundant network providers for the San Francisco datacenter simultaneously experienced connectivity loss. After 20 minutes, clients were rerouted to other datacenters.

Documentation:

Actionables

  • T303219 Integrate DNS depools with Etcd and automate/remove the need for writing a Git commit.
  • Can we increase fiber redundancy?

Scorecard

Incident Engagement™ ScoreCard
Question Score Notes
People Were the people responding to this incident sufficiently different than the previous five incidents? (score 1 for yes, 0 for no) 0 Info not logged
Were the people who responded prepared enough to respond effectively (score 1 for yes, 0 for no) 1
Were more than 5 people paged? (score 0 for yes, 1 for no) 0
Were pages routed to the correct sub-team(s)? (score 1 for yes, 0 for no) 0
Were pages routed to online (business hours) engineers? (score 1 for yes,  0 if people were paged after business hours) 0
Process Was the incident status section actively updated during the incident? (score 1 for yes, 0 for no) 1
Was the public status page updated? (score 1 for yes, 0 for no) 1
Is there a phabricator task for the incident? (score 1 for yes, 0 for no) 0
Are the documented action items assigned?  (score 1 for yes, 0 for no) 0 one appears to be an open question
Is this a repeat of an earlier incident (score 0 for yes, 1 for no) 0
Tooling Was there, before the incident occurred, open tasks that would prevent this incident / make mitigation easier if implemented? (score 0 for yes, 1 for no) 0
Were the people responding able to communicate effectively during the incident with the existing tooling? (score 1 for yes, 0 or no) 1
Did existing monitoring notify the initial responders? (score 1 for yes, 0 for no) 1
Were all engineering tools required available and in service? (score 1 for yes, 0 for no) 1
Was there a runbook for all known issues present? (score 1 for yes, 0 for no) 1
Total score 7