Incidents/2018-04-10 Routing: Difference between revisions
Content deleted Content added
Updated summary |
m typo |
||
Line 1: | Line 1: | ||
== Summary == |
== Summary == |
||
A configuration change on |
A configuration change on routers located in the Ashburn and Singapore datacenters caused a service interruption of ~10min (22:53-23:03UTC) for users redirected to Ashburn, and ~40min for users redirected to Singapore. (22:47-23:24 UTC) |
||
More details on: {{Phabricator|T191940}} |
More details on: {{Phabricator|T191940}} |
Revision as of 04:44, 12 April 2018
Summary
A configuration change on routers located in the Ashburn and Singapore datacenters caused a service interruption of ~10min (22:53-23:03UTC) for users redirected to Ashburn, and ~40min for users redirected to Singapore. (22:47-23:24 UTC)
More details on: task T191940
Timeline
- 22:47 Change pushed to cr1-eqsin
- 22:53 Change pushed to cr2-eqiad
- 22:58 cr2-eqiad rolled-back
- 23:03 eqiad full recovery (after routing convergence)
- 23:22 cr1-eqsin rolled-back (partial recovery)
- 23:31 eqsin de-pooled
- 23:36 eqsin full recovery
Conclusions
- Changes, even if already live in part of the infrastructure, need to be better discussed with the team
- POPs (especially non redundant ones) should be depooled before applying changes, if any doubt
- The same change had different results across the deployment:
- No issues, working as expected (eg. switches, cr2-esams)
- Partial failure (cr1-eqsin), connectivity to the router and rpd appeared in a healthy state, user traffic was being dropped
- Full failure (cr2-eqiad), instantly lost connectivity to the router
Actionables
Explicit next steps to prevent this from happening again as much as possible, with Phabricator tasks linked for every step.
- Tickets have been opened with the vendor task T191667