Incidents/2018-04-10 Routing

From Wikitech

Summary

A configuration change on routers located in the Ashburn and Singapore datacenters caused a service interruption of ~10min (22:53-23:03UTC) for users redirected to Ashburn, and ~40min for users redirected to Singapore. (22:47-23:24 UTC)

More details on: task T191940

Timeline

  • 22:47 Change pushed to cr1-eqsin
  • 22:53 Change pushed to cr2-eqiad
  • 22:58 cr2-eqiad rolled-back
  • 23:03 eqiad full recovery (after routing convergence)
  • 23:22 cr1-eqsin rolled-back (partial recovery)
  • 23:31 eqsin de-pooled
  • 23:36 eqsin full recovery

Conclusions

  • Changes, even if already live in part of the infrastructure, need to be better discussed with the team
  • POPs (especially non redundant ones) should be depooled before applying changes, if any doubt
  • The same change had different results across the deployment:
    • No issues, working as expected (eg. switches, cr2-esams)
    • Partial failure (cr1-eqsin), connectivity to the router and rpd appeared in a healthy state, user traffic was being dropped
    • Full failure (cr2-eqiad), instantly lost connectivity to the router

Actionables

Explicit next steps to prevent this from happening again as much as possible, with Phabricator tasks linked for every step.

  • Tickets have been opened with the vendor phab:T191667 (update: crash reason found)