Incidents/20190603-maps

From Wikitech


Summary

Work was (and still is) in progress to address out of disk space issues on maps codfw servers (task T224395). This means that maps2001 and maps2004 were depooled at that time. A spike is requests caused overload of those servers and request timeouts. The suspicion was that we had other issues related to this disk out of space, so maps2002 and maps2003 were depooled. For unknown reason, eqiad was depooled in dnsdisc, causing all maps traffic to be directed to codfw, the depool of maps200[23] meant that no server was available to serve traffic. Switching all traffic to eqiad fixed the issue.

Impact

Service unavailability for users as tiles request timed out (for how long?)

Detection

  • Icinga alerts for kartotherian endpoints and varnish 5xx alerts
  • checking pybal to realize only maps codfw was pooled at some point

Timeline

All times in UTC.

  • 9:46: First alert seen on irc (#-operations). It showed tile request timed out for kartotherian. PyBal backends health check on lvs2006 is CRITICAL: PYBAL CRITICAL - CRITICAL - kartotherian-ssl_443: Servers maps2003.codfw.wmnet are marked down but pooled
  • 9:48: Maps200[23] were depooled. This was done in believe that maps eqiad was pooled but it was not.
  • 9:49: More alerts were seen on icinga and this time, varnish started reporting errors too.
  • 9:59: We detected only maps codfw was pooled
  • 10:02: maps eqiad was pooled
  • 10:03: Maps codfw was depooled.
  • 10:05: 503s in varnish started recovering

Conclusions

What weaknesses did we learn about and how can we address them?

The following sub-sections should have a couple brief bullet points each.

What went well?

  • It was easy to point maps codfw problems to postgresql lag for slaves.

What went poorly?

  • We should have pooled maps eqiad.
  • Depooling maps200[23] meant that no server was available to serve traffic. The intent was to depool kartotherian / codfw, which should have been done via dnsdisc.

Where did we get lucky?

Links to relevant documentation

Where is the documentation that someone responding to this alert should have (runbook, plus supporting docs). If that documentation does not exist, there should be an action item to create it.

  • how to depool kartotherian codfw (or more generically, how to depool a service for a datacenter using confctl) [TBD]

Actionables

  • Maybe a check to confirm maps eqiad and codfw are always pooled?
  • fix maps codfw disk space issues fast as we only have maps eqiad for now
  • refresher on how dnsdisc works