Incident documentation/20151218-payments wiki

From Wikitech
Jump to: navigation, search

On December 18 2015 at 11:15 UTC icinga started sending alerts about payments webserver timeouts. We found apache processes backing up, waiting for trivial queries to the fundraising drupal database.

Sequence of Events

  • 11:15 icinga alerts begins, Faidon investigates
  • 11:38 Jeff contacted and starts investigating
  • 11:40 Peter takes banners down
  • 12:20 Jeff adjusts mysql table_open_cache and flushes tables on mysql server, services promptly recover
  • 12:30 Peter puts banners back up

Impact

During the time the database was sluggish payment processing performance was degraded and donors may have seen error messages or timeouts.

Root cause

We appear to have overrun mysql's configured open table cache, which caused mysql to become sluggish to open and close tables for queries.

Recommendations

Additional mysql tuning was performed after the outage to prevent similar outages.