Incident documentation/20141113-EventLogging

From Wikitech
Jump to: navigation, search

Summary

Restarting consumers after temporary switch of m2-master CNAME caused ~20 minute outage of writing events to the database between 2014-11-13T00:59 -- 2014-11-13T01:21. (The data did not get lost, as the data is available in log files.)

Timeline

2014-11-13T00:34
Change I463760 got merged and updated m2-master to point to dbproxy1002 to test db proxying.
2014-11-13T00:59
EventLogging's mysql-m2-master consumer got restarted by hand to pick up the above dns change.
The consumer failed to connect to m2-master (which now pointed to dbproxy1002).
So from this time on, no EventLogging events could get written to the database.
(But the other consumers (like the plain log file writer) continued to work as expected.)
2014-11-13T01:19
The above dns change got reverted by change I862947.
2014-11-13T01:21
EventLogging's mysql-m2-master got restarted by hand to pick up the above dns change.
The consumer could again connect to m2-master (which now pointed to db1020 again).
EventLogging events started to get written again to the database.

Conclusions

  • The fact that EventLogging is writing data synchronously and not buffering up database writes (hence cannot handle database connection issues nicely) is not known visible enough throughout the foundation.
  • Regardless of the testing done in labs, production firewalls get in the way.

Actionables

  • Status:    Done Make sure that log files exist for the affected period, so we can backfill.
Logs exist (on vanadium), not yet on stat1002, or stat1003 and look good.
  • Status:    Done Backfill the data.
  • Status:    Done Make sure vanadium can connect to dbproxy1002 (RT8863).