Incidents/20141113-EventLogging
Appearance
(Redirected from Incident documentation/20141113-EventLogging)
Summary
Restarting consumers after temporary switch of m2-master CNAME caused ~20 minute outage of writing events to the database between 2014-11-13T00:59 -- 2014-11-13T01:21. (The data did not get lost, as the data is available in log files.)
Timeline
- 2014-11-13T00:34
- Change I463760 got merged and updated
m2-master
to point todbproxy1002
to test db proxying. - 2014-11-13T00:59
- EventLogging's
mysql-m2-master
consumer got restarted by hand to pick up the above dns change. - The consumer failed to connect to
m2-master
(which now pointed todbproxy1002
). - So from this time on, no EventLogging events could get written to the database.
- (But the other consumers (like the plain log file writer) continued to work as expected.)
- 2014-11-13T01:19
- The above dns change got reverted by change I862947.
- 2014-11-13T01:21
- EventLogging's
mysql-m2-master
got restarted by hand to pick up the above dns change. - The consumer could again connect to
m2-master
(which now pointed todb1020
again). - EventLogging events started to get written again to the database.
Conclusions
- The fact that EventLogging is writing data synchronously and not buffering up database writes (hence cannot handle database connection issues nicely) is not known visible enough throughout the foundation.
- Regardless of the testing done in labs, production firewalls get in the way.
Actionables
- Status: Done Make sure that log files exist for the affected period, so we can backfill.
- Logs exist (on
vanadium
), not yet onstat1002
, orstat1003
and look good.
- Status: Done Backfill the data.
- Status: Done Make sure
vanadium
can connect todbproxy1002
(RT8863).