Incidents/20140318-EventLogging
Summary
Due to repeated database connection failures an EventLogging writer automatically shut down on 2014-03-18. Although Icinga alerted about the issue, it was only fixed when people noticed reports showing no up-to-date data.
Timeline
From Ori's email to the Analytics list
At about 2014-03-18 00:04 UTC, db1047
stopped accepting incoming
connections. At some point during the subsequent hour, MariaDB had either
crashed or been manually restarted. Sean noticed that the database was
choking on some queries from the researchers and notified the wmfresearch
list.
During the time that the database server was out or rejecting connection,
the EventLogging writer that writes to db1047
was repeatedly failing to
connect to it:
sqlalchemy.exc.OperationalError: (OperationalError) (2003, "Can't connect to MySQL server on 'db1047.eqiad.wmnet' (111)")
The Upstart job for EventLogging is configured to re-spawn the writer, up to a certain threshold of failures. Because the writer repeatedly failed to connect, it hit the threshold, and was not re-spawned.
This triggered an Icinga alert:
[00:04:24] <icinga-wm> PROBLEM - Check status of defined EventLogging jobs on vanadium is CRITICAL: CRITICAL: Stopped EventLogging jobs: consumer/mysql-db1047
This alert was not responded to. Ori finally got pinged by Tillman, who noticed the blog visitor stats report was blank, and by Gilles, who noticed image loading performance data was missing.
Conclusions
- Mixing analytics slaves with other work is fragile, as the root cause for the
db1047
downtime were research queries[1]. - There is no clarity who is to respond to EventLogging alerts by when.
Actionables
- Status: on going - Clarify who is to respond to which EventLogging alerts by when.
- Analytics now owns EL, and can respond to some alerts, but not all. Discussion between Ops and Analytics is still ongoing.
- Status: Done - RT #7081 - Move EventLogging database to m2.
- Status: Done - Figure out a way to allow joining EventLogging data against enwiki, as this seems to be critical for researchers.
- Replication back to db1047 is included in the required events to move EventLogging database to m2.
References
- ↑ Look at timestamp 08:54:10f in the Ops channel log