Incidents/20140318-EventLogging

Summary

Due to repeated database connection failures an EventLogging writer automatically shut down on 2014-03-18. Although Icinga alerted about the issue, it was only fixed when people noticed reports showing no up-to-date data.

Timeline

From Ori's email to the Analytics list

At about 2014-03-18 00:04 UTC, db1047 stopped accepting incoming connections. At some point during the subsequent hour, MariaDB had either crashed or been manually restarted. Sean noticed that the database was choking on some queries from the researchers and notified the wmfresearch list.

During the time that the database server was out or rejecting connection, the EventLogging writer that writes to db1047 was repeatedly failing to connect to it:

sqlalchemy.exc.OperationalError: (OperationalError) (2003, "Can't connect to MySQL server on 'db1047.eqiad.wmnet' (111)")

The Upstart job for EventLogging is configured to re-spawn the writer, up to a certain threshold of failures. Because the writer repeatedly failed to connect, it hit the threshold, and was not re-spawned.

This triggered an Icinga alert:

[00:04:24] <icinga-wm> PROBLEM - Check status of defined EventLogging jobs on vanadium is CRITICAL: CRITICAL: Stopped EventLogging jobs: consumer/mysql-db1047

This alert was not responded to. Ori finally got pinged by Tillman, who noticed the blog visitor stats report was blank, and by Gilles, who noticed image loading performance data was missing.

Conclusions

Mixing analytics slaves with other work is fragile, as the root cause for the db1047 downtime were research queries^[1].
There is no clarity who is to respond to EventLogging alerts by when.

Actionables

Status: on going - Clarify who is to respond to which EventLogging alerts by when.
- Analytics now owns EL, and can respond to some alerts, but not all. Discussion between Ops and Analytics is still ongoing.
Status: Done - RT #7081 - Move EventLogging database to m2.
Status: Done - Figure out a way to allow joining EventLogging data against enwiki, as this seems to be critical for researchers.

Replication back to db1047 is included in the required events to move EventLogging database to m2.

References

↑ Look at timestamp 08:54:10f in the Ops channel log

[1] Look at timestamp 08:54:10f in the Ops channel log

[1]