Incident documentation/20150107-EventLogging

From Wikitech
Jump to: navigation, search

Summary

For about 12 hours 70% of event logging events did not include an schema field and thus did not validate

Issue affected client side events, which are the majority of events sent. Server side events and events sent via client apps were unaffected.

EventLogging issue January 7th 2015

Timeline

  • Changeset [1] gets merged on January 6th
  • Changeset gets deployed at 2 am UTC on January 7th to mediawiki
  • Alarms are raised on throughput of events
  • Look at logs, establish cause and send patchset to fix it
  • Fix patch is deployed on the UTC evening (morning on the west coast)

Conclusions

Deployments should be tested on Beta Cluster, we should avoid going directly to production

Actionables

We actually learned about this issue via the alarms so, while we could have alarms on the database, which we do not, the alarms on throughput did their job