Incident documentation/20150205-EventLogging

From Wikitech
Jump to: navigation, search

Summary

Accidentally deployed EventLogging code broke EventLogging's validation of client side events between 2015-02-04 11:45 PST (2015-02-05 07:45 UTC) -- 2015-02-05 08:15 PST (2015-02-05 16:15 UTC)

Server side events were not affected but they only constitute about 30% of the pipeline. Client side events were not entered in the database at all during this outage.

Timeline

2015-02-04 11:45 PST
Reviewed but untested EventLogging code got deployed unintentionally.
EventLogging's processor could not validate client side events as there were two items on the capsule that were not expected:

'clientValidated' and 'isTruncated'. Clients did not have the latest version of this code where these two fields are not being sent.

Icinga alarms regarding validation were triggered

Procedure for removal of optional fields is normally to deploy changes to clients first and once changes are propagated to all clients we deploy the removal of changes to the server thus making the change backwards compatible for the period of the transition.

2015-02-05 08:00 PST (2015-02-05 07:45 UTC)
Team wakes up, figures out what happened and reverts code
2015-02 05 08:15 PST (2015-02-05 16:15 UTC)
Validation proceeds normally

Conclusions

  • EventLogging deployments are anonymous. But as (regardless of who deployed) only an existing, reviewed commit from gerrit and not rogue code got deployed, it seems the deployment happened accidentally by trusted people, and was not ill-intentioned. We are making sure all team members are aware that before code gets deployed to production it needs to go through Beta Cluster and be tested there.

Actionables