Incidents/20150205-EventLogging
Summary
Accidentally deployed EventLogging code broke EventLogging's validation of client side events between 2015-02-04 11:45 PST (2015-02-05 07:45 UTC) -- 2015-02-05 08:15 PST (2015-02-05 16:15 UTC)
Server side events were not affected but they only constitute about 30% of the pipeline. Client side events were not entered in the database at all during this outage.
Timeline
- 2015-02-04 11:45 PST
- Reviewed but untested EventLogging code got deployed unintentionally.
- EventLogging's processor could not validate client side events as there were two items on the capsule that were not expected:
'clientValidated' and 'isTruncated'. Clients did not have the latest version of this code where these two fields are not being sent.
- Icinga alarms regarding validation were triggered
Procedure for removal of optional fields is normally to deploy changes to clients first and once changes are propagated to all clients we deploy the removal of changes to the server thus making the change backwards compatible for the period of the transition.
- 2015-02-05 08:00 PST (2015-02-05 07:45 UTC)
- Team wakes up, figures out what happened and reverts code
- 2015-02 05 08:15 PST (2015-02-05 16:15 UTC)
- Validation proceeds normally
Conclusions
- EventLogging deployments are anonymous. But as (regardless of who deployed) only an existing, reviewed commit from gerrit and not rogue code got deployed, it seems the deployment happened accidentally by trusted people, and was not ill-intentioned. We are making sure all team members are aware that before code gets deployed to production it needs to go through Beta Cluster and be tested there.
Actionables
- Status: Done Make sure that log files exist for the affected period, so we can backfill.
- Status: Done Backfill the data https://phabricator.wikimedia.org/T88692