Incident documentation/20150206-EventLogging

From Wikitech
Jump to: navigation, search

Summary

EventlLogging code was dropping events sparsely from 2015-02-06 to 2015-02-10

Timeline

2015-02-05 08:15 PST
El code got deployed from mainline (and not logged in SAL) to fix issues with incident on

20150205

Code seems to be working normally.

2015-02-06/2015-02-09

Alarms regarding throughput get trigger.

2015-02-11
Developers researchs events on db versus validated events on log and finds discrepancies. Those should agree not 100% but about 99%. (There are valid events that do not get inserted due to encoding issues and other errors)
mysql --defaults-extra-file=/etc/mysql/conf.d/research-client.cnf --host dbstore1002.eqiad.wmnet -e "select left(timestamp,8) ts , 
COUNT(*) from log.ServerSideAccountCreation_5487345 where left(timestamp,8) >=   '20150128' group by ts order by ts;"
ts      COUNT(*)
20150128        18237
20150129        17546 
20150130        16556
20150131        15814
20150201        17079
20150202        17387
20150203        17888
20150204        11496
20150205        6640
20150206        11159
20150207        10307
20150208        10095
20150209        10375

Conclusions

  • We would benefit from looking at alarms right away and not wait several days
  • More precise alarms as to what is going on wouldn't hurt

Actionables