Incidents/20150206-EventLogging
Appearance
(Redirected from Incident documentation/20150206-EventLogging)
Summary
EventlLogging code was dropping events sparsely from 2015-02-06 to 2015-02-10
Timeline
- 2015-02-05 08:15 PST
- El code got deployed from mainline (and not logged in SAL) to fix issues with incident on
20150205
Code seems to be working normally.
- 2015-02-06/2015-02-09
Alarms regarding throughput get trigger.
- 2015-02-11
- Developers researchs events on db versus validated events on log and finds discrepancies. Those should agree not 100% but about 99%. (There are valid events that do not get inserted due to encoding issues and other errors)
mysql --defaults-extra-file=/etc/mysql/conf.d/research-client.cnf --host dbstore1002.eqiad.wmnet -e "select left(timestamp,8) ts , COUNT(*) from log.ServerSideAccountCreation_5487345 where left(timestamp,8) >= '20150128' group by ts order by ts;" ts COUNT(*) 20150128 18237 20150129 17546 20150130 16556 20150131 15814 20150201 17079 20150202 17387 20150203 17888 20150204 11496 20150205 6640 20150206 11159 20150207 10307 20150208 10095 20150209 10375
Conclusions
- We would benefit from looking at alarms right away and not wait several days
- More precise alarms as to what is going on wouldn't hurt
Actionables
- Status: Done Make sure that log files exist for the affected period, so we can backfill.
- Status: Done Backfill the data https://phabricator.wikimedia.org/T88692