- 1 Quarterly Review of post-mortems - 2014-03
Quarterly Review of post-mortems - 2014-03
Questions we want to be able to answer
- Have all of the issues that came out of the post-mortem been addressed? If not, why not?
- Are we satisfied with the current state of that part of the infra? Are there further actions to take (upon further reflection)?
- anything else?
- Go through the post-mortems and their respective action items and make sure they have been followed up appropriately.
- If you have details that are relevant to the post-mortem in BZ/etc, please link from the post-mortem.
- Discuss if there is anything else that we learned from the situation and follow up to better inform future decisions.
- Notes written up by all, collaboratively, so that others in the organization will learn from these as well.
The post mortems
site outage ~ 2014-01-11 22:10 UTC
- TODO: Follow up with Sean and Tim about this. (Greg) - Status: Not done
- greg pinged sean 20140320
- Status: Done - Analyze CategoryTree problem and implement workaround
- Status: Not done - Fix monitoring of poolcounter service
- Status: Not done - Improve poolcounter extension error messages. Some context would be helpful, like poolcounter server contacted, pool context, URL. And perhaps error messages even if only in english (as opposed to what's displayed to the user)
- Status: Done - Investigate page_restrictions query slowness on db1006
SELECT /* Title::loadRestrictions */ pr_type,pr_expiry,pr_level,pr_cascade FROM `page_restrictions` WHERE pr_page = '2720924';domas says needs
- https://rt.wikimedia.org/Ticket/Display.html?id=7126 (closed/rejected)
- Forcing an index seems like the wrong approach here, or perhaps there was some miscommunication somewhere. The example query is properly indexed and fast on any S6 slave today including db1006, so it's likely something else was affecting these queries during the outage, or there was a storm of them, or something else has been fixed in the meantime. Springle (talk) 06:03, 15 April 2014 (UTC)
Most of the issues addressed (as of 2014-03-19), the rest of the things like are good to haves, not critical.
- Status: Done - We should explicitly monitor some critical sysctl active values on systems.
- Status: Not done - LVS testing needs to include internal services testing, and simple TCP port connects may not tell the whole story.
- Status: Done - Check remaining uses of sysctl::parameters and their priorities (Andrew Bogott has committed to handling this).
- "Restore sysctl priorities." - Andrew Bogott
- Status: Done - We need to reinvestigate the performance impact of ntpd on present day LVS (which was found detrimental on old kernels years ago), or find a solution for maintaining the clocks on these systems if it’s still a problem.
- Upstream: Swift daemons die when syslog stops running LP:1094230
- Abandoned (because of inactivity) change to fix: https://review.openstack.org/#/c/24871
- Status: Done - Figure out something since the upstream issue probably won't be resolved:
- ALT1: use udp for syslog messages from swift?
- ALT2: upstart hook to restart swift when syslog is restarted?
- DONE: swift machines moved to trusty in bug T125024 which doesn't seem to be affected
- Status: Done - ping swift people (Faidon) - (unnecessary)
- Status: Done - syslog is not autoupgraded, so that shouldn't happen again
- Status: Done - wrap Math stuff in PoolCounter so it doesn't kill apaches so easily. More review on recent changes to Math. Be careful in rolling this release out further.
- PoolCounter: https://gerrit.wikimedia.org/r/#/c/111916/
- Status: Not done - Let's get better at reviewing the Math extension
- need client side knowledge, and caching
- Brion Vibber?
- Greg to ping Brion - (done)
- Status: Not done - implement true code deployment pipleine (so that all code spends a comparable amount of time in testing/beta cluster before hitting production)
- backlog entry for RelEng team - Long term
- Status: Done - Fix log rotation, run it hourly instead of daily
- Status: Done - Remove old init scripts and update documentation on the log file path
- Status: Done - Lower the warning threshold on parsoid node disk space to provide time to react
- Status: in-progress - Finish migration to async logging backend in Parsoid so that a full disk does not affect the service availability
- partly done, framework merged mid-March
- Status: Not done - Check the logging volume in Parsoid unit tests, less critical once logging is async
What we're doing to prevent it from happening again:
- Status: Done - We're going to monitor the slow query log and have icinga start complaining if it grows very quickly. We normally get a couple of slow queries per day so this shouldn't be too noisy. We're going to also have to monitor error counts, especially once we get more timeouts.
- Status: Done - We're going to sprinkle more timeouts all over the place. Certainly in Cirrus while waiting on Elasticsearch and figure out how to tell Elasticsearch what the shard timeouts should be as well.
- Status: Done - We're going to figure out why we only got half the settings. This is complicated because we can't let puppet restart Elasticsearch because Elasticsearch restarts must be done one node at a time.
- Was puppet error coding error. Fixed.
- Status: informational - Search is broken, with latency quadrupling to crazy numbers on a daily basis. We kinda knew that :( I'll leave the decision of what to do (fix or wait for ElasticSearch) to Nik.
- Status: Not done - Investigate throttling of the API, address DOS vectors
- https://bugzilla.wikimedia.org/show_bug.cgi?id=62615 (private bug)
- Status: Not done - The Pybal DNS bug needs to be fixed. Until then remove servers from /h/w/conf/pybal before renumbering them or decommission them. It's probably a good idea anyway.
- Status: Not done - Make Pybal better about error detection/logging (it hides/makes opaque some backend errors)
- It's unfortunate that we noticed such an issue hours later via a user report. We should have an alert for unusual/high API latency (among others). The data is there, in Graphite, but we need a check_graphite to poll it. Matanya started that but it needs more work.
- Status: On hold - https://gerrit.wikimedia.org/r/#/c/118435/
- Status: Not done - https://bugzilla.wikimedia.org/show_bug.cgi?id=57882
- Status: Not done - Talk with Brad on his thoughts on monitoring the API (error rates etc) - NEEDS BUG
- Similarly, we probably need to monitor reports for failing/retried requests & alert when they happen. The current reqstats/reqerror graphs report errors from frontends which in this case were showing no errors as they were retrying and succeeding. We really need to overhaul the whole metric collection & alerting there.
- Status: Not done - T83580
- Status: Not done - Talk with Analytics team, they probably have related analytics
- Status: informational - Good candidate for a monitoring Sprint at the Hackathon
- Status: Done - Reconcile the Parsoid/Varnish connect/TTFB timeout differences.
- Status: Not done - documentation update needed for when moving boxes?
- Greg to talk to RobH and Chris FILE TICKET
- Suggestion: Have mark/faidon work jointly on future moves?
- Suggestion: Investigate sharing the same source file? Or having some minimal automatic checking?
- Status: Not done - investigate Potential scap bug with the change of mw versions
- Why didn't puppet pull in the latest versions of deployed mw?
- see also: the eventual consistency requirement for deployment tooling
- Status: Done - make scap report rsync errors
- Status: Done - put rsync proxy config in puppet
- Status: Done - Add mw1161 and mw1201 as scap proxies for EQIAD row C and D