Incident documentation meeting/QR201407/group1/notes
(Redirected from Incident documentation/QR201407/group1/notes)
- migrated to m2 shard, shouldn't have too many load issues in future
- analytics is responsible for responding to alerts is from analytics
- ops is responsible for generic looking database alerts
- EL can be down or lagging for up to 48 hours (weekends) - "Tier 2" support
- would have been good to have Ariel on the call
- greg to follow up on explicit next steps with Bryan and Reedy
- Add to next group's list
- all green :)
- seems all bases are covered here, any disagreement? :)
- blog work, loop back with RobH re future of that box? HA? etc?
- how far away to get rid of blog?
- still need to create reproducible steps for this to be reported upstream
- still need to manually remove a sick node (on purpose)
- Friday! :)
- need a bug for "Add monitoring for individual job types on single machines. "
- Should deploy https://gerrit.wikimedia.org/r/#/c/144612/ before hhvm goes to jobrunners
- MediaWiki failed to stop trying to use the bogged down machine
- Greg: need to get this diagnosed and tracked
- HHVM's impact here?
- proposal 4 related to Rashomon?
- epicly awesome fix: http://ur1.ca/hpjpa  ;)
- need the Swift bandwidth
- metrics: https://bugzilla.wikimedia.org/show_bug.cgi?id=67116 Please help :)
- Still have the feature request for scap here