Incident documentation meeting/QR201407/group1/notes

migrated to m2 shard, shouldn't have too many load issues in future
analytics is responsible for responding to alerts is from analytics
ops is responsible for generic looking database alerts
EL can be down or lagging for up to 48 hours (weekends) - "Tier 2" support
- https://www.mediawiki.org/wiki/EventLogging/OperationalSupport

Friday! :)
need a bug for "Add monitoring for individual job types on single machines. "
Should deploy https://gerrit.wikimedia.org/r/#/c/144612/ before hhvm goes to jobrunners

MediaWiki failed to stop trying to use the bogged down machine
- Greg: need to get this diagnosed and tracked
- HHVM's impact here?
proposal 4 related to Rashomon?