Incident documentation meeting/QR201407/group1/notes

From Wikitech

20140318-EventLogging

  • migrated to m2 shard, shouldn't have too many load issues in future
  • analytics is responsible for responding to alerts is from analytics
  • ops is responsible for generic looking database alerts
  • EL can be down or lagging for up to 48 hours (weekends) - "Tier 2" support

20140328-DB-Queries

  • would have been good to have Ariel on the call
  • greg to follow up on explicit next steps with Bryan and Reedy
  • Add to next group's list

20140509-EventLogging

  • all green :)
  • seems all bases are covered here, any disagreement? :)

20140526-m1

  • blog work, loop back with RobH re future of that box? HA? etc?
  • how far away to get rid of blog?

20140607-Elasticsearch

  • still need to create reproducible steps for this to be reported upstream
  • still need to manually remove a sick node (on purpose)

20140613-Videoscalers

20140619-parsercache

  • MediaWiki failed to stop trying to use the bogged down machine
    • Greg: need to get this diagnosed and tracked
    • HHVM's impact here?
  • proposal 4 related to Rashomon?

20140622-es1006

20140622-imagescaler

20140625-CirrusSearch