Talk:Incident documentation/20160123-SessionManagerRolloutFailure

From Wikitech
Jump to navigation Jump to search

Thanks for writing that up, Bryan, it's a very thorough summary! Some ideas for actionables (apart from the ones already mentioned):

  • make debugging on mw1017 easier - apart from Ori's suggestion, there is T117020
  • make that debug log available from logstash (with a short TTL since the volume is huge) and add some field to it which makes it easy to identify requests (e.g. user agent, which can be set to a custom value)
  • start some test documentation for CentralAuth - there are some branches that are hard to guess (e.g. how it can behave significantly differently depending on the browser's handling of third-party cookies)
  • figure out how mobile apps can be tested on time (the Android app can visit testwiki, but can it log in on it? and of course beta would be more ideal, and what about the iPhone app?)

--tgr (talk) 06:03, 27 January 2016 (UTC)

* make that debug log available from logstash (with a short TTL since the volume is huge) -- having a different TTL for a given log group would take a special setup in logstash and the kibana front end. Not impossible but not trivial either. The technical reason being that Elasticsearch index fragments are write-only and thus deletes involve flagging documents as deleted and waiting until a significant fraction of the documents in a given index fragment are marked for deletion before actually rewriting the fragment to a new one that excludes the deleted docs and discarding the old one. We put all log events for each day in a single index regardless of log channel and expire logs by dropping entire indices. To cleanly have a different expiration we would need to populate and manage a separate collection of indices for the shorter duration events. --BryanDavis (talk) 18:50, 30 January 2016 (UTC)

Structuring the document

As Gergo said, great writeup Bryan! There's a lot we can learn from this.

A low-priority fixme for someone (possibly me) would be to structure this document to make it more suited to collaborative editing and for organized reading. The current doc is in storytelling form, where a couple of bulleted lists would be useful: "what went well" and "what we should look into". A low-priority phuture Phab ticket phor me: write up retrospective/incident-document/postmortem best practices doc. -- RobLa (talk) 02:28, 5 February 2016 (UTC)

Credit where credit is due

The very detailed list of events was in large part due to help from @Aklapper: who was a great collaborator on the Etherpad. --BryanDavis (talk) 03:28, 5 February 2016 (UTC)

Etherpad archive