Incident Response Process Improvement

From Wikitech
Jump to navigation Jump to search

ONgoing reFormulation of Incident Response Efforts (ONFIRE)

Contents

The problem

Incident Response has been raised repeatedly by the WMF SRE Team as an area for improvement in various ways. In January 2019, a discussion amongst SREs raised the following key pain points:

  • “I don’t know if I need to do something”
  • very unequal distribution of burden
  • not clear what to do or how to do it or who to escalate to
  • don’t have shared understanding of process and definitions.

Paging & Incident Response Working Group

At the January 2019 offsite, the SRE team agreed that a working group, with Joel facilitating/PMing, will identify and offer solutions for the problems with status quo incident reporting, with something to implement before June 2019.

P&IR Working Group Notes

Participation

Data Center Ops: Rob

Data Persistence: Manuel

Infrastructure Foundations: Riccardo, Chris Danis

Service Ops: Giuseppe, Effie

Traffic: Arzhel

PM: Joel

Notes from January 2019 SRE team discussion

Scope of Work

  • What pain points can be fixed?
  • Who should react during weekends or public holidays?
  • How to react when something alerts and the "experts" aren't around? ie: Networking? Databases?
  • How to assess if immediate action is needed?
  • Initially: Cookbooks?
  • Who's responsible for triaging alerts that don't have a clear owner? ie: MediaWiki alerts
  • What self-evident improvements should be made?
  • What systems (people processes, computer systems, automation) may reasonably be changed in this time frame at acceptable risk?

Alerts and Incidents

  • An alert is anything sent out on icinga (or other similar monitoring system).
  • It is not the same if it is sent via IRC or via a page.
  • An incident is a service problem with some set procedures for resolving, communicating, and following up.
  • We should probably split between service problem and user facing problem, as it is not always the same (or at least to the same degree - ie: gerrit)
  • An alert may be the first warning of what turns out to be an incident, but incidents are identified/declared in the middle of reacting, not at the first moment.
  • Dashboards
  • Is it possible to have a short list or a single dashboard indicating that "all good" Systems

Definition of Done

  • When is this project done?
  • What would constitute success?
  • What qualitative and quantitative metrics are worth tracking?

How to work

  • Should the working group meet synchronously on a regular basis?
  • How will work be identified, documented, tracked, and completed?
  • Should we be completing a comprehensive proposal, or trying to roll out changes incrementally?

Timing

  • Proposal by June 2019?
    • If so, intermediate steps? OR, incremental changes or proposals up through June 2019? If so, when?

Decision-making

  • If the working group proposes a change prior to June, who decides what changes we should make, and if/when we should roll back?
  • Who will decide which proposed changes changes in June will be tried?
    • Options: SRE directors. The Working Group? All SREs? All affected parties? Decisions by pure consensus? Rough consensus? Manager decision with consultation?

Stakeholders

Whose problems are we trying to improve outside of the SRE team? Who could suggest other pain points? Who will be affected directly by changes, and who will need to approve any changes?

Notes from 2019-03-13 meeting

What is our output scope?

OPTIONS:

  • a document for Dublin
  • doc plus pilot, low-hanging fruit, and/or trials
    • What would a pilot be? We are a small group, too small to pilot changes for a big group.
    • Fix some low-hanging fruit. Make a small change like making a rule (for all SRE) that if you add a new alert then you must write the doc [...]
    • Maybe get feedback on ideas
    • Small experiments
    • Skepticism that there is any low-hanging fruit in Incident Response
  • doc + anything else

Consensus: We should produce a proposal doc and also feel free test small or low-risk changes that either solve a part of the problem or validate something in the proposal.

Problem Statement

(We think that these two high-level problems cover all problems in scope. Other notes on this talk page include other details that could be grouped in one of these two mega-problems. All related issues are in scope to start with, but our proposal may not ultimately address them all.)

  1. alarm is inaccurate and imprecise. some examples:
    1. Bad alarming; we are alarming on the wrong things and not the right things
    2. Icinga doesn't group alarms, so lots of repetition
    3. Alarm priority is not clear - trivial and bad alarms both "sound" the same.
    4. We do not have a clear definition of "Is service X up?"
    5. superfluous secondary alarms
      1. e.g. a switch failure will cause dozens/hundreds of alerts that don't point to the actual problem
    6. ownership unclear for services and incidents
  2. systems and practices for coordinating response are inadequate
    1. responding to simple alarms has many problems
      1. current system spreads the personal load unevenly,
      2. is hard for newcomers to onboard to
      3. often fails to answer the question "should I be doing something about this right now?" for most participants. … Who owns this? Who should I be pinging?
      4. not standardized
    2. Major disaster response is not well-defined, documented, or practiced
      1. including escalation and acknowledgment paths

Notes on the work

possible next steps and constraints
  1. Work on the definitions
  2. Solve 1 before 2
    1. problem 1 is cheaper to fix and may redefine or shrink problem 2.
    2. does this mean we should fully finish a round of fixing 1 before even defining 2 in more detail? TBD.
  3. Separate technical vs cultural issues
  4. tie to existing tasks & prioritize
Conceptual model of the project work
  1. high-level problem definition
  2. detailed problem definition; use cases
  3. solution design
  4. Some kind of approval (anything from, working group self-approves a trial to whole SRE teams approves a broad proposal)
  5. Implement (test, pilot, or complete rollout) one or more solutions to one or more problems
  6. Evaluate if the solution(s) solved the problem(s)
How we should work on this project

Some options:

  1. Do all the steps waterfall-style. Have a complete problem definition before considering solutions. design all solutions before implementing anything.
    1. This was implied by the Jan 2019 offsite discussion, but not considered directly.
  2. Waterfall problem definition; incremental fixes. Complete steps 1–2, and then design and roll out fixes incrementally
  3. agile spiral: pick a high-level problem, pick a detailed problem within it, design a fix, implement it. repeat.
  4. Undirected exploration: work on one or more levels at once and explore the problem and solution space simultanously. E.g., Try to fix something smallish that seems broken, then work backwards to clearly define what "broken" meant, what "fixed" should mean, and revisit the proposed fix, and possibly learn something about adjacent problems in the process.

Since the Working Group is not expecting to comprehensively fix the problem by June, the "final" output in many cases may be a proposal, rather than an actual change to tools and process and culture. This complicates all of these working models, so it will be very important to remain clear on where the output is an actual change in real life vs where the whole cycle is on paper only.

Possible next actions
  • Create a standard vocabulary
    • make a list of all words and phrases that are in use or may be considered
    • document status quo definitions
    • propose new definitions
  • Improve program/service ownership records
    • This? https://phabricator.wikimedia.org/T216088
    • Do a deep dive on several key programs and improve/complete/elaborate documentation on who is responsible
      • The most paged services are DB and Cloud, so those may be good candidates for deep dive
        • I don't think that is super accurate. Taking a look at the last pages over the last year, whilst looks like Cloud does page the most we also have lots of LBs, wdqs. Should we try to identify first alert -> owner kind of thing first maybe? Starting to identify owners based on the already triggered alerts could be a good way to start narrowing things down Marostegui (talk) 07:10, 14 March 2019 (UTC)
    • AND/OR Do a broad and shallow survey to refresh and fill out the list of everything
  • look at existing bugs/alerts
    • And do what?
  • Further develop the high- and medium-level problem definition
    • review all relevant text on this page and try to consolidate into the 2 problem areas, or make new problem areas.

Next Steps

  1. joel to book a next meeting; and look for a weekly slot
  2. joel to send out notes for
  3. everybody else to pick one of the "possible next actions" top bullets (or add a new one) and write up a work plan and possibly start working on it.

Notes from 2019-03-22 meeting

Scalability Issues

  • blocks everybody's work: even if not working on it, paying attention to it and being ready to respond
    • everyone looking at logs, suspecting security incidents in every area
  • Allocation of work to different people can be based on people deciding what to do for themselves, with minimal coordination
  • Don't consistently have a "stand down" statement, and way to distribute/check it, allowing people to lower their vigilance
  • We need coordination, if not direction
  • Same people involved every single day
    • no "fresh air"
    • Those people didn't get to take a break
    • Other people not sure how to help

more proposed improvements

  • have an incident response coordinator
    • quickly decide
      • who
      • How to get 24-hr coverage
      • how does everybody know we are in an incident and who the coordinator is and what tasks are pending and how non-articipants are affected?
  • Runbooks for workarounds for major tool outages
    • X is down; how do I keep doing my routine job?
    • X is down, how do I get (unrelated to the incident) emergency work done?
    • X is down; how do I do incident response deployment?
    • How to firewall a service + how staff can workaround the firewall
    • X: Gerrit, Phab, (Broken deploy train doesn't affect SRE)
      • which break pushing changes to puppet and DNS
      • There was a list of [] and a long-term action item list and people were grabbing from it
      • Faidon was doing coordination when he was online
    • External communication: announce status; respond to inquiries; provide updates

Possible working group outputs

  • work on "proposal drafting and work breakdown" for specific issues
    • break down to whatever level of detail is reasonable/doable
    • ask experts
    • How should we go about making proposals for people to do preparatory work for theoretical incident response work?
      • model from previous incident responses
  • Prep a decision: identify the decision, identify the options, pros and cons,
    • either for the P&IR group to decide, or to send to all SRE to decide
Possible work units
  • catalog the last 6 months of incidents/alerts and look for commonalities/patterns
    • https://wikitech.wikimedia.org/wiki/Incident_documentation
    • for incidents
    • for alerts
    • find examples of "incident response": pages that led to some work but weren't labeled as incidents. did require waking people up
    • for each workflow/intensity, define how we connect (Riccardo's example from $JOB-N)
    • review IRC logs for an incident (or alert w/o incident?) and document through the workflow
      • Output: pick ~5 and try to group into patterns/levels of intensity. How many people got involved, how long did it take, how did it escalate?
  • Tool-outage workaround runbooks (e.g. How do we work if Phab is down?)
    • Deploy DNS if gerrit is down
    • Deploy a puppet change if gerrit is down (... later)
    • see row 24 et al above
  • Define what is a page, what is an incident
    • do definitions now
    • flesh out/validate by looking at history (LATER)
  • Define rough levels of escalation
    • exampes/state machine
      • normal, no incident identified
      • potential incident being explored
      • potential major incident being explored
      • incident, no coordinator
      • incident, coordinator
      • incident, coordinator, all hands on deck
    • including escalation and acknowledgment paths
  • create index system to help people find incident response material
    • (past) prefixing system
    • wikitech namespace
    • wikitech categories and labels
    • Proposed overall Vocabulary
      • maybe start an etherpad NOW?
    • browse model?
    • push model (in training, onboarding,
  • Notes url for each icinga alert; non-existing ones get Phab links or [non-empty] wikitech pages
  • Use cases
    • got an icinga alert but need to find and look at a different runbook
    • starting from scratch with [some mystery cue or prompt], need to figure out wtf

These work units are probably better done after everything above:

  • Define a policy regarding phone availability and escalation
  • Incident response coordinator SOP
    • how do i decide which 20 things to try, in which order, and when to stop trying something
    • how do we decide who is working on which thing?
    • standard response vs freeform troubleshooting or debugging
  • typical workflow (meta-runbook) in responding to a page (from acknowledging the page on IRC to finishing an incident post-mortem), including coordination, helping
    • what different workflows or intensity levels are common? NOW
  • Figure out alarm grouping solution
    • Icinga doesn't group alarms, so lots of repetition
    • Example of something that we (P&IR WG) would probably produce a decision matrix, not a proposal
  • Figure out alarm priority protocol
    • Alarm priority is not clear - trivial and bad alarms both "sound" the same.
    • something P&IR WG could not do itself
  • List of services
    • (See "Improve program/service ownership records" in 2019-03-13 notes)
  • Create clear definition of "Is service X up?"
    • depends on list of services, and their boundaries
  • reduce superfluous secondary alarms
    • identify common ones

Next Steps

Do initial work on these work units; present material at next meeting to confirm we are all going in the same, and productive, direction.

  • Chris + Riccardo
    • Define what is a page, what is an incident
    • Define rough levels of escalation
  • Effie (Manuel might be able to help)
    • catalog the last 6 months of incidents/alerts and look for commonalities/patterns
  • Riccardo
    • Tool-outage workaround runbooks (e.g. How do we work if Phab is down?)

Notes from 2019-03-27 P&IRPI-WG Meeting

Agenda

  1. Check in on results of research threads
    1. Is each going in the right direction, or defining a new right direction?
  2. Make decisions if needed
  3. How/when do we check in with SREs overall?
  4. Plan next actions
  • Is it time to try and summarize our status for all SREs, so that they can give us feedback?
    • not yet; too many cooks
    • maybe in 2 weeks
  • What kinds of next steps should we consider? If we aren't sure what to do next in a work thread, consider the following steps:
    • start writing a "straw dog"—a proposal for reaction. MVP is to be "usefully wrong"
    • enumerate all related open questions and decisions
    • identify choices for open questions/decisions
    • flesh out options; identify pros/cons; create prototypes, mockups, proposals

Incident History Review

  • catalog the last 4 months of incidents/alerts and look for commonalities/patterns
  • Manuel & Effie
  • Output: https://etherpad.wikimedia.org/p/incidents
  • Reviewed ~8 incident reports
  • Analysis
    • There are things missing and things that could be better
    • Template for an incident is good
    • People under-provide information, or provide too much information
    • Could use, e.g., the person on (Clinic) duty to curate the incident response
      • Identify correct level of detail
      • identify things missing and ping people to fill them in
      • Not a duty to re-write it, but to [organize people to complete it]
        • e.g. author didn't provide graphs.
    • Quality is variable
      • "fire and forget"—we finish the incident and don't do followup/cleanup
      • Don't always document: what commands were used; will it happen again; …
    • an incident report should be considered closed only if all the action items have been tracked in phabricator/documentation or fixed
    • Not clear why some things get an incident report and others don't
      • only heuristic I can see is, do Mark and Faidon say there should be an incident report
  • Suggested Work
    • improve the template & force requirement of use
    • define how an incident is 'declared/identified'
    • set norm/SOP/prioritization around finishing/following up and mechanism to verify
      • [Maybe followup in Monday meeting?]
      • [Incident Review Board]
      • Tooling
        • "OMG" oversight management tool: collects data for incident in progress
        • other tooling to assist documenting steps taken during firefighting / action items needed afterwards? IRC bot like meetingbot?
        • (Custom Phab template)
Next Action
  • This is the right kind of information
  • This is enough background research, no need for further past incident report reviews, there is enough info to make/take actionables.
  • Needs more discussion?
    • No, we can work on straw dog.
  • TODO: start writing proposal/straw dog
    • scope: better incident organizing and reporting
    • Contents
      • proposed process changes
      • proposed cultural changes
      • Decisions we would need to make
      • prototypes and examples of artifacts, specific enough to elicit meaningful responses

Definitions of alert/page/incident

  • Define what is a page, what is an incident
  • Define rough levels of escalation
  • Chris + Riccardo
  • Output: https://etherpad.wikimedia.org/p/igPdLbfJVjI9LgX8ayXI
  • Summary
    • anyone in SRE can declare that an incident is taking place (be bold). incident is finished when clear consensus within incident response team is reached.
  • Responses
    • Escalations, response to pages, expected behavior, all are ill-defined
    • Since no one has come up with a solid definition, my only objection is not to spend hours on trying to define what is an incident it is not easy to quantify
    • my definition is "you know one when you see it" ;)
    • Need to address epistemological issues
      • how do I know we are "in an incident"?
      • How do I know whether I know enough to decide if we are in an incident?
      • If someone decides retroactively that an incident has already started, what does anyone need to do with that new information?
      • How do I know what I don't know (about other services and impact)?
      • How do I know when I don't know enough and need to escalate?
      • How do we know when there is a clear consensus to close an incident?
    • How do we notify everybody that there's an ongoing incident? Implementation detail, but important one
    • Is it helpful to think of definitions in terms of "what actions do I take next" state machine?
      • E.g., In state "alerted", I don't know whether or not there is an incident, but I have some reason to apply extra scrutiny to available information, and maybe a clue of where to look further.

Next action

  • Continue with more research
    • in what direction? Ran out of time to discuss.

deploy a DNS change without Gerrit

Next action
  • Move on to the puppet runbook proposal
    • Enlist joe if possible

Icinga Dasboard research

Next action

Ran out of time to discuss

Overall Next Steps

  • (Riccardo, Effie, Manuel, Chris, _joe_) some more research/work for the three or four current research areas
  • (Joel) move notes to wikitech; summary email describing next steps
  • (Joel) start a google doc to collect proposal content
  • (Joel) collate all of the ideas in the notes (this wiki page and talk page) into one or more lists
  • (Joel) try to fit in another meeting this week?
  • (Joel) remember to introduce David next meeting

2019-04-04 Status Meeting

Proposal for better incident report SOP

  • Manuel, Effie:
  • previous TODO: start writing proposal/straw dog
    • scope: better incident organizing and reporting
    • Contents
      • proposed process changes
      • proposed cultural changes
      • Decisions we would need to make
      • prototypes and examples of artifacts, specific enough to elicit meaningful responses
  • Proposals we have discussed:
    • who is Incident Commander? ( / Coordinator)
      • either clinic duty person is the incident commander
        • Con: overloads that position
      • anyone within the team can do that
      • One of the SMEs for the affected system
      • Incident Commander (according to Pagerduty e.g.) doesn't fix anything: only coordinates information
        • Marshalls the SMEs rather than being the SME
        • coordinates the person fixing the problem, rather than fixing the problem
        • works from a runbook, doesn't make many or any decisions
      • Change the template
      • …(see past meeting notes for ideas)
    • Incident Review Board
      • Group that regularly follows up on last week of reports
      • who?
        • fixed group
        • rotating, longer than clinic duty but shorter than a quarter
        • next N IRs
        • Case by case during Monday morning review
      • how?
        • group communicating by mailing list
      • What?
        • evaluate the incident report
      • ask for more details if needed
        • ensure uniformity of writeups (including that there is a writeup)
        • suggest followup actions (for specific people, for the incident responders, or for the backlog)
    • Tooling for followup
      • "OMG" oversight management tool: collects data for incident in progress
      • other tooling to assist documenting steps taken during firefighting / action items needed afterwards? IRC bot like meetingbot?
      • (Custom Phab template)
      • example template for decom: https://phabricator.wikimedia.org/maniphest/task/edit/form/52/
  • TODO: Joel to write this up into a Proposal
    • Very rough and abbreviated outline of a proposal for Incident Commander, IRB, Tooling
    • prototype of what we want to present to SREs in Dublin
      • Identify Problem
      • Describe potential solution
      • describe alternatives & decisions
        • pros and cons
        • recommendation

Definitions of alert/page/incident

  • Not clear what to do next
  • TODO: Giuseppe + Chris/, first draft by next meeting

Continuity Runbooks

  • Runbooks for keeping things working during an incident-related critical service outage
    • deploy a DNS change without Gerrit - DONE
    • deploy a puppet change without? Gerrit
      • Lay out what work is needed
  • TODO: Riccardo

2019-04-10 Status Meeting

Actionables

Topics

Incident Severity Levels for SRE:

  • manuel: colours and numbers are fine, those maybe not make much sense if there is no oncall rotation
  • riccardo: maybe it is not doable without rotation
  • joe: maybe we could distinguish if there are things needing a rotation
  • arzhel: have two separate "proposals" one with what we can achieve with oncall rotation and one without oncall
  • chris: this look ok
  • effie: I disagree with the colours
  • joe: they shoud be different
  • we need to find something to visualise the severity level
  • Riccardo: what about incidents trhat involve more teams or are not SRE specific
  • joe: when we define something internally, we could communicate it to other teams
  • Manuel: so do we need to get the alert? fix it? page others?
  • david: We will need to have different SLAs on the security side. For example, there isn't a lot of difference between the SLAs for Black through Yellow.
  • joe: if we have eg a red alert, this could be escallated to rest of sre. The responsibility would be to respond
  • effie: what if we had a rotation where one would just keep an eye on things, and page people if needed
  • joe: that is up to the teams to decide
  • manuel: we should discuss a bit more the codes
  • chris: we could add more examples about each one
  • manuel: we could gi back ti post morterms and try to add level
  • arzhel: we can have cases where something starts with one level and escallates to a higher one
  • joe: alerts could bean different things to service/server owners
  • manuel/volans: what about service-based alers
  • joe: maybe we should try to keep it simple
  • david: better use 0-4 or 1-5 levels, 0 being UBN
  • joe: at some point we will need service level alerts

Incident review board:

  • Effie: are we going to be able to find volunteers to the board/workgroup
  • Manuel: If our templates are good enought and if we improve them, the amount of work will be little
  • joe: maube a fix group wont work, unless we could have the ops duty to make sure that tasks are being taken care of. Volunteering might not be optional here, maybe everyone should be in that rotation
  • Manuel: I think having a fix group means that we'll mostly have the same line of work and criterium
  • Arzhel: decided who will do it is a detail, we can have for now guidelines

Icinga Dasboard

  • More research
    • screenshots, other ways

2019-04-17 Meeting Notes

What we are working on?

grid

The problem/solution/priority grid.

proposal doc

Model/template for our output to Dublin

specific items

  • Come up with better names/scheme on alert levels
    • PAST TODO: Select a couple of services and try to see how they fit with this scheme
    • Do we have a working doc for this?
      • some lines in the meeting notes
      • some overlap with Incident Severity: alert severity will feed into incident severity
      • Joe's proposal: take some sample systems and try to map their existing alerts into the spreadsheet
        • Work breakdown:
          • Pick a system
            • TODO: Riccardo: cumin, debmonitor, puppermaster
            • TODO: Arzhel: network + traffic (best effort)
            • TODO: CDanis: prometheus
            • TODO: manuel: databases
          • add a new column for each system to Incident Severity Levels spreadsheet
            • In the column, try to list all of the levels of alert for the system and figure out which rows they go into
            • Chris: think about system-level failure scenarios.  e.g. "app servers are crashlooping" would correlate with 100s of alerts, not just 1
            • so, in the spreadsheet, you could put "1 icinga alert saying site is unreachable" as an incident trigger; you could also put a trend or pattern, "see disk space alerts for multiple machines in the same cluster repeating over 10-minute period"
    • David: let's keep the incident severity levels the same between Ops and Security
    • Why are we doing this - i.e., what is the deliverable for Dublin?
      • a model on how to recognize and classify problems.
      • enough data to have validation, thought experiments of a model.
      • outline and prototype on how to proceed
    • Note that this implies lots of other things: response SOPs, lists, etc, all of the other solution ideas.
      • "how do we declare and communicate status" is a seperable bit
      • "what is the duty rotation?" is a seperable bit
  • Riccardo will finish up with no gerrit puppet deploys
    • Why are we doing this?  i.e., what is the deliverable for Dublin?
      • is this a finished deliverable for Dublin, "hey, we've filled in this gap, a small but very intense failure case is now solved", or is it "we've done two, this is proof of concept, let's do 10 more"?
  • Effie and Manuel will come up with recomendations about incident reports

2019-04-24 Meeting Notes

Agenda

  • review Proposal doc
    • roadmap
    • levels of completion
    • Figure out what we want to happen at Dublin
  • Look at next items for SRE preview

Review the Proposal 0th draft

  • Is this ready to show SREs to determine if we are on track for Dublin?
    • what do we want to happen at Dublin?
      • decision-making:  agree on framework for thinking about process changes, list and priorities of problems, list and priorities of process changes
      • decision-making.  yes/no on recommendations and finished products.
      • problem-solving: do group work on Options items.

TODO: Joel: make a version of the roadmap that is just a prioritized backlog, not a whole table, for better readability on 1st page.  See also Arzhel's version/vision.

  • We should agree on incident levels ahead of time.
    • that WG agrees internally, or that all SREs already agree with?
      • WG, at least.
  • What should we preview to SREs now?
    • our process & deliverables (framework, list of problems, list of process changes, priorities)
    • specific process changes that we want to get done or close-to before Dublin
      • definitions
      • incident levels (partial)
      • incident commander SOP (get more options) (too risky?)
    • what is ready?
      • the whole document? no. is not ready for next Monday, maybe the week after
        • e.g., 'tooling' is confusing, needs at least 1 sentence.  do this for all.
      • list of process changes?
        • same issue for something like Tooling,
      • list of problems?
        • don't show until we have a list of mitigating process changes?
      • show a specific process change example
        • definition proposal
        • incident commander proposal
  • Still confusing to think about, is this is a policy proposal that is complete and ready vs this is a proposal to go and do a bunch of work.
  • Plan to present to SREs monday?  TOO SOON - aim for May 6 for initial presentation.

Next steps

  • Joel to book mtg with mark and faidon and WG to preview whatever we will show to SREs
  • on mailing list, Joel to start thread: clarify terminology for levels of proposal completeness
  • All, see roadmap table for assignments: add some detail to proposals
    • fill out at least a sentence or paragraph for the proposal sections, so it's more clear what it is intended to mean
    • where practical, do more, like a full template (paragraph intro, paragraph 'what is it', complete the grid, link to more detail)
    • Email to the list when section is ready for review.

2019-05-08

  • Feedback from Mark & from Faidon from late April
    • Confirms this is generally the right direction
    • Current draft of proposal is too complicated at the top to share
    • Get feedback from SREs ASAP or wait until Dublin.
    • Share the document in detail a week before Dublin so people have time to read it  
  • How are we communicating status of this project?
    • List of Proposals & their state/purpose
      • switch to arzhel's table (done)
      • Potential to be confusing around what we are proposing as complete, proposing as a proposal to do more work, etc. And 'done' in the sense that the working group is done, vs 'done' that the proposal is ready, vs 'done' that the proposal is adopted
        • The breakout of output/complete/maturity/work remaining is too complicated to include in the main part. Move it to an appendix or remove it
    • Simplify the proposals in the master doc
      • Each Proposal should have:
        • One short paragraph
        • Definition Standards
        • Status Grid

Next Steps

  • May 31 target to share doc with all SREs
  • longer, more frequent meetings in May Joel to Book
    • how?  do we need another status half-hour
    • move this meeting 30 min earlier?
    • book some "sprint" meetings where everyone is optional?
      • not yet, maybe later.
  • fill out the proposal status table offline
  • start breaking out proposals into subdocs

2019-05-22

What are we re-arranging?

  • summary in doc, details in appendices
  • per-proposal summary: 2 paragraphs and 1 table max per proposal
  • share the whole doc before dublin but only expect people to read summaries.  Explain this clearly

Disable comments before dublin?

  • have to allow comments because people will split conversations, comments will be messy
  • we need to provide a way for intronverts or people that do not feel confident to speak in front of 30 people to speak up

What exactly are we asking in dublin?

  • different for each proposal.  some we ask for final decision.  others, for agreement in principle and support for further work.
  • should have manager endorsement for some things (Incident Coordinator SOP)

Next Steps

  • Go through the doc and make additions/reductions according to what we have discussed

Notes for May 29th

This Friday is the deadline to share with SRE. What should we share?

  • send whole doc on , details in appendix  <--- this one
  • send summary only, no details

What document cleanup is required before we send it?

  • Paragraph only for each proposal; all details to appendices
  • Resolve comments in summary section
  • Redo the proposal table (re-order into logical order)
  • Finalize order of everything
    • make section (and thus ToC) match the proposal table
    • make details appendix match order
  • move working notes and other rough material from dublin deliverables appendices to another document before distribution
  • disable comments?  
    • No.  Not worth document split; would cause confusion

Next Steps

  • (Joel/Chris to share) document cleanup
  • Chris/Giuseppe to follow up on concerns
  • Joel to send out doc Friday (if confirmed)
  • Effie to move working notes to separate doc
  • Joel to move next week's meeting 30 min earlier