Incident response/Process improvement/Meetings/2019

2019-01

Scope of Work

What pain points can be fixed?
Who should react during weekends or public holidays?
How to react when something alerts and the "experts" aren't around? ie: Networking? Databases?
How to assess if immediate action is needed?
Initially: Cookbooks?
Who's responsible for triaging alerts that don't have a clear owner? ie: MediaWiki alerts
What self-evident improvements should be made?
What systems (people processes, computer systems, automation) may reasonably be changed in this time frame at acceptable risk?

Alerts and Incidents

An alert is anything sent out on icinga (or other similar monitoring system).
It is not the same if it is sent via IRC or via a page.
An incident is a service problem with some set procedures for resolving, communicating, and following up.
We should probably split between service problem and user facing problem, as it is not always the same (or at least to the same degree - ie: gerrit)
An alert may be the first warning of what turns out to be an incident, but incidents are identified/declared in the middle of reacting, not at the first moment.
Dashboards
Is it possible to have a short list or a single dashboard indicating that "all good" Systems

Definition of Done

When is this project done?
What would constitute success?
What qualitative and quantitative metrics are worth tracking?

How to work

Should the working group meet synchronously on a regular basis?
How will work be identified, documented, tracked, and completed?
Should we be completing a comprehensive proposal, or trying to roll out changes incrementally?

Timing

Proposal by June 2019?
- If so, intermediate steps? OR, incremental changes or proposals up through June 2019? If so, when?

Decision-making

If the working group proposes a change prior to June, who decides what changes we should make, and if/when we should roll back?
Who will decide which proposed changes changes in June will be tried?
- Options: SRE directors. The Working Group? All SREs? All affected parties? Decisions by pure consensus? Rough consensus? Manager decision with consultation?

Stakeholders

Whose problems are we trying to improve outside of the SRE team? Who could suggest other pain points? Who will be affected directly by changes, and who will need to approve any changes?

2019-03-13

What is our output scope?

OPTIONS:

a document for Dublin
doc plus pilot, low-hanging fruit, and/or trials
- What would a pilot be? We are a small group, too small to pilot changes for a big group.
- Fix some low-hanging fruit. Make a small change like making a rule (for all SRE) that if you add a new alert then you must write the doc [...]
- Maybe get feedback on ideas
- Small experiments
- Skepticism that there is any low-hanging fruit in Incident Response
  - Counter-example: the notes_url experiment (link to experiment details?) - https://phabricator.wikimedia.org/T197873
doc + anything else

Consensus: We should produce a proposal doc and also feel free test small or low-risk changes that either solve a part of the problem or validate something in the proposal.

Problem Statement

(We think that these two high-level problems cover all problems in scope. Other notes on this talk page include other details that could be grouped in one of these two mega-problems. All related issues are in scope to start with, but our proposal may not ultimately address them all.)

alarm is inaccurate and imprecise. some examples:
1. Bad alarming; we are alarming on the wrong things and not the right things
2. Icinga doesn't group alarms, so lots of repetition
3. Alarm priority is not clear - trivial and bad alarms both "sound" the same.
4. We do not have a clear definition of "Is service X up?"
5. superfluous secondary alarms
  1. e.g. a switch failure will cause dozens/hundreds of alerts that don't point to the actual problem
6. ownership unclear for services and incidents
systems and practices for coordinating response are inadequate
1. responding to simple alarms has many problems
  1. current system spreads the personal load unevenly,
  2. is hard for newcomers to onboard to
  3. often fails to answer the question "should I be doing something about this right now?" for most participants. … Who owns this? Who should I be pinging?
  4. not standardized
2. Major disaster response is not well-defined, documented, or practiced
  1. including escalation and acknowledgment paths

Notes on the work

possible next steps and constraints

Work on the definitions
Solve 1 before 2
1. problem 1 is cheaper to fix and may redefine or shrink problem 2.
2. does this mean we should fully finish a round of fixing 1 before even defining 2 in more detail? TBD.
Separate technical vs cultural issues
tie to existing tasks & prioritize

Conceptual model of the project work

high-level problem definition
detailed problem definition; use cases
solution design
Some kind of approval (anything from, working group self-approves a trial to whole SRE teams approves a broad proposal)
Implement (test, pilot, or complete rollout) one or more solutions to one or more problems
Evaluate if the solution(s) solved the problem(s)

How we should work on this project

Some options:

Do all the steps waterfall-style. Have a complete problem definition before considering solutions. design all solutions before implementing anything.
1. This was implied by the Jan 2019 offsite discussion, but not considered directly.
Waterfall problem definition; incremental fixes. Complete steps 1–2, and then design and roll out fixes incrementally
agile spiral: pick a high-level problem, pick a detailed problem within it, design a fix, implement it. repeat.
Undirected exploration: work on one or more levels at once and explore the problem and solution space simultanously. E.g., Try to fix something smallish that seems broken, then work backwards to clearly define what "broken" meant, what "fixed" should mean, and revisit the proposed fix, and possibly learn something about adjacent problems in the process.

Since the Working Group is not expecting to comprehensively fix the problem by June, the "final" output in many cases may be a proposal, rather than an actual change to tools and process and culture. This complicates all of these working models, so it will be very important to remain clear on where the output is an actual change in real life vs where the whole cycle is on paper only.

Possible next actions

Create a standard vocabulary
- make a list of all words and phrases that are in use or may be considered
- document status quo definitions
- propose new definitions
Improve program/service ownership records
- This? https://phabricator.wikimedia.org/T216088
- Do a deep dive on several key programs and improve/complete/elaborate documentation on who is responsible
  - The most paged services are DB and Cloud, so those may be good candidates for deep dive
    - I don't think that is super accurate. Taking a look at the last pages over the last year, whilst looks like Cloud does page the most we also have lots of LBs, wdqs. Should we try to identify first alert -> owner kind of thing first maybe? Starting to identify owners based on the already triggered alerts could be a good way to start narrowing things down Marostegui (talk) 07:10, 14 March 2019 (UTC)[reply]
- AND/OR Do a broad and shallow survey to refresh and fill out the list of everything
  - https://www.mediawiki.org/wiki/Developers/Maintainers
  - There have been previous efforts to do this; check with Greg G to get lessons learned before proceeding.

look at existing bugs/alerts
- And do what?
Further develop the high- and medium-level problem definition
- review all relevant text on this page and try to consolidate into the 2 problem areas, or make new problem areas.

Next Steps

joel to book a next meeting; and look for a weekly slot
joel to send out notes for
everybody else to pick one of the "possible next actions" top bullets (or add a new one) and write up a work plan and possibly start working on it.

2019-03-22

Scalability Issues

blocks everybody's work: even if not working on it, paying attention to it and being ready to respond
- everyone looking at logs, suspecting security incidents in every area
Allocation of work to different people can be based on people deciding what to do for themselves, with minimal coordination
Don't consistently have a "stand down" statement, and way to distribute/check it, allowing people to lower their vigilance
We need coordination, if not direction
Same people involved every single day
- no "fresh air"
- Those people didn't get to take a break
- Other people not sure how to help

more proposed improvements

have an incident response coordinator
- quickly decide
  - who
  - How to get 24-hr coverage
  - how does everybody know we are in an incident and who the coordinator is and what tasks are pending and how non-articipants are affected?
Runbooks for workarounds for major tool outages
- X is down; how do I keep doing my routine job?
- X is down, how do I get (unrelated to the incident) emergency work done?
- X is down; how do I do incident response deployment?
- How to firewall a service + how staff can workaround the firewall
- X: Gerrit, Phab, (Broken deploy train doesn't affect SRE)
  - which break pushing changes to puppet and DNS
  - There was a list of [] and a long-term action item list and people were grabbing from it
  - Faidon was doing coordination when he was online
- External communication: announce status; respond to inquiries; provide updates

Possible working group outputs

work on "proposal drafting and work breakdown" for specific issues
- break down to whatever level of detail is reasonable/doable
- ask experts
- How should we go about making proposals for people to do preparatory work for theoretical incident response work?
  - model from previous incident responses
Prep a decision: identify the decision, identify the options, pros and cons,
- either for the P&IR group to decide, or to send to all SRE to decide

Possible work units

catalog the last 6 months of incidents/alerts and look for commonalities/patterns
- https://wikitech.wikimedia.org/wiki/Incident_documentation
- for incidents
- for alerts
- find examples of "incident response": pages that led to some work but weren't labeled as incidents. did require waking people up
- for each workflow/intensity, define how we connect (Riccardo's example from $JOB-N)
- review IRC logs for an incident (or alert w/o incident?) and document through the workflow
  - Output: pick ~5 and try to group into patterns/levels of intensity. How many people got involved, how long did it take, how did it escalate?
Tool-outage workaround runbooks (e.g. How do we work if Phab is down?)
- Deploy DNS if gerrit is down
- Deploy a puppet change if gerrit is down (... later)
- see row 24 et al above
Define what is a page, what is an incident
- do definitions now
- flesh out/validate by looking at history (LATER)
Define rough levels of escalation
- exampes/state machine
  - normal, no incident identified
  - potential incident being explored
  - potential major incident being explored
  - incident, no coordinator
  - incident, coordinator
  - incident, coordinator, all hands on deck
- including escalation and acknowledgment paths
create index system to help people find incident response material
- (past) prefixing system
- wikitech namespace
- wikitech categories and labels
- Proposed overall Vocabulary
  - maybe start an etherpad NOW?
- browse model?
- push model (in training, onboarding,
Notes url for each icinga alert; non-existing ones get Phab links or [non-empty] wikitech pages
Use cases
- got an icinga alert but need to find and look at a different runbook
- starting from scratch with [some mystery cue or prompt], need to figure out wtf

These work units are probably better done after everything above:

Define a policy regarding phone availability and escalation
Incident response coordinator SOP
- how do i decide which 20 things to try, in which order, and when to stop trying something
- how do we decide who is working on which thing?
- standard response vs freeform troubleshooting or debugging
typical workflow (meta-runbook) in responding to a page (from acknowledging the page on IRC to finishing an incident post-mortem), including coordination, helping
- what different workflows or intensity levels are common? NOW
Figure out alarm grouping solution
- Icinga doesn't group alarms, so lots of repetition
- Example of something that we (P&IR WG) would probably produce a decision matrix, not a proposal
Figure out alarm priority protocol
- Alarm priority is not clear - trivial and bad alarms both "sound" the same.
- something P&IR WG could not do itself
List of services
- (See "Improve program/service ownership records" in 2019-03-13 notes)
Create clear definition of "Is service X up?"
- depends on list of services, and their boundaries
reduce superfluous secondary alarms
- identify common ones

Next Steps

Do initial work on these work units; present material at next meeting to confirm we are all going in the same, and productive, direction.

Chris + Riccardo
- Define what is a page, what is an incident
- Define rough levels of escalation
Effie (Manuel might be able to help)
- catalog the last 6 months of incidents/alerts and look for commonalities/patterns
Riccardo
- Tool-outage workaround runbooks (e.g. How do we work if Phab is down?)

2019-03-27

Agenda

Check in on results of research threads
1. Is each going in the right direction, or defining a new right direction?
Make decisions if needed
How/when do we check in with SREs overall?
Plan next actions

Is it time to try and summarize our status for all SREs, so that they can give us feedback?
- not yet; too many cooks
- maybe in 2 weeks
What kinds of next steps should we consider? If we aren't sure what to do next in a work thread, consider the following steps:
- start writing a "straw dog"—a proposal for reaction. MVP is to be "usefully wrong"
- enumerate all related open questions and decisions
- identify choices for open questions/decisions
- flesh out options; identify pros/cons; create prototypes, mockups, proposals

Incident History Review

catalog the last 4 months of incidents/alerts and look for commonalities/patterns
Manuel & Effie
Output: https://etherpad.wikimedia.org/p/incidents
Reviewed ~8 incident reports
Analysis
- There are things missing and things that could be better
- Template for an incident is good
- People under-provide information, or provide too much information
- Could use, e.g., the person on (Clinic) duty to curate the incident response
  - Identify correct level of detail
  - identify things missing and ping people to fill them in
  - Not a duty to re-write it, but to [organize people to complete it]
    - e.g. author didn't provide graphs.
- Quality is variable
  - "fire and forget"—we finish the incident and don't do followup/cleanup
  - Don't always document: what commands were used; will it happen again; …
- an incident report should be considered closed only if all the action items have been tracked in phabricator/documentation or fixed
- Not clear why some things get an incident report and others don't
  - only heuristic I can see is, do Mark and Faidon say there should be an incident report
Suggested Work
- improve the template & force requirement of use
- define how an incident is 'declared/identified'
- set norm/SOP/prioritization around finishing/following up and mechanism to verify
  - [Maybe followup in Monday meeting?]
  - [Incident Review Board]
  - Tooling
    - "OMG" oversight management tool: collects data for incident in progress
    - other tooling to assist documenting steps taken during firefighting / action items needed afterwards? IRC bot like meetingbot?
    - (Custom Phab template)
      - example template for decom: https://phabricator.wikimedia.org/maniphest/task/edit/form/52/

Next Action

This is the right kind of information
This is enough background research, no need for further past incident report reviews, there is enough info to make/take actionables.
Needs more discussion?
- No, we can work on straw dog.
TODO: start writing proposal/straw dog
- scope: better incident organizing and reporting
- Contents
  - proposed process changes
  - proposed cultural changes
  - Decisions we would need to make
  - prototypes and examples of artifacts, specific enough to elicit meaningful responses

Definitions of alert/page/incident

Define what is a page, what is an incident
Define rough levels of escalation
Chris + Riccardo
Output: https://etherpad.wikimedia.org/p/igPdLbfJVjI9LgX8ayXI
Summary
- anyone in SRE can declare that an incident is taking place (be bold). incident is finished when clear consensus within incident response team is reached.
Responses
- Escalations, response to pages, expected behavior, all are ill-defined
- Since no one has come up with a solid definition, my only objection is not to spend hours on trying to define what is an incident it is not easy to quantify
- my definition is "you know one when you see it" ;)
- Need to address epistemological issues
  - how do I know we are "in an incident"?
  - How do I know whether I know enough to decide if we are in an incident?
  - If someone decides retroactively that an incident has already started, what does anyone need to do with that new information?
  - How do I know what I don't know (about other services and impact)?
  - How do I know when I don't know enough and need to escalate?
  - How do we know when there is a clear consensus to close an incident?
- How do we notify everybody that there's an ongoing incident? Implementation detail, but important one
- Is it helpful to think of definitions in terms of "what actions do I take next" state machine?
  - E.g., In state "alerted", I don't know whether or not there is an incident, but I have some reason to apply extra scrutiny to available information, and maybe a clue of where to look further.

Next action

Continue with more research
- in what direction? Ran out of time to discuss.

deploy a DNS change without Gerrit

Tool-outage workaround runbooks (e.g. How do we work if Phab is down?)
Riccardo
added paragraph to current documentation (thanks Jaime): https://wikitech.wikimedia.org/wiki/DNS#Update_DNS_if_gerrit_or_DNS_are_down_(on_an_emergency_only)
Need a "straw dog" proposal/design
- Spoke to Brandon, created task for next steps: https://phabricator.wikimedia.org/T219400

Next action

Move on to the puppet runbook proposal
- Enlist joe if possible

Icinga Dasboard research

Prototyping
[Arzhel] I added parsing to Icinga notifications syslog in Logstash, and did this quick dashboard to have more visibility on alerts:
- https://logstash.wikimedia.org/app/dashboards#/view/AWm67Kpk8aQffZ3HmRpW
  - You can deactivate the "icinga_contact.raw:irc" filter to see all the alerts (not only the ones going to IRC).
  - Will be nice to have a monthly or whatever report being sent to the list to get people attention

Next action

Ran out of time to discuss

Overall Next Steps

(Riccardo, Effie, Manuel, Chris, _joe_) some more research/work for the three or four current research areas
(Joel) move notes to wikitech; summary email describing next steps
(Joel) start a google doc to collect proposal content
(Joel) collate all of the ideas in the notes (this wiki page and talk page) into one or more lists
(Joel) try to fit in another meeting this week?
(Joel) remember to introduce David next meeting

2019-04-04

Proposal for better incident report SOP

Manuel, Effie:
previous TODO: start writing proposal/straw dog
- scope: better incident organizing and reporting
- Contents
  - proposed process changes
  - proposed cultural changes
  - Decisions we would need to make
  - prototypes and examples of artifacts, specific enough to elicit meaningful responses
Proposals we have discussed:
- who is Incident Commander? ( / Coordinator)
  - either clinic duty person is the incident commander
    - Con: overloads that position
  - anyone within the team can do that
  - One of the SMEs for the affected system
  - Incident Commander (according to Pagerduty e.g.) doesn't fix anything: only coordinates information
    - Marshalls the SMEs rather than being the SME
    - coordinates the person fixing the problem, rather than fixing the problem
    - works from a runbook, doesn't make many or any decisions
  - Change the template
  - …(see past meeting notes for ideas)
- Incident Review Board
  - Group that regularly follows up on last week of reports
  - who?
    - fixed group
    - rotating, longer than clinic duty but shorter than a quarter
    - next N IRs
    - Case by case during Monday morning review
  - how?
    - group communicating by mailing list
  - What?
    - evaluate the incident report
  - ask for more details if needed
    - ensure uniformity of writeups (including that there is a writeup)
    - suggest followup actions (for specific people, for the incident responders, or for the backlog)
- Tooling for followup
  - "OMG" oversight management tool: collects data for incident in progress
  - other tooling to assist documenting steps taken during firefighting / action items needed afterwards? IRC bot like meetingbot?
  - (Custom Phab template)
  - example template for decom: https://phabricator.wikimedia.org/maniphest/task/edit/form/52/
TODO: Joel to write this up into a Proposal
- Very rough and abbreviated outline of a proposal for Incident Commander, IRB, Tooling
- prototype of what we want to present to SREs in Dublin
  - Identify Problem
  - Describe potential solution
  - describe alternatives & decisions
    - pros and cons
    - recommendation

Definitions of alert/page/incident

Not clear what to do next
- strawdog definitions & brief 'life of an incident' at https://etherpad.wikimedia.org/p/igPdLbfJVjI9LgX8ayXI
- proposal: Any page is an incident until declared otherwise
- adapt PagerDuty's SOP?
- Severity Scale
- including how many people, what roles are needed
TODO: Giuseppe + Chris/, first draft by next meeting

Continuity Runbooks

Runbooks for keeping things working during an incident-related critical service outage
- deploy a DNS change without Gerrit - DONE
- deploy a puppet change without? Gerrit
  - Lay out what work is needed
TODO: Riccardo

2019-04-10

Actionables

Come up with better names/scheme on alert levels
- Select a couple of services a try to see how they fit with this scheme
Take a look at joel's doc and make amendments/comments
Riccardo will finish up with no gerrit puppet deploys
Effie and Manuel will come up with recomendations about incident reports
Giuseppe is working on Incident severity levels
- Doc: https://docs.google.com/spreadsheets/d/1IijvozfBjDF9WDIfnjQtNx5ZZoX5TgcHBHNcE4JZrvs/edit?ts=5cab3287#gid=0

Topics

Incident Severity Levels for SRE:

manuel: colours and numbers are fine, those maybe not make much sense if there is no oncall rotation
riccardo: maybe it is not doable without rotation
joe: maybe we could distinguish if there are things needing a rotation
arzhel: have two separate "proposals" one with what we can achieve with oncall rotation and one without oncall
chris: this look ok
effie: I disagree with the colours
joe: they shoud be different
we need to find something to visualise the severity level
Riccardo: what about incidents trhat involve more teams or are not SRE specific
joe: when we define something internally, we could communicate it to other teams
Manuel: so do we need to get the alert? fix it? page others?
david: We will need to have different SLAs on the security side. For example, there isn't a lot of difference between the SLAs for Black through Yellow.
joe: if we have eg a red alert, this could be escallated to rest of sre. The responsibility would be to respond
effie: what if we had a rotation where one would just keep an eye on things, and page people if needed
joe: that is up to the teams to decide
manuel: we should discuss a bit more the codes
chris: we could add more examples about each one
manuel: we could gi back ti post morterms and try to add level
arzhel: we can have cases where something starts with one level and escallates to a higher one
joe: alerts could bean different things to service/server owners
manuel/volans: what about service-based alers
joe: maybe we should try to keep it simple
david: better use 0-4 or 1-5 levels, 0 being UBN
joe: at some point we will need service level alerts

Incident review board:

Effie: are we going to be able to find volunteers to the board/workgroup
Manuel: If our templates are good enought and if we improve them, the amount of work will be little
joe: maube a fix group wont work, unless we could have the ops duty to make sure that tasks are being taken care of. Volunteering might not be optional here, maybe everyone should be in that rotation
Manuel: I think having a fix group means that we'll mostly have the same line of work and criterium
Arzhel: decided who will do it is a detail, we can have for now guidelines

Icinga Dasboard

More research
- screenshots, other ways

2019-04-17

What we are working on?

grid

The problem/solution/priority grid.

proposal doc

Model/template for our output to Dublin

specific items

Come up with better names/scheme on alert levels
- PAST TODO: Select a couple of services and try to see how they fit with this scheme
- Do we have a working doc for this?
  - some lines in the meeting notes
  - some overlap with Incident Severity: alert severity will feed into incident severity
  - Joe's proposal: take some sample systems and try to map their existing alerts into the spreadsheet
    - Work breakdown:
      - Pick a system
        TODO: Riccardo: cumin, debmonitor, puppermaster
        
        TODO: Arzhel: network + traffic (best effort)
        
        TODO: CDanis: prometheus
        
        TODO: manuel: databases
      - add a new column for each system to Incident Severity Levels spreadsheet
        In the column, try to list all of the levels of alert for the system and figure out which rows they go into
        
        Chris: think about system-level failure scenarios. e.g. "app servers are crashlooping" would correlate with 100s of alerts, not just 1
        
        so, in the spreadsheet, you could put "1 icinga alert saying site is unreachable" as an incident trigger; you could also put a trend or pattern, "see disk space alerts for multiple machines in the same cluster repeating over 10-minute period"
- David: let's keep the incident severity levels the same between Ops and Security
- Why are we doing this - i.e., what is the deliverable for Dublin?
  - a model on how to recognize and classify problems.
  - enough data to have validation, thought experiments of a model.
  - outline and prototype on how to proceed
- Note that this implies lots of other things: response SOPs, lists, etc, all of the other solution ideas.
  - "how do we declare and communicate status" is a seperable bit
  - "what is the duty rotation?" is a seperable bit
Riccardo will finish up with no gerrit puppet deploys
- Why are we doing this? i.e., what is the deliverable for Dublin?
  - is this a finished deliverable for Dublin, "hey, we've filled in this gap, a small but very intense failure case is now solved", or is it "we've done two, this is proof of concept, let's do 10 more"?
Effie and Manuel will come up with recomendations about incident reports
- Incidents Etherpad

Take a look at joel's doc and make amendments/comments
- Process Change Proposals 0th draft

2019-04-24

Proposal Working Doc

Agenda

review Proposal doc
- roadmap
- levels of completion
- Figure out what we want to happen at Dublin
Look at next items for SRE preview

Review the Proposal 0th draft

Is this ready to show SREs to determine if we are on track for Dublin?
- what do we want to happen at Dublin?
  - decision-making: agree on framework for thinking about process changes, list and priorities of problems, list and priorities of process changes
  - decision-making. yes/no on recommendations and finished products.
  - problem-solving: do group work on Options items.

TODO: Joel: make a version of the roadmap that is just a prioritized backlog, not a whole table, for better readability on 1st page. See also Arzhel's version/vision.

We should agree on incident levels ahead of time.
- that WG agrees internally, or that all SREs already agree with?
  - WG, at least.
What should we preview to SREs now?
- our process & deliverables (framework, list of problems, list of process changes, priorities)
- specific process changes that we want to get done or close-to before Dublin
  - definitions
  - incident levels (partial)
  - incident commander SOP (get more options) (too risky?)
- what is ready?
  - the whole document? no. is not ready for next Monday, maybe the week after
    - e.g., 'tooling' is confusing, needs at least 1 sentence. do this for all.
  - list of process changes?
    - same issue for something like Tooling,
  - list of problems?
    - don't show until we have a list of mitigating process changes?
  - show a specific process change example
    - definition proposal
    - incident commander proposal

Still confusing to think about, is this is a policy proposal that is complete and ready vs this is a proposal to go and do a bunch of work.
Plan to present to SREs monday? TOO SOON - aim for May 6 for initial presentation.

Next steps

Joel to book mtg with mark and faidon and WG to preview whatever we will show to SREs
on mailing list, Joel to start thread: clarify terminology for levels of proposal completeness

All, see roadmap table for assignments: add some detail to proposals
- fill out at least a sentence or paragraph for the proposal sections, so it's more clear what it is intended to mean
- where practical, do more, like a full template (paragraph intro, paragraph 'what is it', complete the grid, link to more detail)
- Email to the list when section is ready for review.

2019-05-08

Feedback from Mark & from Faidon from late April
- Confirms this is generally the right direction
- Current draft of proposal is too complicated at the top to share
- Get feedback from SREs ASAP or wait until Dublin.
- Share the document in detail a week before Dublin so people have time to read it
How are we communicating status of this project?
- List of Proposals & their state/purpose
  - switch to arzhel's table (done)
  - Potential to be confusing around what we are proposing as complete, proposing as a proposal to do more work, etc. And 'done' in the sense that the working group is done, vs 'done' that the proposal is ready, vs 'done' that the proposal is adopted
    - The breakout of output/complete/maturity/work remaining is too complicated to include in the main part. Move it to an appendix or remove it
- Simplify the proposals in the master doc
  - Each Proposal should have:
    - One short paragraph
    - Definition Standards
    - Status Grid

Next Steps

May 31 target to share doc with all SREs
longer, more frequent meetings in May Joel to Book
- how? do we need another status half-hour
- move this meeting 30 min earlier?
- book some "sprint" meetings where everyone is optional?
  - not yet, maybe later.
fill out the proposal status table offline
start breaking out proposals into subdocs

2019-05-22

What are we re-arranging?

summary in doc, details in appendices
per-proposal summary: 2 paragraphs and 1 table max per proposal
share the whole doc before dublin but only expect people to read summaries. Explain this clearly

Disable comments before dublin?

have to allow comments because people will split conversations, comments will be messy
we need to provide a way for intronverts or people that do not feel confident to speak in front of 30 people to speak up

What exactly are we asking in dublin?

different for each proposal. some we ask for final decision. others, for agreement in principle and support for further work.
should have manager endorsement for some things (Incident Coordinator SOP)

Next Steps

Go through the doc and make additions/reductions according to what we have discussed

2019-05-29

This Friday is the deadline to share with SRE. What should we share?

send whole doc on , details in appendix <--- this one

send summary only, no details

What document cleanup is required before we send it?

Paragraph only for each proposal; all details to appendices
Resolve comments in summary section
Redo the proposal table (re-order into logical order)
Finalize order of everything
- make section (and thus ToC) match the proposal table
- make details appendix match order
move working notes and other rough material from dublin deliverables appendices to another document before distribution

disable comments?
- No. Not worth document split; would cause confusion

Next Steps

(Joel/Chris to share) document cleanup
Chris/Giuseppe to follow up on concerns
Joel to send out doc Friday (if confirmed)
Effie to move working notes to separate doc
Joel to move next week's meeting 30 min earlier