SRE/Working Groups/ONFIRE

From Wikitech

ONFIRE Working Group Charter

Overview

The ONFIRE working group, which stands for ONgoing reFormulation of Incident Response Efforts, was established to enhance our incident response practices within the Site Reliability Engineering (SRE) team.

The group exists to foster an environment of continuous learning and improvement. We act as advocates, mentors, leaders, and engineers dedicated to designing, implementing, promoting, and advocating for the best incident response practices for the organization.

Objective

Our primary aim is to enhance our incident response procedures by focusing on the following key areas: improving the skills and capabilities of our team members, refining our response protocols, and implementing efficient tooling.

The incident response process covers the entire life span of the incident, from initiating the response to post-mortem review.

PPT Framework Application

The PPT (People, Process, Tooling) framework will help structure our initiatives.

  • People development involves enhancing the skills and knowledge of SRE members.
  • Process improvement encompasses the enhancement of incident response procedures and protocols.
  • Tooling will concentrate on identifying and implementing advanced technologies and tools to aid in incident response.

Roles and Responsibilities

All members of ONFIRE are expected to contribute actively, championing the adoption and refinement of best practices. Specific roles, including a “chair” and optional “vice-chair” for leadership, will be assigned to facilitate the organization and management of the group.

As a member, you are expected to fulfill the following responsibilities:

  • Lead incident rituals
  • Communicate important ONFIRE notices to your team
  • Capture and share feedback about incident response with the ONFIRE team
  • Act as an advocate within the SRE organization
  • Devote time to ONFIRE projects
  • Set an example of good incident response practices.

Meeting Structure and Frequency

The working group will meet every other week to review progress, discuss challenges, and plan upcoming activities. Emergency meetings may be called when necessary to address critical incident response issues. Planning occurs yearly/quarterly.

Decision-Making Process

Decisions will be made based on consensus wherever possible. In the case of disagreements, the Chair will make the final decision, considering the inputs from all members.

Deliverables and Timelines

The key deliverables will include updated incident response procedures, training programs, and recommendations for tooling enhancements. Timelines for these deliverables will be established in the working group meetings.

Dependencies and Assumptions

This charter assumes that the SRE organization provides full support and necessary resources to the ONFIRE working group. Any potential dependencies, such as reliance on external departments for tooling decisions, will be identified and managed actively.

Review and Revision Processes

The working group's outputs will be reviewed quarterly by stakeholders and revised based on feedback. The charter will also be reviewed annually and updated to remain relevant and practical.

The ONFIRE team, armed with passion, knowledge, and the determination to improve, is poised to significantly enhance the SRE organization's incident response practices. We invite all stakeholders to join us in this vital endeavor.

Time Commitment

Each core team member is required to dedicate 10% of their time, equivalent to one-half day per week unless they have pressing incident response commitments. Contributors are not subject to time commitment requirements. Members are kindly requested to participate for at least one calendar year, effective their joining date, to ensure effective continuity.

  • 2024: the team is experimenting with different engagement models, a “themed sprint” for a one-week trial during Q3 2024.

Accountability

Accountability is a critical aspect of the ONFIRE working group.

  • Each member is responsible for fulfilling their roles and responsibilities, contributing actively to the group's initiatives, and championing the adoption of best practices.
  • The Chair and Vice-Chair are accountable for managing and organizing the group's activities, ensuring that meetings are held regularly, and ensuring all members are informed about important notices and updates.
  • Updates on the group's progress will be shared and presented with the larger SRE group to ensure transparency, collect feedback, and collaborate.

All members must take their accountability seriously to ensure the group's success.

Membership

Each SRE team is represented by one engineer on the core team. Additional members (inside and out of SRE) can join based on interest and availability with their manager's approval.

Current (2023/2024)

Core Team

  • Chair: Leo Mata
  • Members:
    • Filippo Giunchedi
    • Hugh Nowlan
    • Jesse Hathaway
    • Brett Cornwall
    • Eric Evans

Contributors

  • Chris Danis

Past Members

2022/2023

  • <stub>

2021/2022

  • <stub>

2020/2021

  • <stub>

2019/2020

  • <stub>