SRE/Observability/Intake Standards

From Wikitech

Intake Basics

Our team operates on a structured intake process to ensure that all tasks are properly categorized, prioritized, and assigned. Here are our intake standards:

  • '''Request Submission''': All requests should be submitted via Phabricator using the appropriate tagsTemplate:Font color. This includes requests for new features, bug reports, and general inquiries.
  • '''Prioritization''': Upon receipt, requests are prioritized based on their urgency, impact, and alignment with our team's goals. Urgent issues related to incidents, security events, and privacy concerns are given the highest priority.
  • '''Scheduling''': Prioritized tasks are scheduled for a specific quarter and tagged with a FY-Q# tag in Phabricator. This helps us manage our workload and ensure that tasks are completed in a timely manner.
  • '''Task Assignment''': Tasks are assigned to team members based on their expertise and current workload. We strive to distribute tasks evenly and ensure that each team member has the capacity to complete their assigned tasks.
  • '''Status Updates''': The status of tasks is regularly updated in Phabricator. This includes updates when tasks are started, when progress is made, and when tasks are completed.
  • '''Resolution''': Completed tasks are reviewed to ensure that they meet the required standards and that all objectives have been met. Feedback is provided and any necessary revisions are requested.

We encourage all team members and stakeholders to follow these standards to ensure efficient and effective task management.

How We Organize Our Work

Our work originates from various sources, but primarily, requests should be submitted via Phabricator using the #sre-observability tag. This tag is used for all incoming work, which is then classified into one of six states: Inbox, Backlog, Scheduled*, In Progress, Radar, or Done/Closed.

Inbox: This is the default state where all new tasks land.

Backlog: Tasks that have been accepted but not yet scheduled are placed here. These are tasks we acknowledge as valuable, but their execution timeline is undefined.

Scheduled: These are prioritized tasks with a clear plan of action, assigned personnel, and a defined timeline. They are tagged with a FY-Q# tag (e.g., FY2021/2022 Q1 milestone tag) and placed in the "Up Next" or todo for that workboard. Quarterly workboards are specific to that quarter. This is not a column in phabricator but a designation of state (this is in a specific quarter's workboard).

In Progress: Tasks that are currently being worked on.

Radar: Tasks we want to monitor but are not on our workboard. These are tagged with #observability and kept for visibility purposes.

Done/Closed: Tasks that have been completed are moved to the 'Done' column under the specific milestone.

Our team reviews and grooms incoming tasks weekly, typically during planning meetings on Wednesdays at 8:00 AM Pacific. Some requests may receive an out-of-band prioritization effort.

We review the inbox for both the #observability (component) workboard and the #sre-observability (group) workboard. Tasks are then quickly prioritized as either "to be done this quarter" (if time-sensitive) or moved to the backlog if they are actionable but not urgent. Tasks that cannot be moved forward are blocked in the backlog or placed on the radar.

Tasks lacking sufficient information remain in the general backlog, unprioritized, until enough information is collected. These tasks will receive a follow-up comment requesting the necessary details.

Phabricator workflow

this chart explains how tickets are transitioned from request to completed state
SRE observability phabricator workflow

Tasks are initially submitted to the Observability intake tags, which include #observability and component tags such as #icinga, #prometheus, and #logstash. Once submitted, a member of the Observability team grooms and will triage the task based on its priority.

The task is then tagged with the appropriate subcomponent area, which falls under one of the following subprojects (tags):

  • Observability-Alerting
  • Observability-Metrics
  • Observability-Logging
  • Observability-Tracing (note: this is not in active use yet)

Each of these areas represents a distinct work-stream and has its own product roadmap. Tasks are prioritized within these subprojects to ensure effective management and execution.

Following this, tasks are tagged into a milestone for execution, effectively scheduling the work. Depending on the priority and timeline, tasks are either marked as 'Up Next' or scheduled for a specific quarter.

How We Plan Our Roadmap

Our roadmap planning is a rolling process that spans over a year. The goal is to have a list of pre-groomed and prioritized tasks that are reviewed and updated periodically, typically on a quarterly basis. Our team's efforts are driven by six major work categories (see team page for more detail):

  • Alerting
  • Metrics
  • Logs
  • Tracing (future)
  • Maintenance/Incidents
  • Incident Management

The aim of this process is to:

  • Drive each of these major workstreams
  • Set clear goals and deliverables
  • Quantify effort and time investment per workstream, aligning with the organization's interest in each initiative
  • Allocate adequate time to each initiative

Prioritization

Prioritization is both a scheduled and ongoing effort to size up work and assess the importance and impact of specific workstreams. We use a simple forced rank list of priorities, which are derived from the intake process and groomed by the team. These projects are then scored in a spreadsheet for an overall assessment of value and capacity.

The order of prioritization is as follows:

  • Tier1 (high): Incidents, security events, privacy concerns, PII in logs, urgent tasks
  • Tier2 (medium): Project work, outside requests, maintenances
  • Tier3 (low): Non-critical maintenances
  • Tier4 (lowest): Work to be done when all other prioritized work is completed

Project Work (Hypothesis)

All project work should be prioritized and groomed in advance. Overarching project tasks are created in Phabricator with subtasks, both of which are tagged with a FY/Quarter "milestone" to indicate scheduling for projects that span multiple quarters or years.

Maintenance (core work)

Planned maintenance follows the same workflow as regular project work. Unplanned maintenance or requests are groomed and prioritized based on urgency and severity.

Work Cadence Summary

Intake Grooming + Prioritization: Weekly during o11y team meeting and continuously by team

Planning (rolling roadmap): Quarterly during OKR Meetings

Annual Planning: Yearly