Jump to content

Alert review

From Wikitech

The Alert Review Process is a methodology designed for the systematic evaluation of alert data, aimed at identifying patterns, recurring issues, and opportunities for improvement.

This process aims to:

  • Identify recurring patterns and issues that require targeted attention and resolution.
  • Generate actionable insights to refine and enhance our alerting framework.

These goals guide our proactive efforts enhance system reliability and efficiency.

Analysis Framework

Alert analysis is primarily conducted through our Logstash instance, using the Alert Review visualization for an in-depth analysis of alert data from multiple sources, including:

Additionally, Google Apps Script is used to analyze specific datasets:

Access to the Apps Script codebase is provided via the Gitlab repository at alertreview.

The project environment is hosted on the Alert Review GCP Project.

Doing an Alert Review

Data Collection and Analysis

Logstash and OpenSearch Dashboards Visualization

The primary tool for alert analysis is our Logstash instance via the Alert Review visualization. This facilitates the detailed examination of alerts from various sources, including Icinga, LibreNMS, and Alert Manager.

Access to top critical alert data of the past 2 months is available in:

Focused Analysis with Google Apps Script

Google Apps Script is used to process and analyze the top root@ mail alerts and Splunk On-Call data.

Presentation

1. Create a copy of the Alert Review template.

2. Create a new Etherpad to gather feedback and ideas.

3. Prioritize targeted actions for noise reduction.

Feedback Loop: Use the feedback collected in the Etherpad to refine the alert analysis process. This iterative approach aims to enhance the precision and effectiveness of data collection, analysis, and presentation.

Developing Apps Script

  • For local development of Google Apps Script projects, install Clasp.
  • Clone the `alertreview` repository with: `git clone git@gitlab.wikimedia.org:repos/sre/alertreview.git`.
  • Use `clasp push` to deploy updates to GCP, managing sensitive information via script properties to avoid public exposure.

Key Data Sources

The following data sources are integral to the alert review process, each contributing unique insights:

  • Alert Manager: Analysis focuses on alert volumes, resolution metrics, and patterns of recurring alerts.
  • Icinga: Key metrics include service uptime, response times, and the frequency of critical alerts.
  • Splunk On-Call: Data on on-call rotation efficiency, incident categorization, and management of after-hours incidents is crucial.
  • Grafana, Graphite, Prometheus, and Thanos: These tools provide valuable metrics on system performance, alerting thresholds, and long-term data trends.