Jump to content

Graphite/Deprecation Roadmap

From Wikitech

Context

We have been using Prometheus in production for several years as it offers several benefits over Graphite. Migrating MW off Graphite ensures we stay ahead with a supported, scalable metrics platform for more effective, multidimensional metrics analysis and storage. Prometheus provides more robust data labeling, storage, and query capabilities. This initiative is fundamental in unifying our metrics, enhancing monitoring, improving MW observability, and reducing tool fragmentation.

  • Last year, the team set out to test whether a new interface was viable and determined that long-term sustainability required us to migrate MediaWiki metrics to Prometheus, utilizing StatsLib, a new, internally developed, Prometheus-capable metrics interface. By the end of Q2, the team had successfully tested the component in production, and by the end of Q4, it had advanced about ~42% along the migration.

As the WMF improves its culture around MW ecosystem sustainability, we are setting our goals to complete the migration of active, production, and in-use (by dashboards/alerts) metrics to Prometheus to enable read-only mode on the Graphite cluster by the end of Q3 FY 2024/2025.

For this exercise, we define as “in-use” any metric emitted to Graphite mapped to a dashboard panel or alert active in Grafana. See Graphite Utilization Dashboard

Project Roadmap

Based on our project plan, we’re identifying some target milestones globally for the whole project and per-quarter goals and targets.

Global

Global metrics and goals cover the entirety of the Fiscal year. As the key result and working group are structured, teams and contributing hypotheses are expected to work on their hypothesis for three quarters and assess the impact during Q4.

Goals

  • Ensure MediaWiki platform sustainability.
  • Complete migration of metrics to Prometheus.
  • Sunset Graphite into “read-only mode” by the end of Q3
  • Formally announce Graphite's final deprecation date/timeline one year after Q3.

Success Metrics:

  1. Migration % of dashboard panels using Graphite queries (metrics ingested used last 90d)
  2. Overall StatsLib utilization in contrast to the Graphite data source (metrics emitted last 90d)

Q1-FY2024/2025

Goals

  • [In Progress] Identify (and disable) unused MW Graphite metrics to reduce noise actionable metrics to migrate.
  • Update dashboards in Grafana to use Prometheus-sourced metrics instead of Graphite-source.
  • Update the default data source in Grafana to be Prometheus, not Graphite https://phabricator.wikimedia.org/T269333
  • Formally announce technical deprecation of Graphite (read-only Q3, termination one year later).
    • Phabricator: https://phabricator.wikimedia.org/T228380
    • Wikitech/Docs: https://wikitech.wikimedia.org/wiki/Graphite
    • Grafana: https://grafana.wikimedia.org (under service updates)
    • wikitech-l : [draft] WE5.1.2 Graphite deprecation notice for wikitech-l and tech-all
    • Tech-all: [draft] WE5.1.2 Graphite deprecation notice for wikitech-l and tech-all

Success Metric Targets

  • Increase migration progress (by intake) by an increased 30%. (currently at 40%)
  • Increase migration progress by 30% (in panels/dashboards converted)

Q2-FY2024/2025

Goals

Success Metric Targets

  • Increase migration progress (by intake) by another 20%.
  • Increase migration progress by x% (in panels/dashboards converted)

Q3-FY2024/2025

Goals

Success Metric Targets

  • Increase migration progress (by intake) to 90% as a target for “read only” implementation
  • Increase migration progress to 95%

Q4-FY2024/2025

Goals

  • Analysis and retrospective
  • Updated dashboard panels.
  • Sustainability intervention reports.

Success Metric Targets

  • Increase migration progress (by intake) to as close as 100% as possible
  • Increase migration progress to 100% (in panels/dashboards converted)