Graphite/Deprecation Roadmap
Context
We have been using Prometheus in production for several years as it offers several benefits over Graphite. Migrating MW off Graphite ensures we stay ahead with a supported, scalable metrics platform for more effective, multidimensional metrics analysis and storage. Prometheus provides more robust data labeling, storage, and query capabilities. This initiative is fundamental in unifying our metrics, enhancing monitoring, improving MW observability, and reducing tool fragmentation.
- Last year, the team set out to test whether a new interface was viable and determined that long-term sustainability required us to migrate MediaWiki metrics to Prometheus, utilizing StatsLib, a new, internally developed, Prometheus-capable metrics interface. By the end of Q2, the team had successfully tested the component in production, and by the end of Q4, it had advanced about ~42% along the migration.
As the WMF improves its culture around MW ecosystem sustainability, we are setting our goals to complete the migration of active, production, and in-use (by dashboards/alerts) metrics to Prometheus to enable read-only mode on the Graphite cluster by the end of Q3 FY 2024/2025.
For this exercise, we define as “in-use” any metric emitted to Graphite mapped to a dashboard panel or alert active in Grafana. See Graphite Utilization Dashboard
Project Roadmap
Based on our project plan, we’re identifying some target milestones globally for the whole project and per-quarter goals and targets.
Global
Global metrics and goals cover the entirety of the Fiscal year. As the key result and working group are structured, teams and contributing hypotheses are expected to work on their hypothesis for three quarters and assess the impact during Q4.
Goals
- Ensure MediaWiki platform sustainability.
- Complete migration of metrics to Prometheus.
- Sunset Graphite into “read-only mode” by the end of Q3
- Formally announce Graphite's final deprecation date/timeline one year after Q3.
Success Metrics:
- Migration % of dashboard panels using Graphite queries (metrics ingested used last 90d)
- Overall StatsLib utilization in contrast to the Graphite data source (metrics emitted last 90d)
Q1-FY2024/2025
Goals
- [In Progress] Identify (and disable) unused MW Graphite metrics to reduce noise actionable metrics to migrate.
Update dashboards in Grafana to use Prometheus-sourced metrics instead of Graphite-source.Update the default data source in Grafana to be Prometheus, not Graphitehttps://phabricator.wikimedia.org/T269333Formally announce technical deprecation of Graphite (read-only Q3, termination one year later).Phabricator:https://phabricator.wikimedia.org/T228380Wikitech/Docs:https://wikitech.wikimedia.org/wiki/GraphiteGrafana:https://grafana.wikimedia.org(under service updates)wikitech-l : [draft] WE5.1.2 Graphite deprecation notice for wikitech-l and tech-allTech-all: [draft] WE5.1.2 Graphite deprecation notice for wikitech-l and tech-all
Success Metric Targets
- Increase migration progress (by intake) by an increased 30%. (currently at 40%)
- Increase migration progress by 30% (in panels/dashboards converted)
Q2-FY2024/2025
Goals
- Migrate non-MW metrics producers completely off Graphite
- Continue updating dashboards to use Prometheus-sourced metrics instead of Graphite-source https://phabricator.wikimedia.org/T350592
- Implementation plan and approach to configuring Graphite as “read-only” for a year before sunset. https://phabricator.wikimedia.org/T372856
- Identify and mitigate unknown metrics/sources (should there be any).
- Establish office hours support for the rest of the organization regarding StatsLib/Migration.
Success Metric Targets
- Increase migration progress (by intake) by another 20%.
- Increase migration progress by x% (in panels/dashboards converted)
Q3-FY2024/2025
Goals
- Continue updating dashboards to use Prometheus-sourced metrics instead of Graphite-source. https://phabricator.wikimedia.org/T350592
- Identify and mitigate unknown metrics/sources (should there be any). https://phabricator.wikimedia.org/T228380
- Continue office hours support for the rest of the organization regarding StatsLib/Migration.
- Close the tail end of the migration, identify and migrate any pending extensions/modules/sources.
- Implementation and prep for enabling “read-only” mode by Graphite end of quarter https://phabricator.wikimedia.org/T372856
Success Metric Targets
- Increase migration progress (by intake) to 90-95+% as a target for “read only”
- Increase migration progress to 95%
Q4-FY2024/2025
Goals
- Analysis and retrospective
- Updated dashboard panels.
- Sustainability intervention reports.
Success Metric Targets
- Increase migration progress (by intake) to as close as 100% as possible
- Increase migration progress to 100% (in panels/dashboards converted)
Relevant links
- T228380: Graphite technical deprecation – this is the main task tracking the overall deprecation.
- T350592: Migrate "in-use" metrics and dashboards to StatsLib – this task that tracks WM emitted graphite metrics, once this task is close to completion the bulk of the work will be complete.