Maps/Maintenance

From Wikitech
Jump to navigation Jump to search

Maps maintenance

This document outlines the maintenance activities and points to further documentation explaining each process. The idea is to make the maps infrastructure:

Understood

Teams responsible for aspects of the service understand where their responsibilities begin and end, and have the information required to fulfill those responsibilities (alerting, documentation, SLO, etc)

Supported

Modern components, up to date (where possible and realistic) versions and if possible internally standardized (I’m thinking Prometheus here, but also possibly the discussion around Cassandra that emerged in our meeting, running maps services in Buster where metal is needed, nodejs updates, etc)

Automated

Wherever possible, manual intervention isn’t required for updates and self-healing. This isn’t a problem as such at the moment but if we could avoid things like resyncing the databases in the way we do now it would be excellent.

Distributed/fault tolerant

Currently, if we lose an individual maps node, we lose four services at once. The plan to move components to k8s where possible greatly improves this situation. Avoiding tightly coupling application components and state.

Known issues

  • Services are not in k8s
  • Currently if we lose an individual maps node, we lose four services at once
  • Components tightly coupled
  • Resyncing the DB has a high cost
  • Metrics needs to move from Graphite to Prometheus
  • Service is not paging
  • Kartotherian is not publicly available for 3rd parties

Maintenance activities responsibility and support

What's needed for the maintenance work? R = Responsible, S = Support

Application layer

Tilerator

Activity SRE PI Automated?
Monitor tile generation triggered by OSM replication
Monitor z0 - z9 monthly tile regeneration
Manually trigger a tile regeneration for a specifc part of the planet

Kartotherian

Activity SRE PI Automated?
Investigate and fix application production errors

Infrastructure

Beta Cluster

Activity SRE PI Automated?

Varnish

Activity SRE PI Automated?
Purge tile cache when vandalism occurs

Cassandra

Activity SRE PI Automated?
Restore node replica when there is a disk space issue
Setup storage keyspace on master machine
Enable data replication for the remaining nodes
Setup storage keyspace on master machine

PostgreSQL

Activity SRE PI Automated?
Restore replica out of sync with main DB
Initial OSM import
Restore DB causing disk space issue
Restore OSM replication because OSM lag is falling behind
Make sure that the proper binaries are succesfully installed in the infrastructure

Redis

> No known production issues or maintenance tasks