Maps maintenance

This document outlines the maintenance activities and points to further documentation explaining each process. The idea is to make the maps infrastructure:

Understood

Teams responsible for aspects of the service understand where their responsibilities begin and end, and have the information required to fulfill those responsibilities (alerting, documentation, SLO, etc)

Supported

Modern components, up to date (where possible and realistic) versions and if possible internally standardized (I’m thinking Prometheus here, but also possibly the discussion around Cassandra that emerged in our meeting, running maps services in Buster where metal is needed, nodejs updates, etc)

Automated

Wherever possible, manual intervention isn’t required for updates and self-healing. This isn’t a problem as such at the moment but if we could avoid things like resyncing the databases in the way we do now it would be excellent.

Distributed/fault tolerant

Currently, if we lose an individual maps node, we lose four services at once. The plan to move components to k8s where possible greatly improves this situation. Avoiding tightly coupling application components and state.

Known issues

Services are not in k8s
Currently if we lose an individual maps node, we lose four services at once
Components tightly coupled
Resyncing the DB has a high cost
Metrics needs to move from Graphite to Prometheus
Service is not paging
Kartotherian is not publicly available for 3rd parties

Maintenance activities responsibility and support

What's needed for the maintenance work? R = Responsible, S = Support

Application layer

Kartotherian

Activity	Responsible	Consulted	Informed	Automated?
Investigate and fix application production errors
Package kartotherian code for deployment
Deploy code and configuration into maps clusters
Configure load balance between maps sources

Tegola

Activity	Responsible	Consulted	Informed	Automated?
Investigate and fix application production errors and submit code to upstream

Infrastructure

Beta Cluster (*.maps-experiments.eqiad1.wikimedia.cloud)

Activity	Responsible	Consulted	Informed	Automated?

Varnish

Activity	Responsible	Consulted	Informed	Automated?
Purge tile cache when vandalism occurs

PostgreSQL/OSM

Activity	Responsible	Consulted	Informed	Automated?
Restore replica out of sync with main DB
Initial OSM import
Restore DB causing disk space issue
Restore OSM replication because OSM lag is falling behind
Make sure that the proper binaries are successfully installed in the infrastructure

Swift

Activity	Responsible	Consulted	Informed	Automated?

Kafka

Activity	Responsible	Consulted	Informed	Automated?
Wipe Kafka topic (empty queue) by moving skipping all events in the stream

Tegola

Activity	Responsible	Consulted	Informed	Automated?
Enable pre-generation on eqiad/codfw tegola