Maps/Maintenance

From Wikitech

Maps maintenance

This document outlines the maintenance activities and points to further documentation explaining each process. The idea is to make the maps infrastructure:

Understood

Teams responsible for aspects of the service understand where their responsibilities begin and end, and have the information required to fulfill those responsibilities (alerting, documentation, SLO, etc)

Supported

Modern components, up to date (where possible and realistic) versions and if possible internally standardized (I’m thinking Prometheus here, but also possibly the discussion around Cassandra that emerged in our meeting, running maps services in Buster where metal is needed, nodejs updates, etc)

Automated

Wherever possible, manual intervention isn’t required for updates and self-healing. This isn’t a problem as such at the moment but if we could avoid things like resyncing the databases in the way we do now it would be excellent.

Distributed/fault tolerant

Currently, if we lose an individual maps node, we lose four services at once. The plan to move components to k8s where possible greatly improves this situation. Avoiding tightly coupling application components and state.

Known issues

  • Services are not in k8s
  • Currently if we lose an individual maps node, we lose four services at once
  • Components tightly coupled
  • Resyncing the DB has a high cost
  • Metrics needs to move from Graphite to Prometheus
  • Service is not paging
  • Kartotherian is not publicly available for 3rd parties

Maintenance activities responsibility and support

What's needed for the maintenance work? R = Responsible, S = Support

Application layer

Kartotherian

Activity Responsible Consulted Informed Automated?
Investigate and fix application production errors
Package kartotherian code for deployment
Deploy code and configuration into maps clusters
Configure load balance between maps sources


Tegola

Activity Responsible Consulted Informed Automated?
Investigate and fix application production errors and submit code to upstream

Infrastructure

Beta Cluster (*.maps-experiments.eqiad1.wikimedia.cloud)

Activity Responsible Consulted Informed Automated?

Varnish

Activity Responsible Consulted Informed Automated?
Purge tile cache when vandalism occurs

PostgreSQL/OSM

Activity Responsible Consulted Informed Automated?
Restore replica out of sync with main DB
Initial OSM import
Restore DB causing disk space issue
Restore OSM replication because OSM lag is falling behind
Make sure that the proper binaries are successfully installed in the infrastructure

Swift

Activity Responsible Consulted Informed Automated?

Kafka

Activity Responsible Consulted Informed Automated?
Wipe Kafka topic (empty queue) by moving skipping all events in the stream

Tegola

Activity Responsible Consulted Informed Automated?
Enable pre-generation on eqiad/codfw tegola