Maps/Maintenance
Maps maintenance
This document outlines the maintenance activities and points to further documentation explaining each process. The idea is to make the maps infrastructure:
Understood
Teams responsible for aspects of the service understand where their responsibilities begin and end, and have the information required to fulfill those responsibilities (alerting, documentation, SLO, etc)
Supported
Modern components, up to date (where possible and realistic) versions and if possible internally standardized (I’m thinking Prometheus here, but also possibly the discussion around Cassandra that emerged in our meeting, running maps services in Buster where metal is needed, nodejs updates, etc)
Automated
Wherever possible, manual intervention isn’t required for updates and self-healing. This isn’t a problem as such at the moment but if we could avoid things like resyncing the databases in the way we do now it would be excellent.
Distributed/fault tolerant
Currently, if we lose an individual maps node, we lose four services at once. The plan to move components to k8s where possible greatly improves this situation. Avoiding tightly coupling application components and state.
Known issues
- Services are not in k8s
- Currently if we lose an individual maps node, we lose four services at once
- Components tightly coupled
- Resyncing the DB has a high cost
- Metrics needs to move from Graphite to Prometheus
- Service is not paging
- Kartotherian is not publicly available for 3rd parties
Maintenance activities responsibility and support
What's needed for the maintenance work? R = Responsible, S = Support
Application layer
Kartotherian
Activity | Responsible | Consulted | Informed | Automated? |
---|---|---|---|---|
Investigate and fix application production errors | ||||
Package kartotherian code for deployment | ||||
Deploy code and configuration into maps clusters | ||||
Configure load balance between maps sources |
Tegola
Activity | Responsible | Consulted | Informed | Automated? |
---|---|---|---|---|
Investigate and fix application production errors and submit code to upstream |
Infrastructure
Beta Cluster (*.maps-experiments.eqiad1.wikimedia.cloud
)
Activity | Responsible | Consulted | Informed | Automated? |
---|---|---|---|---|
Varnish
Activity | Responsible | Consulted | Informed | Automated? |
---|---|---|---|---|
Purge tile cache when vandalism occurs |
PostgreSQL/OSM
Activity | Responsible | Consulted | Informed | Automated? |
---|---|---|---|---|
Restore replica out of sync with main DB | ||||
Initial OSM import | ||||
Restore DB causing disk space issue | ||||
Restore OSM replication because OSM lag is falling behind | ||||
Make sure that the proper binaries are successfully installed in the infrastructure |
Swift
Activity | Responsible | Consulted | Informed | Automated? |
---|---|---|---|---|
Kafka
Activity | Responsible | Consulted | Informed | Automated? |
---|---|---|---|---|
Wipe Kafka topic (empty queue) by moving skipping all events in the stream |
Tegola
Activity | Responsible | Consulted | Informed | Automated? |
---|---|---|---|---|
Enable pre-generation on eqiad/codfw tegola |