Incidents/20161021-Maps
Appearance
(Redirected from Incident documentation/20161021-Maps)
Summary
Between 18:50 UTC and 19:20 UTC, October 21st, maps.wikimedia.org stopped rendering tiles due to Cassandra backend being unavailable.
Timeline
- 18:50 UTC: cassandra wrongly reinitialized on maps2004.codfw.wmnet, deleting all cassandra data on maps2004. Kartotherian starts failing with
org.apache.cassandra.exceptions.UnavailableException: Cannot achieve consistency level LOCAL_ONE
. - 19:20 UTC: traffic redirected to maps eqiad cluster, user traffic is served again without error
- 19:40 UTC: full deployment of new traffic configuration
- 21:13 UTC: permissions are reset on maps/cassandra codfw cluster, kartotherian starts working again on the codfw clsuter
Conclusions
- The main trigger for this is human error.
- maps/cassandra has a replication factor of 1 on the "system_auth" keyspace. This means that loosing one node potentially breaks authentication.
Actionables
- increase replication factor on system_auth keyspace task T149074