Incident documentation/20190308-maps

From Wikitech
Jump to navigation Jump to search

Summary

A monkey patch phab:T214350 deployed during the jessie -> stretch migration phab:T198622 was erased by mistake. This broke the geoshape service for some use cases. Which in turn broke a number of map frames.

The failure happened with Geoshapes v1.0.3 which didn't have the fix published, that caused the Pull Request that changed it to be ineffective.

For context: the migration from jessie to stretch requires a re-generation of tiles, which is taking a long time (weeks). During that time, the cluster is in a mixed jessie / stretch configuration. Some code changes were needed for the stretch upgrade, so different application versions are deployed on different servers.

Timeline

  • 2019-01-22 14:34Z: 'monkey patch' deployed to production
  • 2019-03-07 18:32Z: deployment of kartotherian 1.0.0 on maps100[1-4] + maps2004 (servers already migrated to stretch), this erased the previous monkey patch
  • 2019-03-07 18:32Z: repool of maps2004
  • 2019-03-07 14:14Z: phab:T217898 created for an issue on geoshape service
  • 2019-03-07 15:05Z first report of wiki pages with blank maps on IRC, in staff channel
  • 2019-03-07 15:25Z: issue investigated and reproduced on a number of wiki pages with mapframe + geoshape
  • 2019-03-07 15:45Z: issue identified as affecting the nodes upgraded to stretch, but not the ones on jessie
  • 2019-03-07 15:47Z: issue identified as an existing undeployed patch
  • 2019-03-07 16:22Z: patch packaged and properly deployed on all stretch nodes except maps1004 (typo in deployment config)
  • 2019-03-07 16:43Z: missing deployment on maps1004 identified
  • 2019-03-07 16:47Z: patch deployed on maps1004, situation back to normal

Conclusions

  • our monitoring did not catch this issue, no test is performed on the geoshape service as part of the usual spec.yaml phab:T217910

During investigation, a number of issues were raised, which are not all directly related to this specific incident. Still, it make sense to track and address those:

  • kartotherian acts as a SPARQL proxy to wikidata query service, this should be constrained in some way and not expose the full power of SPARQL
  • there was little confidence that codfw alone could cope with production load while in a "degraded" state (with maps2004 down) highlighting potential need to grow cluster / to have a dashboard showing service capacity

Links to relevant documentation

Some (not enough) documentation on maps Maps. There is no specific documentation for this particular problem, and since it is unlikely that the same situation happen again (the same undeployed patch), none will be created.

Actionables

  • Improve monitoring of geoshapes phab:T217910
  • Review maps server capacity