Incident documentation/2020-09-09 mobileapps config change
document status: draft
While reconfiguring all services to use our service proxy middleware to make remote procedure calls, a faulty configuration was deployed by yours truly for mobileapps at 08:40. This caused mobileapps to create mobile-html content with broken css and js links for pages regenerated during the day.
The issue was reported at 16:40 and the issue was quickly reverted. Then we needed a few hours to actually clear all the caching layers (RESTBase, edge caches). All pages affected were purged by 20:20.
- The biggest actionable is of course to always wait for validation from service owners before merging a patch - and the whole outage would've been avoided if that was done. Anything else listed here is purely a second-order actionable.
- While this deployment was the result of bad judgement, SRE need to be able to deploy a configuration change with confidence. The fact that the mobileapps spec tests all passed in staging lulled SRE into a false sense of security. The openAPI spec should be extended to include a test for the aforementioned URLs. (TODO: create task)
- We need staging to become a functional environment where we can test more than just a swagger spec test. Maybe linking it to restbase-dev, and making it possible to compare results of urls with production would help (TODO: create task)
TODO: Add the #Sustainability (Incident Followup) Phabricator tag to these tasks.