Incident documentation/2019-03-01 proton
Proton, one of the two currently deployed PDF rendering solutions in WMF, received an increased amount of requests due to a bot for a period of 2 hours, which it was unable to handle, leading to a degraded service for parts of that period
All times are UTC
11:00 traffic increases for proton by a factor of 3. However nothing bad happens as the infrastructure is able to handle it
11:39 traffic increases even more, now by a factor of 6. During the next 1.5h the traffic will gradually increase even more, reaching up to 25req/s for the a4 desktop endpoint, an increase of over 25x. A graph can be seen at https://grafana.wikimedia.org/d/000000563/proton?orgId=1&from=1551437310272&to=1551448110272. Logstash at https://logstash.wikimedia.org/goto/524c3c5e029d90bb9a84ced759540f79 pointing out the the queue is full
12:04 traffic has reached a peak.
12:04 icinga-wm alerts, but the service recovers quickly leading to believe it's a false positive
12:22 icinga-wm alerts again, alex starts looking into it
12:23 alex reports finding 32 chromium processes on 1 of the 2 proton hosts in eqiad and another 99 chromium processes on the other one. Some have dates of Jan 30.
12:31 akosiaris: restart proton on proton1001 
12:32 OOM locks up proton1002. The service keeps on flapping. 
12:35 alex validates theory that some 3rd party is creating a lot of pdfs
12:38 alex realizes some 3rd party is trying to create pdfs of many dewiki articles
12:41 icinga-wm sends the first SMS pages, recovers quickly
12:44 alex silences the SMS alerts for the next 2 hours, to avoid unnecessarily spamming the rest of the team while investigation is ongoing
12:49 after a discussion with ariel: akosiaris: lowers the max_render_queue_size 
12:50 Giuseppe suggests varnish rate limiting just for the PDF endpoint of the API
13:00 alex states the requests are for restbase aggregated PDF endpoint API.
13:02 alex lowers the limits for both concurrent rendering and queue to 1 and 10 respectively.
13:05 alex after digging through logs realizes that an external IP in the EU has requested over 36K PDF versions of articles in the span of 2 hours, which back of the envelope calculations amount to ~5req/s, which is on average below the current limit of 10req/s
13:08 Ariel suggests we look at the User-Agent
13:09 alex reports that accept-language, user-agent and referrer headers are all null, making it clear it's a bot talking to the API directly
13:18 Emanuelle merges a patch limit the API to 1req/s at the edge cache level
13:20 the bot has already stopped
14:00 the bot presumably tried to restart but was rate limited successfully at the edge cache level
15:02 alex restores the proton config values
The infrastructure powering the proton API was not sufficient to address an unexpected 25x increase in the amount of requests. We did not have proper rate limit for a particularly expensive API endpoint. If the service was on the kubernetes infrastructure it would have been possible to increase the resources for the service, something not possible currently.
Links to relevant documentation
- runbook: None currently exists
- Migrate the proton service to kubernetes (TODO: Create task)
- Create a runbook for issues with the proton service (TODO: Create task)
- Add PDF API specific rate-limiting per IP Done
- Investigate 2019-03-01 Proton incident - phab:T217724