Incident documentation/20120607-LastModifiedExtension

From Wikitech
Jump to: navigation, search

Outage Summary

  • Duration: from about 0235 UTC to 0355 UTC; approximately 80 minutes
  • Impact: API service was flapping throughout that period
  • Cause: LastModified extension was causing every page view to an attempted POST request to http://en.wikipedia.org/w/api.php.
  • Resolution: Manually disabled it.


Detail

Here is the update from Tim Starling on this site incident.

A number of people complained on IRC that the API was mostly down, with 503 errors especially coming from cp1004 and cp1005.

On cp1005, the CPU for the backend squid was maxed out at 100% of one core. Cachemgr for cp1005:3128 reported that there were 1000 connections open to the non-API backend, almost all of which were for the URL "http://en.wikipedia.org/w/api.php". Presumably cp1004:3128 was also overloaded and the frontend squids were falling over to cp1005.

I navigated to Wikipedia with Firebug enabled and discovered that every page view was leading to an attempted POST request to http://en.wikipedia.org/w/api.php. Now armed with the post parameters, and with jeremyb's help on IRC, I was able to identify LastModified as the cause. I disabled it, and the CPU usage of the backend squid on cp1005 dropped to 40%.

If cp1004 and cp1005 hadn't implicitly limited the request rate, then it's very likely that the main Apache cluster would have been overloaded and the whole site would have gone down.

Analysis was made somewhat more difficult by the fact that ganglia was not working. On the two gmond aggregators for the eqiad text cluster, cp1001 and cp1002, gmond was using 100% CPU and tens of gigabytes of RAM. No data was visible for the last 3 days. I haven't identified the root cause yet. Restarting gmond didn't fix it.

-- Tim Starling