Incidents/20120206-Squid

Outage Summary

Duration: about 25 minutes
Effect: Thumbnails failed to load giving a squid error. Some wiki pages on all projects and languages gave a database error for both logged in and logged out users. Roughly 17% of all page views resulted in 500 errors for about 5 minutes, ~8% for about 10 minutes, and a small percentage for the other 10 minutes.
Cause: Human error (restarted too many squids too fast)
Resolution: Manually bringing each failed squid back into service. Waiting for disk and checks to complete.

Detail

During the swift deploy I pushed out a squid config for upload.wikimedia.org that contained errors. When squid reads a config that contains errors, it exits. The deploy process only HUPs squid (never restarts it) so after reverting the change, affected squids failed to reload. Hoping to kick them back into service, I restarted them. Unfortunately, I restarted both text and upload squids instead of just upload squids. After restarting, many of them went into a forced cache consistency check. During this check, squid passes all requests through to the back end instead of returning cached results. The apache cluster was able to handle the flood of traffic, but it overwhelmed db40, the parser cache. The parser cache is backed first by memcached then db40, so frequently accessed pages (such as those that nagios and watchmouse check) remained available, since their parser-cache data is in memcache. This split (where some pages have their parser data in memcache and the rest in the db) is why only a portion of page loads failed. Ganglia graphs from emery show the total number of 5xx responses which, when compared to the total number of page views, gives you the percentage failed page views. Once the upload squids fully restarted ms5 also became overloaded and so loading and creating thumbnails slowed down to the point where some thumbnails failed to load. This caused thumbnails to be completely unavailable for a short time at the beginning of the outage and intermittently slow or unavailable for the remaining 20 minutes while ms5 caught up with demand.

For next time

Always deploy squid config changes to a specific host and test there before deploying to the rest of the cluster. Our deploy tools make this easy, and 'curl' is your friend:
- ./deploy sq86
  - # only deploy to sq86
- curl -o /tmp/foo -vvv -H "Host: upload.wikimedia.org" http://sq86.wikimedia.org:3128//wikipedia/commons/thumb/a/a2/Little_kitten_.jpg/62px-Little_kitten_.jpg
  - # request specifically from sq38's backend squid process
Avoid restarting squids at all costs. It takes a long time (sometimes 20 minutes) if they need to verify their cache.
Our squids are not in sequential groups, it just looks that way sometimes. Always use a specific list of squids, not a range.