Incidents/2019-05-03 varnish

Summary

The traffic layer in eqiad reported various HTTP failures fetching from the application layer. FetchError seemed different from the usual varnish-be scalability issue (e.g. not the same as in Incident documentation/20190416-varnish). Instead of "Could not get storage" we were getting "HTC status -1" and (less frequently) "http format error". See https://phabricator.wikimedia.org/P8470 and https://phabricator.wikimedia.org/P8469 respectively. HTC status -1 happens when the function HTC_RxStuff returns HTC_S_EOF. The "HTTP format error" case was verified to not be happening at the appserver layer. This initially led us to think of a potentially different issue. However, the varnish backends throwing 503 errors were confirmed to consistently be those running for the longest amount of time, and restarting them solved the problem.

Impact

Approx 900,000 HTTP 503s served, mostly between 05:00 and 05:40.

Detection

Automated: Icinga paged.

Timeline

All times in UTC.

05:01: 503 errors begin at a slow but increasing rate OUTAGE BEGINS
05:09 Icinga pages about LVS in eqsin and codfw
05:32 Joe restarted varnish backend on cp1077
05:33 failed fetches move to cp1085
05:37 ema summoned
05:41 failed fetches from cp1085 disappear
05:51 failed fetches move to cp1089
06:00 failed fetches from cp1089 disappear OUTAGE ENDS
07:03 failed fetches from cp1089 return OUTAGE BEGINS
07:16 ema restarts varnish backend on cp1089
07:20 error rate returns to normal OUTAGE ENDS

Graphs:

Conclusions

The on-disk Varnish storage backend does not scale.

Actionables

There is unfortunately no action that can be taken immediately. In the medium-term, our strategy is moving all cache backends to Apache Traffic Server.