Incidents/2021-02-01 swift-codfw

From Wikitech

document status: in-review

Summary

Today at 11:41 the icinga check for `ms-fe.svc.codfw.wmnet` timed out and thus paged (and recovered three minutes later):

11:41 -icinga-wm:#wikimedia-operations- PROBLEM - LVS swift-https codfw port 443/tcp - Swift/Ceph media 
          storage IPv4 #page on ms-fe.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 
          seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
...
11:44 -icinga-wm:#wikimedia-operations- RECOVERY - LVS swift-https codfw port 443/tcp - Swift/Ceph media 
          storage IPv4 #page on ms-fe.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 396 bytes in 
          0.140 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems

Users in codfw/ulsfo/eqsin have experienced ~15min of higher latency (possibly timeouts) for hit-local and miss requests (10-25% of the site's requests, depending on the site).

Screenshot from https://grafana.wikimedia.org/d/8T2XA-5Gz/frontend-ats-tls-ttfb-latency

Specifically hitting /monitoring/backend timed out, this in turn meant that some of the backend server(s) where the monitoring container lives were slow/unhealthy. Case in point ms-be2033.codfw.wmnet was reported as slow from /var/log/swift/server.log on e.g. ms-fe2006.codfw.wmnet:

Feb  1 11:44:29 ms-fe2006 proxy-server: ERROR with Object server 10.192.16.15:6000/sdk1 re: Trying to GET /v1/AUTH_mw/monitoring/backend: Timeout (10.0s) (txn: txe96767fb630b4828af04a-006017e993) (client_ip: 208.80.154.88)

The slowness was induced by an earlier swift rebalance (bug T272837) and the way we do rebalances at the moment means that such operations are generally noisy/impactful to the cluster (e.g. bug T221904, bug T271415). Swift has been depooled internally from its discovery record (essentially anticipating bug T267338).


Actionables

  • Change /monitoring/backend to /monitoring/frontend (i.e. check the frontend itself) for icinga service check and pybal's proxyfetch bug T273453
  • Consider depooling swift's discovery records during rebalances bug T273453