Incident documentation/2021-02-01 swift-codfw
document status: in-review
Today at 11:41 the icinga check for `ms-fe.svc.codfw.wmnet` timed out and thus paged (and recovered three minutes later):
11:41 -icinga-wm:#wikimedia-operations- PROBLEM - LVS swift-https codfw port 443/tcp - Swift/Ceph media storage IPv4 #page on ms-fe.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems ... 11:44 -icinga-wm:#wikimedia-operations- RECOVERY - LVS swift-https codfw port 443/tcp - Swift/Ceph media storage IPv4 #page on ms-fe.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 396 bytes in 0.140 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
Users in codfw/ulsfo/eqsin have experienced ~15min of higher latency (possibly timeouts) for hit-local and miss requests (10-25% of the site's requests, depending on the site).
Specifically hitting /monitoring/backend timed out, this in turn meant that some of the backend server(s) where the monitoring container lives were slow/unhealthy. Case in point ms-be2033.codfw.wmnet was reported as slow from /var/log/swift/server.log on e.g. ms-fe2006.codfw.wmnet:
Feb 1 11:44:29 ms-fe2006 proxy-server: ERROR with Object server 10.192.16.15:6000/sdk1 re: Trying to GET /v1/AUTH_mw/monitoring/backend: Timeout (10.0s) (txn: txe96767fb630b4828af04a-006017e993) (client_ip: 126.96.36.199)
The slowness was induced by an earlier swift rebalance (bug T272837) and the way we do rebalances at the moment means that such operations are generally noisy/impactful to the cluster (e.g. bug T221904, bug T271415). Swift has been depooled internally from its discovery record (essentially anticipating bug T267338).