Incident documentation/2021-11-25 eventgate-main outage

From Wikitech
Jump to navigation Jump to search

document status: in-review

Summary

During the helm3 migration of eqiad Kubernetes cluster the service eventgate-main experience an outage. The service was not available between 7:32 and 7:35 UTC.

For the helm3 migration the service had to be removed and re-deployed to the cluster. Most Kubernetes services were explicitly pooled in codfw-only during the re-deployments. eventgate-main was also falsely assumed to be served by Codfw but was still pooled in Eqiad. So during the time of removing and re-creating the pods, no traffic could be served for this service.

The commands used to migrate and re-deploy codfw (see T251305#7492328) were adapted and re-used for eqiad (see T251305#7526591). Due to a small difference in what Kubernetes services are pooled as active-active and what are active-passive, eventgate-main was missing in the depooling command (as is it not pooled in codfw currently).

Impact: For about 3 minutes (from 7:32 to 7:35 UTC), eventgate-main was unavailable. This resulted in 25,000 unrecoverable MediaWiki backend errors due to inability to queue new jobs. About 1,000 user-facing web requests and API requests failed with an HTTP 500 error. Event intake processing rate measured by eventgate briefly dropped from ~3000/second to 0/second during the outage.

Documentation:

Actionables