document status: in-review

Summary and Metadata

Incident ID	2021-11-25 eventgate-main outage	UTC Start Timestamp:	2021-11-25 07:32
Incident Task	https://phabricator.wikimedia.org/T299970	UTC End Timestamp	2021-11-25 07:35
People Paged	0	Responder Count	1
Coordinator(s)	No Coordinator needed	Relevant Metrics / SLO(s) affected	25k MediaWiki backend errors 1k web & API requests resulted in a 500 Event intake dropped to 0 (from 3k) for the duration No SLO defined, no error budget consumed
Summary:	For about 3 minutes (from 7:32 to 7:35 UTC), eventgate-main was unavailable. This resulted in 25,000 unrecoverable MediaWiki backend errors due to inability to queue new jobs. About 1,000 user-facing web requests and API requests failed with an HTTP 500 error. Event intake processing rate measured by eventgate briefly dropped from ~3000/second to 0/second during the outage.

During the helm3 migration of eqiad Kubernetes cluster the service eventgate-main experience an outage. The service was not available between 7:32 and 7:35 UTC.

For the helm3 migration the service had to be removed and re-deployed to the cluster. Most Kubernetes services were explicitly pooled in codfw-only during the re-deployments. eventgate-main was also falsely assumed to be served by Codfw but was still pooled in Eqiad. So during the time of removing and re-creating the pods, no traffic could be served for this service.

The commands used to migrate and re-deploy codfw (see T251305#7492328) were adapted and re-used for eqiad (see T251305#7526591). Due to a small difference in what Kubernetes services are pooled as active-active and what are active-passive, eventgate-main was missing in the depooling command (as is it not pooled in codfw currently).

Impact: For about 3 minutes (from 7:32 to 7:35 UTC), eventgate-main was unavailable. This resulted in 25,000 unrecoverable MediaWiki backend errors due to inability to queue new jobs. About 1,000 user-facing web requests and API requests failed with an HTTP 500 error. Event intake processing rate measured by eventgate briefly dropped from ~3000/second to 0/second during the outage.

Documentation:

Scorecard

	Question	Score
People	Were the people responding to this incident sufficiently different than the previous five incidents? (score 1 for yes, 0 for no)	1
	Were the people who responded prepared enough to respond effectively (score 1 for yes, 0 for no)	1
	Were more than 5 people paged? (score 0 for yes, 1 for no)	1
	Were pages routed to the correct sub-team(s)? (score 1 for yes, 0 for no)	N/A
	Were pages routed to online (business hours) engineers? (score 1 for yes, 0 if people were paged after business hours)	N/A
Process	Was the incident status section actively updated during the incident? (score 1 for yes, 0 for no)	N/A
	Was the public status page updated? (score 1 for yes, 0 for no)	N/A
	Is there a phabricator task for the incident? (score 1 for yes, 0 for no)	0
	Are the documented action items assigned? (score 1 for yes, 0 for no)	1
	Is this a repeat of an earlier incident (score 0 for yes, 1 for no)	0
Tooling	Was there, before the incident occurred, open tasks that would prevent this incident / make mitigation easier if implemented? (score 0 for yes, 1 for no)	0
	Were the people responding able to communicate effectively during the incident with the existing tooling? (score 1 for yes, 0 or no)	N/A
	Did existing monitoring notify the initial responders? (score 1 for yes, 0 for no)	N/A
	Were all engineering tools required available and in service? (score 1 for yes, 0 for no)	1
	Was there a runbook for all known issues present? (score 1 for yes, 0 for no)	0
Total score		5

Actionables

automate maintenance and proper de-depooling of Kubernetes services using a cookbook T277677 and T260663
reduce snowflake services which need special treatment and make most/all of them active-active (for example T288685)
optional: create a lvs/pybal/k8s service dashboard to see which service is pooled in which DC (will create a task)
T296699: Pool eventgate-main in both datacenters (active/active)