Incidents/2021-07-14 eventgate-analytics latency spike caused MW app server overload

document status: in-review

Summary

While working on updating EventGate to support Prometheus, Andrew Otto deployed the changes to eventgate-analytics in codfw (then-active DC). This change removed the prometheus-statsd-exporter container in favor of direct Prometheus support, as added in recent versions of service-runner and service-template-node.

The deploy went fine in the idle "staging" and "eqiad" clusters, but when deploying to codfw, request latency from MediaWiki to eventgate-analytics spiked, which caused PHP worker slots to fill up, which in turn caused some MediaWiki API requests to fail.

The helm tool noticed that the eventgate-analytics deploy to codfw itself was not doing well, and auto-rolled back the deployment:

$ kube_env eventgate-analytics codfw; helm history production
REVISION	UPDATED                 	STATUS    	CHART           	APP VERSION	DESCRIPTION
[...]
4       	Wed Jul 14 16:07:12 2021	SUPERSEDED	eventgate-0.3.1 	           	Upgrade "production" failed: timed out waiting for the co...
5       	Wed Jul 14 16:17:18 2021	DEPLOYED  	eventgate-0.2.14	           	Rollback to 3

Impact: For ~10 minutes, MediaWiki API clients experienced request failures.

Documentation:

Actionables

Figure out why this happened and fix. Based on this log message, it seems likely that a bug in the service-runner prometheus integration caused the nodejs worker process to die. [DONE]
- Further investigation uncovered that require('prom-client') within a worker causes the observed issue. Both service-runner and node-rdkafka-prometheus require prom-client. It was proposed to patch node-rdkafka-prometheus to handle passing in the prom-client instance.
- node-rdkafka-prometheus is an unmaintained project, so we have forked it to @wikimedia/node-rdkafka-propetheus and fixed the issue there. Additionally, if this issue in prom-client is fixed, we probably won't need the patch we made to node-rdkafka-prometheus for this fix.