Jump to content

Event Platform/EventGate occasionally fails to ingest specific schemas

From Wikitech

This page contains some notes and a post mortem of https://phabricator.wikimedia.org/T326002.

This incident might not technically an outage, but is a good use case for troubleshooting and op practices.

The root cause was an instance of EventGate configured with dynamic schema loading (no bundled schemas). Cache refreshes were requesting a larger config than what the EventStreamConfig was set to return, causing schema loading to fail.

Timeline

1. Dec 29 2022, 10:43 AM. The phab was open. It reports an issue with a data analytics job that fails to produce canary events.

2. Fri, Sep 15, 9:05 AM. Joseph flags the phab in slack. Joseph correctly identifies the root cause begin the EventGate service and escalates to an EventGate maintainer (Gabriele).

3. Fri, Sep 15, 1:06 PM. Joseph and Gabriele tirage, identify the issue only affecting the eventgate-analytics-external instance. Error rates correlate with eventbus log entries.

4. Fri, Sep 15, 2:50 PM. Sam helps narrow down the issue, and provides a workaround. The workaround patches the EventStreamConfig service (Sam is the maintainer).

5. Mon, Sep 18, 9:17 PM. f/up from Gabriele. OK on the patch. Proposal to improve alert coverage for all eventgate instances.

6. Now - EventStreamConfig patch is scheduled for deployment on the upcoming train. Alertmanager patch is in review.

Lessons learnt

1. This issue would have been caught before if we had alerting in place.

2. Severity was underestimated. Since the issue seemed to only relate to canary events, it was not prioritized.


Impact

This incident impacted an instance of eventgate that supports legacy instrumentation systems https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate#eventgate-analytics-external

mobile clients might have been impacted, but it's unclear to what extent.

Follow up

1. EventGate should be able to fetch configs with result pagination. We should evaluate the need for dynamic configs. We should improve logging reporting.

2. EventGate will be covered by an SLO.

3. EventBus will be covered by an SLO.

4. All eventgate instances should alerts on SLOs.