Jump to content

Incidents/2023-07-19 enrich job outage

From Wikitech

document status: in-review


Incident metadata (see Incident Scorecard)
Incident ID 2023-07-19 enrich job outage Start 2023-07-19 18:50:30
Task End 2023-07-19 20:30:00
People paged 0 Responder count 2
Coordinators GModena Affected metrics/SLOs Flink Taskmanager uptime, service availability.
Impact The mw-page-content-change-enrich application (eqiad) has not been producing enriched events during the outage.


An enriched message exceeding Kafka's max.request.size caused the application's Kafka producer to crash. This in turn resulted in a Flink Taskmanager shutdown.

Full stacktrace is available at: https://logstash.wikimedia.org/app/discover#/doc/0fade920-6712-11eb-8327-370b46f9e7a5/ecs-k8s-1-1.11.0-6-2023.29?id=EeKfdIkB6U_kV85ADO_D

HA tried to restart the application but the restart strategy, failed more than restart-strategy.fixed-delay.attempts times.

A manual application restart was required. The offending message has seemingly been discarded.

A fix is proposed in task T342399


  • 20:10:38 UTC: GModena opens thread to ACK the outage in #data-platform-engineering
  • 20:20:24 UTC: TChin identifies the root cause of the issue (
  • 20:24:00 UTC: GModena manually restarts the application
  • 20:44:00 UTC: GModena silences Kafka Consumer lag alerts while the application catches up with queues messages.


GModena and TChin reacted to alerts triggered by degrading SLIs.


  • Application is running.
  • We know what needs to be fixed.

What went well?

What went poorly?

Where did we get lucky?

Links to relevant documentation


  • increase max message size allowed by Kafka
  • Filter out messages larger than the max allowed size; task T342399