MediaWiki Event Enrichment/Incidents/2023 07 19 enrich job outage

From Wikitech

document status: in-review

Summary

Incident metadata (see Incident Scorecard)
Incident ID 2023 07 19 enrich job outage Start 2023-07-19 18:50:30
Task End 2023-07-19 20:30:00
People paged 0 Responder count 2
Coordinators GModena Affected metrics/SLOs Flink Taskmanager uptime, service availability.
Impact The mw-page-content-change-enrich application (eqiad) has not been producing enriched events during the outage.

Description

An enriched message exceeding Kafka's max.request.size caused the application's Kafka producer to crash. This in turn resulted in a Flink Taskmanager shutdown.

Full stacktrace is available at: https://logstash.wikimedia.org/app/discover#/doc/0fade920-6712-11eb-8327-370b46f9e7a5/ecs-k8s-1-1.11.0-6-2023.29?id=EeKfdIkB6U_kV85ADO_D

HA tried to restart the application but the restart strategy, failed more than restart-strategy.fixed-delay.attempts times.

A manual application restart was required. The offending message has seemingly been discarded.

A fix is proposed in task T342399

Timeline

  • 20:10:38 UTC: GModena opens thread to ACK the outage in #data-platform-engineering
  • 20:20:24 UTC: TChin identifies the root cause of the issue (
  • 20:24:00 UTC: GModena manually restarts the application
  • 20:44:00 UTC: GModena silences Kafka Consumer lag alerts while the application catches up with queues messages.

Detection

GModena and TChin reacted to alerts triggered by degrading SLIs.

Conclusions

  • Application is running.
  • We know what needs to be fixed.

What went well?

What went poorly?

Where did we get lucky?

Links to relevant documentation

Actionables

  • increase max message size allowed by Kafka
  • Filter out messages larger than the max allowed size; task T342399

Scorecard