Jump to content

Portal:Toolforge/Admin/Runbooks/ToolforgeWebHighErrorRate

From Wikitech
The procedures in this runbook require admin permissions to complete.

The ToolforgeWebHighErrorRate alert fires when a the number of overall 5xx responses from Toolforge tool web requests exceeds 25%.

Debugging

The alert links to the HAProxy Grafana dashboard. If that does not show anything obvious, then the issue is likely on the Kubernetes cluster end.

Note that the 5xx count includes individual broken tools. That's "normal" and an issue for the maintainers of those individual tools to fix. This alert is there to detect large, platform-wide issues that need urgent attention.

HAProxy debugging

Some tricks to help debug HAProxy:

Show currently active connections per tool
$ echo "show table limit-per-tool" | sudo socat unix-connect:/run/haproxy/admin.sock stdio

Common issues

Support contacts

Old incidents

  • T399281 2025-07-11 Ceph issues
  • T404471 2025-09-22 HAProxy connection limit reached due to individual slow popular tools