Service status: Production

Documentation status: Ready

Goal: Lower the high ICMP load on LVS/CP servers by offloading echo requests to a dedicated server.

Linux has internal ICMP rate limiters that can cause the kernel to drop valuable ICMP packets. By offloading ICMP echo, we make sure the "important" ICMP (eg PMTU discovery) doesn't get dropped.

Deployment

Deployment task: https://phabricator.wikimedia.org/T190090

Rebuilt in: https://phabricator.wikimedia.org/T295767

eqiad: ping1002.eqiad.wmnet

codfw: ping2002.codfw.wmnet

esams: ping3002.esams.wmnet

Monitoring

Icinga: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=ping1001&style=hostservicedetail (and ping2001)

Grafana dashboard: https://grafana.wikimedia.org/d/000000513/ping-offload

External monitoring: Ping to VIPs configured in Watchmouse

InAddrErrors alert

From the Grafana dashboard alerting.

This means the server is receiving packets for an IP not existing on the server.

  1. Run ip addr to check if all the redirected IPs are present on the loopback interface
    1. If not, they can manually be added temporarily with ip addr add <ip>/32 dev lo:ping_offload
  2. If the IPs are present, use tcpdump to find the IP in question (eg. filter out all the present IPs)
  3. In any cases or if the troubleshooting takes too long, disable the redirect (see bellow)

How-to

Deploy a new host

  1. Create a VM, see existing VMs on https://netbox.wikimedia.org/virtualization/virtual-machines/?q=ping
  2. Assign the ping_offload role in Puppet (eg. https://gerrit.wikimedia.org/r/c/operations/puppet/+/564873)
  3. Add the target VIP to its configuration (eg. https://gerrit.wikimedia.org/r/c/operations/puppet/+/564908)
  4. Set the VIP and ping host in Homer (eg. https://gerrit.wikimedia.org/r/c/operations/homer/public/+/564917)

Temporarily stop the ICMP echo redirect

If the system is showing signs of issues or needs to go down for maintenance.

On both cr1 and cr2 routers of the target site, enter the following commands:

# deactivate firewall family inet filter border-in4 term offload-ping4

# deactivate firewall family inet filter transport-in4 term offload-ping4

Then verify that the changes about to be made are correct, the output should be similar to:

# show | compare
[edit firewall family inet filter border-in4]
!       inactive: term offload-ping4 { ... }
[edit firewall family inet filter transport-in4]
!       inactive: term offload-ping4 { ... }

Finish by committing the changes (replace <TASK #> with a phabricator task ID or relevant comment):

# commit comment "<TASK #>"

To confirm that the change is effective, monitor tcpdump on the ping host (for example sudo tcpdump -i ens5 icmp -nn) or the dashboard.

To re-activate the redirect, re-do the similar changes as above but replace deactivate with activate

Possible improvements

  • Use BGP flowspec to automatically advertise/remove the redirect
  • Add IPv6 support
  • Have multiple ping servers per site for redundancy