Ping offload
Service status: Production
Documentation status: Ready
Goal: Lower the high ICMP load on LVS/CP servers by offloading echo requests to a dedicated server.
Linux has internal ICMP rate limiters that can cause the kernel to drop valuable ICMP packets. By offloading ICMP echo, we make sure the "important" ICMP (eg PMTU discovery) doesn't get dropped.
Deployment
Deployment task: https://phabricator.wikimedia.org/T190090
Rebuilt in: https://phabricator.wikimedia.org/T295767
eqiad: ping1002.eqiad.wmnet
codfw: ping2002.codfw.wmnet
esams: ping3002.esams.wmnet
Monitoring
Icinga: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=ping1001&style=hostservicedetail (and ping2001)
Grafana dashboard: https://grafana.wikimedia.org/d/000000513/ping-offload
External monitoring: Ping to VIPs configured in Watchmouse
InAddrErrors alert
From the Grafana dashboard alerting.
This means the server is receiving packets for an IP not existing on the server.
- Run
ip addr
to check if all the redirected IPs are present on the loopback interface- If not, they can manually be added temporarily with
ip addr add <ip>/32 dev lo:ping_offload
- If not, they can manually be added temporarily with
- If the IPs are present, use
tcpdump
to find the IP in question (eg. filter out all the present IPs) - In any cases or if the troubleshooting takes too long, disable the redirect (see bellow)
How-to
Deploy a new host
- Create a VM, see existing VMs on https://netbox.wikimedia.org/virtualization/virtual-machines/?q=ping
- Assign the ping_offload role in Puppet (eg. https://gerrit.wikimedia.org/r/c/operations/puppet/+/564873)
- Add the target VIP to its configuration (eg. https://gerrit.wikimedia.org/r/c/operations/puppet/+/564908)
- Set the VIP and ping host in Homer (eg. https://gerrit.wikimedia.org/r/c/operations/homer/public/+/564917)
Temporarily stop the ICMP echo redirect
If the system is showing signs of issues or needs to go down for maintenance.
On both cr1 and cr2 routers of the target site, enter the following commands:
# deactivate firewall family inet filter border-in4 term offload-ping4
# deactivate firewall family inet filter transport-in4 term offload-ping4
Then verify that the changes about to be made are correct, the output should be similar to:
# show | compare
[edit firewall family inet filter border-in4]
! inactive: term offload-ping4 { ... }
[edit firewall family inet filter transport-in4]
! inactive: term offload-ping4 { ... }
Finish by committing the changes (replace <TASK #> with a phabricator task ID or relevant comment):
# commit comment "<TASK #>"
To confirm that the change is effective, monitor tcpdump on the ping host (for example sudo tcpdump -i ens5 icmp -nn
) or the dashboard.
To re-activate the redirect, re-do the similar changes as above but replace deactivate
with activate
Possible improvements
- Use BGP flowspec to automatically advertise/remove the redirect
- Add IPv6 support
- Have multiple ping servers per site for redundancy