Monitoring/check eth

From Wikitech

Check Eth Alerts

These AlertManager checks report on any problems reported by physical servers about their Ethernet NICs.

In all cases these alerts are likely to be a problem with a faulty cable, so first action should be to check / replace the physical cable running to the port.

InterfaceDuplexError

This will alert if any server interface (filtered to those starting with 'e'), which reports a duplex status other than "full" (desired for up interface) or "unknown" (expected for down interface). The information comes from Prometheus metric node_network_info.

The worry here is that due to a bad cable the auto-negotiation process has failed on a copper link and the device has defaulted to half-duplex.

InterfaceSpeedError

This will alert if any server interface (filtered to those starting with 'e'), reports an operational speed of less in 1000Mb/sec. 1G is the minimum speed we connect servers at, but in some cases a faulty cable could result in 100 or even 10Mb operation. The info comes from Prometheus metric node_network_speed_bytes.

InterfaceReceiveErrors

This will alert if any server interface (filtered to those starting with 'e'), reports inbound interface 'errors' in the past 5 min. These come from the stats Linux exposed at /proc/net/dev, exported in Prometheus metric node_network_receive_errs_total.

All invalid packets received, such as bad length and bad crc, will cause this counter to increment. The most common cause for this is a bad cable, so that should be checked first. If the cable looks good it may need further investigation to isolate exactly what bad packets are observed. The 'ip' command can help show the type of errors:

   ip -s -s -d link show dev <interface_name>

InterfaceTransmitErrors

This will alert if any server interface (filtered to those starting with 'e'), reports outbound interface 'errors' in the past 5 mins. These come from the stats Linux exposed at /proc/net/dev, exported in Prometheus metric node_network_transmit_errs_total.

Outbound errors include aborted transmits and fifo errors. The most common cause is a bad cable, so this should be checked first. These errors should be expected in the case of a interface that has gone into half-duplex mode (see duplex alert above). If more investigation is required the exact type of the errors can be seen in the counters exposed by 'ip':

   ip -s -s -d link show dev <interface_name>

Legacy Icinga / NRPE based check

TODO: Remove once NRPE check has been disabled.

The legacy Icinga alert says checking the health of a network interface failed.

It uses the command NRPE command /usr/local/lib/nagios/plugins/check_eth on the affected host which you can run yourself.

Since that is a shell script you can also look inside to see what it does exactly.

Loops through the following interfaces: "eno1 eno2 enp5s0f0 enp5s0f1 lo" and checks

  • If all of them are reporting a carrier (a cable is plugged in) and exits with a CRIT if one is missing one.
  • If one of the interfaces is reported as administratively configued as DOWN.
  • If all interfaces have speed negotiated to 1000 Mb/s using /sbin/ethtool.

In the cases of "missing cable" (no carrier), missing interface or wrong speed you should create a ticket for the dcops team to check hardware.

If an interface is configured as DOWN try to find out who did it and why, ping on IRC or on the ops list.