Network monitoring

From Wikitech

Monitoring resources

Tool Auth Alerts Link
LibreNMS LDAP https://librenms.wikimedia.org/
Prometheus Open https://grafana.wikimedia.org/d/-K8NgsUnz/home?orgId=1&search=open&tag=netops
Icinga LDAP Network monitoring#Icinga alerts https://icinga.wikimedia.org/icinga/
Logstash LDAP
External monitoring Open https://www.wikimediastatus.net/
RIPE Atlas Semi-open https://atlas.ripe.net
Rancid Internal N/A
BGPalerter Internal Network monitoring#BGPmon alerts
Cloudflare BGP leak detection External emails to noc@ https://blog.cloudflare.com/route-leak-detection/
RIPE RPKI External Network monitoring#RIPE Alerts https://my.ripe.net/#/rpki

Runbooks

Icinga alerts

host (ipv6) down

  • If service impacting (eg. full switch stack down).
    1. Depool the site if possible
    2. Ping/page netops
  • If not service impacting (eg. loss of redundancy, management nework)
    1. Decide if depooling the site is necessary
    2. Ping and open high priority task for netops

Router interface down

Example incident tracking beginning to end: https://phabricator.wikimedia.org/T314978

Example alert:

CRITICAL: host '208.80.154.197', interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0<BR>xe-3/2/3: down -> Transport: cr2-codfw:xe-5/0/1 (Zayo, circuitID) 36ms {#2909} [10Gbps wave]<BR>

Go look at the alert on the Icinga portal, the full description will be visible there (IRC only shows the 1 line version)

The part that interests us is the one between the <BR> tags. In this example:

  • Interface name is xe-3/2/3
  • Description is Core: cr2-codfw:xe-5/0/1 (Zayo, circuitID) 36ms {#2909} [10Gbps wave]
    • Type is Core, other types are: Peering, Transit, OOB, Transport.
    • The circuit is operated by Zayo, with the after-mentioned circuit ID
    • Optional: the other side of the link is cr2-codfw:xe-5/0/1
    • The remaining information are optional (latency, speed, cable#)

If such alert shows up

Identify the type of interface going down

  • 3rd part provider: Type can be Core/Transport/Transit/Peering/OOB, a provider name identifiable and present in Netbox
  • Internal link: Type is Core, no provider name listed

If 3rd party provider link

  1. Verify if the provider doesn't have a planned maintenance for that circuit ID on the maintenance calendar
  2. Verify if the provider didn't send a last minute maintenance or outage email notification to maint-announce
  • If scheduled or provider aware of the incident
  1. downtime/ACK the alert for the duration of the maintenance
  2. monitor that no other links are going down (risk of total loss or redundancy)
  • If unplanned
  1. Open a phabricator task, tag Netops, include the alert and timestamp
  2. Run the network debug cookbook: cookbook sre.network.debug --task-id {phab task} interface device-shortname:interface
  3. Contact the provider using the information present in Netbox, make sure to include the circuit ID, and time when the outage started, cc noc@wiki. And the output of the cookbook.
  4. Monitor for recovery, if no reply to email within 30min, call them

If internal link:

  1. Open a high severity phabricator task, tag the local DCops team (eg. ops-eqiad, ops-codfw, etc), include the alert and timestamp.
  2. Run the network debug cookbook: cookbook sre.network.debug --task-id {phab task} interface device-shortname:interface

Juniper alarm

  • If warning/yellow: open a phabricator task, tag netops
  • If critical/red: open a phabricator task, tag netops, ping/page netops

You can get more information about the alarm by issuing the command show system alarms on the device.

BFD status

Follow Network monitoring#Router interface down

If the interface is not down, please check the following:

  • show bfd session will give you a summary of what link(s) are considered down by BFD.
  • show ospf neighbor / show ospf3 neighbor - is the peer up? If not, please check if there are OSPF alarms ongoing.
  • if OSPF looks good, it might be due to BFD being stuck in some weird state.
  • run clear bfd session address $ADDRESS (with $ADDRESS == IP address gathered in show bfd session)

OSPF status

Follow Network monitoring#Router interface down

BGP status

BGP peerings are connections between routers to exchange routing information. Each organisation has an AS (autonomous system) number assigned, and we classify our peerings from any router based on the ASN of the other side. The BGP status check fires if any BGP we consider critical (internal or external), goes down.

If warning/yellow, follow Peering management#Managing down sessions

If critical/red:

  • Find out the type of BGP session that went down.
    • You can find the name of the ASN from the alert in the nagios config.
    • Some of these are external providers, some are our internal connections.
    • If external, find the provider and circuit
      • Look up the provider's page in Netbox
      • Check through the circuits from that provider at the site
      • You can identify the specific circuit as it will show an interface on the router that alerted
  • Follow the instructions on router interface down from 'If 3rd party provider link' onwards

In the case of Hurricane Electric / AS6939 we do not peer with them on dedicated circuits, but instead at shared internet exchange points. In that case the issue may be the IX interface is down, in which case again follow the router interface down flow. Otherwise it could be a problem with HE, but this is likely non-critical, please open a Phabricator task for netops to investigate when available.

VCP status

  1. Open high priority DCops task and tag netops
  2. For DCops:
    1. run show virtual-chassis vc-port and identify the faulty port(s)
      1. eg. 1/2         Configured         -1    Down         40000
    2. re-seat the cable on both sides, if no success, replace optics or DAC
    3. If still down, escalate to Netops

Atlas alerts

Example:

PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 22 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map

This one is a bit more complex as it usually need some digging to know where the issue exactly is.

It means there is an issue somewhere between the RIPE Atlas constellation, the "in-between" transit providers, our providers, and our network.

as a rule of thumbs though:

First, monitor for drops in real HTTP traffic (e.g. on the Varnish dashboard) and check the NEL dashboard for signals of connectivity issues from real user traffic.

Be ready to de-pool the site (when possible) and page Netops if signs of a larger issue.

  • If a high number of probes fail (eg. >75%) or if both IPv4 and IPv6 are failing simultaneously, and no quick recovery (~5min) it is less likely of a false positive, ping Netops
  • If flapping with a failing number of probe close to the threshold, its possibly a false positive, monitor/downtime and open a high priority Netops task
  • If it matches an (un)scheduled provider maintenance, it is possibly a side effect, if no quick recovery, page Netops to potentially drain that specific link

Lastly, sometimes this alert could be raised due to 500 errors from the RIPE Atlas servers, there is not much we can do in that case. (In this case you should see a slightly different error message from above, as there won't be a valid # of failed probes.)

To run the check manually use the following from one of the icinga/alert servers/usr/lib/nagios/plugins/check_ripe_atlas.py $msg_id 50 35 -v (add -vv debug info) e.g.

$ /usr/lib/nagios/plugins/check_ripe_atlas.py  11645085 50 35 -v
UDM  11645085
Allowed Failures  35
Allowed % percent loss 50
Total 672
Failed (9): ['6641', '6482', '6457', '6650', '6397', '6209', '6722', '6409', '6718']
https://atlas.ripe.net/api/v2/measurements/11645085/status-check/?permitted_total_alerts=35&max_packet_loss=50

OK - failed 9 probes of 672 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map

NEL alerts

This one is a bit more complex as it usually need some digging to know where the issue exactly is.

It means there is an issue somewhere between our userbase, the "in-between" transit providers, our providers, and our network.

Presently we only alert on a much higher-than-usual rate of tcp.timed_out and tcp.address_unreachable reports, which tend to indicate real connectivity issues. However, the problem may not always be actionable by us -- a large ISP having internal issues can trip this alert.

Things to check:

  • the NEL dashboard using the various breakdowns (geoIP country, AS number/ISP, Wikimedia server domain, etc) to attempt to perform a differential diagnosis of what the issue is
  • check for drops in received HTTP traffic (e.g. on the Frontend traffic dashboard)
  • check for any corresponding RIPE Atlas alerts

If the pattern of reports implicate one edge site, be ready to depool it and see if this resolves the issue.

VRRP status

Open high priority Netops task.

BGPmon alerts

TO BE UPDATED TO MATCH BGPalerter ALERTS.

RPKI Validation Failed

  1. Verify that the alert isn't a false positive
  2. If the alert seems genuine, escalate to netops as it might mean
    • That prefix is being hijacked (voluntarily or not)
    • A miss-configuration on our site can result in sub-optimal routing for that prefix

RIPE alerts

Resource Certification (RPKI) alerts

See BGPmon/RPKI, you can use other validation methods listed bellow.

Note that RIPE will only alert for the prefixes it is in charge of. See IP and AS allocations for the list.

  • https://bgp.he.net/${ASNUMBER} (under prefixes, there should be a green key icon).
  • Using whois: whois -h whois.bgpmon.net " --roa ${ASNUMBER} ${PREFIX}", for example whois -h whois.bgpmon.net " --roa 14907 185.15.56.0/24"

LibreNMS alerts

List of current alerts listed on https://librenms.wikimedia.org/alerts/

Unless stated otherwise, open a tracking task for netops, then ack the alert (on the page above). Page if it's causing larger issues (or have any doubt).

If an alert is too noisy, you can mute it on https://librenms.wikimedia.org/alert-rules/ edit the alert and flip the "mute" switch.

Primary outbound port utilization over 80%

The interface description will begin with the type of link saturating (or close to saturation). This alert firing means some users will have delayed or no access to the sites at all so requires a resolution (get help if you're stuck), even if the headline site metrics look OK.

  • Transit or peering: usually mean someone (eg. T192688) is sending us lots of queries of which the replies are saturating a outbound link
    1. Identify which source IP or prefix (webrequest logs, etc)
    2. Rate limit, block, or temporary move traffic to another DC (eg. with DNS)
    3. Contact the offender
  • Core: usually a heavy cross DC transfer
    1. Identify who started the transfer (SAL, IRC), or which host are involved (manually dig down in LibreNMS's graphs)
    2. Ask them to stop or rate limit their transfer

Sensor over limit

Can mean a lot of things, often a faulty optics.

Juniper environment status

Often a faulty part in a Juniper device (eg. power suply).

Juniper alarm active

See Network monitoring#Juniper alarm

DC uplink low traffic

Means that a still active link saw its outbound traffic drop. Can mean than something is wrong with the device or routing.

Ensure that the site has proper connectivity. Depool the site if not.

Processor usage over 85% or Memory over 85%

  1. Gather data by issuing the command show system processes summary and show chassis routing-engine
  2. Watch the site for other signs of malfunctions
  3. If no quick recovery (~30min), escalate to netops

Storage /var over 50% or Storage over 90%

  1. Look for core dumps with show system core-dumps
    • If any it needs to be escalated to JTAC
  2. Look for other large files in /var/tmp and /var/log
  3. If normal growth, cleanup storage with request system storage cleanup

Critical or emergency syslog messages

Escalate to netops, watch the site for other signs of failure.

Inbound/outbound interface errors

Usually mean faulty optics/cable/port.

If you can connect to the network device, run show interfaces <interface> extensive | match error in order to have more information on the errors.

  • If a server, notify its owner and assign the task to DCops.
  • If a core/transit/peering/etc link, look for any provider maintenance notification (expected or not).
    • If none, assign the task to DCops, CC netops.
    • If any, wait for maintenance to end, watch for other signs of failures

Similar task: T203576

Traffic bill over quota

Because checks are attached to devices, a bill going over threshold will alert for every devices linked to the said bill.

  1. Open a WMF-NDA Netops task (as it's about contracts)
  2. CC directors (as it's about billing)
  3. Ack the alerts in LibreNMS
  4. Use Netflow to figure out what traffic to steer away
  5. Use the AVOID-PATH feature of Homer

Poller is taking too long

Might indicate connectivity issue to the device's mgmt or an issue with its SNMP daemon.

BGP peer above prefix limit

Could either mean peer has grown past currently configured limit, or they made an error and temporarily exceeded the limit.

Follow the instructions from Peering management#Managing down sessions

Port with no description on access switch

For DCops, in Netbox: either disable the port, or connect it to the correct device (then run homer).

Port down

Open DCops task to investigate.

Traffic on tunnel link

Means that all links to a site are down and traffic is going through the last resort path.

  1. Escalate to netops
  2. Depool site
  3. Watch for provider maintenance notification

Duplicate IP on mgmt network

Open task for DCops to investigate/fix.

In the email there will be a line like:

arp info overwritten for 10.65.7.94 from 4c:d9:8f:80:74:8c to 4c:d9:8f:80:23:9a

This mean the IP 10.65.7.94 is shared between the two mac addresses.

It usually mean someone typoed an IP recently.

Try to ssh to the IP, run "racadm getsysinfo" to get the service tag. then Netbox to get the host names. Compare that hostname to the one in DNS.

Storm control in effect

More information on Storm_control

This mean that something triggered a broadcast storm on the port being shutdown. For example by looping a cable.

  1. Open a DCops task
  2. Identify and remove the source of the storm
  3. clear the error clear ethernet-switching port-error <port_name>
  4. Monitor for recovery

Outbound discards

See T284593 for details.

If no other signs of issues, open a low priority task for Netops.

virtual-chassis crash

This means a virtual chassis lost one of its members, this alert will help finding the root cause of a larger issue as it means one rack is down or miss-behaving.

For example:

asw2-c-eqiad chassisd[1837]: CHASSISD_VCHASSIS_MEMBER_OP_NOTICE: Member change: vc delete of member 5

Means asw2-c5-eqiad is miss-behaving and all servers in rack C5 could be offline.

  • Make sure services failed over properly (otherwise help them fail over)
  • Open a netops task
  • Escalate to Netops if needed

Access port speed <= 100Mbps

Usually means a faulty cable, you can check what's up with:

$ show interfaces ge-6/0/41 media on the switch

$ sudo ethtool eno1 on the host

Follow up with a DCops task to check/replace the cable.

Not accepting/receiving prefixes from anycast BGP peer

See also Anycast

Either the per is not supposed to advertise BGP sessions (Eg. in setup). In that case it can be ignored. Otherwise it means that something wrong is going on with Bird.

Previous investigation shows that it can be a (rare) "stuck" bird process. In that case a restart solves it.

Blackbox Probes (Prometheus)

The network probes are run by Prometheus Blackbox exporter and perform a variety of checks, most notably HTTP. The probes are typically kept to within the same site (i.e. don't cross the WAN, and therefore are not affected by inter-site communication problems)

ProbeDown

The specified probe has failed repeatedly. The alert will contain a link to the logs dashboard filtered for the failing probe. Check out also the linked grafana dashboard for a metrics view of the probe's availability.

For service::catalog probes (job probes/service) the reported instance will be something in the form of service:port where service is the service's key in the catalog.

Queries examples for filtering and drilling down:

service.name:*foo*
Show logs for service foo
labels.status_code:[500 TO 599]
Show logs for all HTTP errors (make sure you are using lucene syntax in the search, not DQL)


Probes defined by prometheus::blackbox::check::http in puppet (job probes/custom) are slightly different but act essentially the same. Link to logs and dashboard work as expected and there is one blackbox module defined per probe (module label) with e.g. which headers to send, SNI, etc. Whereas the instance label is set to the hostname against which the probe is run.

PingUnreachable

The device(s) in question is failing its ICMP probes, in other words the device appears unreachable when pinged from one or multiple sites. Consult the ICMP dashboard to drill down into the impact.

DNSUnavailable

The probes in question (module label) is failing when run against one of the DNS servers. The failure could be network-related (e.g. unreachability) or protocol-related (generic failure, we're not serving the expected response, etc). Consult the DNS dashboard to asses the impact, also check the "logs" link attached to the alert.

Syslog

Some of the syslog messages seen across the infra and their fix or workaround are listed on https://phabricator.wikimedia.org/T174397

Operations

netmon failover

TODO