Network monitoring
Monitoring resources
Tool | Auth | Alerts | Link |
---|---|---|---|
LibreNMS | LDAP | https://librenms.wikimedia.org/ | |
Prometheus | Open | Network probes overview (Prometheus blackbox) | |
Icinga | LDAP | Network monitoring#Icinga alerts | https://icinga.wikimedia.org/icinga/ |
Logstash | LDAP | ||
External monitoring | Open | https://www.wikimediastatus.net/ | |
RIPE Atlas | Semi-open | https://atlas.ripe.net | |
Rancid | Internal | N/A | |
BGPalerter | Internal | Network monitoring#BGPmon alerts | |
Cloudflare BGP leak detection | External | emails to noc@ | https://blog.cloudflare.com/route-leak-detection/ |
RIPE RPKI | External | Network monitoring#RIPE Alerts | https://my.ripe.net/#/rpki |
Runbooks
Icinga alerts
host (ipv6) down
- If service impacting (eg. full switch stack down).
- Depool the site if possible
- Ping/page netops
- If not service impacting (eg. loss of redundancy, management nework)
- Decide if depooling the site is necessary
- Ping and open high priority task for netops
Router interface down
Example incident tracking beginning to end: https://phabricator.wikimedia.org/T314978
Example alert:
CRITICAL: host '208.80.154.197', interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0<BR>xe-3/2/3: down -> Transport: cr2-codfw:xe-5/0/1 (Zayo, circuitID) 36ms {#2909} [10Gbps wave]<BR>
Go look at the alert on the Icinga portal, the full description will be visible there (IRC only shows the 1 line version)
The part that interests us is the one between the <BR> tags. In this example:
- Interface name is
xe-3/2/3
- Description is
Core: cr2-codfw:xe-5/0/1 (Zayo, circuitID) 36ms {#2909} [10Gbps wave]
- Type is
Core
, other types are:Peering
,Transit
,OOB
,Transport
. - The circuit is operated by Zayo, with the after-mentioned circuit ID
- Optional: the other side of the link is
cr2-codfw:xe-5/0/1
- The remaining information are optional (latency, speed, cable#)
- Type is
If such alert shows up
Identify the type of interface going down
- 3rd part provider: Type can be Core/Transport/Transit/Peering/OOB, a provider name identifiable and present in Netbox
- Internal link: Type is Core, no provider name listed
If 3rd party provider link
- Verify if the provider doesn't have a planned maintenance for that circuit ID on the maintenance calendar
- Verify if the provider didn't send a last minute maintenance or outage email notification to maint-announce
- If scheduled or provider aware of the incident
- downtime/ACK the alert for the duration of the maintenance
- monitor that no other links are going down (risk of total loss or redundancy)
- If unplanned
- Open a phabricator task, tag Netops, include the alert and timestamp
- Run the network debug cookbook:
cookbook sre.network.debug --task-id {phab task} interface device-shortname:interface
- Contact the provider using the information present in Netbox, make sure to include the circuit ID, and time when the outage started, cc noc@wiki. And the output of the cookbook.
- Monitor for recovery, if no reply to email within 30min, call them
If internal link:
- Open a high severity phabricator task, tag the local DCops team (eg. ops-eqiad, ops-codfw, etc), include the alert and timestamp.
- Run the network debug cookbook:
cookbook sre.network.debug --task-id {phab task} interface device-shortname:interface
Juniper alarm
- If warning/yellow: open a phabricator task, tag netops
- If critical/red: open a phabricator task, tag netops, ping/page netops
You can get more information about the alarm by issuing the command show system alarms
on the device.
BFD status
Follow Network monitoring#Router interface down
If the interface is not down, please check the following:
show bfd session
will give you a summary of what link(s) are considered down by BFD.show ospf neighbor
/show ospf3 neighbor
- is the peer up? If not, please check if there are OSPF alarms ongoing.- if OSPF looks good, it might be due to BFD being stuck in some weird state.
- run
clear bfd session address $ADDRESS
(with $ADDRESS == IP address gathered inshow bfd session
)- It's possible that the clear needs to be done on the remote side of the session.
OSPF status
Follow Network monitoring#Router interface down
BGP status
BGP peerings are connections between routers to exchange routing information. Each organisation has an AS (autonomous system) number assigned, and we classify our peerings from any router based on the ASN of the other side. The BGP status check fires if any BGP we consider critical (internal or external), goes down.
If warning/yellow, follow Peering management#Managing down sessions
If critical/red:
- Find out the type of BGP session that went down.
- You can find the name of the ASN from the alert in the nagios config.
- Some of these are external providers, some are our internal connections.
- If external, find the provider and circuit
- Look up the provider's page in Netbox
- Check through the circuits from that provider at the site
- You can identify the specific circuit as it will show an interface on the router that alerted
- Follow the instructions on router interface down from 'If 3rd party provider link' onwards
In the case of Hurricane Electric / AS6939 we do not peer with them on dedicated circuits, but instead at shared internet exchange points. In that case the issue may be the IX interface is down, in which case again follow the router interface down flow. Otherwise it could be a problem with HE, but this is likely non-critical, please open a Phabricator task for netops to investigate when available.
VCP status
- Open high priority DCops task and tag netops
- For DCops:
- run
show virtual-chassis vc-port
and identify the faulty port(s)- eg.
1/2 Configured -1 Down 40000
- eg.
- re-seat the cable on both sides, if no success, replace optics or DAC
- If still down, escalate to Netops
- run
Atlas alerts
Example:
PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 22 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
This one is a bit more complex as it usually need some digging to know where the issue exactly is.
It means there is an issue somewhere between the RIPE Atlas constellation, the "in-between" transit providers, our providers, and our network.
as a rule of thumbs though:
First, monitor for drops in real HTTP traffic (e.g. on the Varnish dashboard) and check the NEL dashboard for signals of connectivity issues from real user traffic.
Be ready to de-pool the site (when possible) and page Netops if signs of a larger issue.
- If a high number of probes fail (eg. >75%) or if both IPv4 and IPv6 are failing simultaneously, and no quick recovery (~5min) it is less likely of a false positive, ping Netops
- If flapping with a failing number of probe close to the threshold, its possibly a false positive, monitor/downtime and open a high priority Netops task
- If it matches an (un)scheduled provider maintenance, it is possibly a side effect, if no quick recovery, page Netops to potentially drain that specific link
Lastly, sometimes this alert could be raised due to 500 errors from the RIPE Atlas servers, there is not much we can do in that case. (In this case you should see a slightly different error message from above, as there won't be a valid # of failed probes.)
To run the check manually use the following from one of the icinga/alert servers/usr/lib/nagios/plugins/check_ripe_atlas.py $msg_id 50 35 -v
(add -vv debug info) e.g.
$ /usr/lib/nagios/plugins/check_ripe_atlas.py 11645085 50 35 -v
UDM 11645085
Allowed Failures 35
Allowed % percent loss 50
Total 672
Failed (9): ['6641', '6482', '6457', '6650', '6397', '6209', '6722', '6409', '6718']
https://atlas.ripe.net/api/v2/measurements/11645085/status-check/?permitted_total_alerts=35&max_packet_loss=50
OK - failed 9 probes of 672 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map
NEL alerts
This one is a bit more complex as it usually need some digging to know where the issue exactly is.
It means there is an issue somewhere between our userbase, the "in-between" transit providers, our providers, and our network.
Presently we only alert on a much higher-than-usual rate of tcp.timed_out
and tcp.address_unreachable
reports, which tend to indicate real connectivity issues. However, the problem may not always be actionable by us -- a large ISP having internal issues can trip this alert.
Things to check:
- the NEL dashboard using the various breakdowns (geoIP country, AS number/ISP, Wikimedia server domain, etc) to attempt to perform a differential diagnosis of what the issue is
- check for drops in received HTTP traffic (e.g. on the Frontend traffic dashboard)
- check for any corresponding RIPE Atlas alerts
If the pattern of reports implicate one edge site, be ready to depool it and see if this resolves the issue.
VRRP status
Open high priority Netops task.
BGPmon alerts
TO BE UPDATED TO MATCH BGPalerter ALERTS.
RPKI Validation Failed
- Verify that the alert isn't a false positive
- One option is to use RIPE's validator: http://localcert.ripe.net:8088/api/v1/validity/${ASNUMBER}/${PREFIX}
- Where ${ASNUMBER} is the full AS advertising the prefix, and ${PREFIX} the prefix with its mask
- For example: http://localcert.ripe.net:8088/api/v1/validity/AS14907/185.15.56.0/24 says "state":"Valid"
- If the alert seems genuine, escalate to netops as it might mean
- That prefix is being hijacked (voluntarily or not)
- A miss-configuration on our site can result in sub-optimal routing for that prefix
RIPE alerts
Resource Certification (RPKI) alerts
See BGPmon/RPKI, you can use other validation methods listed bellow.
Note that RIPE will only alert for the prefixes it is in charge of. See IP and AS allocations for the list.
- https://bgp.he.net/${ASNUMBER} (under prefixes, there should be a green key icon).
- Using whois:
whois -h whois.bgpmon.net " --roa ${ASNUMBER} ${PREFIX}"
, for examplewhois -h whois.bgpmon.net " --roa 14907 185.15.56.0/24"
LibreNMS alerts
List of current alerts listed on https://librenms.wikimedia.org/alerts/
Unless stated otherwise, open a tracking task for netops, then ack the alert (on the page above). Page if it's causing larger issues (or have any doubt).
If an alert is too noisy, you can mute it on https://librenms.wikimedia.org/alert-rules/ edit the alert and flip the "mute" switch.
Primary outbound port utilization over 80%
The interface description will begin with the type of link saturating (or close to saturation). This alert firing means some users will have delayed or no access to the sites at all so requires a resolution (get help if you're stuck), even if the headline site metrics look OK.
- Transit or peering: usually mean someone (eg. T192688) is sending us lots of queries of which the replies are saturating a outbound link
- Identify which source IP or prefix (webrequest logs, etc)
- Rate limit, block, or temporary move traffic to another DC (eg. with DNS)
- Contact the offender
- Core: usually a heavy cross DC transfer
- Identify who started the transfer (SAL, IRC), or which host are involved (manually dig down in LibreNMS's graphs)
- Ask them to stop or rate limit their transfer
Sensor over limit
Can mean a lot of things, often a faulty optics.
Juniper environment status
Often a faulty part in a Juniper device (eg. power suply).
Juniper alarm active
See Network monitoring#Juniper alarm
DC uplink low traffic
Means that a still active link saw its outbound traffic drop. Can mean than something is wrong with the device or routing.
Ensure that the site has proper connectivity. Depool the site if not.
Processor usage over 85% or Memory over 85%
- Gather data by issuing the command
show system processes summary
andshow chassis routing-engine
- Watch the site for other signs of malfunctions
- If no quick recovery (~30min), escalate to netops
Storage /var over 50% or Storage over 90%
- Look for core dumps with
show system core-dumps
- If any it needs to be escalated to JTAC
- Look for other large files in /var/tmp and /var/log
- If normal growth, cleanup storage with
request system storage cleanup
Critical or emergency syslog messages
Escalate to netops, watch the site for other signs of failure.
Inbound/outbound interface errors
Usually mean faulty optics/cable/port.
If you can connect to the network device, run show interfaces <interface> extensive | match error
in order to have more information on the errors.
You can also look at the number of errors in LibreNMS, browsing through 'Devices'... 'All Devices'... 'Network' and then searching for the network device in question. When the page for the device appears you will see all the interface names displayed on the left panel below the overall throughput graph. Select (or search for) the interface in this list and click it. The page that appears for that particular interface has the 'Interface Errors' graph on the very last row. This should give you a sense of if the errors are constant, or if it was just a brief blip.
- If a server, notify its owner and assign the task to DCops.
- If a core/transit/peering/etc link, look for any provider maintenance notification (expected or not).
- If none, assign the task to DCops, CC netops.
- If any, wait for maintenance to end, watch for other signs of failures
Similar tasks: T203576 T362486
Traffic bill over quota
Because checks are attached to devices, a bill going over threshold will alert for every devices linked to the said bill.
- Open a WMF-NDA Netops task (as it's about contracts)
- CC directors (as it's about billing)
- Ack the alerts in LibreNMS
- Use Netflow to figure out what traffic to steer away
- Use the AVOID-PATH feature of Homer
Poller is taking too long
Might indicate connectivity issue to the device's mgmt or an issue with its SNMP daemon.
BGP peer above prefix limit
Could either mean peer has grown past currently configured limit, or they made an error and temporarily exceeded the limit.
Follow the instructions from Peering management#Managing down sessions
Port with no description on access switch
For DCops, in Netbox: either disable the port, or connect it to the correct device (then run homer).
Port down
Open DCops task to investigate.
Traffic on tunnel link
Means that all links to a site are down and traffic is going through the last resort path.
- Escalate to netops
- Depool site
- Watch for provider maintenance notification
Duplicate IP on mgmt network
Open task for DCops to investigate/fix.
In the email there will be a line like:
arp info overwritten for 10.65.7.94 from 4c:d9:8f:80:74:8c to 4c:d9:8f:80:23:9a
This mean the IP 10.65.7.94
is shared between the two mac addresses.
It usually mean someone typoed an IP recently.
Try to ssh to the IP, run "racadm getsysinfo" to get the service tag. then Netbox to get the host names. Compare that hostname to the one in DNS.
Storm control in effect
More information on Storm_control
This mean that something triggered a broadcast storm on the port being shutdown. For example by looping a cable.
- Open a DCops task
- Identify and remove the source of the storm
- clear the error
clear ethernet-switching port-error <port_name>
- Monitor for recovery
Outbound discards
See T284593 for details.
If no other signs of issues, open a low priority task for Netops.
virtual-chassis crash
This means a virtual chassis lost one of its members, this alert will help finding the root cause of a larger issue as it means one rack is down or miss-behaving.
For example:
asw2-c-eqiad chassisd[1837]: CHASSISD_VCHASSIS_MEMBER_OP_NOTICE: Member change: vc delete of member 5
Means asw2-c5-eqiad
is miss-behaving and all servers in rack C5 could be offline.
- Make sure services failed over properly (otherwise help them fail over)
- Open a netops task
- Escalate to Netops if needed
Access port speed <= 100Mbps
Usually means a faulty cable, you can check what's up with:
$ show interfaces ge-6/0/41 media
on the switch
$ sudo ethtool eno1
on the host
Follow up with a DCops task to check/replace the cable.
Not accepting/receiving prefixes from anycast BGP peer
See also Anycast
Either the per is not supposed to advertise BGP sessions (Eg. in setup). In that case it can be ignored. Otherwise it means that something wrong is going on with Bird.
Previous investigation shows that it can be a (rare) "stuck" bird process. In that case a restart solves it.
Blackbox Probes (Prometheus)
The network probes are run by Prometheus Blackbox exporter and perform a variety of checks, most notably HTTP. The probes are typically kept to within the same site (i.e. don't cross the WAN, and therefore are not affected by inter-site communication problems)
ProbeDown
The specified probe has failed repeatedly. The alert will contain a link to the logs dashboard filtered for the failing probe. Check out also the linked grafana dashboard for a metrics view of the probe's availability.
For service::catalog probes (job probes/service) the reported instance will be something in the form of service:port where service is the service's key in the catalog.
Queries examples for filtering and drilling down:
- service.name:*foo*
- Show logs for service foo
- labels.status_code:[500 TO 599]
- Show logs for all HTTP errors (make sure you are using lucene syntax in the search, not DQL)
Probes defined by prometheus::blackbox::check::http in puppet (job probes/custom) are slightly different but act essentially the same. Link to logs and dashboard work as expected and there is one blackbox module defined per probe (module label) with e.g. which headers to send, SNI, etc. Whereas the instance label is set to the hostname against which the probe is run.
PingUnreachable
The device(s) in question is failing its ICMP probes, in other words the device appears unreachable when pinged from one or multiple sites. Consult the ICMP dashboard to drill down into the impact.
DNSUnavailable
The probes in question (module label) is failing when run against one of the DNS servers. The failure could be network-related (e.g. unreachability) or protocol-related (generic failure, we're not serving the expected response, etc). Consult the DNS dashboard to asses the impact, also check the "logs" link attached to the alert.
Syslog
Some of the syslog messages seen across the infra and their fix or workaround are listed on https://phabricator.wikimedia.org/T174397
Operations
netmon failover
TODO