Monitoring resources

Tool	Auth	Alerts	Link
LibreNMS	LDAP		https://librenms.wikimedia.org/
Prometheus	Open		Network probes overview (Prometheus blackbox)
Icinga	LDAP	Network monitoring#Icinga alerts	https://icinga.wikimedia.org/icinga/
Logstash	LDAP
External monitoring	Open		https://www.wikimediastatus.net/
RIPE Atlas	Semi-open		https://atlas.ripe.net
Rancid	Internal	N/A
BGPalerter	Internal	Network monitoring#BGPmon alerts
Cloudflare BGP leak detection	External	emails to noc@	https://blog.cloudflare.com/route-leak-detection/
RIPE RPKI	External	Network monitoring#RIPE Alerts	https://my.ripe.net/#/rpki

Runbooks

Icinga alerts

host (ipv6) down

If service impacting (eg. full switch stack down).
1. Depool the site if possible
2. Ping/page netops
If not service impacting (eg. loss of redundancy, management nework)
1. Decide if depooling the site is necessary
2. Ping and open high priority task for netops

Router interface down

Example incident tracking beginning to end: https://phabricator.wikimedia.org/T314978

Example alert:

CRITICAL: host '208.80.154.197', interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0<BR>xe-3/2/3: down -> Transport: cr2-codfw:xe-5/0/1 (Zayo, circuitID) 36ms {#2909} [10Gbps wave]<BR>

Go look at the alert on the Icinga portal, the full description will be visible there (IRC only shows the 1 line version)

The part that interests us is the one between the <BR> tags. In this example:

Interface name is xe-3/2/3
Description is Core: cr2-codfw:xe-5/0/1 (Zayo, circuitID) 36ms {#2909} [10Gbps wave]
- Type is Core, other types are: Peering, Transit, OOB, Transport.
- The circuit is operated by Zayo, with the after-mentioned circuit ID
- Optional: the other side of the link is cr2-codfw:xe-5/0/1
- The remaining information are optional (latency, speed, cable#)

If such alert shows up

Identify the type of interface going down

3rd part provider: Type can be Core/Transport/Transit/Peering/OOB, a provider name identifiable and present in Netbox
Internal link: Type is Core, no provider name listed

If 3rd party provider link

Verify if the provider doesn't have a planned maintenance for that circuit ID on the maintenance calendar
Verify if the provider didn't send a last minute maintenance or outage email notification to maint-announce

If scheduled or provider aware of the incident

downtime/ACK the alert for the duration of the maintenance
monitor that no other links are going down (risk of total loss or redundancy)

If unplanned

Open a phabricator task, tag Netops, include the alert and timestamp
Run the network debug cookbook: cookbook sre.network.debug --task-id {phab task} interface device-shortname:interface
Contact the provider using the information present in Netbox, make sure to include the circuit ID, and time when the outage started, cc noc@wiki. And the output of the cookbook.
Monitor for recovery, if no reply to email within 30min, call them

If internal link:

Open a high severity phabricator task, tag the local DCops team (eg. ops-eqiad, ops-codfw, etc), include the alert and timestamp.
Run the network debug cookbook: cookbook sre.network.debug --task-id {phab task} interface device-shortname:interface

Juniper alarm

If warning/yellow: open a phabricator task, tag netops
If critical/red: open a phabricator task, tag netops, ping/page netops

You can get more information about the alarm by issuing the command show system alarms on the device.

BFD status

Follow Network monitoring#Router interface down

If the interface is not down, please check the following:

show bfd session will give you a summary of what link(s) are considered down by BFD.
show ospf neighbor / show ospf3 neighbor - is the peer up? If not, please check if there are OSPF alarms ongoing.
if OSPF looks good, it might be due to BFD being stuck in some weird state.
run clear bfd session address $ADDRESS (with $ADDRESS == IP address gathered in show bfd session)

OSPF status

Follow Network monitoring#Router interface down

BGP status

BGP peerings are connections between routers to exchange routing information. Each organisation has an AS (autonomous system) number assigned, and we classify our peerings from any router based on the ASN of the other side. The BGP status check fires if any BGP we consider critical (internal or external), goes down.

If warning/yellow, follow Peering management#Managing down sessions

If critical/red:

Find out the type of BGP session that went down.
- You can find the name of the ASN from the alert in the nagios config.
- Some of these are external providers, some are our internal connections.
- If external, find the provider and circuit
  - Look up the provider's page in Netbox
  - Check through the circuits from that provider at the site
  - You can identify the specific circuit as it will show an interface on the router that alerted
Follow the instructions on router interface down from 'If 3rd party provider link' onwards

In the case of Hurricane Electric / AS6939 we do not peer with them on dedicated circuits, but instead at shared internet exchange points. In that case the issue may be the IX interface is down, in which case again follow the router interface down flow. Otherwise it could be a problem with HE, but this is likely non-critical, please open a Phabricator task for netops to investigate when available.

VCP status

Open high priority DCops task and tag netops
For DCops:
1. run show virtual-chassis vc-port and identify the faulty port(s)
  1. eg. 1/2 Configured -1 Down 40000
2. re-seat the cable on both sides, if no success, replace optics or DAC
3. If still down, escalate to Netops

Atlas alerts

NEL alerts

VRRP status

Open high priority Netops task.

BGPmon alerts

TO BE UPDATED TO MATCH BGPalerter ALERTS.

RPKI Validation Failed

Verify that the alert isn't a false positive
- One option is to use RIPE's validator: http://localcert.ripe.net:8088/api/v1/validity/${ASNUMBER}/${PREFIX}
- Where ${ASNUMBER} is the full AS advertising the prefix, and ${PREFIX} the prefix with its mask
- For example: http://localcert.ripe.net:8088/api/v1/validity/AS14907/185.15.56.0/24 says "state":"Valid"
If the alert seems genuine, escalate to netops as it might mean
- That prefix is being hijacked (voluntarily or not)
- A miss-configuration on our site can result in sub-optimal routing for that prefix

RIPE alerts

Resource Certification (RPKI) alerts

See BGPmon/RPKI, you can use other validation methods listed bellow.

Note that RIPE will only alert for the prefixes it is in charge of. See IP and AS allocations for the list.

https://bgp.he.net/${ASNUMBER} (under prefixes, there should be a green key icon).
Using whois: whois -h whois.bgpmon.net " --roa ${ASNUMBER} ${PREFIX}", for example whois -h whois.bgpmon.net " --roa 14907 185.15.56.0/24"

LibreNMS alerts

List of current alerts listed on https://librenms.wikimedia.org/alerts/

Unless stated otherwise, open a tracking task for netops, then ack the alert (on the page above). Page if it's causing larger issues (or have any doubt).

If an alert is too noisy, you can mute it on https://librenms.wikimedia.org/alert-rules/ edit the alert and flip the "mute" switch.

Primary outbound port utilization over 80%

The interface description will begin with the type of link saturating (or close to saturation). This alert firing means some users will have delayed or no access to the sites at all so requires a resolution (get help if you're stuck), even if the headline site metrics look OK.

Transit or peering: usually mean someone (eg. T192688) is sending us lots of queries of which the replies are saturating a outbound link
1. Identify which source IP or prefix (webrequest logs, etc)
2. Rate limit, block, or temporary move traffic to another DC (eg. with DNS)
3. Contact the offender
Core: usually a heavy cross DC transfer
1. Identify who started the transfer (SAL, IRC), or which host are involved (manually dig down in LibreNMS's graphs)
2. Ask them to stop or rate limit their transfer

Sensor over limit

Can mean a lot of things, often a faulty optics.

Juniper environment status

Often a faulty part in a Juniper device (eg. power suply).

Juniper alarm active

See Network monitoring#Juniper alarm

DC uplink low traffic

Means that a still active link saw its outbound traffic drop. Can mean than something is wrong with the device or routing.

Ensure that the site has proper connectivity. Depool the site if not.

Processor usage over 85% or Memory over 85%

Gather data by issuing the command show system processes summary and show chassis routing-engine
Watch the site for other signs of malfunctions
If no quick recovery (~30min), escalate to netops

Storage /var over 50% or Storage over 90%

Look for core dumps with show system core-dumps
- If any it needs to be escalated to JTAC
Look for other large files in /var/tmp and /var/log
If normal growth, cleanup storage with request system storage cleanup

Critical or emergency syslog messages

Escalate to netops, watch the site for other signs of failure.

Inbound/outbound interface errors

Usually mean faulty optics/cable/port.

If you can connect to the network device, run show interfaces <interface> extensive | match error in order to have more information on the errors.

You can also look at the number of errors in LibreNMS, browsing through 'Devices'... 'All Devices'... 'Network' and then searching for the network device in question. When the page for the device appears you will see all the interface names displayed on the left panel below the overall throughput graph. Select (or search for) the interface in this list and click it. The page that appears for that particular interface has the 'Interface Errors' graph on the very last row. This should give you a sense of if the errors are constant, or if it was just a brief blip.

If a server, notify its owner and assign the task to DCops.
If a core/transit/peering/etc link, look for any provider maintenance notification (expected or not).
- If none, assign the task to DCops, CC netops.
- If any, wait for maintenance to end, watch for other signs of failures

Similar tasks: T203576 T362486

Traffic bill over quota

Because checks are attached to devices, a bill going over threshold will alert for every devices linked to the said bill.

Open a WMF-NDA Netops task (as it's about contracts)
CC directors (as it's about billing)
Ack the alerts in LibreNMS
Use Netflow to figure out what traffic to steer away
Use the AVOID-PATH feature of Homer

Poller is taking too long

Might indicate connectivity issue to the device's mgmt or an issue with its SNMP daemon.

BGP peer above prefix limit

Could either mean peer has grown past currently configured limit, or they made an error and temporarily exceeded the limit.

Follow the instructions from Peering management#Managing down sessions

Port with no description on access switch

For DCops, in Netbox: either disable the port, or connect it to the correct device (then run homer).

Port down

Open DCops task to investigate.

Traffic on tunnel link

Means that all links to a site are down and traffic is going through the last resort path.

Escalate to netops
Depool site
Watch for provider maintenance notification

Duplicate IP on mgmt network

Open task for DCops to investigate/fix.

In the email there will be a line like:

arp info overwritten for 10.65.7.94 from 4c:d9:8f:80:74:8c to 4c:d9:8f:80:23:9a

This mean the IP 10.65.7.94 is shared between the two mac addresses.

It usually mean someone typoed an IP recently.

Try to ssh to the IP, run "racadm getsysinfo" to get the service tag. then Netbox to get the host names. Compare that hostname to the one in DNS.

Storm control in effect

More information on Storm_control

This mean that something triggered a broadcast storm on the port being shutdown. For example by looping a cable.

Open a DCops task
Identify and remove the source of the storm
clear the error clear ethernet-switching port-error <port_name>
Monitor for recovery

Outbound discards

See T284593 for details.

If no other signs of issues, open a low priority task for Netops.

virtual-chassis crash

This means a virtual chassis lost one of its members, this alert will help finding the root cause of a larger issue as it means one rack is down or miss-behaving.

For example:

asw2-c-eqiad chassisd[1837]: CHASSISD_VCHASSIS_MEMBER_OP_NOTICE: Member change: vc delete of member 5

Means asw2-c5-eqiad is miss-behaving and all servers in rack C5 could be offline.

Make sure services failed over properly (otherwise help them fail over)
Open a netops task
Escalate to Netops if needed

Access port speed <= 100Mbps

Usually means a faulty cable, you can check what's up with:

$ show interfaces ge-6/0/41 media on the switch

$ sudo ethtool eno1 on the host

Follow up with a DCops task to check/replace the cable.

Not accepting/receiving prefixes from anycast BGP peer

Blackbox Probes (Prometheus)

The network probes are run by Prometheus Blackbox exporter and perform a variety of checks, most notably HTTP. The probes are typically kept to within the same site (i.e. don't cross the WAN, and therefore are not affected by inter-site communication problems)

ProbeDown

The specified probe has failed repeatedly. The alert will contain a link to the logs dashboard filtered for the failing probe. Check out also the linked grafana dashboard for a metrics view of the probe's availability.

For service::catalog probes (job probes/service) the reported instance will be something in the form of service:port where service is the service's key in the catalog.

Queries examples for filtering and drilling down:

service.name:*foo*: Show logs for service foo
labels.status_code:[500 TO 599]: Show logs for all HTTP errors (make sure you are using lucene syntax in the search, not DQL)

Probes defined by prometheus::blackbox::check::http in puppet (job probes/custom) are slightly different but act essentially the same. Link to logs and dashboard work as expected and there is one blackbox module defined per probe (module label) with e.g. which headers to send, SNI, etc. Whereas the instance label is set to the hostname against which the probe is run.

PingUnreachable

The device(s) in question is failing its ICMP probes, in other words the device appears unreachable when pinged from one or multiple sites. Consult the ICMP dashboard to drill down into the impact.

DNSUnavailable

The probes in question (module label) is failing when run against one of the DNS servers. The failure could be network-related (e.g. unreachability) or protocol-related (generic failure, we're not serving the expected response, etc). Consult the DNS dashboard to asses the impact, also check the "logs" link attached to the alert.

Syslog

Some of the syslog messages seen across the infra and their fix or workaround are listed on https://phabricator.wikimedia.org/T174397

Operations

netmon failover

TODO