Network monitoring

From Wikitech
Jump to navigation Jump to search

Monitoring resources

Tool Auth Alerts Link
LibreNMS LDAP https://librenms.wikimedia.org/
Smokeping Open https://smokeping.wikimedia.org/
Prometheus Open https://grafana.wikimedia.org/dashboard/db/network-performances-global
Icinga LDAP Network monitoring#Icinga alerts https://icinga.wikimedia.org/icinga/
Logstash LDAP https://logstash.wikimedia.org/app/kibana#/dashboard/6bcd2a10-7d21-11e7-86fb-51c84229aeb7
External monitoring Open https://status.wikimedia.org/ See bug T199816
RIPE Atlas Semi-open https://atlas.ripe.net
Rancid Internal N/A
BGPmon External Network monitoring#BGPmon alerts https://bgpmon.net/
RIPE RPKI External Network monitoring#RIPE Alerts https://my.ripe.net/#/rpki

Runbooks

Icinga alerts

host (ipv6) down

  • If service impacting (eg. full switch stack down).
    1. Depool the site if possible
    2. Ping/page netops
  • If not service impacting (eg. loss of redundancy, management nework)
    1. Decide if depooling the site is necessary
    2. Ping and open high priority task for netops

Router interface down

Example

CRITICAL: host '208.80.154.197', interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0<BR>xe-3/2/3: down -> Core: cr2-codfw:xe-5/0/1 (Zayo, circuitID) 36ms {#2909} [10Gbps wave]<BR>

The part that interests us is the one between the <BR> tags. In this example:

  • Interface name is xe-3/2/3
  • Description is Core: cr2-codfw:xe-5/0/1 (Zayo, circuitID) 36ms {#2909} [10Gbps wave]
    • Type is Core, other types are for example: Peering, Transit, OOB.
    • The other side of the link is cr2-codfw:xe-5/0/1
    • The circuit is operated by Zayo, with the after-mentioned circuit ID
    • The remaining informations are optional (latency, speed, cable#)

If such alert shows up:

First, all links are redundant, but don't hesitate to depool the site if it's showing signs of a larger outage.

Identify the type of interface going down

  • 3rd part provider: Type can be Core/Transit/Peering/OOB, a provider name identifiable and present on that list
  • Internal link: Type is Core, no provider name listed

If 3rd party provider link

  1. Verify if the provider doesn't have a planned maintenance for that circuit ID on the maintenance calendar
  2. Verify if the provider didn't send a last minute maintenance or outage email notification
  • If scheduled or provider aware of the incident
  1. downtime the alert for the duration of the maintenance
  2. monitor that no other links are going down (risk of total loss or redundancy
  • If unplanned
  1. Open a phabricator task, tag netops, include the alert and timestamp
  2. Contact the provider using the informations present on that list, make sure to include the circuit ID, and time when the outage started
  3. If needed, escalate to netops
  4. Monitor for recovery, if no reply to email within 30min, call them
  5. Close the task if quick recovery

If internal link

  1. Open a phabricator task, tag netops and dcops, include the alert and timestamp
  2. Most likely the optic need to be replaced on one of the ends.

Juniper alarm

  • If warning/yellow: open a phabricator task, tag netops
  • If critical/red: open a phabricator task, tag netops, ping/page netops

You can get more information about the alarm by issuing the command show system alarms on the device.

BGP status

  • If warning/yellow: open a phabricator task, tag/ping netops.
    • This is most likely an IXP peer session down
  • If critical/red: consider similar router interface down.
    1. Identify the peer name: in a terminal type `whois as#####` or lookup the AS number on http://peeringdb.com/
    2. Follow the router interface down instructions.

Atlas alerts

Example:

PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 22 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map

This one is a bit more complex as it usually need some digging to know where the issue exactly is.

It means there is an issue somewhere between the RIPE Atlas constellation, the "in-between" transit providers, our providers, and our network.

as a rule of thumbs though:

First, monitor real traffic usage (eg. on https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?orgId=1 and be ready to de-pool the site (when possible) and page Netops if signs of a larger issue.

  • If a high number of probes fail (eg. >75%) or if both IPv4 and IPv6 are failing simultaneously, and no quick recovery (~5min) it is less likely of a false positive, ping Netops
  • If flapping with a failing number of probe close to the threshold, its possibly a false positive, monitor/downtime and open a high priority Netops task
  • If it matches an (un)scheduled provider maintenance, it is possibly a side effect, if no quick recovery, page Netops to potentially drain that specific link

BGPmon alerts

RPKI Validation Failed

  1. Verify that the alert isn't a false positive
  2. If the alert seems genuine, escalate to netops as it might mean
    • That prefix is being hijacked (voluntarily or not)
    • A miss-configuration on our site can result in sub-optimal routing for that prefix

RIPE alerts

Resource Certification (RPKI) alerts

See BGPmon/RPKI, you can use other validation methods listed bellow.

Note that RIPE will only alert for the prefixes it is in charge of. See IP and AS allocations for the list.

LibreNMS alerts

List of current alerts listed on https://librenms.wikimedia.org/alerts/

Unless stated otherwise, open a tracking task for netops, then ack the alert (on the page above). Page if it's causing larger issues (or have any doubt).

If an alert is too noisy, you can mute it on https://librenms.wikimedia.org/alert-rules/ edit the alert and flip the "mute" switch.

Primary outbound port utilization over 80%

The interface description will begin with the type of link saturating (or close to saturation).

  • Transit or peering: usually mean someone (eg. T192688) is sending us lots of queries of which the replies are saturating a outbound link
    1. Identify which source IP or prefix (webrequest logs, etc)
    2. Rate limit, block, or temporary move traffic to another DC (eg. with DNS)
    3. Contact the offender
  • Core: usually a heavy cross DC transfer
    1. Identify who started the transfer (SAL, IRC), or which host are involved (manually dig down in LibreNMS's graphs)
    2. Ask them to stop or rate limit their transfer

Sensor over limit

Can mean a lot of things, often a faulty optics.

Juniper environment status

Often a faulty part in a Juniper device (eg. power suply).

Juniper alarm active

See Network monitoring#Juniper alarm

DC uplink low traffic

Means that a still active link saw its outbound traffic drop. Can mean than something is wrong with the device or routing.

Ensure that the site has proper connectivity. Depool the site if not.

Processor usage over 85% or Memory over 85%

  1. Gather data by issuing the command show system processes summary and show chassis routing-engine
  2. Watch the site for other signs of malfunctions
  3. If no quick recovery (~30min), escalate to netops

Storage /var over 50% or Storage over 85%

  1. Look for core dumps with show system core-dumps
    • If any it needs to be escalated to JTAC
  2. Look for other large files in /var/tmp and /var/log
  3. If normal growth, cleanup storage with request system storage cleanup

Critical or emergency syslog messages

Escalate to netops, watch the site for other signs of failure.

Inbound/outbound interface errors

Usually mean faulty optics/cable/port.

If you can connect to the network device, run show interfaces <interface> extensive | match error in order to have more information on the errors.

  • If a server, notify its owner and assign the task to DCops.
  • If a core/transit/peering/etc link, look for any provider maintenance notification (expected or not).
    • If none, assign the task to DCops, CC netops.
    • If any, wait for maintenance to end, watch for other signs of failures

Similar task: T203576

CDR bill over 75% used

Traffic might need to be steered.

Poller is taking too long

Might indicate connectivity issue to the device's mgmt or an issue with its SNMP daemon.

BGP peer above prefix limit

Could either mean:

  • Peer had a miss-configuration and started to export prefixes that it shouldn't
    1. Wait a few hours
    2. Clear the bgp session clear bgp neighbor <IP>
    3. If still triggering the limit, keep down, contact peer (info in peeringDB)
  • Peer has naturally grew past the current limit
  1. Identify faulty peer (IP and ASN in the message)
  2. Show current limit for that peer show configuration protocols bgp
  3. Get their recommended limit on PeeringDB
  4. Update configuration

Port with no description on access switch

Open DCops task to update description or disable port.

Port down

Open DCops task to investigate.

Traffic on tunnel link

Means that all links to a site are down and traffic is going through the last resort path.

  1. Escalate to netops
  2. Depool site
  3. Watch for provider maintenance notification

Duplicate IP on mgmt network

Open task for DCops to investigate/fix

Syslog

Some of the syslog messages seen across the infra and their fix or workaround are listed on https://phabricator.wikimedia.org/T174397