Network monitoring

From Wikitech
Jump to: navigation, search

Monitoring resources

Tool Auth Alerts Link
LibreNMS LDAP https://librenms.wikimedia.org/
Smokeping Open https://smokeping.wikimedia.org/
Prometheus Open https://grafana.wikimedia.org/dashboard/db/network-performances-global
Icinga LDAP Network monitoring#Icinga alerts https://icinga.wikimedia.org/icinga/
Logstash LDAP https://logstash.wikimedia.org/app/kibana#/dashboard/6bcd2a10-7d21-11e7-86fb-51c84229aeb7
External monitoring Open https://status.wikimedia.org/
RIPE Atlas Semi-open https://atlas.ripe.net
Rancid Internal N/A
BGPmon External Network monitoring#BGPmon alerts https://bgpmon.net/
RIPE RPKI External Network monitoring#RIPE Alerts https://my.ripe.net/#/rpki

Runbooks

Icinga alerts

host (ipv6) down

  • If service impacting (eg. full switch stack down).
    1. Depool the site if possible
    2. Ping/page netops
  • If not service impacting (eg. loss of redundancy, management nework)
    1. Decide if depooling the site is necessary
    2. Ping and open high priority task for netops

Router interface down

Example

CRITICAL: host '208.80.154.197', interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0<BR>xe-3/2/3: down -> Core: cr2-codfw:xe-5/0/1 (Zayo, circuitID) 36ms {#2909} [10Gbps wave]<BR>

The part that interests us is the one between the <BR> tags. In this example:

  • Interface name is xe-3/2/3
  • Description is Core: cr2-codfw:xe-5/0/1 (Zayo, circuitID) 36ms {#2909} [10Gbps wave]
    • Type is Core, other types are for example: Peering, Transit, OOB.
    • The other side of the link is cr2-codfw:xe-5/0/1
    • The circuit is operated by Zayo, with the after-mentioned circuit ID
    • The remaining informations are optional (latency, speed, cable#)

If such alert shows up:

First, all links are redundant, but don't hesitate to depool the site if it's showing signs of a larger outage.

Identify the type of interface going down

  • 3rd part provider: Type can be Core/Transit/Peering/OOB, a provider name identifiable and present on that list
  • Internal link: Type is Core, no provider name listed

If 3rd party provider link

  1. Verify if the provider doesn't have a planned maintenance for that circuit ID on the maintenance calendar
  2. Verify if the provider didn't send a last minute maintenance or outage email notification
  • If scheduled or provider aware of the incident
  1. downtime the alert for the duration of the maintenance
  2. monitor that no other links are going down (risk of total loss or redundancy
  • If unplanned
  1. Open a phabricator task, tag netops, include the alert and timestamp
  2. Contact the provider using the informations present on that list, make sure to include the circuit ID, and time when the outage started
  3. If needed, escalate to netops
  4. Monitor for recovery, if no reply to email within 30min, call them
  5. Close the task if quick recovery

If internal link

  1. Open a phabricator task, tag netops and dcops, include the alert and timestamp
  2. Most likely the optic need to be replaced on one of the ends.

Juniper alarm

  • If warning/yellow: open a phabricator task, tag netops
  • If critical/red: open a phabricator task, tag netops, ping/page netops

BGP status

  • If warning/yellow: open a phabricator task, tag/ping netops.
    • This is most likely an IXP peer session down
  • If critical/red: consider similar router interface down.
    1. Identify the peer name: in a terminal type `whois as#####` or lookup the AS number on http://peeringdb.com/
    2. Follow the router interface down instructions.

BGPmon alerts

RPKI Validation Failed

  1. Verify that the alert isn't a false positive
  2. If the alert seems genuine, escalate to netops as it might mean
    • That prefix is being hijacked (voluntarily or not)
    • A miss-configuration on our site can result in sub-optimal routing for that prefix

RIPE Alerts

Resource Certification (RPKI) alerts

See BGPmon/RPKI, you can use other validation methods listed bellow.

Note that RIPE will only alert for the prefixes it is in charge of. See IP and AS allocations for the list.

Syslog

Some of the syslog messages seen across the infra and their fix or workaround are listed on https://phabricator.wikimedia.org/T174397