Network telemetry

Work in progress.

The short term goal is to complement LibreNMS for some metrics (exposed or not by SNMP), as well as provide more real time data (LibreNMS have a 5min granularity) for critical metrics.

Long term goal is to replace LibreNMS.

Example dashboard

https://grafana.wikimedia.org/d/iUATvNzSz/network-queues

https://grafana.wikimedia.org/d/5p97dAASz/cathal-network-queue-stats

Infrastructure

Network devices (exporters)

Use the sre.network.tls cookbook to create or update the TLS certificate.

netflow VMs (collectors)

gNMIc is the cornerstone of this pipeline. It connects to the network devices in its area of influence (eg. same site) asks them to send it relevant metrics, optionally mangles them, then exposes them as a Prometheus endpoint.

Configuration

The configuration of gnmic is driven from Puppet, using class profile::gnmi_telemetry, which is currently enabled for the "netinsights" role applied to our netflow VMs.

The gnmic configuration on the VMs itself is a YAML file, built directly from the data in our Puppet repo at hieradata/common/profile/gnmi_telemetry.yaml. The configuration has four main elements:

Subscriptions

The subscriptions define the gnmi path(s) to subscribe to on network device targets, and the parameters to use for the connection. Key params we set include:

encoding: This is set to 'proto' to use protobuf encoding.

sampling-interval: We have the sampling-interval set to 60 seconds, which matches how often the Prometheus servers connect to gnmic to pull in the data. In general these two values should always match (no point gnmic getting data more frequently than we will put it in the db or vice versa).

Processors

These can be used to process the raw data received by gnmic as defined in a subscription. They allow for things like grouping of metrics, adding additonal metadata, or dropping some measures.

Targets

These are the devices to connect to and what subscriptions to enable for each.

The list is generated from the profile::netbox::data::network_devices Hiera key, which is generated from Netbox using the sre.puppet.sync-netbox-hiera cookbook.

If a device is missing, make sure its status is set to ACTIVE in Netbox and the cookbook has been run. Also that you're looking at the collector local to the device you're looking for.

Outputs

Finally the outputs are what to do with the collected data. We have one enabled, the prometheus output. When enabled this causes gnmic to run a local web server and provide the measurements received from network device subscriptions as prometheus metrics over HTTP. Some important options we have set here are:

export-timestamps: Setting this to true ensures the prometheus metrics are exported with the timestamp of when the data was received from the router. This ensures it is entered in the Prometheus database with the correct time rather than the time it was requested by Prometheus.

timeout: This sets an upper-limit to how long the gnmic process will spend returning prometheus metrics when requested over HTTP. It defaults to 10 seconds but as the number of metrics grew this was not enough in codfw, and we had to increase it to allow all metrics to be sent within the limit. In general this value should always be set to match the scrape_interval configured on the prometheus server for collection.

Prometheus Configuration

The Prometheus configuration is defined in the same way as any other scraping job within in our ops configuration at each site.

A key value we configure here is the scrape_timeout attribute. This needs to be long enough to allow all the metrics be served by the gnmic ouput. In general this value should always match the timeout configured for the gnmic prometheus output.

Monitoring the monitoring

Network devices gNMI endpoint monitoring

We use two different monitoring for the gNMI endpoints.

First the Prometheus blackbox exporter for that purpose, however the gRPC check doesn't work with gNMI ("rpc error: code = Unavailable desc = JGrpcServer: Unknown RPC /grpc.health.v1.Health/Check received"). This is why we use a custom TCP check that also verifies the TLS certificate for expiration.

Note that the device's TLS certificate doesn't include the whole chain (see also https://phabricator.wikimedia.org/T375513), so we have to pass the network root CA as parameter to the check.

Probe results are available in https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?orgId=1&var-job=probes%2Fgrpc&var-module=All&var-site=All and automatically benefit from the existing alerting.

If a target on this page (or similar) says "DOWN" it means blackbox can't establish a TCP handshake : https://prometheus-eqiad.wikimedia.org/ops/targets?scrapePool=probes%2Fgrpc&search=

If any issue, it's also possible to filter for "service.name:gnmi_connect" in https://logstash.wikimedia.org/app/dashboards#/view/f3e709c0-a5f8-11ec-bf8e-43f1807d5bc2

Then (and more recently) gNMIc exports its subscriptions status as Prometheus metrics, which can be monitored in the dashboard below.

We should eventually settle on a single monitoring do prevent duplicated efforts.

gNMIc monitoring

We also collect gNMIc health data from its dedicated (API) Prometheus endpoint.

gNMIc health dashboard : https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic?orgId=1&var-site=All

Troubleshooting

Get the currently exposed TLS cert

openssl s_client -showcerts -connect <fqdn>:<port> 2>/dev/null | openssl x509 -text

Validate the currently exposed TLS cert

From the (netflow) host running gnmic:

openssl s_client -showcerts -connect <fqdn>:<port> 2>/dev/null | openssl x509 | tee /tmp/device.pem openssl verify -CAfile /etc/ssl/localcerts/network_devices_bundle.pem /tmp/device.pem

Get the currently exposed Prometheus metrics

prometheus1005:~$ curl netflow1002:9804/metrics

Run gnmic manually (debug mode)

sudo service gnmic stop && sudo -u gnmic /usr/local/bin/gnmic --config /etc/gnmic.yaml subscribe --debug

--log instead of --debug will be less verbose.

Show juniper's gRPC deamon's status

show extension-service request-response servers

Check Juniper "Analytics Agent" is running correctly on a target device

It seems sometimes gnmic cannot subscribe to stats for a device, with errors like this shown if it's run in debug mode:

subscription interfaces-states rcv error: rpc error: code = Unavailable

This can occur if the JunOS "analytics agent" (agentd) service isn't working correctly. You can see if this is the case by running:

show agent sensors

The system should return a list of sensors and information about them to this command, if it doesn't that is likely the issue. To solve you can restart that service:

restart analytics-agent gracefully

Future improvements

On Junos, once all devices are running > 22.2 uses the device's PKI stack instead of storing the key/cert as a text blob - use-pki in https://www.juniper.net/documentation/us/en/software/junos/interfaces-telemetry/topics/ref/statement/ssl-edit-system-services-grpc-jet.html
In the sre.network.tls cookbook use the timeout parameter once cumin hosts are running python >= 3.10 to speed things up
Expose more metrics, like BGP sessions
Alert on gNMIc's subscriptions once all the virtual-chassis are replaced with devices supporting gNMI
Build and package gnmic - https://phabricator.wikimedia.org/T347461

History

https://phabricator.wikimedia.org/T326322 - Add per-output queue monitoring for Juniper network devices - (Main tracking task)

https://phabricator.wikimedia.org/T334594 - TLS certificates for network devices

External links

https://phabricator.wikimedia.org/phame/post/view/304/multi-platform_network_configuration/