Network telemetry

From Wikitech

Work in progress.

The short term goal is to complement LibreNMS for some metrics (exposed or not by SNMP), as well as provide more real time data (LibreNMS have a 5min granularity) for critical metrics.

Long term goal is to replace LibreNMS.

Example dashboard

https://grafana.wikimedia.org/d/iUATvNzSz/network-queues

Infrastructure

Network devices (exporters)

Use the sre.network.tls cookbook to create or update the TLS certificate.

netflow VMs (collectors)

gNMIc is the cornerstone of this pipeline. It connects to the network devices in its area of influence (eg. same site) asks them to send it relevant metrics, optionally mangles them, then exposes them as a Prometheus endpoint.

Troubleshooting

Get the currently exposed TLS cert

openssl s_client -showcerts -connect <fqdn>:<port> 2>/dev/null | openssl x509 -text

Get the currently exposed Prometheus metrics

prometheus1005:~$ curl netflow1002:9804/metrics

Run gnmic manually (debug mode)

sudo service gnmic stop && sudo -u gnmic /usr/local/bin/gnmic --config /etc/gnmic.yaml subscribe --debug

--log instead of --debug will be less verbose.

Show juniper's gRPC deamon's status

show extension-service request-response servers

Future improvements

History

https://phabricator.wikimedia.org/T326322 - Add per-output queue monitoring for Juniper network devices - (Main tracking task)

https://phabricator.wikimedia.org/T334594 - TLS certificates for network devices

External links

https://phabricator.wikimedia.org/phame/post/view/304/multi-platform_network_configuration/