Network telemetry
Work in progress.
The short term goal is to complement LibreNMS for some metrics (exposed or not by SNMP), as well as provide more real time data (LibreNMS have a 5min granularity) for critical metrics.
Long term goal is to replace LibreNMS.
Example dashboard
https://grafana.wikimedia.org/d/iUATvNzSz/network-queues
Infrastructure
Network devices (exporters)
Use the sre.network.tls cookbook to create or update the TLS certificate.
netflow VMs (collectors)
gNMIc is the cornerstone of this pipeline. It connects to the network devices in its area of influence (eg. same site) asks them to send it relevant metrics, optionally mangles them, then exposes them as a Prometheus endpoint.
Troubleshooting
Get the currently exposed TLS cert
openssl s_client -showcerts -connect <fqdn>:<port> 2>/dev/null | openssl x509 -text
Get the currently exposed Prometheus metrics
prometheus1005:~$ curl netflow1002:9804/metrics
Run gnmic manually (debug mode)
sudo service gnmic stop && sudo -u gnmic /usr/local/bin/gnmic --config /etc/gnmic.yaml subscribe --debug
--log
instead of --debug
will be less verbose.
Show juniper's gRPC deamon's status
show extension-service request-response servers
Future improvements
- On Junos, once all devices are running > 22.2 uses the device's PKI stack instead of storing the key/cert as a text blob -
use-pki
in https://www.juniper.net/documentation/us/en/software/junos/interfaces-telemetry/topics/ref/statement/ssl-edit-system-services-grpc-jet.html - In the
sre.network.tls
cookbook use thetimeout
parameter once cumin hosts are running python >= 3.10 to speed things up - Expose BGP sessions
- Graph and alert on gNMIc's subscriptions once https://github.com/openconfig/gnmic/pull/89 is merged and released.
History
https://phabricator.wikimedia.org/T326322 - Add per-output queue monitoring for Juniper network devices - (Main tracking task)
https://phabricator.wikimedia.org/T334594 - TLS certificates for network devices
External links
https://phabricator.wikimedia.org/phame/post/view/304/multi-platform_network_configuration/