Portal:Cloud VPS/Admin/Runbooks/NovaFullstackStaleStats
Overview
The procedures in this runbook require admin permissions to complete.
Error / Incident
Some of the stats used for the novafullstack alerts were not reported for some time.
Debugging
Check that the prometheus series are there, search for the alert name in the alerts.git repo, and look for the expr line, would be something like:
expr: count(cloudvps_novafullstack_instances_count) == 0 or count(cloudvps_novafullstack_instances_max) == 0
From there each of the stats inside a count
function might be the one failing (or all of them!), so you can go to thanos or grafana and query them there.
This might mean:
- That the novafullstack service is misbehaving
- That the stats names did change
- That the service is down
- That the prometheus stats are not being generated (usually under
/var/lib/prometheus/node.d/nofafullstack.prom
in the cloudcontrol that is running the novafullstack service).
Common issues
Add any new issues you find here.
Related information
Support contacts
Communication and support
Support and administration of the WMCS resources is provided by the Wikimedia Foundation Cloud Services team and Wikimedia Movement volunteers. Please reach out with questions and join the conversation:
Discuss and receive general support
- Chat in real time in the IRC channel #wikimedia-cloud connect, the bridged Telegram group, or the bridged Mattermost channel
- Discuss via email after you subscribed to the cloud@ mailing list
Receive mail announcements about critical changes
Subscribe to the cloud-announce@ mailing list (all messages are also mirrored to the cloud@ list)
Track work tasks and report bugs
Use a subproject of the #Cloud-Services Phabricator project to track confirmed bug reports and feature requests about the Cloud Services infrastructure itself
Learn about major near-term plans
Read the News wiki page
Read news and stories about Wikimedia Cloud Services
Read the Cloud Services Blog (for the broader Wikimedia movement, see the Wikimedia Technical Blog)
Old incidents
Add any tasks for incidents related to this alert here.