Portal:Toolforge/Admin/Runbooks/ToolsNFSDown
The ToolsNFSDown alert fires when the nfs-service
service is not running or not being found in the stats.
Error / Incident
If the value is 0
, then the service is down, if the value is -1
then prometheus is not gathering the stats correctly (the service might be down, we don't know).
As of 2024-07-30 the nfs server is tools-nfs-2.tools.eqiad1.wikimedia.cloud
Debugging
Check the service status
Ssh to the server and check the service status:
dcaro@tools-nfs-2:~$ sudo systemctl status nfs-server.service
● nfs-server.service - NFS server and services
Loaded: loaded (/lib/systemd/system/nfs-server.service; enabled; vendor preset: enabled)
Active: active (exited) since Mon 2024-06-24 14:25:57 UTC; 1 months 5 days ago
Main PID: 721 (code=exited, status=0/SUCCESS)
Tasks: 0 (limit: 77152)
Memory: 0B
CPU: 0
CGroup: /system.slice/nfs-server.service
Jun 24 14:25:55 tools-nfs-2 systemd[1]: Starting NFS server and services...
Jun 24 14:25:57 tools-nfs-2 systemd[1]: Finished NFS server and services.
If there's no stats
This is a tricky one and it will be related to the way we gather metrics on tools/toolsbeta.
Note that this is not directly related to the metricsinfra monitoring project, but toolforge's own setup.
You can start by going to the project's prometheus page and trying to get the stats there, example for tools:
Common issues
Add here any new common issues you find.
Related information
- Portal:Data_Services/Admin/Runbooks/Enable_NFS_for_a_project
- Portal:Data_Services/Admin/Shared_storage
Old incidents
Add here any new tasks for incidents you might encounter.