Portal:Cloud VPS/Admin/Runbooks/Check unit status of backup vms
Error / Incident
The backup_vms unit is failing, that means that the backups for virtual machines are not working as expected.
Debugging
Maintain-dbusers daemon
To gather logs, just ssh to the host having the issue and check the maintain-dbusers.service unit status:
root@cloudcontrol1005:~# systemctl status maintain-dbusers ● maintain-dbusers.service - Maintain labsdb accounts Loaded: loaded (/lib/systemd/system/maintain-dbusers.service; static) Active: active (running) since Tue 2023-03-14 21:10:31 UTC; 12h ago Main PID: 3971369 (maintain-dbuser) Tasks: 1 (limit: 154192) Memory: 37.2M CGroup: /system.slice/maintain-dbusers.service └─3971369 /usr/bin/python3 /usr/local/sbin/maintain-dbusers maintain Mar 15 09:27:11 cloudcontrol1005 /usr/local/sbin/maintain-dbusers[3971369]: INFO [root.inner:161] Skipping Account piccardi: Parent directory (/srv/tools/shared/tools/home/piccardi) does not exist yet, ... Mar 15 09:27:12 cloudcontrol1005 /usr/local/sbin/maintain-dbusers[3971369]: INFO [root.inner:161] Skipping Account aar888: Parent directory (/srv/tools/shared/tools/home/aar888) does not exist yet, ... ...
You can also check the logs for a longer view:
root@cloudcontrol1005:~# journalctl -u maintain-dbusers
replica_cnf API
This component is the one that writes the replica.my.cnf and .my.cnf files into the NFS filesystem. It runs on each NFS server.
To check it, you have to ssh to the NFS service (ex. paws-nfs-1.paws.eqiad1.wikimedia.cloud, labstore1004.eqiad.wmnet) and see the service uwsgi-replica-ncf-api:
root@labstore1004:~# systemctl status uwsgi-toolsdb-replica-cnf-web.service ● uwsgi-toolsdb-replica-cnf-web.service - uwsgi-toolsdb-replica-cnf-web uwsgi app Loaded: loaded (/lib/systemd/system/uwsgi-toolsdb-replica-cnf-web.service; enabled; vendor preset: enabled) Active: active (running) since Tue 2023-03-14 21:24:03 UTC; 12h ago Process: 58326 ExecStartPre=/bin/bash -c rm -rf /run/toolsdb-replica-cnf-metrics/* (code=exited, status=0/SUCCESS) Main PID: 58376 (uwsgi) Tasks: 9 (limit: 9830) CGroup: /system.slice/uwsgi-toolsdb-replica-cnf-web.service ├─58376 /usr/bin/uwsgi --die-on-term --ini /etc/uwsgi/apps-enabled/toolsdb-replica-cnf-web.ini ├─58406 /usr/bin/uwsgi --die-on-term --ini /etc/uwsgi/apps-enabled/toolsdb-replica-cnf-web.ini ├─58407 /usr/bin/uwsgi --die-on-term --ini /etc/uwsgi/apps-enabled/toolsdb-replica-cnf-web.ini ├─58408 /usr/bin/uwsgi --die-on-term --ini /etc/uwsgi/apps-enabled/toolsdb-replica-cnf-web.ini ├─58409 /usr/bin/uwsgi --die-on-term --ini /etc/uwsgi/apps-enabled/toolsdb-replica-cnf-web.ini ├─58410 /usr/bin/uwsgi --die-on-term --ini /etc/uwsgi/apps-enabled/toolsdb-replica-cnf-web.ini ├─58411 /usr/bin/uwsgi --die-on-term --ini /etc/uwsgi/apps-enabled/toolsdb-replica-cnf-web.ini ├─58412 /usr/bin/uwsgi --die-on-term --ini /etc/uwsgi/apps-enabled/toolsdb-replica-cnf-web.ini └─58413 /usr/bin/uwsgi --die-on-term --ini /etc/uwsgi/apps-enabled/toolsdb-replica-cnf-web.ini Mar 15 09:27:11 labstore1004 uwsgi-toolsdb-replica-cnf-web[58376]: [pid: 58413|app: 0|req: 27370/77981] 208.80.154.85 () {42 vars in 766 bytes} [Wed Mar 15 09:27:11 2023] POST /v1/write-replica-cnf => generated 327 bytes in 1 msecs (HTTP/1.1 200) 2 headers in 72 bytes (3 switch Mar 15 09:27:12 labstore1004 uwsgi-toolsdb-replica-cnf-web[58376]: [pid: 58413|app: 0|req: 27371/77982] 208.80.154.85 () {42 vars in 766 bytes} [Wed Mar 15 09:27:12 2023] POST /v1/write-replica-cnf => generated 321 bytes in 1 msecs (HTTP/1.1 200) 2 headers in 72 bytes (3 switch Mar 15 09:27:12 labstore1004 uwsgi-toolsdb-replica-cnf-web[58376]: [pid: 58411|app: 0|req: 12138/77983] 208.80.154.85 () {42 vars in 766 bytes} [Wed Mar 15 09:27:12 2023] POST /v1/write-replica-cnf => generated 327 bytes in 1 msecs (HTTP/1.1 200) 2 headers in 72 bytes (3 switch Mar 15 09:27:12 labstore1004 uwsgi-toolsdb-replica-cnf-web[58376]: [pid: 58411|app: 0|req: 12139/77984] 208.80.154.85 () {42 vars in 766 bytes} [Wed Mar 15 09:27:12 2023] POST /v1/write-replica-cnf => generated 336 bytes in 1 msecs (HTTP/1.1 200) 2 headers in 72 bytes (3 switch Mar 15 09:27:12 labstore1004 uwsgi-toolsdb-replica-cnf-web[58376]: [pid: 58
Common issues
Add here any issues you encounter.
Connectivity issues
The maintain-dbusers daemon is the one reaching to the replica_cnf API on the nfs servers, the toolsdb database, and each and every wikireplica database, for the list of IPs and database user/credentials for each check the file /etc/dbusers.yaml.
Related information
- Portal:Data_Services/Admin/Shared_storage#maintain-dbusers
- Portal:Toolforge/Admin#Regenerate_replica.my.cnf
Old incidents
Add any new incidents if you end up in this page.
Communication and support
Support and administration of the WMCS resources is provided by the Wikimedia Foundation Cloud Services team and Wikimedia movement volunteers. Please reach out with questions and join the conversation:
- Chat in real time in the IRC channel #wikimedia-cloud connect or the bridged Telegram group
- Discuss via email after you have subscribed to the cloud@ mailing list
- Subscribe to the cloud-announce@ mailing list (all messages are also mirrored to the cloud@ list)
- Read the News wiki page
Use a subproject of the #Cloud-Services Phabricator project to track confirmed bug reports and feature requests about the Cloud Services infrastructure itself
Read the Cloud Services Blog (for the broader Wikimedia movement, see the Wikimedia Technical Blog)