Portal:Cloud VPS/Admin/Runbooks/Check unit status of backup vms

The procedures in this runbook require admin permissions to complete.

Error / Incident

The backup_vms unit is failing, that means that the backups for virtual machines are not working as expected.

Debugging

Maintain-dbusers daemon

To gather logs, just ssh to the host having the issue and check the maintain-dbusers.service unit status:

root@cloudcontrol1005:~# systemctl status maintain-dbusers
● maintain-dbusers.service - Maintain labsdb accounts
    Loaded: loaded (/lib/systemd/system/maintain-dbusers.service; static)
    Active: active (running) since Tue 2023-03-14 21:10:31 UTC; 12h ago
  Main PID: 3971369 (maintain-dbuser)
     Tasks: 1 (limit: 154192)
    Memory: 37.2M
   CGroup: /system.slice/maintain-dbusers.service
            └─3971369 /usr/bin/python3 /usr/local/sbin/maintain-dbusers maintain

Mar 15 09:27:11 cloudcontrol1005 /usr/local/sbin/maintain-dbusers[3971369]: INFO [root.inner:161] Skipping Account piccardi: Parent directory (/srv/tools/shared/tools/home/piccardi) does not exist yet, ...
Mar 15 09:27:12 cloudcontrol1005 /usr/local/sbin/maintain-dbusers[3971369]: INFO [root.inner:161] Skipping Account aar888: Parent directory (/srv/tools/shared/tools/home/aar888) does not exist yet, ...
...

You can also check the logs for a longer view:

root@cloudcontrol1005:~# journalctl -u maintain-dbusers

replica_cnf API

This component is the one that writes the replica.my.cnf and .my.cnf files into the NFS filesystem. It runs on each NFS server.

To check it, you have to ssh to the NFS service (ex. paws-nfs-1.paws.eqiad1.wikimedia.cloud, labstore1004.eqiad.wmnet) and see the service uwsgi-replica-ncf-api:

root@labstore1004:~# systemctl status uwsgi-toolsdb-replica-cnf-web.service
● uwsgi-toolsdb-replica-cnf-web.service - uwsgi-toolsdb-replica-cnf-web uwsgi app
  Loaded: loaded (/lib/systemd/system/uwsgi-toolsdb-replica-cnf-web.service; enabled; vendor preset: enabled)
  Active: active (running) since Tue 2023-03-14 21:24:03 UTC; 12h ago
 Process: 58326 ExecStartPre=/bin/bash -c rm -rf /run/toolsdb-replica-cnf-metrics/* (code=exited, status=0/SUCCESS)
Main PID: 58376 (uwsgi)
   Tasks: 9 (limit: 9830)
  CGroup: /system.slice/uwsgi-toolsdb-replica-cnf-web.service
          ├─58376 /usr/bin/uwsgi --die-on-term --ini /etc/uwsgi/apps-enabled/toolsdb-replica-cnf-web.ini
          ├─58406 /usr/bin/uwsgi --die-on-term --ini /etc/uwsgi/apps-enabled/toolsdb-replica-cnf-web.ini
          ├─58407 /usr/bin/uwsgi --die-on-term --ini /etc/uwsgi/apps-enabled/toolsdb-replica-cnf-web.ini
          ├─58408 /usr/bin/uwsgi --die-on-term --ini /etc/uwsgi/apps-enabled/toolsdb-replica-cnf-web.ini
          ├─58409 /usr/bin/uwsgi --die-on-term --ini /etc/uwsgi/apps-enabled/toolsdb-replica-cnf-web.ini
          ├─58410 /usr/bin/uwsgi --die-on-term --ini /etc/uwsgi/apps-enabled/toolsdb-replica-cnf-web.ini
          ├─58411 /usr/bin/uwsgi --die-on-term --ini /etc/uwsgi/apps-enabled/toolsdb-replica-cnf-web.ini
          ├─58412 /usr/bin/uwsgi --die-on-term --ini /etc/uwsgi/apps-enabled/toolsdb-replica-cnf-web.ini
          └─58413 /usr/bin/uwsgi --die-on-term --ini /etc/uwsgi/apps-enabled/toolsdb-replica-cnf-web.ini

Mar 15 09:27:11 labstore1004 uwsgi-toolsdb-replica-cnf-web[58376]: [pid: 58413|app: 0|req: 27370/77981] 208.80.154.85 () {42 vars in 766 bytes} [Wed Mar 15 09:27:11 2023] POST /v1/write-replica-cnf => generated 327 bytes in 1 msecs (HTTP/1.1 200) 2 headers in 72 bytes (3 switch
Mar 15 09:27:12 labstore1004 uwsgi-toolsdb-replica-cnf-web[58376]: [pid: 58413|app: 0|req: 27371/77982] 208.80.154.85 () {42 vars in 766 bytes} [Wed Mar 15 09:27:12 2023] POST /v1/write-replica-cnf => generated 321 bytes in 1 msecs (HTTP/1.1 200) 2 headers in 72 bytes (3 switch
Mar 15 09:27:12 labstore1004 uwsgi-toolsdb-replica-cnf-web[58376]: [pid: 58411|app: 0|req: 12138/77983] 208.80.154.85 () {42 vars in 766 bytes} [Wed Mar 15 09:27:12 2023] POST /v1/write-replica-cnf => generated 327 bytes in 1 msecs (HTTP/1.1 200) 2 headers in 72 bytes (3 switch
Mar 15 09:27:12 labstore1004 uwsgi-toolsdb-replica-cnf-web[58376]: [pid: 58411|app: 0|req: 12139/77984] 208.80.154.85 () {42 vars in 766 bytes} [Wed Mar 15 09:27:12 2023] POST /v1/write-replica-cnf => generated 336 bytes in 1 msecs (HTTP/1.1 200) 2 headers in 72 bytes (3 switch
Mar 15 09:27:12 labstore1004 uwsgi-toolsdb-replica-cnf-web[58376]: [pid: 58

Common issues

Add here any issues you encounter.

Connectivity issues

The maintain-dbusers daemon is the one reaching to the replica_cnf API on the nfs servers, the toolsdb database, and each and every wikireplica database, for the list of IPs and database user/credentials for each check the file /etc/dbusers.yaml.

Related information

Old incidents

Add any new incidents if you end up in this page.

Communication and support

Support and administration of the WMCS resources is provided by the Wikimedia Foundation Cloud Services team and Wikimedia movement volunteers. Please reach out with questions and join the conversation:

Discuss and receive general support

Chat in real time in the IRC channel #wikimedia-cloud ^connect or the bridged Telegram group
Discuss via email after you have subscribed to the cloud@ mailing list

Stay aware of critical changes and plans

Subscribe to the cloud-announce@ mailing list (all messages are also mirrored to the cloud@ list)
Read the News wiki page

Track work tasks and report bugs

Use a subproject of the #Cloud-Services Phabricator project to track confirmed bug reports and feature requests about the Cloud Services infrastructure itself

Read stories and WMCS blog posts

Read the Cloud Services Blog (for the broader Wikimedia movement, see the Wikimedia Technical Blog)