Portal:Cloud VPS/Admin/Runbooks/NovafullstackSustainedLeakedVMs

The procedures in this runbook require admin permissions to complete.

Error / Incident

Usually an email/icinga alert with the subject ** PROBLEM alert - <hostname>/Check for VMs leaked by the nova-fullstack test is CRITICAL **

This happens when there's an error in the creation and/or deletion of the nova-fullstack VM. To see the logs, you have to login to the instance that triggered the alert (if you are looking at the https://alerts.wikimedia.org, it is shown as a blue tag). For example cloudcontrol1003:

Debugging

Quick check

 ssh cloudcontrol1003.wikimedia.org

There you can check the systemctl status:

 user@cloudcontrol1003:~$ sudo systemctl status nova-fullstack.service

It might be that the service re-triggered since then and now it's running. Logs are in Kibana: https://logstash.wikimedia.org/app/dashboards#/view/13149940-ff6c-11eb-85b7-9d1831ce7631

Quick way to check the last error (now that we have structured errors):

 user@cloudcontrol1003:~$ echo "$(sudo journalctl -n 5000 -u  nova-fullstack.service | grep '"log.level": "ERROR"' | tail -n 1 | sed -e 's/^.*@cee: //')" | jq '.'

View the current list of VMs

 user@cloudcontrol1003:~$ sudo wmcs-openstack server list --os-project-id admin-monitoring

Another possible source of information is the Grafana dasboard.^[1]

Debugging service errors

DNS entry not being created

This is related to the designate service that is running on cloudcontrols, to check it's status ssh and run:

root@cumin1002:~# cumin cloudcontrol1* 'systemctl status designate-sink'

One common issue is designate-sink getting stuck, the reason is not known yet (see task T309929), but the symptom is that it stopped logging anything, so if there are no logs newer than 2h, it's probably this, the workaround is to restart the service:

root@cumin1002:~# cumin cloudcontrol1* 'systemctl restart designate-sink.service'

It will take a couple minutes to start up, but after you should start seeing entries like:

Jun 04 20:42:02 cloudcontrol1006 designate-sink[3363330]: 2022-06-04 20:42:02.192 3363330 WARNING nova_fixed_multi.base [req-6caefd0f-9620-49f4-94b3-53710ce2b128 - - - - -] Deleting forward record 0ddc6e2d-10db-4246-9a43-3fe81ff1fa20 in recordset becbf645-db72-4dd3-b704-1a1cd380e85c

If that did not help, you can try restarting all the designate services, sometimes they get stuck too:

root@cumin1002:~# cumin cloudcontrol1* 'systemctl restart designate\*'

Checking zone transfers

To test if the zone transfer between designate-mdns and pdns is working ok, you can run the following, from a cloudcontrol:

root@cloudcontrol1006:~# host -l eqiad1.wikimedia.cloud ns1.openstack.eqiad1.wikimediacloud.org  # only one of them is the authoratitative and will reply
root@cloudcontrol1006:~# host -l eqiad1.wikimedia.cloud ns0.openstack.eqiad1.wikimediacloud.org

Or from the pdns server (cloudservice):

root@cloudservices1006:/# host -p 5354 -l eqiad1.wikimedia.cloud 172.20.1.3  # ip of cloudcontrol you want to test

SSH timeout to the machine

This can have many reasons, one common one is that the machine failed to run cloud-init properly.

To check it up, connect to the console of the machine, that is, check the host that VM is running on (on horizon, or run):

user@cloudcontrol1003:~$ sudo wmcs-openstack server show --os-project-id admin-monitoring afe478d0-d950-467f-abd0-6f3a41e50552
user@cloudcontrol1003:~$ sudo wmcs-openstack server show --os-project-id admin-monitoring afe478d0-d950-467f-abd0-6f3a41e50552 | grep -E '(OS-EXT-SRV-ATTR:host|instance_name)'
| OS-EXT-SRV-ATTR:host                | cloudvirt1045
| OS-EXT-SRV-ATTR:instance_name       | i-000626d2

Ssh to that host, and run:

user@cloudvirt1045:~$ sudo virsh
Welcome to virsh, the virtualization interactive terminal.

Type: 'help' for help with commands

      'quit' to quit

virsh # console i-000626f9

Connected to domain 'i-000626f9'
Escape character is ^] (Ctrl + ])

With the instance_name you got before, and that will connect you to the console (to exit hit CTRL+]), from there, check the cloud-init log:

root@buildvm-84424939-7022-4ea6-9e97-0ef7d3df1af0:~# cat /var/log/cloud-init.log

No active metadata service found

If you see this in the cloud-init.log file:

2022-06-03 23:46:57,664 - util.py[WARNING]: No active metadata service found
2022-06-03 23:46:57,665 - util.py[DEBUG]: No active metadata service found
Traceback (most recent call last):
 File "/usr/lib/python3/dist-packages/cloudinit/sources/DataSourceOpenStack.py", line 145, in _get_data
   results = self._crawl_metadata()
 File "/usr/lib/python3/dist-packages/cloudinit/sources/DataSourceOpenStack.py", line 181, in _crawl_metadata
   raise sources.InvalidMetaDataException(
cloudinit.sources.InvalidMetaDataException: No active metadata service found

Then the VM was unable to reach the metadata service to get the vendor-data with the custom settings to initialize it.

A common problem is the nova-api-metadata not working properly on the cloudcontrol nodes, you can check it out and try restarting it (workaround):

root@cloudcontrol1003:~# systemctl restart nova-api-metadata.service

It's also possible that the neutron metadata agent to be down, there's a specific alert for it, but to manually check:

root@cloudcontrol1003:~# source novaagent.sh
root@cloudcontrol1003:~# neutron agent-list

You might have to restart them on the neutron machines:

root@cloudnet1005:~# systemctl restart neutron-metadata-agent.service

If all the agents show as down, it might also be that the neutron-rpc-server service is down on the cloudcontrols, you can just restart it:

root@cloudcontrol1003:~# systemctl restart neutron-rpc-server.service

Per-VM debugging

If there's nothing in the logs (sometimes the log has been rotated), you can check the servers for the admin-monitoring project:

 dcaro@cloudcontrol1003:~$ sudo wmcs-openstack --os-project-id admin-monitoring server list
 +--------------------------------------+---------------------------+--------+----------------------------------------+--------------------------------------------+-----------------------+
 | ID                                   | Name                      | Status | Networks                               | Image                                      | Flavor                |
 +--------------------------------------+---------------------------+--------+----------------------------------------+--------------------------------------------+-----------------------+
 | d603b2e0-7b8b-462f-b74d-c782c2d34fea | fullstackd-20210110160929 | BUILD  |                                        | debian-10.0-buster (deprecated 2021-02-22) |                       |
 | 33766360-bbbe-4bef-8294-65fca6722e20 | fullstackd-20210415002301 | ACTIVE | lan-flat-cloudinstances2b=172.16.4.230 | debian-10.0-buster                         | g3.cores1.ram2.disk20 |
 | 697ebb69-0394-4e29-82fc-530153a38a1b | fullstackd-20210414162903 | ACTIVE | lan-flat-cloudinstances2b=172.16.5.251 | debian-10.0-buster                         | g3.cores1.ram2.disk20 |
 | 1812e03b-c978-43a5-a07e-6e3e240a9bf0 | fullstackd-20210414123145 | ACTIVE | lan-flat-cloudinstances2b=172.16.4.184 | debian-10.0-buster                         | g3.cores1.ram2.disk20 |
 | d70a111d-6f23-40e1-8c01-846dedb5f2ca | fullstackd-20210414110500 | ACTIVE | lan-flat-cloudinstances2b=172.16.2.117 | debian-10.0-buster                         | g3.cores1.ram2.disk20 |
 | 0b476a1a-a75e-4b56-bf51-8f9d43ec9201 | fullstackd-20210413182752 | ACTIVE | lan-flat-cloudinstances2b=172.16.4.198 | debian-10.0-buster                         | g3.cores1.ram2.disk20 |
 +--------------------------------------+---------------------------+--------+----------------------------------------+--------------------------------------------+-----------------------+

There we see that there's a few instances stuck from April 14th, and one from the 13th, and a building instance from January.

Very old failed instances

Usually the best of action there is just to delete the server, as any traces will be lost already, in this case fullstackd-20210110160929.

 sudo wmcs-openstack --os-project-id admin-monitoring server delete fullstackd-20210110160929

If it replies with:

 No server with a name or ID of 'fullstackd-20210110160929' exists.

Then you have a stuck server entry, follow this: Portal:Cloud VPS/Admin/Troubleshooting#Instance Troubleshooting

In this case, it was an old build request that got lost (quite uncommon): Portal:Cloud VPS/Admin/Troubleshooting#Deleting an orphaned build request

Steady, intermittent failures

Sometimes the fullstack test will exhibit a slow leak, with one failed VM every few hours or every few days. That suggests an intermittent failure. Here are some things that have caused that in the past:

Race conditions in the instance startup script, usually related to getting clean puppet runs (puppet/modules/openstack/templates/nova/vendordata.txt.erb). This usually shows up in the cloud-init logs on the failed instances, /var/log/cloud-init*.

Partial outage of OpenStack services, resulting either from one service on one cloudcontrol being out, or an haproxy configuration error. First place to look for that is in /var/log/haproxy/haproxy.log

Bad tokens in a service cache, causing intermittent auth failures between openstack services. To clear the caches, restart memcached on all cloudcontrols.

Bad Rabbitmq behavior

New instances

Now we can dig a bit deeper, and check for openstack logs containing those instance ids, let's pick 697ebb69-0394-4e29-82fc-530153a38a1b You can now go to kibana^[2] and search for any openstack entry that has that server id in it, for example:

https://logstash.wikimedia.org/goto/d1dd6f5c0b090d2c903b88c7eac734ab

Remember to set the time span accordingly to the instance creation date

In this case there seems not to be anything interesting there for that instance.

Going to the new instance

As a last resort, we can try sshing to the instance and running puppet, see if anything is broken there:

 ssh fullstackd-20210415002301.admin-monitoring.eqiad1.wikimedia.cloud

Check Cloud-Init / Logs:

 $ cat /var/log/cloud-init.log | grep error

Check Puppet:

 dcaro@fullstackd-20210415002301:~$ sudo run-puppet-agent
  Info: Using configured environment 'production'
  Info: Retrieving pluginfacts
  Info: Retrieving plugin
  Info: Retrieving locales
  Info: Loading facts
  Info: Caching catalog for fullstackd-20210415002301.admin-monitoring.eqiad1.wikimedia.cloud
  Info: Applying configuration version '(dd0bf90505) Manuel Arostegui - mariadb: Productionzie db1182'
  Notice: The LDAP client stack for this host is: sssd/sudo
  Notice: /Stage[main]/Profile::Ldap::Client::Labs/Notify[LDAP client stack]/message: defined 'message' as 'The LDAP client stack for this host is: sssd/sudo'
  Notice: Applied catalog in 5.74 seconds

In this case nothing failed.

Cleanup

As we did not find anything in the logs (some might be lost already), and the service is running ok currently, all that's left to do is to cleanup the VMs, see if deleting them fails, and freeing up the resources if not.

 useer@cloudcontrol1003:~$ sudo wmcs-openstack --os-project-id admin-monitoring server list  -f value -c ID | xargs -I{} sudo wmcs-openstack --os-project-id admin-monitoring server delete {}

If any of the failures were a result of DNS issues, it's best to clean up any possible stray DNS records as well to avoid future collisions:

 root@cloudcontrol1005:~# wmcs-dnsleaks # dry-run to make sure this won't do anything horrible
 skipping public zone: cloudinfra.wmflabs.org.
 checking zone: eqiad1.wikimedia.cloud.
 335997b9-e2ff-47af-8fca-b0a5c73efb04 is linked to missing instance fullstackd-20221216045943.admin-monitoring.eqiad1.wikimedia.cloud.
 skipping public zone: wmcloud.org.
 skipping public zone: gitlab-test.eqiad1.wmcloud.org.
 <etc>

 root@cloudcontrol1005:~# wmcs-dnsleaks --delete # actually delete stray records
 skipping public zone: cloudinfra.wmflabs.org.
 checking zone: eqiad1.wikimedia.cloud.
 335997b9-e2ff-47af-8fca-b0a5c73efb04 is linked to missing instance fullstackd-20221216045943.admin-monitoring.eqiad1.wikimedia.cloud.
 Deleting 335997b9-e2ff-47af-8fca-b0a5c73efb04 with https://openstack.eqiad1.wikimediacloud.org:29001/v2/zones/67603ef4-3d64-40d6-90d3-5b7776a99034/recordsets/335997b9-e2ff-47af-8fca-b0a5c73efb04
 skipping public zone: wmcloud.org.
 skipping public zone: gitlab-test.eqiad1.wmcloud.org.
 skipping public zone: mx-out.wmcloud.org.
 <etc>

Support contacts

Usually anyone in the WMCS team should be able to help/debug the issue, the main subject matter expert would be Andrew Bogott.

Related information

References

[1] Nova fullstack Grafana dashboard

[2] Wikimedia Kibana

[1]

[2]