Portal:Cloud VPS/Admin/Runbooks/NovafullstackSustainedLeakedVMs
Error / Incident
Usually an email/icinga alert with the subject ** PROBLEM alert - <hostname>/Check for VMs leaked by the nova-fullstack test is CRITICAL **
This happens when there's an error in the creation and/or deletion of the nova-fullstack VM.
To see the logs, you have to login to the instance that triggered the alert (if you are looking at the https://alerts.wikimedia.org, it is shown as a blue tag). For example cloudcontrol1003
:
Debugging
Quick check
ssh cloudcontrol1003.wikimedia.org
There you can check the systemctl status:
user@cloudcontrol1003:~$ sudo systemctl status nova-fullstack.service
It might be that the service re-triggered since then and now it's running. Logs are in Kibana: https://logstash.wikimedia.org/app/dashboards#/view/13149940-ff6c-11eb-85b7-9d1831ce7631
Quick way to check the last error (now that we have structured errors):
user@cloudcontrol1003:~$ echo "$(sudo journalctl -n 5000 -u nova-fullstack.service | grep '"log.level": "ERROR"' | tail -n 1 | sed -e 's/^.*@cee: //')" | jq '.'
View the current list of VMs
user@cloudcontrol1003:~$ sudo wmcs-openstack server list --os-project-id admin-monitoring
Another possible source of information is the Grafana dasboard.[1]
Debugging service errors
DNS entry not being created
This is related to the designate service that is running on cloudcontrols
, to check it's status ssh and run:
root@cumin1002:~# cumin cloudcontrol1* 'systemctl status designate-sink'
One common issue is designate-sink getting stuck, the reason is not known yet (see task T309929), but the symptom is that it stopped logging anything, so if there are no logs newer than 2h, it's probably this, the workaround is to restart the service:
root@cumin1002:~# cumin cloudcontrol1* 'systemctl restart designate-sink.service'
It will take a couple minutes to start up, but after you should start seeing entries like:
Jun 04 20:42:02 cloudcontrol1006 designate-sink[3363330]: 2022-06-04 20:42:02.192 3363330 WARNING nova_fixed_multi.base [req-6caefd0f-9620-49f4-94b3-53710ce2b128 - - - - -] Deleting forward record 0ddc6e2d-10db-4246-9a43-3fe81ff1fa20 in recordset becbf645-db72-4dd3-b704-1a1cd380e85c
If that did not help, you can try restarting all the designate services, sometimes they get stuck too:
root@cumin1002:~# cumin cloudcontrol1* 'systemctl restart designate\*'
Checking zone transfers
To test if the zone transfer between designate-mdns
and pdns
is working ok, you can run the following, from a cloudcontrol
:
root@cloudcontrol1006:~# host -l eqiad1.wikimedia.cloud ns1.openstack.eqiad1.wikimediacloud.org # only one of them is the authoratitative and will reply root@cloudcontrol1006:~# host -l eqiad1.wikimedia.cloud ns0.openstack.eqiad1.wikimediacloud.org
Or from the pdns server (cloudservice
):
root@cloudservices1006:/# host -p 5354 -l eqiad1.wikimedia.cloud 172.20.1.3 # ip of cloudcontrol you want to test
SSH timeout to the machine
This can have many reasons, one common one is that the machine failed to run cloud-init properly.
To check it up, connect to the console of the machine, that is, check the host that VM is running on (on horizon, or run):
user@cloudcontrol1003:~$ sudo wmcs-openstack server show --os-project-id admin-monitoring afe478d0-d950-467f-abd0-6f3a41e50552 user@cloudcontrol1003:~$ sudo wmcs-openstack server show --os-project-id admin-monitoring afe478d0-d950-467f-abd0-6f3a41e50552 | grep -E '(OS-EXT-SRV-ATTR:host|instance_name)' | OS-EXT-SRV-ATTR:host | cloudvirt1045 | OS-EXT-SRV-ATTR:instance_name | i-000626d2
Ssh to that host, and run:
user@cloudvirt1045:~$ sudo virsh Welcome to virsh, the virtualization interactive terminal.
Type: 'help' for help with commands
'quit' to quit virsh # console i-000626f9 Connected to domain 'i-000626f9' Escape character is ^] (Ctrl + ])
With the instance_name
you got before, and that will connect you to the console (to exit hit CTRL+]
), from there, check the cloud-init log:
root@buildvm-84424939-7022-4ea6-9e97-0ef7d3df1af0:~# cat /var/log/cloud-init.log
No active metadata service found
If you see this in the cloud-init.log
file:
2022-06-03 23:46:57,664 - util.py[WARNING]: No active metadata service found 2022-06-03 23:46:57,665 - util.py[DEBUG]: No active metadata service found Traceback (most recent call last): File "/usr/lib/python3/dist-packages/cloudinit/sources/DataSourceOpenStack.py", line 145, in _get_data results = self._crawl_metadata() File "/usr/lib/python3/dist-packages/cloudinit/sources/DataSourceOpenStack.py", line 181, in _crawl_metadata raise sources.InvalidMetaDataException( cloudinit.sources.InvalidMetaDataException: No active metadata service found
Then the VM was unable to reach the metadata service to get the vendor-data
with the custom settings to initialize it.
A common problem is the nova-api-metadata
not working properly on the cloudcontrol
nodes, you can check it out and try restarting it (workaround):
root@cloudcontrol1003:~# systemctl restart nova-api-metadata.service
It's also possible that the neutron metadata agent to be down, there's a specific alert for it, but to manually check:
root@cloudcontrol1003:~# source novaagent.sh root@cloudcontrol1003:~# neutron agent-list
You might have to restart them on the neutron machines:
root@cloudnet1005:~# systemctl restart neutron-metadata-agent.service
If all the agents show as down, it might also be that the neutron-rpc-server
service is down on the cloudcontrols, you can just restart it:
root@cloudcontrol1003:~# systemctl restart neutron-rpc-server.service
Per-VM debugging
If there's nothing in the logs (sometimes the log has been rotated), you can check the servers for the admin-monitoring
project:
dcaro@cloudcontrol1003:~$ sudo wmcs-openstack --os-project-id admin-monitoring server list +--------------------------------------+---------------------------+--------+----------------------------------------+--------------------------------------------+-----------------------+ | ID | Name | Status | Networks | Image | Flavor | +--------------------------------------+---------------------------+--------+----------------------------------------+--------------------------------------------+-----------------------+ | d603b2e0-7b8b-462f-b74d-c782c2d34fea | fullstackd-20210110160929 | BUILD | | debian-10.0-buster (deprecated 2021-02-22) | | | 33766360-bbbe-4bef-8294-65fca6722e20 | fullstackd-20210415002301 | ACTIVE | lan-flat-cloudinstances2b=172.16.4.230 | debian-10.0-buster | g3.cores1.ram2.disk20 | | 697ebb69-0394-4e29-82fc-530153a38a1b | fullstackd-20210414162903 | ACTIVE | lan-flat-cloudinstances2b=172.16.5.251 | debian-10.0-buster | g3.cores1.ram2.disk20 | | 1812e03b-c978-43a5-a07e-6e3e240a9bf0 | fullstackd-20210414123145 | ACTIVE | lan-flat-cloudinstances2b=172.16.4.184 | debian-10.0-buster | g3.cores1.ram2.disk20 | | d70a111d-6f23-40e1-8c01-846dedb5f2ca | fullstackd-20210414110500 | ACTIVE | lan-flat-cloudinstances2b=172.16.2.117 | debian-10.0-buster | g3.cores1.ram2.disk20 | | 0b476a1a-a75e-4b56-bf51-8f9d43ec9201 | fullstackd-20210413182752 | ACTIVE | lan-flat-cloudinstances2b=172.16.4.198 | debian-10.0-buster | g3.cores1.ram2.disk20 | +--------------------------------------+---------------------------+--------+----------------------------------------+--------------------------------------------+-----------------------+
There we see that there's a few instances stuck from April 14th, and one from the 13th, and a building instance from January.
Very old failed instances
Usually the best of action there is just to delete the server, as any traces will be lost already, in this case fullstackd-20210110160929
.
sudo wmcs-openstack --os-project-id admin-monitoring server delete fullstackd-20210110160929
If it replies with:
No server with a name or ID of 'fullstackd-20210110160929' exists.
Then you have a stuck server entry, follow this: Portal:Cloud VPS/Admin/Troubleshooting#Instance Troubleshooting
In this case, it was an old build request that got lost (quite uncommon): Portal:Cloud VPS/Admin/Troubleshooting#Deleting an orphaned build request
Steady, intermittent failures
Sometimes the fullstack test will exhibit a slow leak, with one failed VM every few hours or every few days. That suggests an intermittent failure. Here are some things that have caused that in the past:
- Race conditions in the instance startup script, usually related to getting clean puppet runs (puppet/modules/openstack/templates/nova/vendordata.txt.erb). This usually shows up in the cloud-init logs on the failed instances, /var/log/cloud-init*.
- Partial outage of OpenStack services, resulting either from one service on one cloudcontrol being out, or an haproxy configuration error. First place to look for that is in /var/log/haproxy/haproxy.log
- Bad tokens in a service cache, causing intermittent auth failures between openstack services. To clear the caches, restart memcached on all cloudcontrols.
- Bad Rabbitmq behavior
New instances
Now we can dig a bit deeper, and check for openstack logs containing those instance ids, let's pick 697ebb69-0394-4e29-82fc-530153a38a1b
You can now go to kibana[2] and search for any openstack entry that has that server id in it, for example:
https://logstash.wikimedia.org/goto/d1dd6f5c0b090d2c903b88c7eac734ab
Remember to set the time span accordingly to the instance creation date
In this case there seems not to be anything interesting there for that instance.
Going to the new instance
As a last resort, we can try sshing to the instance and running puppet, see if anything is broken there:
ssh fullstackd-20210415002301.admin-monitoring.eqiad1.wikimedia.cloud
Check Cloud-Init / Logs:
$ cat /var/log/cloud-init.log | grep error
Check Puppet:
dcaro@fullstackd-20210415002301:~$ sudo run-puppet-agent Info: Using configured environment 'production' Info: Retrieving pluginfacts Info: Retrieving plugin Info: Retrieving locales Info: Loading facts Info: Caching catalog for fullstackd-20210415002301.admin-monitoring.eqiad1.wikimedia.cloud Info: Applying configuration version '(dd0bf90505) Manuel Arostegui - mariadb: Productionzie db1182' Notice: The LDAP client stack for this host is: sssd/sudo Notice: /Stage[main]/Profile::Ldap::Client::Labs/Notify[LDAP client stack]/message: defined 'message' as 'The LDAP client stack for this host is: sssd/sudo' Notice: Applied catalog in 5.74 seconds
In this case nothing failed.
Cleanup
As we did not find anything in the logs (some might be lost already), and the service is running ok currently, all that's left to do is to cleanup the VMs, see if deleting them fails, and freeing up the resources if not.
useer@cloudcontrol1003:~$ sudo wmcs-openstack --os-project-id admin-monitoring server list -f value -c ID | xargs -I{} sudo wmcs-openstack --os-project-id admin-monitoring server delete {}
If any of the failures were a result of DNS issues, it's best to clean up any possible stray DNS records as well to avoid future collisions:
root@cloudcontrol1005:~# wmcs-dnsleaks # dry-run to make sure this won't do anything horrible skipping public zone: cloudinfra.wmflabs.org. checking zone: eqiad1.wikimedia.cloud. 335997b9-e2ff-47af-8fca-b0a5c73efb04 is linked to missing instance fullstackd-20221216045943.admin-monitoring.eqiad1.wikimedia.cloud. skipping public zone: wmcloud.org. skipping public zone: gitlab-test.eqiad1.wmcloud.org. <etc>
root@cloudcontrol1005:~# wmcs-dnsleaks --delete # actually delete stray records skipping public zone: cloudinfra.wmflabs.org. checking zone: eqiad1.wikimedia.cloud. 335997b9-e2ff-47af-8fca-b0a5c73efb04 is linked to missing instance fullstackd-20221216045943.admin-monitoring.eqiad1.wikimedia.cloud. Deleting 335997b9-e2ff-47af-8fca-b0a5c73efb04 with https://openstack.eqiad1.wikimediacloud.org:29001/v2/zones/67603ef4-3d64-40d6-90d3-5b7776a99034/recordsets/335997b9-e2ff-47af-8fca-b0a5c73efb04 skipping public zone: wmcloud.org. skipping public zone: gitlab-test.eqiad1.wmcloud.org. skipping public zone: mx-out.wmcloud.org. <etc>
Support contacts
Usually anyone in the WMCS team should be able to help/debug the issue, the main subject matter expert would be Andrew Bogott.
Related information
- https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Nova-fullstack
- https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin-monitoring#nova-fullstack