This page may be need some refreshes. We have now a new deployment (eqiad1) which may have sighly different maintenance mechanisms.
This page is for routine maintenance tasks performed by Cloud Services operators. For troubleshooting immediate issues, see the Labs_troubleshooting page.
Labvirt reboot checklist
- Notify users on cloud-announce -- one week in advance if possible
- 'schedule downtime for this host and all services' in icinga
- 'schedule downtime for this host and all services' for checker.tools.wmflabs.org in icinga
- If VMs will be affected:
- Collect a list of nodes and their current state on the labvirt in question: 'nova list --all-tenants --host <hostname>'
- Disable puppet and stop the 'shinken' service on shinken-01.eqiad.wmflabs -- this isn't subtle but will keep alerts to a minimum
- depool all affected tool exec nodes
- failover tools nodes as needed https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Admin#Failover
- failover nova-proxy as needed
- Kubernetes nodes should generally be fine as long as only one labvirt is rebooted at a time
- Reboot host
- Wait for host to reboot, verify ssh access still works
- If VMs were affected
- once Nova has caught up to the change, all hosted VMs should switch to an 'off' state -- wait until that happens so we know that Nova is up to speed (again, check this with 'nova list --all-tenants --host <hostname>'
- refer back to your list of nodes and pre-reboot state from earlier; restart all VMs that were previously running, waiting 5-10 seconds after each restart to avoid flooding the Nova control plane (probably by making a script out of the output from 'nova list')
- repool all affected exec nodes
- Re-enable puppet on shinken-01.eqiad.wmflabs (puppet will restart shinken.)
Openstack Upgrade test plan
Upgrading openstack mostly consists of updating config files, changing openstack::version in hiera and then running puppet a bunch of times. In theory each individual openstack service is compatible with the n+1 and n-1 version so that the components don't have to be upgraded in a particular order.
That said, we have a test cluster, so it's best to run a test upgrade there before rolling things out for prod. Here are things to test:
- Openstack service list
- Openstack endpoint list
- Create new account via wikitech
- Set up 2fa for new account
- Verify new user can log in on wikitech
- Create new project via wikitech
- Set keystone cmdline auth to new user
- Verify new user has no access to new project
- Keystone commandline roles
- Assign a role to the new account
- Remove role from new account
- Wikitech project management
- Add new user to a project
- Promote user to projectadmin
- Verify new user can log in on Horizon
- Verify new user can view instance page for new project
- Demote user to normal project member
- Instance creation
- verify dns entry created
- Verify ldap record created
- ssh access
- check puppet run output
- Assignment/Removal of floating IPs
- Security groups
- Remove instance from ssh security group, verify ssh is blocked
- Replace instance in ssh security group, verify ssh works again
- Add/remove source group and verify that networking between existing and new instances in the same project changes accordingly
- Instance deletion
- Verify dns entry cleaned up
- Verify ldap record removed
- Instance creation
- Openstack image list
- Create new image
- Test instance creation with new image
- Assign/remove dns entry from Horizon
- Dynamic Proxy
- Create/delete proxy
Puppet installs novaenv.sh on the openstack controller. In order to run nova and glance shell commands without having to add a thousand args to the commandline,
$ source /root/novaenv.sh
$ source <(sudo cat /root/novaenv.sh)
The cold-migrate tool will shut down an instance, copy it to the specified target host, and boot it on the new host.
$ nova list --all-tenants --host <source> $ /root/cold-migrate <args> 7d4a9768-c301-4e95-8bb9-d5aa70e94a64 <destination>
Puppet installs cold-migrate.sh in /root on the nova controller. This can take quite a while, so run this in a 'screen' session.
The imagestats script can be run periodically to list which images are currently in use -- it can also answer the question 'what instances use image xxx'? As obsolete images are abandoned they can be deleted from glance to save disk space.
Puppet installs imagestats in /root/novastats on the nova controller.
Novastats.py is a simple python library which (among other things) creates a dictionary of instance data. It's useful for writing simple one-off scripts during cluster management.
Puppet installs novastats.py in /root/novastats.py on the nova controller. You'll need to source novaenv.sh before using any of its functions.
Novastats /should/ use python openstack libraries to talk to nova, but it doesn't. Rather, it uses the openstack commandline tools.