Portal:Cloud VPS/Admin/Maintenance

From Wikitech
Jump to navigation Jump to search

This page is for routine maintenance tasks performed by Cloud Services operators. For troubleshooting immediate issues, see the Labs_troubleshooting page.

Labvirt reboot checklist

  1. Notify users on cloud-announce -- one week in advance if possible
  2. 'schedule downtime for this host and all services' in icinga
  3. 'schedule downtime for this host and all services' for checker.tools.wmflabs.org in icinga
  4. If VMs will be affected:
    1. Collect a list of nodes and their current state on the labvirt in question: 'nova list --all-tenants --host <hostname>'
    2. Disable puppet and stop the 'shinken' service on shinken-01.eqiad.wmflabs -- this isn't subtle but will keep alerts to a minimum
    3. depool all affected tool exec nodes
    4. failover tools nodes as needed https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Admin#Failover
    5. failover nova-proxy as needed
    6. Kubernetes nodes should generally be fine as long as only one labvirt is rebooted at a time
  5. Reboot host
  6. Wait for host to reboot, verify ssh access still works
  7. If VMs were affected
    1. once Nova has caught up to the change, all hosted VMs should switch to an 'off' state -- wait until that happens so we know that Nova is up to speed (again, check this with 'nova list --all-tenants --host <hostname>'
    2. refer back to your list of nodes and pre-reboot state from earlier; restart all VMs that were previously running, waiting 5-10 seconds after each restart to avoid flooding the Nova control plane (probably by making a script out of the output from 'nova list')
    3. repool all affected exec nodes
    4. Re-enable puppet on shinken-01.eqiad.wmflabs (puppet will restart shinken.)

Openstack Upgrade test plan

Upgrading openstack mostly consists of updating config files, changing openstack::version in hiera and then running puppet a bunch of times. In theory each individual openstack service is compatible with the n+1 and n-1 version so that the components don't have to be upgraded in a particular order.

That said, we have a test cluster, so it's best to run a test upgrade there before rolling things out for prod. Here are things to test:

  • Keystone/Ldap
    • Openstack service list
    • Openstack endpoint list
    • Create new account via wikitech
      • Set up 2fa for new account
      • Verify new user can log in on wikitech
    • Create new project via wikitech
      • Set keystone cmdline auth to new user
      • Verify new user has no access to new project
    • Keystone commandline roles
      • Assign a role to the new account
      • Remove role from new account
    • Wikitech project management
      • Add new user to a project
      • Promote user to projectadmin
      • Verify new user can log in on Horizon
      • Verify new user can view instance page for new project
      • Demote user to normal project member
  • Nova
    • Instance creation
      • verify dns entry created
      • Verify ldap record created
      • ssh access
      • check puppet run output
    • Assignment/Removal of floating IPs
    • Security groups
      • Remove instance from ssh security group, verify ssh is blocked
      • Replace instance in ssh security group, verify ssh works again
      • Add/remove source group and verify that networking between existing and new instances in the same project changes accordingly
    • Instance deletion
      • Verify dns entry cleaned up
      • Verify ldap record removed
  • Glance
    • Openstack image list
    • Create new image
    • Test instance creation with new image
  • Designate
    • Assign/remove dns entry from Horizon
  • Dynamic Proxy
    • Create/delete proxy

Admin/Maintenance scripts

Information on our admin/maintenance scripts.


Puppet installs novaenv.sh on the openstack controller. In order to run nova and glance shell commands without having to add a thousand args to the commandline:

root@cloudcontrol1003:~# source novaenv.sh


user@cloudcontrol1003:~$ source <(sudo cat /root/novaenv.sh)

You can do basic sudo usage of the openstack command using our custom wrapper which simply loads credentials on the fly:

user@cloudcontrol1003:~$ sudo wmcs-openstack server list --all-projects

migration scripts

A collection of scripts to move VM instances around.


The wmcs-cold-migrate tool will shut down an instance, copy it to the specified target host, and boot it on the new host.

root@cloudcontrol:~# nova list --all-tenants --host <source>
root@cloudcontrol:~# wmcs-cold-migrate <args> 7d4a9768-c301-4e95-8bb9-d5aa70e94a64 <destination>

This can take quite a while, so run this in a 'screen' session. At the end of the migration, a prompt will show up asking for confirmation before the final cleanup (including origin disk deletion).

This can be used in other deployments. This example shows how to use it in the codfw1dev deployment:

root@labtestcontrol2003:~# wmcs-cold-migrate --datacenter codfw --nova-db-server localhost --nova-db nova 85d106d7-5c4e-4955-871d-a88dfb5d2a1e labtestmetal2001

To move VMs in the old main region:

root@cloudcontrol1004:~# wmcs-cold-migrate --region eqiad --nova-db nova 5a41a2b1-5bdd-4d52-ba1c-72273b4fe6f3 labvirt1005

Check --help for concrete details of input arguments and their default value.


TODO: fill me. Apparently not working.


TODO: fill me.


See Neutron migration.


See Neutron migration.


See Neutron migration.


A collection of scripts to fetch info from nova.


The wmcs-novastats-imagestats script can be run periodically to list which images are currently in use -- it can also answer the question 'what instances use image xxx'? As obsolete images are abandoned they can be deleted from glance to save disk space.


TODO: fill me.


TODO: fill me.


TODO: fill me.


TODO: fill me.


TODO: fill me.


See detecting laeked DNS records.


TODO: fill me.


Admin scripts related to DNS operations.


To create subdomains under the wmflabs.org domain. The base domain belongs to the wmflabsdotorg project, so this script creates the subdomain there and then transfer ownership to the desired project.

root@cloudcontrol1004:~# wmcs-makedomain --help
usage: makesubdomain [-h] [--designate-user DESIGNATE_USER]
                     [--designate-pass DESIGNATE_PASS]
                     [--keystone-url KEYSTONE_URL] --project PROJECT
                     [--domain DOMAIN] [--delete] [--all]

Create a subdomain of wmflabs.org

optional arguments:
  -h, --help            show this help message and exit
  --designate-user DESIGNATE_USER
                        username for nova auth
  --designate-pass DESIGNATE_PASS
                        password for nova auth
  --keystone-url KEYSTONE_URL
                        url for keystone auth and catalog
  --project PROJECT     project for domain creation
  --domain DOMAIN       domain to create
  --delete              delete domain rather than create
  --all                 with --delete, delete all domains in a project
root@cloudcontrol1004:~# source novaenv.sh   
root@cloudcontrol1004:~# wmcs-makedomain --project traffic --domain traffic.wmflabs.org

This script is what we use to handle this general-user FAQ entry: Help:Horizon_FAQ#Can_I_create_a_new_DNS_domain/zone_for_my_project,_or_records_under_the_wmflabs.org_domain?


This script is used to create and update FQDN and domain names related to wikireplicas and other services.

See Portal:Data_Services/Admin/Wiki_Replica_DNS for more details.

TODO: should we simply merge wmcs-wikirelpica-dns docs here?


Other useful scripts.


This script can check whether different instance types are spread enough into different cloudvirt servers. Useful in a project with many similar VM instances which may want to run them in separate cloudvirts to avoid potential disasters.
Is usually set up as a NRPE script for icinga. The script requires a yaml configuration file like this:

project: tools

  bastion-: bastion
  checker-: checker
  elastic-: elastic
  flannel-etcd-: flannel-etcd
  k8s-etcd-: k8s-etcd
  mail: mail
  paws-worker-: paws
  prometheus-: prometheus
  proxy-: proxy
  redis-: redis
  services-: services
  sgebastion-: sgebastion
  sgeexec-: sgeexec
  sgegrid-master: sgemaster
  sgegrid-shadow: sgemaster
  sgewebgrid-generic-: sgewebgrid-generic
  sgewebgrid-lighttpd-: sgewebgrid-lighttpd
  static-: static
  worker-: worker

And is usally used like this:

root@cloudcontrol1003:~# wmcs-spreadcheck --config /etc/wmcs-spreadcheck-tools.yaml

See also