Jump to content

Portal:Cloud VPS/Admin/Maintenance

From Wikitech

This page is for routine maintenance tasks performed by Cloud Services operators. For troubleshooting immediate issues, see the troubleshooting page.

Rebooting hosts (e.g. for security upgrades)

cloudbackupXXXX

TODO

cloudcontrolXXXX

TODO

cloudnetXXXX

To reboot these hosts in the right order (standby first), you can use the cookbook roll_reboot_cloudnets:

$ cookbook wmcs.openstack.roll_reboot_cloudnets --cluster-name eqiad1 --task-id TXXXXXX

The cookbook will also perform additional checks to verify that the network is working correctly after the reboots.

cloudrabbitXXXX

Reboot hosts one at a time using sre.hosts.reboot-single, order shouldn't matter but in the past we started from the highest (1003):

cumin1002:~$ sudo cookbook sre.hosts.reboot-single cloudrabbit1003.eqiad.wmnet --reason TXXXXXX

After rebooting each one of the cloudrabbit hosts, verify that rabbitmqctl cluster_status shows that all 3 nodes are back in the cluster:

cloudrabbit1001:~# sudo rabbitmqctl cluster_status
Cluster status of node rabbit@cloudrabbit1001.private.eqiad.wikimedia.cloud ...
Basics

Cluster name: rabbit@cloudrabbit1001.eqiad.wmnet

Disk Nodes

rabbit@cloudrabbit1001.private.eqiad.wikimedia.cloud
rabbit@cloudrabbit1002.private.eqiad.wikimedia.cloud
rabbit@cloudrabbit1003.private.eqiad.wikimedia.cloud

Running Nodes

rabbit@cloudrabbit1001.private.eqiad.wikimedia.cloud
rabbit@cloudrabbit1002.private.eqiad.wikimedia.cloud
rabbit@cloudrabbit1003.private.eqiad.wikimedia.cloud

When all 3 nodes have been rebooted, check the admin-monitoring project in Horizon. Every 10 minutes a new instance should be created and deleted, if instances start piling up, it means something is broken with the Rabbit cluster.

In case something breaks during the upgrade (our cloudrabbit setup is sometimes unstable), you can rebuild the full cluster following the instructions at Portal:Cloud VPS/Admin/RabbitMQ#Resetting the HA setup.

cloudservicesXXXX

Reboot them one at at time using sre.hosts.reboot-single:

cumin1002:~$ sudo cookbook sre.hosts.reboot-single cloudservices1005.eqiad.wmnet --reason TXXXXXX

After rebooting each cloudservices host, run a nova-fullstack test to verify that the DNS is working correctly. nova-fullstack is only available on one specific cloudcontrol node (currently cloudcontrol1006):

cloudcontrol1006:~# sudo systemctl restart nova-fullstack.service

To check that the nova-fullstack test is completing succesfully, verify that a test VM is created in the admin-monitoring project as soon as you restart the systemd unit, and that the VM disappears after 2-3 minutes:

cloudcontrol1006:~$ sudo wmcs-openstack server list --project admin-monitoring

cloudvirtXXXX

In theory you can just use the wmcs.openstack.cloudvirt.safe_reboot cookbook to restart everything:

:# run this in a screen/tmux:
cloudcuminXXXX:~$ sudo cookbook wmcs.openstack.cloudvirt.safe_reboot --cluster-name $CLUSTER --ceph-only

However a single node failing to drain will fail that entire cookbook and that's not always very reliable. You can re-try the upgrade all cookbook which will start from the beginning, but at least as of time of writing you might want to just run the cookbook for one node at a time:

cloudcuminXXXX:~$ sudo cookbook wmcs.openstack.cloudvirt.safe_reboot --fqdn cloudvirtXXXX.SITE.wmnet

cloudvirtlocalXXXX

These hosts are special cloudvirts with local storage attached. We only run two VMs on each of those: tools-k8s-etcd-XX and toolsbeta-test-k8s-etcd-XX.

Don't try rebooting them using the cookbook wmcs.openstack.cloudvirt.safe_reboot as that cookbook only works with standard cloudvirt hosts, but not with cloudvirtlocal hosts (it fails when trying to migrate VMs).

To reboot cloudvirtlocal hosts, use instead the production cookbook sre.hosts.reboot-single:

fnegri@cumin1002:~$ sudo cookbook sre.hosts.reboot-single cloudvirtlocal1001.eqiad.wmnet --reason Txxxxxx

After rebooting each host, check that etcd reports all 3 nodes as healthy:

root@tools-k8s-etcd-16:~# ETCDCTL_API=3 etcdctl --key /etc/etcd/ssl/tools-k8s-etcd-16.tools.eqiad1.wikimedia.cloud.priv --cert=/etc/etcd/ssl/tools-k8s-etcd-16.tools.eqiad1.wikimedia.cloud.pem  --insecure-skip-tls-verify=true --endpoints=https://tools-k8s-etcd-16.tools.eqiad1.wikimedia.cloud:2379,https://tools-k8s-etcd-17.tools.eqiad1.wikimedia.cloud:2379,https://tools-k8s-etcd-18.tools.eqiad1.wikimedia.cloud:2379 endpoint health
https://tools-k8s-etcd-16.tools.eqiad1.wikimedia.cloud:2379 is healthy: successfully committed proposal: took = 8.133759ms
https://tools-k8s-etcd-17.tools.eqiad1.wikimedia.cloud:2379 is healthy: successfully committed proposal: took = 3.920964ms
https://tools-k8s-etcd-18.tools.eqiad1.wikimedia.cloud:2379 is healthy: successfully committed proposal: took = 1.771278ms

cloudvirt-wdqsXXXX

These hosts are special cloudvirts dedicated to the Wikidata Query Service.

Before rebooting them, check if there's any VM running on them:

cloudcontrol1005:~$ sudo wmcs-openstack server list --all-projects --host cloudvirt-wdqsXXXX

If there are VMs running (and not just the "canary" VM which can be ignored), get in touch with the owners (TODO: who are the owners?) before rebooting the host. If only the "canary" VM is running, you can reboot the host using the cloudvirt.safe_reboot cookbook:

$ cookbook wmcs.openstack.cloudvirt.safe_reboot --fqdn cloudvirt-wdqsXXXX.eqiad.wmnet --task-id TXXXXXX

Live Migrating Virtual Machines

All hypervisors using the Puppet role `role::wmcs::openstack::eqiad1::virt_ceph` can support live migration. Also note that the libvirt CPU architecture and capabilities must match the source and destination migration targets.

Live migration command:

$ openstack server migrate <virtual machine uuid> --live <target hypervisor>

After running the command Nova will begin the migration process. The Nova compute log files `/var/log/nova/nova-compute.log` on the source and destination hypervisors can be used to follow along or troubleshoot the migration process.

Openstack Upgrade test plan

Upgrading openstack mostly consists of updating config files, changing openstack::version in hiera and then running puppet a bunch of times. In theory each individual openstack service is compatible with the n+1 and n-1 version so that the components don't have to be upgraded in a particular order.

That said, we have a test cluster, so it's best to run a test upgrade there before rolling things out for prod. Here are things to test:

  • Keystone/Ldap
    • Openstack service list
    • Openstack endpoint list
    • Create new account via wikitech
      • Set up 2fa for new account
      • Verify new user can log in on wikitech
    • Create new project via wikitech
      • Set keystone cmdline auth to new user
      • Verify new user has no access to new project
    • Keystone commandline roles
      • Assign a role to the new account
      • Remove role from new account
    • Wikitech project management
      • Add new user to a project
      • Promote user to projectadmin
      • Verify new user can log in on Horizon
      • Verify new user can view instance page for new project
      • Demote user to normal project member
  • Nova
    • Instance creation ad boot with different glance images
      • verify dns entry created
      • Verify ldap record created
      • ssh access
      • check puppet run output
    • Assignment/Removal of floating IPs
    • Security groups
      • Remove instance from ssh security group, verify ssh is blocked
      • Replace instance in ssh security group, verify ssh works again
      • Add/remove source group and verify that networking between existing and new instances in the same project changes accordingly
    • Instance deletion
      • Verify dns entry cleaned up
      • Verify ldap record removed
  • Glance
    • Openstack image list
    • Create new image
    • Test instance creation with new image
  • Designate
    • Assign/remove dns entry from Horizon
  • Dynamic Proxy
    • Create/delete proxy

Admin/Maintenance scripts

Information on our admin/maintenance scripts.

credentials

Puppet installs novaenv.sh on the openstack controller. In order to run nova and glance shell commands without having to add a thousand args to the commandline:

root@cloudcontrol1003:~# source novaenv.sh

or:

user@cloudcontrol1003:~$ source <(sudo cat /root/novaenv.sh)

wmcs-openstack

You can do basic sudo usage of the openstack command using our custom wrapper which simply loads credentials on the fly:

user@cloudcontrol1003:~$ sudo wmcs-openstack server list --all-projects
[...]

migration scripts

A collection of scripts to move VM instances around.

wmcs-cold-migrate

The wmcs-cold-migrate tool will shut down an instance, copy it to the specified target host, and boot it on the new host.

root@cloudcontrol:~# openstack server list --all-projects --host <cloudvirt_hostname_without_domain>
root@cloudcontrol:~# wmcs-cold-migrate <args> 7d4a9768-c301-4e95-8bb9-d5aa70e94a64 <destination>

This can take quite a while, so run this in a 'screen' session. At the end of the migration, a prompt will show up asking for confirmation before the final cleanup (including origin disk deletion).

This can be used in other deployments. This example shows how to use it in the codfw1dev deployment:

root@cloudcontrol2001-dev:~# wmcs-cold-migrate --datacenter codfw --nova-db-server localhost --nova-db nova 85d106d7-5c4e-4955-871d-a88dfb5d2a1e cloudvirt2001-dev

To move VMs in the old main region:

root@cloudcontrol1004:~# wmcs-cold-migrate --region eqiad --nova-db nova 5a41a2b1-5bdd-4d52-ba1c-72273b4fe6f3 cloudvirt1005

Check --help for concrete details of input arguments and their default value.

wmcs-cold-nova-migrate

TODO: fill me. Apparently not working.

wmcs-live-migrate

TODO: fill me.

wmcs-region-migrate

See Neutron migration.

wmcs-region-migrate-security-groups

See Neutron migration.

wmcs-region-migrate-quotas

See Neutron migration.

novastats

A collection of scripts to fetch info from nova.

wmcs-novastats-imagestats

The wmcs-novastats-imagestats script can be run periodically to list which images are currently in use -- it can also answer the question 'what instances use image xxx'? As obsolete images are abandoned they can be deleted from glance to save disk space.

wmcs-novastats-alltrusty

TODO: fill me.

wmcs-novastats-flavorreport

TODO: fill me.

wmcs-novastats-puppetleaks

TODO: fill me.

wmcs-novastats-capacity

TODO: fill me.

wmcs-novastats-imagestats

TODO: fill me.

wmcs-dnsleaks

See detecting leaked DNS records.

wmcs-novastats-proxyleaks

TODO: fill me.

DNS

Admin scripts related to DNS operations.

wmcs-makedomain

To create subdomains under the wmflabs.org domain or any other primary domain. The base domain belongs to the wmflabsdotorg project or other project, so this script creates the subdomain there and then transfer ownership to the desired project.

We need this because designate doesn't allow us to create subdomains in a given project if the superdomain belongs to other different project.

root@cloudcontrol1004:~# wmcs-makedomain --help
usage: wmcs-makedomain [-h] [--designate-user DESIGNATE_USER]
                       [--designate-pass DESIGNATE_PASS]
                       [--keystone-url KEYSTONE_URL] --project PROJECT
                       [--domain DOMAIN] [--delete] [--all]
                       [--orig-project ORIG_PROJECT]

Create a subdomain and transfer ownership

optional arguments:
  -h, --help            show this help message and exit
  --designate-user DESIGNATE_USER
                        username for nova auth
  --designate-pass DESIGNATE_PASS
                        password for nova auth
  --keystone-url KEYSTONE_URL
                        url for keystone auth and catalog
  --project PROJECT     project for domain creation
  --domain DOMAIN       domain to create
  --delete              delete domain rather than create
  --all                 with --delete, delete all domains in a project
  --orig-project ORIG_PROJECT
                        the project that is oiginally owner of the superdomain
                        in which the subdomain is being created. Typical
                        values are either wmflabsdotorg or admin. Default:
                        wmflabsdotorg  
root@cloudcontrol1004:~# source novaenv.sh   
root@cloudcontrol1004:~# wmcs-makedomain --project traffic --domain traffic.wmflabs.org
root@cloudcontrol1004:~# wmcs-makedomain --project toolsbeta --domain toolsbeta.eqiad1.wikimedia.cloud --orig-project admin
root@cloudcontrol1003:~# wmcs-makedomain --project huggle --domain huggle.wmcloud.org --orig-project cloudinfra

This script is what we use to handle this general-user FAQ entry: Help:Horizon_FAQ#Can_I_create_a_new_DNS_domain/zone_for_my_project,_or_records_under_the_wmflabs.org_domain?

wmcs-wikireplica-dns

This script is used to create and update FQDN and domain names related to wikireplicas and other services.

See Portal:Data_Services/Admin/Wiki_Replica_DNS for more details.

TODO: should we simply merge wmcs-wikirelpica-dns docs here?

wmcs-populate-domains

When a new designate/pdns node is set up, designate does not pre-emptively populate domains in pdns; it only updates a domain when that domain is edited.

This script enumerates domains (aka 'zones' in designate) and creates a dummy record in each. It then cleans up those domains. That seeming no-op prompts designate to create and sync each domain with pdns.

Like most wmcs admin scripts, this should be run on a cloudcontrol node with standard novaadmin environment settings.

others

Other useful scripts.

wmcs-spreadcheck

This script can check whether different instance types are spread enough into different cloudvirt servers. Useful in a project with many similar VM instances which may want to run them in separate cloudvirts to avoid potential disasters.
Is usually set up as a NRPE script for icinga. The script requires a yaml configuration file like this:

project: tools

classifier:
  bastion-: bastion
  checker-: checker
  elastic-: elastic
  flannel-etcd-: flannel-etcd
  k8s-etcd-: k8s-etcd
  mail: mail
  paws-worker-: paws
  prometheus-: prometheus
  proxy-: proxy
  redis-: redis
  services-: services
  sgebastion-: sgebastion
  sgeexec-: sgeexec
  sgegrid-master: sgemaster
  sgegrid-shadow: sgemaster
  sgewebgrid-generic-: sgewebgrid-generic
  sgewebgrid-lighttpd-: sgewebgrid-lighttpd
  static-: static
  worker-: worker

And is usally used like this:

root@cloudcontrol1003:~# wmcs-spreadcheck --config /etc/wmcs-spreadcheck-tools.yaml

wmcs-vm-extra-specs

When a virtual machine is created the requested flavor data is copied to the instance, and any future updates to the flavor are ignored. This script will connect to the Nova database and directly modify extra specs.

$ wmcs-vm-extra-specs --help
usage: wmcs-vm-extra-specs [-h] [--nova-db-server NOVA_DB_SERVER] [--nova-db NOVA_DB] [--mysql-password MYSQL_PASSWORD]
                          uuid
                          {quota:disk_read_bytes_sec,
                           quota:disk_read_iops_sec,
                           quota:disk_total_bytes_sec,
                           quota:disk_total_iops_sec,
                           quota:disk_write_bytes_sec,
                           quota:disk_write_iops_sec}
                          spec_value

wmcs-instancepurge

Some projects have limited VM lifespan. For those projects, a systemd timer periodically runs the wmcs-instancepurge script. That script checks the age of each VM in the project and deletes VMs over a certain age (after emailing warnings for a few days first.)

To expose a new project to lifespan limits, add a section to the profile::openstack::eqiad1::purge_projects key in puppet/hiera/common/profile/openstack/eqiad1.yaml. Be sure to include documentation about why the project is being auto-purged.

profile::openstack::eqiad1::purge_projects:
   - project: sre-sandbox
     days-to-nag: 10
     days-to-delete: 15

See also