Wikimedia Cloud Services team/EnhancementProposals/OpenStackHA

From Wikitech

DRAFT WORK IN PROGRESS

Current configuration

Each core OpenStack service provides a REST-full API that is stateless and supports any number of active instances. For each service an endpoint is defined and used by the clients for service discovery.

The current WMCS VPS OpenStack services are running on active/standby controllers. These endpoints are configured to point directly at a single controller.

openstack endpoint list --interface public -c "Service Name" -c "Service Type" -c "URL"
+--------------+--------------+------------------------------------------------------------------------+
| Service Name | Service Type | URL                                                                    |
+--------------+--------------+------------------------------------------------------------------------+
| glance       | image        | http://cloudcontrol1003.wikimedia.org:9292                             |
| designate    | dns          | http://cloudservices1003.wikimedia.org:9001                            |
| neutron      | network      | http://cloudcontrol1003.wikimedia.org:9696                             |
| keystone     | identity     | http://cloudcontrol1003.wikimedia.org:5000/v3                          |
| nova         | compute      | http://cloudcontrol1003.wikimedia.org:8774/v2.1                        |
| proxy        | proxy        | http://proxy-eqiad1.wmflabs.org:5668/dynamicproxy-api/v1/$(tenant_id)s |
+--------------+--------------+------------------------------------------------------------------------+

This approach to HA is fairly simple and straightforward, but it does allow for periods of service interruption. This approach becomes more complex on operations by requiring manual changes to the endpoint configuration of each service to recover from or survive through outages or configuration changes.

More information on the process to switch active/standby endpoints https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Fail-over

Proposed design

requirements

  • Each OpenStack controller must be able to fully support the OpenStack control plane on its own with no cross dependencies.
  • No manual intervention required to survive from a service failure, restart or single host outage.

load balancing

To take full advantage of multiple OpenStack controllers and its native REST-full capabilities a load balancer can be used to spread traffic across multiple backend instances. HAproxy is a layer7 load balancer that provides advanced routing and health check capabilities. HAproxy is used for other services in WMF today and is also a popular solution found in many other OpenStack deployments.

To ensure that there are no cross dependencies and maintain client sticky sessions each controller will run a standalone (no peering) HAproxy instance.

Networking

HAproxy will need an IPv4 or IPv6 interface to bind to. This should be a separate interface that can be activated on any of the hosts supporting HAproxy.

endpoint mapping

A service endpoint can only have one definition per public, internal or admin interface. Endpoints will be directed to the active HAproxy instance

The DNS entry used for the endpoint URL will map to a single IPv4 or IPv6 address that is dedicated to HAproxy. This address should be different than the existing interface the OpenStack services are bound to. By using a different interface we'll avoid having to maintain custom port maps.