Portal:Cloud VPS/Admin/Runbooks/OpenstackAPIResponse
Error / Incident
This alert fires when the Openstack API performance is detected to be "poor". By the time of this writing, this value has been established as the 12h average response time to be > 1 second.
The performance problem is likely affecting the Openstack control plane and making it unrealiable, slow, and/or prone to transient failures.
Debugging
This alert is based on the prometheus metrics of HAproxy. You can check this grafana dashboard:
https://grafana-rw.wikimedia.org/d/UUmLqqX4k/openstack-api-performance
As of this writing, there should be one primary HAproxy node, that you can check with:
user@laptop:~ $ host openstack.eqiad1.wikimediacloud.org
openstack.eqiad1.wikimediacloud.org is an alias for cloudcontrol1007.wikimedia.org.
[..]
You can try by hand some openstack commands and see how they behave:
user@cloudrabbit1007:~$ sudo wmcs-openstack endpoint list
[..]
user@cloudrabbit1007:~$ sudo wmcs-openstack server list --all-projects
[..]
user@cloudrabbit1007:~$ sudo wmcs-openstack zone list --all-projects
[..]
user@cloudrabbit1007:~$ sudo wmcs-openstack subnet list --all-projects
[..]
You can check basic server stats, like CPU, memory, load average, etc. Try using tools like htop
and similar.
Operations
Some common operations to handle this alert.
Services restart
In the past, restarting all the openstack control plane has solved the performance problem for the most part.
There is an automated cookbook to do it:
user@laptop:~ $ cookbook wmcs.openstack.restart_openstack --task-id T11111 --cluster-name eqiad1 --all
[..]
Increase number of worker
In the past, increasing the number of API workers has helped accommodate additional load.
Example ticket: T336379 - Openstack API slowdowns.
Old incidents
- phab:T345084 (open task to track this recurring issue)
- In the past, we know prometheus scrapping of the openstack metrics have causes performance to suffer.
See for example phabricator ticket: T335943