Incidents/20151118-LVS-PyBal
Summary
PyBal was stopped on the primary eqiad LVS servers (lvs100[123]) for maintenance, with the expectation that traffic would be unaffected and shift to the backup servers (lvs100[456]). Some services were not operating correctly on the backups, causing a partial outage. pybal was restarted ~5 minutes later on the primaries, ending the outage.
The impact during the window was limited. Most service traffic moved successfully, and only a few specific service+proto+port combinations failed, with the primary public-facing affected services being:
Service | Protocol | Port |
---|---|---|
text-lb | IPv4 | 80 |
mobile-lb | IPv6 | 443 |
misc-web-lb | IPv4 | 443 |
Because most of our traffic is IPv4 on port 443 (HTTPS) for all services, the impact to text and mobile services was limited (HTTP->HTTPS redirects for text, IPv6 users for mobile). misc-web was affected for most users, denying access to services such as phabricator, gerrit, racktables, etc.
Traffic graphs of the primary clusters in eqiad: https://phabricator.wikimedia.org/F2972690
Timeline
- 13:56 - pybal stopped on lvs100[123] by bblack
- 14:00 - first user report on IRC: "< aude> did someone kill phabricator?"
- 14:01 - first automated report on IRC: "< icinga-wm> PROBLEM - LVS HTTP IPv4 on text-lb.eqiad.wikimedia.org is CRITICAL: Connection refused"
- 14:01 - pybal restarted on lvs100[123] by bblack
- 14:02 - traffic restored to normal
Conclusions
Because the same pattern of pybal failover to the same version of software had happened successfully at 3 other datacenters the day before, confidence was too high and not enough pre-flight checking was done. More verification on the state of lvs100[456] should have been done prior to the start of maintenance. The problem issues were obvious prior to the outage in the output of "ipvsadm -Ln" as well as the pybal service logs in "journalctl" if they were examined in depth.
The actual technical issues are due to bugs in PyBal.
Actionables
- Fix PyBal (task T118948)