PyBal

From Wikitech
Jump to navigation Jump to search

PyBal is an automated manager for LVS. We use it to continously monitor Varnish or Apache servers and change the LVS load balancer pooling and weights accordingly. The service is written in Python using the Twisted framework.

For more information about Wikimedia's LVS setup in general, see LVS.

Features

PyBal distinguishes itself from lvsmon in a few aspects:

  • It's using asynchronous communication, and thus runs all checks in parallel instead of sequentially
  • It has an extra monitoring method called IdleConnection, which keeps an idle connection open to all squids, and therefore notices immediately when the Squid processes are shut down / crashing
  • It can fetch server lists over HTTP as well as from the local filesystems

Setup

PyBal is currently installed on our LVS hosts, in directory /usr/sbin. Start or stop it with systemctl start/stop pybal.service

Configuration is in /etc/pybal/. pybal.conf defines the LVS service parameters

The list of pooled hosts resides wherever the pybal::web class is installed via puppet, under the directory /srv/pybal-config and it will be reachable via the internal address http://configuration-master.$site.wmnet/pybal, with one file per LVS service. Attributes:

  • weight: a larger number means that more requests get sent to this server in comparison with others
  • enabled: either True or False, depending on whether you want requests to be sent to this server

The format should be fairly self explanatory; the files more or less use Python assignment / dictionary syntax.

PyBal supports multiple LVS services through a single instance and configuration file pybal.conf, e.g.:

[text]
protocol = tcp
ip = 145.97.39.155
port = 80
scheduler = wlc
config = file:///etc/pybal/text-squids

[images]
protocol = tcp
ip = 145.97.39.156
port = 80
scheduler = wlc
config = file:///etc/pybal/upload-squids

Beware, the code as checked out from git has DryRun = True set in ipvs.py, meaning that it will not modify any actual IPVS state but only show the commands for debugging. This should be changed to a command line option, but for now edit that file to DryRun = False.

The configuration files are generated via puppet.

How to

See LVS.

Updating PyBal on LVS instances

After testing new releases on pybal-test2003.codfw.wmnet, PyBal should be updated in the following order:

  • ulsfo
  • eqsin
  • drmrs
  • esams
  • codfw
  • eqiad

Within each datacenter, first update and check that everything is fine on the passive instances and then go for the active instances. As long as BGP is enabled, redirecting traffic from the active instances to the passive one it should as easy as stopping PyBal in the active instance. After stopping it, you should see an increase of the active connections on the passive instance running the ipvsadm command described below.

After updating a BGP enabled PyBal instance you can check that everything is good on the router side with the following commands:

show bgp summary | match <pybal_ip>
show bgp neighbor <pybal_ip>
show route receive-protocol bgp <pybal_ip>

On the PyBal instance the following commands are useful:

ipvsadm -Ln
/usr/local/lib/nagios/plugins/check_pybal_ipvs_diff --prometheus-url http://<pybal_ip>:9100/metrics

Testing

New PyBal releases can be tested on pybal-test2003.codfw.wmnet. The systems are deployed with role(pybaltest). Configuration example:

# /etc/pybal/pybal.conf on pybal-test2001
[global]
bgp = yes
bgp-local-asn = 64496
bgp-peer-address = 10.192.16.140
#bgp-as-path = 64460
bgp-nexthop-ipv4 = 10.192.16.139
bgp-nexthop-ipv6 = 2620:0:860:101:10:192:1:3
instrumentation = yes
instrumentation_ips = [ '127.0.0.1', '::1', '10.192.16.139' ]
# Lower is prefered
bgp-med = 50

# Service definition
[textlb6_80]
protocol = tcp
ip = 2620:0:860:ed1a::1
port = 80
scheduler = sh

config = etcd://conf2001.codfw.wmnet/conftool/v1/pools/codfw/cache_text/varnish-fe/

depool-threshold = .5
monitors = ["IdleConnection"]

# IdleConnection monitor configuration
idleconnection.max-delay = 300
idleconnection.timeout-clean-reconnect = 3
# /etc/pybal/pybal.conf on pybal-test2003
[global]
bgp = yes
bgp-local-asn = 64496
bgp-peer-address = 10.192.16.140
#bgp-as-path = 64460
bgp-nexthop-ipv4 = 10.192.16.141
bgp-nexthop-ipv6 = 2620:0:860:101:10:192:1:3
instrumentation = yes
instrumentation_ips = [ '127.0.0.1', '::1', '10.192.16.141' ]
#Lower is prefered
bgp-med = 100

# Service definition
[...]

A Quagga instance is installed on pybal-test2002 and can be used to test the BGP component of PyBal:

log file /var/log/quagga/quagga.log
!
debug zebra rib
debug bgp events
debug bgp updates
debug bgp zebra
!
password SECRET
!
interface eth0
 ipv6 nd suppress-ra
!
interface lo
!
router bgp 64460
 bgp router-id 10.192.16.140
 no bgp default ipv4-unicast
 network 127.0.0.2/32
 neighbor 10.192.16.139 remote-as 64496
 neighbor 10.192.16.139 description PyBal on pybal-test2001
 neighbor 10.192.16.139 activate
 neighbor 10.192.16.139 prefix-list NONE out

 neighbor 10.192.16.141 remote-as 64496
 neighbor 10.192.16.141 description PyBal on pybal-test2003
 neighbor 10.192.16.141 activate
 neighbor 10.192.16.141 prefix-list NONE out
!
 address-family ipv6
 network 2620:0:860:102::/64
 neighbor 10.192.16.139 activate
 neighbor 10.192.16.141 activate
 exit-address-family
!
ip prefix-list NONE seq 5 deny any
!
ip forwarding
ipv6 forwarding
!
line vty
!

The IPv4 routing table can be inspected with:

vtysh -c 'show ip route'

Similarly, to inspect the IPv6 routing table:

vtysh -c 'show ipv6 route'

Alerts

PyBal IPVS diff check

The alert fires whenever pybal and ipvs disagree on the current configuration.

Services in IPVS but unknown to PyBal

For example upon removing services from pybal (or changing ports) the stale ipvs virtual services might not get removed. (In other words, this is shown: CRITICAL: Services in IPVS but unknown to PyBal: set(['addr:port'])). For such cases it is sufficient to delete the stale TCP service from the lvs pair:

 ipvsadm --delete-service --tcp-service addr:port

Services known to PyBal but not to IPVS

This alert is usually temporary and is caused by new services being setup (i.e. in etcd, pybal knows about them) but pybal hasn't been restarted yet, and thus hasn't had a chance to program ipvs correctly. The fix is to restart pybal.

For example:

 PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.54:443])

See also

  • lvsmon: The predecessor to PyBal

External links