PyBal

From Wikitech
Jump to navigation Jump to search

PyBal is a LVS monitoring script. It's written in Python using the Twisted framework.

For more information about Wikimedia's LVS setup in general, see LVS.

At this moment, just a few features distinguish it from lvsmon:

  • It's using asynchronous communication, and thus runs all checks in parallel instead of sequentially
  • It has an extra monitoring method called IdleConnection, which keeps an idle connection open to all squids, and therefore notices immediately when the Squid processes are shut down / crashing
  • It can fetch server lists over HTTP as well as from the local filesystems

...but I intend to polish it more, and extend it with useful things.

The script is in Wikimedia's Git repo "operations/debs", subdirectory pybal.

-- Mark

Setup

PyBal is currently installed on our LVS hosts, in directory /usr/sbin. Start or stop it by

 /etc/init.d/pybal start|stop

Configuration is in /etc/pybal/. pybal.conf defines the LVS service parameters

The list of pooled hosts resides wherever the pybal::web class is installed via puppet, under the directory /srv/pybal-config and it will be reachable via the internal address http://configuration-master.$site.wmnet/pybal, with one file per LVS service. Attributes:

  • weight: a larger number means that more requests get sent to this server in comparison with others
  • enabled: either True or False, depending on whether you want requests to be sent to this server

The format should be fairly self explanatory; the files more or less use Python assignment / dictionary syntax.

PyBal supports multiple LVS services through a single instance and configuration file pybal.conf, e.g.:

[text]
protocol = tcp
ip = 145.97.39.155
port = 80
scheduler = wlc
config = file:///etc/pybal/text-squids

[images]
protocol = tcp
ip = 145.97.39.156
port = 80
scheduler = wlc
config = file:///etc/pybal/upload-squids

Beware, the code as checked out from git has DryRun = True set in ipvs.py, meaning that it will not modify any actual IPVS state but only show the commands for debugging. This should be changed to a command line option, but for now edit that file to DryRun = False.

The configuration files are generated via puppet.

How to

See LVS.

Updating PyBal on LVS instances

After testing new releases on pybal-test200[123].codfw.wmnet, PyBal should be updated first on lvs1007-lvs1010.eqiad.wmnet and if it works as expected then on

  • eqsin
  • ulsfo
  • esams
  • codfw
  • eqiad

on that specific order. Within each datacenter, first update and check that everything is fine on the passive instances and then go for the active instances. As long as BGP is enabled, redirecting traffic from the active instances to the passive one it should as easy as stopping PyBal in the active instance. After stopping it, you should see an increase of the active connections on the passive instance running the ipvsadm command described below.

After updating a BGP enabled PyBal instance you can check that everything is good on the router side with the following commands:

show bgp summary | match <pybal_ip>
show bgp neighbor <pybal_ip>
show route receive-protocol bgp <pybal_ip>

On the PyBal instance the following commands are useful:

ipvsadm -Ln
/usr/local/lib/nagios/plugins/check_pybal_ipvs_diff --prometheus-url http://<pybal_ip>:9100/metrics

Testing

New PyBal releases can be tested on pybal-test200[123].codfw.wmnet. The systems are deployed with role(pybaltest). Configuration example:

# /etc/pybal/pybal.conf on pybal-test2001
[global]
bgp = yes
bgp-local-asn = 64496
bgp-peer-address = 10.192.16.140
#bgp-as-path = 64460
bgp-nexthop-ipv4 = 10.192.16.139
bgp-nexthop-ipv6 = 2620:0:860:101:10:192:1:3
instrumentation = yes
instrumentation_ips = [ '127.0.0.1', '::1', '10.192.16.139' ]
# Lower is prefered
bgp-med = 50

# Service definition
[textlb6_80]
protocol = tcp
ip = 2620:0:860:ed1a::1
port = 80
scheduler = sh

config = etcd://conf2001.codfw.wmnet/conftool/v1/pools/codfw/cache_text/varnish-fe/

depool-threshold = .5
monitors = ["IdleConnection"]

# IdleConnection monitor configuration
idleconnection.max-delay = 300
idleconnection.timeout-clean-reconnect = 3
# /etc/pybal/pybal.conf on pybal-test2003
[global]
bgp = yes
bgp-local-asn = 64496
bgp-peer-address = 10.192.16.140
#bgp-as-path = 64460
bgp-nexthop-ipv4 = 10.192.16.141
bgp-nexthop-ipv6 = 2620:0:860:101:10:192:1:3
instrumentation = yes
instrumentation_ips = [ '127.0.0.1', '::1', '10.192.16.141' ]
#Lower is prefered
bgp-med = 100

# Service definition
[...]

A Quagga instance is installed on pybal-test2002 and can be used to test the BGP component of PyBal:

log file /var/log/quagga/quagga.log
!
debug zebra rib
debug bgp events
debug bgp updates
debug bgp zebra
!
password SECRET
!
interface eth0
 ipv6 nd suppress-ra
!
interface lo
!
router bgp 64460
 bgp router-id 10.192.16.140
 no bgp default ipv4-unicast
 network 127.0.0.2/32
 neighbor 10.192.16.139 remote-as 64496
 neighbor 10.192.16.139 description PyBal on pybal-test2001
 neighbor 10.192.16.139 activate
 neighbor 10.192.16.139 prefix-list NONE out

 neighbor 10.192.16.141 remote-as 64496
 neighbor 10.192.16.141 description PyBal on pybal-test2003
 neighbor 10.192.16.141 activate
 neighbor 10.192.16.141 prefix-list NONE out
!
 address-family ipv6
 network 2620:0:860:102::/64
 neighbor 10.192.16.139 activate
 neighbor 10.192.16.141 activate
 exit-address-family
!
ip prefix-list NONE seq 5 deny any
!
ip forwarding
ipv6 forwarding
!
line vty
!

The IPv4 routing table can be inspected with:

vtysh -c 'show ip route'

Similarly, to inspect the IPv6 routing table:

vtysh -c 'show ipv6 route'

See also

  • lvsmon: The predecessor to PyBal