Jump to content

HAProxy

From Wikitech

HAProxy is a mature and relatively simple proxy for L3/L4 (TCP/IP) and L7 (HTTP/HTTPS) traffic. We sometimes use it in different parts of our infrastructure as either a TLS terminator, load balancer, or automatic switchover handler. It is not-crazily-full-of-features, but also reliable and efficient ("just does the job").

It can provide simple checks for things like MySQL. For more complex checks, typical deployments will need to setup an open IP socket that responds an HTTP code to control the proxy behaviour, for example: https://www.percona.com/doc/percona-xtradb-cluster/5.7/howtos/haproxy.html

HAProxy for edge caching

Since April 2022, we use HAProxy for TLS termination (T290005).

Since Jul 2023, we use HAProxy for port 80 management too (T323557).

This makes HAProxy the front-most L7 layer for most of our traffic.

Session state at disconnection

Anomalous session termination states. When a session disconnects TCP and HTTP logs provide a session termination indicator in the "termination_state" field, just before the number of active connections. It is 2-characters long in TCP mode, and is extended to 4 characters in HTTP mode, each of which has a special meaning.

A list of codes is provided at: HAProxy/session states

This has been imported from this upstream source.

Related Phabricator tickets are: T308952, T308940.

Silent-drop

HAProxy enforces certain concurrency limits silently dropping those requests if the limits are reached. Recent silent-drop events outputs two different message regarding they are produced on the tls or the http frontend. This can be checked via cumin:

  • $ sudo cumin --ignore-exit-codes A:cp 'journalctl -u haproxy --since=-1h | grep silent-drop_for'
  • $ sudo cumin --ignore-exit-codes A:cp 'journalctl -u haproxy --since=-1h | grep silent-drop_port80_for'

HAProxy for MariaDB

Find out if a proxy is being actively used

Not all services are using a proxy as of today (24th Sept 2019), even though they have one ready to be used. To find out which proxies are in use you can run:

$ host m1-master ; host m2-master ; host m3-master ; host labsdb-analytics ; host labsdb-web

If the result points to a dbproxy, it means it is in use. If it points to a database, it means there is no proxy being used in front of it.

Failover

The typical server is configured like this (/etc/haproxy/conf.d/*). Here it knows about a primary (DB master) a secondary (DB replication slave) but only one node is active at any timeː

listen mariadb 0.0.0.0:3306
   mode tcp
   balance roundrobin
   option tcpka
   option mysql-check user haproxy
   server <%= @primary_name %> <%= @primary_addr %> check inter 3s fall 3 rise 99999999
   server <%= @secondary_name %> <%= @secondary_addr %> check backup

If the primary fails health checks the backup is brought online. The rise 99999999 trick (about 10 years) means that the primary does not come back without human intervention, even if it starts passing HAProxy health checks again. This prevents flopping back and forth once a failure happens, something worse than just losing a node.

Now, this all sounds good, but there are still some catches:

  • At present many misc slaves are still running read_only=1 so read traffic will fail over but writes will start to be blocked until a human verifies that the old master is properly dead and runs SET GLOBAL read_only=0;. Applications on m2 like gerrit, ieg, otrs, exim and scholarships will complain but remain semi-useful.
  • Persistent connections like those from the eventlogging, bacula, phabricator, etc. did not failover nicely during trials, instead hitting a TCP timeout and causing just about as much annoyance (and backfilling) as having no HAProxy at all. This needs more research.

When a dbproxy complains, it will give a non-critical (will not page) error with:

<icinga-wm> PROBLEM - haproxy failover on dbproxy1010 is CRITICAL: CRITICAL check_failover servers up 1 down 1

You can check the status of a proxy by running as root from localhost:

$ echo "show stat" | socat unix-connect:/run/haproxy/haproxy.sock stdio
# pxname,svname,qcur,qmax,scur,smax,slim,stot,bin,bout,dreq,dresp,ereq,econ,eresp,wretr,wredis,status,weight,act,bck,chkfail,chkdown,lastchg,downtime,qlimit,pid,iid,sid,throttle,lbtot,tracked,type,rate,rate_lim,rate_max,check_status,check_code,check_duration,hrsp_1xx,hrsp_2xx,hrsp_3xx,hrsp_4xx,hrsp_5xx,hrsp_other,hanafail,req_rate,req_rate_max,req_tot,cli_abrt,srv_abrt,comp_in,comp_out,comp_byp,comp_rsp,lastsess,last_chk,last_agt,qtime,ctime,rtime,ttime,
mariadb,FRONTEND,,,0,2,5000,361,60844,537174,0,0,0,,,,,OPEN,,,,,,,,,1,2,0,,,,0,1,0,1,,,,,,,,,,,0,0,0,,,0,0,0,0,,,,,,,,
mariadb,labsdb1009,0,0,0,2,,226,40594,501669,,0,,0,0,0,0,DOWN,1,1,0,3,1,138,138,,1,2,1,,226,,2,0,,1,L4TOUT,,3000,,,,,,,0,,,,2,0,,,,,138,,,0,0,0,7658,
mariadb,labsdb1010,0,0,0,1,,135,20250,35505,,0,,0,0,0,0,UP,1,0,1,0,0,1543670,0,,1,2,2,,135,,2,1,,1,L7OK,0,0,,,,,,,0,,,,0,0,,,,,1,5.5.5-10.1.19-MariaDB,,0,0,0,1,
mariadb,BACKEND,0,0,0,2,500,361,60844,537174,0,0,,0,0,0,0,UP,1,0,1,,0,1543670,0,,1,2,0,,361,,1,1,,1,,,,,,,,,,,,,,2,0,0,0,0,0,1,,,0,0,0,5882,

Here we see that labsdb1009 has gone down, and labsdb1010 is now the server serving the proxy backend, which is still up.

So for the present, if a dbproxyXXX complains:

  1. Check that original master is really down. If not, restart haproxy on dbproxy1002 and figure out why health checks failed.
  2. If the master is fubar ensure its mysqld is stopped before setting read_only=0 on the slave.
  3. If the slave is fubar most apps probably don't care, so do nothing.

if you run sudo systemctl reload haproxy, because the original server has recovered, you will now get:

$ echo "show stat" | socat unix-connect:/run/haproxy/haproxy.sock stdio
# pxname,svname,qcur,qmax,scur,smax,slim,stot,bin,bout,dreq,dresp,ereq,econ,eresp,wretr,wredis,status,weight,act,bck,chkfail,chkdown,lastchg,downtime,qlimit,pid,iid,sid,throttle,lbtot,tracked,type,rate,rate_lim,rate_max,check_status,check_code,check_duration,hrsp_1xx,hrsp_2xx,hrsp_3xx,hrsp_4xx,hrsp_5xx,hrsp_other,hanafail,req_rate,req_rate_max,req_tot,cli_abrt,srv_abrt,comp_in,comp_out,comp_byp,comp_rsp,lastsess,last_chk,last_agt,qtime,ctime,rtime,ttime,
mariadb,FRONTEND,,,0,1,5000,3,450,789,0,0,0,,,,,OPEN,,,,,,,,,1,2,0,,,,0,1,0,1,,,,,,,,,,,0,0,0,,,0,0,0,0,,,,,,,,
mariadb,labsdb1009,0,0,0,1,,3,450,789,,0,,0,0,0,0,UP,1,1,0,0,0,3,0,,1,2,1,,3,,2,1,,1,L7OK,0,0,,,,,,,0,,,,0,0,,,,,1,5.5.5-10.1.19-MariaDB,,0,0,0,1,
mariadb,labsdb1010,0,0,0,0,,0,0,0,,0,,0,0,0,0,UP,1,0,1,0,0,3,0,,1,2,2,,0,,2,0,,0,L7OK,0,0,,,,,,,0,,,,0,0,,,,,-1,5.5.5-10.1.19-MariaDB,,0,0,0,0,
mariadb,BACKEND,0,0,0,1,500,3,450,789,0,0,,0,0,0,0,UP,1,1,1,,0,3,0,,1,2,0,,3,,1,1,,1,,,,,,,,,,,,,,0,0,0,0,0,0,1,,,0,0,0,1,

On IRC:

<icinga-wm> RECOVERY - haproxy failover on dbproxy1010 is OK: OK check_failover

(all servers are back)

Reloading configuration

Two annoying particularities of haproxy:

  • HAProxy, as of this date, doesn't automatically read all configuration files inside `/etc/haproxy/conf.d`, this is workarounded by a preexecution systemd script (generate_haproxy_default.sh that reads all files present there and generates a manual command line on /etc/default/haproxy. This can be misleading if you delete configuration files on puppet but not physically.
  • HAProxy is able to reload (and reset its status) in a clean (and recommended way) by using reload. However, if you change the config file names or its number, you need to restart the service (which is fast, but drops connections. This is a limitation of HAProxy itself and not our puppetization.

Other interesting commands

$ echo "show info" | socat unix-connect:/run/haproxy/haproxy.sock stdio
Name: HAProxy
Version: 1.5.8
Release_date: 2014/10/31
Nbproc: 1
Process_num: 1
Pid: 25297
Uptime: 0d 0h59m29s
Uptime_sec: 3569
Memmax_MB: 0
Ulimit-n: 4033
Maxsock: 4033
Maxconn: 2000
Hard_maxconn: 2000
CurrConns: 0
CumConns: 330
CumReq: 330
MaxSslConns: 0
CurrSslConns: 0
CumSslConns: 0
Maxpipes: 0
PipesUsed: 0
PipesFree: 0
ConnRate: 0
ConnRateLimit: 0
MaxConnRate: 1
SessRate: 0
SessRateLimit: 0
MaxSessRate: 1
SslRate: 0
SslRateLimit: 0
MaxSslRate: 0
SslFrontendKeyRate: 0
SslFrontendMaxKeyRate: 0
SslFrontendSessionReuse_pct: 0
SslBackendKeyRate: 0
SslBackendMaxKeyRate: 0
SslCacheLookups: 0
SslCacheMisses: 0
CompressBpsIn: 0
CompressBpsOut: 0
CompressBpsRateLim: 0
ZlibMemUsage: 0
MaxZlibMemUsage: 0
Tasks: 6
Run_queue: 1
Idle_pct: 100
node: dbproxy1010
description: