LVS

LVS, or Linux Virtual Server, is used at Wikimedia Foundation as a high-traffic Layer 4 load balancer.

Introduction

If you're familiar with proxy-based load balancing (such as HAProxy, Nginx, ATS, Varnish, etc), there are three key differences between those and the way LVS is used at Wikimedia (specifically the LVS/DR mode of IPVS):

It operates low-level on individual TCP/IP or UDP network packets (known as Layer 4 or Transport Layer). This means it does not buffer or wait for network packets to form a complete HTTP request, nor does it parse, validate, or otherwise interpret HTTP headers etc. (those would be Layer 7 or Application Layer).
It is implemented within the Linux kernel, without even invoking or communicating with any program or process ("userland").
It lets the chosen "real" backend server respond directly to the traffic source. This means that the backend will not respond to LVS but to where LVS got the packet from (e.g. the local subnet gateway/router). This further improves latency and multiplies our effective network capacity.

Overview

We use LVS-DR, or Direct Routing. This means that only forward (incoming) traffic is balanced by the load balancer, and return traffic does not even go through the load balancer. Essentially, the LVS balancer receives traffic for a given service IP and port, selects one out of multiple "real servers", and then forwards the packet to that real server with only a modified destination MAC address. The destination servers also listen to, and accept, traffic for the service IP, but don't advertise it over ARP. Return traffic is simply sent directly to the gateway or core router.

The LVS balancer and the real servers need to be in the same subnet for this to work.

The real servers are monitored by a Python program called PyBal. It performs periodic health checks to determine which servers can be used, and pools or depools them accordingly. You can follow what PyBal is doing in log file /var/log/pybal.log.

PyBal also has an integrated BGP module that Mark Bergsma wrote (Twisted BGP, available in the Pybal repository). This is used as a failover/high availability protocol between the LVS balancers (PyBal) and the routers. PyBal announces the LVS service IPs to the router(s) to indicate that it is alive and can serve traffic. This also removes the need to manually configure the service IPs on the active balancers. All LVS servers are now using this setup.

Upstream documentation:

IPIP encapsulation experiments

As part of the work being performed to replace PyBal with LiBerica (T332027), we are switching from IPVS route mode to tunneling mode using IPIP encapsulation. This is being done to close the gap and reduce the risk of migrating from IPVS to Katran as the forwarding planes in our load balancers. This is because Katran requires that traffic headed to real servers gets encapsulated using IPIP or GUE.

The first experiment will be performed using ncredir (T351069). As soon as IPIP encapsulation is enabled ipvsadm output will change slightly from:

vgutierrez@lvs4008:~$ sudo -i ipvsadm -Ln |grep -A2 $(dig +short ncredir-lb.ulsfo.wikimedia.org)
TCP  198.35.26.98:80 wrr
  -> 10.128.0.32:80               Route   1      11         136       
  -> 10.128.0.33:80               Route   1      9          132       
TCP  198.35.26.98:443 wrr
  -> 10.128.0.32:443              Route   1      17         90        
  -> 10.128.0.33:443              Route   1      20         89

to:

vgutierrez@lvs4008:~$ sudo -i ipvsadm -Ln |grep -A2 $(dig +short ncredir-lb.ulsfo.wikimedia.org)
TCP  198.35.26.98:80 wrr
  -> 10.128.0.32:80               Tunnel   1      11         136       
  -> 10.128.0.33:80               Tunnel   1      9          132       
TCP  198.35.26.98:443 wrr
  -> 10.128.0.32:443              Tunnel   1      17         90        
  -> 10.128.0.33:443              Tunnel   1      20         89

IPIP encapsulation requires TCP MSS clamping to avoid getting ingress traffic that cannot be encapsulated into a single packet. That would lead to fragmentation or packets being dropped. TCP MSS clamping is performed on the realservers (ncredir instances) using tcp-mss-clamper. Metrics for tcp-mss-clamper can be seen here.

Another side effect of performing IPIP encapsulation is that all the traffic forwarded to real servers would come from the same IP (the load balancer one) and same port (0). This makes balancing traffic to several NIC queues harder. To add some variability, we randomize the source IP per flow using ipip-multiqueue-optimizer on the load balancers. Metrics for ipip-multiqueue-optimizer can be seen here.

If you're seeing issues with any service that has IPIP encapsulation enabled, the fastest way of disabling it is setting the ipip_encapsulation parameter to false on the service catalog (hieradata/common/service.yaml) for the impacted service. If you do not find an explicit ipip_encapsulation parameter set to true, IPIP encapsulation should not be enabled for that service.

HOWTO

Make sure you know whether you are using Etcd or not!

Etcd as a backend to Pybal (All of production)

In order to manage Pybal pools in Etcd use Conftool and confctl.

High level overview:

Define node and services in conftool-data/ in ops-puppet
puppet-merge and conftool-merge your change
Nodes will usually inherit a default pool/weight value based on their service default
To modify the state of the Node per service in Etcd use confctl
Pybal is consuming from etcd directly, using HTTP Long Polling to watch for changes in the service definition in Etcd.
Any change should be picked up by pybal within a very short timespan (usually, less than a second)
If you want to see what is currently defined on pybal, you can browse the pools under https://config-master.wikimedia.org/pybal/

Depooling servers provides examples of confctl usage. Load Balanced Services And Conftool has further details on what various states in conftool mean for LVS pools, and what helper scripts are available and their inner workings.

Planned reboot of LVS servers

Reboot the secondaries and verify they look sane afterwards: pybal is actually running, ipvsadm -L output looks right.
Disable puppet on the primary.
Stop pybal on the primary.
Stay logged into the matching secondary and confirm (dstat 10, ipvsadm -L) that traffic is coming in when pybal stops on the primary.
Reboot the primary with the sre.hosts.reboot-single cookbook.
Enable Puppet/Start pybal on the primary.
Wait for traffic to flip back post-reboot.

Planned reboots of Varnish frontends

A systemd service called traffic-pool is installed on all cpNNN machines to assist with planned reboots. This service will cause an etcd depool of all services hosted on the machine on shutdown/reboot with a 45 second pause between depool and the stop of nginx/varnish services. It will also repool when the host comes back up if /var/lib/traffic-pool/pool-once is present.

For a planned reboot, you need to execute the following commands:

touch /var/lib/traffic-pool/pool-once
reboot

Pool or depool hosts (for non-Etcd managed pools)

Edit the files in /srv/pybal-config/pybal/$colo on config-master.$colo and wait a minute - PyBal will fetch the file over HTTP. Please don't forget to commit your changes locally.

If you set a host to disabled, PyBal will continue to monitor it but just not pool it:

{ 'host': 'knsq1.esams.wikimedia.org', 'weight': 10, 'enabled': False }

If you comment the line, PyBal will forget about it completely.

Emergency situations

In emergency cases, you can do this manually using ipvsadm, if PyBal for some reason is not working for example.

ipvsadm -d -t VIP:PORT -r REALSERVER

Such as:

ipvsadm -d -t 91.198.174.232:80 -r knsq1.esams.wikimedia.org

Note that PyBal won't know about this, so make sure you bring the situation back in sync.

Example request for checking the current status for restbase in eqiad via http:

curl http://config-master.eqiad.wmnet/conftool/eqiad/restbase

See which LVS balancer is active for a given service

If you have ssh access to the host in question, sshing to the IP address will land you in a shell on whichever system is active.

 $ ssh root@ms-fe.eqiad.wmnet
 root@lvs4:~#

If you don't want to connect (or can't connect) to the system, ask the directly attached routers. You can request the route for a given service IP. E.g. on Foundry:

csw1-esams#show ip route 91.198.174.234
Type Codes - B:BGP D:Connected I:ISIS S:Static R:RIP O:OSPF; Cost - Dist/Metric
Uptime - Days:Hours:Minutes:Seconds 
        Destination        Gateway         Port        Cost     Type Uptime
1       91.198.174.234/32  91.198.174.110  ve 1        20/1     B    10:14:28:44

So 91.198.174.110 (amslvs2] is active for Upload LVS service IP 91.198.174.234.

On Juniper:

csw2-esams> show route 91.198.174.232 

inet.0: 38 destinations, 41 routes (38 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both

91.198.174.232/32  *[BGP/170] 19:38:18, localpref 100, from 91.198.174.247
                      AS path: 64600 I
                    > to 91.198.174.109 via vlan.100
                    [BGP/170] 1w3d 14:24:52, MED 10, localpref 100
                      AS path: 64600 I
                    > to 91.198.174.111 via vlan.100

So 91.198.174.109 (*) is active for Text LVS service IP 91.198.174.232.

To see all LVS servers configured for a service

To see which servers are configured for a service, but not which server is currently active, look in the puppet configs.

configuration is stored in hieradata/common/service.yaml
look for your service (eg 'swift' or 'upload')
look for the lvs -> class entry, which will be something like low-traffic
back up in modules/profile/manifests/lvs/configuration.pp to the section defining the $lvs_class_hosts variable.
look for your class (eg low-traffic)
you should see sections for production and labs, with variables for each data center listing the lvs servers responsible.

Deploy a change to an existing service

Preconditions:

you have already made the change in puppet and puppet-merged it on the puppet master (puppetmaster1001.eqiad.wmnet).
you have tested the change on the backend real servers directly (eg if you were changing a health check URL you have already queried the backend servers for that URL successfully).

Deploy steps:

find out which LVS servers host your service (see above). For this example, I'll use lvs 3 and 4.
find out which LVS server is active (see above). For this example, I'll assume lvs4.
log into the inactive host twice.
in one session, tail the pybal log looking for one (or more) of your backend servers. eg journalctl -u pybal -f and look for errors

in the other session, run puppet and verify your change exists on the local filesystem
get a list of all IP addresses served by this LVS server - you're going to check that they all exist after your change
- run ip addr and save the output for later
restart pybal:
- systemctl stop pybal.service
- Verify that it stopped correctly with systemctl status pybal.service
- systemctl start pybal.service
check that all the expected IP addresses exist
- run 'ip addr' and compare against the list you collected before making your change
in the log you're tailing, you should see a few messages like:

 2012-03-13 19:21:26.015393 New enabled server ms-fe1.pmtpa.wmnet, weight 40
 2012-03-13 19:21:26.015611 New enabled server ms-fe2.pmtpa.wmnet, weight 40
 2012-03-13 19:21:26.015666 ['-a -t 10.2.1.27:80 -r ms-fe2.pmtpa.wmnet -w 40', '-a -t 10.2.1.27:80 -r ms-fe1.pmtpa.wmnet -w 40']

Look for errors in the next couple of minutes!
- example of a failed change (note that it still says enabled/up/pooled for a few lines - look for the Fetch line):

 2012-03-13 19:27:23.626787 [IdleConnection] ms-fe2.pmtpa.wmnet (enabled/up/pooled): Connection established.
 2012-03-13 19:27:23.632928 [IdleConnection] ms-fe1.pmtpa.wmnet (enabled/up/pooled): Connection established.
 2012-03-13 19:27:33.555879 [ProxyFetch] ms-fe2.pmtpa.wmnet (enabled/up/pooled): Fetch failed, 0.005 s
 2012-03-13 19:27:33.555917 Monitoring instance ProxyFetch reports servers ms-fe2.pmtpa.wmnet (enabled/up/pooled) down: 404 Not Found
 2012-03-13 19:27:33.556022 ['-d -t 10.2.1.27:80 -r ms-fe2.pmtpa.wmnet']
 2012-03-13 19:27:33.562458 [ProxyFetch] ms-fe1.pmtpa.wmnet (enabled/up/pooled): Fetch failed, 0.012 s
 2012-03-13 19:27:33.562533 Monitoring instance ProxyFetch reports servers ms-fe1.pmtpa.wmnet (enabled/up/pooled) down: 404 Not Found
 2012-03-13 19:27:33.562589 Could not depool server ms-fe1.pmtpa.wmnet because of too many down!
 2012-03-13 19:27:43.561745 [ProxyFetch] ms-fe2.pmtpa.wmnet (enabled/partially up/not pooled): Fetch failed, 0.002 s
 2012-03-13 19:27:43.565608 [ProxyFetch] ms-fe1.pmtpa.wmnet (enabled/partially up/pooled): Fetch failed, 0.003 s

if your change was successful, repeat the procedure on the active host.
- when you stop pybal on the active host, traffic will immediately fail over to the standby host.
- when you restart pybal on the formerly active host, traffic will immediately fail back (the LVS pairs are configured with a default and a standby so traffic always flows to the default if it's up).
You can check the status of your service pool at any time by fetching http://localhost:9090/pools/<pool-name
Example output:

lvs1002 $ curl localhost:9090/pools 
streamlb_80 
dns_rec_53 
...

lvs1002 $ curl localhost:9090/pools/dns_rec_53 
chromium.wikimedia.org:	enabled/up/pooled 
hydrogen.wikimedia.org:	enabled/up/pooled

Add a new load balanced service

Before you begin: Making changes to LVS can be dangerous. All LVS services share the same configuration and infrastructure, and a misconfiguration could cause an outage of many or even all production load-balanced services. DO NOT do this without review and help from an SRE with LVS experience.

A service can be either high-traffic (public-facing) or low-traffic ("internal" services, which also means traffic from the CDN edge proxies to any backend, e.g. Mediawiki appservers). Yes, the naming is bad.

There are different phases to adding a load-balanced service, that can be logically summarized as follow:

Ensure the service is running on all the backend servers
Add relevant data in etcd
Add DNS records, allocate service IPs in all datacenters where the service is running
Create an entry in the service::catalog
Add this IP to the loopback interface on all the servers where the service is present
Configure the load balancers to provide balancing across those backends
Add the puppet-generated discovery DNS resources, start sending network probes/monitoring
Make the service page
Add discovery DNS records for the service

These logical steps translate to a series of steps to perform across the operations/dns and the operations/puppet repositories.

Let's go through all of them – minus setting up the backends running the service, which we'll assume you have already done. Please note the procedure below is optimized for reducing manual intervention and for avoiding the "oops new service, disregard" pages.

Please read before we get started...

Adding a new service takes time and involves coordination with the Traffic team. Thus, it's best to inform Traffic about this if possible in advance by talking to us on IRC (#wikimedia-traffic) or sending an email to sre-traffic@.
For the steps that follow below, ensure you have independent patches for each and then add the relevant Traffic member that's working with you on this so that they can be reviewed in advance. Stacking the patches is helpful to make review easier as well as to help you wrap your head around the order of operations. The idea is to merge these patches quickly and move on to the next step to complete the process from start to end, ideally in one go.
The instructions below have been updated in August 2024 but if something is not clear (expected), please ask the Traffic team.

A quick note about SSL/TLS

As of early 2020, when adding new services, you almost certainly want only one, TLS-enabled service.

In ancient times, almost no services were TLS-enabled. Then there was a transition period as part of the switch to ATS where more and more services exposed a TLS endpoint in addition to their standard cleartext HTTP one (which required configuring a new LVS service). If you're defining a new service, don't be led astray by this: just create a single, TLS-enabled service. Use Envoy for the TLS terminator (in Puppet, you can use profile::tlsproxy::envoy). To create new internal certificates see cfssl.

Add data in etcd

etcd data for backend selection

If your service runs solely on Kubernetes, skip this sub-section and go to #etcd data for DNS Discovery. (Later, when creating an entry in service::catalog you'll use kubesvc as the service name instead of a dedicated conftool service name.)

You need to add the relevant data in conftool-data for adding your new service to the servers where it's running.

So for instance, if you're adding service foo to servers srv* part of the bar cluster in eqiad, you'll need something like:

# File: node/eqiad.yaml
eqiad:
   bar:
     srv1: [foo]
     srv2: [foo]
...

Please note this will add the service with status pooled: inactive, weight: 0. You will need to set pooled status and weight with confctl.

Following up on our example:

$ sudo confctl select 'cluster=bar,service=foo' set/pooled=yes:weight=1

etcd data for DNS Discovery

DNS changes (svc zone only)

allocate an IP address per colo to serve your content on Netbox, see DNS/Netbox#How to manually allocate a special purpose IP address in Netbox
- Until the related zone files will be migrated to the automated DNS system a manual patch in the DNS repository is still needed.
- Check https://phabricator.wikimedia.org/T270071 for more info.
- You might be tempted to copy/paste from other VIP service configs in the dns repo, that may contain dns-disc settings. Don't add any dns-disc settings (there is a step later).
internal addresses should have names *.svc.$colo.wmnet:
- codfw should be in the 10.2.1.0/24 range
- eqiad should be in the 10.2.2.0/24 range
external addresses:
- These need to be allocated from the (small!) public IP address pool, and may need specific configuration on the routers. Talk to the network admins first (i.e. netops/traffic: Arzhel/Brandon/etc).
Examples:
- eventgate-analytics.svc.eqiad.wmnet
- eventgate-analytics.svc.codfw.wmnet
Follow DNS#Changing records in a zonefile to create and deploy the zone
Run the sre.dns.netbox cookbook

Create an entry in the service::catalog

In hiera, under hieradata/common/service.yaml there is an entry called service::catalog. This data structure contains the full definition of your service, as far as puppet is concerned – the configuration of LVS and of monitoring is derived from the definition here. Here is a full entry you can model yours around:

service::catalog:
  echostore:
    description: Echo store, echostore.svc.%{::site}.wmnet
    encryption: true # If the service offers TLS encryption or not
    ip: # Hash of site: list of IPs with a label. If you have only one IP, use "default"
      codfw:
        default: 10.2.1.49
      eqiad:
        default: 10.2.2.49
    lvs: # Properties that are related to LVS setup.
      class: low-traffic
      conftool: # any service on k8s should re-use this same definition. Other services should use their own values.
        cluster: kubernetes
        service: kubesvc
      depool_threshold: '.5'
      enabled: true
      monitors:
        IdleConnection:
          max-delay: 300
          timeout-clean-reconnect: 3
        ProxyFetch:
          url:
          - https://localhost/healthz
      scheduler: wrr
      protocol: tcp
    page: false # probe failures page if true (or 'page' not present)
    probes:
      - type: http  # there's no HTTPS, TLS usage is governed by 'encryption: true'
        path: /healthz
    port: 8082  # for k8s services, choose based on https://wikitech.wikimedia.org/wiki/Service_ports
    sites:
    - eqiad
    - codfw
    state: service_setup # this is the most important entry in your service definition! See below for details
    discovery: # Discovery DNS configuration. You can have multiple entries for a single service.
    - dnsdisc: echostore
      active_active: true

The state is a pretty important entry. Have a look at the diagram to the right about the supported state transitions. Don't diverge from those.

The supported state transitions for state parameter

Here we've defined state to be "service_setup"; this means that this service will not be included in monitoring, LVS configuration, or DNS Discovery at the moment. Until you perform the next step in the procedure, adding this stanza will be a no-op.

For k8s services:

If you are using Ingress, follow the specific steps in the ingress documentation
Choose a port number based on the table at Service ports.
Please remember to merge this config change alongside with the next one (hieradata/role/common/kubernetes/worker.yaml).
If you require httpbb monitoring, you can add an httpbb_dir: $dir stanza where $dir is the /srv/deployment/httpbb-tests/ subdirectory where your tests are located.

Add the IPs on the backend servers

If you aren't using Kubernetes, and the Puppet role on your backend servers doesn't already include profile::lvs::realserver, add it.
Once that profile is included, you'll also need to add a hiera configuration as follows:

profile::lvs::realserver::pools:
  echostore: {}

Use the same label you used in service::catalog. Once puppet runs on the backends, the LVS IP will be configured on their loopback device, allowing them to respond to traffic directed to the LVS service.

If the service is using conftool then you'll need to add the relevant services to the pools configuration, for example:

profile::lvs::realserver::pools:
  kibana-next:
    services:
      - kibana
      - apache2

The services above are the systemd services that are needed to serve the LVS pool. (When in doubt, see other examples of profile::lvs::realserver::pools).

If you are using Kubernetes, you will need to add your service to the Kubernetes specific pools by editing hieradata/role/common/kubernetes/worker.yaml and adding an empty stanza for your service as shown above.

Configure the load balancers

This phase has risk of collateral damage to other services. Check in with #wikimedia-traffic and be careful!

Start by disabling puppet on both eqiad* and codfw lvs* servers:
1. sudo cumin 'A:lvs and (A:eqiad or A:codfw)' 'disable-puppet "adding new service foo"'

To add the configuration to PyBal and add the LVS endpoint on the load-balancers, you just need to change the state of your service to lvs_setup:

[...]
    sites:
    - eqiad
    - codfw
    state: lvs_setup <-----
    discovery:
    - dnsdisc: echostore
      active_active: true

Once puppet has run on the LVS servers, you will have to restart PyBal for your changes to take effect. Restarting PyBal requires some care - follow this procedure. We will start affect one DC at a time, so you will need to repeat the below for each.

Check in with #wikimedia-traffic ^connect that your change looks good and that now is a good time for a PyBal restart.
Enable and run Puppet on the first data center you wish to effect the change on (such as eqiad here):
- sudo cumin 'A:lvs and A:eqiad' 'run-puppet-agent --enable "adding new service foo"'
Acknowledge upcoming PyBal IPVS diff check and PyBal connections to etcd icinga alerts regarding your changemodules/profile/manifests/lvs/configuration.pp
For the secondary LVS, do:
1. cumin query: sudo cumin 'A:lvs-secondary-eqiad' 'systemctl restart pybal.service' to restart Pybal on the backup LVS servers in eqiad (low-traffic). Log this in the ops channel.
Check that sudo ipvsadm -L -n output on the backup LVS server contains your newly added service (and sane list of backends)
Wait 120 secs (while looking at https://icinga.wikimedia.org/alerts)
Restart PyBal on the primary LVS, do:
1. cumin query: sudo cumin 'A:lvs-low-traffic-eqiad' 'systemctl restart pybal.service' to restart Pybal on the primary LVS servers in eqiad (low-traffic). Log this in the ops channel.
2. Ensure you're using the alias for the correct class (high-traffic1, high-traffic2, low-traffic) for your service.
Run a test (like curl -v -k http://eventgate-analytics.svc.eqiad.wmnet:31192/_info)

Repeat the above for the other relevant data center as well, such as codfw: enable Puppet in A:lvs and A:codfw and repeat the above steps.

Note: Make sure you've pooled your service's backend hosts, otherwise the IPVS diff check errors will not resolve. A pooled host is also required for /srv/config-master/pybal/${DC}/${SERVICE_NAME} to render on the puppetmasters.

Add discovery/DNS resources

Switching from lvs_setup → production does not require anything more than a Puppet run. Move forward and change state of your service to production, merge and run agent on A:dnsbox hosts:

$ sudo cumin 'A:dnsbox' run-puppet-agent

This will create the DNS resources (e.g., gdnsd state files) that will later be referenced by your DNS discovery records.

Make the service page (Optional)

This step can page the whole SRE team. Make sure monitoring is happy with your service first!

The step above will start sending network probes for your service (probes section). When the probes fail the ProbeDown alert will fire, make sure said alert is not firing for your service! Once you are happy the probes are working as expected, set the following in service::catalog:

page: true # default if not specified

Add the DNS Discovery Record

What you need to do depends on the nature of your service. Don't forget that irrespective of the type of service below (active/active or active/passive), you will need to run authdns-update at the very end to pull in the changes to the zone files (see the last section).

For active/active services

If you declared active_active = true

Add discovery entries to templates/wmnet in operations/dns. Your entry should be of the geoip type. Also add an entry to utils/mock_etc/discovery-geo-resources. See for instance this change
Pool both datacenters in confctl

$ confctl --object-type discovery select 'dnsdisc=echostore' set/pooled=true

For active/passive services

If you have an active/passive service instead (i.e. you declared active_active = false):

Add discovery entries to templates/wmnet in operations/dns. Your entry should be of the metafo type. Also add an entry to utils/mock_etc/discovery-metafo-resources. See for instance this change
Pool one datacenter in confctl

$ confctl --object-type discovery select 'name=eqiad,dnsdisc=echostore' set/pooled=true

Make sure the other datacenters are not pooled:

$ confctl --object-type discovery select 'dnsdisc=echostore' get

For both active/active and active/passive

Prepare a patch in the operations/dns repo as described in DNS/Discovery to add records for your service, making sure to use the appropriate DYNA record plugin type for your service (i.e., geoip vs. metafo).

Merge the DNS change. Choose one authoritative DNS (eg: dns1004.wikimedia.org), and run sudo -i authdns-update. That script will deploy your change to all our DNS servers.

You're done! Your service will be now correctly configured to work in production

[End of adding a new service]

Remove a load balanced service

The procedure for removal of a service should more or less follow the inverse order of what gets done adding it. It is important to perform the following actions in order. Specifically:

Silence network probes for your service
Remove the discovery DNS record
Remove network probes
Remove the service from the load-balancers and the backend servers
Remove conftool data for both the lvs pool and dns discovery
Remove entry from service::catalog

Silence network probes

Since you'll be working on turning down the service, make sure to issue a silence for ProbeDown alert for your service (instance="servicename:port"). This makes sure no spurious alerts will be issued while you work on the service.

Remove the discovery DNS record

Simply remove the entry from the operations/dns repository:

Remove the entry from templates/wmnet
Remove it from the relevant files under utils/mock_etc/
Merge the DNS change. Choose a DNS server (eg: dns1004.wikimedia.org), and run sudo -i authdns-update

Remove network probes / monitoring

To remove network probes and the discovery templates from the dns servers, change state: production to state: lvs_setup in hieradata/common/service.yaml, then run puppet on the auth dns servers. The DNS record must have been removed in the previous step, otherwise it will trigger an alert.

$ sudo cumin 'A:dnsbox' run-puppet-agent

Remove the service from the load-balancers and the backend servers

Change state: lvs_setup to state: service_setup, and remove the service stanza from profile::lvs::realserver::pools. Then:

Run puppet on all LVS servers
- cumin 'O:lvs::balancer' 'run-puppet-agent'
You will be presented with some CRITICAL: Services in IPVS but unknown to PyBal: set(['addr:port']) alerts see PyBal section's with details, we will deal with them after the PyBal restarts
Ask on #wikimedia-traffic which are the backup LVS server for the LVS class of your service on both datacentres and restart pybal on those
1. (Recommended): You can use the cumin query: sudo cumin 'A:lvs-secondary-eqiad' 'systemctl restart pybal.service' to restart Pybal on the backup LVS servers in eqiad (low-traffic). Wait for some time (if the above looks good) and repeat the same for the primary server, A:lvs-low-traffic-eqiad
2. If the cookbook is not used and you restart Pybal manually, please !log the restart on the #wikimedia-operations channel.
3. If the cookbook is used, note that it will poll for Icinga checks to clear after the restart, but this cannot happen until you ACK the now-firing PyBal IPVS diff check (which we remedy below using ipvsadm).
4. low-traffic above is just an example; to find the actual LVS class, look for it under the service entry you are trying to remove in hieradata/common/service.yaml.
Wait 300 sec (no need to wait if the cookbook was used and it completed successfully)
Restart pybal on the active LVS server, ask #wikimedia-traffic
1. Equivalent cumin command: sudo cumin 'A:lvs-secondary-codfw' 'systemctl restart pybal.service' to restart Pybal on the backup LVS servers in and codfw (low-traffic).
2. Wait for some time (if the above looks good) and restart on primary in codfw.
3. If the cookbook is not used and you restart Pybal manually, please !log the restart on the #wikimedia-operations channel.
Run ipvsadm --delete-service --tcp-service addr:port on the LVS servers (in the same order in which you ran the cookbook above, helping ensure that the agent run and Pybal restart was completed), where addr needs to match the service IP of the datacenter the LVS server is in. If you get it wrong (e.g. you type the codfw IP while working on eqiad, or whatever) and somehow the entry you're trying to delete already doesn't exist, the error message to expect is: "Memory allocation problem"
Run puppet on the service backends.

Final removal

You can now remove the service stanza from service::catalog, and all the references to the service inside conftool-data safely. When you merge the puppet change, your service will be successfully removed from production.

Cleanup

You may remove code related to the discontinued service (e.g. Puppet modules/roles/profiles, Kubernetes deployments). It's probably a good idea to grep for the service name that was removed so that all instances of it are removed (such as from the CI).

LVS installation

LVS now uses Puppet and automatic BGP failover. Puppet arranges the service IP configuration, and installation of packages. To configure the service IPs that an LVS balancer should serve (both primary and backup!), set the $lvs_balancer_ips variable:

node /amslvs[1-4]\.esams\.wikimedia\.org/ {
        $cluster = "misc_esams"

        $lvs_balancer_ips = [ "91.198.174.2", "91.198.174.232", "91.198.174.233", "91.198.174.234" ]

        include base,
                lvs::balancer
}

In this setup, all 4 hosts amslvs1-amslvs4 are configured to accept all service IPs, although in practice every service IP is only ever serviced by one out of two hosts due to the router configuration.

Puppet uses the (now misleadingly named) wikimedia-lvs-realserver package to bind these IPs to the loopback (!) interface. This is to make sure that a server answers on these IPs, but does not announce them via ARP - we'll use BGP for that.

LVS service configuration

In file lvs.pp the services themselves are configured, from which the PyBal configuration file /etc/pybal/pybal.conf is generated by Puppet.

Most configuration is in a large associative hash, $lvs_services. Each key in this hash is the name of one LVS service, and points to hash of PyBal configuration variables:

description: Textual description of the LVS service.
class: The class the LVS service belongs too; i.e. on which LVS balancers it is active (see below).
ip: A hash of service IP address for the service. All IP addresses are aliases, and are translated to separate LVS services in PyBal.conf, but with identical configuration.

The other configuration variables are described in the PyBal article.

Global PyBal configuration options can be specified in the $pybal hash.

Classes

To determine which LVS services are active on which hosts, the $lvs_class_hosts determines for each class, which hosts should have the services for that class. This is used by the pybal.conf template to generate the LVS services. The following classes are used, to distribute traffic over the LVS balancer hosts:

high-traffic1 (text, bits)
high-traffic2 (text, upload)
https (HTTPS services corresponding to the 'high-traffic' HTTP services; should be active on all hosts that carry either class)
specials (special LVS services, especially those that do not have BGP enabled)
low-traffic (internal load balancing, e.g. from the Squids to the Apaches)

BGP failover and load sharing

Previously, the LVS balancer that had a certain service IP bound to its eth0 interface was active for that IP. To do failovers, the IP had to be moved manually.

In the new setup, multiple servers announce the service IP(s) via BGP to the router(s), which then pick which server(s) to use based on BGP routing policy.

PyBal BGP configuration

In the global section, the following BGP related settings typically exist:

bgp = yes

Enables bgp globally, but can be overridden per service.

bgp-local-asn =  64600

The ASN to use while communicating to the routers. All prefixes will get this ASN as AS path.

bgp-peer-address = 91.198.174.247

The IP of the router this PyBal instance speaks BGP to.

#bgp-as-path = 64600 64601

An optional modified AS path. Can be used e.g. to make the AS path longer and thus less attractive (on a backup balancer).

Example BGP configuration for Juniper (cr1-esams)

mark@cr1-esams> show configuration protocols bgp group PyBal 
type external;
multihop {
    ttl 2;
}
local-address 91.198.174.245;
hold-time 30;
import LVS_import;
family inet {
    unicast {
        prefix-limit {
            maximum 50;
            teardown;
        }
    }
}
family inet6 {
    unicast {
        prefix-limit {
            maximum 50;
            teardown;
        }
    }
}
export NONE;
peer-as 64600;
neighbor 91.198.174.109;
neighbor 91.198.174.110;

mark@cr1-esams> show configuration policy-options prefix-list LVS-service-ips            
10.2.0.0/16;
91.198.174.224/28;

mark@cr1-esams> show configuration policy-options prefix-list LVS-service-ips6   
2620:0:862:ed1a::/64;

mark@cr1-esams> show configuration routing-options aggregate                               
route 91.198.174.224/28;
route 10.2.3.0/24;

mark@cr1-esams> show configuration routing-options rib inet6.0 aggregate 
route 2620:0:862:ed1a::/64;

mark@cr1-esams> show configuration policy-options policy-statement ospf_export 
term 1 {
    from protocol direct;
    then accept;
}
term statics {
    from protocol [ static aggregate ];
    then accept;
}
then reject;

The LVS_import policy adds metric 10 to the "routes" (service IPs) received from the secondary (backup) LVS balancers. This means that the router will regard them as less preferred.

The individual /32 and /128 service IPs are announced by PyBal and exchanged between the routers using IBGP. Aggregates for the service IP ranges are generated by the core routers and redistributed into OSPF as well.

SSH checking

As the Apache cluster is often suffering from broken disks which break SSH but keep Apache up, I have implemented a RunCommand monitor in PyBal which can periodically run an arbitrary command, and check the server's health by the return code. If the command does not return within a certain timeout, the server is marked down as well.

The RunCommand configuration is in /etc/pybal/pybal.conf:

runcommand.command = /bin/sh
runcommand.arguments = [ '/etc/pybal/runcommand/check-apache', server.host ]
runcommand.interval = 60
runcommand.timeout = 10

runcommand.command: The path to the command which is being run. Since we are using a shell script and PyBal does not invoke a shell by itself, we have to do that explicitly.
runcommand.arguments: A (Python) list of command arguments. This list can refer to the monitor's server object, as shown here.
runcommand.interval: How often to run the check (seconds).
runcommand.timeout: The command timeout; after this amount of seconds the entire process group of the command will be KILLed, and the server is marked down.

Currently we're using the following RunCommand script, in /etc/pybal/runcommand/check-apache:

#!/bin/sh

set -e

HOST=$1
SSH_USER=pybal-check
SSH_OPTIONS="-o PasswordAuthentication=no -o StrictHostKeyChecking=no -o ConnectTimeout=8"

# Open an SSH connection to the real-server. The command is overridden by the authorized_keys file.
ssh -i /root/.ssh/pybal-check $SSH_OPTIONS $SSH_USER@$HOST true

exit 0

The limited ssh accounts on the application servers are managed by the wikimedia-task-appserver package.

Diagnosing problems

If the alert is about jobrunners, it might be that jobrunners became overwhelmed by video encoding jobs. See this page for more details.

A common alert one may encounter that is related to LVS is

LVS ncredir esams port 80/tcp - Non canonical redirect service IPv6 #page on ncredir-lb.esams.wikimedia.org_ipv6 is CRITICAL:

Ncredir are the servers that redirect clients from non canonical domains (eg wikipedia.gr) to canonical ones like en.wikipedia.org. When this alert fires, either ncredir itself is saturated, or LVS is overwhelmed.

Run ipvsadm -l on the director. Healthy output looks like this:

IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  upload.pmtpa.wikimedia.org:h wlc
  -> sq10.pmtpa.wmnet:http        Route   10     5202       5295
  -> sq1.pmtpa.wmnet:http         Route   10     8183       12213
  -> sq4.pmtpa.wmnet:http         Route   10     7824       13360
  -> sq5.pmtpa.wmnet:http         Route   10     7843       12936
  -> sq6.pmtpa.wmnet:http         Route   10     7930       12769
  -> sq8.pmtpa.wmnet:http         Route   10     7955       11010
  -> sq2.pmtpa.wmnet:http         Route   10     7987       13190
  -> sq7.pmtpa.wmnet:http         Route   10     8003       7953

All the servers are getting a decent amount of traffic, there's just normal variation.

If a realserver is refusing connections or doesn't have the VIP configured, it will look like this:

IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  upload.pmtpa.wikimedia.org:h wlc
  -> sq10.pmtpa.wmnet:http        Route   10     2          151577
  -> sq1.pmtpa.wmnet:http         Route   10     2497       1014
  -> sq4.pmtpa.wmnet:http         Route   10     2459       1047
  -> sq5.pmtpa.wmnet:http         Route   10     2389       1048
  -> sq6.pmtpa.wmnet:http         Route   10     2429       1123
  -> sq8.pmtpa.wmnet:http         Route   10     2416       1024
  -> sq2.pmtpa.wmnet:http         Route   10     2389       970
  -> sq7.pmtpa.wmnet:http         Route   10     2457       1008

Active connections for the problem server are depressed, inactive connections normal or above normal. This problem must be fixed immediately, because in wlc mode, LVS load balances based on the ActiveConn column, meaning that servers that are down get all the traffic.

Incorrectly bound interfaces

Don't ever bind IP addresses directly to lo in /etc/network/interfaces. If you do, stuff breaks. (This applies not just to LVS servers but any real server as well. Anything with the wikimedia-lvs-realserver package will break if you bind addresses manually.)

When it's broken, it looks like this. Notice that all the balanced IP addresses are tagged lo:LVS except 10.2.1.13. 13 is broken and causes the ifup script that reloads the IPs to be broken.

    inet 127.0.0.1/8 scope host lo
    inet 10.2.1.13/32 scope global lo
    inet 10.2.1.1/32 scope global lo:LVS
    inet 10.2.1.11/32 scope global lo:LVS
    inet 10.2.1.12/32 scope global lo:LVS
    inet6 ::1/128 scope host

The solution here is to remove the broken IP from the "lo" interface (ip addr del 10.2.1.13/32 dev lo), then run dpkg-reconfigure wikimedia-lvs-realserver. This triggers the scripts that will re-add all the IP addresses.

Happy ip addr output looks like this:

root@lvs4:/etc/network# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
    inet 10.2.1.1/32 scope global lo:LVS
    inet 10.2.1.11/32 scope global lo:LVS
    inet 10.2.1.12/32 scope global lo:LVS
    inet 10.2.1.13/32 scope global lo:LVS
    inet 10.2.1.21/32 scope global lo:LVS
    inet 10.2.1.22/32 scope global lo:LVS
    inet 10.2.1.27/32 scope global lo:LVS
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
    etc...

LVSRealserverMSS alert

This alert is triggered if the configured MSS on tcp-mss-clamper and the observed one during the TCP three-way handshake don't match. The script used to check MSS values is prometheus-lvs-realserver-mss.py . MSS is configured via two hiera keys:

profile::lvs::realserver::ipip::ipv4_mss
profile::lvs::realserver::ipip::ipv6_mss

MSS is effectively clamped on the SYN/ACK packet that is sent after an initial SYN packet. If for some reason the kernel is unable to answer to the initial SYN packet or it answers with an RST packet, this alert will be trigger a false positive.

tcp-mss-clamper exposes basic metrics of clamped packets on prometheus. A dashboard is available here.