Jump to content

Varnish

From Wikitech

Varnish is a caching HTTP proxy used as the frontend (in-memory) component of Wikimedia's CDN. On-disk, persistent caching is done by cache backends running Apache Traffic Server.

TTL

Varnish object life time

In general, the time-to-live (TTL) of an object is set to the max-age value specified by application servers in the Cache-Control header. A ceiling for max-age is provided by the so called ttl_cap: we clamp max-age values greater than 1 day and set them to 24 hours. If the Cache-Control header is absent, then Varnish TTL is set to the default_ttl configured in hiera, which is 24 hours as of March 2019. In some cases, we might override all of this by setting beresp.ttl in VCL.

How long is an object kept in cache? It depends on the ttl and two other settings called keep and grace. Just as the ttl, keep and grace can also be set by VCL or default to varnishd's default_keep, and default_grace.

An object is kept in cache for ttl + keep + grace seconds. Whether or not it is returned to clients requesting it depends on when the request comes in. Let t_origin be the moment an object enters the cache, it is considered to be fresh and hence unconditionally served to clients till t_origin + ttl. Grace mode is enabled and the stale object is returned if a fetch for the given object fails due to the origin server being marked as sick and the object is past its ttl but withitn t_origin + ttl + grace. Objects within t_origin + ttl + grace + keep are kept around for conditional requests such as If-Modified-Since.

All of the above applies to a single varnish instance. The object you are fetching has gone through multiple varnishes, from a minimum of 2 layers to a maximum of 4 layers (as of March 2019). Thus, in the worst case the object can be as old as ttl * 4.

Historical TTL

Historically, the Wikimedia CDN used a max-age of 31 days (for Varnish, and before that, for Squid). In 2016, an initiative began to reduce operational risk from relying on such a strong cache, and what the actual hitrate distribution is in practice (T124954). In May 2016, the CDN frontend ttl was lowered to 1 day, and in May 2017 the CDN backend ttl was lowered to 1 day.

Requests coalescing

In case a request from client A results in a cache miss, Varnish fetches the object from an origin server. If multiple requests for the same object arrive before Varnish is done fetching, they are put on a waitinglist instead of being sent to the origin server to avoid pointless origin server load. Once the response is fetched, Varnish decides whether or not it is cacheable. If it is, the response is sent to all clients whose request is on the waitinglist. This feature is called request coalescing.

If the object is not cacheable, on the other hand, it means that the response received for client A cannot be sent to others. Requests on the waitinglist must be sent to the origin server; waiting for the response to A's request was pointless. All requests for the uncacheable object will be serialized by Varnish (sent to the origin server one after the other).

hit-for-pass

It is possible to create a special type of object to mark uncacheable responses as such for a certain amount of time. This allows the cache to remember that requests for, say, uri /test will not end up being a cache hit. After an hfp object for /test has been created, concurrent requests for that object will not be coalesced as described previously. Instead, they will all be sent to the origin server in parallel. All requests hitting the object will be turned into pass (hence the name: hit-for-pass).

A hit-for-pass object with 120s ttl can be created in Varnish 5.1 as follows:

   sub vcl_backend_response {
       // [...]
       return(pass(120s));
   }

hit-for-miss

Certain objects might not be uncacheable forever. A drawback of hit-for-pass is that no response will be cached for the duration of the ttl chosen for the hit-for-pass object, even if meanwhile the response became cacheable.

Another feature called hit-for-miss is available to make sure that responses becoming cacheable do indeed get cached.

A hit-for-miss object with ttl 120s can be created in Varnish 5.1 as follows:

   sub vcl_backend_response {
       // [...]
       set beresp.ttl = 120s;
       set beresp.uncacheable = true;
       return(deliver);
   }

Conditional requests (requests with the If-Modified-Since or If-None-Match headers) hitting a hit-for-miss object will be turned by Varnish into unconditional requests. The reason for this is that the response body is going to be necessary if the response becomes cacheable.

Cache admission policies

Caching probability exponentially decreasing with object size. The graph shows three different curves depending on algorithm settings.

The decision of which objects to cache at the frontend layer and which are best left for the backend, is crucial to achieving a good hitrate. As of May 2021 we use two cache admission policies: a trivial size-based policy implemented with a static size cutoff, and a probabilistic cache admission policy with probability exponentially decreasing with object size. The static size-based policy is very simple: cache all objects smaller than a static threshold defined in hiera as large_objects_cutoff. The probabilistic policy, also known as the Exp policy, assigns a probability between 0.0 and 1.0 depending on the object size: smaller objects are more likely to be cached, while larger objects are progressively less likely to be cached. This idea implicitly brings popularity into the equation, ensuring that larger, popular objects have a chance to be cached. The policy is enabled and configured by using a few tunable hiera settings:

profile::cache::varnish::frontend::fe_vcl_config:
  admission_policy: 'exp'
  large_objects_cutoff: 8388608
  exp_policy_rate: 0.2
  exp_policy_base: -20.3

Different values of the rate and base settings produce different probability curves, and can be visualized using a spreadsheet made by the traffic team for this purpose.

To observe the behavior of the algorithm at runtime, use the following command. In the output below, p is the probability of an object of size s to get cached given the configured base and rate settings. If the random number r is smaller than p, as is the case in the example, then the object will be cached.

 sudo varnishncsa -b -n frontend -q 'BerespStatus eq 200 and BereqMethod eq "GET"' -F 'p=%{VCL_Log:Admission Probability}x r=%{VCL_Log:Admission Urand}x s=%b %{X-Cache-Int}o %s %r'
 [...]
 p=0.962 r=0.192 s=36075 cp3055 miss 200 GET http://upload.wikimedia.org/wikipedia/commons/thumb/3/34/Rose_Melrose_20070601_2.jpg/640px-Rose_Melrose_20070601_2.jpg HTTP/1.1

HOWTO

Diagnose connection count flooding

For the Varnish/nginx cache pool, the client is the browser, and disconnections occur both due to humans pressing the "stop" button, and due to automated timeouts. It's rare for any queue size limit to be reached in Varnish, since queue slots are fairly cheap. Varnish's client-side timeouts tend to prevent the queue from becoming too large.

See request logs

As explained below, there are no access logs. However, you can see NCSA style log entries matching a given pattern on all cache hosts using:

$ sudo cumin 'A:cp' 'timeout 30 varnishncsa -g request -q "ReqURL ~ \"/wiki/Banana\"" -n frontend'

Block requests from public clouds

The hiera setting profile::cache::varnish::frontend::fe_vcl_config has an attribute called public_clouds_shutdown, defaulting to 'false'. Set it to 'true' to return 429 to all requests from public clouds such as AWS EC2.

This rule makes use of the abuse_networks['public_cloud_nets'] netmasks defined in hieradata/common.yaml in the private puppet repo, which has known IP blocks of popular cloud providers. This mask may also be used to write custom ratelimiting/filtering.

Rate limiting

The vsthrottle vmod can be used to rate limit certain URLs/IPs. To add some text-specific limit, add a VCL snippet similar to the following to cluster_fe_ratelimit in modules/varnish/templates/text-frontend.inc.vcl.erb:

 if (req.url ~ "^/api/rest_v1/page/pdf/") {
   // Allow a maximum of 10 requests every 10 seconds for a certain IP
   if (vsthrottle.is_denied("proton_limiter:" + req.http.X-Client-IP, 10, 10s)) {
     return (synth(429, "Too Many Requests"));
   }
 }

Blacklist an IP

In case of service abuse, all requests from a given network can be blocked by specifying the source network in the abuse_networks['blocked_nets'] hiera structure in /srv/private/hieradata/common.yaml. Users in the NDA LDAP group can access this list on config-master.wikimedia.org

It looks like this:

@def $T123456_DESCRIPTION_HERE = (
  192.0.2.21/31
);
@def $BLOCKED_NETS = (
  198.51.100.1/31
  203.0.113.12/31
);
@def $BOT_BLOCKED_NETS = (
  198.51.100.50/30
  203.0.113.120/30
);
[…]

Per above, this file also supports a BOT_BLOCKED_NETS ACL that will block traffic only from obviously bot-like User-Agents (for example python-requests/x.y.z). This is useful when there is unacceptable bot traffic from an IP range, but real users at that range as well.

Either way: add the network to the right hiera key, commit your changes, and run puppet on all cache nodes with:

 $ sudo cumin -b 15 'A:cp' 'run-puppet-agent -q'

If you know whether the IP is hitting text or upload caches, you can use A:cp-text or A:cp-upload respectively for faster recovery. If a host is down, you can lower the success threshold with, e.g. -p 90 .

All requests with source IP in the ACL will get a 403 response suggesting to contact us at noc@wikimedia.org.

See backend health

Run

# varnishlog -i Backend_health -O

Force your requests through a specific Varnish frontend

nginx/varnish-fe server selection at the LVS layer is done by using the consistent hash of client source IPs. This means that there is no easy way to choose a given frontend node. As a workaround, if you have production access you can use ssh tunnels and /etc/hosts. Assuming that the goal is choosing cp2002 for all your requests to upload.wikimedia.org:

 sudo setcap 'cap_net_bind_service=+ep' /usr/bin/ssh
 ssh -L 443:localhost:443 cp2002.codfw.wmnet

Then add a line to /etc/hosts such as

 127.0.0.1 upload.wikimedia.org 

This way, all your requests to upload.wikimedia.org will be served by cp2002.

Alternatively, curl's --resolve parameter can be used:

 curl --resolve upload.wikimedia.org:443:127.0.0.1 https://upload.wikimedia.org/wikipedia/en/thumb/d/d2/U.C._Sampdoria_logo.svg/263px-U.C._Sampdoria_logo.svg.png

Package a new Varnish release

  • Download the latest upstream release from http://varnish-cache.org/releases/
  • Verify the SHA256 signature with sha256sum from coreutils
  • Clone the operations/debs/varnish4 repository
  • Checkout the debian-wmf branch
  • Import the tarball with gbp import-orig --pristine-tar /tmp/varnish-${version}.tar.gz
  • Push both the upstream and pristine-tar branches straight to gerrit without code review. For example: git push gerrit pristine-tar and git push gerrit upstream
  • Checkout the debian-wmf branch
  • Edit debian/changelog commit and open a code review. Eg: git push gerrit HEAD:refs/for/debian-wmf

Build and upload the new release

This section assumes that we are targeting the amd64 architecture, and Debian bullseye as a distro.

Once the packaging work is done as mentioned in the previous section, the next step is about building it and uploading it to WMF's APT repository.

Package building must be done using git-buildpackage (aka gbp) on the host with role(builder). After cloning the operations/debs/varnish4 repository on the builder, the next step is invoking gbp with the right incantation:

GIT_PBUILDER_AUTOCONF=no WIKIMEDIA=yes ARCH=amd64 GBP_PBUILDER_DIST=bullseye DIST=bullseye gbp buildpackage -jauto -us -uc -sa --git-builder=git-pbuilder

After successful compilation, all build artifacts will be available under /var/cache/pbuilder/result/bullseye-amd64/.

To upload the packages to apt.wikimedia.org, ssh to apt1001.wikimedia.org and run the following command (replace ~ema/varnish_6.0.9-1wm1_amd64.changes with the appropriate .changes file:

$ rsync -v build2001.codfw.wmnet::pbuilder-result/bullseye-amd64/*varnish* .

$ sudo -i reprepro -C component/varnish6 include bullseye-wikimedia ~ema/varnish_6.0.9-1wm1_amd64.changes

At this point the package should be available to be installed on all production nodes. Double-check that this is the case with the following command on a production node:

$ sudo apt update ; sudo apt policy varnish

Deploy the new release on Beta

The new release should be tested on the Beta Cluster. To do so, ssh onto the cache node serving en.wikipedia.beta.wmflabs.org (currently deployment-cache-text08.deployment-prep.eqiad1.wikimedia.cloud), and upgrade the package running the following commands:

$ sudo apt update ; sudo apt install varnish libvarnishapi2

Assuming that package installation went well, the service can be restarted on deployment-cache-text08 as follows:

# CAUTION! The following command does not depool the service before restarting it, and
# it should therefore NOT be executed on production nodes.
# This is not the command to run in prod. Do not run this in prod. :)
# On production nodes you would run /usr/local/sbin/varnish-frontend-restart instead.
# Making varnish-frontend-restart just work on the Beta Cluster is left as an exercise for the reader:
# https://phabricator.wikimedia.org/T299054
$ sudo systemctl restart varnish-frontend

Once the unit has been restarted successfully, go browse https://en.wikipedia.beta.wmflabs.org/wiki/Main_Page and see if it works. Also run the following to see the logs of requests going through:

$ sudo varnishncsa -n frontend

Deploy the new release to Prod

Now that you've verified that the new release works great, announce to #wikimedia-operations your intention to upgrade a prod node. For example:

!log cp4021: upgrade varnish to 6.0.9-1wm1 T298758

SSH onto the host and upgrade varnish:

$ sudo apt update ; sudo apt install varnish libvarnishapi2

Restart the service using varnish-frontend-restart, which takes care of depooling it before stop and repooling it after start.

$ sudo -i varnish-frontend-restart

Look at the various stats on cache-hosts-comparison and see that there are no obvious issues. If something looks wrong (eg: 50x error spikes), depool the node with the following command and investigate:

$ sudo -i depool

If everything looks good, the whole cluster can be upgraded. The procedure to upgrade and roll-restart all cache nodes is as follows (from a cumin node: cumin1002.eqiad.wmnet, cumin2002.codfw.wmnet):

Note that you can run cumin with --force to avoid being prompted for confirmation in each screen session about the list of nodes the command will be run on. Some people prefer to manually check that.

for dc in eqiad eqsin codfw esams ulsfo drmrs ; do
        for cluster in text upload; do
                sudo -i screen -dmSU ${cluster}_${dc} cumin -b 1 -s 1200 "A:cp-${cluster}_${dc}" 'apt -y install varnish libvarnishapi2 && varnish-frontend-restart'
        done
done

The above upgrades all DC/cluster combinations in parallel (eg: eqiad_text together with eqiad_upload), waiting 20 minutes between nodes to let caches refill a little before proceeding.

Nov 2022: Please note that the above script (varnish-frontend-restart) will also pool existing depooled hosts. We are working on a fix and will update this page.

Diagnosing Varnish alerts

There are multiple possible sources of 5xx errors.

  1. Check 5xx errors on the aggregate client status code dashboard
    1. Use the site dropdown to see if the issue is affecting multiple sites or if it is isolated
    2. Use the cache_type dropdown and see if the errors affect only text, upload, or both. If both clusters are affected, contact #wikimedia-netops on IRC as the issue is potentially network-related
  2. If upload is affected, check the following dashboards:
    1. Thumbor eqiad and codfw
    2. Swift eqiad and codfw
  3. If text is affected, check the following Grafana dashboards:
    1. MediaWiki Graphite Alerts
    2. RESTBase
  4. Check if anything stands out in Logstash, on the Varnish-Webrequest-50X and Varnish Fetch Errors Grafana visualizations
  5. If none of the above steps helped, the issue might be due to problems with Varnish or ATS. See the Varnish Failed Fetches dashboard for the DC/cluster affected by the issue
    1. It could be that most fetch failures affect one single Varnish backend (backends can be shown by using the layer dropdown). See if there's anything interesting for the given backend in the "Mailbox Lag" graph and contact #wikimedia-traffic
    2. In case nobody from the Traffic team is around, SSH onto the host affected by mailbox lag. Varnish backends are restarted twice a week by cron (see /etc/cron.d/varnish-backend-restart), so the maximum uptime for varnish.service is 3.5 days. Check the uptime with sudo systemctl status varnish | grep Active. If it is in the order of a few hours, something is wrong. Page Traffic. Otherwise, restart the backend with sudo -i varnish-backend-restart. Do not use any other command to restart varnish. Do not restart more than one varnish backend.
  6. If the given cluster uses ATS instead of Varnish (there's no result when choosing backend in the layer dropdown), see the ATS Cluster View.

Some more tricks

// Query Times
varnishncsa -F '%t %{VCL_Log:Backend}x %Dμs %bB %s %{Varnish:hitmiss}x "%r"'

// Top URLs
varnishtop -i RxURL

// Top Referer, User-Agent, etc.
varnishtop -i RxHeader -I Referer
varnishtop -i RxHeader -I User-Agent

// Cache Misses
varnishtop -i TxURL

Configuration

Deployment and configuration of Varnish is done using Puppet.

See the production/modules/varnish and varnishkafka repositories in operations/puppet.

We use a custom varnish package with several local patches applied. Varnish uses a VCL file (Varnish Configuration Language), a DSL where Varnish behavior is controlled using subroutines that are compiled into C and executed during each request. The VCL files are located at

/etc/varnish/*.vcl

One-off purges (bans)

Note: if all you need is purging objects based on their URLs, see how to perform One-off purges.

Sometimes it's necessary to do one-off purges to fix operational issues, and these are accomplished with the varnishadm "ban" command. It's best to read up on this thoroughly ahead of time! What a ban effectively does in practice is mark all objects that match the ban conditions, which were in the cache prior to the ban command's execution, as invalid.

Keep in mind that bans are not routine operations! These are expected to be isolated low-rate operations we perform in emergencies or after some kind of screw-up has happened. These are low-level tools which can be very dangerous to our site performance and uptime, and Varnish doesn't deal well in general with a high rate of ban requests. These instructions are mostly for SRE use (with great care). Depending on the cluster and situation, under normal circumstances anywhere from 85 to 98 percent of all our incoming traffic is absorbed by the cache layer, so broad invalidation can greatly multiply applayer request traffic until the caches refill, causing serious outages in the process.


How to execute a ban (on one machine)

The varnishadm ban command is generally going to take the form:

varnishadm -n frontend ban [ban conditions]

Note that every machine has two cache daemons, the frontend Varnish and the backend ATS. In order to effectively remove an object from both, you must first force the cache miss from ATS, then ban it from Varnish.

Examples of ban conditions

Ban all 301 redirect objects in hostnames ending in wikimedia.org

obj.status == 301 && req.http.host ~ ".*wikimedia.org"

Ban all urls that start with /static/ , regardless of hostname:

 req.url ~ "^/static/"

Ban all content on zh.wikipedia.org

req.http.host == "zh.wikipedia.org"

Notably, OR is not supported. So something like the following does not work:

 req.http.host == "zh.wikipedia.org" || req.http.host == "it.wikipedia.org"

Instead, you need multiple bans to get OR semantics:

 req.http.host == "zh.wikipedia.org"
 req.http.host == "it.wikipedia.org"

Execute a ban on a cluster

The following example shows how to ban all objects with Content-Length: 0 and status code 200 in ulsfo cache_upload:

cumin -b 1 A:cp-upload_ulsfo "varnishadm -n frontend ban 'obj.status == 200 && obj.http.content-length == 0'"

With nested quotation marks:

cumin -b 1 A:cp-upload_ulsfo "varnishadm -n frontend ban 'req.http.host == \"it.wikipedia.org\"'"

Ban condition for MediaWiki outputs by datestamp of generation

1. Determine the start/end timestamps you need in the same standard format as date -u:

 > date -u
 Thu Apr 21 12:16:52 UTC 2016

2. Convert your start/end timestamps to unix epoch time integers:

 > date -d "Thu Apr 21 12:16:52 UTC 2016" +%s
 1461241012
 > date -d "Thu Apr 21 12:36:01 UTC 2016" +%s
 1461242161

3. Note you can reverse this conversion, which will come in handy below, like this:

 > date -ud @1461241012
 Thu Apr 21 12:16:52 UTC 2016

4. For ban purposes, we need a regex matching a range of epoch timestamp numbers. It's probably easiest to approximate it and round outwards to a slightly wider range to make the regex simpler. This regex rounds the range above to be 1461241000 - 1461242199 , which if converted back via step 3, shows we've rounded the range outwards to cover 12:16:40 through 12:36:39

 146124(1[0-9]|2[01])

5. MediaWiki emits a Backend-Timing header with fields D and t, where t is a microsecond-resolution epoch number (we'll ignore those final 6 digits), like so:

 Backend-Timing: D=31193 t=1461241832338645

6. To ban on this header using the epoch seconds regex we build in step 4:

 ban obj.http.Backend-Timing ~ "t=146124(1[0-9]|2[01])"

Presentations:

See also