Varnish

From Wikitech
Jump to navigation Jump to search
Wikimedia infrastructure

Data centres and PoPs

Networking

HTTP Caching


MediaWiki

Media

Logs

Varnish is a fast caching HTTP proxy

Cache Clusters

We currently host the following Varnish cache clusters at all of our datacenters:

  • cache_text - Primary cluster for MediaWiki and various app/service (e.g. RESTBase, phabricator) traffic
  • cache_upload - Serves upload.wikimedia.org and maps.wikimedia.org exclusively (images, thumbnails, map tiles)

Old clusters that no longer exist:

  • cache_bits - Used to exist just for static content and ResourceLoader, now decommed (traffic went to cache_text)
  • cache_mobile - Was like cache_text but just for (m|zero)\. mobile hostnames, now decommed (traffic went to cache_text)
  • cache_parsoid - Legacy entrypoint for parsoid and related *oid services, now decommend (traffic goes via cache_text to RestBase)
  • cache_maps - Served maps.wikimedia.org exclusively, which is now serviced by cache_upload
  • cache_misc - Miscellaneous lower-traffic / support services (e.g. phabricator, metrics, etherpad, graphite, etc). Now moved to cache_text.

Headers

X-Cache

X-Cache is a comma-separated list of cache hostnames with information such as hit/miss status for each entry. The header is read right-to-left: the rightmost is the outermost cache, things to the left are progressively deeper towards the applayer. The rightmost cache is the in-memory cache, all others are disk caches.

In case of cache hit, the number of times the object has been returned is also specified. Once "hit" is encountered while reading right to left, everything to the left of "hit" is part of the cached object that got hit. It's whether the entries to the left missed, passed, or hit when that object was first pulled into the hitting cache. For example:

X-Cache: cp1066 hit/6, cp3043 hit/1, cp3040 hit/26603

An explanation of the possible information contained in X-Cache follows.

Not talking to other servers

  • hit: a cache hit in cache storage. There was no need to query a deeper cache server (or the applayer, if already at the last cache server)
  • int: locally-generated response from the cache. For example, a 301 redirect. The cache did not use a cache object and it didn't need to contact another server

Talking to other servers

  • miss: the object might be cacheable, but we don't have it
  • pass: the object was uncacheable, talk to a deeper level

Some subtleties on pass: different caches (eg: in-memory vs. on-disk) might disagree on whether the object is cacheable or not. A pass on the in-memory cache (for example, because the object is too big) could be a hit for an on-disk cache. Also, it's sometimes not clear that an object is uncacheable till the moment we fetch it. In that case, we cache for a short while the fact that the object is uncachable. In Varnish terminology, this is a hit-for-pass.

If we don't know an object is uncacheable until after we fetch it, it's initially identical to a normal miss. Which means coalescing, other requests for the same object will wait for the first response. But after that first fetch we get an uncacheable object, which can't answer the other requests which might have queued. Because of that they all get serialized and we've destroy the performance of hot (high-parallelism) objects that are uncacheable. hit-for-pass is the answer to that problem. When we make that first request (no knowledge), and get an uncacheable response, we create a special cache entry that says something like "this object cannot be cached, remember it for 10 minutes" and then all remaining queries for the next 10 minutes proceed in parallel without coalescing, because it's already known the object isn't cacheable.

The content of the X-Cache header is recorded for every request in the webrequest log table.

Requests coalescing

In case a request from client A results in a cache miss, Varnish fetches the object from an origin server. If multiple requests for the same object arrive before Varnish is done fetching, they are put on a waitinglist instead of being sent to the origin server to avoid pointless origin server load. Once the response is fetched, Varnish decides whether or not it is cacheable. If it is, the response is sent to all clients whose request is on the waitinglist. This feature is called request coalescing.

If the object is not cacheable, on the other hand, it means that the response received for client A cannot be sent to others. Requests on the waitinglist must be sent to the origin server; waiting for the response to A's request was pointless. All requests for the uncacheable object will be serialized by Varnish (sent to the origin server one after the other).

hit-for-pass

It is possible to create a special type of object to mark uncacheable responses as such for a certain amount of time. This allows the cache to remember that requests for, say, uri /test will not end up being a cache hit. After an hfp object for /test has been created, concurrent requests for that object will not be coalesced as described previously. Instead, they will all be sent to the origin server in parallel. All requests hitting the object will be turned into pass (hence the name: hit-for-pass).

A hit-for-pass object with 120s ttl can be created in Varnish 5.1 as follows:

   sub vcl_backend_response {
       // [...]
       return(pass(120s));
   }

hit-for-miss

Certain objects might not be uncacheable forever. A drawback of hit-for-pass is that no response will be cached for the duration of the ttl chosen for the hit-for-pass object, even if meanwhile the response became cacheable.

Another feature called hit-for-miss is available to make sure that responses becoming cacheable do indeed get cached.

A hit-for-miss object with ttl 120s can be created in Varnish 5.1 as follows:

   sub vcl_backend_response {
       // [...]
       set beresp.ttl = 120s;
       set beresp.uncacheable = true;
       return(deliver);
   }

Conditional requests (requests with the If-Modified-Since or If-None-Match headers) hitting a hit-for-miss object will be turned by Varnish into unconditional requests. The reason for this is that the response body is going to be necessary if the response becomes cacheable.

TTL

Varnish object life time

In general, the ttl of an object is set to the max-age value specified by application servers in the Cache-Control header. A ceiling for max-age is provided by the so called ttl_cap: we override max-age values greater than 24 hours and set them to 24 hours. If the Cache-Control header is absent, ttl is set to the default_ttl configured in hiera, which is 24 hours as of March 2019. In some cases, we might override all of this by setting beresp.ttl in VCL.

How long is an object kept in cache? It depends on the ttl and two other settings called keep and grace. Just as the ttl, keep and grace can also be set by VCL or default to varnishd's default_keep, and default_grace.

An object is kept in cache for ttl + keep + grace seconds. Whether or not it is returned to clients requesting it depends on when the request comes in. Let t_origin be the moment an object enters the cache, it is considered to be fresh and hence unconditionally served to clients till t_origin + ttl. Grace mode is enabled and the stale object is returned if a fetch for the given object fails due to the origin server being marked as sick and the object is past its ttl but withitn t_origin + ttl + grace. Objects within t_origin + ttl + grace + keep are kept around for conditional requests such as If-Modified-Since.

All of the above applies to a single varnish instance. The object you are fetching has gone through multiple varnishes, from a minimum of 2 to a maximum of 4 (again as of March 2019). Thus, in the worst case the object can be as old as ttl * 4.

HOWTO

See request logs

As explained below, there are no access logs. However, you can see NCSA style log entries matching a given pattern on all cache hosts using:

$ sudo cumin 'A:cp' 'timeout 30 varnishncsa -g request -q "ReqURL ~ \"/wiki/Banana\"" -n frontend'

Rate limiting

The vsthrottle vmod can be used to rate limit certain URLs/IPs. To add some text-specific limit, add a VCL snippet similar to the following to cluster_fe_ratelimit in modules/varnish/templates/text-frontend.inc.vcl.erb:

 if (req.url ~ "^/api/rest_v1/page/pdf/") {
   // Allow a maximum of 10 requests every 10 seconds for a certain IP
   if (vsthrottle.is_denied("proton_limiter:" + req.http.X-Client-IP, 10, 10s)) {
     return (synth(429, "Too Many Requests"));
   }
 }

See backend health

Run

# varnishlog -i Backend_health -O

Force your requests through a specific Varnish frontend

nginx/varnish-fe server selection at the LVS layer is done by using the consistent hash of client source IPs. This means that there is no easy way to choose a given frontend node. As a workaround, if you have production access you can use ssh tunnels and /etc/hosts. Assuming that the goal is choosing cp2002 for all your requests to upload.wikimedia.org:

 sudo setcap 'cap_net_bind_service=+ep' /usr/bin/ssh
 ssh -L 443:localhost:443 cp2002.codfw.wmnet

Then add a line to /etc/hosts such as

 upload.wikimedia.org 127.0.0.1

This way, all your requests to upload.wikimedia.org will be served by cp2002.

Restart a backend

Sometimes, like when there is the 'mailbox lag' alert on a cp* host, you need to restart the backend and only the backend.

The script for this is:

sudo -i varnish-backend-restart

Package a new Varnish release

 git checkout debian-wmf
 gbp import-orig --pristine-tar /tmp/varnish-${version}.tar.gz
 git push gerrit pristine-tar
 git push gerrit upstream
 # edit changelog, commit and open a code review:
 git push gerrit HEAD:refs/for/debian-wmf

Upgrading to a new minor Varnish release

Run the following commands to upgrade Varnish to a new minor release:

depool ; sleep 3 ; puppet agent --disable 'Upgrading varnish' ; run-no-puppet echo; apt update; service varnish-frontend stop; service varnish stop ; apt install varnish varnish-dbg libvarnishapi1 ; puppet agent --enable ; puppet agent -tv ; pool

Note that it's important to avoid race conditions with cron-scheduled puppet agent runs. The run-no-puppet command can be used for that purpose.

Upgrading from Varnish 3 to Varnish 4

In the specific case of upgrading from Varnish 3 to Varnish 4, follow this procedure:

  • Disable puppet on the node puppet agent --disable "Upgrading to Varnish 4"
  • Set varnish_version4 to true in hieradata
  • Depool the node and wait a bit for it to be drained: depool ; sleep 15
  • Verify that no user requests are being served by the frontend varnish: varnishncsa -n frontend -m 'RxRequest:^(?!PURGE$)' | grep -v PageGetter
  • Verify that no user requests are being served by the backend varnish: varnishncsa -m 'RxRequest:^(?!PURGE$)' | grep -v 'backend check'
  • Enable our experimental repo: echo deb http://apt.wikimedia.org/wikimedia jessie-wikimedia experimental > /etc/apt/sources.list.d/wikimedia-experimental.list ; apt update
  • Stop Varnish 3: service varnish-frontend stop; service varnish stop
  • Remove libvarnishapi1: apt-get -y remove libvarnishapi1
  • Wipe on-disk storage: rm -f /srv/sd*/varnish*
  • Re-enable puppet: puppet agent --enable
  • Run puppet agent a few times and ensure it completes successfully: puppet agent -t; puppet agent -t; puppet agent -t; puppet agent -t
  • Test the upgrade and if everything is fine repool the node

Downgrading from Varnish 4 to Varnish 3

  • Disable puppet on the node puppet agent --disable "Downgrading to Varnish 3"
  • Remove varnish_version4 from hieradata, or set it to false
  • Depool the node and wait a bit for it to be drained: depool ; sleep 15
  • Verify that no user requests are being served by the frontend varnish: varnishncsa -n frontend -q 'not ReqMethod eq PURGE' | grep -v PageGetter
  • Verify that no user requests are being served by the backend varnish: varnishncsa -q 'not ReqMethod eq PURGE' | grep -v 'backend check'
  • Remove our experimental repo: rm /etc/apt/sources.list.d/wikimedia-experimental.list ; apt update
  • Stop Varnish 4: service varnish-frontend stop; service varnish stop
  • Remove libvarnishapi1: apt-get -y remove libvarnishapi1
  • Wipe on-disk storage: rm -f /srv/sd*/varnish*
  • Re-enable puppet: puppet agent --enable
  • Run puppet agent a few times and ensure it completes successfully: puppet agent -t; puppet agent -t; puppet agent -t; puppet agent -t. If something goes wrong here, you might have to unmask varnish.service: systemctl unmask varnish.service
  • Test the downgrade and if everything is fine repool the node

Upgrading from Varnish 4 to Varnish 5

  • Disable puppet on the node to be upgraded: puppet agent --disable "Upgrading to Varnish 5"
  • Set profile::cache::base::varnish_version: 5 and apt::use_experimental: true in hiera
  • Depool the node and wait a bit for it to be drained: depool ; sleep 15
  • Enable our experimental repo: echo deb http://apt.wikimedia.org/wikimedia jessie-wikimedia experimental > /etc/apt/sources.list.d/wikimedia-experimental.list ; apt update
  • Stop Varnish 4: service varnish-frontend stop; service varnish stop
  • Check that indeed no varnishd process is running any longer
  • Remove libvarnishapi1: apt-get -y remove libvarnishapi1
  • Re-enable puppet: puppet agent --enable
  • Run puppet agent a few times and ensure it completes successfully: puppet agent -t; puppet agent -t
  • Repool the node if everything looks fine: pool

Downgrading from Varnish 5 to Varnish 4

  • Disable puppet on the node to be downgraded: puppet agent --disable "Downgrading to Varnish 4"
  • Set profile::cache::base::varnish_version: 4 and apt::use_experimental: false in hiera
  • Depool the node and wait a bit for it to be drained: depool ; sleep 15
  • Disable our experimental repo: rm /etc/apt/sources.list.d/wikimedia-experimental.list ; apt update
  • Stop Varnish 5: service varnish-frontend stop; service varnish stop
  • Remove libvarnishapi1: apt-get -y remove libvarnishapi1
  • Re-enable puppet: puppet agent --enable
  • Run puppet agent a few times and ensure it completes successfully: puppet agent -t; puppet agent -t
  • Repool the node if everything looks fine: pool

Diagnosing Varnish alerts

There are multiple possible sources of 5xx errors.

  1. Check 5xx errors on the aggregate client status code dashboard
    1. Use the site dropdown to see if the issue is affecting multiple sites or if it is isolated
    2. Use the cache_type dropdown and see if the errors affect only text, upload, or both. If both clusters are affected, contact #wikimedia-netops on IRC as the issue is potentially network-related
  2. If upload is affected, check the following dashboards:
    1. Thumbor eqiad and codfw
    2. Swift eqiad and codfw
  3. If text is affected, check the following dashboards:
    1. MediaWiki Graphite Alerts
    2. RESTBase
  4. See if anything stands out in the Varnish-Webrequest-50X and Varnish Fetch Errors Grafana visualizations
  5. If none of the above steps helped, the issue might be due to problems with Varnish or ATS. See the Varnish Failed Fetches dashboard for the DC/cluster affected by the issue
    1. It could be that most fetch failures affect one single Varnish backend (backends can be shown by using the layer dropdown). See if there's anything interesting for the given backend in the "Mailbox Lag" graph and contact #wikimedia-traffic
    2. In case nobody from the Traffic team is around, SSH onto the host affected by mailbox lag. Varnish backends are restarted twice a week by cron (see /etc/cron.d/varnish-backend-restart), so the maximum uptime for varnish.service is 3.5 days. Check the uptime with sudo systemctl status varnish | grep Active. If it is in the order of a few hours, something is wrong. Page Traffic. Otherwise, restart the backend with sudo -i varnish-backend-restart. Do not use any other command to restart varnish. Do not restart more than one varnish backend.
  6. If the given cluster uses ATS instead of Varnish (there's no result when choosing backend in the layer dropdown), see the ATS Cluster View.

Some more tricks

// Query Times
varnishncsa -F '%t %{VCL_Log:Backend}x %Dμs %bB %s %{Varnish:hitmiss}x "%r"'

// Top URLs
varnishtop -i RxURL

// Top Referer, User-Agent, etc.
varnishtop -i RxHeader -I Referer
varnishtop -i RxHeader -I User-Agent

// Cache Misses
varnishtop -i TxURL

Configuration

Deployment and configuration of Varnish is done using Puppet.

See the production/modules/varnish and varnishkafka repositories in operations/puppet.

We use a custom varnish 5.1.x package with several local patches applied. Varnish uses a VCL file (Varnish Configuration Language), a DSL where Varnish behavior is controlled using subroutines that are compiled into C and executed during each request. The VCL files are located at

/etc/varnish/*.vcl

The VCL configuration can be tested using vagrant and the VTC (Varnish Test Case) files shipped with the operations/puppet repo. For example, if the VCL changes that need to be tested have been published as Gerrit change 506868, and they need to be tested against a cache_upload node (cp1076.eqiad.wmnet):

cd ./modules/varnish/files/tests
vagrant up
./run.sh 506868 cp1076.eqiad.wmnet

One-off purges (bans)

Note: if all you need is purging objects based on their URLs, see how to perform One-off purges.

Sometimes it's necessary to do one-off purges to fix operational issues, and these are accomplished with the varnishadm "ban" command. It's best to read up on this thoroughly ahead of time! What a ban effectively does in practice is mark all objects that match the ban conditions, which were in the cache prior to the ban command's execution, as invalid.

Keep in mind that bans are not routine operations! These are expected to be isolated low-rate operations we perform in emergencies or after some kind of screw-up has happened. These are low-level tools which can be very dangerous to our site performance and uptime, and Varnish doesn't deal well in general with a high rate of ban requests. These instructions are mostly for operations staff use (with great care). Depending on the cluster and situation, under normal circumstances anywhere from 85 to 98 percent of all our incoming traffic is absorbed by the cache layer, so broad invalidation can greatly multiply applayer request traffic until the caches refill, causing serious outages in the process.


How to execute a ban (on one machine)

The varnishadm ban command is generally going to take the form:

varnishadm [-n frontend] ban [ban conditions]

Note that every machine has two varnish daemons, the default (backend) instance which requires no '-n' parameter, and the frontend instance that requires '-n frontend'.

Execute a ban on a cluster

The following example shows how to ban all objects with Content-Length: 0 and status code 200 in ulsfo cache_upload:

salt -b 1 -C 'G@site:ulsfo and G@cluster:cache_upload' cmd.run "varnishadm -n frontend ban 'obj.status == 200 && obj.http.content-length == 0'"

Examples of ban conditions

Ban all content on zh.wikipedia.org

req.http.host == "zh.wikipedia.org"

Ban all 301 redirect objects in hostnames ending in wikimedia.org

obj.status == 301 && req.http.host ~ ".*wikimedia.org"

Ban all urls that start with /static/ , regardless of hostname:

 req.url ~ "^/static/"

Ban condition for MediaWiki outputs by datestamp of generation

1. Determine the start/end timestamps you need in the same standard format as date -u:

 > date -u
 Thu Apr 21 12:16:52 UTC 2016

2. Convert your start/end timestamps to unix epoch time integers:

 > date -d "Thu Apr 21 12:16:52 UTC 2016" +%s
 1461241012
 > date -d "Thu Apr 21 12:36:01 UTC 2016" +%s
 1461242161

3. Note you can reverse this conversion, which will come in handy below, like this:

 > date -ud "Jan 1, 1970 00:00:00 +0000 + 1461241012 seconds"
 Thu Apr 21 12:16:52 UTC 2016

4. For ban purposes, we need a regex matching a range of epoch timestamp numbers. It's probably easiest to approximate it and round outwards to a slightly wider range to make the regex simpler. This regex rounds the range above to be 1461241000 - 1461242199 , which if converted back via step 3, shows we've rounded the range outwards to cover 12:16:40 through 12:36:39

 146124(1[0-9]|2[01])

5. MediaWiki emits a Backend-Timing header with fields D and t, where t is a microsecond-resolution epoch number (we'll ignore those final 6 digits), like so:

 Backend-Timing: D=31193 t=1461241832338645

6. To ban on this header using the epoch seconds regex we build in step 4:

 ban obj.http.Backend-Timing ~ "t=146124(1[0-9]|2[01])"

How to execute a ban across a cluster

The first step is selecting the correct cluster from the list at the top of this page for the traffic you're trying to ban.

Keeping in mind the architecture of our cache tiers and layers, there are ordering rules that must be followed:

  1. Backends at Tier-1 datacenters must be banned before backends at Tier-2 datacenters.
  2. Backends at any datacenter must be banned before frontends at the same datacenter.

Currently, only eqiad is a Tier-1 datacenter at the Traffic layer, and all others are Tier-2. Therefore a reasonable procedure that obeys the rules above is:

  1. Ban eqiad backend instances
  2. Ban codfw backend instances
  3. Ban ulsfo and esams backend instances
  4. Ban all frontend instances

For distributed execution of ban commands, cache clusters and sites can be selected with salt grain conditionals on "site" and "cluster".

Putting this all together, this is a real example of banning all 404 objects with request URL "/apple-app-site-association":

salt -b 1 -v -t 30 -C 'G@cluster:cache_text and G@site:eqiad' cmd.run "varnishadm ban 'req.url == \"/apple-app-site-association\" && obj.status == 404'"
salt -b 1 -v -t 30 -C 'G@cluster:cache_text and G@site:codfw' cmd.run "varnishadm ban 'req.url == \"/apple-app-site-association\" && obj.status == 404'"
salt -b 1 -v -t 30 -C 'G@cluster:cache_text and not G@site:codfw and not G@site:eqiad' cmd.run "varnishadm ban 'req.url == \"/apple-app-site-association\" && obj.status == 404'"
salt -b 1 -v -t 30 -C 'G@cluster:cache_text' cmd.run "varnishadm -n frontend ban 'req.url == \"/apple-app-site-association\" && obj.status == 404'"

How to execute a ban across a cluster with cumin

Cluster-wide ban of objects with Content-Type text on cache_upload:

cumin -b 1 'R:class = role::cache::upload and *.eqiad.wmnet' "varnishadm ban 'obj.http.content-type ~ \"^text\"'"
cumin -b 1 'R:class = role::cache::upload and *.codfw.wmnet' "varnishadm ban 'obj.http.content-type ~ \"^text\"'"
cumin -b 1 'R:class = role::cache::upload and not *.eqiad.wmnet and not *.codfw.wmnet' "varnishadm ban 'obj.http.content-type ~ \"^text\"'"
cumin -b 1 'R:class = role::cache::upload' "varnishadm -n frontend ban 'obj.http.content-type ~ \"^text\"'"

Cluster-wide ban of objects with specific path:

cumin -b 1 'R:class = role::cache::upload and *.eqiad.wmnet' "varnishadm ban 'req.url ~ \"^/path\"'"
cumin -b 1 'R:class = role::cache::upload and *.codfw.wmnet' "varnishadm ban 'req.url ~ \"^/path\"'"
cumin -b 1 'R:class = role::cache::upload and not *.eqiad.wmnet and not *.codfw.wmnet' "varnishadm ban 'req.url ~ \"^/path\"'"
cumin -b 1 'R:class = role::cache::upload' "varnishadm -n frontend ban 'req.url ~ \"^/path\"'"

External links

Presentations:

See also