Obsolete:Squids
proxies (url-downloader / webproxy)
Even though the rest of this page is historical we are still using Squid as http_proxy in 2 places, Url-downloader and webproxy. If you are wondering what the difference is see an explanation on:
Why do we have 2 sets of squid proxies?
There are 4 clusters of squid servers, one upload and one text at each of our two locations: esams and pmpta. Each server runs two instances of squid: a frontend squid listening on port 80, and a cache squid listening on port 3128. The purpose of the frontend squid is to distribute load to the cache squids based on URL hash, using the CARP algorithm.
LVS is used to balance incoming requests between the frontends that use CARP to distribute the traffic to the backends.
Overview
Why Squid?
Squid is a high-performance proxy server that can also be used as a HTTP accelerator for the webserver. Explained in layman terms, Squid will store a copy of the pages served by webserver and the next time the same page is requested, Squid will serve the copy. This process is called "caching" and it removes the need for the webserver to regenerate that same page again, resulting in a tremendous performance boost for the webserver.
Since MediaWiki websites are generated entirely dynamically, there is a substantial performance gain in running Squid or Varnish as a HTTP accelerator for your webserver. In fact, sites like Wikipedia use several Squid caches to enhance their performance.
Because of this performance gain, MediaWiki has been designed to integrate closely with Squid. For example, MediaWiki will notify Squid when a page should be purged from the cache in order to be regenerated.
The architecture
How to set up a combo of Squid, Apache and MediaWiki on a single server is outlined below. It is possible to use a more complex caching strategy or use different port numbers and IP-addresses, but for this simple example we strive for the following single-server architecture:
Outside world | <---> | Server
|
To the outside world, Squid will seem to act as the webserver. In reality, it passes on requests to the Apache webserver, but only when necessary. Apache runs on the same server, but it only listens to requests from localhost (127.0.0.1). Rest assured, running both services on port 80 will not cause conflicts, since both services are bound to different IP addresses.
Setting it up like this means Apache cannot be accessed from the outside world directly, only through Squid. Using this configuration, Apache can only be accessed directly from the console of the server it is running on. For testing and troubleshooting purposes to bypass Squid completely, one can use Elinks (http://elinks.or.cz/) and browse to http://127.0.0.1/
.
Installation
sudo apt-get install squid3
Configuring Squid 3
Due to its versatility Squid has a very large "squid.conf" configuration file. There are however only a few settings relevant when using Squid in accelerator mode.
http_port 207.142.131.205:80 transparent vhost defaultsite=<sitename> cache_peer 127.0.0.1 parent 80 3130 originserver acl manager proto cache_object acl localhost src 127.0.0.1/32 # Allow access to the web ports acl web_ports port 80 http_access allow web_ports # Allow cachemgr access from localhost only for maintenance purposes http_access allow manager localhost http_access deny manager # Allow cache purge requests from MediaWiki/localhost only acl purge method PURGE http_access allow purge localhost http_access deny purge # And finally deny all other access to this proxy http_access deny all
Note: There is a mention in OutputPage::sendCacheControl()
function in MediaWiki of more rules that should be added in to replace Cache-Control headers http://wiki.aulinx.de/Cache-Control.
If necessary, IPv4 and IPv6 connections can be handled both by Squid 3.1.5. Ignore the remainder of this section if IPv6 is not of concern to you and skip to the common ACL configuration.
http_port <Your external IPv4>:80 defaultsite=<Your DNS sitename> vhost http_port [<Your external IPv6>]:80 defaultsite=<Your DNS sitename> vhost cache_peer 127.0.0.1 parent 80 0 no-query originserver round-robin name=wiki
where multiple outside IP addresses may be listed, one per line, in either IPv4 or IPv6 protocol:
http_port [2001:db8::2]:80 vhost defaultsite=example.org http_port [2001:db8::123:456]:80 vhost defaultsite=example.org
Note that, as Squid handles the task of listening for all outside connections and Apache merely sits behind it on a local loopback address (127.0.0.1:80) it is not necessary to configure Apache to be IPv6-aware in this instance.
Only your cache server (Squid in this instance), your domain name server (IN AAAA records) and your network (ipconfig, route) need to be modified to contain IPv6-specific information if you intend your wiki to be IPv6-compatible and are using Squid.
Configuring Apache
The Apache webserver now needs to be configured to listen only to the localhost port 80. The file httpd.conf(or possibly ports.conf) should contain the following line:
Listen 127.0.0.1:80
and if you are using virtual hosts also lines like:
NameVirtualHost 127.0.0.1:80 <VirtualHost 127.0.0.1:80> ServerName meta.wikimedia.org ... </VirtualHost>
Please see http://wiki.apache.org/httpd/CouldNotBindToAddressfor more on troubleshooting this step.
If Apache is issuing the header
Vary: cookie
Then the caching will not be effective. You can stop this behaviour by adding the following to httpd.conf
SetEnv force-no-vary
Configuring MediaWiki
When configuring MediaWiki act as if there is no Squid. Meaning, use the servername the outside world would use instead of the internal IP-address. E.g., use "meta.wikimedia.org" for servername instead of "127.0.0.1".
Since Squid is doing the requests from localhost, Apache will receive "127.0.0.1" as the direct remote address. However, as Squid forwards the requests to Apache, it adds the "X-Forwarded-For" header containing the direct remote address as received by Squid. This way the remote address from the outside world is preserved.
By default MediaWiki will use the direct remote address for changes etcetera, so it must be configured to use the "X-Forwarded-For" header instead in order to function correctly. Make sure the LocalSettings.php file contains the following lines:
$wgUseSquid = true;
$wgSquidServers = [ '<your IPv4 address>' ];
$wgSquidServersNoPurge = [ '127.0.0.1' ];
This ensures both that addresses internal to your network (such as the Squid server or the 127.0.0.1 loopback) do not appear in MediaWiki Recent changes, and that notification to discard changed pages will be sent to Squid (not Apache).
Statistics
In this setup, Squid will shield off most of the traffic to Apache. Therefore, if you need reliable web statistics from a statistics package like e.g. AWStats, you will need to set it up to analyze Squid's access_log instead of Apache's.
Squid 2.6 Configuration Settings
Squid 2.6 has simplified the http accelerator configuration, and these settings should work:
http_port 10.10.10.1:80 defaultsite=<Your DNS sitename> vhost cache_peer 127.0.0.1 parent 80 0 no-query originserver round-robin name=wiki acl mySites dstdomain <Your DNS sitename> <any other vhosts> cache_peer_access wiki allow mySites cache_peer_access wiki deny all http_access allow mySites
Also, a URL rewriter isn't necessary for redirecting from *.com and *.net domains to your *.org domain if you have $wgServer set in your LocalSettings.php since Mediawiki will take care of this for you.
Apache 2.x-Logfile Settings
The Apache Webserver is only seeing "127.0.0.1:80" Within Apache you can use the Parameter "X-Forwarded-for" which is provided by Squid e.g. within a custom logfile format. The sample below is similar to the "combined" one.
Settings
mod_log_config.conf
LogFormat "%{X-Forwarded-for}i %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" cached
squid.conf
forwarded_for on
See also
- Instructions on using Apache's mod_disk_cache with MediaWiki
- Additional information about installing squid3
RE-Installation
Please note that NEW squid servers need to be setup by someone who understands the full setup. There are a number of various setttings that have to be configured. Thus the instructions are only for reinstallation.
To reinstall a previously existing squid server:
- Reinstall the server OS.
- After boot, copy the old SSH hostkey back using scp -o StrictHostKeyChecking=no files hostname:/etc/ssh/
- These should all be saved on tridge in the /data/hostkeys/
- Follow the instructions on Puppet#Reinstalls
- Will have to run puppet a couple of times, get dependency error due to not having deployed below step yet:
- Deploy the Squid configuration files on fenari:
# cd /home/w/conf/squid # ./deploy servername
- If the system has been offline for over 2 hours, its cache will need to be cleaned with:
/etc/init.d/squid clean
- Manually run puppet update and ensure system is still online.
- Check the LVS server to ensure the system is fully online.
- Check the CacheManager interface for open connections, ensure they normalize on reinstalled squid BEFORE taking any more offline.
Deploying more Squids
17:53:02 * mark extends the squid configuration (text-settings.php) with config for the new squids 17:57:38 * mark deploys the squid configs to the new squid hosts only, so puppet can do its task. old config remains on the rest of the squids, so they're still unaffected 17:57:52 <mark> (I promised Rob to show every step of squid deployment in case anyone's wondering ;) 17:58:23 * mark checks whether MediaWiki is setup to recognize the new squids as proxies (CommonSettings.php) 17:58:55 <mark> yes it is 18:01:42 * mark checks whether puppet has initialized the squids; i.e. both squid instances are running, and the correct LVS ip is bound 18:03:11 <mark> where puppet hasn't run yet since the squid config deploy, I trigger it with "puppetd --test" 18:04:10 <mark> they've all nicely joined ganglia as well 18:08:56 <mark> alright, both squid instances are running on the new text squids 18:09:06 <mark> time to setup statistics so we can see what's happening and we're not missing any requests in our graphs 18:09:15 <mark> both torrus and cricket 18:11:29 <mark> cricket done... 18:14:58 <mark> torrus done as well 18:15:03 * mark watches the graphs to see if they're working 18:15:22 <mark> if not, probably something went wrong earlier with puppet setup or anything 18:17:45 <mark> in the mean time, backend squids are still starting up and reading their COSS partitions (which are empty), which takes a while 18:17:48 <mark> nicely visible in ganglia 18:21:32 <mark> alright, all squids have finished reading their COSS partition, and torrus is showing reasonable values in graphs 18:21:43 <mark> so all squids are correctly configured and ready for service 18:21:50 <mark> but they have EMPTY CACHES 18:22:11 <mark> giving them the full load now, would mean that they would start off with forwarding every request they get onto the backend apaches 18:22:51 <mark> I am going to seed the caches of the backend squids first 18:22:55 <mark> we have a couple of ways of doing that 18:23:19 <mark> first, I'll deploy the *new* squid config (which has all the new backend squids in it) to *one* of the frontend squids on the previously existing servers 18:23:33 <mark> that way that frontend squid will start using the new servers, and filling their caches with the most common requests 18:23:44 <mark> let's use the frontend squid on sq66 18:24:38 * mark runs "./deploy sq66" 18:24:52 <mark> so only sq66 is sending traffic to sq71-78 backend squids now 18:25:02 <mark> which is why they're all using approximately 1% cpu 18:25:31 <mark> now we wait a while and watch the hit rate rise on the new backend squids 18:25:51 <mark> e.g. http://torrus.wikimedia.org/torrus/CDN?path=%2FSquids%2Fsq77.wikimedia.org%2Fbackend%2FPerformance%2FHit_ratios 18:29:26 <mark> no problems visible in the squid logs either 18:32:23 <mark> each of the new squids is serving about 1 Mbps of backend traffic 18:37:10 <mark> the majority of all requests are being forwarded to the backend... let's wait until the hit ratio is a bit higher 18:38:04 <mark> I'll deploy the config to a few more frontend squids so it goes a bit faster 18:54:02 <mark> sq77 is weird in torrus 18:54:10 <mark> it reports 100% request hit ratio and byte hit ratio 18:54:29 <mark> and is still empty in terms of swap.. 18:54:33 * mark investigates 18:54:51 <mark> it's not getting traffic 18:58:27 <mark> I think that's just the awful CARP hashing algorithm :( 18:58:31 <mark> it has an extremely low CARP weight 19:04:21 * mark deploys the new squid conf to a few more frontend squids 19:05:29 <mark> they're getting some serious traffic now 19:29:03 <mark> ok 19:29:07 <mark> hit rate is up to roughly 45% now 19:29:30 <mark> swap around 800M per server, and around 60k objects 19:29:40 <mark> I'm confident enough to pool all the new backend squids 19:29:46 <mark> but with a lower CARP weight (10 instead of 30) 19:32:31 <mark> in a few days, when all the newe servers have filled their caches, we can decommision sq40 and lower 19:33:02 * mark watches backend requests graphs and hit ratios 19:40:05 <mark> looks like the site is not bothered at all by the extra load - the seeding worked well 19:40:08 <mark> nice time to get some dinner 19:40:19 <mark> afterwards I'll increase the CARP weight, and pool the frontends 19:40:38 <mark> ...and then repeat that for upload squids 20:19:18 <mark> ok.. the hit ratio is not high enough to my liking, but I can go on with the frontends 20:19:26 <mark> the frontend squids are pretty much independent from the backends 20:19:35 <mark> we need to seed their caches as well, but it's quick 20:20:05 <mark> they don't like to get an instant 2000 requests/s from nothing when we pool them in LVS, so pretty much the only way we can mitigate that is to pool them with low load (1) 20:29:47 <mark> ok, frontend text squids now fully deployed
Configuration
Configuration is done by editing the master files in /home/wikipedia/conf/squid, then running make to rebuild the configuration files, and ./deploy to deploy them to the remote servers. The configuration files are:
- squid.conf.php
- Template file for the cache (backend) instances
- frontend.conf.php
- Template file for the frontend instances
- text-settings.php
- A settings array which applies to text squids. All elements in this array will become available as variables during execution of squid.conf.php and frontend.conf.php. The settings array can be used to give server-specific configuration.
- upload-settings.php
- Same as text-settings.php but for upload squids
- common-acls.conf
- ACL directives used by both text and upload frontends. Use this to block clients from all access.
- upload-acls.conf
- ACL directives used by upload frontends. Use this for e.g. image referrer regex blocks.
- text-acls.conf
- ACL directives used by text frontends. Use this for e.g. remote loader IP blocks.
- Configuration.php
- Contains most of the generator code
- generate.php
- the script that the makefile runs
The configs are under version control using git.
The deployment script has lots of options. Run it with no arguments to get a summary.
Changing configuration
Note: remember to ssh to fenari with key forwarding ( -A )
# cd /home/w/conf/squid
Edit *-settings.php or *-acls.php
# make
To see the changes in the generated configuration vs what should be already deployed, run:
$ diff -ru deployed/ generated/
If these changes are ok, you can deploy them to all servers, either all at once or a subset of servers, either fast or slowly. See ./deploy -h for all possible options.
# ./deploy all
Using this invocation, the script will copy the newly generated config files into the deployed/ directory, rsync it to the puppetmaster, and then scp them to each server and reload the squid process(es).
# git commit -m "A meaningful commit message"
You should always commit your changes to git to allow for history tracking and rollback.
Current problems
None? :-)
Monitoring
You can get some nice stats about the squids by going to http://noc.wikimedia.org/cgi-bin/cachemgr.cgi (user name root, password is in the squid configuration file). The squids are each listed twice in the drop-down, once for front end and once for back-end. Peer Cache stats for the backend is especially handy.
Debugging
To see HTTP requests sent from Squids to their backend, install ngrep and run e.g.:
# ngrep -W byline port 80 and dst host ms4.wikimedia.org
HowTo
Edit ACLs
in /home/w/conf/squid
edit text-acls.conf
, then run make
, then run ./deploy all
Purge a url
on terbium, run: echo 'https://example.org/foo?x=y' | mwscript purgeList.php
See also
- MediaWiki caching -- some cache headers explained
- Multicast HTCP purging -- new method of cache purging
- Squid logging
- Squid log format
- http://wiki.squid-cache.org/SquidFaq/
- http://httpd.apache.org/docs-2.2/mod/mod_log_config.html