User:Bhartshorne/ops meeting notes 2011-08-24

update db.php to mark the cluster read only (near the bottom of the file)
deploy the new db.php
read Switch_master
script in /home/w/src/mediawiki/tools/switch-master
- only works when the master is up
- assumes ~/.my.cnf contains the root mysql password

when a master crashes, we never wait for crash recovery. we always rotate in a new master.

dns

pdns has 3 backends - order is pipe, geo, bind

bind
- in svn, yadda yadda.
geo
- ip-map - comes from out there on the net. contains IP ranges to country mapping. last octet of the 127 addr is the ISO country code.
- geo-maps - contains the ISO code (from ip-map) to localized name.
- resolution is based on the source IP of the DNS server querying our authoritative server
- powerds/scenarios/*
  - there are three scenarios for when a given datacenter is down or everything's normal. switched using symlinks to the right place.
pipe
- if a query source address is in a select list of participants, the pipe backend will return an IPV6 response for upload.esams.wikimedia.org in addition to the IPV4 response.

no ICP - squids aren't peers. cache affinity is by URL hashing.
frontend squid and backend squid (coresident on the same host)
frontend
- has 100MB in-memory cache. this is duplicated over all the squids (i.e. same stuff in all of them)
- uses CARP to hash the URL to pass to a specific backend squid
- serves about 50% of our page load
backend
- disk-backed cache

same front end
backend misses go to a specific pmtpa backend using the same hashing algorithm rather than going through pmtpa's front ends.

text: actual wiki text
upload: static images
bits: javascript, css - things that don't change
- uses varnish, not squid
- entire dataset fits in memory
- also hosts geoiplookup.wikimedia.org

api requests are in the text hostname, so share the text frontends
frontends hash all API urls to a different set of squid backends
- squid backends in text are some text some api

we don't rely on expiration times. we want to expire the page when it changes, not after a timeout
htcp is like icp (squid's peering protocol) - udp cache purging.
- works with multicast
varnish - daemon that also listens for the same packets, obey the same purge messages
thumbnail purging is different
- nginx servers run a daemon that listen to htcp

htcp, etc.
also can put a URL parameter on a page to mediawiki to convince it to purge the page (?purge=yes or something)

normal squids have an ACL that do some device detection and forward you to m.wiki if you match.

New system:

mobile starts at varnish boxes
- similar frontend / backend setup to the squids
- daemon processing htcp only purges backend, frontend has a 300s cache timeout
varnish boxes detect what mobile device you're running and set an X-mobile header (using a VCL)
goes straight to the app servers (does not go through the squids)
URL coming into mobile is en.m.wiki.../normal/path. that's translated to the normal URL in varnish (not sure which one) so the app servers get the same URL for both mobile and non-mobile requests
- can differentiate by the x-mobile header
app servers set no-cache if the request is coming from squid, legacy
css/javascript/etc all comes from bits, same as the regular site.
- the MobileFrontend extension uses ResourceLoader to pull in the appropriate js/css files for mobile as part of the page

questions for russ:

does swiftmedia extenios support chunked uploads?
can we split out thumbs and originals into separate containers? it's been useful in the past
- we already split out per project...
stashed media?
archived / deleted files?
when migrating files into swift (for the initial deploy) can we keep the current timestamp?
- this is important, but we can live without it if it's a realy PITA