Swift/Deploy Plan - Originals Part 2

From Wikitech

Last steps for move from ms7 (media server for originals) to swift for reads

what we need to do to switch upload/foo originals to read from swift

Blacklist approach

1. block authenticated requests to swift at squid:

I hate blacklists instead of whitelists, but this will probably do the job:
  • reject swift authenticated requests. probably should do some header checking too...
    acl swift_auth urlpath_regex ^/(auth|v[^/]+/AUTH).*
    http_access deny swift_auth
    might be better to use the same pattern as in rewrite.py (that specifically catches {32,36} hex chars:
248         # If it already has AUTH, presume that it's good. #07. fixes bug 33620
249         hasauth = re.search('/AUTH_[0-9a-fA-F-]{32,36}', req.path)
250         if req.path.startswith('/auth') or hasauth:
251             return self.app(env, start_response)
  • why? there's no way ^/v[^/]+/AUTH.* is going to match any files. I don't like encoding the same logic over and over across config and systems. What if we change the length of the token at some point? will we remember to change squid.conf too?

2. Auth URLs are not enough to protect swift. We should probably filter out X-Authenticate headers as well non-GET/HEAD methods.

Can't Swift protect itself based on IP ranges?

3. check swift acls so that private wiki's "public" container is not world readable

  • eg office wiki is set correctly
    • what's the full list of "private" wikis?
  • method of testing:
    • swift -A http://127.0.0.1/auth/v1.0 -U mw:thumb -K $pass stat wikipedia-office-local-public
    • shows blank Read ACL: instead of mw:thumb,.r:*

4. upload URLs to swift:

-> so use urlpath_regex instead of urlregex? (I think it's still upload; I'm not sure why tcpdump didn't give me the hostname. the interesting part is the url is anchored at math instead of project/language/math)
proxy vs non-proxy requests? yes, but who doesn't use the proxy-style url when asking an image scaler a question? mediawiki?
  • exception: graphs and timeline extensions
  • all of these caught by:
    +# math extension still requires NFS.  send these to ms7 until we can fix that.
    +acl ms7_math          urlpath_regex       ^/[^/]+/[^/]+/(graphs|math|timeline)/.*
    +cache_peer_access 10.0.0.246    allow ms7_math
    +cache_peer_access 10.0.0.246    deny all

5. Other stuff that's on ms7 that swift doesn't (yet?) have (this list comes from ls /mnt/upload6/):

See Cruft on ms7 for this list with annotations, and make changes there.

  • dirs
  • math (as mentioned above)
  • ext-dist
  • jars
  • portal
  • private
  • skins
  • scripts
  • sync-from-home (aka scripts) (this can go)
  • lost-image-thumb-backup
  • files
  • pybaltestfile.txt !!! <-- swift does have this at monitoring/pybal.txt; see lvs.pp for full URL but rewrite.py will probably have to be modified to serve this file directly
  • robots.txt
  • index.html
  • favicon.ico
  • x1
  • mime.php

overall approach:

  • set swift as a default target and blacklist things that shouldn't get there?
    • must blacklist auth, things that haven't moved to swift yet
  • whitelist each thing we transition from ms7 to swift leaving the default on ms7?
    • then watch ms7's traffic and see when stuff stops coming in

Whitelist (or combination) approach:

still blacklist auth attempts, just cuz we can

  • acl swift_auth1 urlpath_regex ^/auth
  • acl swift_auth2 urlpath_regex v[0-9]+/AUTH_.*
  • acl swift_auth3 urlpath_regex AUTH_[0-9a-fA-F-]{32,36}
    needs testing, does {digits} work?
  • http_access deny swift_auth1
  • http_access deny swift_auth2
  • http_access deny swift_auth3

existing squid acl for thumbs:

squid acl for originals: (Aaron's suggestion)

Prep first:

  • test that non head/get to upload are rejected. we think so based on front end config but best to try some requests
  • see if swift is really missing a pile of thuimbs (how?) -- skipping and hoping

How to test on a single squid

  • Want to take it out of front and backend service.
    • Out of front end: enabled = false in pybal config
      why frontend? it shouldn't matter. backend is what we need
      indeed, frontend doesn't need depooling
      if the test squid front end is active, it will have itself listed in the backend list I think, since when we deploy the files generated without the test sqid in the ocnf, its own file isn't regenerted.
    • Out of back end: ?
      (after irc discussion) remove squid from config file, generate, edit the front end config file for the test squid to remove itself from the back end list, push this change to *all* (so esams gets the update) (and to the specific host manually maybe?) note if you don't fix it on pppet master puppet will overwrite your edited file on the test squid frontend :-/
      revert the removal, make back end config changes for our test, generate
      after generating squid configs, deploy only to that host, 'cache' as type

No lab project we can test squid confs for swift in, right?

Do these later (not during this window)?

squid acl for math (once it lives on swift; requires rewrite.py changes)

squid acl for ext-dist: (requires rewrite.py changes)

etc...

Tests (and expected resurlts)

  • random /math/: MISS, 200 OK, Sun-Java-System-Web-Server
  • random thumb: MISS, 200 OK, swift (X-Object-Meta-Sha1base36)
  • random orig: MISS, 200 OK, swift
  • random /archive/: MISS, 200 OK, swift
  • request using swift syntax to monitoring container/file MISS (403 from squid)
    curl -v -H "User-Agent: benfoo" -H "Host: upload.wikimedia.org" http://sq51.wikimedia.org:3128/v1/AUTH_XXXX/monitoring/pybaltestfile.txt
  • OPTIONS on backend: 405 method not allowed (presumably from swift); OPTIONS on frontend 403 forbidden

What happened

Spike in load on image scalers, much iowait, small load increase on ms5, scalers eventually became unresponsive. Later found nfs server timeout messages in the logs on the scalers. After revert this situation continued for awhile until eventually load on scalers dropped sharply, at the same time that load on ms5 returned to normal.

During the same interval, we saw some http requests to ms5 for thumbs and images, about 5 get requests a second. Note that this is nothing compared to traffic it used to handle, about 40/sec.

However... it's 100% full (there's about 101gb free on ms5). We already know form experience that the more thumbs are in these directories, the slower it gets, and that we will eventually see nfs timeouts. I don't know if we've ver run at this close to the edge.