Swift/Open Issues Aug - Sept 2012

From Wikitech

Open Issues after upgrade and media originals reads deployment

  • move other things off of ms7 one at a time
  • Finish building eqiad cluster
    • ms-be1004, ms-be1005 had hardware issues; RobH can give an update
  • Pending MW bugs (Aaron) (could someone file these in BZ?)
    • multiple HEADs (7?) for the same file for moves/deletes
      patch pending to reduce this to ~5
      https://gerrit.wikimedia.org/r/#/c/21485/
      more HEAD requests will go away when we stop doing multibackend write (no more MW consistency check across file backends)
    • HEAD/GET for a thumb instead of just a GET
    • excessively long filenames can fail on thumb requests (MW has a 255char limit, but 255 unicode chars double-URL encoded > 1024 chars which makes swift hate)
      https://bugzilla.wikimedia.org/show_bug.cgi?id=39697
  • statsd
    • updated packages by swiftstack
    • verify everything works in labs and/or in eqiad
    • update the ganglia views to use the new statistics
  • start running swift-recon, incorporate into stats and/or monitoring
  • eqiad container sync
  • hardware problems
    • too many of them (see RT), escalate to Dell (in progress)
  • upgrade to precise
    • eqiad already is
  • redo zones in pmtpa so that a zone represents a rack not a server
    • to move a host to a new zone, remove all devices from the ring and re-add them to the ring in the new zone. Format all the disks on the moved host before the new ring takes effect. Only move one host per week, preferably every other week.
    • Look up hosts in racktables to determine what zone they should be in.
    • move ms-be12 to zone 14
    • move ms-be7 to zone 5
    • move ms-be8 to zone 5
    • move ms-be6 to zone 8
    • ms-be6, 7, 8 are offline due to hardware problems, this has seen progress accidentally :)
  • hook up disk failure detection to nagios in a useful way so that we are alerted when a disk needs to be swapped out (rather than having to proactively check).
  • tinker with rsync? or object replicator? concurrency setting (lots of connection errors in logs)
  • investigate 507s in swift logs
    • maybe correlated with dead disks?
  • cluster in esams and retirement of ms6?
  • (eventually) turn off writes on ms7 and reclaim it for other uses
  • average proxy query duration has risen since moving originals to swift. This average is driven up by a small number of very slow transactions. (90th% and 50th% are not as affected.) The object server shows a fast transmission, it's only the proxy-server that is logging a long transaction time. What is the proxy server doing that is making it slow? Is it a slow client read? Investigate (the extra statsd timers will be useful here).
  • Originals are typically larger than thumbs, hence they typically take more time to transmit to clients; the latency metric counts time to last packet (rather than first packet). Hence this might be explainable and perfectly OK (i.e. low on my todo list right now) Faidon 00:06, 4 September 2012 (UTC)[reply]
  • example1, object server took 0.0234s, proxy-server took 7.161s:
Aug 28 20:53:52 10.0.6.202 object-server xx.xx.xx.xx - - [28/Aug/2012:20:53:52 +0000] "GET /sdl1/37405/AUTH_43651bxxxxxxxxxdfe/wikipedia-commons-local-public.4e/4/4e/Martigny,_ville_romaine_et_moderne,_vestiges_de_canalisations_romaines.jpg" 200 2582570 "http://www.google.it/search?hlxxxxxxxcs" "tx14xxxxxxxxe6a1b8" "Mozilla/5.0 (iPad; CPU OS 5_1_1 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Version/5.1 Mobile/9B206 Safari/7534.48.3" 0.0234
Aug 28 20:53:59 10.0.6.214 proxy-server xx.xx.xx.xx xx.xx.xx.xx 28/Aug/2012/20/53/59 GET /v1/AUTH_436xxxxxxxxx8dfe/wikipedia-commons-local-public.4e/4/4e/Martigny%252C_ville_romaine_et_moderne%252C_vestiges_de_canalisations_romaines.jpg HTTP/1.0 200 http%3A//www.google.it/search%3Fhl%3Dixxxxx;bbbbfCLcs Mozilla/5.0%20%28iPad%3B%20CPU%20OS%205_1_1%20like%20Mac%20OS%20X%29%20AppleWebKit/534.46%20%28KHTML%2C%20like%20Gecko%29%20Version/5.1%20Mobile/9B206%20Safari/7534.48.3 - - 2582570 - tx14xxxxxxa1b8 - 7.1610 -
  • example2, object server took 0.0529s, proxy-server took 11.4569s:
Aug 28 20:53:58 10.0.6.204 object-server xx.xx.xx.xx - - [28/Aug/2012:20:53:58 +0000] "GET /sde1/27370/AUTH_4365xxxxxxxfe/wikipedia-commons-local-public.34/3/34/Martigny,_ville_romaine_et_moderne,_Martigny-Bourg.jpg" 200 2985520 "http://www.google.it/search?hlxxxxxxcs" "txf53fxxxxxxxx973ac" "Mozilla/5.0 (iPad; CPU OS 5_1_1 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Version/5.1 Mobile/9B206 Safari/7534.48.3" 0.0529
Aug 28 20:54:09 10.0.6.214 proxy-server xx.xx.xx.xx xx.xx.xx.xx 28/Aug/2012/20/54/09 GET /v1/AUTH_43xxxxxxxdfe/wikipedia-commons-local-public.34/3/34/Martigny%252C_ville_romaine_et_moderne%252C_Martigny-Bourg.jpg HTTP/1.0 200 http%3A//www.google.it/search%3xxxxxxxLcs Mozilla/5.0%20%28iPad%3B%20CPU%20OS%205_1_1%20like%20Mac%20OS%20X%29%20AppleWebKit/534.46%20%28KHTML%2C%20like%20Gecko%29%20Version/5.1%20Mobile/9B206%20Safari/7534.48.3 - - 2985520 - txf53fxxxxxxxxxc973ac - 11.4569 -


Pointers