User:Bhartshorne/swift tasks 2012-08-13

to complete before 8/28

  • DONE move mediawiki reading originals to swift (aaron)
    • the deploy to test wiki on monday worked.
  • DONE updated squid and swift/ to allow reads for originals (scheduled for monday 8/20)
    • squid change is acl work similar to how thumbnails got moved
    • rewrite does not need changes to accept non-thumbnails and get to the right bucket
  • finish building eqiad cluster
    • ms-be1004 is waiting on a replacement SSD eta friday 8/17
    • ms-be1005 doesn't see any of its spinning disks. RobH to investigate
    • it's ok to continue building the cluster without those two hosts.
  • DONE upgrade to 1.5.0 (with ganglia statsd stuff disabled)
    • test in labs (lucid)
      • done. tested fetching existent and nonexistent thumbs. tested with mismatched proxies and storage servers.
    • test on eqiad (precise)
      • tested mixed cluster upgraded by hand. tested container creation, thumb creation, thumb fetching, lost object recovery.
      • need to test puppet rules (scheduled monday)
    • test mediawiki auth - Jan claims MW fails to auth against 1.4.4+. replicate his test, find and fix the problem (if replicable) (aaron)

to start before 8/28

  • sync content
    • test between eqiad-prod cluster and ??? (eiqad-test? labs?
  • redo zones in pmtpa
  • audit and replace disks across all backends
    • rt-3282 and rt-3432

to do in sept

  • improve reaction-based documentation (instead of feature-based documentation)
    • what to do when a host fails; what to do when a nagios alert triggers (for each nagios alert); etc.
  • improve dead disk detection methods, automate alerting and replacing
    • installed and configured swift-drive-audit to find them.
    • how to hook into nagios?
  • set up swift-recon

to do Sometime(tm)

  • enable 1.5 statsd ganglia stuff
    • disable ganglia-logtailer
    • disable local logging?
    • update ganglia view for new metrics
  • document how to switch from pmtpa to eqiad
    • container synchronization is an eventually consistent thing; how to synchronize the change?
  • have 2 users that interact with containers - one that can create / destroy containers and the other that can't
    • talk to aaron for more detail
  • upgrade pmtpa cluster from lucid to precise
  • add SSDs into ms-be1-5 to get hardware parity with the rest of the ms-be servers (and get the OS and local logs onto SSD instead of sharing with the object store)
  • LVS currently has the same monitoring URL for both pmtpa and eqiad but the URL includes the account ID, which is different between the two clusters. Separate the LVS config (lvs.pp line 663ish) into separate things so they can have separate monitoring URLs.
  • replace all the swift C2100 hardware with something that doesn't have hardware failures left and right