Swift/Deploy Plan - Upgrade from 1.4.3 to 1.5

From Wikitech

Notes for Swift upgrade from 1.4.3 to 1.5

See:

Upgrade steps

Prep work

  • stop puppet on the production cluster
    /etc/init.d/puppet stop
  • edit root's crontab to prevent it from getting restarted
  • put the new swift packages in the lucid and precise debian repositories on brewster

Test puppet changes

  • merge gerrit 18264
  • run puppet on a not-yet-upgraded proxy and storage node in the eqiad-prod cluster (ms-fe100* and ms-be10**)

Proxy servers

  • pull out two proxy servers, watch to make sure the remaining two handle the load
    edit fenari:/home/w/conf/pybal/pmtpa/swift; set two to 'False'
  • upgrade the two
    stop the proxy service (swift-init all stop)
    run puppet
    swift-init all start
  • test: request existing, nonexisting thumbs, nonsense urls; do some commands from the swift howto page on wikitech
  • when everything looks good and they are handling their share of traffic, take the remaining two proxies out
  • wait a day or so, see if anything odd shows up (check logs, village pump, mailing lists, etc)
  • if it looks bad, roll back = move traffic off the upgraded proxy servers and use just the two that weren't upgraded
  • it it looks good, ... ? do a backend? do the rest of the proxies?

Backend hosts

  • stop the service on one host, upgrade, start
    swift-init all stop
    run puppet
    swift-init all start
  • do some tests and wait around a little while, make sure nothing odd crops up
  • do the rest one at a time

Note that this covers upgrade but not enabling of statd; that's a separate step to be done later

Statsd

swift 1.5+ has statsd metrics integrated in the code for more detailed monitoring. We will feed those metrics into ganglia using a statsd->ganglia bridge that runs on each host. (i.e. statsd reports to localhost which then gets shunted off into ganglia-land.)

Config

for each service (proxy, object, container, account), enable the commented out statsd lines

statsd -> ganglia bridge

  • teach puppet to start the bridge and make sure it's running (the puppet changes for this don't yet exist)
    example (needs confirmation) statsd line: pystatsd-server -n 127.0.0.1 -r ganglia --ganglia-host 10.4.0.79 --ganglia-port 21146 --flush-interval 15 -d --ganglia-spoof-host 10.4.0.107:swift-be1 --ganglia-counter-group swift_counters
  • test in labs and on the eqiad test cluster
    labs uses unicast ganglia, production uses multicast. testing multicast is necessary

Caution - in my testing, there was a bug where, after reporting some metrics, the host would show as down in the ganglia UI. I don't know what's causing this, but is the critical bug to fix before this can be used in production.

Container Synchronization

This process will happen after the 1.5 upgrade but is not part of the upgrade process. It should happen as soon as is reasonable once the cluster is upgraded and stable.

See http://docs.openstack.org/developer/swift/overview_container_sync.html for details on the how sync works.

things to do:

  • set up synchronization in two labs swift clusters
  • determine (empirically?) a reasonable interval for the sync processes (default is 5 minutes)
  • determine a method for newly created containers to get synchronization set up
    new containers are created when we have a new language wiki created

questions:

  • what happens if swauth tokens are replicated?
  • how much load will the replication generate
    especially for the initial sync
    can we throttle the initial sync
    if not, maybe we should do the initial sync ourselves, then enably synchronization on already mostly-synced containers?