Swift/Deploy Plan - Upgrade from 1.4.3 to 1.5

Notes for Swift upgrade from 1.4.3 to 1.5

See:

Upgrade steps

Prep work

stop puppet on the production cluster
/etc/init.d/puppet stop
edit root's crontab to prevent it from getting restarted
put the new swift packages in the lucid and precise debian repositories on brewster

Test puppet changes

merge gerrit 18264
run puppet on a not-yet-upgraded proxy and storage node in the eqiad-prod cluster (ms-fe100* and ms-be10**)

Proxy servers

pull out two proxy servers, watch to make sure the remaining two handle the load
edit fenari:/home/w/conf/pybal/pmtpa/swift; set two to 'False'
upgrade the two
stop the proxy service (swift-init all stop)

run puppet

swift-init all start
test: request existing, nonexisting thumbs, nonsense urls; do some commands from the swift howto page on wikitech
when everything looks good and they are handling their share of traffic, take the remaining two proxies out
wait a day or so, see if anything odd shows up (check logs, village pump, mailing lists, etc)
if it looks bad, roll back = move traffic off the upgraded proxy servers and use just the two that weren't upgraded
it it looks good, ... ? do a backend? do the rest of the proxies?

Backend hosts

stop the service on one host, upgrade, start
swift-init all stop

run puppet

swift-init all start
do some tests and wait around a little while, make sure nothing odd crops up
do the rest one at a time

Note that this covers upgrade but not enabling of statd; that's a separate step to be done later

Statsd

swift 1.5+ has statsd metrics integrated in the code for more detailed monitoring. We will feed those metrics into ganglia using a statsd->ganglia bridge that runs on each host. (i.e. statsd reports to localhost which then gets shunted off into ganglia-land.)

Config

for each service (proxy, object, container, account), enable the commented out statsd lines

statsd -> ganglia bridge

teach puppet to start the bridge and make sure it's running (the puppet changes for this don't yet exist)
example (needs confirmation) statsd line: pystatsd-server -n 127.0.0.1 -r ganglia --ganglia-host 10.4.0.79 --ganglia-port 21146 --flush-interval 15 -d --ganglia-spoof-host 10.4.0.107:swift-be1 --ganglia-counter-group swift_counters
test in labs and on the eqiad test cluster
labs uses unicast ganglia, production uses multicast. testing multicast is necessary

Caution - in my testing, there was a bug where, after reporting some metrics, the host would show as down in the ganglia UI. I don't know what's causing this, but is the critical bug to fix before this can be used in production.

Container Synchronization

This process will happen after the 1.5 upgrade but is not part of the upgrade process. It should happen as soon as is reasonable once the cluster is upgraded and stable.

See http://docs.openstack.org/developer/swift/overview_container_sync.html for details on the how sync works.

things to do:

set up synchronization in two labs swift clusters
determine (empirically?) a reasonable interval for the sync processes (default is 5 minutes)
determine a method for newly created containers to get synchronization set up
new containers are created when we have a new language wiki created

questions:

what happens if swauth tokens are replicated?
how much load will the replication generate
especially for the initial sync

can we throttle the initial sync
if not, maybe we should do the initial sync ourselves, then enably synchronization on already mostly-synced containers?