Swift/Deploy Plan - Upgrade from 1.4.3 to 1.5
Notes for Swift upgrade from 1.4.3 to 1.5
- stop puppet on the production cluster
- /etc/init.d/puppet stop
- edit root's crontab to prevent it from getting restarted
- put the new swift packages in the lucid and precise debian repositories on brewster
Test puppet changes
- merge gerrit 18264
- run puppet on a not-yet-upgraded proxy and storage node in the eqiad-prod cluster (ms-fe100* and ms-be10**)
- pull out two proxy servers, watch to make sure the remaining two handle the load
- edit fenari:/home/w/conf/pybal/pmtpa/swift; set two to 'False'
- upgrade the two
- stop the proxy service (swift-init all stop)
- run puppet
- swift-init all start
- test: request existing, nonexisting thumbs, nonsense urls; do some commands from the swift howto page on wikitech
- when everything looks good and they are handling their share of traffic, take the remaining two proxies out
- wait a day or so, see if anything odd shows up (check logs, village pump, mailing lists, etc)
- if it looks bad, roll back = move traffic off the upgraded proxy servers and use just the two that weren't upgraded
- it it looks good, ... ? do a backend? do the rest of the proxies?
- stop the service on one host, upgrade, start
- swift-init all stop
- run puppet
- swift-init all start
- do some tests and wait around a little while, make sure nothing odd crops up
- do the rest one at a time
Note that this covers upgrade but not enabling of statd; that's a separate step to be done later
swift 1.5+ has statsd metrics integrated in the code for more detailed monitoring. We will feed those metrics into ganglia using a statsd->ganglia bridge that runs on each host. (i.e. statsd reports to localhost which then gets shunted off into ganglia-land.)
for each service (proxy, object, container, account), enable the commented out statsd lines
statsd -> ganglia bridge
- teach puppet to start the bridge and make sure it's running (the puppet changes for this don't yet exist)
- example (needs confirmation) statsd line: pystatsd-server -n 127.0.0.1 -r ganglia --ganglia-host 10.4.0.79 --ganglia-port 21146 --flush-interval 15 -d --ganglia-spoof-host 10.4.0.107:swift-be1 --ganglia-counter-group swift_counters
- test in labs and on the eqiad test cluster
- labs uses unicast ganglia, production uses multicast. testing multicast is necessary
Caution - in my testing, there was a bug where, after reporting some metrics, the host would show as down in the ganglia UI. I don't know what's causing this, but is the critical bug to fix before this can be used in production.
This process will happen after the 1.5 upgrade but is not part of the upgrade process. It should happen as soon as is reasonable once the cluster is upgraded and stable.
See http://docs.openstack.org/developer/swift/overview_container_sync.html for details on the how sync works.
things to do:
- set up synchronization in two labs swift clusters
- determine (empirically?) a reasonable interval for the sync processes (default is 5 minutes)
- determine a method for newly created containers to get synchronization set up
- new containers are created when we have a new language wiki created
- what happens if swauth tokens are replicated?
- how much load will the replication generate
- especially for the initial sync
- can we throttle the initial sync
- if not, maybe we should do the initial sync ourselves, then enably synchronization on already mostly-synced containers?