Swift/How To

From Wikitech
Jump to navigation Jump to search

General Prep

Nearly all of these commands are best executed from a swift proxy and stats reporter host (e.g. ms-fe1005.eqiad.wmnet or ms-fe2005.codfw.wmnet) and require either the master password or an account password. Both the master password (super_admin_key) and the specific users' passwords we have at Wikimedia are accessible in the swift proxy config file /etc/swift/proxy-server.conf or in the private puppet repository.

All of these tasks are explained in more detail and with far more context in the official swift documentation. This page is intended as a greatly restricted version of that information directed specifically at tasks we'll need to carry out in the Wikimedia cluster. For this reason many options and caveats have been left out, and assume things like the authentication type used to restrict it to what's correct for our installation. It may or may not be useful for a wider audience.

Set up an entire swift cluster

This is documented elsewhere: Swift/Setup_New_Swift_Cluster

Individual Commands - interacting with Swift

Impersonating a specific swift account

Some swift proxy servers (e.g. ms-fe2005 / ms-fe1005 ) have an extra stats reporter role, which records account credentials in /etc/swift/account_*

You can do . /etc/swift/account_AUTH_dispersion.env to read the dispersion account's authentication URL (ST_AUTH), username (ST_USER) and key (ST_KEY) into your environment, which the swift binary will then use to authenticate your commands, similar to nova/designate/openstack used in labs administration.

Create a container

You create a container by POSTing to it. You modify a container by POSTing to an existing container. Only users with admin rights (aka users in the .admin group) are allowed to create or modify containers.

Run the following commands on any host with the swift binaries installed (any host in the swift cluster or iron)

  • create a container with default permissions (r/w by owner and nobody else)
    • swift post container-name
  • create a container with global read permissions
    • swift post -r '.r:*'
  • The WikimediaMaintenance extension's filebackend/setZoneAccess.php file creates most wiki-specific containers, and SwiftFilebackend gives it's own user read and write privileges along with global read for public containers.

List containers and contents

It's easiest to do all listing from a frontend host on the cluster you wish to list.

list of all containers

  • ask for a listing of the container: swift list

list the contents of one container

  • ask for a listing of the container: swift list wikipedia-commons-local-thumb.a2

list specific objects within a container

example: look for all thumbnails for the file Little_kitten_.jpg

  • start from a URL for a thumbnail (if you are at the original File: page, 'view image' on the existing thumbnail)
  • Pull out the project, "language", thumb, and shard to form the correct container and add -local into the middle
    • eg wikipedia-commons-local-thumb.a2
    • Note - only some containers are sharded: grep shard /etc/swift/proxy-server.conf to find out if your container should be sharded
    • unsharded containers leave off the shard eg wikipedia-commons-local-thumb
  • ask swift for a listing of the correct container with the --prefix option (it must come before the container name)
    • swift list --prefix a/a2/Little_kit wikipedia-commons-local-thumb.a2
    • note that --prefix is a substring anchored to the beginning of the shard; it doesn't have to be a complete name.

Show specific info about a container or object

Note - these instructions will only show containers or objects the account has permission to see.

  • log into a swift frontend host on the cluster you want to use, set ST_AUTH/ST_USER/ST_KEY
  • ask for statistics about all containers: swift stat
  • ask for statistics about the container: swift stat wikipedia-commons-local-thumb.a2
  • ask for statistics about an object in a container: swift stat wikipedia-commons-local-thumb.a2 a/a2/Little_kitten_.jpg/300px-Little_kitten_.jpg

Delete a container or object

Note - THIS IS DANGEROUS; it's easy to delete a container instead of an object by hitting return at the wrong time!

Deleting uses the same syntax as 'stat'. I recommend running stat on an object to get the command right then do cli substitution (^stat^delete^ in bash)

  • log into a swift frontend host on the cluster you want to use, set ST_AUTH/ST_USER/ST_KEY
  • run swift stat on the object you want to delete
    • swift stat wikipedia-de-local-thumb f/f5/Wasserhose_1884.jpg/800px-Wasserhose_1884.jpg
  • swap stat for delete in the same command.
    • ^stat^delete^

When you call delete for a container it will first delete all objects within the container and then delete the container itself.

Setup temp url key on an account

MediaWiki makes use of a temporary url key to download files, the key must be set on the mw:media account. On swift machines that report statistics you can find several .env files to "su" to each account, e.g.

 source /etc/swift/account_AUTH_mw.env
 swift post -m 'Temp-URL-Key:<your key>'

Individual Commands - Managing Swift

Show current swift ring layout

There are three rings in Swift: account, object, and container. The swift-ring-builder command with a builder file will list the current state of the ring.

  1. swift-ring-builder /etc/swift/account.builder
  2. swift-ring-builder /etc/swift/container.builder
  3. swift-ring-builder /etc/swift/object.builder

Rebalance the rings

You only have to rebalance the rings after you have made a change to them. If there are no changes pending, the attempt to rebalance will fail with the error message "Cowardly refusing to save rebalance as it did not change at least 1%."

To rebalance the rings you run the actual rebalance on a copy of the ring files then distribute the rings to the rest of the cluster (via puppet).

The canonical copy of the rings is kept in operations/software/swift-ring.git with instructions on how to make changes and send them for review. After a change has been reviewed and merged it can be deployed (i.e. pushed to the puppet master)

Add a proxy node to the cluster

  • Update site.pp in puppet to make the new proxy match existing proxies in that cluster
    • likely you'll include role::swift::xxx-yyy::proxy
    • maybe some ganglia-related stuff
  • Update the xxx-yyy config section in role/swift.pp
    • add the new server to the list of memcached_servers
  • Run puppet on the host twice, reboot, and run puppet again
  • Test the host
  • Add the new proxy to the load balancer (full details) if it's a load balanced cluster

Remove a failed proxy node from the cluster

  • Take the failed node out of the load balancer if necessary
  • Update the puppet configuration for the cluster
    • remove the failed node from the memcached list in the role/swift.pp in the cluster config

Add a storage node to the cluster

Start by doing the normal setup with a few tweaks, paying attention to the desired swift server layout.

Puppet will take care of all disks that are only 1 partition used for data - you should pass it all non-OS disks. You may have to create partitions on the OS disk for swift storage. The following is what I ran on ms-be1 (where the bios is on sda1 and sdb1, the OS partition is raided across 120GB partitions on sda2 and sdb2, and sda3 and sdb3 are swap):

 # parted
 ) help
 ) print free
 ) mkpart swift-sda4 121GB 2000GB
 ) select /dev/sdb
 ) print free
 ) mkpart swift-sdb4 121GB 2000GB
 ) quit
 # mkfs -t xfs -i 512 -L swift-sda4 /dev/sda4
 # mkfs -t xfs -i 512 -L swift-sdb4 /dev/sdb4
 # mkdir /srv/swift-storage/sd{a,b}4
 # chown -R swift:swift /srv/swift-storage/sd{a,b}4
 # vi /etc/fstab # <-- add in a line for sda4 and sdb4 with the same xfs options as the rest
 # mount -a
 # reboot # just for good measure

After Puppet has finished setting up Swift and all device partitions are mounted successfully, add them to the rings. (Since the two partitions on sda and sdb are slightly smaller than the rest, they should get an appropriately smaller weight, eg 95 instead of 100.)

Add a device (drive) to a ring

Select the following values:

  • zone : each rack is its own zone; all servers within a rack and all drives within a server should be the same zone
    • list all the drives to see what zones are in use with swift-ring-builder /etc/swift/account.builder (see above)
  • ip - ip of the storage node
  • dev - the short name of the partition - eg 'sdc1'
  • weight - how big the partition is in gigabyte (powers of 10, not 2) (e.g. 2TB -> 2000)

note see #Rebalance_the_rings on how to obtain a copy of the rings

    swift-ring-builder account.builder add z${zone}-${ip}:6002/${dev} $weight
    swift-ring-builder container.builder add z${zone}-${ip}:6001/${dev} $weight
    swift-ring-builder object.builder add z${zone}-${ip}:6000/${dev} $weight

Example, to add device /dev/sda4 on ms-be5:

   swift-ring-builder account.builder add z5- 100
   swift-ring-builder container.builder add z5- 100
   swift-ring-builder object.builder add z5- 100

After you're done, you must rebalance the three rings and push them out to the rest of the cluster.

Remove a failed storage node from the cluster

Remove each of the devices on the failed node from the rings, rebalance, and distribute the new ring files.

Remove (fail out) a drive from a ring

There are two conditions in which you will want to remove a device from service

  • when the device is dead or the host is down and unreachable
  • when it's still working but you want to decommission it or pull it out for service

For the former, you just remove the device; for the latter, you can nicely pull data off the device before shutting it off by changing the device weight first.

remove failed devices

The command to remove a device is swift-ring-builder /etc/swift/<ring>.builder remove d###. Here's the sequence:

  • find the IDs of the devices you want to remove. You're looking for the 'id' using the IP address and name as your keys. You should verify that the ID is the same across all three rings; I'm only showing one ring here for the example.
root@ms-fe2:~# swift-ring-builder /etc/swift/account.builder
/etc/swift/account.builder, build version 192
65536 partitions, 3 replicas, 5 zones, 161 devices, 0.10 balance
The minimum number of hours before a partition can be reassigned is 3
Devices:    id  zone      ip address  port      name weight partitions balance meta
             0     1  6002      sda1  25.00        844    0.02
             1     1  6002     sdaa1  25.00        844    0.02
             2     1  6002     sdab1  25.00        844    0.02
             3     1  6002     sdad1  25.00        844    0.02
             4     1  6002     sdae1  25.00        844    0.02
             5     1  6002     sdaf1  25.00        844    0.02
             etc. etc. etc.
  • remove them (in this example I'm removing an entire host; you can remove only a single drive if necessary.) Note that in our environment, account and container device IDs often (but not always) match and object device IDs are different. You should check each ring individually.
cp -a /etc/swift ~; cd ~/swift;
for i in {150..161}; do
  swift-ring-builder account.builder remove d$i

remove working devices for maintenance

To remove a device for maintenance, you set the weight on the device to 0, rebalance, wait a while (a day or two), then do your maintenance. The examples here assume you're removing all the devices on a node. Note that I'm only checking one of the three rings but taking action on all three. To be completely sure we should check all three rings but by policy we keep all three rings the same.

  • find the IDs for the devices you want to remove (in this example, I'm pulling out ms-be5)
 root@ms-fe1:/etc/swift# swift-ring-builder /etc/swift/account.builder search
 Devices:    id  zone      ip address  port      name weight partitions balance meta
            186     8  6002      sda4  95.00       1993  -12.24
            187     8  6002      sdb4  95.00       1993  -12.24
            188     8  6002      sdc1 100.00       2098  -12.23
            189     8  6002      sdd1 100.00       2097  -12.27
            190     8  6002      sde1 100.00       2097  -12.27
            191     8  6002      sdf1 100.00       2097  -12.27
            192     8  6002      sdg1 100.00       2097  -12.27
            193     8  6002      sdh1 100.00       2097  -12.27
            194     8  6002      sdi1 100.00       2097  -12.27
            195     8  6002      sdj1 100.00       2097  -12.27
            196     8  6002      sdk1 100.00       2097  -12.27
            197     8  6002      sdl1 100.00       2097  -12.27
  • set their weight to 0

cd [your swift-ring.git checkout]/[swift instance] (e.g. eqiad-prod)

 for id in {186..197}; do 
   for ring in account object container ; do 
     swift-ring-builder ${ring}.builder set_weight d${id} 0
Alternatively you can, for a given ring,
 swift-ring-builder ${ring}.builder set_weight 0
It will prompt you with a list of the devices that will be affected and give you a change to confirm or cancel.
  • check what you've done
 git diff -w

Replacing a disk without touching the rings

If the time span for replacement is short enough the failed disk can be left unmounted and swapped with a working one. After successful replacement it should be added back to the raid controller and the raid cache discarded:

 megacli -GetPreservedCacheList -a0
 megacli -DiscardPreservedCache -L'disk_number' -a0
 megacli -CfgEachDskRaid0 WB RA Direct CachedBadBBU -a0

Change info for devices in a ring

  • first use the search subcommand to find the devices you want to update. Example, looking for all the devices on with port 6002, which is wrong:
 root@ms-be11:~/swift-rings/swift# swift-ring-builder container.builder search z8-
 Devices:    id  zone      ip address  port      name weight partitions balance meta
            202     8  6002      sdn3 100.00      13108  -33.33 
            203     8  6002      sdm3 100.00      13108  -33.33 
  • next, if it showed you the right devices, use the set_info subcommand to replace the incorrect info. In this example, the port number is wrong, so we update it as follows:
 root@ms-be11:~/swift-rings/swift# swift-ring-builder container.builder set_info z8-
 Matched more than one device:
 Are you sure you want to update the info for these 2 devices? (y/N) y
 Device d202z8-"" is now d202z8-""
 Device d203z8-"" is now d203z8-""
  • check your work:
 root@ms-be11:~/swift-rings/swift# swift-ring-builder container.builder 
 container.builder, build version 809
 65536 partitions, 3 replicas, 4 zones, 10 devices, 33.33 balance
 The minimum number of hours before a partition can be reassigned is 3
 Devices:    id  zone      ip address  port      name weight partitions balance meta
            186     8  6001      sda3 100.00      19660   -0.00 
            187     8  6001      sdb3 100.00      19660   -0.00 
            194    12  6001      sda3 100.00      21845   11.11 
            195    12  6001      sdb3 100.00      21845   11.11 
            198    14  6001      sda3 100.00      21846   11.11 
            199    14  6001      sdb3 100.00      21846   11.11 
            200    15  6001      sda3 100.00      21845   11.11 
            201    15  6001      sdb3 100.00      21845   11.11 
            202     8  6001      sdn3 100.00      13108  -33.33 
            203     8  6001      sdm3 100.00      13108  -33.33 
  • and now write the rings (you don't rebalance them, because you don't actually change partitioning):
 root@ms-be11:~/swift-rings/swift# swift-ring-builder container.builder write_ring

Nuke a swift cluster

only do this on test clusters - it is unrecoverable and destroys all the data in the cluster

  • on all servers:
    • stop all services: swift-init all stop
    • remove all ring data: rm /etc/swift/*.{builder,ring.gz}
  • on the storage nodes:
    • remove all storage content: for i in /srv/swift-storage/sd*; do rm -r $i/*& done (or just reformat the drives - faster)

The swift cluster is now destroyed. To rebuild, follow the instructions in Swift/Setup_New_Swift_Cluster

Misc operations

Repair xfs free blocks counter corruption

As found in https://phabricator.wikimedia.org/T199198 newer xfs settings lead to the filesystem mis-counting free blocks and in turn df returning bogus numbers. When that happens the cure is to unmount said filesystem and run xfs_repair on it, once that's done no reoccurrence has been observed.

 # downtime the host in icinga first, on icinga.wikimedia.org:
 sudo icinga-downtime -r 'xfs repair' -d 16000 -h $short_hostname
 # on the swift server
 puppet agent --disable "repairing xfs"
 systemctl stop swift-object*
 systemctl stop rsync
 to_repair=$(df -h | awk '/ - \/srv\/swift/ { print $1 }')
 # paste this on #wikimedia-operations for audit purposes
 echo '!'"log repair $to_repair on $HOSTNAME - T199198"
 # run in a screen, repair all filesystems misreporting disk space
 for dev in $to_repair; do umount $dev ; echo "repairing $dev" ;  xfs_repair $dev ; done && mount -a && puppet agent --enable && puppet agent --test