Two questions about the nagios checks noted for ceph; I assume those are icinga checks? And... where do I find them? Going to icinga and searching for the host doesn't show me anything ceph-related.
A question about handling disk replacements: I asume we would stop the osd (if it's not already stopped), replace the disk, and start the osd, presumably figuring out which osd is in trouble by checking output from e.g. ceph osd dump?
For the nagios checks, see manifests/role/ceph.pp, plus the normal LVS check. I have one more important one pending (the ceph health check) that needs a bit more work.
The disk replacement parts needs a bit more documentation. I've written a shell script called "ceph-add-disk" that automates all of the steps and has worked in practice, but I'd like to enhance it a bit and document it better. I think I'll just wait for a failed disk before doing so.
The odd number of monitors is not a requirement but a recommendation. The reason for that is that an even number lowers availability, rather than increase it, since you need 51% running. In the case of 6 mons, you need 4 to establish quorum which means that you can sustain up to 2 failed monitors; this is the same number of failures that you need to sustain when you have 5 monitors too, so that extra monitor doesn't bring you much, but instead increases the likelihood of a failure. faidon (talk) 15:40, 24 June 2013 (UTC)