the current working set is ~1TB
- The archived/ hierarchy is 346 GB of that space, and it could live on spinning disks.
- true, it is going to grow at the tune of ~1MB/metric so depends on how much growth we want to plan on
Another option would be to use dm-cache (and kernel >= 3.9) and an SSD to complement the raid controller cache.
- I didn't know about this. How much of a difference would this make?
- good question, I don't know for sure either nor if it would be any gains but possibly yes. I've clarified the comment
performance: change statsd daemon
- We could. <https://github.com/armon/statsite> looks pretty good.
- Have you looked at what it's doing, using perf or Python's cProfile module? There could be some low-hanging fruit.
- You could also try running txStatsD on pypy, which is already in apt. But you'd likely need to rebuild twisted.
- yep I'll take a look at statsite for example, I haven't looked in depth into what txstatsd is doing but I doubt we could squeeze significant gains (e.g. 7/8x)
availability: statsd+graphite clustered setup behind LVS ..potentially painful operations-wise..
- Agreed; I don't think it's a great solution.
performance+availability: switch graphite backend There are solutions to switch the graphite backend away from whisper but keep a compatible http interface.
- I think this is the better alternative. We should use a data store that was designed to be elastic, durable and fault-tolerant rather than cobble one together from bric-a-brac. It's possible to MacGuyver scalability out of Carbon but it requires an insane amount of glue.
One of the alternative graphite backends is influxdb
- It's way too new. Distributed data stores are hard to get right.
another option is cassandra for storage, via cyanite using graphite-cyanite
- Cassandra would be nice. Cyanite looks less mature than I had hoped. Are people using it?
- no idea offhand, there is some activity on the issues github page though
performance+availability: opentsdb backend
- This looks like the most mature option. If we go with it, we should probably consider adopting its data and query model. I'm pretty sure someone in the ops team used it before, but now I can't remember who. It may have been you, or perhaps Chase. It'd be interesting to hear this person thinks.
- I have used it before, maybe others? I like its model better but processing post-retrieve capabilities are not as advanced as graphite's
Two things you haven't brought up:
- Revisiting the aggregation/retention configuration and making sure it is sensible.
- Relieving load on txStatsD by sending it less metrics:
- By changing the metric sampling factor, for applications that allow that to be configured (Swift does, IIRC).
- By temporarily disabling statsd reporting for applications that don't have a credible operational need for metrics.
- indeed, I'm adding those too
Thanks for the write-up, and for taking on this problem.