From Wikitech
Jump to navigation Jump to search

Methodology flaws

I found a few flaws in your figures and methodology. They are not that important for Kafka, but since you're about to use those to plan for Hadoop DataNodes I think it's important to account for these now.

More specifically:

  • 720xd have 12 external bays and 2 internal 2.5" bays. You can use the internal bays for small disks for /, so so you can use all of the external bays for data. That's 12 disks instead of 10 per server, which is significant. Even in the 720xd models that don't have the internal bays, you can just partition two of the twelve disks to have one small partition for /. We don't need two completely separate disks for that, that's a waste. FWIW, that's what we did at ms-be eqiad & ms-be esams, respectively.
  • There is no mention in this document at all of IOPS. Even when throughput is mentioned, it's all assumed to be "155MB/s", i.e. pure sequential traffic in ideal conditions (start of the platter). That same disks in a random workload could possibly not even achieve 1/10th of that. A disk's throughput isn't just its max sustained rate of transfer, it's a lot more complicated than that. Moreover, even if the write workload is sequential, Kafka may fsync() often to gurantee the integrity of the data and hence increase IOPS and disk strain (databases usually have similar issues).
  • Similarly, there's no mention anywhere of RAID controllers & BBUs. First of all, there are problems with the H310s and multiple disks/high load that were a suprirse to us from previous endeavours (Swift), so you should read the threads from back then and mention that. Second, the amount of memory a BBU has (e.g. H710 vs. H710P) can make a big difference for IOPS and should definitely be weighted here.
  • That's not important for our Kafka brokers probably, but you'd need to also make the case of why we need separate 720xd and not separate H810s + disk shelves that add capacity and throughput and how is that weighted against cost. My instict tells me that separate servers is better, but I'd like to see arguments for it specifically for the Kafka brokers & Hadoop DataNodes.


720xd have 12 external bays and 2 internal 2.5" bays.

720xd have 12 external bays and 2 internal 2.5" bays. You can use the internal bays for small disks for /, so so you can use all of the external bays for data.

Awesome! I didn't know they had internal drives. Using the internal ones makes lots of sense. We should put these in the existing Kafka brokers (and Hadoop Datanodes) too.

RAID controllers and BBUs

Similarly, there's no mention anywhere of RAID controllers & BBUs.

I didn't consider this as an option, because of several notes from LinkedIn saying that JBOD works fine, but RAID has given them occasional issues.


RAID can potentially do better at balancing load between disks (although it doesn't always seem to) because it balances load at a lower level. The primary downside of RAID is that it is usually a big performance hit for write throughput and reduces the available disk space.

Another potential benefit of RAID is the ability to tolerate disk failures. However our experience has been that rebuilding the RAID array is so I/O intensive that it effectively disables the server, so this does not provide much real availability improvement.


If you lose one drive in a JBOD setup you will just re-replicate the data on that disk. It is similar to what you would do during RAID repair except that instead of having the data coming 100% from the mirror drives the load will be spread over the rest of the cluster.

The real downside of RAID is that if you lose one drive the machine with the dead drive will be down until the drive is replaced/fixed. RAID allows you to continue using that machine even with the dead drive, although we have seen that there can be issues with this in practice.

I agree with this post -- at both Yahoo and my other employer, RAID was used on the datanodes for this reason. And yes, sometimes this caused problems as nodes slowed down during rebuilds and had other problems. I'm willing to go with JBOD initially to get more storage. I also don't understand how we service machines with issues -- do we have people onsite at datacenters who can swap out disks? The main problem here is that I won't actually have to fix the machines when they break. Tnegrin (talk) 05:54, 24 April 2014 (UTC)

However, LinkedIn does claim that they are using RAID 10 in production:

We have 8x7200 rpm SATA drives in a RAID 10 array. In general this is the performance bottleneck, and more disks is more better. Depending on how you configure flush behavior you may or may not benefit from more expensive disks (if you flush often then higher RPM SAS drives may be better)


I added more to the [[Analytics/Kraken/Kafka/Capacity#Throughput|Disk Throughput] section, mostly about current rates and disk seeks caused by data not in page cache. I don't think that this is entirely what you are looking for, but it's a start for now.

Re: your comment about fflush. This is configurable in Kafka either by time or number of messages. We currently flush once a second, but there is no reason we couldn't flush less often. A quick look at iostat shows that each disk gets around 20 write requests / second, and each disk's util% is around 20%. I'm a little green in this area, and would appreciate a little IRC chat lesson. I'm not sure which of these stats is more important to optimize for in this case. E.g. should I be more concerned with merged requests or actual write requests to the disk? What should I try to tune flush rate to? I would appreciate an IRC brain bounce chat about this at some point soon. :)