Kafka/Kafka-main-raid-performance-testing-2019
Kafka main raid performance testing 2019
In Q4 2019 we purchased 5 new 8xSSD enabled Kafka-main hosts per-site (codfw and eqiad).
Here are some performance comparisons between possible RAID-10 based storage layouts.
The fio utility was used to do synthetic load testing with a workload in the same ballpark as Kafka. There were 3 tests used. Randread, write, and a combined write and random read at the same time.
# kafka-main-testing.fio
[global]
ioengine=sync
direct=1
invalidate=1
ramp_time=10
size=1G
iodepth=1
per_job_logs=0
numjobs=24
group_reporting=1
[write]
stonewall
bsrange=1k-16k
rw=write
write_bw_log=write.results
write_iops_log=write.results
write_lat_log=write.results
[randread]
stonewall
bsrange=1k-16k
rw=randread
write_bw_log=randrw.results
write_iops_log=randrw.results
write_lat_log=randrw.results
[write+randread]
stonewall
bsrange=1k-16k
rw=write
write_bw_log=write.results
write_iops_log=write.results
write_lat_log=write.results
[randread-simul]
bsrange=1k-16k
rw=randread
write_bw_log=simul.results
write_iops_log=simul.results
write_lat_log=simul.results
The fio testing output is super verbose, so I've tried to boil down the output focusing on speed and iops.
Initially, I tested various raid10 chunk sizes and io schedulers using linux md raid. After narrowing the scope using md raid, I'll compare the highest performing combination against a hardware raid layout with the same attributes.
raid10 linux md 256K chunk size
# raid10 ext4 linux md chunk size 256K deadline scheduler
write: (groupid=0, jobs=24): err= 0: pid=193083: Thu May 30 20:04:42 2019
write: io=15206MB, bw=986336KB/s, iops=130369, runt= 15787msec
randread: (groupid=1, jobs=24): err= 0: pid=193107: Thu May 30 20:04:42 2019
read : io=17983MB, bw=582986KB/s, iops=110751, runt= 31586msec
write+randread: (groupid=2, jobs=48): err= 0: pid=193132: Thu May 30 20:04:42 2019
read : io=21325MB, bw=453032KB/s, iops=83063, runt= 48202msec
write: io=17688MB, bw=821532KB/s, iops=108659, runt= 22047msec
# raid10 ext4 linux md chunk size 256K cfq scheduler
write: (groupid=0, jobs=24): err= 0: pid=192904: Thu May 30 18:09:48 2019
write: io=19244MB, bw=607033KB/s, iops=80251, runt= 32462msec
randread: (groupid=1, jobs=24): err= 0: pid=192928: Thu May 30 18:09:48 2019
read : io=18903MB, bw=473222KB/s, iops=89029, runt= 40905msec
write+randread: (groupid=2, jobs=48): err= 0: pid=192953: Thu May 30 18:09:48 2019
read : io=21374MB, bw=320753KB/s, iops=58780, runt= 68237msec
write: io=21252MB, bw=401129KB/s, iops=53060, runt= 54252msec
# raid10 ext4 linux md chunk size 256K noop scheduler
write: (groupid=0, jobs=24): err= 0: pid=192746: Thu May 30 17:25:47 2019
write: io=19617MB, bw=520280KB/s, iops=68785, runt= 38610msec
randread: (groupid=1, jobs=24): err= 0: pid=192770: Thu May 30 17:25:47 2019
read : io=17979MB, bw=580913KB/s, iops=110359, runt= 31693msec
write+randread: (groupid=2, jobs=48): err= 0: pid=192794: Thu May 30 17:25:47 2019
read : io=22232MB, bw=335443KB/s, iops=60914, runt= 67868msec
write: io=20277MB, bw=462938KB/s, iops=61239, runt= 44851msec
raid10 linux md 512K chunk size
# raid10 ext4 linux md chunk size 512K deadline scheduler
write: (groupid=0, jobs=24): err= 0: pid=126114: Thu May 30 02:07:12 2019
write: io=17424MB, bw=977625KB/s, iops=129228, runt= 18251msec
randread: (groupid=1, jobs=24): err= 0: pid=126139: Thu May 30 02:07:12 2019
read : io=18042MB, bw=572255KB/s, iops=108639, runt= 32284msec
write+randread: (groupid=2, jobs=48): err= 0: pid=126163: Thu May 30 02:07:12 2019
read : io=22113MB, bw=327702KB/s, iops=59583, runt= 69098msec
write: io=20802MB, bw=460898KB/s, iops=60965, runt= 46216msec
# raid10 ext4 linux md chunk size 512K cfq scheduler
write: (groupid=0, jobs=24): err= 0: pid=126269: Thu May 30 02:20:50 2019
write: io=19770MB, bw=551746KB/s, iops=72949, runt= 36692msec
randread: (groupid=1, jobs=24): err= 0: pid=126294: Thu May 30 02:20:50 2019
read : io=18854MB, bw=467789KB/s, iops=88056, runt= 41271msec
write+randread: (groupid=2, jobs=48): err= 0: pid=126319: Thu May 30 02:20:50 2019
read : io=21365MB, bw=323074KB/s, iops=59211, runt= 67719msec
write: io=21044MB, bw=408179KB/s, iops=53992, runt= 52793msec
# raid10 ext4 linux md chunk size 512K noop scheduler
write: (groupid=0, jobs=24): err= 0: pid=126414: Thu May 30 02:24:34 2019
write: io=17885MB, bw=993139KB/s, iops=131284, runt= 18441msec
randread: (groupid=1, jobs=24): err= 0: pid=126438: Thu May 30 02:24:34 2019
read : io=18013MB, bw=574555KB/s, iops=109113, runt= 32104msec
write+randread: (groupid=2, jobs=48): err= 0: pid=126463: Thu May 30 02:24:34 2019
read : io=21305MB, bw=443088KB/s, iops=81257, runt= 49238msec
write: io=19089MB, bw=822288KB/s, iops=108765, runt= 23772msec
raid10 linux md 1024K chunk size
#raid10 ext4 linux md chunk size 1024K deadline scheduler
write: (groupid=0, jobs=24): err= 0: pid=184316: Thu May 30 14:07:49 2019
write: io=21329MB, bw=510904KB/s, iops=67551, runt= 42749msec
randread: (groupid=1, jobs=24): err= 0: pid=184341: Thu May 30 14:07:49 2019
read : io=17997MB, bw=562859KB/s, iops=106909, runt= 32742msec
write+randread: (groupid=2, jobs=48): err= 0: pid=184365: Thu May 30 14:07:49 2019
read : io=21557MB, bw=278354KB/s, iops=50911, runt= 79302msec
write: io=22420MB, bw=346021KB/s, iops=45773, runt= 66350msec
#raid10 ext4 linux md chunk size 1024K cfq scheduler
write: (groupid=0, jobs=24): err= 0: pid=184471: Thu May 30 14:18:59 2019
write: io=23296MB, bw=372577KB/s, iops=49262, runt= 64027msec
randread: (groupid=1, jobs=24): err= 0: pid=184496: Thu May 30 14:18:59 2019
read : io=18871MB, bw=467244KB/s, iops=87936, runt= 41358msec
write+randread: (groupid=2, jobs=48): err= 0: pid=184523: Thu May 30 14:18:59 2019
read : io=21433MB, bw=282170KB/s, iops=51676, runt= 77782msec
write: io=21886MB, bw=318494KB/s, iops=42131, runt= 70366msec
#raid10 ext4 linux md chunk size 1024K noop scheduler
write: (groupid=0, jobs=24): err= 0: pid=184617: Thu May 30 14:32:53 2019
write: io=23050MB, bw=510122KB/s, iops=67450, runt= 46269msec
randread: (groupid=1, jobs=24): err= 0: pid=184641: Thu May 30 14:32:53 2019
read : io=17958MB, bw=562896KB/s, iops=106961, runt= 32669msec
write+randread: (groupid=2, jobs=48): err= 0: pid=184666: Thu May 30 14:32:53 2019
read : io=21398MB, bw=341749KB/s, iops=62611, runt= 64115msec
write: io=23056MB, bw=473598KB/s, iops=62656, runt= 49850msec
raid10 linux md 4096K chunk size
#raid10 ext4 linux md chunk size 4096K deadline
write: (groupid=0, jobs=24): err= 0: pid=132515: Thu May 30 04:27:33 2019
write: io=22398MB, bw=194475KB/s, iops=25712, runt=117934msec
randread: (groupid=1, jobs=24): err= 0: pid=132540: Thu May 30 04:27:33 2019
read : io=18047MB, bw=572786KB/s, iops=108737, runt= 32263msec
write+randread: (groupid=2, jobs=48): err= 0: pid=132564: Thu May 30 04:27:33 2019
read : io=21163MB, bw=283403KB/s, iops=52050, runt= 76467msec
write: io=22667MB, bw=186081KB/s, iops=24617, runt=124735msec
#raid10 ext4 linux md chunk size 4096K cfq
write: (groupid=0, jobs=24): err= 0: pid=132371: Thu May 30 04:20:29 2019
write: io=22600MB, bw=183568KB/s, iops=24271, runt=126070msec
randread: (groupid=1, jobs=24): err= 0: pid=132399: Thu May 30 04:20:29 2019
read : io=18812MB, bw=473933KB/s, iops=89249, runt= 40647msec
write+randread: (groupid=2, jobs=48): err= 0: pid=132423: Thu May 30 04:20:29 2019
read : io=21699MB, bw=259523KB/s, iops=47397, runt= 85618msec
write: io=23368MB, bw=224384KB/s, iops=29684, runt=106642msec
#raid10 ext4 linux md chunk size 4096K noop
write: (groupid=0, jobs=24): err= 0: pid=132219: Thu May 30 04:13:01 2019
write: io=21247MB, bw=406205KB/s, iops=53708, runt= 53561msec
randread: (groupid=1, jobs=24): err= 0: pid=132243: Thu May 30 04:13:01 2019
read : io=18033MB, bw=579558KB/s, iops=110039, runt= 31862msec
write+randread: (groupid=2, jobs=48): err= 0: pid=132267: Thu May 30 04:13:01 2019
read : io=21398MB, bw=292549KB/s, iops=53598, runt= 74900msec
write: io=22242MB, bw=380829KB/s, iops=50379, runt= 59807msec
raid10 linux md 512k chunk size with TRIM
# raid10 ext4 linux md chunk size 512K noop scheduler with TRIM enabled via ext4 discard option
write: (groupid=0, jobs=24): err= 0: pid=126565: Thu May 30 02:31:26 2019
write: io=18843MB, bw=937654KB/s, iops=123962, runt= 20578msec
randread: (groupid=1, jobs=24): err= 0: pid=126590: Thu May 30 02:31:26 2019
read : io=17928MB, bw=573414KB/s, iops=108991, runt= 32015msec
write+randread: (groupid=2, jobs=48): err= 0: pid=126615: Thu May 30 02:31:26 2019
read : io=21472MB, bw=437143KB/s, iops=80024, runt= 50298msec
write: io=18078MB, bw=787586KB/s, iops=104178, runt= 23505msec
raid10 512K stripe size hardware vs software
# raid10 ext4 hwraid 512K stripe noop scheduler
write: (groupid=0, jobs=24): err= 0: pid=2833: Fri May 31 04:28:53 2019
write: io=17665MB, bw=705690KB/s, iops=93283, runt= 25633msec
randread: (groupid=1, jobs=24): err= 0: pid=2857: Fri May 31 04:28:53 2019
read : io=18794MB, bw=491197KB/s, iops=92519, runt= 39179msec
write+randread: (groupid=2, jobs=48): err= 0: pid=2881: Fri May 31 04:28:53 2019
read : io=21997MB, bw=318370KB/s, iops=57959, runt= 70750msec
write: io=19910MB, bw=478322KB/s, iops=63275, runt= 42624msec
# raid10 ext4 linux md 512K chunk noop scheduler
write: (groupid=0, jobs=24): err= 0: pid=126414: Thu May 30 02:24:34 2019
write: io=17885MB, bw=993139KB/s, iops=131284, runt= 18441msec
randread: (groupid=1, jobs=24): err= 0: pid=126438: Thu May 30 02:24:34 2019
read : io=18013MB, bw=574555KB/s, iops=109113, runt= 32104msec
write+randread: (groupid=2, jobs=48): err= 0: pid=126463: Thu May 30 02:24:34 2019
read : io=21305MB, bw=443088KB/s, iops=81257, runt= 49238msec
write: io=19089MB, bw=822288KB/s, iops=108765, runt= 23772msec
Conclusions
Of these combinations, linux md raid at 256K and 512K chunk sizes (with the deadline and noop schedulers respectively) stood out as offering the highest iops.
- 512K chunk md raid10 with noop scheduler
- 131K iops write
- 109K iops randread
- 81K/108K iops simulatneous randread/write
- 256K chunk md raid10 with deadline scheduler
- 130K iops write
- 110K iops randread
- 83K/108K simultaneous randread/write
When comparing hardware and software raid using the 512K chunk/stripe with noop scheduler, hardware performed significantly worse.
- 512K stripe hw raid10 with noop scheduler
- 93K iops write
- 92K iops randread
- 57K/63K iops simultaneous randread/write
- 512K chunk md raid10 with noop scheduler
- 131K iops write
- 109K iops randread
- 81K/108K iops simultaneous randread/write
Enabling the TRIM/discard feature with the 512K md raid10, at the time of this writing, appears to cause a small-ish performance hit.
- 512K chunk md raid10 with noop scheduler ext4
- 131K iops write
- 109K iops randread
- 81K/108K iops simulatneous randread/write
- 512K chunk md raid10 with noop scheduler ext4 TRIM enabled via discard option
- 126K iops write
- 108K iops randread
- 80K/104K iops simultaneous randread/write
Taking the above into account, 512K chunk md raid10 ext4 with noop scheduler and TRIM disabled looks to be a reasonable path forward. The larger chunk size (as opposed to 256K) should be beneficial for sequential workloads, while still offering good performance with random reads as well.