We'd like to be able to scale beyond that for future capacity as well as for future uses of Kafka. I will use 500,000 messages/second, or 100 MB/second as a round number to shoot for. If this is too high, we can re-estimate with 300,000 messages/second. (Note that 500,000 messages/second is about 100 MB/second of webrequest logs. Other uses of Kafka will likely have different message sizes.)
Current Eqiad Setup
We are running 2 Kafka brokers in eqiad. Each are Dell PowerEdge R720s with 12 cores @ 2.00GHz, 48G RAM, 12 x 2TB 7200RPM SAS disks (SEAGATE ST32000645SS). 2 of the disks are used as RAID 1 OS partitions. The remaining 10 disks are JBOD, each used as a Kafka data partition. Because we have 10 disks, we allocate 10 partitions per topic with a random partitioner. This allows Kafka to balance writes evenly across all 10 disks. Replication factor is 2, which means that each broker sees the same data. Half of the data is produced directly to each broker, and the other half is replicated between the brokers.
We started sending mobile webrequest logs to Kafka back in December, which amounted to max 6000 messages/second 800 KB/second. In late January, we added bits webrequest logs, which increased this rate to max 50,000 messages/second and 9 MB/second.
Note: the bump in early February was caused by a broker being offline. The remaining broker took over as leader for all partitions during that time.
Kafka relies heavily on OS pagecache. The actual JVM process does not need much memory. I have not tuned the JVM memory beyond the defaults (-Xms256mb -Xmx1024Gb). The Kafka JVM process has consistently hovered at a max heap memory usage of around 500 MB, independent of the amount of data being sent. Total memory used -/+ buffers/cache has remained around 4.5 GB independent of the amount of data being sent. Since Kafka writes all of its logs to disk, it allows the OS to fill up available memory with pagecache. The current brokers have 48 GB RAM. Cached memory usage has hovered around 45 GB independent of the amount of data being sent.
From Kafka Operations Documentation: "You need sufficient memory to buffer active readers and writers. You can do a back-of-the-envelope estimate of memory needs by assuming you want to be able to buffer for 30 seconds and compute your memory need as write_throughput*30"
If we want to be able to scale to 100 MB/second, we'd need a minimum of 3 GB pagecache available. LinkedIn says they are running their brokers with 24 GB. I'd like to make this our minimum memory broker requirement as well. We have seen one occasion where during a a strange zookeeper connection problem where a broker has used more memory than 4.5 GB.
Broker CPU load does naturally increase with more traffic. At the mobile 6000 messages/second rate, both Kafka brokers hovered at around 4% CPU utilization. Since turning on bits in late January, CPU utilization is now hovering at around 12%. A quick failover test indicates that CPU load does not signifigantly change when one broker is the leader for all partitions (instead of the usual half). This makes sense, since (in our current setup) both brokers must ultimately write the same data to disk anyway.
I am not sure how CPU load will scale as we add more traffic. I do not think that the two datapoints (4% and 12%) are enough to extrapolate. If we should start overloading CPUs, we will have to scale by adding more brokers than we have configured replicas.
"In general, disk throughput is the bottleneck". Our current brokers have 10x 2TB disks. According to this Seagate datasheet, these disks have a max sustained transfer rate of 155 MB/second. At our target of 100 MB/second total of Kafka data spread over 10 disks, we need to be able to write 10 MB/second per disk.
As of 2014-04, at peak time a Kafka brokers are ingesting about 16 MB/second total at peak load time. (In 2014-04, more data to Kafka than when this document was originally compiled.) With 2 brokers and replication to both brokers for every partition, each disk is writing about 1.9 MB/second. Camus is currently consuming all data from Kafka every 10 minutes. By only needing to read the 10 most recent minutes of data, most if not all of the data Camus needs to consume should be able to fit in our current page cache of 45 GB. During regular Camus runs, the peak rate for a single disk is around 350 KB/second.
As long as we are able to fit all data needed for regular Camus imports in each Broker's page cache, we will keep disk seeks for reads to a minimum. Kafka writes data sequentially, which should allow for high disk throughput.
With the current (2014-04) Broker/disk configuration at 16 MB/second, each disk hovers around 20% utilization. webrequest_text was added to Kafka on 2014-04-22, which almost double the amount of data we were previously producing. CPU iowait doubled when webrequest_text was added.
These disks are currently formatted ext3 (we plan on reformatting these to ext4 soon) and report 1.8 TB of usable space. That's 18TB. The 7 day Kafka log buffer at 50,000 messages/second is currently 4.2 TB (snappy compressed JSON logs).
The eqiad Kafka cluster will aggregate all datacenter traffic. Webrequest log traffic maxes at around 205,000 messages/second. Scaling that up based on current space usage in Kafka is 18TB (205K/50K * 4.2TB), and we’ve got only 18TB right now. If possible, I'd like to install 2 cheap 2.5" drives in the internal bays, and use all 12 disks for Kafka. If this is not possible, then I will reinstall these nodes with a small / partition on the first 2 drives, which will allow us to still utilize most of the space on those drives. That would bring us close to 24TB of usable storage, which should be plenty for the time being.
This section is a work in progress at this time. See ops mailing list discussion titled 'Kafka broker requirements'.
- +1 broker identical to what we have now: Dell PowerEdge R720s with 12 cores @ 2.00GHz, 48G RAM, 12 x 2TB 7200RPM SAS disks (SEAGATE ST32000645SS), + 2 x 2.5" drives for / (if possible).
- This is more for redundancy than for scaling purposes. It will allow us to take one broker offline for maintenance without introducing a temporary SPOF.
New (primary) Datacenter
If we are aiming to set up a fully redundant Kafka cluster in the new datacenter, this cluster should have identical (or better) specs as eqiad's. That would be:
- +3 Dell PowerEdge R720s with 12 cores @ 2.00GHz, 48G RAM, 12 x 2TB 7200RPM SAS disks (SEAGATE ST32000645SS)
Since we may have to worry about disk space in eqiad soon, it would be nice if we could increase the storage capacity of these new nodes somehow. Are there 4TB 7200RPM SAS drives we could use? :)
We are still determining if we need to host Kafka clusters in remote datacenters for log aggregation there. Doing so would be the recommended Kafka setup, although producing across DCs seems like it should be possible.
If we are going to maintain Kafka clusters in other DCs, we should have 3 brokers for redundancy purposes. Since Kafka clusters in other DCs will not need to handle as much data as eqiad, we can relax the requirements. I'd prefer if we could keep the number of disks identical, but we can do with less storage, less memory, and probably fewer cores.
- +3 Kafka broker servers in each remote Datacenter.
- 24 GB RAM
- 12 disks preferred.