Analytics/Archive/Hadoop Logging - Solutions Recommendation

From Wikitech
Jump to: navigation, search

Our recommendation is Apache Kafka, a distributed pub-sub messaging system designed for throughput. We evaluated about a dozen[1] best-of-breed systems drawn from the domains of distributed log collection, CEP / stream processing, and real-time messaging systems. While these systems offer surprisingly similar features, they differ substantially in implementation, and each is specialized to a particular work profile (a more thorough technical discussion is available as an appendix).

Kafka stands out because it is specialized for throughput and explicitly distributed in all tiers of its architecture. Interestingly, it is also concerned enough with resource conservation[2] to offer sensible tradeoffs that loosen guarantees in exchange for performance — something that may not strike Facebook or Google as an important feature in the systems they design. Constraints breed creativity.

In addition, Kafka has several perks of particular interest to Operations readers. While it is written in Scala, it ships with a native C++ producer library that can be embedded in a module for our cache servers, obviating the need to run the JVM on those servers. Second, producers can be configured to batch requests to optimize network traffic, but do not create a persistent local log which would require additional maintenance. Kafka's I/O and memory usage is left up to the OS rather than the JVM[3].

Kafka was written by LinkedIn and is now an Apache project. In production at LinkedIn, approximately 10,000 producers are handled by eight Kafka servers per datacenter. These clusters consolidate their streams into a single analytics datacenter, which Kafka supports out of the box via a simple mirroring configuration.

These features are a very apt fit for our intended use cases; even those we don't intend to use — such as sharding and routing by "topic" categories — are interesting and might prove useful in the future as we expand our goals.

The rest of this document dives into these topics in greater detail.

The Firehose

Distributed log aggregation is a non-trivial problem for us because of The Firehose. At present, the entire request stream is estimated to be ~150k requests/second, handled by 635 servers, who generate around 117 GB/hour (uncompressed) — 2.75 TB/day. The load is geographically distributed, requiring disparate streams be aggregated. As with all distributed programming challenges, we must expect connections to experience periods of high variance, and participants to randomly fail or disappear.


With this clearer view of the problem, here are what we see as the requirements:

  • Performance: high throughput so as to handle the firehose.
  • Horizontal Scaling: clustered and distributed, providing fault tolerance and automated recover, transport reliability, and durability for received messages.
  • Producer Client Performance & Stability: agent running on production servers must not impact stability of user-facing servers.
  • Battle Tested: must be in production, handling Big Data somewhere else.
  • Simplicity: low maintenance cost, amenable to automation.
  • Independence: not tied to any producer platform or storage method (e.g., not specialized for HDFS).


Apache Kafka is a distributed pub-sub messaging system designed for throughput. The Kafka homepage contains a concise statement of its virtues:

  • Persistent messaging with O(1) disk structures that provide constant time performance even with many TB of stored messages.
  • High-throughput: even with very modest hardware Kafka can support hundreds of thousands of messages per second.
  • Explicit support for partitioning messages over Kafka servers and distributing consumption over a cluster of consumer machines while maintaining per-partition ordering semantics.
  • Support for parallel data load into Hadoop.


The project was developed by LinkedIn for its activity stream, and forms the backbone of its data processing pipeline. That is to say: several years ago, LinkedIn needed to solve exactly the problem we're looking at, and found all the other solutions lacking. Others didn't put emphasis on throughput or resource efficiency, or they had awkward logfile-based semantics. So they wrote Kafka, and now it is used by dozens of companies (including awesome places like Tumblr and AddThis) who found themselves in the same place.

Kafka has an active community and a solid roadmap. The team is small (about 5 core committers, it seems) but well-organized and productive. The authors also published several research papers on the system when it was first open sourced.


Kafka is a publish-subscribe message passing system. This section attempts to provide relevant architectural details without getting into technical detail for its own sake.

The elemental unit of transmission in Kafka is a single message -- it has no concept of log files when interacting with producers, nor databases when sending data to consumers. This eliminates a number of awkward problems in serializing or streaming data, and decouples the concept of a producer from its machine and application. All messages are categorized with a "topic", which is used for routing messages to consumers. For the most part, we can ignore the details of this for our discussion.

The agents in a Kafka system are split into three roles:

  • A Producer originates messages; in other systems this may be called a "source" or "client". Each application that generates log data of analytic interest will be acting as a producer -- thus each squid, apache, and/or varnish is a candidate.
  • A Broker (with "Kafka Server" used interchangeably) receives messages from producers, handles message persistence, and routes them to the appropriate topic queues.
  • A Consumer processes messages from topic queue. Though much of Kafka's architecture concerns the relationship between consumers, message consumption, and brokers, we'll be eliding that detail as nearly all consumers will live in the analytics cluster.

Agent membership and cluster state is coordinated via ZooKeeper, a centralized service that provides functionality common to distributed applications: group consensus, membership, and presence, distributed synchronization, naming, or maintaining cluster configuration. It is distributed and highly reliable.

This separation of concerns has several beneficial effects:

  • The publish-subscribe model means that producers only need to know how to connect to the broker cluster. The cluster itself can determine the optimal distribution of producers to brokers, conduct load-balancing, and mediate failover. This simplifies all configuration to the point that one uniform config can be used for all hosts in a producer cluster.
  • Producers are intended to be thin, dumb clients: no logs are stored on disk before being sent to the broker; no acks are sent, and there are no retries. To mediate the possibility of messages encountering network difficulties, Kafka uses TCP as the transport and layers atop its own transmission format.
  • Kafka's message format includes compression (gzip or snappy) and the ability to recursively embed messages, enabling very efficient network transmission. The buffer window and/or size can be controlled via configuration.
  • The clear separation of concerns allows for easy multi-datacenter mirroring, a consumer from one cluster can be a producer for another; Kafka ships with this functionality built in.
  • As messages are ordered by arrival timestamp as determined by the broker, there is no need for a single master anywhere in Kafka; all brokers are peers.


The operational profile, maintenance needs, and configuration of any logging solution is of particular concern to us in making this recommendation.

First, and most critically, the architecture should not in any way imperil production stability. Similarly but indirect, it should be high-enough performance to ensure no network or memory starvation occurs, which could reduce user-facing service to an unacceptable level. In addition to these essential requirements, we prefer systems built with simplicity and automation in mind. The less code, the less configuration, the better.

  • Centralized broker configuration via ZooKeeper.
  • Publish-subscribe (via ZooKeeper) for cluster membership and role determination.
  • Automatic producer load-balancing for the broker cluster.
  • Policies for time-based log cleanup on brokers.
  • Excellent built-in monitoring via jmx.
  • I/O reduced by using sendfile() to avoid extra copying between memory spaces [4].
  • Messages stored as real files; let's OS handle caching[5].

Recovery Testing

Resilience to failure is perhaps the most important quality of a distributed system. As more machines participate, the expectation of a random, uncontrollable failure increases. We did not want to get bogged down benchmarking under likely-unrealistic conditions, but we definitely wanted some idea of Kafka's behavior when random things die. So we conducted three simple tests[6].

First, we failed a sub-quorum number of ZooKeeper nodes. Kafka uses ZooKeeper for cluster membership and presence notifications, as well as distributed configuration management. Losing ZooKeeper nodes should never impact performance, and should only impact system stability if a majority fail. This test was pro-forma: ZooKeeper is used by so many distributed systems, we expected to see no impact whatsoever, and were pleased to see that result. There was no interruption of service, no loss of data, and no impact on performance.

Second, we failed a majority of ZooKeeper nodes. As ZooKeeper is not on the critical path of message production, routing, or consumption -- it is used only when configuration or group membership changes -- we did not expect to see any impact on message services. We were pleased to see the expected result. As expected, new nodes were not able to join the cluster while ZooKeeper was unavailable.

Finally, we wanted to test Kafka failover. We set up a cluster with two Kafka brokers (A and B), a single producer, and a single consumer. We configured the producer to continuously send messages, and connected it to the cluster, observing that it was handed off to broker A. After getting baseline readings, we failed broker A. Failover to B occurred in less than one second; during that time (1s), approximately 12% of messages sent by the producer (a few hundred) were lost. We also restarted A, and then failed B with identical results.

These results are encouraging. In production, failover would obviously effect throughput, so it's important for us to have headroom in both bandwidth and storage to absorb the spillover with a smaller cluster during failure periods. Our estimates indicate a single Kafka server should be able to handle the logging load of a single datacenter, so a pair of brokers should provide both failover capability and sufficient headroom.

Case Study: Kafka at LinkedIn

LinkedIn has published quite a bit about their data pipeline and its performance. A summary:

  • Writes: 10 billion msgs/day
    • peak 172k msgs/sec
    • peak ingress 31 MB/sec
    • ~10,000 producers/colo
    • 8 brokers/colo
  • Reads: 55 billion msgs/day
    • peak egress 171 MB/sec
    • 32 consumers
  • Average latency ~18sec (end-to-end) to reach all consumers
    • worst-case at peak only 30sec

Rejected Alternatives

For completeness if in summary, this section is a comparison of the alternatives we ultimately rejected in choosing Kafka. We've also written a substantially more detailed review of these systems, and a feature comparison spreadsheet.

I've pulled out the major contenders into their own sections, listing the deciding factors against each. Finally, some of the less-mature or ill-suited systems are listed at the end.


Scribe is a server for aggregating log data streamed in real time from a large number of servers. It is designed to be scalable, extensible without client-side modification, and robust to failure of the network or any specific machine. The implementation is a Thrift service using the non-blocking C++ server. It was written by Facebook, where it ran on thousands of machines and reliably delivered tens of billions of messages a day until they replaced it with a Java version of the same system (Calligraphus).



Flume is a distributed logging service specializing on being a reliable way of getting stream and log data into HDFS. Though it was built by Cloudera (and remains a core component of CDH, the Cloudera Distribution of Hadoop), it is now a top-level Apache project.


  • Specialized for HDFS import, making it difficult to use with other systems
  • Producer library is Java: would require running JVM on production servers
  • Push-only, meaning that if the sink doesn't keep up with the sources, the producer client can block
  • Infected with the Hadoop community's inability to number releases in a coherent manner -- CDH3 supports 0.9.x and 1.2.0, but CDH4 only supports 1.1.x?


These systems were eliminated quickly, as they are less-mature or ill-suited to our needs.

  • Kestrel: no sharding/clustering; no pub-sub
  • Fluentd: no sharding/clustering; no pub-sub.
  • RabbitMQ: nobody uses it for this / awkward for log aggregation
  • Chukwa: no, we are not running HDFS on the Squids


  1. Logging Solutions Comparison Spreadsheet
  2. Kafka Design Document - Major Design Elements
  3. Kafka Design Document - Message Persistence and Caching
  4. Building LinkedIn’s Real-time Activity Data Pipeline
  5. Why Kafka is better than some in memory messages queues
  6. Kafka Failover Tests - Analytics Mailling List Thread