Data Engineering/Evaluations/Event Platform/Stream Processing/Framework Evaluation

In 2022, the Event Platform Value Stream, did an evaluation of Stream Processing frameworks for use with WMF's Event Platform. This page contains the results of that evaluation.

Note: The analysis for Flink and Kafka streams has been supplemented from the evaluation conducted by the Search team^[1]. Knative Eventing evaluation details can be found in Phabricator^[1].

Features we need now (2022)


Capability	Flink	Spark Structured Streaming	Kafka Streams	Knative Eventing
Transform single streams	✅	✅	✅	✅ doc
Database Connectors	✅ Referred to as Sinks		❌ Referred to as Sink Connectors. These are great,most implementations are not open source, however the Connector framework is.)	❌
Multi-dc	❓Yes, but manual	❓Yes, but manual	❓Possible for streaming only if using Kafka stretched cluster, otherwise manual	❓Yes, but manual
Failover (cluster recoverability / HA)	✅ Flink can cooperate with Zookeeper (standalone deployment) Hadoop YARN + zookeeper handles HA for YARN deployments (have to read up more on that)	✅	✅ Kafka Streams uses Kafka for HA.	✅ via k8s, but very simple
Deployment options	Flink is a Cluster that can run 1 or more jobs at a time. Several options: Standalone deployment Hadoop Yarn Kubernetes Native (experimental) Some features missing Still deployable as standalone	Standalone deployment Hadoop Yarn Kubernetes	Kafka Streams API is a Java library, it still requires application deployment (standalone)	k8s CRDs. Applications are HTTP POST endpoints.
Monitoring / Observability	✅ Metrics ✅ Web UI with monitoring API available for statistics and job/cluster control	✅	✅ Metrics ❌ Web UI with monitoring: Only available through Confluent Control Center, which is a paid feature of a Confluent platform	✅ Metrics (and event tracing!) ❌ Web UI

Things we might want in the future


Capability	Flink	Spark Structured Streaming	Kafka Streams	Knative
Combine Streams	✅ doc	✅ doc	✅ doc	❌
Bootstrapping / Data Initialization (first run or lambda arch)	✅ Kafka consumer can be started from any record, we can populate from a dump and start initial updater run with specified e.g. timestamp (doc)	✅	✅ (but inflexible) Kafka is the only source of data, so all data must be in kafka for bootstrapping. If we only need current state, compacted topics will help.	❌ (no state)
Replay events / state Snapshotting /Checkpointing	✅ Supported directly in Flink requires persistent data source (e.g Kafka) and snapshotting storage (like HDFS, Ceph, NFS, etc.). Guaranteed to be in a consistent state. Disabled by default. doc Is used for save restarts/updates	✅ Supported for HDFS compatible filesystems.	❌ Kafka offset can be reset for stream replay Local state is backed by compacted changelog topics, which are used for state failover No historical state snapshots, would have to replay old stream to regenerate old state	❌ although using KafkaBroker, consumer offsets could be manually reset.
Async calls to MW API or other internal API's	✅ doc	❌ link	❌	✅ doc
Event time reordering (+ subordering based on revision id for given item)	✅ Supports event time characteristics, using watermarking (doc). Features like windowing use that. There are way to implement custom timestamp and watermark extractors, but subordering should be easy enough within the window	✅	✅	❌
Scale to 100?	✅	✅	✅	✅ (autoscale!)
State storage (for diffs, etc.)	Store local state (inMem / FS / RocksDB) If RocksDB is used it can provide incremental checkpointing Async I/O API (doc) - data enrichment (revisions?)	HDFS (compatible) local RocksDB	Local state (inMem / RocksDB) It’s possible to access external databases for data enrichment, but it requires some tinkering (Message enrichment with Kafka Streams)	❌
Job UI	✅ cluster UI	✅ cluster UI	❌	❌

Other considerations


Capability	Flink	Spark Structured Streaming	Kafka Streams	Knative
Languages supported	Java/Scala Limited Python (doc)	Java/Scala Python R	Java	Any
Community Support	✅ Active questions/answers on Stack Overflow Mailing lists Active JIRA tickets	✅	✅ Active questions/answers on Confluent site Apache Kafka community very active	✅
Complexity to Administer	High	Medium	Low	Medium
Conceptual Learning curve	High	Medium	Medium	Medium
Difficulty of app development (once learned)	Difficult	Medium	Medium	Easy
Path to WMF production (as of 2022-11)	✅ One Flink based process in prod k8s now	✅, DE is working on Spark in k8s, but not necessarily for production apps.	✅ Standalone application, can deploy on k8s easily	❌ Needs newer version of k8s, requires upgrade of Wikikube or new multi DC k8s cluster

General Remarks

It would be nice if we chose the same technology for Data Connectors, Stream Processors, Batch Processors, and event driven feature service development, but there is no requirement to do so. We are focusing on platform level choices, so we are likely to favor technology that allows us to implement a more comprehensive data platform rather than ease of use for event driven feature services.

Multi DC

There is no built in support for 'multi DC' (AKA multi region) in any of these frameworks, and as such, the multi-DC-ness is an application level concern. Generally, the more state that the stream processor is responsible for, the more difficult it is to architect. For simple event transformation/enrichment processors, the only state we will need is in the Kafka consumer offsets. Both Flink and Spark Streaming handle management of offsets themselves, and do no rely on Kafka's built in consumer offset API.

Flink

A general purpose stream and batch processing framework and scheduler, supporting any input and output data systems (not just Kafka ).

Java API is somewhat limited, because of type erasure (doc). Because of this, Scala seems a better choice.
Testing API enables both stateless and stateful testing. Same with timely UDFs (user defined functions) (doc)
There is a script to launch Scala Flink REPL, seems useful
There are few different levels of API here, ranging from SQL analytics to low level stateful stream processing (1.10 Documentation: Dataflow Programming Model)

Spark Structured Streaming

Another general purpose stream and batch processing framework and scheduler. Deployment models are similar to Flink.

Very few downstream data store connectors, only has Kafka and 'file'. Easy to write to HDFS and S3, but not e.g. Cassanrda. Would need to implement custom sinks.
- Possible that sinks are streaming specific; it may not be possible to use a custom streaming sink with a batch DataFrame.
Easier to learn and use than Flink.
Poorer streaming performance (latency) than Flink. Spark is much better for ad hoc and analytics workloads, when working with a big data lake (HDFS, S3, Hive, etc.).

Kafka Streams

A library for developing stream applications using Kafka.

It focuses more heavily on SQL-like - called KSQL- approach, when it comes to data mangling
It looks cool for simple operations on Kafka topics, but the philosophy here is to augment existing applications (Kafka Streams API is a library) with a dash of data processing, rather than create standalone processing applications. They say so basically in the first, introductory video (1. Intro to Streams | Apache Kafka® Streams API)
It’s difficult to find code examples in their documentation - Apache Flink’s is much better in that regard.

Knative Eventing

'Event routing' k8s primitives to trigger service API calls based on events.

NOT a comprehensive stream processing framework.
More focused on abstraction of event streams, so application developers only have to develop HTTP request based services.
Has Kafka integration, but within the Eventing system fully abstracts this away. Eventing KafkaBroker uses a single topic, and filters all events to fowards requests to subscribing services.
Uses CloudEvents as the event data envelope.
Looks very nice if you are building a self contained event driven application. Not so great for any kind of CEP.
Requires newer versions of kubernetes that we currently have at WMF (as of 2022-05).

Use Case Considerations

These technologies are on a spectrum of more complex and featurful, too simple and less featurful, with Flink being the the most complex and Knative Eventing the simplest.

Given the use cases we are considering, at times we will need a complex stream processors (e.g. WDQS updater, diff calculation, database connectors), and at others, a simpler and language agnostic event driven application framework (event enrichment, change prop, job queue). We'd like to make a 'platform' that makes it easy for engineers to build stream based applications. Sometimes those applications will be about complex data state sourcing and transformation problems, and other times they will be for triggering actions based on events. Attempting to support those different types of uses with the same technology may not be the right decision.

We should keep this in mind, and try to place incoming use cases into one of 2 categories: simple event driven applications, and complex stream processing.

Decision Record

Kafka Streams is easier than Flink for developing event driven applications, but less flexible than Knative Eventing, and less powerful and featureful than Flink. We eliminate it from further consideration based on this.

Spark streaming is a little easier to use than Flink if you assume you will always be operating around a Data Lake. However, its lack of built in Sink connectors is a huge downside, and from a brief attempt, implementing a streaming enrichment was equivalent to Flink, including all of the usual java dependency conflicts.

For the initial capabilities development and experiment, we choose Flink. This will allow us to get our hands dirty and investigate how we can use it to build platform capabilities to support our initial use cases, while considering future ones.

In the future, we may want to also support something like Knative Eventing for simple event driven feature products.

Supplementary Research

During analysis of the various solutions, specifically how each of them work in a multi-dc environment, Kafka Stretch was discovered as a potential solution to allow a single Kafka cluster to span multiple dc's.

Details of this additional evaluation can be found at phab:T307944

↑ ^1.0 ^1.1 https://docs.google.com/document/d/1NWYnbvktbMxdsztOd6h_aGUDMWdPEf8QMbor3ATd0Wo/edit?usp=sharing

[:0-1] 1.0 ^1.1 https://docs.google.com/document/d/1NWYnbvktbMxdsztOd6h_aGUDMWdPEf8QMbor3ATd0Wo/edit?usp=sharing

[1]