Data Engineering/Evaluations/Event Platform/Stream Processing/Framework Evaluation
Features we need now (2022)
|Capability||Flink||Spark Structured Streaming||Kafka Streams||Knative Eventing|
|Transform single streams||✅||✅||✅||✅ doc|
|Multi-dc||❓Yes, but manual||❓Yes, but manual||❓Possible for streaming only if using Kafka stretched cluster, otherwise manual||❓Yes, but manual|
|Failover (cluster recoverability / HA)||✅
|Deployment options||Flink is a Cluster that can run 1 or more jobs at a time. Several options:
|Monitoring / Observability||
Things we might want in the future
|Capability||Flink||Spark Structured Streaming||Kafka Streams||Knative|
|Combine Streams||✅ doc||✅ doc||✅ doc||❌|
|Bootstrapping / Data Initialization (first run or lambda arch)||✅
||✅||✅ (but inflexible)
||❌ (no state)|
|Replay events / state Snapshotting /Checkpointing||✅
although using KafkaBroker, consumer offsets could be manually reset.
|Async calls to MW API or other internal API's||✅||❌ link||❌||✅ doc|
|Event time reordering (+ subordering based on revision id for given item)||✅
|Scale to 100?||✅||✅||✅||✅ (autoscale!)|
|State storage (for diffs, etc.)||
|Job UI||✅ cluster UI||✅ cluster UI||❌||❌|
|Capability||Flink||Spark Structured Streaming||Kafka Streams||Knative|
|Complexity to Administer||High||Medium||Low||Medium|
|Conceptual Learning curve||High||Medium||Medium||Medium|
|Difficulty of app development (once learned)||Difficult||Medium||Medium||Easy|
|Path to WMF production (as of 2022-11)||✅
||✅, DE is working on Spark in k8s, but not necessarily for production apps.||✅
Needs newer version of k8s, requires upgrade of Wikikube or new multi DC k8s cluster
It would be nice if we chose the same technology for Data Connectors, Stream Processors, Batch Processors, and event driven feature service development, but there is no requirement to do so. We are focusing on platform level choices, so we are likely to favor technology that allows us to implement a more comprehensive data platform rather than ease of use for event driven feature services.
There is no built in support for 'multi DC' (AKA multi region) in any of these frameworks, and as such, the multi-DC-ness is an application level concern. Generally, the more state that the stream processor is responsible for, the more difficult it is to architect. For simple event transformation/enrichment processors, the only state we will need is in the Kafka consumer offsets. Both Flink and Spark Streaming handle management of offsets themselves, and do no rely on Kafka's built in consumer offset API.
See also: Multi DC for Streaming Apps
Multi DC streaming would be much more easier to accomplish by default if Apache Kafka had support for multi region clusters. There is support for this in the Confluent Enterprise edition. We will be experimenting with this idea as part of the Kafka stretch cluster.
Automatic multi DC failover might be possible with Kafka streams (or any system that uses Kafka offset API) in a Kafka stretch cluster. This would look like a single streaming app deployed across both DCs. However, because Flink and Spark Streaming manage Kafka offsets themselves, this is not possible. Kafka stretch may still help in an active/passive deployment, as offsets stored in their state checkpoints will work in both DCs.
A general purpose stream and batch processing framework and scheduler, supporting any input and output data systems (not just Kafka ).
- Java API is somewhat limited, because of type erasure (doc). Because of this, Scala seems a better choice.
- Testing API enables both stateless and stateful testing. Same with timely UDFs (user defined functions) (doc)
- There is a script to launch Scala Flink REPL, seems useful
- There are few different levels of API here, ranging from SQL analytics to low level stateful stream processing (1.10 Documentation: Dataflow Programming Model)
Spark Structured Streaming
Another general purpose stream and batch processing framework and scheduler. Deployment models are similar to Flink.
- Very few downstream data store connectors, only has Kafka and 'file'. Easy to write to HDFS and S3, but not e.g. Cassanrda. Would need to implement custom sinks.
- Possible that sinks are streaming specific; it may not be possible to use a custom streaming sink with a batch DataFrame.
- Easier to learn and use than Flink.
- Poorer streaming performance (latency) than Flink. Spark is much better for ad hoc and analytics workloads, when working with a big data lake (HDFS, S3, Hive, etc.).
A library for developing stream applications using Kafka.
- It focuses more heavily on SQL-like - called KSQL- approach, when it comes to data mangling
- It looks cool for simple operations on Kafka topics, but the philosophy here is to augment existing applications (Kafka Streams API is a library) with a dash of data processing, rather than create standalone processing applications. They say so basically in the first, introductory video (1. Intro to Streams | Apache Kafka® Streams API)
- It’s difficult to find code examples in their documentation - Apache Flink’s is much better in that regard.
'Event routing' k8s primitives to trigger service API calls based on events.
- NOT a comprehensive stream processing framework.
- More focused on abstraction of event streams, so application developers only have to develop HTTP request based services.
- Has Kafka integration, but within the Eventing system fully abstracts this away. Eventing KafkaBroker uses a single topic, and filters all events to fowards requests to subscribing services.
- Uses CloudEvents as the event data envelope.
- Looks very nice if you are building a self contained event driven application. Not so great for any kind of CEP.
- Requires newer versions of kubernetes that we currently have at WMF (as of 2022-05).
Use Case Considerations
These technologies are on a spectrum of more complex and featurful, too simple and less featurful, with Flink being the the most complex and Knative Eventing the simplest.
Given the use cases we are considering, at times we will need a complex stream processors (e.g. WDQS updater, diff calculation, database connectors), and at others, a simpler and language agnostic event driven application framework (event enrichment, change prop, job queue). We'd like to make a 'platform' that makes it easy for engineers to build stream based applications. Sometimes those applications will be about complex data state sourcing and transformation problems, and other times they will be for triggering actions based on events. Attempting to support those different types of uses with the same technology may not be the right decision.
We should keep this in mind, and try to place incoming use cases into one of 2 categories: simple event driven applications, and complex stream processing.
Kafka Streams is easier than Flink for developing event driven applications, but less flexible than Knative Eventing, and less powerful and featureful than Flink. We eliminate it from further consideration based on this.
Spark streaming is a little easier to use than Flink if you assume you will always be operating around a Data Lake. However, its lack of built in Sink connectors is a huge downside, and from a brief attempt, implementing a streaming enrichment was equivalent to Flink, including all of the usual java dependency conflicts.
For the initial capabilities development and experiment, we choose Flink. This will allow us to get our hands dirty and investigate how we can use it to build platform capabilities to support our initial use cases, while considering future ones.
In the future, we may want to also support something like Knative Eventing for simple event driven feature products.
During analysis of the various solutions, specifically how each of them work in a multi-dc environment, Kafka Stretch was discovered as a potential solution to allow a single Kafka cluster to span multiple dc's.
Details of this additional evaluation can be found at phab:T307944