Data Engineering/Evaluations/Dumps

From Wikitech

In the course of planning a new architecture for dumps.wikimedia.org, we are evaluating a number of technologies. This document outlines the decisions we are making.

Architecture

The XML/Content Dumps were considered the most complex and large scale pipeline of the content that's available at dumps.wikimedia.org, so the initial design work focused on it. The potential solutions both depend on events published to Kafka by EventBus via our EventPlatform. We could consume those events in near-real time, updating dumps and other similar data products, or we could run a set of batch jobs and process the events in hourly buckets.

Near-Real Time

In our environment, we have two options for stream processing. wp:Apache_Flink is used by the search team and being built into event enrichment pipelines for general use. And Spark Streaming is a component of wp:Apache_Spark that we get for free because we use Spark on our cluster.

Flink

One proof of concept job was built in Flink SQL and another packaged as a jar that can be submitted to a Flink cluster. The Flink SQL job mostly worked, but was hard to debug at the time we tried it. Inserts and selects would hang, for example, with no obvious error message. The KafkaToIceberg job proved difficult to package correctly for our environment and was abandoned. Good lessons were learned in the art of maven dependency management, especially the importance of learning and running:

mvn clean verify
mvn sortPom:sort
mvn duplicate-finder:check

However, with conflicts so difficult to resolve, we decided against Flink for this use case and moved on.

Spark Streaming

A proof of concept job was built with Spark Streaming with relative ease. The main difficulty here beyond User:Milimetric/Learning_iceberg was the trick to materialize timestamps. It seems possible for the optimizer to see the size of the batch and know based on the join condition that a partition lookup can be used instead of a full table scan (which is what was happening before the hack). It seems that in more recent versions of Iceberg this has been improved.

Spark Streaming was easier to write and worked, but it lacks job management features like a UI to see the status and interact with running jobs. It's basically just a long-lived process and common practice is to write your own management hooks into the job itself (see [1]).

The performance of micro-batches was good, for most 1 minute batches of data from the revision create topic, it took 5 seconds to insert metadata into the target table. We did realize that if the job fails over a long weekend and we have days to catch up, we'd need a proper batch counterpart to help us catch up, because running micro-batches over that period of time would incur heavy performance overhead.

For these reasons, we put the Spark Streaming experiment aside and decided to focus on a plain Spark job for just the XML/SQL Dumps use case. For use cases needing real-time replication, we will revisit our options.

Decision

Near-real time is not necessary for the particular needs of the Dumps use cases. The proof of concept jobs and learning should be useful for other use cases that do need it.

Batch

Spark

Our standard tool for batch processing at WMF is Spark. We have been using it for years and it is a great fit for this use case. We did not consider other batch processing at this time.