Data Platform/Systems/Hadoop Event Ingestion Lifecycle

High level overview of how an event makes it into the analytics-hadoop cluster.

Event Ingestion Lifecycle

(There is a recording of a talk describing this system (as well as EventLogging))

Producer code (EventBus, EventLogging, Flink, etc. etc.)

EventGate (optional)
Kafka
HDFS raw (via Gobblin MR job)
Hive event database (via Refine Spark job)
Hive event_sanitized database (via RefineSanitized Spark job) (optional)
unsanitized Hive event table data deleted (after 90 days)

Canary Events

Why?

Most event streams in WMF Kafka are multi-datacenter. You can learn more about how we run multi-DC Kafka clusters.

An Event Platform stream can be made up of many individual topics in Kafka. In the multi DC case, usually a single stream is made up of its prefixed Kafka topics. E.g. my_stream would be made up of eqiad.my_stream and codfw.my_stream Kafka topics.

In Hive (via Refine) we ingest these Kafka topics into tables in the event database named after the stream. The datacenter Kafka topic prefixes are added as a datacenter Hive partition key.

Downstream jobs that use these Hive event tables need to know when an hourly partition has completed.

Problem

Problem: If there is no data in one of a stream's composite Kafka topics for a given hour, how can we differentiate between the following scenarios?

There is a problem somewhere with in the ingestion pipeline, and data failed to be ingested.
There really was no data produced to the topic during this hour.

Solution

We solve this problem by producing artificial 'canary' events into every stream. As long as we successfully produce at least one canary event into all of a stream's composite topics every hour, we can be sure that there should be at least one event in every topic every hour. If there are no events for a topic in an hour, then something is broken. Downstream jobs can hope that the hourly Hive partition will be ingested later, and wait until it is.

ProduceCanaryEvents

The ProduceCanaryEvents job runs as a is scheduled by Airflow several times an hour. It …

discovers all event streams via EventStreamConfig (that have canary_events_enabled)
uses the stream's schema examples to create an artificial canary event for the stream (with meta.domain set to 'canary'),
POSTs these events to the destination_event_service (eventgate) in each datacenter.

This means that all streams will have artificial canary events in them, and consumers must intentionally filter these events out. (e.g. where meta.domain != 'canary')