Analytics/Archive/EventLogging pipeline

This page contains historical information. It may be outdated or unreliable.

Value Proposition

For the Analytics team who maintain the EventLogging platform, the new EventLogging Pipeline is an architecture and implementation that leverages our big data infrastructure yielding capacity and reliability. Unlike the current implementation, the new architecture will make it easier to backfill data and handle 10x more data.

For product managers and analysts, the new EventLogging Pipeline will appear more reliable (because of capacity and ease of backfilling) unlike the current implementation where events are dropped or backfilling is slow.

Objective

Leverage existing infrastructure (Kafka & Hadoop) to scale the throughput capacity of EventLogging.

Key Result

Current maximum throughput is roughly 500 events / second
Handle at least 5000 events / second

Requirements

Preserve teams ability to query events via SQL over a small amount of data (depending on throughput)
Preserve all existing dashboards
Teams can test their instrumentation in Beta Cluster
Teams can join their data with mediawiki databases

Design

Given the complexity of the project, having multiple phases seems reasonable:

Phase 1

After that phase, EventLogging still works full scale on MySQL, but also provides data to Hadoop for us to check everything works fine.

Use Kafka as a new transportation layer for event messages between forwarders - processors - consumers processes (no data loss when processes crash).
Have processor publish into schema_version Kafka topics to ease consuming coherent batches
Rework MySQL consumers to take advantage of schema_version batches (optional values with default still to be fixed).
Create a new Camus job to import Kafka schema_version topics to HDFS (one shot job to initialize HDFS with existing EventLogging and mediawiki MySQL data using Sqoop)
Use graphite to chart per topic throughput and validation/error.

Phase 2

After that phase, EventLogging becomes available for customers on Hadoop, and MySQL gets gently relieved of it services.

Automate hive table creation based on schema_version definition.
Automate data import from mediawiki into a single HDFS store (easier to query, using Sqoop and oozie?)
Ask customers to move to Hive (Hue + JDBC? file copy? RESTBase instance?)
Truncate old MySQL data.

Phase 3

At the end of that phase, we have an easily scalable, fault-tolerant system.

Rework stream processing (python forwarder, processor, consumer) with Spark Streaming.
Anything else ?

Roadmap & Timeline

Work is being done as part of our ongoing priority for operational excellence. Resources dedicated to this priority will take on the work, not extra resources or projects are needed.