Data Engineering/Systems/EventLogging/Data representations

From Wikitech

This page gives an overview over the various representations of EventLogging data available on the WMF production cluster, and expectations around those representations.

Hadoop & Hive

EventLogging analytics data is imported from Kafka into Hadoop as raw JSON, and then 'refined' into Parquet backed Hive tables. These tables are in the Hive event and event_sanitized databases. The refined data is stored in HDFS in the hdfs:///wmf/data/event directory. And the sanititized data is stored under hdfs:///wmf/data/event_sanitized.

See: Analytics/Systems/EventLogging#Hadoop_.26_Hive for info on how to access this data.

'all-events' JSON log files

Use this data source only to debug issues around ingestion into the m2 database (data ingested only on hadoop does not go through these files)

Entries are JSON objects.

Only validated events get written.

In case of bugs, historic data does not get fixed.

Those files are available as:

  • stats1004:/srv/log/eventlogging/archive/all-events.log-$DATE.gz
  • stats1005:/srv/log/eventlogging/archive/all-events.log-$DATE.gz
  • eventlog1002:/var/log/eventlogging/...

Raw 'client' side log files

Use this data source only to debug issues around ingestion into the m2 database.

Entries are parameters to the /beacon/event HTTP request. They are not decoded at all.

In case of bugs, historic data does not get fixed. Neither need hot-fixes reach those files.

Those files are available as:

  • stats1004:/srv/log/eventlogging/archive/client-side-events.log-$DATE.gz
  • stats1005:/srv/log/eventlogging/archive/client-side-events.log-$DATE.gz
  • eventlog1002:/var/log/eventlogging/...

Kafka

EventLogging now feeds the following topics in Kafka:

  • eventlogging-valid-mixed: This topic exists for ingestion into MariaDB and contains most of the live EventLogging analytics data. Some schemas are blacklisted.
  • eventlogging_<schemaName>: All events from the specified schema. Each schema has its own deditcated topic.

Varnish pipeline

Since EventLogging data is extracted at the bits caches, and the EventLogging payload is encoded in the URL, EventLogging data is available in all log targets from the caches.

In case of bugs, historic data does not get fixed. Neither need hot-fixes reach this pipeline.