Analytics/Systems/EventLogging/Data representations

From Wikitech
Jump to: navigation, search

This page gives an overview over the various representations of EventLogging data available on the WMF production cluster, and expectations around those representations.

Hadoop & Hive

EventLogging analytics data is imported from Kafka into Hadoop as raw JSON, and then 'refined' into Parquet backed Hive tables. These tables are in the Hive event database. The refined data is stored in HDFS in the hdfs:///wmf/data/event directory.

See: Analytics/Systems/EventLogging#Hadoop_.26_Hive for info on how to access this data.

MySQL / MariaDB database on m2

This database is the best place to consume EventLogging data from.

Available as log database on m2 replicas, such as analytics-store.eqiad.wmnet. You can access the analytics-store database host from a machine such as stat1006, as explained at Analytics/Data access#Stats machines.

Only validated events enter the database.

'all-events' JSON log files

Use this data source only to debug issues around ingestion into the m2 database.

Entries are JSON objects.

Only validated events get written.

In case of bugs, historic data does not get fixed.

Those files are available as:

  • stats1004:/srv/eventlogging/archive/all-events.log-$DATE.gz
  • stats1005:/srv/eventlogging/archive/all-events.log-$DATE.gz
  • eventlog1001:/var/log/eventlogging/...

Raw 'client' side log files

Use this data source only to debug issues around ingestion into the m2 database.

Entries are parameters to the /beacon/event HTTP request. They are not decoded at all.

In case of bugs, historic data does not get fixed. Neither need hot-fixes reach those files.

Those files are available as:

  • stats1004:/srv/eventlogging/archive/client-side-events.log-$DATE.gz
  • stats1005:/srv/eventlogging/archive/client-side-events.log-$DATE.gz
  • eventlog1001:/var/log/eventlogging/...

Kafka

EventLogging now feeds the following topics in Kafka:

  • eventlogging-valid-mixed: This topic exists for ingestion into MariaDB and contains most of the live EventLogging analytics data. Some schemas are blacklisted.
  • eventlogging_<schemaName>: All events from the specified schema. Each schema has its own deditcated topic.

Varnish pipeline

Since EventLogging data is extracted at the bits caches, and the EventLogging payload is encoded in the URL, EventLogging data is available in all log targets from the caches.

In case of bugs, historic data does not get fixed. Neither need hot-fixes reach this pipeline.