Data Platform/Systems/EventLogging/Data representations
This page gives an overview over the various representations of EventLogging data available on the WMF production cluster, and expectations around those representations.
Hadoop & Hive
EventLogging analytics data is imported from Kafka into Hadoop as raw JSON, and then 'refined' into Parquet backed Hive tables. These tables are in the Hive event
and event_sanitized
databases. The refined data is stored in HDFS in the hdfs:///wmf/data/event
directory. And the sanititized data is stored under hdfs:///wmf/data/event_sanitized
.
See: Analytics/Systems/EventLogging#Hadoop_.26_Hive for info on how to access this data.
'all-events' JSON log files
Use this data source only to debug issues around ingestion into the m2 database (data ingested only on hadoop does not go through these files)
Entries are JSON objects.
Only validated events get written.
In case of bugs, historic data does not get fixed.
Those files are available as:
stats1004:/srv/log/eventlogging/archive/all-events.log-$DATE.gz
stats1005:/srv/log/eventlogging/archive/all-events.log-$DATE.gz
eventlog1002:/var/log/eventlogging/...
Raw 'client' side log files
Use this data source only to debug issues around ingestion into the m2 database.
Entries are parameters to the /beacon/event
HTTP request. They are not decoded at all.
In case of bugs, historic data does not get fixed. Neither need hot-fixes reach those files.
Those files are available as:
stats1004:/srv/log/eventlogging/archive/client-side-events.log-$DATE.gz
stats1005:/srv/log/eventlogging/archive/client-side-events.log-$DATE.gz
eventlog1002:/var/log/eventlogging/...
Kafka
EventLogging now feeds the following topics in Kafka:
- eventlogging-valid-mixed: This topic exists for ingestion into MariaDB and contains most of the live EventLogging analytics data. Some schemas are blacklisted.
- eventlogging_<schemaName>: All events from the specified schema. Each schema has its own deditcated topic.
Varnish pipeline
Since EventLogging data is extracted at the bits caches, and the EventLogging payload is encoded in the URL, EventLogging data is available in all log targets from the caches.
In case of bugs, historic data does not get fixed. Neither need hot-fixes reach this pipeline.