Event Platform/EventLogging legacy

From Wikitech
Jump to navigation Jump to search

EventLogging was Wikimedia's original analytics focused event data system. It used Draft-3 JSONSchemas on meta.wikimedia.org to validate incoming events.

In Q2 and Q3 of FY2020-2021, The Analytics/Data Engineering team is collaborating with the Product/Data Infrastructure team to migrate these now 'legacy' EventLogging event streams to Event Platform components. This page will document these changes, and what this means for the owners of the legacy EventLogging event data.

The Phabricator task tracking this migration is https://phabricator.wikimedia.org/T259163.

Changes

Data

Legacy EventLogging data in Hive is 100% compatible with Event Platform. You shouldn't notice any changes with existent data fields in Hive.

Client IP addresses are no longer collected by default

This means that event data in Hive will not be geocoded. If your instrumentation relies on either client IPs or geocoded data, we need to manually include the client_ip field in the migrated schema. During the migration, an engineer will contact the legacy EventLogging schema owner to see if they need this data.

timestamp field semantics

The now deprecated eventlogging backend previously collected only a dt field, which was the time at which the backend received the event. Since the EventLogging extension only sends batches of events every 30 seconds, this timestamp could be within 30 seconds after the event actually happened.

Once a legacy EventLogging stream has been migrated to Event Platform, it will have the following timestamp fields:

  • dt - server side receive time.
  • meta.dt - server side receive time (unless the client explicitly sets this field). This field is used for hive hourly partitioning.
  • client_dt - client side event timestamp. Since this can be set arbitrarily by clients, there is no restriction on what this value might be. Usually it should be the time at which the event happened, but a misbehaving client could set this to anything, including timestamps in the future.

NOTE: The timestamp field semantics are different than those for non-legacy Event Platform events. In non legacy events, dt is the client side event timestamp, and meta.dt is the server side receive timestamp. (As of 2020-11, This is still TODO for EventBus based streams.)

System

Schema Location

The main visible change here is the schema location. Schemas are no longer stored on meta.wikimedia.org. Instead, they are stored in the schemas/event/secondary repository. Migrated legacy EventLogging schemas are in the jsonschema/analytics/legacy directory.

The schemas will look slightly different from what you are used to seeing on meta.wikimedia.org. The old eventlogging backend system wrapped all on wiki schemas with the EventCapsule schema. Event Platform has some required fields. The EventCapsule and the required Event Platform fields are now included directly in migrated schemas.

If you need to make schema changes, you will now do so in the schemas/event/secondary repository. You can read more about how to do this at Event_Platform/Schemas#Modifying_schemas and Event_Platform/Instrumentation_How_To#Evolving.

Automatically augmented event data

See: Event_Platform/Schemas/Guidelines#Automatically_populated_fields.

Backend

Events are now POSTed to an EventGate instance instead of using a URL encoded GET query parameter. This means that browser clients that don't support JavaScript will not be able to send events.

Frontend

The main producer of this legacy data is the MediaWiki EventLogging extension. This extension has been modified to be able to produce the legacy data to EventGate via a config switch. The engineers doing this migration will not modify any frontend instrumentation code. If you'd like your instrumentation to fully move to Event Platform, you'll need to create new schemas and instrumentation code that calls the mw.eventLog.submit() function, rather than the now deprecated mw.eventLog.logEvent() function. However, this will result in a totally new event stream and Hive table, i.e. a brand new instrumentation stream. This is not required of any legacy EventLogging event streams, but it is nice if you want to do this. :)