Event Platform/EventLogging legacy

From Wikitech
Jump to navigation Jump to search

EventLogging was Wikimedia's original analytics focused event data system. It used Draft-3 JSONSchemas on meta.wikimedia.org to validate incoming events.

Differences from EventLogging the legacy backend

The EventLogging extension was originally built as an all in one system to capture MediaWiki analytics events. It managed schemas, client side event submission, server side event validation and server side event ingestion (into e.g. MySQL). The Event Platform program was conceived to unify event collection for production and analytics events. EventLogging's tier 2 and analytics focus and breadth was not suitable to support this unification. Many of the features of WMF's Event Platform are the same as the legacy EventLogging system, but are more modular and scalable. From an instrumentation only perspective, it may not be clear why things have to be different, but there are good engineering reasons for all of these changes.

The EventLogging extension has been repurposed as an MediaWiki instrumentation event producer library only. On wiki schemas and backend validation are no longer supported by EventLogging.

EventLogging legacy Event Platform
Schema repositories EventLogging schemas were stored as centralized wiki pages on metawiki, and all environments (development, beta, production, etc.) had to use this same schema repository. Event Platform schema are in decentralized git repositories. (Analytics instrumentation schemas are in the schemas/event/secondary repository. Schema repositories are also readable at https://schema.wikimedia.org/#!/ )
Streams, not schemas EventLogging schemas were single use. Each schema corresponded to only one instrumentation, and eventually only one downstream SQL table. Event Platform schemas are like data types for a dataset. A realtime event data set is called an 'event stream' (or just 'stream' for shorthand). Each stream must specify its schema, and a schema may be used by multiple streams.
Schema versions EventLogging schema versions were wiki page revisions. Each event specified its schema name and revision. Event Platform schemas are semantically versioned, and each event declares its schema and version in a $schema URI.
Schema compatibility Each EventLogging schema revision could change the schema in any way, which lead to backwards incompatible changes. Event Platform schemas versions must be backwards compatible; i.e. only adding new optional fields is allowed.
Stream config None. Changes to the way events were emitted (like sampling rate) required a code deployment. Streams are declared and configured in mediawiki-config and can be modified via a Backport window deployment.

How do I ... ?

If you are used to the old EventLogging system with metawiki schemas, the new system probably feels a little unfamiliar. There's plenty of documentation around Event Platform, but sometimes you just want to get things done. How do I ... ?

find schemas

'Instrumentation' schemas are stored in the schemas/event/secondary git repository in gerrit. Instrumentation specific ones live in the jsonschema/analytics directory.

You can browse these on github or at schema.wikimedia.org. schema.wikimedia.org is also a simple HTTP API serving the directory of schema files.

edit schemas

Schemas are now stored in git repositories, just like other code. WMF (as of 2021-03) uses gerrit for code review and for hosting git repositories. If you are new to gerrit and/or git, you can learn more at https://www.mediawiki.org/wiki/Gerrit.

Schemas are now semantically versioned. To ease the task of creating new versions, we use a library called jsonschema-tools to help automate some tedious schema editing tasks. For the most part, you shouldn't have to worry about this. To edit a schema:

Once merged, your schema will be automatically deployed to schema.wikimedia.org.

See also Event Platform/Schemas

create new schemas

Create a current.yaml file in a directory path that matches the schema's title. E.g. analytics/cool_button_click should live at jsonschema/analytics/cool_button_click/current.yaml.

More detailed instructions are available at Event_Platform/Instrumentation_How_To#Creating_a_new_schema.

TODO: Link to guidelines for creating new instrumentation schemas when they exist.

produce event data

To produce data to a stream, you must

	"attributes": {
		"EventLogging": {
			"Schemas": {
				"LegacySchema": "/analytics/legacy/legacyschema/1.0.0",
             }
         }
     }

query my data

No changes here. Event data is available in Hive in the event database. However, the tables are no longer named after the schema; they are named after the stream. See also Event_Platform/Instrumentation_How_To#Viewing_and_querying_events.

Migration to Event Platform

In FY2020-2021, The Analytics/Data Engineering team is collaborating with the Product/Data Infrastructure team to migrate these now 'legacy' EventLogging event streams to Event Platform components. This page will document these changes, and what this means for the owners of the legacy EventLogging event data.

The Phabricator task tracking this migration is https://phabricator.wikimedia.org/T259163.

Changes

Data

Legacy EventLogging data in Hive is 100% compatible with Event Platform. You shouldn't notice any changes with existent data fields in Hive.

Client IP addresses are no longer collected by default

This means that event data in Hive will not be geocoded. If your instrumentation relies on either client IPs or geocoded data, we need to manually include the client_ip field in the migrated schema. During the migration, an engineer will contact the legacy EventLogging schema owner to see if they need this data.

timestamp field semantics

The now deprecated eventlogging backend previously collected only a dt field, which was the time at which the backend received the event. Since the EventLogging extension only sends batches of events every 30 seconds, this timestamp could be within 30 seconds after the event actually happened.

Once a legacy EventLogging stream has been migrated to Event Platform, it will have the following timestamp fields:

  • dt - server side receive time.
  • meta.dt - server side receive time (unless the client explicitly sets this field). This field is used for hive hourly partitioning.
  • client_dt - client side event timestamp. Since this can be set arbitrarily by clients, there is no restriction on what this value might be. Usually it should be the time at which the event happened, but a misbehaving client could set this to anything, including timestamps in the future.

NOTE: The timestamp field semantics are different than those for non-legacy Event Platform events. In non legacy events, dt is the client side event timestamp, and meta.dt is the server side receive timestamp. (As of 2020-11, This is still TODO for EventBus based streams.)

System

Schema Location

The main visible change here is the schema location. Schemas are no longer stored on meta.wikimedia.org. Instead, they are stored in the schemas/event/secondary repository. Migrated legacy EventLogging schemas are in the jsonschema/analytics/legacy directory.

The schemas will look slightly different from what you are used to seeing on meta.wikimedia.org. The old eventlogging backend system wrapped all on wiki schemas with the EventCapsule schema. Event Platform has some required fields. The EventCapsule and the required Event Platform fields are now included directly in migrated schemas.

If you need to make schema changes, you will now do so in the schemas/event/secondary repository. You can read more about how to do this at Event_Platform/Schemas#Modifying_schemas and Event_Platform/Instrumentation_How_To#Evolving.

Automatically augmented event data

See: Event_Platform/Schemas/Guidelines#Automatically_populated_fields.

Backend

Events are now POSTed to an EventGate instance instead of using a URL encoded GET query parameter. This means that browser clients that don't support JavaScript will not be able to send events.

Frontend

The main producer of this legacy data is the MediaWiki EventLogging extension. This extension has been modified to be able to produce the legacy data to EventGate via a config switch. The engineers doing this migration will not modify any frontend instrumentation code. If you'd like your instrumentation to fully move to Event Platform, you'll need to create new schemas and instrumentation code that calls the mw.eventLog.submit() function, rather than the now deprecated mw.eventLog.logEvent() function. However, this will result in a totally new event stream and Hive table, i.e. a brand new instrumentation stream. This is not required of any legacy EventLogging event streams, but it is nice if you want to do this. :)