Event Platform/EventLogging legacy

From Wikitech
Jump to navigation Jump to search

EventLogging was Wikimedia's original analytics focused event data system. It used Draft-3 JSONSchemas on meta.wikimedia.org to validate incoming events.

Differences from EventLogging the legacy backend

The EventLogging extension was originally built as an all in one system to capture MediaWiki analytics events. It managed schemas, client side event submission, server side event validation and server side event ingestion (into e.g. MySQL). The Event Platform program was conceived to unify event collection for production and analytics events. EventLogging's tier 2 and analytics focus and breadth was not suitable to support this unification. Many of the features of WMF's Event Platform are the same as the legacy EventLogging system, but are more modular and scalable. From an instrumentation only perspective, it may not be clear why things have to be different, but there are good engineering reasons for all of these changes.

The EventLogging extension has been repurposed as a MediaWiki instrumentation event producer library only. On wiki schemas and backend validation are no longer supported by EventLogging.

EventLogging legacy Event Platform
Schema repositories EventLogging schemas were stored as centralized wiki pages on metawiki, and all environments (development, beta, production, etc.) had to use this same schema repository. Event Platform schema are in decentralized git repositories. (Analytics instrumentation schemas are in the schemas/event/secondary repository. Schema repositories are also readable at https://schema.wikimedia.org/#!/ )
Streams, not schemas EventLogging schemas were single use. Each schema corresponded to only one instrumentation, and eventually only one downstream SQL table. Event Platform schemas are like data types for a dataset. A realtime event data set is called an 'event stream' (or just 'stream' for shorthand). Each stream must specify its schema, and a schema may be used by multiple streams.
Schema versions EventLogging schema versions were wiki page revisions. Each event specified its schema name and revision. Event Platform schemas are semantically versioned, and each event declares its schema and version in a $schema URI.
Schema compatibility Each EventLogging schema revision could change the schema in any way, which lead to backwards incompatible changes. Event Platform schemas versions must be backwards compatible; i.e. only adding new optional fields is allowed.
Stream config None. Changes to the way events were emitted (like sampling rate) required a code deployment. Streams are declared and configured in mediawiki-config and can be modified via a Backport window deployment.

FAQ

If you are used to the old EventLogging system with metawiki schemas, the new system probably feels a little unfamiliar. There's plenty of documentation around Event Platform, but sometimes you just want to get things done.

How do I find schemas?

'Instrumentation' schemas are stored in the schemas/event/secondary git repository in gerrit. Instrumentation specific ones live in the jsonschema/analytics directory.

You can browse these on github or at schema.wikimedia.org. schema.wikimedia.org is also a simple HTTP API serving the directory of schema files.

How do I edit schemas?

Schemas are now stored in git repositories, just like other code. WMF (as of 2021-03) uses gerrit for code review and for hosting git repositories. If you are new to gerrit and/or git, you can learn more at https://www.mediawiki.org/wiki/Gerrit.

Schemas are now semantically versioned. To ease the task of creating new versions, we use a library called jsonschema-tools to help automate some tedious schema editing tasks. For the most part, you shouldn't have to worry about this. To edit a schema:

Once merged, your schema will be automatically deployed to schema.wikimedia.org.

See also Event Platform/Schemas

How do I create new schemas?

Create a current.yaml file in a directory path that matches the schema's title. E.g. analytics/cool_button_click should live at jsonschema/analytics/cool_button_click/current.yaml.

More detailed instructions are available at Event_Platform/Instrumentation_How_To#Creating_a_new_schema.

How do I produce event data?

IMPORTANT: streams, not schemas. A significant difference in Event Platform is that schemas are no longer mapped one to one with a dataset. A schema is a more like a datatype than a record. It describes the shape of data. Many different datasets might have the same shape, so a schema can be reused for different streams. A stream is just an inflight dataset. It is a continuous series of events, and each event in a stream conforms to a specific schema.

To produce data to a stream, you must

	"attributes": {
		"EventLogging": {
			"Schemas": {
				"LegacySchema": "/analytics/legacy/legacyschema/1.0.0",
             }
         }
     }

How do I query my data?

No changes here. Event data is available in Hive in the event database. However, the tables are no longer named after the schema; they are named after the stream. See also Event_Platform/Instrumentation_How_To#Viewing_and_querying_events.

Migration to Event Platform

In FY2020-2021, The Analytics/Data Engineering team is collaborating with the Product/Data Infrastructure team to migrate these now 'legacy' EventLogging event streams to Event Platform components.

The Phabricator task tracking this migration is https://phabricator.wikimedia.org/T259163.


What does Event Platform 'migration' mean?

This refers specifically to the process of moving legacy EventLogging schemas off of meta.wikimedia.org and having clients POST events to EventGate. Completing this migration will allow us to decommission the SPOF EventLogging backend service, a brittle Hive ingestion pipeline, and reliance on metawiki for schema distribution.

The migration should be mostly transparent to you, unless you need to make schema changes. Read more for details.

How does this relate to the Metrics Platform?

Metrics Platform will provide an abstraction on top of Event Platform components that will standardize the way product teams build instrumentations and collect metrics on product usage. When ready, the Product Data Infrastructure team will want slowly re-instrument products to use Metrics Platform client libraries and schemas.

This legacy EventLogging -> Event Platform migration is separate from that. The process for Metrics Platform reinstrumentation will be the same whether or not a schema is 'legacy' or a new Event Platform based schema.

What is the Metrics Platform reinstrumentation plan?

Once the Metrics Platform is released, PDI will focus on the re-instrumentation process.

Is my schema legacy or not?

If you schema is or was ever stored on meta.wikimedia.org, it is a legacy schema.

Is my schema migrated or not?

If your schema is editprotected on meta.wikimedia.org, or if it exists in the schemas/event/secondary repository in jsonschema/analytics/legacy, it has been migrated to Event Platform.


Changes

Data

Legacy EventLogging data in Hive is 100% compatible with Event Platform. You shouldn't notice any changes with existent data fields in Hive.

Client IP addresses are no longer collected by default

This means that event data in Hive will not be geocoded. If your instrumentation relies on either client IPs or geocoded data, we need to manually include the client_ip field in the migrated schema. During the migration, an engineer will contact the legacy EventLogging schema owner to see if they need this data.

timestamp field semantics

The now deprecated eventlogging backend previously collected only a dt field, which was the time at which the backend received the event. Since the EventLogging extension only sends batches of events every 30 seconds, this timestamp could be within 30 seconds after the event actually happened.

Once a legacy EventLogging stream has been migrated to Event Platform, it will have the following timestamp fields:

  • dt - server side receive time.
  • meta.dt - server side receive time (unless the client explicitly sets this field). This field is used for hive hourly partitioning.
  • client_dt - client side event timestamp. Since this can be set arbitrarily by clients, there is no restriction on what this value might be. Usually it should be the time at which the event happened, but a misbehaving client could set this to anything, including timestamps in the future.

NOTE: The timestamp field semantics are different than those for non-legacy Event Platform events. In non legacy events, dt is the client side event timestamp, and meta.dt is the server side receive timestamp. (As of 2020-11, This is still TODO for EventBus based streams.)

System

Schema Location

The main visible change here is the schema location. Schemas are no longer stored on meta.wikimedia.org. Instead, they are stored in the schemas/event/secondary repository. Migrated legacy EventLogging schemas are in the jsonschema/analytics/legacy directory.

The schemas will look slightly different from what you are used to seeing on meta.wikimedia.org. The old eventlogging backend system wrapped all on wiki schemas with the EventCapsule schema. Event Platform has some required fields. The EventCapsule and the required Event Platform fields are now included directly in migrated schemas.

If you need to make schema changes, you will now do so in the schemas/event/secondary repository. You can read more about how to do this at Event_Platform/Schemas#Modifying_schemas and Event_Platform/Instrumentation_How_To#Evolving.

Automatically augmented event data

See: Event_Platform/Schemas/Guidelines#Automatically_populated_fields.

Backend

Events are now POSTed to an EventGate instance instead of using a URL encoded GET query parameter. This means that browser clients that don't support JavaScript will not be able to send events.

Frontend

The main producer of this legacy data is the MediaWiki EventLogging extension. This extension has been modified to be able to produce the legacy data to EventGate via a config switch. The engineers doing this migration will not modify any frontend instrumentation code. If you'd like your instrumentation to fully move to Event Platform, you'll need to create new schemas and instrumentation code that calls the mw.eventLog.submit() function, rather than the now deprecated mw.eventLog.logEvent() function. However, this will result in a totally new event stream and Hive table, i.e. a brand new instrumentation stream. This is not required of any legacy EventLogging event streams, but it is nice if you want to do this. :)

Other questions?

If you are still confused or have more questions...then we need to know and do a better job at documentation! Please reach out to Andrew Otto (IRC: ottomata, email: otto@wikimedia.org) and Marcel Forns (IRC: mforns, email: mforns@wikimedia.org) with any questions. We're happy to answer and will use your questions to help make this documentation better.