Event Platform/Schemas

From Wikitech
Jump to navigation Jump to search

Motivation and Overview

Event Schemas are essential for an Event Streaming Platform. They allow disparate continuously changing producers and consumers to reliably communicate with each other. By explicitly declaring the shape of data, schemas ease integration between various systems.

Schemas should be readily available for any producer or consumer code that might need it. Schemas are needed to validate data, but they can also be used to automate data integration problems, e.g. auto creation of SQL tables in which events will be imported. Access of those schemas should be reliable and immutable for any given deployed service.

WMF uses JSON as our preferred in-flight data serialization format, and as such we have chosen* to use JSONSchema for our event schemas. Schema evolution is necessary to be able to reliably upgrade producer and consumer code, but unfortunately, JSONSchema does not have any built-in features for schema evolution. Therefore, each change (even a small one) requires the creation of a totally separate JSONSchema file.

WMF has chosen to distribute schemas using Git. This allows us to do development, CI, versioning and deployment for schemas the same way we do any code project. However, even though we use Git, we do not rely on Git history for schema versioning. Each schema version is an explicit static file in the schema repository. For more background, see RFC: Modern Event Platform: Schema Registry.

To make development of many schema versions files in git easier, WMF has developed the jsonschema-tools library. This tooling makes it easier for developers to design and evolve schemas dynamically while allowing production services can use static and immutable versions of those schemas.

jsonschema-tools will be used in the rest of this documentation to set up and develop schemas in a Git schema repository. Please skim the jsonschema-tools README before proceeding.

jsonschema-tools is a NodeJS module, so you'll need a recent (Node 10 or greater) version of NodeJS and npm installed. You can get NodeJS and npm at nodejs.org. Once installed, cd to the schema repository and run npm install. Heads-up: the full path to the directory cannot contain spaces. For example, ~/Documents/analytics\ engineering/event\ schemas/primary is likely to yield errors, but ~/Documents/analytics-engineering/event-schemas/primary would be fine.

*There are plenty of other schema technologies out there, (Avro, Thrift, etc.) but JSON and JSONSchema fit our use cases better than any of those. (For more information about how JSONSchema was chosen, see RFC: Modern Event Platform - Choose Schema Tech and https://blog.wikimedia.org/2017/01/13/json-hadoop-kafka/.

Event Schema Design Rules and Conventions

Event Platform/Schemas/Guidelines

Schema Repositories

A schema repository is a Git repository with a hierarchy of versioned JSONSchema files, with a file layout something like:

jsonschema
└── analytics
    ├── button
    │   ├── click
    │   │   ├── 1.0.0 -> 1.0.0.yaml
    │   │   ├── 1.0.0.yaml
    │   │   ├── current.yaml
    │   │   └── latest -> 1.0.0
    │   └── release
    │       ├── 1.0.0 -> 1.0.0.yaml
    │       ├── 1.0.0.yaml
    │       ├── 1.0.1 -> 1.0.1.yaml
    │       ├── 1.0.1.yaml
    │       ├── current.yaml
    │       └── latest -> 1.0.1
    └── page_preview
        └── visibility_change
            ├── 1.0.0 -> 1.0.0.yaml
            ├── 1.0.0.yaml
            ├── 2.0.0 -> 2.0.0.yaml
            ├── 2.0.0.yaml
            ├── current.yaml
            └── latest -> 2.0.0

JSONSchema has title and $id fields that we use to associate event data with a schema, as well as for semantically versioning schemas. The actual hierarchy layout shown here is arbitrary, but each schema's title and $id must match the layout in a specific way. More on this below.

Note the 'current.yaml' files. These files represent the current working version of the schema. The current schemas are never themselves used as a schema for validation or data integration. Instead, they are 'materialized' by jsonschema-tools into static versioned schema files. These versioned schema files are the canonical schemas used by event processing systems.

Hierarchy Rules

Each schema's title should match its relative path in the schema repository. E.g. all schema version files in namespace1/entity1/verbB should have title: namespace1/entity1/verbB. Each schema's $id field should be set to the path (starting with /) and (extensionless) version. E.g. namespace1/entity1/verbB/1.0.1.yaml should have $id: /namespace1/entity1/verbB/1.0.1.

This layout combined with the title and $id allow for event data to specifically point to their schemas via relative URIs. By semantically versioning schema files, jsonschema-tools is able to associate schemas with the same title and enforce backwards compatibility. The relative and versioned $id URIs can also be used as JSON $ref links and with JSON Pointers. More on this below as well.

Creating a new schema repository

Most likely you will already be working with a schema repository. If so, skip to Creating a new schema or Modifying schemas.

jsonschema-tools is a NodeJS libary and CLI for managing JSONSchema Git repositories. To create a new schema repository, you'll create a package.json file, install and configure jsonschema-tools, and set up jsonschema-tools tests.

mkdir my_schema_repository
cd my_schema_repository
git init .

# Our schemas will go in the jsonschema/ directory
mkdir jsonschema

# Create a configuration file for jsonschema-tools.
echo -e 'schemaBasePath: ./jsonschema/\nlogLevel: info' > .jsonschema-tools.yaml

# Create a package.json file.  (Modify this as desired.)
echo '
{
  "name": "my_schema_repository",
  "scripts": {
    "test": "mocha test/jsonschema",
    "postinstall": "$(npm bin)/jsonschema-tools install-git-hook"
  },
  "devDependencies": {
    "@wikimedia/jsonschema-tools": "^0.6.0",
    "mocha": "^6.2.0"
  }
}
' > package.json

# Install jsonschema-tools.  The npm postinstall script will install a git
# pre-commit hook to auto materialize versioned schema files when current
# schema files are modified.
npm install .

# Install jsonschema-tools tests.
mkdir -p test/jsonschema
echo "
'use strict';
require('@wikimedia/jsonschema-tools').tests.all({ logLevel: 'info' });
" > test/jsonschema/repository.test.js

# Create the first git commit.
echo 'node_modules**' >> .gitignore
git add .
git commit -m 'New schema repository'

Creating a new schema

Once you are working in a repository with jsonschema-tools, we can create new schemas. By 'new schema', we mean a brand new schema lineage, not just a new schema version. To create a new schema, we need to first decide on its title (and hierarchy), create the directory structure, write a new current.yaml schema file, and materialize the schema. For this example, we'll create a new event schema that represents a Mediawiki UI button click.

NOTE: since will be writing JSONSchema, you should probably know how to do that. See this tutorial and reference for help working with JSONSchema.

mkdir -p jsonschema/mediawiki/desktop/button/click

Open jsonschema/mediawiki/desktop/button/click/current.yaml. We'll build this up piece by piece and explain each part.

Schema meta data

First we need some schema meta data that describe and identify the schema. Note that this schema meta data is not describing any aspect of your event data.

# This is the title of the schema.
# It should match the relative path to this file's parent directory.
title: mediawiki/desktop/button/click

# Document the what the schema represents.
description: Mediawiki desktop web button clicked

# The $id uniquely identifies this schema.  It should be a versioned (and extensionless) URI.
$id: /mediawiki/desktop/button/click/1.0.0

# This is the meta-schema of this schema.  This should probably always be the same
# for every schema, and should point to the main JSONSchema meta-schema at json-schema.org.
$schema: https://json-schema.org/draft-07/schema#


Event fields

...continuing on to event data fields. Your event should be a JSON object with each field explicitly declared here.

type: object
additionalProperties: false
properties:

Event meta data

In addition to the $schema field, WMF has defined common 'meta' fields for event data. These common fields allow us to have some consistency all event data.

$schema

Each event needs to identify it's schema. Right now we are just writing the schema, but later on your code will produce JSON event data that conforms to this schema. We need to be able to look up the schema for any given event just from the event data itself. To do this, we re-use the JSONSchema $schema field in the event properties.

  $schema:
    type: string
    description: >
      The URI identifying the JSONSchema for this event. This should be
      a short URI containing only the name and version at the end of the
      URI path.  e.g. /schema_name/1.0.0 is acceptable. This should match
      the schema's $id field.
meta.dt

Every event happens at a certain date-time. That event time is stored in the meta.dt field as an ISO-8601 UTC datetime string, e.g. '2020-07-01T00:00:00T'.

NOTE: meta.dt will be used as the Kafka timestamp as well as for Hive hourly partitioning. This field is required in schemas, but if your producer does not set it, EventGate will fill it in with a the timestamp when it receives the event. If you don't have strict control over your event producers (e.g. remote browser clients), you might want to allow EventGate to fill in this field so that you don't end up with incorrect timestamps.


meta.stream

Every event should belong to a named dataset. While events are in flight, this dataset is called a stream of events. Each event needs to specify which stream it belongs to. For example, the resource_change schema is re-used in the `mediawiki.resource_change`, `transcludes.resource_change`, `change-prop.retry.resource_change`, etc. streams. You might want to design a generic button_clicked schema that is generic for all button clicks, but keep the different types of button click events in different streams. We do this using the meta.stream field. (meta.stream is used for routing incoming events to specific streams and downstream 'datasets'. Each distinct meta.stream will correspond with certain Kafka topics and a Hive table. In most cases, the Kafka topic will be the stream name prefixed with the datacenter name where the event was received.)

There are a few more common and optional meta fields that WMF defines, but we don't need explain them all here. For now we will write out just these 2 example meta fields. Later we will show how to include the event meta schema using $ref.

  ### Meta data object.  All events schemas should have this.
  meta:
    type: object
    properties:
      dt:
        type: string
        # Whenever a format is used on a field, we require that maxLength is also set.
        # See https://github.com/epoberezkin/ajv#security-risks-of-trusted-schemas
        format: date-time
        maxLength: 128
        description: Time stamp of the event, in ISO-8601 format
      stream:
        type: string
        description: Name of the stream/queue that this event belongs in
    required:
      - dt
      - stream

Event data fields

Finally we can add any fields that we really want our event to have.

  button_name:
    type: string
    description: Name of the button that was clicked
  page_title:
    type: string
    description: Page the button appeared on when clicked

The new schema

Here is the new schema we just wrote:

title: mediawiki/desktop/button/click
description: Mediawiki desktop web button clicked
$id: /mediawiki/desktop/button/click/1.0.0
$schema: https://json-schema.org/draft-07/schema#
type: object
properties:
  $schema:
    type: string
    description: >
      The URI identifying the JSONSchema for this event. This should be
      a short URI containing only the name and version at the end of the
      URI path.  e.g. /schema_name/1.0.0 is acceptable. This often will
      (and should) match the schema's $id field.
  ### Meta data object.  All events schemas should have this.
  meta:
    type: object
    properties:
      dt:
        type: string
        format: date-time
        maxLength: 128
        description: Time stamp of the event, in ISO-8601 format
      stream:
        type: string
        description: Name of the stream/queue that this event belongs in
    required:
      - dt
      - stream
  button_name:
    type: string
    description: Name of the button that was clicked
  page_title:
    type: string
    description: Page the button appeared on when clicked

examples:
  - {"$schema": "/mediawiki/desktop/button/click/1.0.0", "meta": {"dt": "2019-01-01T00:00:00Z", "stream": "mediawiki.desktop.button-click"}, "button_name": "Edit source", "page_title": "Delayed-choice quantum eraser"}

Note the examples. This is optional, but can be nice if you want to give schema readers an example of what you expect event data to look like. Notice how the event's $schema matches exactly the schema's $id.

Materializing the schema

jsonschema-tools calls the process of derefencing, merging and generating the static versioned files 'materializing'. So far, we've saved this our new schema as ./jsonschema/mediawiki/desktop/button/click/current.yaml. current.yaml will be the 'current working copy' of a schema. It can contain $ref URI pointers (more on this below). Any changes we make to schemas should always be done on their current.yaml files. We'll use jsonschema-tools to materialize current.yaml into a statically versioned schema file.

When we set up our schema repository, we installed a Git pre-commit hook to auto-materialize schemas. So, if we do

git add ./jsonschema/mediawiki/desktop/button/click/current.yaml
git commit -m 'Created mediawiki/desktop/button/click 1.0.0'

[2019-09-19 16:24:53.057 +0000]: Looking for modified current.yaml schema files in ./jsonschema/
[2019-09-19 16:24:53.093 +0000]: Materializing /home/user/my_schema_repository/jsonschema/mediawiki/desktop/button/click/current.yaml...
[2019-09-19 16:24:53.097 +0000]: Dereferencing schema with $id /mediawiki/desktop/button/click/1.0.0 using schema base URIs ./jsonschema/
[2019-09-19 16:24:53.120 +0000]: Materialized schema at /home/user/my_schema_repository/jsonschema/mediawiki/desktop/button/click/1.0.0.yaml.
[2019-09-19 16:24:53.121 +0000]: Materialized schema at /home/user/my_schema_repository/jsonschema/mediawiki/desktop/button/click/1.0.0.json.
[2019-09-19 16:24:53.122 +0000]: Created extensionless symlink /home/user/my_schema_repository/jsonschema/mediawiki/desktop/button/click/1.0.0 -> 1.0.0.yaml.
[2019-09-19 16:24:53.123 +0000]: New schema files have been materialized. Adding them to git: /home/user/my_schema_repository/jsonschema/mediawiki/desktop/button/click/1.0.0.yaml,/home/user/my_schema_repository/jsonschema/mediawiki/desktop/button/click/1.0.0,/home/user/my_schema_repository/jsonschema/mediawiki/desktop/button/click/1.0.0.json
[master ac7b60d] Created mediawiki/desktop/button/click 1.0.0
 Date: Thu Sep 19 10:52:26 2019 -0400
 5 files changed, 130 insertions(+)
 create mode 120000 jsonschema/mediawiki/desktop/button/click/1.0.0
 create mode 100644 jsonschema/mediawiki/desktop/button/click/1.0.0.json
 create mode 100644 jsonschema/mediawiki/desktop/button/click/1.0.0.yaml
 create mode 100644 jsonschema/mediawiki/desktop/button/click/current.yaml

jsonschema-tools will notice any newly modified current.yaml and materialize them on git commit. The version to materialize will be obtained from the value of $id in current.yaml. Both yaml and json (by default) files will be materialized, and the versioned extensionless symlink will point to the versioned yaml file (by default).

Alternatively you can manually materialize a schema using the jsonschema-tools CLI. See $(npm bin)/jsonschema-tools --help for more information.

Modifying schemas

Versioned schemas should be (mostly) immutable. Once committed and merged, they may be used by many active producers and consumers. Changing an existent version should not be done (if you think you need to do it, get in touch with the Analytics or Core Platform Engineering teams). Instead, to modify a schema you should just create a new backwards compatible version.

Let's add a user_id to our event data. Edit jsonschema/mediawiki/desktop/button/click/current.yaml and add the following at the bottom of the schema.

# ...
  user_id:
    type: string
    description: ID of the user

# Add a user_id onto our examples field too:
examples:
  - {"$schema": "/mediawiki/desktop/button/click/1.0.0", "meta": {"dt": "2019-01-01T00:00:00Z", "stream": "mediawiki.desktop.button-click"}, "button_name": "Edit source", "page_title": "Delayed-choice quantum eraser", "user_id": 123}

Since we've changed the schema, we MUST manually change the version in the schema's $id field. According to semantic versioning, our addition of the user_id field should be a minor version increment. So change $id to:

$id: /mediawiki/desktop/button/click/1.1.0

Since we've changed the version, jsonschema-tools will materialize new 1.1.0 version files on git commit:

git add ./jsonschema/mediawiki/desktop/button/click/current.yaml
git commit -m 'Added user_id and created mediawiki/desktop/button/click 1.1.0'

[2019-09-19 16:24:53.057 +0000]: Looking for modified current.yaml schema files in ./jsonschema/
[2019-09-19 16:24:53.093 +0000]: Materializing /home/user/my_schema_repository/jsonschema/mediawiki/desktop/button/click/current.yaml...
[2019-09-19 16:24:53.097 +0000]: Dereferencing schema with $id /mediawiki/desktop/button/click/1.1.0 using schema base URIs ./jsonschema/
[2019-09-19 16:24:53.120 +0000]: Materialized schema at /home/user/my_schema_repository/jsonschema/mediawiki/desktop/button/click/1.1.0.yaml.
[2019-09-19 16:24:53.121 +0000]: Materialized schema at /home/user/my_schema_repository/jsonschema/mediawiki/desktop/button/click/1.1.0.json.
[2019-09-19 16:24:53.122 +0000]: Created extensionless symlink /home/user/my_schema_repository/jsonschema/mediawiki/desktop/button/click/1.1.0 -> 1.1.0.yaml.
[2019-09-19 16:24:53.123 +0000]: New schema files have been materialized. Adding them to git: /home/user/my_schema_repository/jsonschema/mediawiki/desktop/button/click/1.1.0.yaml,/home/user/my_schema_repository/jsonschema/mediawiki/desktop/button/click/1.1.0,/home/user/my_schema_repository/jsonschema/mediawiki/desktop/button/click/1.1.0.json
[master 827d1a6] Added user_id and created mediawiki/desktop/button/click 1.1.0
 4 files changed, 106 insertions(+), 2 deletions(-)
 create mode 120000 jsonschema/mediawiki/desktop/button/click/1.1.0
 create mode 100644 jsonschema/mediawiki/desktop/button/click/1.1.0.json
 create mode 100644 jsonschema/mediawiki/desktop/button/click/1.1.0.yaml

Including sub schemas

When materializing schemas, jsonschema-tools will dereference any $ref pointers and merge any allOf it finds. This allows us to DRY up common subschemas to avoid copy/paste bugs. It may also potentially allow us to standardize common fields (page_title) into a data dictionary reference (TBD).

For WMF, all event schemas should have a $schema event field, as well as use a common event meta sub object. The Wikimedia common schema is in the https://schema.wikimedia.org/#!/primary/jsonschema primary schema repository] at /fragment/common.

In our example schema repository, assume we have a common schema at jsonschema/fragment/common/1.0.0 as:

title: common
description: Common schema fields for all WMF schemas
$id: /common/1.0.0
$schema: 'https://json-schema.org/draft-07/schema#'
type: object
properties:
  $schema:
    type: string
    description: >
      The URI identifying the JSONSchema for this event. This should be a short
      URI containing only the name and revision at the end of the URI path. 
      e.g. /schema_name/1.0.0 is acceptable. This should match
      the schema's $id field.
  meta:
    type: object
    required:
      - dt
      - stream
    properties:
      uri:
        type: string
        format: uri-reference
        maxLength: 8192
        description: Unique URI identifying the event / resource
      request_id:
        type: string
        description: Unique ID of the request that caused the event
      id:
        type: string
        maxLength: 36
        description: Unique ID of this event
      dt:
        type: string
        format: date-time
        maxLength: 128
        description: 'Time stamp of the event, in ISO-8601 format'
      domain:
        type: string
        description: Domain the event pertains to
        minLength: 1
      stream:
        type: string
        description: Name of the stream/queue that this event belongs in
        minLength: 1
required:
  - $schema
  - meta

We want to include this schema (including it's required properties) in our button/click example schema. Let's make a new version of this schema and include it using $ref. Edit jsonschema/mediawiki/desktop/button/click/current.yaml to

title: mediawiki/desktop/button/click
description: Mediawiki desktop web button clicked
$id: /mediawiki/desktop/button/click/1.2.0
$schema: https://json-schema.org/draft-07/schema#
type: object
allOf:
- $ref: /fragment/common/1.0.0
- properties:
    button_name:
      type: string
      description: Name of the button that was clicked
    page_title:
      type: string
      description: Page the button appeared on when clicked
    user_id:
      type: string
      description: ID of the user

examples:
  - {"$schema": "/mediawiki/desktop/button/click/1.0.0", "meta": {"dt": "2019-01-01T00:00:00Z", "stream": "mediawiki.desktop.button-click", "id": "12345678-1234-5678-1234-567812345678"}, "button_name": "Edit source", "page_title": "Delayed-choice quantum eraser", "user_id": 123}

Notice that we've bumped the version number in $id again to 1.2.0. Commit and materialize this new schema.

git add ./jsonschema/mediawiki/desktop/button/click/current.yaml
git commit -m 'Using $ref to common in new version mediawiki/desktop/button/click 1.2.0'
...

The newly materialized ./jsonschema/mediawiki/desktop/button/click/1.2.0.yaml has both our schema and the included common schema:

title: mediawiki/desktop/button/click
description: Mediawiki desktop web button clicked
$id: /mediawiki/desktop/button/click/1.2.0
$schema: 'https://json-schema.org/draft-07/schema#'
type: object
examples:
  - $schema: /mediawiki/desktop/button/click/1.0.0
    meta:
      dt: '2019-01-01T00:00:00Z'
      stream: mediawiki.desktop.button-click
      id: 12345678-1234-5678-1234-567812345678
    button_name: Edit source
    page_title: Delayed-choice quantum eraser
    user_id: 123
required:
  - $schema
  - meta
properties:
  $schema:
    type: string
    description: >
      The URI identifying the JSONSchema for this event. This should be a short
      URI containing only the name and revision at the end of the URI path. e.g.
      /schema_name/1.0.0 is acceptable. This should match the schema's $id
      field.
  meta:
    type: object
    required:
      - dt
      - stream
    properties:
      uri:
        type: string
        format: uri-reference
        maxLength: 8192
        description: Unique URI identifying the event / resource
      request_id:
        type: string
        description: Unique ID of the request that caused the event
      id:
        type: string
        maxLength: 36
        description: Unique ID of this event
      dt:
        type: string
        format: date-time
        maxLength: 128
        description: 'Time stamp of the event, in ISO-8601 format'
      domain:
        type: string
        description: Domain the event pertains to
        minLength: 1
      stream:
        type: string
        description: Name of the stream/queue that this event belongs in
        minLength: 1
  button_name:
    type: string
    description: Name of the button that was clicked
  page_title:
    type: string
    description: Page the button appeared on when clicked
  user_id:
    type: string
    description: ID of the user

How this works

When jsonschema-tools encounters a $ref, it will attempt to resolve it and then replace it with the resolved content.

Absolute $ref

If the $ref starts with a URI protocol (http:// or file://), it will attempt to load it as is. $ref: https://schema.wikimedia.org/repositories/primary/jsonschema/fragment/common/1.0.0 will load the content at that URL.

Relative to baseSchemaUris.

jsonschema-tools can be configured (in .jsonschema-tools.yaml with multiple baseSchemaUris, the default of which is just the schemaBasePath (in our case, ./jsonschema). When a $ref starts with a slash (/), jsonschema-tools will iterate through each of the configured baseSchemaUris, prepend the base URI to the $ref value, and attempt to resolve it. If your baseSchemaUris: [./jsonschema, https://schema.wikimedia.org/repositories/primary/jsonschema/], jsonschema-tools will look for your $ref path in both of those locations.

Testing schemas

jsonschema-tools comes with a series of tests that ensure your schema repository is nice and clean. We showed how to install these tests in the section above about Creating a New Schema Repository. These are mocha tests, so all we need to do is run npm test. These tests will ensure that your schema repository structure is correct, that your schemas have required fields, and that schema versions are backwards compatible.