Analytics/Systems/EventLogging/Schema Guidelines

From Wikitech


The Analytics Engineering team is considering establishing some schema guidelines that would make ingestion into Data Analysis tools easier. The current idea is that we can automatically load conforming schemas into Druid and make them available for analysis in Turnilo (formerly Pivot) or Superset, but the particular technologies used aren't too important. This is just a draft set of guidelines, started in 2017 as part of a collaboration with the Mobile Apps team to see how these would work in practice.

EventLogging Druid ingestion is manually configured. If you want some specific EventLogging data to make it into Druid, ask the analytics team for help.

NOTE: these guidelines are subject to change as part of the Modern Event Platform Schema Guidelines component. The basic ideas will remain the same, but some naming conventions may change. In the future, we'd like to improve these conventions to make automated Druid ingestion possible.

See also Modern Event Platform Schema Guidelines.

Guidelines

While events are originally sent using JSON, they need be persisted to a SQL datastore so restrictions apply.

NEVER MAKE BACKWARDS INCOMPATIBLE CHANGES. This means that (almost) the ONLY change you can make to a schema is add new optional fields.

There is no such thing as 'renaming' a field if the old field name exists in old data. You can only add new optional fields. The latest version of every schema should validate with every event of that schema ever produced.


Schema set up

  • Event date time should always be stored in a field called dt, in ISO 8601 format in UTC +00:00 timezone.
  • Other date time fields should be in ISO 8601 format in fields suffixed with '_dt', e.g. session_start_dt
  • If you must use an integer unix epoch timestamp, send it in millisecond precision in UTC +00:00 timezone and name the field suffixed with 'ts', e.g. session_start_ts
  • The schema should be as flat as possible. Don't send complex objects in a single field, flatten them out and send an event as shown below. Complex events cause more work and possible confusion down the line during analysis.
  • Do not remove fields when making changes to the schema in the future. Restricting schema changes to only adding fields keeps schema (and event code) backwards compatible and doesn't break queries. Otherwise queries would need to be revised every time the schema is changed.
  • All arrays must specify the items type. E.g.
 
"array_field": {
  "type": "array",
  "items": {
    "type": "string"
  }
}
  • Types should never change. This is tricky with JSON, as both decimals and integers are valid numbers. If you want integers, please use the integer type. If you want decimals, use the number type, but you'll need to make sure that the values ALWAYS have a decimal point in them. 2.0 is a valid float number, 2 is not. You'll run into trouble if your data has both of these formats for the same field.
  • If the schema has any fields that measure time elapsed, use milliseconds as the time unit.
  • Union types are not supported. That means that you cannot send null for fields that are type string. If you want an optional field, make it not required, and just don't set it in your data.
  • EventLogging + Hive now supports map types. They are specified in JSONSchema like:
 
"map_field": {
  "type": "object",
  "additionalProperties": {
    "type": "string" // or whatever type your values are.
  }
}
  • If there are fields (or sets of fields) that are mutually exclusive they should be "named" distinctively. Example:

If events look like this:

 
{
  user: 'TheWikiEditor',
  editorship: {is_active:'yes', edit_count:'1000'}
}

Do not use a marker like the following one to indicate "absence" of data

{
  user: 'TheBadWikiEditor',
  editorship: {status:'blocked'}
}

Rather use an explicit field like:

{
  user: 'TheBadWikiEditor',
  blocked: 'true' //mutually exclusive with being able to retrieve editorship status
}

Ingestion into Druid

  • All fields are "dimensions" by default. Fields that should be measures in Druid should be configured as such in EventLoggingToDruid configs. (see section below explaining what dimensions are).

Example Schema Conforming to Guideline

Let's say we wanted to understand feature usage for a mobile app. We might have a schema that looks like this:

{
    dt:                             'ISO 8601 formatted timestamp, eg. 2015-12-20 09:10:56'
    app_platform:                   'Platform, Operating System',
    app_platform_version:           'Version of the OS',
    app_version:                    'The version of the app',
    feature_category:               'A feature category, to allow analyzing groups of features',
    feature:                        'The name of the feature',

    time_since_last_action: 'In milliseconds, how much time passed since the last user action and until they engaged this feature',
    time_spent:             'In milliseconds, how much time did the user spend using the feature'
}

Dimension

Dimension in data modeling is a construct that categorizes data. In the Druid sense, we're usually talking about degenerate dimensions which are basically like labels for your data. Examples are: country, project, agent type, app version, browser, etc.

Regarding Dimensions with "many" Numeric Values

Druid is not the best tool to manipulate numeric data, Druid excels on manipulating cubes of "dimensions" each of which has low cardinality (distinct values that a dimension can take) like: "pageviews per country for Chrome across all wikimedia projects". Numeric values that are not bucketed have "infinite" cardinality and thus are not well suited for druid ingestion as dimensions. When ingested as "measures" please be aware that druid supports a limited set of aggregations: http://druid.io/docs/latest/querying/aggregations

Bucketing time measures

Fields that contain time elapsed values, can not be treated as metrics (measures) in Druid. However, our EventLogging Druid ingestion pipeline allows to bucket time elapsed fields into buckets, transforming the field to an 'ingestible' dimension, with values like: "100ms-1sec", "10sec-1min", etc. So, you can still consider having time elapsed fields in your schema.

See also

Real life examples: