Analytics/Systems/EventLogging/Schema Guidelines

From Wikitech
Jump to navigation Jump to search

The Analytics Engineering team is considering establishing some schema guidelines that would make ingestion into Data Analysis tools easier. The current idea is that we can automatically load conforming schemas into Druid and make them available for analysis in Turnilo (formerly Pivot), but the particular technologies used aren't too important. This is just a draft set of guidelines, started in 2017 as part of a collaboration with the Mobile Apps team to see how these would work in practice.

Guidelines

  • Time should always be stored in a field called dt, in ISO 8601 format
  • The schema should be flat. Don't send complex objects in a single field, flatten them out and send an event as shown below. Complex events cause more work down the line during analysis.
  • All fields are "dimensions" by default unless they are prefixed with measure_ (see section below explaining what dimensions are) [To be confirmed!].
  • Fields prefixed with measure_ are considered "measures" or "metrics", by which we mean they are numbers that can be summed up or averaged [To be confirmed!].
  • Do not remove fields when making changes to the schema in the future. Restricting schema changes to only adding fields keeps the events backwards compatible and doesn't break queries.
  • Types should never change. This is tricky with JSON, as both decimals and integers are valid numbers. If you want integers, please use the integer type. If you want decimals, use the number type, but you'll need to make sure that the values ALWAYS have a decimal point in them. 2.0 is a valid float number, 2 is not. You'll run into trouble if your data has both of these formats for the same field.
  • If the schema has any fields that measure time elapsed, use milliseconds as the time unit.

Example Schema Conforming to Guideline

Let's say we wanted to understand feature usage for a mobile app. We might have a schema that looks like this:

{
    dt:                             'ISO 8601 formatted timestamp, eg. 2015-12-20 09:10:56'

    app_platform:                   'Platform, Operating System',
    app_platform_version:           'Version of the OS',
    app_version:                    'The version of the app',
    feature_category:               'A feature category, to allow analyzing groups of features',
    feature:                        'The name of the feature',

    measure_time_since_last_action: 'In milliseconds, how much time passed since the last user action and until they engaged this feature',
    measure_time_spent:             'In milliseconds, how much time did the user spend using the feature'
}

In this example, we could build an ingestion spec for Druid that considered app_platform, app_platform_version, app_version, feature_category, and feature dimensions. It would consider measure_time_since_last_action and measure_time_spent as metrics. This is how Druid ingestion works: http://druid.io/docs/latest/ingestion/. Note that if a schema didn't conform to these guidelines, we could just manually write an ingestion spec for it, this idea is meant to facilitate automatic ingestion.

Dimension

Dimension in data modeling is a construct that categorizes data. In the Druid sense, we're usually talking about degenerate dimensions which are basically like labels for your data. Examples are: country, project, agent type, app version, browser, etc.

Regarding numeric Values

Druid is not the best tool to manipulate numeric data, Druid excells on manipulating cubes of "dimensions" each of which has low cardinality (distinct values that a dimension can take) like: "pageviews per country for Chrome across all wikimedia projects". Numeric values that are not bucketed have "infinite" cardinality and thus are not well suited for druid ingestion as dimensions. When ingested as "measures" please be aware that druid supports a limited set of aggregations: http://druid.io/docs/latest/querying/aggregations

Bucketing time measures

A known problem with Druid is that currenlty it is unable to compute percentiles at scale. Thus, fields that contain time elapsed values, can not be treated as metrics (measures). However, our Druid ingestion pipeline allows to bucket time elapsed fields into buckets, transforming the field to an ingestable dimension, with values like: "100ms-1sec", "10sec-1min", etc. So, you can still consider having time elapsed fields in your schema.

Example Schema Conforming to Guideline

Let's say we wanted to understand feature usage for a mobile app. We might have a schema that looks like this:

{
    dt:                             'ISO 8601 formatted timestamp, eg. 2015-12-20 09:10:56'

    app_platform:                   'Platform, Operating System',
    app_platform_version:           'Version of the OS',
    app_version:                    'The version of the app',
    feature_category:               'A feature category, to allow analyzing groups of features',
    feature:                        'The name of the feature',

    measure_time_since_last_action: 'In milliseconds, how much time passed since the last user action and until they engaged this feature',
    measure_time_spent:             'In milliseconds, how much time did the user spend using the feature'
}

In this example, we could build an ingestion spec for Druid that considered app_platform, app_platform_version, app_version, feature_category, and feature dimensions. It would consider measure_time_since_last_action and measure_time_spent as metrics. This is how Druid ingestion works: http://druid.io/docs/latest/ingestion/. Note that if a schema didn't conform to these guidelines, we could just manually write an ingestion spec for it, this idea is meant to facilitate automatic ingestion.

See also

Real life examples: