Analytics/Systems/EventLogging/Schema Guidelines

From Wikitech
Jump to: navigation, search

This is a Draft

The Analytics team is considering establishing some schema guidelines that would make ingestion into Data Analysis tools easier. The current idea is that we can automatically load conforming schemas into Druid and make them available for analysis in Pivot, but the particular technologies used aren't too important. This is just a draft set of guidelines and we're currently working with the Mobile Apps team to see how these would work in practice.

Guidelines

  • Time should always be stored in a field called dt, in ISO 8601 format
  • The schema should be flat. Don't send complex objects in a single field, flatten them out and send an event as shown below. Complex events cause more work down the line during analysis.
  • All fields are "dimensions" by default unless they are prefixed with measure_ (see section below explaining what dimensions are).
  • Fields prefixed with measure_ are considered "measures" or "metrics", by which we mean they are numbers that can be summed up or averaged.
  • Do not remove fields when making changes to the schema in the future. Restricting schema changes to only adding fields keeps the events backwards compatible and doesn't break queries.
  • Types should never change. This is tricky with JSON, as both decimals and integers are valid numbers. If you want integers, please use the integer type. If you want decimals, use the number type, but you'll need to make sure that the values ALWAYS have a decimal point in them. 2.0 is a valid float number, 2 is not. You'll run into trouble if your data has both of these formats for the same field.

Example Schema Conforming to Guideline

Let's say we wanted to understand feature usage for a mobile app. We might have a schema that looks like this:

{
    dt:                             'ISO 8601 formatted timestamp, eg. 2015-12-20 09:10:56'

    app_platform:                   'Platform, Operating System',
    app_platform_version:           'Version of the OS',
    app_version:                    'The version of the app',
    feature_category:               'A feature category, to allow analyzing groups of features',
    feature:                        'The name of the feature',

    measure_time_since_last_action: 'In seconds, how much time passed since the last user action and until they engaged this feature',
    measure_time_spent:             'In seconds, how much time did the user spend using the feature'
}

In this example, we could build an ingestion spec for Druid that considered app_platform, app_platform_version, app_version, feature_category, and feature dimensions. It would consider measure_time_since_last_action and measure_time_spent as metrics. This is how Druid ingestion works: http://druid.io/docs/latest/ingestion/. Note that if a schema didn't conform to these guidelines, we could just manually write an ingestion spec for it, this idea is meant to facilitate automatic ingestion.

Dimension

Dimension in data modeling is a construct that categorizes data. In the Druid sense, we're usually talking about degenerate dimensions which are basically like labels for your data. Examples are: country, project, agent type, app version, browser, etc.

See also

Real life examples: