Data Platform/Systems/EventLogging/Schema Guidelines
This documentation is outdated. See Event_Platform/Schemas/Guidelines. |
The Analytics Engineering team is considering establishing some schema guidelines that would make ingestion into Data Analysis tools easier. The current idea is that we can automatically load conforming schemas into Druid and make them available for analysis in Turnilo (formerly Pivot) or Superset, but the particular technologies used aren't too important. This is just a draft set of guidelines, started in 2017 as part of a collaboration with the Mobile Apps team to see how these would work in practice.
EventLogging Druid ingestion is manually configured. If you want some specific EventLogging data to make it into Druid, ask the analytics team for help.
NOTE: these guidelines are subject to change as part of the Modern Event Platform Schema Guidelines component. The basic ideas will remain the same, but some naming conventions may change. In the future, we'd like to improve these conventions to make automated Druid ingestion possible.
See also Modern Event Platform Schema Guidelines.
Guidelines
While events are originally sent using JSON, they need be persisted to a SQL datastore so restrictions apply.
NEVER MAKE BACKWARDS INCOMPATIBLE CHANGES. This means that (almost) the ONLY change you can make to a schema is add new optional fields.
There is no such thing as 'renaming' a field if the old field name exists in old data. You can only add new optional fields. The latest version of every schema should validate with every event of that schema ever produced.
Schema set up
- Event date time should always be stored in a field called dt, in ISO 8601 format in UTC +00:00 timezone.
- Other date time fields should be in ISO 8601 format in fields suffixed with '_dt', e.g. session_start_dt
- If you must use an integer unix epoch timestamp, send it in millisecond precision in UTC +00:00 timezone and name the field suffixed with 'ts', e.g. session_start_ts
- The schema should be as flat as possible. Don't send complex objects in a single field, flatten them out and send an event as shown below. Complex events cause more work and possible confusion down the line during analysis.
- Do not remove fields when making changes to the schema in the future. Restricting schema changes to only adding fields keeps schema (and event code) backwards compatible and doesn't break queries. Otherwise queries would need to be revised every time the schema is changed.
- All arrays must specify the items type. E.g.
"array_field": {
"type": "array",
"items": {
"type": "string"
}
}
- Types should never change. This is tricky with JSON, as both decimals and integers are valid
number
s. If you want integers, please use theinteger
type. If you want decimals, use thenumber
type, but you'll need to make sure that the values ALWAYS have a decimal point in them. 2.0 is a valid float number, 2 is not. You'll run into trouble if your data has both of these formats for the same field. - If the schema has any fields that measure time elapsed, use milliseconds as the time unit.
- Union types are not supported. That means that you cannot send null for fields that are type string. If you want an optional field, make it not required, and just don't set it in your data.
- EventLogging + Hive now supports map types. They are specified in JSONSchema like:
"map_field": {
"type": "object",
"additionalProperties": {
"type": "string" // or whatever type your values are.
}
}
- If there are fields (or sets of fields) that are mutually exclusive they should be "named" distinctively. Example:
If events look like this:
{
user: 'TheWikiEditor',
editorship: {is_active:'yes', edit_count:'1000'}
}
Do not use a marker like the following one to indicate "absence" of data
{
user: 'TheBadWikiEditor',
editorship: {status:'blocked'}
}
Rather use an explicit field like:
{
user: 'TheBadWikiEditor',
blocked: 'true' //mutually exclusive with being able to retrieve editorship status
}
Ingestion into Druid
- All fields are "dimensions" by default. Fields that should be measures in Druid should be configured as such in EventLoggingToDruid configs. (see section below explaining what dimensions are).
Example Schema Conforming to Guideline
Let's say we wanted to understand feature usage for a mobile app. We might have a schema that looks like this:
{
dt: 'ISO 8601 formatted timestamp, eg. 2015-12-20 09:10:56'
app_platform: 'Platform, Operating System',
app_platform_version: 'Version of the OS',
app_version: 'The version of the app',
feature_category: 'A feature category, to allow analyzing groups of features',
feature: 'The name of the feature',
time_since_last_action: 'In milliseconds, how much time passed since the last user action and until they engaged this feature',
time_spent: 'In milliseconds, how much time did the user spend using the feature'
}
Dimension
Dimension in data modeling is a construct that categorizes data. In the Druid sense, we're usually talking about degenerate dimensions which are basically like labels for your data. Examples are: country, project, agent type, app version, browser, etc.
Regarding Dimensions with "many" Numeric Values
Druid is not the best tool to manipulate numeric data, Druid excels on manipulating cubes of "dimensions" each of which has low cardinality (distinct values that a dimension can take) like: "pageviews per country for Chrome across all wikimedia projects". Numeric values that are not bucketed have "infinite" cardinality and thus are not well suited for druid ingestion as dimensions. When ingested as "measures" please be aware that druid supports a limited set of aggregations: http://druid.io/docs/latest/querying/aggregations
Bucketing time measures
Fields that contain time elapsed values, can not be treated as metrics (measures) in Druid. However, our EventLogging Druid ingestion pipeline allows to bucket time elapsed fields into buckets, transforming the field to an 'ingestible' dimension, with values like: "100ms-1sec", "10sec-1min", etc. So, you can still consider having time elapsed fields in your schema.
See also
Real life examples:
- Ingestion of the pageviews_hourly Hadoop table into Druid: load_pageview_daily.json.template (specifying dimensions and metrics), generate_daily_druid_pageviews.hql (daily Hive query exporting the Hadopp table)
- T202751 "Ingest data from PageIssues EventLogging schema into Druid"