Jump to content

Data Platform/Data modeling guidelines

From Wikitech

Overview

The goal of this document is to establish best practices in the context of the Wikimedia project domain and its technical systems—specifically, analytics data primarily stored in the Hadoop data lake and made available through Hive and Iceberg.

There are over 300 curated datasets in the data lake, following different informal standards that have evolved organically over many years. This document will take these into consideration, identifying inconsistencies and best practices, with the intent of establishing more formal guidelines for future datasets.

Issues identified while drafting these guidelines can be found in Data Modeling Guidelines Issues.

Goals

Having data modeling guidelines aims to improve:

  • Efficiency in data handling by reducing cognitive overload for both data producers in designing the data models, and consumers in understanding and data. It allows for consistent and streamlined workflows, including better data integration and interoperability
  • Communication and collaboration between individuals and teams
  • Improved data quality—consistency reduces misunderstandings and supports data accuracy, consistency and reliability
  • Compliance and Privacy by foreseeing and considering privacy and compliance during the design process. Standardized models support easier reviews and audits
  • Scalability and data interoperability

Guidelines

Dataset Creation Process

Your first stop when considering creating a new dataset is the Dataset Creation Pathway flowchart. The crucial first steps there are:

  1. Check DataHub to see if the data you need is already there. If not, is there a similar table that could be updated to meet your needs?
  2. If you do need a new dataset, have you identified a Business Data Steward and Technical Steward?

After that, you can start thinking about the schema of your dataset.

As for where your data lives, if it is in the Data Lake, new datasets should be created in Iceberg, and should slot into one of the databases listed in the “Changes to database names” section. If none of those seem right, contact Data Engineering or post your question in the #data-engineering-collab Slack channel to discuss.

General Schema Design Guidelines

[De]normalization

For analytics purposes, wider, denormalized tables are more efficient. You should design your tables so that all the information your prospective users will normally need can be retrieved without joining another table. This requires extra storage cost, but storage is cheaper and easier to scale than CPU and memory. It also makes for more efficient queries and reduces the cognitive load on the user.

Some cases where you should not denormalize are when the data being joined is small, such as country, language, or wiki tables. For one, often the join key is textual and is sufficient information for most queries (e.g. country_code = ‘us’ or wiki_id = ‘enwiki’), eliminating the need for a join. But also, small tables make for cheap joins. (An exception to this rule is when the data is for use in a Druid data cube, which does not allow joins.)

Another case where you should avoid denormalizing is when the joined data is subject to change. This is likely rare in the data warehouse, but something to consider to avoid costly backfills.

Granularity

Granularity in datasets refers to the level of detail in each row. High-granularity datasets contain a lot of detail and consequently use more storage space. This also implies slower queries. But sometimes that level of detail is needed. Low-granularity data are generally aggregations of higher-granularity data. This can drastically reduce storage space and speed up query times.

An example of a high-granularity dataset is wmf.webrequest, which contains individual web requests and their associated attributes. It adds over 10 billion records/day.

wmf.unique_editors_by_country_monthly is an example of a low-granularity dataset. It contains only a few fields and a bit over 1000 records/month.

The appropriate granularity for your dataset is largely dependent on how the data is going to be queried. If users only need to see daily/weekly/monthly counts, a low-granularity aggregation is appropriate. If detailed data is needed for debugging or auditing purposes, then you need high-granularity—just keep an eye on how much storage you’re using and the nature of the data you’re storing (see Privacy Considerations below).

If your dataset can be reconstructed from source data, it’s better to start with lower-granularity and increase the granularity as needed. If your dataset is the source data, or otherwise can’t be reconstructed, then you’re better off keeping it at the highest-granularity of detail possible, keeping in mind storage and privacy considerations.

Immutability

Your dataset should be designed with immutability in mind. Updating or deleting data (other than truncating, or deleting at the partition level) in an analytics data store is expensive. (There will be exceptions, but mostly for static tables where updating a row happens infrequently—e.g canonical_data.wikis.)

What this means in practice is that new data should have a timestamp for when it was created, and that new data will be appended to your table.

Privacy Considerations

When considering which fields to include in your schema, special thought should be given to which might be considered personally-identifiable information (PII), either by themselves or when joining with other data.

Per the WMF Privacy Policy, any PII should be purged within 90 days. However, the best policy is to not store PII at all. If you don’t need it, don’t store it.

You should also refrain from storing any user-specific geographical data at anything less than the country level.

Some documents to refer to for more information:

Release Frequency

The release frequency of your data should reflect the anticipated use cases. It doesn’t make sense to update wmf.unique_editors_by_country_monthly every hour. But there are certainly a lot of datasets that could stand to be updated more frequently. There are some datasets like wmf.mediawiki_wikitext_current that depend on periodic dumps that, due to technical limitations, currently cannot be produced more frequently, but as a general rule, you should aim to update your dataset daily (provided it isn’t explicitly a longer-duration rollup).

Storage Format

Compressed Parquet should be used as the storage format in HDFS whenever possible. Currently Snappy compression is most widely used, but we will be investigating moving to zstd in Iceberg.

There are at least a couple cases (mediawiki_wikitext_{current, history}) where Avro is being used instead because of memory issues when loading Parquet data in Spark. If this is the case, Avro is a good fallback.

Naming and Data Type Conventions

Table and column names

Tables and columns should use lowercase snake_case. Use only the characters a-z, 0-9, and underscore (_). These conventions are shared with Event Platform’s excellent guidelines.

Following the MediaWiki standard, tables should be singular nouns, e.g. user, page, etc. Exceptions can be made for aggregate or metrics tables where the singular might be awkward or confusing.

Avoid using team, project, or specific technology names in table and column names, as those tend to change over time.

Aggregates

Tables that are aggregates should be suffixed with the period over which they were aggregated—i.e. _hourly, _daily, _monthly, _quarterly, _yearly. Note that this is not necessarily the same as the update interval—e.g. edits_hourly is currently updated monthly. You should be sure to document the update frequency of your dataset in DataHub.

per vs. by

_per_ should be used instead of _by_ in a table name when the data on the left side of the _per_ is calculated for each member of (i.e. grouped by) the right side—e.g. unique_devices_per_domain.

IDs

ID fields should be numeric (bigint), referencing the primary key of another table. E.g. user_id in one table implies the existence of a users table with a matching column.

ID fields should have an _id suffix. Do not use just id as a column name—use the same name as you would if it were a foreign key in another table. E.g. user_id in the users table. For one, this adds clarity to complex queries. It also allows the use of USING to simplify queries:

SELECT * 
FROM table_1
JOIN table_2
  ON table_1.key_A = table_2.key_A
  AND table_1.key_B = table_2.key_B

becomes:

SELECT *
FROM table_1
JOIN table_2
USING (key_A, key_B)

(example taken from https://prestodb.io/docs/current/sql/select.html#using)

It is permissible for an id field to be non-numeric, so long as it maps to a [preferably primary] key in another table. However, non-numeric IDs that aren’t foreign key references, such as UUIDs or device ids, should not use _id. For UUIDs, use _uuid. For other types of unique ids, use _uid.

There are many examples where existing schemas do not follow these ID rules. However, going forward, newly designed schemas and fields should comply.

Timestamps

For the most part, WMF has standardized on ISO-8601 (e.g. “2024-03-13T09:10:56Z”), with UTC (“Z”) as the timezone. Again, the Event Platform has excellent guidelines for datetimes/timestamps, which should be adopted here as well.

dt should be used as a record timestamp field. Other timestamp fields should use the _dt suffix. E.g. start_dt, end_dt.

Timestamps should use the timestamp data type, in UTC, rather than stored as a varchar. This makes it much easier to work with vs. having to do cast(from_iso8601_timestamp(dt) as timestamp). Additionally, string timestamp comparisons can be problematic. Finally, timestamps are stored in most systems as 64-bit integers, so take up significantly less space than a string representation. Relevant ticket: T278467: Use Hive/Spark timestamps in Refined event data

Sub-second timestamps can be expressed using decimal seconds, e.g. “2024-03-13T09:10:56.123Z”.

A note on timezones: Some data storage systems—including Postgres and Iceberg—have separate timestamp data types for timestamps with and without a timezone. In Iceberg, timestamp has no timezone, and timezonetz does. To avoid any confusion, especially when moving data between systems, always use the timezone-less timestamp data type and be sure all times are UTC.

Dates

Dates should be stored with the date data type, and suffixed with _date.

Durations

Fields that represent a duration of time (e.g. first_paint_time, time_firstbyte) should include their units (first_paint_time_ms, time_firstbyte_sec). Ideally, these should be expressed in integer milliseconds and suffixed with _ms. See also Numeric Values below.

Units Suffix
seconds _sec
milliseconds _ms
microseconds _us
nanoseconds _ns

Year/Month/Day/Hour Partition Columns

Many of our Hive tables have year, month, and sometimes day and hour columns for partitioning. For tables that also have a dt (or other timestamp/date column), this is redundant, but often necessary for Hive partitioning. Iceberg, however, supportshidden partitioning, where partition values can be derived from a timestamp or date column. So these extraneous year/month/day/hour columns should be dropped when moving to Iceberg.

Months

month should always be an integer in the range 1-12. There are a number of tables, mostly monthly aggregates, that have a month column that is a varchar in YYYY-MM format. This should be deprecated and replaced with separate year and month integer columns (note that if there is also a date column in the table, year/month should be dropped entirely when the table is moved to Iceberg. See “Year/Month/Day/Hour Partition Columns” above.

If it is necessary to have a YYYY-MM string in the table, the column should be renamed to year_month to disambiguate.

Numeric Values

In general, columns with numeric values in indeterminate units should specify those units. Additionally, values that are counts should be suffixed with _count using the singular form of the root word(s).

Instead of... ...use
visible_length visible_length_px
time_firstbyte time_firstbyte_sec
response_size response_size_bytes
views view_count

Booleans

Boolean fields should be prefixed with is_ or has_ (e.g. is_visible, has_cookies). It is acceptable to have is/has in the middle of the name (e.g. user_is_anonymous).

Integers

bigint (64-bit integers) should be the default for any integer columns, with the exception of columns for which values over ~2 billion would be invalid, such as year, month, day, or hour. In this case, integer is acceptable.

Floats and Doubles

Floats should generally be avoided, particularly in any context where accuracy is paramount—especially when representing money. A classic example of why:

>>> print(0.1 + 0.2)
0.30000000000000004

Try to represent your value as an integer if possible. If you really need a decimal place, use decimal. If that doesn’t work for whatever reason, at least use double (vs. float).

Countries

Countries should be identified by their two-letter ISO-3166-1 alpha-2 code, under the field name country_code. If the full country name (or at least the most common English variant) is needed, it can be obtained by joining canonical_data.countries.

Languages

Human languages are identified using the ISO 639 language code, and should use the field name language_code. These should map to the language_code field in canonical_data.wikis. Note that there are some non-standard language codes in use.

There are a few tables with a language or language_name field containing the full language name (e.g. wmf.edit_hourly, canonical_data.wikis). A languages reference table in canonical_data is recommended.

WMF-Specific Conventions

wiki vs. wiki_id vs. wiki_db vs. project

wiki should be used for any external-user facing data, and should contain the full domain name, e.g. “en.wikipedia.org” or “fr.wiktionary.org”. This should map to canonical_data.wikis.domain_name.

wiki_id should be used for internal references to a particular 'Mediawiki database, as defined in canonical_data.wikis.database_code. wiki_db is commonly used for this now, but the Event Platform has decided to standardize on wiki_id, which is also used in the MW codebase.

project is used widely and is usually a synonym of wiki (In Hive tables, project is the domain name of the wiki, without the top-level domain (TLD); e.g. “en.wikipedia” or “fr.wiktionary”). However, it has come to have different meanings in different contexts, and as such should be deprecated.

Neil Shah-Quinn had a great summary of wiki vs. project:

Actually, I would say that "wiki" and "project" aren't necessarily the same.

"Wiki" is pretty clear. en.wikipedia.org is one wiki; fr.wikipedia.org is another.

"Project class"/"Project family" are also pretty clear. Wikipedia is one; Wiktionary is another. It feels a little weird to call, say, "Wikidata" a project family since it's just one wiki, but I don't think it's actually a problem for anyone.

But "project" is a problem. For some people, "project" is a synonym for wiki, but [for] other people Wikipedia is a project and English Wikipedia is, say, a "subproject" or "language edition". For example, https://meta.wikimedia.org/wiki/Complete_list_of_Wikimedia_projects takes that second approach. And of course, "project" can also mean lots of other things including "campaign" or "WikiProject"

So, I guess I'd say use "wiki" and "project family" for the main glossary entries, and then for "project", say it could mean either, or pretty much anything else, depending on the context 😅

project_family should be used to refer to the high-level projects listed in Complete list of Wikimedia projects. Currently that list is: commons, foundation, incubator, meta, sources, species, wikibooks, wikidata, wikifunctions, wikimedia, wikinews, wikipedia, wikiquote, wikisource, wikiversity, wikivoyage, wiktionary.

performer vs. actor vs. user

actor was added to Mediawiki core in 2018 to represent either a user or IP address in order to save space in large tables like logging and revision.

performer is used in the Event Platform to indicate the user performing the action. This was chosen because it was already in use in Mediawiki core, in LogEntry and RecentChange. It was a surprise to the designers that actor was already in use.

user is used in many tables in Hive to represent a registered or temporary user.

For non-events tables, you should use user_id to refer to a user. If the user is potentially missing (as is the case with older IP users), user_id can be NULL. actor should be reserved specifically for referring to entries in the MW actor table.

user_text vs. user_name

user_text is used in a number of tables (mediawiki_history_*, mediawiki_wikitext_*) to contain either a logged-in user’s username, or an IP address if the user is anonymous. It is not used in MW core, which has a user table with a user_name field for the username. Despite the prevalence of user_text, user_name should be used for storing usernames going forward and should be left blank if the user is not logged in or otherwise anonymous.

page vs. article

page is what is used in most places (including MW core) to refer to a MediaWiki page. There are a few places where article is used instead, although it appears to be almost exclusively in events (mobilewikiapparticlesuggestions, relatedarticles, and some properties on other events). This Design Best Practices page talks about the difference:

The word 'article' is meaningless on various wikis and the word 'page' should be used wherever possible. That said in sometimes the word page itself is too ambiguous - for example when describing pages in the main namespace on Wikipedia the word 'articles' would be more meaningful.

Unless you have a specific reason for distinguishing between main namespace pages (e.g. main namespace article counts), you should use page instead of article.

Glossary

Terms to be added to the DataHub Glossary.

Actor - An object and table in MW core that represents either a registered user, or IP address for non-logged-in/non-registered users, in order to save space in large tables like logging and revision.

Event - A strongly-typed and schema-ed piece of data, usually representing something happening at a definite time. E.g. 'revision create','user button click', 'page view', etc. See Event Platform for more information.

GDI - Global Data and Insights. Data gathered by a number of teams across Research & Decision Science. There are gdi datasets in Hive.

Performer - Used in the Event Platform to indicate the user performing the action.

Project Family - Refers to the high-level projects listed in Complete list of Wikimedia projects. Currently that list is: commons, foundation, incubator, meta, sources, species, wikibooks, wikidata, wikifunctions, wikimedia, wikinews, wikipedia, wikiquote, wikisource, wikiversity, wikivoyage, wiktionary.

User - Represents a registered or temporary user.

Resources

Some of the documents that were referred to in the creation of this document are: