Data Platform/Data modeling guidelines
Overview
The goal of this document is to establish best practices in the context of the Wikimedia project domain and its technical systems—specifically, analytics data primarily stored in the Hadoop data lake and made available through Hive and Iceberg.
There are over 300 curated datasets in the data lake, following different informal standards that have evolved organically over many years. This document will take these into consideration, identifying inconsistencies and best practices, with the intent of establishing more formal guidelines for future datasets.
Issues identified while drafting these guidelines can be found in Data Modeling Guidelines Issues.
Goals
Having data modeling guidelines aims to improve:
- Efficiency in data handling by reducing cognitive overload for both data producers in designing the data models, and consumers in understanding and data. It allows for consistent and streamlined workflows, including better data integration and interoperability
- Communication and collaboration between individuals and teams
- Improved data quality—consistency reduces misunderstandings and supports data accuracy, consistency and reliability
- Compliance and Privacy by foreseeing and considering privacy and compliance during the design process. Standardized models support easier reviews and audits
- Scalability and data interoperability
Guidelines
Dataset Creation Process
Your first stop when considering creating a new dataset is the Dataset Creation Pathway flowchart. The crucial first steps there are:
- Check DataHub to see if the data you need is already there. If not, is there a similar table that could be updated to meet your needs?
- If you do need a new dataset, have you identified a Business Data Steward and Technical Steward?
After that, you can start thinking about the schema of your dataset.
As for where your data lives, if it is in the Data Lake, new datasets should be created in Iceberg, and should slot into one of the databases listed in the “Changes to database names” section. If none of those seem right, contact Data Engineering or post your question in the #data-engineering-collab Slack channel to discuss.
General Schema Design Guidelines
[De]normalization
For analytics purposes, wider, denormalized tables are more efficient. You should design your tables so that all the information your prospective users will normally need can be retrieved without joining another table. This requires extra storage cost, but storage is cheaper and easier to scale than CPU and memory. It also makes for more efficient queries and reduces the cognitive load on the user.
Some cases where you should not denormalize are when the data being joined is small, such as country, language, or wiki tables. For one, often the join key is textual and is sufficient information for most queries (e.g. country_code = ‘us’
or wiki_id = ‘enwiki’
), eliminating the need for a join. But also, small tables make for cheap joins. (An exception to this rule is when the data is for use in a Druid data cube, which does not allow joins.)
Another case where you should avoid denormalizing is when the joined data is subject to change. This is likely rare in the data warehouse, but something to consider to avoid costly backfills.
Granularity
Granularity in datasets refers to the level of detail in each row. High-granularity datasets contain a lot of detail and consequently use more storage space. This also implies slower queries. But sometimes that level of detail is needed. Low-granularity data are generally aggregations of higher-granularity data. This can drastically reduce storage space and speed up query times.
An example of a high-granularity dataset is wmf.webrequest, which contains individual web requests and their associated attributes. It adds over 10 billion records/day.
wmf.unique_editors_by_country_monthly is an example of a low-granularity dataset. It contains only a few fields and a bit over 1000 records/month.
The appropriate granularity for your dataset is largely dependent on how the data is going to be queried. If users only need to see daily/weekly/monthly counts, a low-granularity aggregation is appropriate. If detailed data is needed for debugging or auditing purposes, then you need high-granularity—just keep an eye on how much storage you’re using and the nature of the data you’re storing (see Privacy Considerations below).
If your dataset can be reconstructed from source data, it’s better to start with lower-granularity and increase the granularity as needed. If your dataset is the source data, or otherwise can’t be reconstructed, then you’re better off keeping it at the highest-granularity of detail possible, keeping in mind storage and privacy considerations.
Immutability
Your dataset should be designed with immutability in mind. Updating or deleting data (other than truncating, or deleting at the partition level) in an analytics data store is expensive. (There will be exceptions, but mostly for static tables where updating a row happens infrequently—e.g canonical_data.wikis
.)
What this means in practice is that new data should have a timestamp for when it was created, and that new data will be appended to your table.
Privacy Considerations
When considering which fields to include in your schema, special thought should be given to which might be considered personally-identifiable information (PII), either by themselves or when joining with other data.
Per the WMF Privacy Policy, any PII should be purged within 90 days. However, the best policy is to not store PII at all. If you don’t need it, don’t store it.
You should also refrain from storing any user-specific geographical data at anything less than the country level.
Some documents to refer to for more information:
- WMF Privacy Policy, notably “Personal Information” under Definitions
- Data Collection Guidelines
- Data Retention Guidelines
- Data Publication Guidelines
- Country and Territory Protection List
- Data Engineering/Systems/Event Data retention
Release Frequency
The release frequency of your data should reflect the anticipated use cases. It doesn’t make sense to update wmf.unique_editors_by_country_monthly
every hour. But there are certainly a lot of datasets that could stand to be updated more frequently. There are some datasets like wmf.mediawiki_wikitext_current
that depend on periodic dumps that, due to technical limitations, currently cannot be produced more frequently, but as a general rule, you should aim to update your dataset daily (provided it isn’t explicitly a longer-duration rollup).
Storage Format
Compressed Parquet should be used as the storage format in HDFS whenever possible. Currently Snappy compression is most widely used, but we will be investigating moving to zstd in Iceberg.
There are at least a couple cases (mediawiki_wikitext_{current, history}
) where Avro is being used instead because of memory issues when loading Parquet data in Spark. If this is the case, Avro is a good fallback.
Naming and Data Type Conventions
Table and column names
Tables and columns should use lowercase snake_case. Use only the characters a-z, 0-9, and underscore (_). These conventions are shared with Event Platform’s excellent guidelines.
Following the MediaWiki standard, tables should be singular nouns, e.g. user, page, etc. Exceptions can be made for aggregate or metrics tables where the singular might be awkward or confusing.
Avoid using team, project, or specific technology names in table and column names, as those tend to change over time.
Aggregates
Tables that are aggregates should be suffixed with the period over which they were aggregated—i.e. _hourly
, _daily
, _monthly
, _quarterly
, _yearly
. Note that this is not necessarily the same as the update interval—e.g. edits_hourly
is currently updated monthly. You should be sure to document the update frequency of your dataset in DataHub.
per vs. by
_per_
should be used instead of _by_
in a table name when the data on the left side of the _per_
is calculated for each member of (i.e. grouped by) the right side—e.g. unique_devices_per_domain
.
IDs
ID fields should be numeric (bigint
), referencing the primary key of another table. E.g. user_id
in one table implies the existence of a users table with a matching column.
ID fields should have an _id
suffix. Do not use just id as a column name—use the same name as you would if it were a foreign key in another table. E.g. user_id
in the users table. For one, this adds clarity to complex queries. It also allows the use of USING
to simplify queries:
SELECT *
FROM table_1
JOIN table_2
ON table_1.key_A = table_2.key_A
AND table_1.key_B = table_2.key_B
becomes:
SELECT *
FROM table_1
JOIN table_2
USING (key_A, key_B)
(example taken from https://prestodb.io/docs/current/sql/select.html#using)
It is permissible for an id field to be non-numeric, so long as it maps to a [preferably primary] key in another table. However, non-numeric IDs that aren’t foreign key references, such as UUIDs or device ids, should not use _id
. For UUIDs, use _uuid
. For other types of unique ids, use _uid
.
There are many examples where existing schemas do not follow these ID rules. However, going forward, newly designed schemas and fields should comply.
Timestamps
For the most part, WMF has standardized on ISO-8601 (e.g. “2024-03-13T09:10:56Z”), with UTC (“Z”) as the timezone. Again, the Event Platform has excellent guidelines for datetimes/timestamps, which should be adopted here as well.
dt
should be used as a record timestamp field. Other timestamp fields should use the _dt
suffix. E.g. start_dt
, end_dt
.
Timestamps should use the timestamp
data type, in UTC, rather than stored as a varchar
. This makes it much easier to work with vs. having to do cast(from_iso8601_timestamp(dt) as timestamp)
. Additionally, string timestamp comparisons can be problematic. Finally, timestamps are stored in most systems as 64-bit integers, so take up significantly less space than a string representation. Relevant ticket: T278467: Use Hive/Spark timestamps in Refined event data
Sub-second timestamps can be expressed using decimal seconds, e.g. “2024-03-13T09:10:56.123Z”.
A note on timezones: Some data storage systems—including Postgres and Iceberg—have separate timestamp data types for timestamps with and without a timezone. In Iceberg, timestamp
has no timezone, and timezonetz
does. To avoid any confusion, especially when moving data between systems, always use the timezone-less timestamp data type and be sure all times are UTC.
Dates
Dates should be stored with the date
data type, and suffixed with _date
.
Durations
Fields that represent a duration of time (e.g. first_paint_time
, time_firstbyte
) should include their units (first_paint_time_ms
, time_firstbyte_sec
). Ideally, these should be expressed in integer milliseconds and suffixed with _ms
. See also Numeric Values below.
Units | Suffix |
---|---|
seconds | _sec
|
milliseconds | _ms
|
microseconds | _us
|
nanoseconds | _ns
|
Year/Month/Day/Hour Partition Columns
Many of our Hive tables have year
, month
, and sometimes day
and hour
columns for partitioning. For tables that also have a dt
(or other timestamp/date column), this is redundant, but often necessary for Hive partitioning. Iceberg, however, supportshidden partitioning, where partition values can be derived from a timestamp or date column. So these extraneous year/month/day/hour columns should be dropped when moving to Iceberg.
Months
month
should always be an integer in the range 1-12. There are a number of tables, mostly monthly aggregates, that have a month column that is a varchar in YYYY-MM format. This should be deprecated and replaced with separate year
and month
integer columns (note that if there is also a date column in the table, year/month should be dropped entirely when the table is moved to Iceberg. See “Year/Month/Day/Hour Partition Columns” above.
If it is necessary to have a YYYY-MM string in the table, the column should be renamed to year_month
to disambiguate.
Numeric Values
In general, columns with numeric values in indeterminate units should specify those units. Additionally, values that are counts should be suffixed with _count using the singular form of the root word(s).
Instead of... | ...use |
---|---|
visible_length
|
visible_length_px
|
time_firstbyte
|
time_firstbyte_sec
|
response_size
|
response_size_bytes
|
views
|
view_count
|
Booleans
Boolean fields should be prefixed with is_
or has_
(e.g. is_visible
, has_cookies
). It is acceptable to have is/has in the middle of the name (e.g. user_is_anonymous
).
Integers
bigint
(64-bit integers) should be the default for any integer columns, with the exception of columns for which values over ~2 billion would be invalid, such as year, month, day, or hour. In this case, integer
is acceptable.
Floats and Doubles
Floats should generally be avoided, particularly in any context where accuracy is paramount—especially when representing money. A classic example of why:
>>> print(0.1 + 0.2)
0.30000000000000004
Try to represent your value as an integer if possible. If you really need a decimal place, use decimal
. If that doesn’t work for whatever reason, at least use double
(vs. float
).
Countries
Countries should be identified by their two-letter ISO-3166-1 alpha-2 code, under the field name country_code
. If the full country name (or at least the most common English variant) is needed, it can be obtained by joining canonical_data.countries
.
Languages
Human languages are identified using the ISO 639 language code, and should use the field name language_code
. These should map to the language_code
field in canonical_data.wikis
. Note that there are some non-standard language codes in use.
There are a few tables with a language
or language_name
field containing the full language name (e.g. wmf.edit_hourly
, canonical_data.wikis
). A languages reference table in canonical_data is recommended.
WMF-Specific Conventions
wiki vs. wiki_id vs. wiki_db vs. project
wiki
should be used for any external-user facing data, and should contain the full domain name, e.g. “en.wikipedia.org” or “fr.wiktionary.org”. This should map to canonical_data.wikis.domain_name
.
wiki_id
should be used for internal references to a particular 'Mediawiki database, as defined in canonical_data.wikis.database_code
. wiki_db
is commonly used for this now, but the Event Platform has decided to standardize on wiki_id
, which is also used in the MW codebase.
project
is used widely and is usually a synonym of wiki (In Hive tables, project
is the domain name of the wiki, without the top-level domain (TLD); e.g. “en.wikipedia” or “fr.wiktionary”). However, it has come to have different meanings in different contexts, and as such should be deprecated.
Neil Shah-Quinn had a great summary of wiki vs. project:
Actually, I would say that "wiki" and "project" aren't necessarily the same.
"Wiki" is pretty clear. en.wikipedia.org is one wiki; fr.wikipedia.org is another.
"Project class"/"Project family" are also pretty clear. Wikipedia is one; Wiktionary is another. It feels a little weird to call, say, "Wikidata" a project family since it's just one wiki, but I don't think it's actually a problem for anyone.
But "project" is a problem. For some people, "project" is a synonym for wiki, but [for] other people Wikipedia is a project and English Wikipedia is, say, a "subproject" or "language edition". For example, https://meta.wikimedia.org/wiki/Complete_list_of_Wikimedia_projects takes that second approach. And of course, "project" can also mean lots of other things including "campaign" or "WikiProject"
So, I guess I'd say use "wiki" and "project family" for the main glossary entries, and then for "project", say it could mean either, or pretty much anything else, depending on the context 😅
project_family
should be used to refer to the high-level projects listed in Complete list of Wikimedia projects. Currently that list is: commons, foundation, incubator, meta, sources, species, wikibooks, wikidata, wikifunctions, wikimedia, wikinews, wikipedia, wikiquote, wikisource, wikiversity, wikivoyage, wiktionary.
performer vs. actor vs. user
actor
was added to Mediawiki core in 2018 to represent either a user or IP address in order to save space in large tables like logging and revision.
performer
is used in the Event Platform to indicate the user performing the action. This was chosen because it was already in use in Mediawiki core, in LogEntry and RecentChange. It was a surprise to the designers that actor was already in use.
user
is used in many tables in Hive to represent a registered or temporary user.
For non-events tables, you should use user_id
to refer to a user. If the user is potentially missing (as is the case with older IP users), user_id
can be NULL. actor
should be reserved specifically for referring to entries in the MW actor table.
user_text vs. user_name
user_text
is used in a number of tables (mediawiki_history_*
, mediawiki_wikitext_*
) to contain either a logged-in user’s username, or an IP address if the user is anonymous. It is not used in MW core, which has a user
table with a user_name
field for the username. Despite the prevalence of user_text
, user_name
should be used for storing usernames going forward and should be left blank if the user is not logged in or otherwise anonymous.
page vs. article
page
is what is used in most places (including MW core) to refer to a MediaWiki page. There are a few places where article
is used instead, although it appears to be almost exclusively in events (mobilewikiapparticlesuggestions
, relatedarticles
, and some properties on other events). This Design Best Practices page talks about the difference:
The word 'article' is meaningless on various wikis and the word 'page' should be used wherever possible. That said in sometimes the word page itself is too ambiguous - for example when describing pages in the main namespace on Wikipedia the word 'articles' would be more meaningful.
Unless you have a specific reason for distinguishing between main namespace pages (e.g. main namespace article counts), you should use page
instead of article
.
Glossary
Terms to be added to the DataHub Glossary.
Actor - An object and table in MW core that represents either a registered user, or IP address for non-logged-in/non-registered users, in order to save space in large tables like logging and revision.
Event - A strongly-typed and schema-ed piece of data, usually representing something happening at a definite time. E.g. 'revision create','user button click', 'page view', etc. See Event Platform for more information.
GDI - Global Data and Insights. Data gathered by a number of teams across Research & Decision Science. There are gdi datasets in Hive.
Performer - Used in the Event Platform to indicate the user performing the action.
Project Family - Refers to the high-level projects listed in Complete list of Wikimedia projects. Currently that list is: commons, foundation, incubator, meta, sources, species, wikibooks, wikidata, wikifunctions, wikimedia, wikinews, wikipedia, wikiquote, wikisource, wikiversity, wikivoyage, wiktionary.
User - Represents a registered or temporary user.
Resources
Some of the documents that were referred to in the creation of this document are:
- WMF Privacy Policy, notably “Personal Information” under Definitions
- Data Collection Guidelines [draft]
- Data Retention Guidelines
- Data Publication Guidelines
- Country and Territory Protection List
- Data Engineering/Systems/Event Data retention
- Data Asset Life Cycle Management Process
- Metrics Platform - Wikitech
- Event Platform/Schemas - Wikitech
- Manual:Database layout - MediaWiki
- User:Isaac (WMF)/Analysis gotchas - Meta
- User:Isaac (WMF)/Comparing Wikipedia language editions - Meta
- Analytics/Systems/Cluster/Iceberg
- Data Engineering/Systems/Event Data retention
- Data Modeling Guidelines Issues