Data Platform/Data Lake/Edits/Edit hourly
Appearance
The wmf.edit_hourly table (available in Hive) contains edit data, aggregated hourly over a predefined set of dimensions. It is stored in the Parquet columnar file format and partitioned by snapshot (as it is computed from MediaWiki history). The data goes back to 2001.
Current schema
# Main fields information # col_name data_type comment ts string Timestamp string truncated to the hour. Format: "YYYY-MM-DD HH:00:00.0". project string Project name, i.e.: "en.wikipedia". user_is_anonymous boolean Whether user is anonymous or not. user_is_bot boolean Whether user is bot or not. user_is_administrator boolean Whether user is administrator or not. user_groups array<string> User groups array. user_tenure_bucket string Bucketed time between user creation and edit (Under 1 day, 1 to 7 days, 7 to 30 days, ..., Over 10 years, Undefined). namespace_is_content boolean Whether the namespace is of type content or not. namespace_is_talk boolean Whether the namespace is of type talk or not. namespace_name string Namespace name (Main, Talk, User, User talk, etc.). See: oozie/edit/hourly/edit_hourly.hql. namespace_id int Namespace id. creates_new_page boolean Whether the edit was the first of a page (page creation). is_deleted boolean Whether the edit has been deleted. is_reverted boolean Whether the edit has been reverted. user_edit_count_bucket string Authors edit count bucket (1-4, 5-99, 100-999, 1000-9999, 10000+). platform string Access method (iOS, Android, Mobile web, Other). interface string Editing interface (VisualEditor, 2017 wikitext editor, Switched from VisualEditor to wikitext editor, Other). revision_tags array<string> Revision tags (change tags) array. edit_count bigint Number of edits belonging to this hourly bucket (for the given dimension value set). text_bytes_diff bigint Number of bytes added minus number of bytes removed belonging to this hourly bucket (for the given dimension value set). snapshot string MediaWiki history snapshot (YYYY-MM format). # Partition information # col_name data_type comment snapshot string MediaWiki history snapshot (YYYY-MM format).
Notice the snapshot field is a Hive partition, and an explicit mapping to the corresponding data directory in HDFS. You must include at least one partition predicate in the where clause of your queries (like snapshot='2019-01'). Partitions allow you to reduce the amount of data that Hive must parse and process before it returns you results. You may use partition fields as you would any normal field, even though the field values are not actually stored in the data files.
Technical details
- The ts field is truncated to the hour, so that hourly aggregation can take place.
- The project indicates the wiki in the same format as pageview_hourly, but it is actually calculated from mediawiki_history's wiki_db field, so it might have some minor inconsistencies with other data sets that calculate it from i.e. a webrequest host.
- The field user_is_bot is true if the editors in the corresponding bucket belong to the user group bot.
- All dimensions contain the value for the time of the corresponding edit. For example: user_is_administrator=true means the user was an administrator at the time the corresponding edit happened (even if they are no more). There are 2 exceptions though: is_deleted and is_reverted, that contain the value for the present time always.
Changes and known problems
Date from | Task | Details |
---|---|---|
2019-05-15 | task T220092 | Initial data set. |