Analytics/Data Lake/Edits/Edit hourly

From Wikitech

The wmf.edit_hourly table (available in Hive) contains edit data, aggregated hourly over a predefined set of dimensions. It is stored in the Parquet columnar file format and partitioned by snapshot (as it is computed from MediaWiki history). The data goes back to 2001.

Current schema

# Main fields information
# col_name              data_type               comment
ts                      string                  Timestamp string truncated to the hour. Format: "YYYY-MM-DD HH:00:00.0".
project                 string                  Project name, i.e.: "en.wikipedia".
user_is_anonymous       boolean                 Whether user is anonymous or not.
user_is_bot             boolean                 Whether user is bot or not.
user_is_administrator   boolean                 Whether user is administrator or not.
user_groups             array<string>           User groups array.  
user_tenure_bucket      string                  Bucketed time between user creation and edit (Under 1 day, 1 to 7 days, 7 to 30 days, ..., Over 10 years, Undefined).
namespace_is_content    boolean                 Whether the namespace is of type content or not.
namespace_is_talk       boolean                 Whether the namespace is of type talk or not.
namespace_name          string                  Namespace name (Main, Talk, User, User talk, etc.). See: oozie/edit/hourly/edit_hourly.hql.
namespace_id            int                     Namespace id.       
creates_new_page        boolean                 Whether the edit was the first of a page (page creation).
is_deleted              boolean                 Whether the edit has been deleted.
is_reverted             boolean                 Whether the edit has been reverted.
user_edit_count_bucket  string                  Authors edit count bucket (1-4, 5-99, 100-999, 1000-9999, 10000+).
platform                string                  Access method (iOS, Android, Mobile web, Other).
interface               string                  Editing interface (VisualEditor, 2017 wikitext editor, Switched from VisualEditor to wikitext editor, Other).
revision_tags           array<string>           Revision tags (change tags) array.
edit_count              bigint                  Number of edits belonging to this hourly bucket (for the given dimension value set).
text_bytes_diff         bigint                  Number of bytes added minus number of bytes removed belonging to this hourly bucket (for the given dimension value set).
snapshot                string                  MediaWiki history snapshot (YYYY-MM format).

# Partition information          
# col_name              data_type               comment             
snapshot                string                  MediaWiki history snapshot (YYYY-MM format).

Notice the snapshot field is a Hive partition, and an explicit mapping to the corresponding data directory in HDFS. You must include at least one partition predicate in the where clause of your queries (like snapshot='2019-01'). Partitions allow you to reduce the amount of data that Hive must parse and process before it returns you results. You may use partition fields as you would any normal field, even though the field values are not actually stored in the data files.

Technical details

  • The ts field is truncated to the hour, so that hourly aggregation can take place.
  • The project indicates the wiki in the same format as pageview_hourly, but it is actually calculated from mediawiki_history's wiki_db field, so it might have some minor inconsistencies with other data sets that calculate it from i.e. a webrequest host.
  • The field user_is_bot is true if the editors in the corresponding bucket belong to the user group bot.
  • All dimensions contain the value for the time of the corresponding edit. For example: user_is_administrator=true means the user was an administrator at the time the corresponding edit happened (even if they are no more). There are 2 exceptions though: is_deleted and is_reverted, that contain the value for the present time always.

Changes and known problems

Date from Task Details
2019-05-15 task T220092 Initial data set.

See also