Analytics/Data Lake/Edits/Mediawiki user history

From Wikitech
Jump to navigation Jump to search

This page describes the data set that stores the user history of WMF's wikis. It lives in Analytic's Hadoop cluster and is accessible via the Hive/Beeline external table wmf.mediawiki_user_history. For more detail of the purpose of this data set, please read Analytics/Data Lake/Page and user history reconstruction. Also visit Analytics/Data access if you don't know how to access this data set.

Schema

col_name data_type comment
wiki_db string enwiki, dewiki, eswiktionary, etc.
user_id bigint ID of the user, as in the user table.
user_text_historical string Historical user name.
user_text string User name as of today.
user_groups_historical array<string> Historical user groups.
user_groups array<string> User groups as of today.
user_blocks_historical array<string> Historical user blocks.
user_blocks array<string> User blocks as of today.
user_registration_timestamp string When the user account was registered, in YYYYMMDDHHmmss format.
created_by_self boolean Whether the user created their own account
created_by_system boolean Whether the user account was created by mediawiki (eg. centralauth)
created_by_peer boolean Whether the user account was created by another user
anonymous boolean Whether the user is not registered
is_bot_by_name boolean Whether the user's name matches patterns we use to identify bots
start_timestamp string Timestamp from where this state applies (inclusive).
end_timestamp string Timestamp to where this state applies (exclusive).
caused_by_event_type string Event that caused this state (create: account was created; rename: account was renamed; altergroups: user's group memberships changed; or alterblocks: user's block status changed).
caused_by_user_id bigint ID from the user that caused this state.
caused_by_block_expiration string Block expiration timestamp (YYYYMMDDhhmmss), if the block has an expiry set. "indefinite" for indefinite blocks.
inferred_from string If non-NULL, indicates that some of this state's fields have been inferred after an inconsistency in the source data.
snapshot string Versioning information to keep multiple datasets (YYYY-MM for regular labs imports)

Note the snapshot field: It is a Hive partitions. It explicitly maps to snapshot folders in HDFS. Since the full data is present in every snapshot up to the snapshot date, you should always specify a snapshot partition predicate in the where clause of your queries.

Changes and known problems

Date Schema version Details Phab

Task

2017-11 For pairs of fields that give current and historical versions of a value, rename the fields so that _historical is appended to the historical field rather than _latest to the current one.
2016/10/06 n/a The dataset contains data for simplewiki and enwiki until september 2016. Still we need to productionize the automatic updates to that table and import all the wikis.
2017/03/01 n/a Add the snapshot partition, allowing to keep multiple versions of the user history. Data starts to flow regularly (every month) from labs.