This page describes the data set that stores the user history of WMF's wikis. It lives in Analytic's Hadoop cluster and is accessible via the Hive/Beeline external table wmf.mediawiki_user_history. For more detail of the purpose of this data set, please read Analytics/Data Lake/Page and user history reconstruction. Also visit Analytics/Data access if you don't know how to access this data set.


col_name data_type comment
wiki_db string enwiki, dewiki, eswiktionary, etc.
user_id bigint ID of the user, as in the user table.
user_text_historical string Historical user name.
user_text string User name as of today.
user_groups_historical array<string> Historical user groups.
user_groups array<string> User groups as of today.
user_blocks_historical array<string> Historical user blocks.
user_blocks array<string> User blocks as of today.
user_registration_timestamp string When the user account was registered, in YYYYMMDDHHmmss format.
created_by_self boolean Whether the user created their own account
created_by_system boolean Whether the user account was created by mediawiki (eg. centralauth)
created_by_peer boolean Whether the user account was created by another user
anonymous boolean Whether the user is not registered
is_bot_by_name boolean Whether the user's name matches patterns we use to identify bots
start_timestamp string Timestamp from where this state applies (inclusive).
end_timestamp string Timestamp to where this state applies (exclusive).
caused_by_event_type string Event that caused this state (create: account was created; rename: account was renamed; altergroups: user's group memberships changed; or alterblocks: user's block status changed).
caused_by_user_id bigint ID from the user that caused this state.
caused_by_block_expiration string Block expiration timestamp (YYYYMMDDhhmmss), if the block has an expiry set. "indefinite" for indefinite blocks.
inferred_from string If non-NULL, indicates that some of this state's fields have been inferred after an inconsistency in the source data.
snapshot string Versioning information to keep multiple datasets (YYYY-MM for regular labs imports)

Note the snapshot field: It is a Hive partitions. It explicitly maps to snapshot folders in HDFS. Since the full data is present in every snapshot up to the snapshot date, you should always specify a snapshot partition predicate in the where clause of your queries.

Changes and known problems

Date Schema version Details Phab


2017-11 For pairs of fields that give current and historical versions of a value, rename the fields so that _historical is appended to the historical field rather than _latest to the current one.
2016/10/06 n/a The dataset contains data for simplewiki and enwiki until september 2016. Still we need to productionize the automatic updates to that table and import all the wikis.
2017/03/01 n/a Add the snapshot partition, allowing to keep multiple versions of the user history. Data starts to flow regularly (every month) from labs.