Analytics/Data Lake/Edits/Mediawiki user history

From Wikitech

This page describes the data set that stores the user history of WMF's wikis. It lives in Analytic's Hadoop cluster and is accessible via the Hive/Beeline external table wmf.mediawiki_user_history. For more detail of the purpose of this data set, please read Analytics/Data Lake/Page and user history reconstruction. Also visit Analytics/Data access if you don't know how to access this data set.

Schema

col_name data_type comment
wiki_db string enwiki, dewiki, eswiktionary, etc.
user_id bigint ID of the user, as in the user table.
user_text_historical string Historical user name.
user_text string User name as of today.
user_groups_historical array<string> Historical user groups.
user_groups array<string> User groups as of today.
user_blocks_historical array<string> Historical user blocks.
user_blocks array<string> User blocks as of today.
is_bot_by_historical array<string> Historical bot information of the user that caused the event, can contain values name or group
is_bot_by array<string> Bot information of the user that caused the event, can contain values name or group
user_registration_timestamp string When the user account was registered (from user table)
user_creation_timestamp string When the user account was created (from logging table)
user_first_edit_timestamp string When the user made its first edit
created_by_self boolean Whether the user created their own account
created_by_system boolean Whether the user account was created by mediawiki (eg. centralauth)
created_by_peer boolean Whether the user account was created by another user
anonymous boolean Whether the user is not registered
start_timestamp string Timestamp from where this state applies (inclusive).
end_timestamp string Timestamp to where this state applies (exclusive).
caused_by_event_type string Event that caused this state (create: account was created; rename: account was renamed; altergroups: user's group memberships changed; or alterblocks: user's block status changed).
caused_by_user_id bigint ID from the user that caused this state.
caused_by_user_text string Name of the user that caused this state
caused_by_anonymous_user boolean Whether the user that caused this state was anonymous
caused_by_block_expiration string Block expiration timestamp (YYYYMMDDhhmmss), if the block has an expiry set. "indefinite" for indefinite blocks.
inferred_from string If non-NULL, indicates that some of this state's fields have been inferred after an inconsistency in the source data.
source_log_id bigint ID of the logging table row that caused this state
source_log_comment string Comment of the logging table row that caused this state
source_log_params map<string,string> Parameters of the logging table row that caused this state, parsed as a map
snapshot string Versioning information to keep multiple datasets (YYYY-MM for regular labs imports)

Note the snapshot field: It is a Hive partitions. It explicitly maps to snapshot folders in HDFS. Since the full data is present in every snapshot up to the snapshot date, you should always specify a snapshot partition predicate in the where clause of your queries.

Changes and known problems

Snapshot

or Date

Details Phab

Task

2019-07 Schema change: Addition of caused_by_anonymous_user which is just set when the user that caused this state was anonymous. task T221825
2019-04 Schema changes: Addition of is_bot_by_historical and is_bot_by, user_creation_timestamp and user_first_edit_timestamp, and source_log_id, source_log_comment, source_log_params. The user registration is the one stored in the user table, the user creation one is retrieved from the logging table (user creation event), and the first-edit is the date of the user first edit, whether deleted or not.

Having caused_by_user_text in addition to caused_by_user_id, and the better creation information has allowed an improvement of linking between user-states and page/revision states, providing more accurate historical information (see task T218463).

task T221824
2017-11 For pairs of fields that give current and historical versions of a value, rename the fields so that _historical is appended to the historical field rather than _latest to the current one.
2016/10/06 The dataset contains data for simplewiki and enwiki until september 2016. Still we need to productionize the automatic updates to that table and import all the wikis.
2017/03/01 Add the snapshot partition, allowing to keep multiple versions of the user history. Data starts to flow regularly (every month) from labs.