Data Platform/Data Lake/Edits/Mediawiki user history

This page describes the data set that stores the user history of WMF's wikis. It lives in Analytics Hadoop cluster and is accessible via the Hive external table wmf.mediawiki_user_history. For more detail of the purpose of this data set, please read Analytics/Data Lake/Page and user history reconstruction. Also visit Analytics/Data access if you don't know how to access this data set.

Schema

col_name	data_type	comment
wiki_db	string	enwiki, dewiki, eswiktionary, etc.
user_id	bigint	ID of the user, as in the user table.
user_text_historical	string	Historical user name.
user_text	string	User name as of today.
user_groups_historical	array<string>	Historical user groups.
user_groups	array<string>	User groups as of today.
user_blocks_historical	array<string>	Historical user blocks.
user_blocks	array<string>	User blocks as of today.
is_bot_by_historical	array<string>	Historical bot information of the user that caused the event, can contain values name or group
is_bot_by	array<string>	Bot information of the user that caused the event, can contain values name or group
user_registration_timestamp	string	When the user account was registered (from user table)
user_creation_timestamp	string	When the user account was created (from logging table)
user_first_edit_timestamp	string	When the user made its first edit
created_by_self	boolean	Whether the user created their own account
created_by_system	boolean	Whether the user account was created by mediawiki (eg. centralauth)
created_by_peer	boolean	Whether the user account was created by another user
anonymous	boolean	Whether the user is not registered
start_timestamp	string	Timestamp from where this state applies (inclusive).
end_timestamp	string	Timestamp to where this state applies (exclusive).
caused_by_event_type	string	Event that caused this state (create: account was created; rename: account was renamed; altergroups: user's group memberships changed; or alterblocks: user's block status changed).
caused_by_user_id	bigint	ID from the user that caused this state.
caused_by_user_text	string	Name of the user that caused this state
caused_by_anonymous_user	boolean	Whether the user that caused this state was anonymous
caused_by_block_expiration	string	Block expiration timestamp (YYYYMMDDhhmmss), if the block has an expiry set. "indefinite" for indefinite blocks.
inferred_from	string	If non-NULL, indicates that some of this state's fields have been inferred after an inconsistency in the source data.
source_log_id	bigint	ID of the logging table row that caused this state
source_log_comment	string	Comment of the logging table row that caused this state
source_log_params	map<string,string>	Parameters of the logging table row that caused this state, parsed as a map
snapshot	string	Versioning information to keep multiple datasets (YYYY-MM for regular labs imports)

Note the snapshot field: It is a Hive partitions. It explicitly maps to snapshot folders in HDFS. Since the full data is present in every snapshot up to the snapshot date, you should always specify a snapshot partition predicate in the where clause of your queries.

Changes and known problems

Snapshot or Date	Details	Phab Task
2019-07	Schema change: Addition of `caused_by_anonymous_user` which is just set when the user that caused this state was anonymous.	task T221825
2019-04	Schema changes: Addition of `is_bot_by_historical` and `is_bot_by`, `user_creation_timestamp` and `user_first_edit_timestamp`, and `source_log_id`, `source_log_comment`, `source_log_params`. The user registration is the one stored in the user table, the user creation one is retrieved from the logging table (user creation event), and the first-edit is the date of the user first edit, whether deleted or not. Having `caused_by_user_text` in addition to `caused_by_user_id`, and the better creation information has allowed an improvement of linking between user-states and page/revision states, providing more accurate historical information (see task T218463).	task T221824
2017-11	For pairs of fields that give current and historical versions of a value, rename the fields so that `_historical` is appended to the historical field rather than `_latest` to the current one.
2016/10/06	The dataset contains data for simplewiki and enwiki until september 2016. Still we need to productionize the automatic updates to that table and import all the wikis.
2017/03/01	Add the `snapshot` partition, allowing to keep multiple versions of the user history. Data starts to flow regularly (every month) from labs.