Data Platform/Data Lake/Edits/Mediawiki user history
This page describes the data set that stores the user history of WMF's wikis. It lives in Analytics Hadoop cluster and is accessible via the Hive external table wmf.mediawiki_user_history
. For more detail of the purpose of this data set, please read Analytics/Data Lake/Page and user history reconstruction. Also visit Analytics/Data access if you don't know how to access this data set.
Schema
col_name | data_type | comment |
---|---|---|
wiki_db | string | enwiki, dewiki, eswiktionary, etc. |
user_id | bigint | ID of the user, as in the user table. |
user_text_historical | string | Historical user name. |
user_text | string | User name as of today. |
user_groups_historical | array<string> | Historical user groups. |
user_groups | array<string> | User groups as of today. |
user_blocks_historical | array<string> | Historical user blocks. |
user_blocks | array<string> | User blocks as of today. |
is_bot_by_historical | array<string> | Historical bot information of the user that caused the event, can contain values name or group |
is_bot_by | array<string> | Bot information of the user that caused the event, can contain values name or group |
user_registration_timestamp | string | When the user account was registered (from user table) |
user_creation_timestamp | string | When the user account was created (from logging table) |
user_first_edit_timestamp | string | When the user made its first edit |
created_by_self | boolean | Whether the user created their own account |
created_by_system | boolean | Whether the user account was created by mediawiki (eg. centralauth) |
created_by_peer | boolean | Whether the user account was created by another user |
anonymous | boolean | Whether the user is not registered |
start_timestamp | string | Timestamp from where this state applies (inclusive). |
end_timestamp | string | Timestamp to where this state applies (exclusive). |
caused_by_event_type | string | Event that caused this state (create: account was created; rename: account was renamed; altergroups: user's group memberships changed; or alterblocks: user's block status changed). |
caused_by_user_id | bigint | ID from the user that caused this state. |
caused_by_user_text | string | Name of the user that caused this state |
caused_by_anonymous_user | boolean | Whether the user that caused this state was anonymous |
caused_by_block_expiration | string | Block expiration timestamp (YYYYMMDDhhmmss), if the block has an expiry set. "indefinite" for indefinite blocks. |
inferred_from | string | If non-NULL, indicates that some of this state's fields have been inferred after an inconsistency in the source data. |
source_log_id | bigint | ID of the logging table row that caused this state |
source_log_comment | string | Comment of the logging table row that caused this state |
source_log_params | map<string,string> | Parameters of the logging table row that caused this state, parsed as a map |
snapshot | string | Versioning information to keep multiple datasets (YYYY-MM for regular labs imports) |
Note the snapshot
field: It is a Hive partitions. It explicitly maps to snapshot folders in HDFS. Since the full data is present in every snapshot up to the snapshot date, you should always specify a snapshot partition predicate in the where
clause of your queries.
Changes and known problems
Snapshot
or Date |
Details | Phab
Task |
---|---|---|
2019-07 | Schema change: Addition of caused_by_anonymous_user which is just set when the user that caused this state was anonymous.
|
task T221825 |
2019-04 | Schema changes: Addition of is_bot_by_historical and is_bot_by , user_creation_timestamp and user_first_edit_timestamp , and source_log_id , source_log_comment , source_log_params . The user registration is the one stored in the user table, the user creation one is retrieved from the logging table (user creation event), and the first-edit is the date of the user first edit, whether deleted or not.
Having |
task T221824 |
2017-11 | For pairs of fields that give current and historical versions of a value, rename the fields so that _historical is appended to the historical field rather than _latest to the current one.
|
|
2016/10/06 | The dataset contains data for simplewiki and enwiki until september 2016. Still we need to productionize the automatic updates to that table and import all the wikis. | |
2017/03/01 | Add the snapshot partition, allowing to keep multiple versions of the user history. Data starts to flow regularly (every month) from labs.
|