Analytics/Data Lake/Edits/Mediawiki user history
This page describes the data set that stores the user history of WMF's wikis. It lives in Analytic's Hadoop cluster and is accessible via the Hive/Beeline external table wmf.mediawiki_user_history
. For more detail of the purpose of this data set, please read Analytics/Data Lake/Page and user history reconstruction. Also visit Analytics/Data access if you don't know how to access this data set.
Schema
col_name | data_type | comment |
---|---|---|
wiki_db | string | enwiki, dewiki, eswiktionary, etc. |
user_id | bigint | ID of the user, as in the user table. |
user_text_historical | string | Historical user name. |
user_text | string | User name as of today. |
user_groups_historical | array<string> | Historical user groups. |
user_groups | array<string> | User groups as of today. |
user_blocks_historical | array<string> | Historical user blocks. |
user_blocks | array<string> | User blocks as of today. |
is_bot_by_historical | array<string> | Historical bot information of the user that caused the event, can contain values name or group |
is_bot_by | array<string> | Bot information of the user that caused the event, can contain values name or group |
user_registration_timestamp | string | When the user account was registered (from user table) |
user_creation_timestamp | string | When the user account was created (from logging table) |
user_first_edit_timestamp | string | When the user made its first edit |
created_by_self | boolean | Whether the user created their own account |
created_by_system | boolean | Whether the user account was created by mediawiki (eg. centralauth) |
created_by_peer | boolean | Whether the user account was created by another user |
anonymous | boolean | Whether the user is not registered |
start_timestamp | string | Timestamp from where this state applies (inclusive). |
end_timestamp | string | Timestamp to where this state applies (exclusive). |
caused_by_event_type | string | Event that caused this state (create: account was created; rename: account was renamed; altergroups: user's group memberships changed; or alterblocks: user's block status changed). |
caused_by_user_id | bigint | ID from the user that caused this state. |
caused_by_user_text | string | Name of the user that caused this state |
caused_by_anonymous_user | boolean | Whether the user that caused this state was anonymous |
caused_by_block_expiration | string | Block expiration timestamp (YYYYMMDDhhmmss), if the block has an expiry set. "indefinite" for indefinite blocks. |
inferred_from | string | If non-NULL, indicates that some of this state's fields have been inferred after an inconsistency in the source data. |
source_log_id | bigint | ID of the logging table row that caused this state |
source_log_comment | string | Comment of the logging table row that caused this state |
source_log_params | map<string,string> | Parameters of the logging table row that caused this state, parsed as a map |
snapshot | string | Versioning information to keep multiple datasets (YYYY-MM for regular labs imports) |
Note the snapshot
field: It is a Hive partitions. It explicitly maps to snapshot folders in HDFS. Since the full data is present in every snapshot up to the snapshot date, you should always specify a snapshot partition predicate in the where
clause of your queries.
Changes and known problems
Snapshot
or Date |
Details | Phab
Task |
---|---|---|
2019-07 | Schema change: Addition of caused_by_anonymous_user which is just set when the user that caused this state was anonymous.
|
task T221825 |
2019-04 | Schema changes: Addition of is_bot_by_historical and is_bot_by , user_creation_timestamp and user_first_edit_timestamp , and source_log_id , source_log_comment , source_log_params . The user registration is the one stored in the user table, the user creation one is retrieved from the logging table (user creation event), and the first-edit is the date of the user first edit, whether deleted or not.
Having |
task T221824 |
2017-11 | For pairs of fields that give current and historical versions of a value, rename the fields so that _historical is appended to the historical field rather than _latest to the current one.
|
|
2016/10/06 | The dataset contains data for simplewiki and enwiki until september 2016. Still we need to productionize the automatic updates to that table and import all the wikis. | |
2017/03/01 | Add the snapshot partition, allowing to keep multiple versions of the user history. Data starts to flow regularly (every month) from labs.
|