Analytics/Data Lake/Edits

From Wikitech
Jump to: navigation, search

This page links to detailed information about Edits datasets in the Data Lake.

To access this data, log into stat1005.eqiad.wmnet and run hive. Here you can use wmf; and query the tables described below.

In comparison to the traffic ones, those datasets are not continuously updated. They are regularly updated by fully re-importing/re-building them, creating a new snapshot.

This snapshot notion is key when querying the Edits datasets, since including multiple snapshots doesn't make sense for most queries. As of 2017-04, snapshots are provided monthly. When we import, we grab all the data available from all tables except the revision table, for which we filter by where rev_timestamp <= <<snapshot-date>>. If the snapshot is a little late because of processing problems, then by the time it finishes it may have more data in tables like logging, archive, etc. These should not affect history reconstruction because we base everything on revisions, but they'll affect any queries you may run on those tables separately.

Datasets

Mediawiki raw data

Those are copy of mediawiki MySQL tables

  • archive
  • ipblocks
  • logging
  • page
  • pagelinks
  • redirect
  • revision
  • user
  • user_groups

Processed Data

  • Mediawiki user history -- Dataset providing reconstructed history events of mediawiki users
  • Mediawiki page history -- Dataset providing reconstructed history events of mediawiki pages
  • Mediawiki fully denormalized history -- Fully denormalized dataset containing user, page and revision processed data
  • Metrics -- Dataset providing precomputed metrics over edits data (e.g. monthly new registered users or daily edits by anonymous users)

Other Data


For an explanation of how this data is processed, see docs at Analytics/Systems/Wikistats

Limitations of the historical datasets

Users of this data should be aware that the reconstruction process is not perfect. The resulting data is not 100% complete throughout all wiki-history. In some specific slices/dices of the data set, some fields may be missing (null) or approximated (inferred value).

Why?

  • MediaWiki databases are not meant to store history (revisions yes, of course; but not user history or page history). They hold part of the history in the logging table, but it's incomplete and formatted in many different ways depending on the software version. This makes the reconstruction of MediaWiki history a really complex task. Even sometimes the data is not there, and can not be reconstructed.
  • The size of the data is considerably large. The reconstruction algorithm needs to reprocess the whole database(s) at every run since the beginning of time, because MediaWiki constantly updates the old records of the logging table. This presents hard performance challenges to the reconstruction job, which made the code much more complex. We need to balance the complexity of the job with the data quality, at some point we need to add a lot of complexity to "maybe" improve quality for a small percentage of data. For example, if only 0.5% of pages have field X missing and getting the info to fix the field would make reconstruction twice as complex, it will not be corrected but rather documented as not present. This is a balance of requirements so you always let us know whether we are missing something there.

How much/Which data is missing?

After vetting the data for some time we approximated that the recoverable data that we did not make to recover represented less than 1%. We also saw that this data corresponded mostly to the earlier years of reconstructed history (2007-2009), and especially related to deleted pages. We do not have yet an in-depth analysis of the completeness of the data, it's in our backlog, see: https://phabricator.wikimedia.org/T155507

Will there be improvements in the future to correct this missing data?

Yes, if we know that the improvement will have enough benefit. The mentioned task would help in measuring that.

Examples

History of deleted pages that are (re)created: Correctly identifying a page as deleted and recreated might be straightforward for small sets of pages. It might also be simplified by "recreated" not meaning the page was undeleted by an administrator. As mentioned above, how MediaWiki logs data changes over time. This further complicates the identification process, particularly on a scale of "across all wikis". You might therefore find examples of pages that were recreated with the same page ID, namespace, and title. This can result in their creation and deletion timestamps in the history table appearing to be incorrect. If you're looking to run analysis on those kind of cases, further narrowing of the dataset (e.g. by time) might allow for correct processing of those.

Access

Some of the data above is made public through different systems (see Analytics main page), but any data on the Data Lake is private by default. For this, reference Analytics/Data access