Analytics/Cluster/Edit data loading

From Wikitech

This page describes the first step of the edit history reconstruction pipeline: the loading of MediaWiki data into Hadoop. This is done using a couple of scripts stored in the analytics-refinery-source repository. After that, next steps in the pipeline will process that data within the cluster and generate the desired output.

Sqooping data

The main script uses apache sqoop to import some MediaWiki tables from publicly available databases replica and production replicas into Analytics' Hadoop cluster. You can find it in the analytics-refinery github. The MediaWiki tables imported from the public replicas are archive, category, categorylinks, change_tag, change_tag_def, content, content_models, externallinks, image, imagelinks, ipblocks, ipblocks_restrictions, iwlinks, langlinks, logging, page, pagelinks, page_props, page_restrictions, redirect, revision, slots, slot_roles, templatelinks, user, user_groups, user_properties, wbc_entity_usage. Tables imported from the production replicas are: actor, comment, watchlist. Finally some special case sqoop jobs are used to get the cu_changes and discussiontools_subscription tables from the production replicas.

In addition to that, another table is created in Hadoop: namespace_mapping. It contains localized namespaces for every wiki (see namespace mapping script below). This sqooping process is to be done at the beginning of every month, and it is done since the beginning of time every time. It is designed in a non-incremental fashion to adapt to the fact that MediaWiki data (revision, archive and logging) can suffer alterations in log records created in the past.

Sqooping wikis in groups

As in the other steps of the pipeline, there have been performance challenges associated with the size and nature of the data in this script as well. Sqoop would crash if trying to import all wikis at once, but also a job for each one of the ~800 wikis would be too slow and error prone. The solution we went for is to group the wikis in clusters that can be processed by sqoop in parallel. The groups have been determined by studying the sizes of all wikis. You can see a diagram of the partitions in github.

Namespace mapping script

An important feature of this process is also the generation of the namespaces mapping table. This table holds a relation between namespace names and namespace ids for all wikis. Note that many wikis have (or have had at some point in time) their own localized versions of the namespace names, like "Benutzer" (German) instead of "User". This table translates all versions (localized and standard) of namespace names into their namespace ids, which will help further steps in the pipeline to normalize and reconstruct the editing data. You can find the namespace mapping script in github.