Analytics/Systems/Data Lake/Edits/Pipeline/Data loading

From Wikitech
Jump to: navigation, search

This page describes the first step of the edit history reconstruction pipeline: the loading of MediaWiki data into Hadoop. This is done using a couple of scripts stored in the analytics-refinery-source repository. After that, next steps in the pipeline will process that data within the cluster and generate the desired output.

Main sqooping script

The main script uses apache sqoop to import some MediaWiki tables into Analytics' Hadoop cluster. You can find it in github here. The MediaWiki tables imported are: archive, ipblocks, logging, page, revision, user_groups and user. In addition to that, another table is created in Hadoop: namespace_mapping (see namespace mapping script). This sqooping process is to be done at the beginning of every month, and it is done since the beginning of time every time. It is designed in a non-incremental fashion to adapt to the fact that MediaWiki data (revision, archive and logging) can suffer alterations in log records created in the past.

Sqooping wikis in groups

As in the other steps of the pipeline, there have been performance challenges associated with the size and nature of the data in this script as well. Sqoop would crash if trying to import all wikis at once, but also a job for each one of the ~800 wikis would be too slow and error prone. The solution we went for is to group the wikis in clusters that can be processed by sqoop in parallel. The groups have been determined by studying the sizes of all wikis. You can see a diagram of the partitions in github here.

Namespace mapping script

An important feature of this process is also the generation of the namespaces mapping table. This table holds a relation between namespace names and namespace ids for all wikis. Note that many wikis have (or have had at some point in time) their own localized versions of the namespace names, like "Benutzer" (German) instead of "User". This table translates all versions (localized and standard) of namespace names into their namespace ids, which will help further steps in the pipeline to normalize and reconstruct the editing data. You can find the namespace mapping script in github here.