Data Platform/Data Lake/Public Data Lake
The data we are talking about in this document includes: edits, corrections, uploads, editor history, pageviews. We will call this usage data `analytics data` in the rest of the document.
Originally, analytics data has been made available in aggregated views in wikistats static HTML reports and via big file dumps (pageviews files and edits XML-dumps). From the spectrum of academic articles usage, it seems that releasing this data publicly has been useful: quite some articles have been published that mention wikipedia pageviews or wikipedia edits. The WMF has also made available application data in the database replicas on cloud platform, this data is “raw” (not aggregated) and while it is heavily used by the community today it is not in a form friendly to do analytics computations.
Value Proposition
The WMF-Analytics team envisions a new system available in WMF-Cloud that would facilitate querying wiki-project edit/user/page data for analytics results.
For instance, the system would provide answers to queries such as: “How many registered-user edit at least 3 different wikis? Or how many reverts have been made on a specific page since it exists ?”
Those two examples have been chosen on purpose because they are difficult to solve with the current Mediawiki database replicas made available in WMF-Cloud. The first because querying across wikis is complicated (each wiki being its own database), and the second because revert computation is computationally complex and expensive.
The system we have in mind will provide multi-wiki querying by default (all data would be in the same table, with a wiki_db field differentiating rows from different projects), and the data will be augmented with “interesting” fields (like the “edit count” of the user at the time the edit was made for instance) that will be precomputed on the Analytics cluster at data-preparation stage.
This system will allow users (tools and researchers) to gather more information, more easily, on our projects.
Technical solution
The dataset and dataflow
Made available on the WMF-Analytics cluster a bit less than 2 years ago, the mediawiki-history dataset contains most user, pages and revision information from all public wiki-projects. It is constructed from data coming from the WMF-Cloud database replicas, meaning it contains no private data. Its construction also embeds the so called “interesting” precomputed fields, such as reverting/reverted information, by-user and by-page revision count for each revision, historical names of users and pages at the time of revision (when available).
The dataset is updated on a monthly basis following a complicated process (see https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster#Edit_data). Its current size is of about 800Gb, stored in an analytics optimized file format named parquet. We plan to make this data available from the WMF-Analytics cluster to the WMF-Cloud system (and any other users) through downloading from the publicly-accessible datasets endpoints (currently serving xml-dumps or analytics-pageview data). We plan to have a parent folder containing monthly-snapshot (“YYYY-MM” format) sub-folders, each containing the files for the given snapshot. We will keep up to 3 historical versions available online in parallel.
The query engine and user access
There are multiple analytics-oriented query engines existing in the technological space as we speak. We decided to use Presto, and open source project from Facebook(see https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto for more detailed reasons of this choice). The presto engine is built to provide fast results to multiple users querying big volumes of data. It takes advantage of parallelizing computation over multiple machines and also provides better performances when used in conjunction with the parquet file-format used to store the dataset.
For more details on sql-engines, see the related appendix page.
In term of accessibility, we aim to provide access to Presto to any WMF-Cloud registered user, first through shell access (ssh and presto CLI), and in a second time through visual querying tools, such as quarry for instance.
WMF-Cloud architecture of SQL engine
Two solutions are possible considering there are 5 machines to be made accessible in the WMF-Cloud Infrastructure. The first one is to install Hadoop and Presto on bare-metal, and configure the network so that those machines are part of the WMF-Cloud network. The second possibility is to make the machines part of the WMF-Cloud virtualization pool, in a dedicated project (or even in the analytics project). This latter solution has been picked, as it allows for more flexibility and keeps the WMF-Cloud infra coherent. In addition to the 5 machines for Presto, 3 smaller VMs have been asked for, to handle edge-systems: HDFS NameNode, Hive metastore and Presto Coordinator.