Analytics/Systems/Data Lake/Edits/Pipeline/Page and user history reconstruction algorithm and optimizations
This page describes the history reconstruction algorithm, with the performance challenges the Analytics team had when developing it and the solutions they found.
The algorithm is an Apache Spark job written in Scala that runs on the Analytics' Hadoop cluster (~30 nodes). It reads the data from the MediaWiki databases having been imported to Hadoop via the data loading process. It then processes that data and outputs the results in a couple of tables in Hadoop. The algorithm is not incremental, it processes the whole history every time it runs. An important detail is that Spark uses RDDs (Resilient Distributed Datasets) to process big data and this will affect the final algorithm.
The reconstruction algorithm (both for user and page history reconstruction) is, in its core, a workaround on the fact that MediaWiki logging table does not store the page nor the user IDs in its events. This means that all the events that affect a single user or page can not be linked together using a unique ID. They can only be linked using the user names and page titles. But those names and titles can change over time, and two users can have the same name at different points in time. So the user/page events can not be simply grouped by name/title, they need to be traced back in time, starting from the current status of the database.
The following gif tries to explain how the algorithm works. At the top, there is a list of events coming from the logging table sorted by timestamp (t1, t2, ...), and at the bottom there is the current state of the user table, on top of it, the algorithm will reconstruct the history provided by the events above. Note that [t1-t2] means "from timestamp t1 to timestamp t2", and [?-t3] means "from some unknown point in the past until timestamp t3".
The page history is reconstructed in a very similar way, the main difference being that the page title always goes together with the page namespace. So both should be considered when joining events with history states. For simplicity reasons, the example shows only rename events, but in reality there are many other events, like: deletions, restores, blocks, rights, etc. However, those are simpler than the renames in term of managing history.
The history reconstruction algorithm needs several tables from the MediaWiki databases: archive, logging, page, revision and user. And needs them for the ~800 databases there are. In total this data has the size of >200GB in AVRO compressed format. It's not huge data, but the algorithm needs to apply several complex joins that become a serious computational challenge.
Here are the solutions we implemented to overcome the described problem.
The algorithm shown in the gif example, corresponds to the first version that we implemented. It was a fixed-point algorithm, that would join lots of events to the reconstructed history in a single step, and then would repeat the whole thing and join again. This would go on and on until all (joinable) events were consumed: fixed-point. However, some pages or users have a very long history in MediaWiki, with thousands of events. And this made the algorithm have too many steps and take too long to initialize and setup the RDDs at every algorithm step. In addition to that, the parallel nature of the fixed-point revealed several correctness issues that applied to a considerable portion of events.
2. "Backwards time-lining"
The solution we came with was processing one event at a time while going backwards in time. This solution would not use RDDs to process the data, so the performance problems associated with users/pages with long histories would disappear, and also for not joining lots of asynchronous events at the same step, this new strategy would not have the correctness issues. It worked like a charm for small wikis like simplewiki, but as the data needed to be in memory, it didn't work for enwiki and other large wikis.
3. Subgraph partitioning
So the final solution was to partition the data in pieces before applying the "backwards time-lining", so that each piece would fit in memory and could be processed in parallel in a separate node. The split that turned out to be the best was splitting by subgraphs. Picture the MediaWiki page universe as a graph where each page title ever used was a vertex and each page move event (rename) was an edge connecting two of those vertices. There would be naturally some unconnected subgraphs that could be processed independently without affecting the other subgraphs. The solution is to apply this subgraph partitioning as a pre-processing stage, and then apply "Backwards time-lining" over each of the disconnected subgraphs.