Analytics/Systems/Data Lake/Edits/Pipeline/Serving layer

From Wikitech
Jump to: navigation, search

This is the last step of the edit history reconstruction pipeline. It is composed of a back end storing service optimized for fast analytical queries, and a front end UI that is connected to it and permits simple slice and dice operations on the data.

Note: This part of the project is still very much WIP and may change as we productionize the data processing and decide wich serving layer and UI are best.

Publicly available

We have not yet invested time in finding the datastore that will back the data for public consumption. It's nonetheless at the top of our priority list.

Internal to WMF

Back end

At the moment we're using druid, a column-oriented datastore, that has been giving nice results performance-wise for recent data (last 2 years). The Druid cluster has currently 3 boxes druid100[1-6].eqiad.wmnet. It supports a considerable amount of non-trivial queries per second, see this load test. A nice advantage of Druid is that it has an awesome dedicated UI called Pivot that works out of the box.

We are also considering and testing other datastores, namelly ClickHouse and Presto, which could provide good performance and allow more flexibility than druid. They would however have no connector to Pivot. We could try to write our own.

Front end

The chosen UI is Pivot, a simple but powerful slice-and-dice UI for data analysis. If you have ldap access, you can try it here. Consider it's still WIP and its contents may change, especially the edit history data sets.