Analytics/Cluster/Edit serving layer

From Wikitech

This is the last step of the edit history reconstruction pipeline. It is composed of a back end storing service optimized for fast analytical queries, and a front end UI that is connected to it and permits simple slice and dice operations on the data.

Publicly available

Internal to WMF

Back end

At the moment we're using druid, a column-oriented datastore, that has been giving nice results performance-wise for recent data (last 2 years). The Druid cluster has currently 3 boxes druid100[1-6].eqiad.wmnet. It supports a considerable amount of non-trivial queries per second, see this load test. A nice advantage of Druid is that it has an awesome dedicated UI called Pivot that works out of the box.

We are also considering and testing other datastores, namelly ClickHouse and Presto, which could provide good performance and allow more flexibility than druid. They would however have no connector to Turnilo. We could try to write our own.

Front end

We have two tools conneted to the Druid datastore:

  • Turnilo, a simple but powerful slice-and-dice UI for data analysis (a rewritten version of the now non-opensource Pivot). If you have ldap access, you can try it at turnilo.wikimedia.org.
  • Superset, a less simple but more featurefull tool for BI and dashboarding using druid. If you have ldap access, you can try it at superset.wikimedia.org.