Analytics/Data Lake/Traffic/referrer daily/Dashboard

From Wikitech
Jump to navigation Jump to search

The referrer_daily dataset has many facets and ways in which someone might want to aggregate or split the data. Turnilo is currently the best solution at Wikimedia for visualizing data with these properties, but our Turnilo instance is designed for private datasets so a public instance needed to be created in order to share this dataset more broadly. Some technical details are given below and the dashboard can be found at:

Turnilo instance

The turnilo dashboard is hosted on a Cloud VPS instance by the Wikimedia Research team. The code for setting up the instance can be found here:

Data backend

Turnilo generally depends on a Druid database backend. It also supports flat-file formats though, including JSON, TSV, and CSV files. For simplicity and given the relatively small size of the dataset, this instance uses a TSV flat-file backend. This requires restarting the Turnilo instance each time the data is updated (daily) but startup is quick and this is considered simpler than building a public Druid database endpoint at the moment.


Until a public druid instance and more streamlined workflow is developed, updates are handled via a string of daily scripts run via crontab:

  • Export new TSV with yesterday's data from HDFS
  • Reformat data to match Turnilo's expected format and append to single flat-file (see below)
  • Update flat-file on Analytics server
  • Download new file on Turnilo instance and restart dashboard with updated data backend

Read More