Analytics/Data Lake/Traffic/referrer daily/Dashboard

From Wikitech

The referrer_daily dataset has many facets and ways in which someone might want to aggregate or split the data. Turnilo is currently the best solution at Wikimedia for visualizing data with these properties, but our Turnilo instance is designed for private datasets so a public instance needed to be created in order to share this dataset more broadly. Some technical details are given below and the dashboard can be found at: https://wiki-search-referrals.wmcloud.org

Turnilo instance

The turnilo dashboard is hosted on a Cloud VPS instance by the Wikimedia Research team. The code for setting up the instance can be found here: https://github.com/wikimedia/research-api-endpoint-template/tree/turnilo-druid

Data backend

Turnilo depends on a Druid database backend to scale up effectively. Initially as TSV-backend was used but this quickly became too slow as the size of the dataset grew.

Updating

Until a more streamlined workflow is developed, updates are handled via a string of daily scripts run via crontab:

  • Export new TSV with yesterday's data from HDFS
  • Reformat data to match Turnilo's expected format and append to single flat-file (see below)
  • Update flat-file on Analytics server
  • Download new file on Turnilo instance and restart dashboard with updated data backend

Read More