User:Triciaburmeister/Sandbox/Data platform/Publish data

From Wikitech

This page describes internal tools for publishing dashboards, visualizations, and reports based on Wikimedia private data. For publicly-accessible resources and data, see meta:Research:Data.

Before you start

This page assumes you have already identified datasets relevant for your analysis, and that you can access and query that data using internal analysis tools.

Generate reports and dashboards

Internal data reporting tools

Turnilo is a web interface for exploring data stored in Druid. Turnilo is a quick and simple solution for self-service analysis tasks, but it has some technical limitations that make it slightly less accurate and precise than Superset.

Go to Turnilo: turnilo.wikimedia.org

Superset is a web interface for data visualization and exploration. Like Turnilo, it provides access to Druid tables, but it also has access to data in Hive (and elsewhere) via Presto, and it offers more advanced slicing-and-dicing options.

Go to Superset

Matomo is a small-scale web analytics platform, mostly used for Wikimedia microsites (roughly 10,000 requests per day or less).

Go to Matomo

Public dashboards and reporting tools

These dashboards provide options for external users to view and explore Wikimedia metrics and datasets. Some of them may also offer private dashboard views of internal data.

analytics.wikimedia.org

analytics.wikimedia.org is a static site that serves WMF analytics dashboards and data downloads.

Dashiki is a dashboarding tool that lets users declare dashboards by using configuration pages on a wiki.

Grafana is a frontend for creating queries and storing dashboards using data from Graphite and other datasources. WMF uses it to publicly share metrics about Wikimedia website use and performance.

Publish and maintain data

Policies

This is a quick reference list of policies; these policies are also linked in the next section, organized by the step of the data lifecycle to which they correspond.

Procedures and how-to guides

See Data lifecycle management process for an overview of the full process.

Publish and document data
Maintain and monitor data

Policies:

Procedures:

Archive or deprecate data
  • TODO: Do we have any such documentation?

Next steps

  • To start collecting new data, or to define new workflows that compute a dataset, transform some data, or upload a dataset to a data store: visit Collect data.

TODO: I'd like the details below to be easily accessible from the "Data infrastructure" section of the landing page, but maybe that doesn't make sense, and the information should be highlighed as part of each component/dataset?