User:Triciaburmeister/Sandbox/Data platform/Publish data
This page is currently a draft. More information and discussion about changes to this draft on the talk page. |
This page describes internal tools for publishing dashboards, visualizations, and reports based on Wikimedia private data. For publicly-accessible resources and data, see meta:Research:Data.
Before you start
This page assumes you have already identified datasets relevant for your analysis, and that you can access and query that data using internal analysis tools.
Generate reports and dashboards
Internal data reporting tools
Turnilo is a web interface for exploring data stored in Druid. Turnilo is a quick and simple solution for self-service analysis tasks, but it has some technical limitations that make it slightly less accurate and precise than Superset.
Go to Turnilo: turnilo.wikimedia.org
Superset is a web interface for data visualization and exploration. Like Turnilo, it provides access to Druid tables, but it also has access to data in Hive (and elsewhere) via Presto, and it offers more advanced slicing-and-dicing options.
Matomo is a small-scale web analytics platform, mostly used for Wikimedia microsites (roughly 10,000 requests per day or less).
Public dashboards and reporting tools
These dashboards provide options for external users to view and explore Wikimedia metrics and datasets. Some of them may also offer private dashboard views of internal data.
analytics.wikimedia.org is a static site that serves WMF analytics dashboards and data downloads.
- Site documentation
- Web publication: Process for publishing ad-hoc, low-risk datasets, notebooks, or other research products on the site
Dashiki is a dashboarding tool that lets users declare dashboards by using configuration pages on a wiki.
- Dashiki dashboard tutorial
- Example dashboards:
- Pageviews (public)
- Browser statistics (public)
Grafana is a frontend for creating queries and storing dashboards using data from Graphite and other datasources. WMF uses it to publicly share metrics about Wikimedia website use and performance.
- Technical docs: https://wikitech.wikimedia.org/wiki/Grafana
- Community docs: https://meta.wikimedia.org/wiki/Grafana.wikimedia.org
- Dashboard access: https://grafana.wikimedia.org
Publish and maintain data
Policies
This is a quick reference list of policies; these policies are also linked in the next section, organized by the step of the data lifecycle to which they correspond.
- Data Access
- WMF Privacy Policy
- Data Retention
- Data Publication
- Country and Territory Protection List
Procedures and how-to guides
See Data lifecycle management process for an overview of the full process.
Policies:
Procedures:
- Formal open data release process
- Web publication: Process for publishing ad-hoc, low-risk datasets, notebooks, or other research products on the web in the analytics.wikimedia.org/published directory
- Dashboarding guidelines
- Reporting guidelines
- DataHub and dataset documentation guide
Policies:
Procedures:
- Data Incident management
- Data Issue reporting
- Event sanitization: Process for retaining event data in Hive beyond the 90 day retention period.
- TODO: Do we have any such documentation?
Next steps
- To start collecting new data, or to define new workflows that compute a dataset, transform some data, or upload a dataset to a data store: visit Collect data.
TODO: I'd like the details below to be easily accessible from the "Data infrastructure" section of the landing page, but maybe that doesn't make sense, and the information should be highlighed as part of each component/dataset?
- To see more technical details of how WMF data pipelines and systems implement our data retention and sanitization policies:
- Data Engineering/Systems/Event Data retention
- Analytics/Cluster/Data_deletion_and_sanitization
- Data Engineering/Systems/EventLogging/Publishing
- Data Engineering/Systems/EventLogging/Sanitization vs Aggregation
- Data Engineering/Systems/EventLogging/Sensitive Fields
- Data Engineering/Systems/EventLogging/User agent sanitization