Analytics/Web publication

From Wikitech

This page describes how to make safe, non-identifying datasets, notebooks, or other research products public on the web in the analytics.wikimedia.org/published directory. For guidelines on how to formally release an open dataset (with metadata and persistent identifiers), please refer to Data releases. For regular, structured, and maintained datasets, please see Analytics#Datasets.

If you're looking for data here, some of it may not be maintained or documented. If possible, please reach out to the authors of the data for help, or to Data Engineering/Team. If you're publishing data here, there are some guidelines in the README on the server.

Instructions

  1. Double-check that the dataset or notebook you want to publish is safe and non-identifying.
  2. Decide where you want to publish it. There are separate folders for notebooks and datasets; within those, you should browse the existing subfolders and decide where your code fits. For example, if you have my-data-2020-01.tsv, you may want to publish it as datasets/one-off/my-data/my-data-2020-01.tsv. Please try to use names that the complete strangers viewing the website will understand!
  3. Make sure you're using on one of the Analytics clients (AKA stat boxes).
  4. Copy our file to the corresponding location within the /srv/published/ folder on that machine, or in the /wmf/data/published/ folder on HDFS. Create the intermediate folders if necessary. If you're using Analytics/Systems/Jupyter, for security reasons you will not be able to access this file from the terminal in your browser. You'll need to SSH directly into the server and move the file using the command line.

Once you do this, it will be automatically synced to the website by a script that runs automatically every 15 minutes. If you want to run the sync immediately, you can do it manually with the published-sync command.