SWAP

From Wikitech
Jump to: navigation, search

SWAP (Simple Wikimedia Analytics Platform, previously known as PAWS Internal - renaming is very WiP) refers to an internal web-based interface at the Wikimedia Foundation for analyzing non-public data from e.g. Hive or EventLogging, using Jupyter-based notebooks as a service. It is similar to the public PAWS infrastructure that lives on the Wikimedia tool labs.

For an introduction to notebooks and Jupyter, see PAWS/Introduction

Background

Why Internal Notebooks as a service?

The need for an open notebooks infrastructure is obvious - we have a lot of tools and bots that can leverage the infrastructure, research to be done and shared on public data sources, and a lot of other awesome things described in the PAWS/Tools page. What would be the point of an NDA only equivalent?

  1. Access to Analytics data: The WMF analytics infrastructure houses plenty of rich data sources - about webrequests, pageviews, unique devices, browsers, Eventlogging data and more. The team works hard to get data sources aggregated and exposed publicly. However there is a gap in the rate at which data sources can be made public, and the demand for data. There is also the problem that some data cannot be public. The immediate response that comes to mind is that anyone who is part of the NDA group can request access to our stat boxes and query away. This is easier said than done - access requests are tedious - and once you have access, there is a need to learn to use SSH, command line interfaces etc. Should engineers and analysts not already know this? Well, it may be true that it is the current state of things - however there is no real need to deal with this "accidental complexity" and drudgery in order to be a good engineer or analyst. At this point, we have 30 or so active users of our Hadoop infrastructure, but a lot more people in the organization who would leverage the available data, but wouldn't want to pay the tax to get to it.
  2. Ease of manipulating and visualizing data: Often, folks are interested in looking at the data and plotting simple graphs to see trends. Doing this now would be tedious. There is a real need to access data across MySQL and Hadoop stores sometimes and no good way to work on them at the same time without a lot of grunt work to prepare datasets. Notebooks with good connectors to talk to different data stores and programmatically manipulate and visualize the data would go a long way in making this easy.
  3. Easy discovery of data sources
  4. Enables more research and analysis: Not only does having this interface ease research and analysis on our private data, it empowers everyone to ask questions and answer them. It removes artificial barriers that exist currently, and let's everyone - including folks who are not from technical backgrounds -answer interesting questions.
  5. Recurrent reports: It would be really easy to have crons run that periodically regenerate reports - monthly reader and editor metrics can be both generated and published automatically!
  6. Publish more: Even if the data sources are internal, the research done on them can be published externally. It gives us an opportunity to publish rich versions of our research - along with the thought process that went into those analyses. It also enables generating aggregated versions of data that can be released publicly(being extremely careful about sensitive data of course), and publishing them along with the notebooks.

Plan for SWAP (Previously PAWS-Internal)

  1. Build out the configuration management necessary to run Jupyter notebooks as a service
  2. Work on APIs for talking to MySQL and Hive (Good support for this exists - we have to ensure it works with our datastores, fair scheduling of jobs etc, and any contributions to the APIs will be to Jupyter upstream)
  3. Work on a good publishing standard for sharing notebooks (Jupyter upstream)

Future plans

  1. Forking notebooks and building on top of them
  2. Spark integration
  3. Kafka integration?

Infrastructure

This will be build on prod hardware with 4 machines (for inter and cross datacenter replication - analytics hadoop cluster is not cross DC at this point, we have to support cross DC if the mysql datastores that house analytics data (mostly on the m4 shard) are cross DC) to ensure a highly available and stable service.

Access to this service is restricted to WMF-NDA only (using LDAP-based authentication).

Usage

Access

You will need production access (ask for the "researchers"/"analytics-privatedata-users"/"statistics-privatedata-users" groups, SWAP piggy backs on data access rules for the Analytics cluster, and any of these 3 groups should work), with SSH configured (see also the Discovery team's notes).

To access SWAP, enter the following in a terminal (to open an SSH tunnel):

ssh -N notebook1001.eqiad.wmnet -L 8000:127.0.0.1:8000

Then open http://localhost:8000 in your browser and log in with your LDAP (wikitech) credentials.

Sharing Notebooks

There is currently no functionality to view (Phab:T156980) or share (Phab:T156934) other users' notebooks in real-time, but it is possible to copy notebooks and files directly on the server by clicking 'New' -> 'Terminal' (in the root folder in the browser window) and using the cp command.

See also

Notes on how to use the prototype: