From Wikitech
Jump to: navigation, search

SWAP (the Simple Wikimedia Analytics Platform, previously known as PAWS Internal) is Jupyter notebook service for analyzing non-public data from sources like the Analytics Data Lake. To access it, you need production data access.

It is similar to the public PAWS infrastructure that lives on the Wikimedia Cloud, but uses completely different infrastructure and configuration. For an introduction to notebooks and Jupyter, see PAWS/Introduction


Why Internal Notebooks as a service?

The need for an open notebooks infrastructure is obvious - we have a lot of tools and bots that can leverage the infrastructure, research to be done and shared on public data sources, and a lot of other awesome things described in the PAWS/Tools page. What would be the point of an NDA only equivalent?

  1. Access to Analytics data: The WMF analytics infrastructure houses plenty of rich data sources - about webrequests, pageviews, unique devices, browsers, Eventlogging data and more. The team works hard to get data sources aggregated and exposed publicly. However there is a gap in the rate at which data sources can be made public, and the demand for data. There is also the problem that some data cannot be public. The immediate response that comes to mind is that anyone who is part of the NDA group can request access to our stat boxes and query away. This is easier said than done - access requests are tedious - and once you have access, there is a need to learn to use SSH, command line interfaces etc. Should engineers and analysts not already know this? Well, it may be true that it is the current state of things - however there is no real need to deal with this "accidental complexity" and drudgery in order to be a good engineer or analyst. At this point, we have 30 or so active users of our Hadoop infrastructure, but a lot more people in the organization who would leverage the available data, but wouldn't want to pay the tax to get to it.
  2. Ease of manipulating and visualizing data: Often, folks are interested in looking at the data and plotting simple graphs to see trends. Doing this now would be tedious. There is a real need to access data across MySQL and Hadoop stores sometimes and no good way to work on them at the same time without a lot of grunt work to prepare datasets. Notebooks with good connectors to talk to different data stores and programmatically manipulate and visualize the data would go a long way in making this easy.
  3. Easy discovery of data sources
  4. Enables more research and analysis: Not only does having this interface ease research and analysis on our private data, it empowers everyone to ask questions and answer them. It removes artificial barriers that exist currently, and let's everyone - including folks who are not from technical backgrounds -answer interesting questions.
  5. Recurrent reports: It would be really easy to have crons run that periodically regenerate reports - monthly reader and editor metrics can be both generated and published automatically!
  6. Publish more: Even if the data sources are internal, the research done on them can be published externally. It gives us an opportunity to publish rich versions of our research - along with the thought process that went into those analyses. It also enables generating aggregated versions of data that can be released publicly(being extremely careful about sensitive data of course), and publishing them along with the notebooks.

Plan for SWAP (Previously PAWS-Internal)

  1. Build out the configuration management necessary to run Jupyter notebooks as a service
  2. Work on APIs for talking to MySQL and Hive (Good support for this exists - we have to ensure it works with our datastores, fair scheduling of jobs etc, and any contributions to the APIs will be to Jupyter upstream)
  3. Work on a good publishing standard for sharing notebooks (Jupyter upstream)

Future plans

  1. Forking notebooks and building on top of them
  2. Spark integration
  3. Kafka integration?



You will need production access (ask for the "researchers"/"analytics-privatedata-users"/"statistics-privatedata-users" groups, SWAP piggy backs on data access rules for the Analytics cluster, and any of these 3 groups should work), with SSH configured (see also the Discovery team's notes).

To access SWAP, enter the following in a terminal (to open an SSH tunnel):

ssh -N notebook1003.eqiad.wmnet -L 8000: # or notebook1004

Then open http://localhost:8000 in your browser and log in with your CLI username LDAP (wikitech) password.


Here's the fingerprint of notebook1003.eqiad.wmnet:

$ ssh-keyscan -t ecdsa notebook1003.eqiad.wmnet 2>/dev/null | awk '{print $3}' | base64 -d | sha256sum -b | awk '{print $1}' | xxd -r -p | base64 igDsRNg97RX38qNsbNvoJXWD8Lh6bBtfegcjAkRyqi0=

More at SSH Fingerprints/notebook1003.eqiad.wmnet.


JupyterLab is the new Jupyter Notebooks, currently in beta. The beta preview is installed and useable, but is not the default. To use it, navigate to http://localhost:8000/user/<username>/lab (replace <username> with your username).

Querying data

If you have permissions to access analytics datasets in production, you should be able to do so from SWAP as well.


Wow sql_magic is pretty cool, and works with Hive (via MapReduce), Hive via Spark and other SQL engines too.

with Hive (MapReduce):
%load_ext sql_magic

from pyhive import hive
hive = hive.connect('analytics1003.eqiad.wmnet', 10000)

%config SQL.conn_name = 'hive'

%%read_sql df1
SELECT page_title FROM wmf.pageview_hourly WHERE year=2017 and month=1 and day=1 and hour=0 LIMIT 10
Query started at 07:56:21 PM UTC; Query executed in 0.01 m

0	User:
1	Special:Log/!_!_!_!_!_!_!_!_!_!_!
2	User:Akhil_0950
3	User:
4	User:Daniel
5	5_рашәара
6	Ажьырныҳәа_5
7	Алахәыла:ChuispastonBot
8	Алахәыла_ахцәажәара:Oshwah
9	Алахәыла_ахцәажәара:Untifler
with Hive via Spark:
%load_ext sql_magic

import findspark, os
os.environ['SPARK_HOME'] = '/usr/lib/spark2';
import pyspark
import pyspark.sql
conf = pyspark.SparkConf()  # Use master yarn here if you are going to query large datasets.
sc = pyspark.SparkContext(conf=conf)
spark_hive = pyspark.sql.HiveContext(sc)

%config SQL.conn_name = 'spark_hive'

%%read_sql df2
SELECT page_title FROM wmf.pageview_hourly WHERE year=2017 and month=1 and day=1 and hour=0 LIMIT 10
Query started at 08:04:48 PM UTC; Query executed in 0.12 m

0	User:
1	Special:Log/!_!_!_!_!_!_!_!_!_!_!
2	User:Akhil_0950
3	User:
4	User:Daniel
5	5_рашәара
6	Ажьырныҳәа_5
7	Алахәыла:ChuispastonBot
8	Алахәыла_ахцәажәара:Oshwah
9	Алахәыла_ахцәажәара:Untifler


See: https://gist.github.com/madhuvishy/d349c472de1279568534e4fb2b5bf505


Both pyhive and impyla are installed in all user virtualenvs by default. When connecting to hive, use host analytics1003.eqiad.wmnet on port 10000.

pyhive example:

from pyhive import hive
cursor = hive.connect('analytics1003.eqiad.wmnet', 10000).cursor()
cursor.execute('SELECT page_title FROM wmf.pageview_hourly WHERE year=2017 and month=1 and day=1 and hour=0 LIMIT 10')
[('page_title', 'STRING_TYPE', None, None, None, None, True)]
# ...

Problems have been reported with getting impyla to work. The following workaround may help

# cf. https://community.cloudera.com/t5/Interactive-Short-cycle-SQL/Python-Error-TSaslClientTransport-object-has-no-attribute-trans/td-p/58033
!pip uninstall -y thrift
!pip uninstall -y impyla
!pip install thrift==0.9.3
!pip install impyla==0.13.8


In addition to being able to access Spark via the Jupyter Terminal like any other terminal (see also Analytics/Systems/Cluster/Spark), Spark is now available to be used in notebooks with the following settings (on two different blocks, as written here):

!export PYSPARK_SUBMIT_ARGS='--master yarn --deploy-mode client'
import os
import findspark
os.environ['SPARK_HOME'] = '/usr/lib/spark2';
import pyspark
import pyspark.sql
conf = pyspark.SparkConf()
sc = pyspark.SparkContext(conf=conf)
sqlContext = pyspark.sql.HiveContext(sc)

In the next blocks, you'll be able to use the sqlContextvariable to query spark.

Sharing Notebooks

There is currently no functionality to view (Phab:T156980) or share (Phab:T156934) other users' notebooks in real-time, but it is possible to copy notebooks and files directly on the server by clicking 'New' -> 'Terminal' (in the root folder in the browser window) and using the cp command.

Python virtual environment

All your Python notebooks will live inside a virtual environment automatically created in ~/venv. If you want to enter it directly from the terminal on your computer, SSH into the SWAP server and type source ~/venv/bin/activate.

See also

Notes on how to use the prototype:


My Python kernel will not start

Your IPython configuration may be broken. Try deleting your ~/.ipython directory (you'll lose any configurations you've made or extensions you've installed, but it won't affect your notebooks, files, or Python packages).



JupyterHub is built and installed in the analytics/deploy/jupyterhub repository and the jupyterhub Puppet module. There are a few steps to updating the deployment.

To upgrade or add new packages, you should edit the frozen-requirements.txt file to specify exactly which packages you want. Then, on a non production machine (in Cloud VPS, or mediawiki vagrant), run the build_wheels.sh script. This will create frozen wheels in the artifacts/ directory. Commit and merge the changes. To deploy these, you should git pull in /srv/jupyterhub/deploy on each of the notebook servers, and then run the create_virtualenv.sh script. This will build a new virtualenv from the updated frozen wheel artifacts. You can then service jupyterhub restart to have JupyterHub run from the newly built /srv/jupyterhub/venv.

User virtualenvs

Upon first login, each user will automatically have a new python3 virtualenv created at $HOME/venv. The users themselves can pip install packages into this virtualenv as part of regular Jupyter notebook usage. If you need to update the automatically installed packages in user virtualenvs that have already been created, you'll have to do so manually.

Updating user virtualenvs

If you upgrade JupyterHub or any of the packages listed in frozen-requirements.txt, you might want to upgrade the installed versions of these packages in each user's virtualenv too. To do so, you want to rerun the pip install command that was used during the virtualenv creation. As of 2018-03, this was pip install --upgrade --no-index --ignore-installed --find-links=/srv/jupyterhub/deploy/artifacts/stretch/wheels --requirement=/srv/jupyterhub/deploy/frozen-requirements.txt. To do this for all users:

for u in $(getent passwd | awk -F ':' '{print $1}'); do
    if [ -d $venv ]; then
        echo "Updating $venv"
        sudo -u $u $venv/bin/pip install --upgrade --no-index --ignore-installed --find-links=$wheels_path --requirement=/srv/jupyterhub/deploy/frozen-requirements.txt