From Wikitech
< Analytics‎ | Systems(Redirected from SWAP)
Jump to navigation Jump to search

The analytics clients include a hosted version Jupyter, allowing us to analyze our internal data with notebooks. To access it, you need production data access and the Kerberos procedure explained below.

This service was previously called SWAP. It is similar to the public PAWS infrastructure that lives on the Wikimedia Cloud, but uses completely different infrastructure and configuration.

Access and infrastructure

You will need production shell access in the analytics-privatedata-users group, with your SSH configured correctly (see also the Discovery team's notes).

You'll also need wmf or nda LDAP access. For more details, see Analytics/Data access#LDAP access.

Once you have this access, you need to open an SSH tunnel to one of the analytics clients. You can do this with following the terminal command: ssh -N stat1005.eqiad.wmnet -L 8000:, replacing stat1005 with the name of any client if you prefer.

Then, open localhost:8000 in your browser and log in with your developer shell username and password.


Once you've logged in, if you want to access data from Hadoop, you will need to authenticate with Kerberos. Click New -> Terminal (at the bottom in the Other section). In the new terminal, type kinit. Once the password prompt appears, type in your Kerberos password.

Note: Having got a valid kerberos ticket for your username on the notebook machine (notebook1003 for instance) through an ssh connection doesn't provide you access. You need to remove that ticket using kdestroy and then create a new one in the notebook session as described above. The issue seems related to python virtual-environments used when running the notebook not getting access to non-virtual-env settings.

If you receive Kerberos errors from Jupyter notebooks, try the following:

  1. Login to the notebook server directly via SSH and run kdestroy.
  2. Open a tunnel and port forward to the notebook server and login to Jupyter.
  3. In Jupyter open the inbuilt terminal.
  4. Run kinit from the inbuilt terminal.


JupyterLab is the new Jupyter notebook interface, currently in beta. It is installed and useable, but is not the default. To use it, navigate to http://localhost:8000/user/<username>/lab (replace <username> with your username).

Known issues

If your browser session disconnects from the kernel on the server (if, for example, your SSH connection times out), any work the kernel is doing will continue, and you'll be able to access the results the next time you connect to the kernel, but no further display output for that work (like print() commands to log progress) will accumulate, even if you reopen the notebook (JupyterLab issue 4237).


JupyterLab is being updated frequently, and it's possible to update your own version if the preinstalled version is old.

  1. Launch a Jupyter terminal and update the package by running pip install --upgrade jupyterlab.
  2. Restart your Jupyter server: go to the classic interface (/user/YOUR-USERNAME/tree), click on "control panel" in the top right, click "stop my server", and then click "my server" on the resulting page.

Querying data

If you have permissions to access analytics datasets in production, you should be able to do so from these Jupyter notebooks as well.

The easiest way: wmfdata

The Product Analytics team maintains a couple of software packages to make accessing data from the analytics clients as easy as possible by hard-coding all the setup and configuration:

For Python, there is wmfdata-python. It can access data through MariaDB, Hive, and Spark and has a number of other useful functions, like creating custom Spark sessions.

For R, there is wmfdata-r. It can access data from MariaDB and Hive and has many other useful functions, particularly for graphing and statistics.


The spark SparkSession is automatically created for you when using a PySpark notebook (select menu for the type of notebook is on the upper right corner).

import pandas as pd
# needed to show matplot plots
%matplotlib inline
import matplotlib.pyplot as plt

query = "select access_method, count(*) as c \
        from wmf.webrequest where is_pageview=true and agent_type='user'and year={1} and month={2} and day={3} \
        and hour in (1,2,3)\
        and  pageview_info['page_title']='{0}' group by access_method \
        order by c desc  limit 4 ".format(page_title, year, month, day)

df = spark.sql(query),False)

pandas_df = df.toPandas()'access_method', y='c', rot=0)


We need specific versions of Impyla and some dependencies to work with Hive and Kerberos. (PyHive seems to need a unreleased version of thrift_sasl to work with Kerberos and python 3.)

See also this suggested workaround for 'TSocket' object has no attribute 'isOpen' errors.

# Make sure we have compatible versions of the things we need
!pip install  impyla==0.15.0 sasl==0.2.1 thrift_sasl==0.2.1 thriftpy==0.3.9 thriftpy2==0.4.0 numpy==1.12.1

from impala.dbapi import connect
from impala.util import as_pandas

cursor = connect(host="an-coord1001.eqiad.wmnet", port=10000, auth_mechanism='GSSAPI', kerberos_service_name="hive").cursor()
cursor.execute('SELECT page_title, revision_text FROM wmf.mediawiki_wikitext_history WHERE snapshot="2019-10" AND wiki_db="testwiki" and revision_text_bytes > 1000 and revision_text_bytes < 10000 limit 10')

df = as_pandas(cursor)

	page_title	revision_text
0	User:Rutilant/New.js	\n// Only add edit count button on user pages\...
1	User:Masumrezarock100/test.js	//A part of [[User:Masumrezarock100/mobilemore...
2	Victor Vlad Delamarina	* <s>[[:commons:category:Victor Vlad Delamarin...
3	Template:Old AfD multi	{| class="messagebox standard-talk" style="tex...

There seems an issue with queries returning a big amount of data, ending up in:

TTransportException: TTransportException(type=0, message=b'Error in sasl_decode (-1) SASL(-1): generic failure: Unable to find a callback: 32775')

This seems to be an Impyla problem, since it uses the sasl pypi package behind the scenes(wrapper around the Cyrus SASL C library) originally created by Cloudera but not maintained anymore. There is still not fix for this issue, we suggest to move to PySpark (that seems to work fine).

with Hive via Spark and sql_magic

%load_ext sql_magic

import findspark, os
os.environ['SPARK_HOME'] = '/usr/lib/spark2';
import pyspark
import pyspark.sql
conf = pyspark.SparkConf()  # Use master yarn here if you are going to query large datasets.
sc = pyspark.SparkContext(conf=conf)
spark_hive = pyspark.sql.HiveContext(sc)

%config SQL.conn_name = 'spark_hive'

%%read_sql df2
SELECT page_title FROM wmf.pageview_hourly WHERE year=2017 and month=1 and day=1 and hour=0 LIMIT 10
Query started at 08:04:48 PM UTC; Query executed in 0.12 m

0	User:
1	Special:Log/!_!_!_!_!_!_!_!_!_!_!
2	User:Akhil_0950
3	User:
4	User:Daniel
5	5_рашәара
6	Ажьырныҳәа_5
7	Алахәыла:ChuispastonBot
8	Алахәыла_ахцәажәара:Oshwah
9	Алахәыла_ахцәажәара:Untifler




Spark notebooks are available via Apache Toree and iPython (for PySpark). Spark kernels for local and YARN based Spark notebooks for Scala, Python, SQL and R.

Note that once you open a Python Spark notebook, you don't need to create a SparkSession or SparkContext—it's automatically provided the variable spark.

Note the 'large' YARN Spark notebook options. These launch Spark in yarn with --executor-memory 4g and memoryOverhead of 2g. See also this ticket for more info. if you need to launch Spark with different settings, refer to Custom Spark Kernels below.

Also, pyspark != Python 3! Your Python 3 Notebook will run out of your local virtualenv on the Jupyter Notebook host node, and as such you can !pip install things there. However, Spark is expected to run distributed, and your local virtualenv will not be available on remote worker nodes. If you need python packages installed to work with pyspark, you'll need to submit a Phabricator request for them.

Spark with Brunel

Brunel is a visualization library that works well with Spark and Scala in a Jupyter Notebook. We deploy a Brunel jar with Jupyter. You just need to add it as a magic jar:

%AddJar -magic file:///srv/jupyterhub/deploy/spark-kernel-brunel-all-2.6.jar

import org.apache.spark.sql.DataFrame
val seq = (0 until 3).map(i => (i, i)).toSeq
val df = spark.sqlContext.createDataFrame(seq)

%%brunel data('df') x(_1) y(_2) bar  style("fill:red") filter(_1:2) :: width=300, height=300
Screen Shot 2018-08-07 at 16.45.54.png

See for more Brunel examples.

Custom Spark Notebook Kernels

Jupyter Notebook kernels are configured in JSON files. The default ones are installed in /usr/local/share/jupyter/kernels, but you can create your own custom kernel spec files in your home directory in ~/.local/share/jupyter/kernels.

There are a few reasons why you might want to run a custom kernel. For example: jupyter does not, by default, know how to read avro files, to do so you need a set of extra dependency jars passed along to the environment when it starts and the way to do that is by setting up a kernel that passes those.

There are two ways to launch Spark with custom kernel settings. You can either create a new kernel spec that will launch Spark with your custom settings, OR you can run a plain old python notebook and instantiate a SparkSession directly.

Permanent custom Spark Notebook kernels

This method will allow you to select and start the custom notebook in the JupyterHub UI.

The Spark kernels included in our Jupyter servers all have hardcoded Spark options. There isn't a good way to make a Jupyter Notebook prompt the user for settings before the Notebook is launched, so we have to hardcode the options given to the Spark shell. If you need a Spark Notebook (or any kind of Notebook) with custom settings, you'll need to create a new kernelspec in your user's Jupyter kernels directory. The easiest way to do this is to install a new kernelspec from an existing one, and then edit the kernel.json file.

# Activate your Jupyter virtualenv (if it isn't already activated):
[@notebook1004:/home/otto] $ . ./venv/bin/activate

# Use jupyter kernelspec install to copy a global kernelspec into your user kernel directory, changing the kernel name on the way.
[@notebook1004:/home/otto] [venv] $ jupyter kernelspec install --user --name 'spark_yarn_pyspark_otto1' /usr/local/share/jupyter/kernels/spark_yarn_pyspark

# Edit the kernel.json file to change your settings.  Here, we change the display name and the --executor-memory:
[@notebook1004:/home/otto] [venv] $ vim ~/.local/share/jupyter/kernels/spark_yarn_pyspark_otto1/kernel.json
  "argv": [
  "language": "python",
  "display_name": "PySpark - YARN - 16g Executor (otto custom)",
  "env": {
    "PYSPARK_PYTHON": "/usr/bin/python3",
    "SPARK_HOME": "/usr/lib/spark2",
    "PYTHONPATH": "/usr/lib/spark2/python/lib/",
    "PYTHONSTARTUP": "/usr/lib/spark2/python/pyspark/",
    "PYSPARK_SUBMIT_ARGS": "--master yarn pyspark-shell --conf spark.dynamicAllocation.maxExecutors=32 --executor-memory 16g --executor-cores 4 --driver-memory 4g"

Once done, refresh Jupyter in your browser, and you should see the newly created Notebook Kernel show up for use.

Launching as SparkSession in a Python Notebook

For this method, you will run the following in a regular Python notebook (not a Spark specific notebook kernel). This allows you to specify the spark session configuration directly, rather than relying on a static notebook kernel spec config file.

Note: When used in Yarn mode, this example starts a big spark application in the cluster (~30% of resources).

import findspark
from pyspark.sql import SparkSession
import os

master = 'yarn'
app_name = 'my_app_name'
builder = (
        ' '.join('-D{}={}'.format(k, v) for k, v in {
            'http.proxyHost': 'webproxy.eqiad.wmnet',
            'http.proxyPort': '8080',
            'https.proxyHost': 'webproxy.eqiad.wmnet',
            'https.proxyPort': '8080',
    .config('spark.jars.packages', 'graphframes:graphframes:0.6.0-spark2.3-s_2.11')
if master == 'yarn':
    os.environ['PYSPARK_SUBMIT_ARGS'] = '--archives pyspark-shell'
    os.environ['PYSPARK_PYTHON'] = 'venv/bin/python'
    builder = (
        .config('spark.sql.shuffle.partitions', 512)
        .config('spark.dynamicAllocation.maxExecutors', 128)
        .config('spark.executor.memory', '8g')
        .config('spark.executor.cores', 4)
        .config('spark.driver.memory', '4g')
elif master == 'local':
    builder = (
        .config('spark.driver.memory', '8g')
    raise Exception()

spark = builder.getOrCreate()
Setting the foundations

Breaking this into a few pieces, first the findspark package must be installed to your Jupyter server.

!pip install findspark

findspark can then be called to bring the system spark 2.x libraries into your environment

import findspark
from pyspark.sql import SparkSession
import os

The SparkSession builder is then used to give the spark app a name and specify some useful defaults. In this example we provide appropriate proxies to the spark driver to download the graphframes package. Other spark dependencies can be installed in a similar fashion. For python library dependencies, continue reading.

Note: In this example the started spark-yarn application uses a medium amount of the cluster resources (~15%)

builder = (
    .appName('example spark application')
        ' '.join('-D{}={}'.format(k, v) for k, v in {
            'http.proxyHost': 'webproxy.eqiad.wmnet',
            'http.proxyPort': '8080',
            'https.proxyHost': 'webproxy.eqiad.wmnet',
            'https.proxyPort': '8080',
    .config('spark.jars.packages', 'graphframes:graphframes:0.6.0-spark2.3-s_2.11')
    .config('spark.sql.shuffle.partitions', 256)
    .config('spark.dynamicAllocation.maxExecutors', 64)
    .config('spark.executor.memory', '8g')
    .config('spark.executor.cores', 4)
    .config('spark.driver.memory', 2g)
Application Master

Spark can be run either in local mode, where all execution happens on the machine the notebook is started from, or in yarn mode which is fully distributed.

Local mode is especially convenient for initial notebook development. Local mode is typically more responsive than cluster mode, and allows prototyping code against small samples of the data before scaling up to full month or multi-month datasets. Dependencies are additionally easily handled in local mode, when running local nothing special has to be done. Dependencies can be installed with !pip install and will be available in all executor processes. In local mode the number of java threads can be specified using brackets, such as local[12] . It is recommended to provide more memory than the defaults when running in local mode, otherwise a dozen java executor threads might fight over a couple GB of memory.

Note: Using spark in local mode implies some knowledge of the resources of the local machine it runs on (number of CPU cores and amount of memory). Please don't use the full capacity of local machines! Some other users might be working with it as well :)

    .config('spark.driver.memory', '8g')

Local mode has it's limits. When more compute is necessary set the master to yarn and specify some limits to the amount of compute to use. Spark will execute a task per core, so 64 executors with 4 cores will give 256 parallel tasks with 64 * 8G = 512G of total memory to work on the notebook (this represent ~15% of the whole cluster). Finally some more memory is needed for the spark-driver: 2g instead of 1g as default. The driver being responsible to collect computed data onto the notebook (if needed) as well as managing the whole tasks the application deals with, some memory shall prevent it to crash too quickly.

    .config('spark.dynamicAllocation.maxExecutors', 64)
    .config('spark.executor.memory', '4g')
    .config('spark.executor.cores', 4)
    .config('spark.driver.memory', 2g)

PySpark in YARN with python dependencies

PySpark in YARN mode with dependencies can be managed by shipping a virtualenv containing all the dependencies to each executor. There are two options for packaging up the virtualenv. The simplest is to use a notebook bang command to zip up the current virtualenv. For more advanced usage see venv-pack.

!cd venv; zip -qur ~/ .

Two environment variables work together to tell spark to ship the virtualenv to the executors and run workers from the included python executable. First is PYSPARK_SUBMIT_ARGS which must be provided an --archives parameter. This parameter is a comma separated list of file paths. Each path can be suffixed with #name to decompress the file into the working directory of the executor with the specified name. The final segment of PYSPARK_SUBMIT_ARGS must always invoke pyspark-shell.

os.environ['PYSPARK_SUBMIT_ARGS'] = '--archives pyspark-shell'

In PYSPARK_SUBMIT_ARGS we instructed spark to decompress a virtualenv into the executor working directory. In the next environment variable, PYSPARK_PYTHON, we instruct spark to start executors using python provided in that virtualenv.

os.environ['PYSPARK_PYTHON'] = 'venv/bin/python'
Ready to go

After the builder is fully configured spark can be started and utilized as normal.

spark = builder.getOrCreate()
Other working examples

An example notebook has been put together demonstrating loading the ELMo package from tensorflow hub and running it across many executors in the yarn cluster.

Custom Python kernels

Custom python kernels through virtual environments are not supported in our Jupyter service. However you can still use them by manually running your own notebooks and connecting them to a spark job. See Analytics/Systems/Cluster/Spark.

Miscellaneous use

Sharing notebooks

There is currently no direction functionality to view (Phab:T156980) or share (Phab:T156934) other users' notebooks in real-time, but it is possible to copy notebooks and files directly on the server by clicking 'New' -> 'Terminal' (in the root folder in the browser window) and using the cp command.


It's also possible to track your notebooks in Git and push them to GitHub, which will display them fully rendered on its website. If you want to do this, you should connect to GitHub using HTTPS. The SRE recommends not using SSH because of the risk that other users could access your SSH keys (which, if combined with a production SSH key reused for GitHub, could result in a serious security breach).

With HTTPS, by default you'll have to type in your GitHub username and password every time you push. You can avoid this by adding the following (from this Superuser answer) to ~/.gitconfig:

[url ""]
    insteadOf =

    helper = cache --timeout=28800

This will automatically apply your GitHub user name to any HTTPS access, and then cache the password you enter for 8 hours (28 800 seconds).

You can also set up a personal access token to use for authentication over HTTPS. This will allow you to not enter your GitHub password every time. These tokens can also be limited in what access they have, e.g. they can be set to only be able to modify repositories.

HTML files

You can also export your notebook as an HTML file, using File > Download as... in the JupyterLab interface or the jupyter nbconvert --to html command.

If you want to make the HTML file public on the web, you can use the web publication workflow.

Python virtual environment

All your Python notebooks will live inside a virtual environment automatically created in ~/venv. If you want to enter it directly from the terminal on your computer, SSH into the analytics client and type source ~/venv/bin/activate.

HTTP requests

Allow HTTP requests, to for example enable your notebook to clone a repo, by adding export code to your .bash_profile in the notebook's terminal.

Sending emails from within a notebook

To send out an email from a Python notebook (e.g. as a notification that a long-running query or calculation has completed), one can use the following code

# cf. :
notebookservername = !hostname
notebookserverdomain =  notebookservername[0]+'.eqiad.wmnet'
username = !whoami

def send_email(subject, body, to_email = username[0]+'', from_email = username[0]+'@'+notebookserverdomain):
    import smtplib
    smtp = smtplib.SMTP("localhost")
    message = """From: <{}>
To: <{}>
Subject: {}

""".format(from_email, to_email, subject, body)
    smtp.sendmail(from_email, [to_email], message)

# example uses: 
# send_email('Jupyter notebook ready (n/t)', '')
# send_email('Jupyter email test', 'test body', '', '')

(Invoking the standard mail client via the shell, i.e. !mailx or !heirloom-mailx, fails for some reason, see phab:T168103.)

I want python 3+

Using R in Python through rpy2

rpy2 enables access to R in Python, and is particularly useful in Jupyter notebooks in that you can for instance use ggplot2 for your visualizations. As of March 2020, installing rpy2 on the Jupyter servers is not necessarily straightforward as the latest version of it is incompatible. Using a version lower than 3.1 appears to work, e.g.

  pip install --upgrade "rpy2 < 3.1"
  pip install tzlocal

The second call is needed because this version of rpy2 fails to install the required "tzlocal" package. Once installed, you should be able to load the extension with the following code in a notebook:

   %load_ext rpy2.ipython

And once that is done, use the %R and %%R magic words to execute R code in the notebook.

See also

Notes on how to use the prototype:


My Python kernel will not start

Your IPython configuration may be broken. Try deleting your ~/.ipython directory (you'll lose any configurations you've made or extensions you've installed, but it won't affect your notebooks, files, or Python packages).

My kernel restarts when I run a large query

It may be that the notebook server ran out of memory and the operating system's out of memory killer decided to kill your kernel to cope with the situation. You won't get any notification that this has happened other than the notebook restarting, but you can assess the state of the memory on the notebook server by checking its host overview dashboard in Grafana (notebook1003 or notebook1004) or using the command line to see which processes are using the most (with ps aux --sort -rss | head or similar).



JupyterHub is built and installed in the analytics/deploy/jupyterhub repository and the jupyterhub Puppet module. There are a few steps to updating the deployment.

To upgrade or add new packages, you should edit the frozen-requirements.txt file to specify exactly which packages you want. Then, on a non production machine (in Cloud VPS, or mediawiki vagrant), run the script. This will create frozen wheels in the artifacts/ directory. Commit and merge the changes. To deploy these, you should git pull in /srv/jupyterhub/deploy on each of the notebook servers, and then run the script. This will build a new virtualenv from the updated frozen wheel artifacts. You can then service jupyterhub restart to have JupyterHub run from the newly built /srv/jupyterhub/venv.

Spark Integration

Spark integration is handled by global custom kernels installed into /usr/local/share/jupyter/kernels. pyspark kernels are a custom iPython kernel that loads pyspark. All other Spark kernels use Apache Toree.

These kernels are installed by the script that should be run during deployemnt. If you need to update them, you should modify the kernel.json files in the jupyterhub-deploy repository.

User virtualenvs

Upon first login, each user will automatically have a new python3 virtualenv created at $HOME/venv. The users themselves can pip install packages into this virtualenv as part of regular Jupyter notebook usage. If you need to update the automatically installed packages in user virtualenvs that have already been created, you'll have to do so manually.

Updating user virtualenvs

If you upgrade JupyterHub or any of the packages listed in frozen-requirements.txt, you might want to upgrade the installed versions of these packages in each user's virtualenv too. To do so, you want to rerun the pip install command that was used during the virtualenv creation. As of 2018-03, this was pip install --upgrade --no-index --ignore-installed --find-links=/srv/jupyterhub/deploy/artifacts/stretch/wheels --requirement=/srv/jupyterhub/deploy/frozen-requirements-stretch.txt. To do this for all users:

cd /srv/jupyterhub/deploy
for u in $(getent passwd | awk -F ':' '{print $1}'); do
    if [ -d $venv ]; then
        echo "Updating $venv"
        sudo -H -u $u $venv/bin/pip install --upgrade --no-index --force-reinstall --find-links=$wheels_path --requirement=/srv/jupyterhub/deploy/frozen-requirements-stretch.txt

Note: on machines running Debian Buster (e.g. stat1005, stat1008) replace stretch with buster in the lines above.

Resetting user virtualenvs

Sometimes someone may want to totally recreate their Jupyter virtualenv from scratch. This can be done by the user themselves! The steps are as follows:

# 1. Stop your Jupyter Notebook server from the JupyterHub UI.

# 2. Move your old venv out of the way (or just delete it)
mv $HOME/venv $HOME/venv-old-$().$(date +%s)

# 3. create a new empty venv
python3 -m venv --system-site-packages $HOME/venv

# 4. Reinstall the jupyter venv
cd /srv/jupyterhub/deploy

# Instead of manually changing stretch->buster on stat1005 & stat1008:
source /etc/os-release
$HOME/venv/bin/pip install --upgrade --no-index --force-reinstall --find-links=/srv/jupyterhub/deploy/artifacts/$VERSION_CODENAME/wheels --requirement=/srv/jupyterhub/deploy/frozen-requirements-$VERSION_CODENAME.txt

# 5. Login to JupyterHub and restart your Jupyter Notebook server.

Machine ran out of space

As noted in known issues, files deleted through the Jupyter UI are moved to .local/share/Trash in the user's home directory. If users are reporting that they can't save their notebooks because the machine has run out of spaces, chances are the trash directories need to be emptied.


These docs will be incorporated into the main documentation above once the new conda based JupyterHub has replaced all of the virtualenv based JupyterHubs.

We are in the process of migrating away from single virtualenv based JuptyerHub and Jupyter installs, instead using our installation of Anaconda with stacked user conda environments.

The new JupyterHub setup is hosted on port 8880, so ssh tunnel into a stat box (just stat1008 at the moment) like

 ssh -N stat1008.eqiad.wmnet -L 8880:

Navigate to http://localhost:8880.

When you log in and start your server, you should be prompted with a drop down list of conda environments to choose from. The base anaconda-wmf environment is read only, meaning you cannot install packages there. To create a new conda environment, select 'Create and use new stacked conda environment...'. This will run conda-create-stacked to create a new conda environment in your ~/.conda/envs directory, as well as activate that environment for use with your Jupyter Notebook Server.

If you already have a conda environment created, you should be able to select it from the drop down.

Once you've launched your Jupyter Notebook Server with a conda environment, you should be able to conda and pip install into that conda environment at will.

Creating and destroying conda environments is cheap, so feel free to create new ones as often as you like. If you want to launch Jupyter with a different conda environment, you need to stop your Jupyter server from the JupyterHub interface and start a new one, selecting the desired conda environment.

Conda stacked envs

See: Analytics/Systems/Anaconda#stacked_conda_environments

Newpyter TODO

  • Yarn Spawners
  • Remove custom spark and other jupyterhub kernels
  • Support more language kernels by default: Scala (via Almond?).