Analytics/Systems/Jupyter

The analytics clients include a hosted version JupyterHub, allowing access to internal data with Jupyter Notebooks.


Overview

JupyterHub is a multi-tenant Jupyter Notebook Server launcher. It runs on each of the analytics clients (AKA stat boxes). Users open ssh tunnels to the JupyterHub service, open a browser and log in, choose or create a Conda environment from which to run their Jupyter Notebook Server.

Access

To access JupyterHub, you need:

Once you have this access, open a SSH tunnel to one of the analytics clients, e.g.

 ssh -N stat1005.eqiad.wmnet -L 8880:127.0.0.1:8880

replacing stat1005 with the name of any analytics client hostname (AKA stat box) if you prefer.

Then, open localhost:8880 in your browser and log in with your shell username and LDAP password. You'll be prompted to select or create a Conda environment. See the section on Conda environments below.

Note that this will give you access to your Jupyter Notebook Server on the chosen analytics client host only. Notebooks and files are saved to your home directory on that host. If you need shared access to files, consider putting those files in HDFS.

Authenticate to Hadoop via Kerberos

Once you've logged in, if you want to access data from Hadoop, you will need to authenticate with Kerberos.

This can be done either via a terminal ssh session or in a Jupyter Terminal Notebook.

In a terminal session, just type

 kinit

You'll be prompted for your Kerberos password.

Querying Analytics Cluster Datasets

The Product Analytics team maintains software packages to make accessing data from the analytics clients as easy as possible by hard-coding all the setup and configuration.

For Python, there is wmfdata-python. It can access data through MariaDB, Hive, Presto, and Spark and has a number of other useful functions, like creating SparkSessions (see below).

For more advanced usage, see Analytics/Systems/Jupyter/Tips#Custom_PySpark_Notebook_Kernels

For R, there is wmfdata-r. It can access data from MariaDB and Hive and has many other useful functions, particularly for graphing and statistics.

For Scala-Spark and Spark-SQL, you need to install your own kernels in your Conda environment and use Apache Toree (see below). NOTE: Toree is a relatively inactive project.

PySpark and wmfdata

It is possible to create and use a custom Jupyter Notebook kernel to instantiate a PySpark session. However, predefined kernels must specify all possible options to Spark, making it impossible to customize a SparkSession to your needs. Instead, it is recommended to use a regular Python Notebook, and use either wmfdata-python or the findspark package to instantiate your Python SparkSession.

wmfdata has a simplified Spark run function that allows you to quickly run SQL to access data in Hive via Spark as a Pandas DataFrame.

import wmfdata

pandas_df = wmfdata.spark.run(
    """
    SELECT meta.domain, count(*)
    FROM event.mediawiki_page_create
    WHERE year=2021 AND month=1 AND day=1 and hour=0
    GROUP BY meta.domain
    ORDER BY count(*) DESC
    LIMIT 10
    """
)

domain  count(1)
0  commons.wikimedia.org      1491
1       en.wikipedia.org       308
2      mg.wiktionary.org       209
3       www.wikidata.org       126
4       fr.wikipedia.org       117
5      uk.wikisource.org        87
6       ar.wikipedia.org        61
7       ur.wikipedia.org        56
8       pl.wikipedia.org        36
9      hyw.wikipedia.org        32

The run function should only be used with smallish result sets, as it pulls all results into memory in the Jupyter Notebook server.

You can also just instantiate a SparkSession and use it directly.

import wmfdata

# Get a predefined and preconfigured SparkSession type using get_session.
spark = wmfdata.spark.get_session(type='yarn-large')

# Or get a totally customizable SparkSession using get_spark_session.
spark = wmfdata.spark.get_spark_session(
    master='yarn',
    spark_config={
        'spark.executor.memory': '4g'
    }
)

# If you have locally installed dependencies that you need on remote YARN Spark executors,
# wmfdata.spark.get_session and wmfdata.spark.get_custom_session
# have a ship_python_env options, which will automatically
# pack and ship your current conda environment to the remote executor,
# and cause it to use python and the dependencies from it.
spark = wmfdata.spark.get_spark_session(
    master='yarn',
    ship_python_env=True
)

By default, Yarn based SparkSessions used by run will timeout after 5 minutes of inactivity.

Also note that you can only have one active SparkSession instance per notebook at a time. Provided parameters will only be applied the first time the SparkSession is instaniated. Subsequent calls with different configuration parameters will not result in a modified SparkSession, unless the SparkSession is first stopped (or has timed-out).

Scala-Spark or Spark-SQL using Toree

To use either Scala-Spark or Spark-SQL notebooks you need to have Apache Toree available in your conda environment.

An easy way to do so is to install it via the notebook terminal interface: In your notebook interface, click New -> terminal , and in the terminal type pip install toree. And that's it :)

Now you can create a jupyter kernel using Toree as a gateway between the notebook and a spark session running on the cluster (note: the spark session is managed by Toree, no need to create it manually).

To create both a Scala-spark and Spark-SQL kernels, in your notebook terminal again :

NOTE: {lease change the kernel name and the spark options as you see fit - you can find the default wmfdata spark parameters on this github page.

jupyter toree install \
    --user \
    --spark_home="/usr/lib/spark2/" \
    --interpreters=Scala,SQL \
    --kernel_name="Scala Spark" \
    --spark_opts="--master yarn --driver-memory 2G --executor-memory 8G --executor-cores 4 --conf spark.dynamicAllocation.maxExecutors=64 --conf spark.sql.shuffle.partitions=256"

Conda environments

Your Jupyter Notebook Server is run out of a Conda environment which is 'stacked' on top of a read only distribution of Anaconda, named anaconda-wmf. anaconda-wmf has a large list of packages already installed, and these packages are installed on all Hadoop worker nodes.

After logging into JupyterHub, when you start Jupyter Notebook Server it is launched out of your Conda environment stacked on anaconda-wmf. This means that the packages in anaconda-wmf are available to import in your python notebooks. If you need different or newer versions of packages, you can conda (preferred) or pip install them into your active Conda environment, and they will be imported from there.

Using multiple Conda environments

You can create as many Conda environments as you might need, but you can only run one Jupyter Notebook Server at a time. This means that you can only use one Conda environment in a Jupyter Notebook Server at a time. To use a Jupyter Notebook Server with a different Conda environment, you can stop your Jupyter Notebook Server from the JupyterHub Control Panel, and start a new server and select a different Conda environment for it to use.

These Conda environments may also be used outside of Jupyter on the CLI.

See Analytics/Systems/Anaconda for more information.

Troubleshooting

pip fails to install a newer version of a package

If using pip to install a package into your conda environment that already exists in the base anaconda-wmf environment, you might get an error like:

  Attempting uninstall: wmfdata
    Found existing installation: wmfdata 1.0.4
    Uninstalling wmfdata-1.0.4:
ERROR: Could not install packages due to an EnvironmentError: [Errno 30] Read-only file system: 'WHEEL'

To work around this, tell pip to --ignore-installed when running pip install, like:

pip install --ignore-installed --upgrade git+https://github.com/wikimedia/wmfdata-python.git@release

See also: Analytics/Systems/Anaconda#Installing_packages_into_your_user_conda_environment

Trouble installing R packages

See Analytics/Systems/Anaconda#R_support

Browser disconnects

If your browser session disconnects from the kernel on the server (if, for example, your SSH connection times out), any work the kernel is doing will continue, and you'll be able to access the results the next time you connect to the kernel, but no further display output for that work (like print() commands to log progress) will accumulate, even if you reopen the notebook (JupyterLab issue 4237).

My Python notebook will not start

Your IPython configuration may be broken. Try deleting your ~/.ipython directory (you'll lose any configurations you've made or extensions you've installed, but it won't affect your notebooks, files, or Python packages).

My kernel restarts when I run a large query

It may be that your Jupyter Notebook Server ran out of memory and the operating system's out of memory killer decided to kill your kernel to cope with the situation. You won't get any notification that this has happened other than the notebook restarting, but you can assess the state of the memory on the notebook server by checking its host overview dashboard in Grafana (host-overview dashboard) or using the command line to see which processes are using the most (with ps aux --sort -rss | head or similar).

Viewing Jupyter Notebook Server logs

JupyterHub logs are viewable by normal users in Kibana.

A dashboard has been created named JupyterHub and this is also linked from the Home Dashboard.

At present the logs are not split per user, but we are working to make this possible.

They are no longer written by default to /var/log/syslog but they are retained on the host in the systemd journal.

You might need to see JupyterHub logs to troubleshoot login issues or resource issues affecting the cluster.

An individual user's notebook server log be examined with the following command

sudo journalctl -f -u jupyter-$USERNAME-singleuser.service

Viewing JupyterHub logs

TODO: Make this work for regular users!

You might need to see JupyterHub logs to troubleshoot login issues:

 sudo journalctl -f -u jupyterhub

Tips

Analytics/Systems/Jupyter/Tips

Administration

Analytics/Systems/Jupyter/Administration