Data Platform/Systems/Jupyter

The analytics clients include a hosted version of JupyterHub, allowing easy analysis of internal data using Jupyter notebooks.

Access

Prerequisites

To access Jupyter, you need:

production data access in the analytics-privatedata-users POSIX group
Your SSH configured correctly.
Kerberos credentials
You'll also need to be in the wmf or nda LDAP groups.

SSH tunnel

Once you have this access, open a SSH tunnel to one of the analytics clients. There are two main ways to do this. We'll assume you want to connect to stat1008, but you can connect to another client instead by changing the name in the terminal command.

The first option is using the standard SSH command:

$ ssh -N stat1008.eqiad.wmnet -L 8880:127.0.0.1:8880

The second option is to modify your SSH configuration file to automatically open a tunnel whenever you connect to a client:

Match host=!*.*,stat10*
        SessionType none
        HostName %h.eqiad.wmnet
        LocalForward 8880 127.0.0.1:8880

With that added, you can simply use the following command:^[1]

$ ssh stat1008

Note that your Jupyter notebook and files will stored be on the chosen analytics client only. If you want to move to another server, you will have to copy your files using Rsync. If you need shared access to files, consider putting those files in HDFS.

Logging in

Then, open localhost:8880 in your browser and log in with your developer account. Use your shell username rather than your wiki username (e.g. nshahquinn-wmf, not Neil Shah-Quinn (WMF).

You'll be prompted to select or create a Conda environment. See the section on Conda environments below.

Authenticating via Kerberos

Once you've logged in, if you want to access data from Hadoop, you will need to authenticate with Kerberos. Open a new terminal in JupyterLab (not a separate SSH session). Type kinit on the command line and enter your Kerberos password at the prompt.

Querying data

To make it easier for you to access data from the analytics clients, use the following software packages, which hard-code much of the setup and configuration.

In Python

For Python, there is Wmfdata-Python. It can access data through MariaDB, Hive, Presto, and Spark and has a number of other useful functions, like creating Spark sessions. For details, see the repository and particularly the quickstart notebook.

In R

For R, there is wmfdata-r. It can access data from MariaDB and Hive and has many other useful functions, particularly for graphing and statistics.

Conda environments

Main article: Data Platform/Systems/Conda

Jupyter is set up to use isolated environments managed by Conda. When you first log in, you will be prompted to "Create and use new cloned conda environment".

You can create as many Conda environments as you need, but you can only run one at a time in Jupyter. To change environments:

Navigate to the JupyterHub control panel by selecting File → Hub Control Panel in the Jupyter interface or navigating to localhost:8880/hub/home
Select "Stop My Server"
Select "Start My Server"

You will now see a dropdown allowing you to choose an existing environment or create a new one:

Installing packages

If you need different or newer versions of packages, run conda install {{package}} or conda update {{package}} in the terminal. For more information, see Data Platform/Systems/Conda#Installing packages.

Sharing Notebooks

There is currently no built-in functionality for internal sharing of notebooks (phab:T156934), but there are several different workarounds available:

Copying files on the analytics clients

It's possible to copy notebooks and files directly on the server by clicking 'New' → 'Terminal' (in the root folder in the browser window) and using the cp command. Note that you may have to change the file permissions using the chmod command to give the other user read access to the files.

GitHub or GitLab

It's also possible to track your notebooks in Git and push them to either GitHub or our GitLab, both of which will display them fully rendered on its website. Generally, this requires making the notebook public, but it's also possible to request a private GitLab repo if necessary.

In either case, you will need to connect using HTTPS rather than SSH (SRE considers SSH from the analytics clients a security risk because of the possibility that other users could access your SSH keys). To do this, you will need to set up a personal access token (GitLab docs, GitHub docs), which you will use in place of a password when using Git on the command line.

By default, you'll have to enter your username and password every time you push. You can avoid this by adding the following to ~/.gitconfig:

# Automatically add username to GitHub URLs
[url "https://{{username}}@github.com"]
    insteadOf = https://github.com

# Automatically add username to Wikimedia GitLab URLs
[url "https://{{username}}@gitlab.wikimedia.org"]
    insteadOf = https://gitlab.wikimedia.org

# Cache access tokens for 8 hours after entry
[credential]
    helper = cache --timeout=28800

HTML files

You can also export your notebook as an HTML file, using File > Download as... in the JupyterLab interface or the jupyter nbconvert --to html command.

If you want to make the HTML file public on the web, you can use the web publication workflow.

Quarto

To use convert a Jupyter notebook to an HTML file with Quarto on the analytics cluster:

Install Quarto in your conda-analytics environment with conda install quarto (which will install Quarto from conda-forge)
Restart your server (File → Hub Control Panel → Stop My Server → Start My Server → select the relevant environment) because this will run the etc/conda/activate.d/quarto.sh that sets up the correct environment variables which are necessary for the quarto command to work

Then you can run quarto render some_notebook.ipynb --to html which will generate:

An HTML file with same name as the notebook but html extension (e.g. some_notebook.html)
A folder of JavaScript and CSS files with same name as the notebook but _files appended (e.g. some_notebook_files): this is necessary for the aforementioned HTML file to display and function correctly
- ALTERNATIVELY: Set embed-resources: true option to produce a standalone (self-contained) HTML file.^[2]

When you publish the files (e.g. following the web publication workflow), make sure you publish both the HTML file and the folder of supporting files.

Nbviewer

Nbviewer is a useful tool to share notebooks, especially if the notebooks have interactive or HTML elements, which may not render well on GitHub or GitLab. For GitHub, the tool works fine with the blob version of a notebook, however for gitlab.wikimedia.org, it can only read the raw version. You can either change from blob to raw in the URL, or open the raw version (from the top bar) as shown in the image and copy the URL.

Troubleshooting

Spawn failure after creating a new environment

If you attempt to start a Jupyter server using the "create and use a new Conda environment" option, but the start-up fails, try deleting your .conda/pkgs directory (task T380477).

Trouble installing R packages

See Data Platform/Systems/Conda#R support.

Browser disconnects

If your browser session disconnects from the kernel on the server (if, for example, your SSH connection times out), any work the kernel is doing will continue, and you'll be able to access the results the next time you connect to the kernel, but no further display output for that work (like print() commands to log progress) will accumulate, even if you reopen the notebook (JupyterLab issue 4237).

Notebook is unresponsive, or kernel restarts when running a large query

It may be that your Jupyter Notebook Server ran out of memory and the operating system's out of memory killer decided to kill your kernel to cope with the situation. You won't get any notification that this has happened other than the notebook being unresponsive or restarting, but you can assess the state of the memory on the notebook server by checking its host overview dashboard in Grafana (host-overview dashboard) or using the command line to see which processes are using the most (with ps aux --sort -rss | head or similar).

Sometimes trying to access the main interface at http://localhost:8880 will throw an HTTP 500 error. In these cases it may be possible to visit http://localhost:8880/hub/home and stop the server.

Viewing Jupyter Notebook Server logs

JupyterHub logs are viewable by normal users in Kibana.

A dashboard has been created named JupyterHub and this is also linked from the Home Dashboard.

At present the logs are not split per user, but we are working to make this possible.

They are no longer written by default to /var/log/syslog but they are retained on the host in the systemd journal.

You might need to see JupyterHub logs to troubleshoot login issues or resource issues affecting the cluster.

An individual user's notebook server log be examined with the following command

sudo journalctl -f -u jupyter-$USERNAME-singleuser.service

Viewing JupyterHub logs

TODO: Make this work for regular users!

You might need to see JupyterHub logs to troubleshoot login issues:

sudo journalctl -f -u jupyterhub

"An error occurred while trying to connect to the Java server"

If you see an error like this:

ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:43881)
Traceback (most recent call last):
  File "/usr/lib/spark3/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 977, in _get_connection
    connection = self.deque.pop()
IndexError: pop from an empty deque

Try each of these options, one at time:

restart your notebook kernel (Menu -> Kernel -> Restart kernel)
restart your JupyterHub server (follow the steps for changing environments, but use the same environment)
create and use a brand new environment (follow the steps given previously, but select "create and use new cloned conda environment..."

Tips

Sending emails from within a notebook

To send out an email from a Python notebook (e.g. as a notification that a long-running query or calculation has completed), you can use the following code:

hostname = !hostname
server = hostname[0] + '.eqiad.wmnet'

whoami = !whoami
user = whoami[0]

from email.message import EmailMessage
import smtplib

def send_email(
    subject,
    body,
    to_email=user+'@wikimedia.org',
    from_email=user+'@'+server
):
    smtp = smtplib.SMTP("localhost")
    
    message = EmailMessage()
    message.set_content(body)
    
    message['From'] = from_email
    message['To'] = to_email
    message['Subject'] = subject
    
    smtp.send_message(message)

(Invoking the standard mail client via the shell, i.e. !mailx, fails for some reason. See phab:T168103.)

Shell shortcuts

To make it easier to access the analytics clients you can add entries like

Host stat11
    HostName stat1011.eqiad.wmnet

to ~/.ssh/config so that connecting to stat1011.eqiad.wmnet is as easy as ssh stat11 You can also make it easier to open SSH tunnels without remembering the full command. For example, if you are using Z shell you can add a tunnel function

tunnel() {
    ssh -N $1 -L 8880:127.0.0.1:8880
}

to ~/.zshrc so that opening a tunnel to stat1011.eqiad.wmnet – and assuming you added the appropriate entries to your SSH config – is as easy as tunnel stat11

Administration

Data_Platform/Systems/Jupyter/Administration

References

↑ If you later want to open a plain interactive SSH session with one of the analytics clients, you can still do this by using its full name: ssh stat1008.eqiad.wmnet.
↑ https://quarto.org/docs/output-formats/html-basics.html#self-contained

[1] If you later want to open a plain interactive SSH session with one of the analytics clients, you can still do this by using its full name: ssh stat1008.eqiad.wmnet.

[2] ttps://quarto.org/docs/output-formats/html-basics.html#self-contained

[1]

[2]