Data Engineering/Systems/Jupyter

From Wikitech
The interface you'll see once you sign into our Jupyter service

The analytics clients include a hosted version of JupyterHub, allowing easy analysis of internal data using Jupyter notebooks.

Access

Prerequisites

To access Jupyter, you need:

SSH tunnel

Once you have this access, first open a SSH tunnel to one of the analytics clients by running the following command in your computer's terminal:

$ ssh -N stat1005.eqiad.wmnet -L 8880:127.0.0.1:8880

You can replace stat1005 with the name of another analytics client if you prefer.

Note that your Jupyter notebook and files will stored be on the chosen analytics client only. If you want to move to another server, you will have to copy your files using Rsync. If you need shared access to files, consider putting those files in HDFS.

Logging in

Then, open localhost:8880 in your browser and log in with your developer account. Use your shell username rather than your wiki username (e.g. nshahquinn-wmf, not Neil Shah-Quinn (WMF).

You'll be prompted to select or create a Conda environment. See the section on Conda environments below.

Authenticating via Kerberos

Once you've logged in, if you want to access data from Hadoop, you will need to authenticate with Kerberos. Open a new terminal. Type kinit on the command line and enter your Kerberos password at the prompt.

Querying data

The Data Engineering and Product Analytics teams maintain software packages to make accessing data from the analytics clients as easy as possible by hard-coding all the setup and configuration.

In Python

For Python, there is Wmfdata-Python. It can access data through MariaDB, Hive, Presto, and Spark and has a number of other useful functions, like creating Spark sessions. For details, see the repository and particularly the quickstart notebook.

In R

For R, there is wmfdata-r. It can access data from MariaDB and Hive and has many other useful functions, particularly for graphing and statistics.

Conda environments

Main article: Data Engineering/Systems/Conda

Jupyter is set up to use isolated environments managed by Conda. When you first log in, you will be prompted to "Create and use new cloned conda environment".

You can create as many Conda environments as you need, but you can only run one at a time in Jupyter. To change environments:

  1. Navigate to the JupyterHub control panel by selecting File → Hub Control Panel in the Jupyter interface or navigating to localhost:8880/hub/home
  2. Select "Stop My Server"
  3. Select "Start My Server"

You will now see a dropdown allowing you to choose an existing environment or create a new one:

The option to choose a Conda environment when starting a JupyterHub server

Installing packages

If you need different or newer versions of packages, run conda install {{package}} or conda update {{package}} in the terminal. For more information, see Data Engineering/Systems/Conda#Installing packages.

Sharing Notebooks

There is currently no built-in functionality for internal sharing of notebooks (phab:T156934), but there are several different workarounds available:

Copying files on the analytics clients

It's possible to copy notebooks and files directly on the server by clicking 'New' -> 'Terminal' (in the root folder in the browser window) and using the cp command. Note that you may have to change the file permissions using the chmod command to give the other user read access to the files.

GitHub or GitLab

It's also possible to track your notebooks in Git and push them to either GitHub or our GitLab, both of which will display them fully rendered on its website. Generally, this requires making the notebook public, but it's also possible to request a private GitLab repo if necessary.

In either case, you will need to connect using HTTPS rather than SSH (SRE considers SSH from the analytics clients a security risk because of the possibility that other users could access your SSH keys). To do this, you will need to set up a personal access token (GitLab docs, GitHub docs), which you will use in place of a password when using Git on the command line.

By default, you'll have to enter your username and password every time you push. You can avoid this by adding the following to ~/.gitconfig:

# Automatically add username to GitHub URLs
[url "https://{{username}}@github.com"]
    insteadOf = https://github.com

# Automatically add username to Wikimedia GitLab URLs
[url "https://{{username}}@gitlab.wikimedia.com"]
    insteadOf = https://gitlab.wikimedia.com

# Cache access tokens for 8 hours after entry
[credential]
    helper = cache --timeout=28800

HTML files

You can also export your notebook as an HTML file, using File > Download as... in the JupyterLab interface or the jupyter nbconvert --to html command.

If you want to make the HTML file public on the web, you can use the web publication workflow.

Nbviewer

"Open raw" button

Nbviewer is a useful tool to share notebooks, especially if the notebooks have interactive or HTML elements, which may not render well on GitHub or GitLab. For GitHub, the tool works fine with the blob version of a notebook, however for gitlab.wikimedia.org, it can only read the raw version. You can either change from blob to raw in the URL, or open the raw version (from the top bar) as shown in the image and copy the URL.

Troubleshooting

Trouble installing R packages

See Data Engineering/Systems/Conda#R support.

Browser disconnects

If your browser session disconnects from the kernel on the server (if, for example, your SSH connection times out), any work the kernel is doing will continue, and you'll be able to access the results the next time you connect to the kernel, but no further display output for that work (like print() commands to log progress) will accumulate, even if you reopen the notebook (JupyterLab issue 4237).

Notebook is unresponsive, or kernel restarts when running a large query

It may be that your Jupyter Notebook Server ran out of memory and the operating system's out of memory killer decided to kill your kernel to cope with the situation. You won't get any notification that this has happened other than the notebook being unresponsive or restarting, but you can assess the state of the memory on the notebook server by checking its host overview dashboard in Grafana (host-overview dashboard) or using the command line to see which processes are using the most (with ps aux --sort -rss | head or similar).

Viewing Jupyter Notebook Server logs

JupyterHub logs are viewable by normal users in Kibana.

A dashboard has been created named JupyterHub and this is also linked from the Home Dashboard.

At present the logs are not split per user, but we are working to make this possible.

They are no longer written by default to /var/log/syslog but they are retained on the host in the systemd journal.

You might need to see JupyterHub logs to troubleshoot login issues or resource issues affecting the cluster.

An individual user's notebook server log be examined with the following command

sudo journalctl -f -u jupyter-$USERNAME-singleuser.service

Viewing JupyterHub logs

TODO: Make this work for regular users!

You might need to see JupyterHub logs to troubleshoot login issues:

sudo journalctl -f -u jupyterhub

"An error occurred while trying to connect to the Java server"

If you see an error like this:

ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:43881)
Traceback (most recent call last):
  File "/usr/lib/spark3/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 977, in _get_connection
    connection = self.deque.pop()
IndexError: pop from an empty deque

Try each of these options, one at time:

  1. restart your notebook kernel (Menu -> Kernel -> Restart kernel)
  2. restart your JupyterHub server (follow the steps for changing environments, but use the same environment)
  3. create and use a brand new environment (follow the steps given previously, but select "create and use new cloned conda environment..."

Tips

Sending emails from within a notebook

To send out an email from a Python notebook (e.g. as a notification that a long-running query or calculation has completed), you can use the following code:

hostname = !hostname
server = hostname[0] + '.eqiad.wmnet'

whoami = !whoami
user = whoami[0]

from email.message import EmailMessage
import smtplib

def send_email(
    subject,
    body,
    to_email=user+'@wikimedia.org',
    from_email=user+'@'+server
):
    smtp = smtplib.SMTP("localhost")
    
    message = EmailMessage()
    message.set_content(body)
    
    message['From'] = from_email
    message['To'] = to_email
    message['Subject'] = subject
    
    smtp.send_message(message)

(Invoking the standard mail client via the shell, i.e. !mailx, fails for some reason. See phab:T168103.)

Administration

Analytics/Systems/Jupyter/Administration