Data Engineering/Systems/Jupyter

From Wikitech
Jump to navigation Jump to search
The interface you'll see once you sign into our Jupyter service

The analytics clients include a hosted version of JupyterHub, allowing easy analysis of internal data using Jupyter notebooks.

Access

Prerequisites

To access Jupyter, you need:

SSH tunnel

Once you have this access, first open a SSH tunnel to one of the analytics clients by running the following command in your computer's terminal:

$ ssh -N stat1005.eqiad.wmnet -L 8880:127.0.0.1:8880

You can replace stat1005 with the name of another analytics client if you prefer.

Note that your Jupyter notebook and files will stored be on the chosen analytics client only. If you want to move to another server, you will have to copy your files using Rsync. If you need shared access to files, consider putting those files in HDFS.

Logging in

Then, open localhost:8880 in your browser and log in with your developer account. Use your shell username rather than your wiki username (e.g. nshahquinn-wmf, not Neil Shah-Quinn (WMF).

You'll be prompted to select or create a Conda environment. See the section on Conda environments below.

Authenticating via Kerberos

Once you've logged in, if you want to access data from Hadoop, you will need to authenticate with Kerberos. Open a new terminal. Type kinit on the command line and enter your Kerberos password at the prompt.

Querying data

The Data Engineering and Product Analytics teams maintain software packages to make accessing data from the analytics clients as easy as possible by hard-coding all the setup and configuration.

In Python

For Python, there is Wmfdata-Python. It can access data through MariaDB, Hive, Presto, and Spark and has a number of other useful functions, like creating Spark sessions. For details, see the repository and particularly the quickstart notebook.

In R

For R, there is wmfdata-r. It can access data from MariaDB and Hive and has many other useful functions, particularly for graphing and statistics.

Conda environments

Main article: Data Engineering/Systems/Conda

Jupyter is set up to use isolated environments managed by Conda. When you first log in, you will be prompted to "Create and use new cloned conda environment".

You can create as many Conda environments as you need, but you can only run one at a time in Jupyter. To change environments:

  1. Navigate to the JupyterHub control panel by selecting File → Hub Control Panel in the Jupyter interface or navigating to localhost:8880/hub/home
  2. Select "Stop My Server"
  3. Select "Start My Server"

You will now see a dropdown allowing you to choose an existing environment or create a new one:

The option to choose a Conda environment when starting a JupyterHub server

Installing packages

If you need different or newer versions of packages, run conda install {{package}} or conda update {{package}} in the terminal. For more information, see Data Engineering/Systems/Conda#Installing packages.

Troubleshooting

Trouble installing R packages

See Data Engineering/Systems/Conda#R support.

Browser disconnects

If your browser session disconnects from the kernel on the server (if, for example, your SSH connection times out), any work the kernel is doing will continue, and you'll be able to access the results the next time you connect to the kernel, but no further display output for that work (like print() commands to log progress) will accumulate, even if you reopen the notebook (JupyterLab issue 4237).

Notebook is unresponsive, or kernel restarts when running a large query

It may be that your Jupyter Notebook Server ran out of memory and the operating system's out of memory killer decided to kill your kernel to cope with the situation. You won't get any notification that this has happened other than the notebook being unresponsive or restarting, but you can assess the state of the memory on the notebook server by checking its host overview dashboard in Grafana (host-overview dashboard) or using the command line to see which processes are using the most (with ps aux --sort -rss | head or similar).

Viewing Jupyter Notebook Server logs

JupyterHub logs are viewable by normal users in Kibana.

A dashboard has been created named JupyterHub and this is also linked from the Home Dashboard.

At present the logs are not split per user, but we are working to make this possible.

They are no longer written by default to /var/log/syslog but they are retained on the host in the systemd journal.

You might need to see JupyterHub logs to troubleshoot login issues or resource issues affecting the cluster.

An individual user's notebook server log be examined with the following command

sudo journalctl -f -u jupyter-$USERNAME-singleuser.service

Viewing JupyterHub logs

TODO: Make this work for regular users!

You might need to see JupyterHub logs to troubleshoot login issues:

sudo journalctl -f -u jupyterhub

"An error occurred while trying to connect to the Java server"

If you see an error like this:

ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:43881)
Traceback (most recent call last):
  File "/usr/lib/spark3/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 977, in _get_connection
    connection = self.deque.pop()
IndexError: pop from an empty deque

Try the following steps:

  1. restart your notebook kernel (Menu -> Kernel -> Restart kernel)
  2. restart your JupyterHub server (follow the steps for changing environments, but use the same environment)
  3. create and use a brand new environment (follow the steps given previously, but select "create and use new cloned conda environment..."

Tips

Analytics/Systems/Jupyter/Tips

Administration

Analytics/Systems/Jupyter/Administration