Data Platform/Systems/Jupyter
The analytics clients include a hosted version of JupyterHub, allowing easy analysis of internal data using Jupyter notebooks.
Access
Prerequisites
To access Jupyter, you need:
- production data access in the analytics-privatedata-users POSIX group
- Your SSH configured correctly.
- Kerberos credentials
- You'll also need to be in the
wmf
ornda
LDAP groups.
SSH tunnel
Once you have this access, open a SSH tunnel to one of the analytics clients. There are two main ways to do this. We'll assume you want to connect to stat1008
, but you can connect to another client instead by changing the name in the terminal command.
The first option is using the standard SSH command:
$ ssh -N stat1008.eqiad.wmnet -L 8880:127.0.0.1:8880
The second option is to modify your SSH configuration file to automatically open a tunnel whenever you connect to a client:
Match host=!*.*,stat10*
SessionType none
HostName %h.eqiad.wmnet
LocalForward 8880 127.0.0.1:8880
With that added, you can simply use the following command:[1]
$ ssh stat1008
Note that your Jupyter notebook and files will stored be on the chosen analytics client only. If you want to move to another server, you will have to copy your files using Rsync. If you need shared access to files, consider putting those files in HDFS.
Logging in
Then, open localhost:8880 in your browser and log in with your developer account. Use your shell username rather than your wiki username (e.g. nshahquinn-wmf
, not Neil Shah-Quinn (WMF)
.
You'll be prompted to select or create a Conda environment. See the section on Conda environments below.
Authenticating via Kerberos
Once you've logged in, if you want to access data from Hadoop, you will need to authenticate with Kerberos. Open a new terminal in JupyterLab (not a separate SSH session). Type kinit
on the command line and enter your Kerberos password at the prompt.
Querying data
To make it easier for you to access data from the analytics clients, use the following software packages, which hard-code much of the setup and configuration.
In Python
For Python, there is Wmfdata-Python. It can access data through MariaDB, Hive, Presto, and Spark and has a number of other useful functions, like creating Spark sessions. For details, see the repository and particularly the quickstart notebook.
In R
For R, there is wmfdata-r. It can access data from MariaDB and Hive and has many other useful functions, particularly for graphing and statistics.
Conda environments
Jupyter is set up to use isolated environments managed by Conda. When you first log in, you will be prompted to "Create and use new cloned conda environment".
You can create as many Conda environments as you need, but you can only run one at a time in Jupyter. To change environments:
- Navigate to the JupyterHub control panel by selecting File → Hub Control Panel in the Jupyter interface or navigating to localhost:8880/hub/home
- Select "Stop My Server"
- Select "Start My Server"
You will now see a dropdown allowing you to choose an existing environment or create a new one:
Installing packages
If you need different or newer versions of packages, run conda install {{package}}
or conda update {{package}}
in the terminal. For more information, see Data Platform/Systems/Conda#Installing packages.
Sharing Notebooks
Important information Before sharing the notebook in any way, please consult and follow the Data Publication Guidelines. |
There is currently no built-in functionality for internal sharing of notebooks (phab:T156934), but there are several different workarounds available:
Copying files on the analytics clients
It's possible to copy notebooks and files directly on the server by clicking 'New' → 'Terminal' (in the root folder in the browser window) and using the cp
command. Note that you may have to change the file permissions using the chmod
command to give the other user read access to the files.
GitHub or GitLab
It's also possible to track your notebooks in Git and push them to either GitHub or our GitLab, both of which will display them fully rendered on its website. Generally, this requires making the notebook public, but it's also possible to request a private GitLab repo if necessary.
In either case, you will need to connect using HTTPS rather than SSH (SRE considers SSH from the analytics clients a security risk because of the possibility that other users could access your SSH keys). To do this, you will need to set up a personal access token (GitLab docs, GitHub docs), which you will use in place of a password when using Git on the command line.
By default, you'll have to enter your username and password every time you push. You can avoid this by adding the following to ~/.gitconfig
:
# Automatically add username to GitHub URLs
[url "https://{{username}}@github.com"]
insteadOf = https://github.com
# Automatically add username to Wikimedia GitLab URLs
[url "https://{{username}}@gitlab.wikimedia.org"]
insteadOf = https://gitlab.wikimedia.org
# Cache access tokens for 8 hours after entry
[credential]
helper = cache --timeout=28800
HTML files
You can also export your notebook as an HTML file, using File
> Download as...
in the JupyterLab interface or the jupyter nbconvert --to html
command.
If you want to make the HTML file public on the web, you can use the web publication workflow.
Quarto
To use convert a Jupyter notebook to an HTML file with Quarto on the analytics cluster:
- Install Quarto in your conda-analytics environment with
conda install quarto
(which will install Quarto from conda-forge) - Restart your server (File → Hub Control Panel → Stop My Server → Start My Server → select the relevant environment) because this will run the etc/conda/activate.d/quarto.sh that sets up the correct environment variables which are necessary for the
quarto
command to work
Then you can run quarto render some_notebook.ipynb --to html
which will generate:
- An HTML file with same name as the notebook but html extension (e.g. some_notebook.html)
- A folder of JavaScript and CSS files with same name as the notebook but _files appended (e.g. some_notebook_files): this is necessary for the aforementioned HTML file to display and function correctly
- ALTERNATIVELY: Set
embed-resources: true
option to produce a standalone (self-contained) HTML file.[2]
- ALTERNATIVELY: Set
When you publish the files (e.g. following the web publication workflow), make sure you publish both the HTML file and the folder of supporting files.
Nbviewer
Nbviewer is a useful tool to share notebooks, especially if the notebooks have interactive or HTML elements, which may not render well on GitHub or GitLab. For GitHub, the tool works fine with the blob version of a notebook, however for gitlab.wikimedia.org, it can only read the raw version. You can either change from blob to raw in the URL, or open the raw version (from the top bar) as shown in the image and copy the URL.
Troubleshooting
Spawn failure after creating a new environment
If you attempt to start a Jupyter server using the "create and use a new Conda environment" option, but the start-up fails, try deleting your .conda/pkgs
directory (task T380477).
Trouble installing R packages
See Data Platform/Systems/Conda#R support.
Browser disconnects
If your browser session disconnects from the kernel on the server (if, for example, your SSH connection times out), any work the kernel is doing will continue, and you'll be able to access the results the next time you connect to the kernel, but no further display output for that work (like print()
commands to log progress) will accumulate, even if you reopen the notebook (JupyterLab issue 4237).
Notebook is unresponsive, or kernel restarts when running a large query
It may be that your Jupyter Notebook Server ran out of memory and the operating system's out of memory killer decided to kill your kernel to cope with the situation. You won't get any notification that this has happened other than the notebook being unresponsive or restarting, but you can assess the state of the memory on the notebook server by checking its host overview dashboard in Grafana (host-overview dashboard) or using the command line to see which processes are using the most (with ps aux --sort -rss | head
or similar).
Sometimes trying to access the main interface at http://localhost:8880 will throw an HTTP 500 error. In these cases it may be possible to visit http://localhost:8880/hub/home and stop the server.
Viewing Jupyter Notebook Server logs
JupyterHub logs are viewable by normal users in Kibana.
A dashboard has been created named JupyterHub and this is also linked from the Home Dashboard.
At present the logs are not split per user, but we are working to make this possible.
They are no longer written by default to /var/log/syslog
but they are retained on the host in the systemd journal.
You might need to see JupyterHub logs to troubleshoot login issues or resource issues affecting the cluster.
An individual user's notebook server log be examined with the following command
sudo journalctl -f -u jupyter-$USERNAME-singleuser.service
Viewing JupyterHub logs
TODO: Make this work for regular users!
You might need to see JupyterHub logs to troubleshoot login issues:
sudo journalctl -f -u jupyterhub
"An error occurred while trying to connect to the Java server"
If you see an error like this:
ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:43881) Traceback (most recent call last): File "/usr/lib/spark3/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 977, in _get_connection connection = self.deque.pop() IndexError: pop from an empty deque
Try each of these options, one at time:
- restart your notebook kernel (Menu -> Kernel -> Restart kernel)
- restart your JupyterHub server (follow the steps for changing environments, but use the same environment)
- create and use a brand new environment (follow the steps given previously, but select "create and use new cloned conda environment..."
Tips
Sending emails from within a notebook
To send out an email from a Python notebook (e.g. as a notification that a long-running query or calculation has completed), you can use the following code:
hostname = !hostname
server = hostname[0] + '.eqiad.wmnet'
whoami = !whoami
user = whoami[0]
from email.message import EmailMessage
import smtplib
def send_email(
subject,
body,
to_email=user+'@wikimedia.org',
from_email=user+'@'+server
):
smtp = smtplib.SMTP("localhost")
message = EmailMessage()
message.set_content(body)
message['From'] = from_email
message['To'] = to_email
message['Subject'] = subject
smtp.send_message(message)
(Invoking the standard mail client via the shell, i.e. !mailx
, fails for some reason. See phab:T168103.)
Shell shortcuts
To make it easier to access the analytics clients you can add entries like
Host stat11
HostName stat1011.eqiad.wmnet
to ~/.ssh/config so that connecting to stat1011.eqiad.wmnet is as easy as ssh stat11
You can also make it easier to open SSH tunnels without remembering the full command. For example, if you are using Z shell you can add a tunnel
function
tunnel() {
ssh -N $1 -L 8880:127.0.0.1:8880
}
to ~/.zshrc so that opening a tunnel to stat1011.eqiad.wmnet – and assuming you added the appropriate entries to your SSH config – is as easy as tunnel stat11
Administration
Data_Platform/Systems/Jupyter/Administration
References
- ↑ If you later want to open a plain interactive SSH session with one of the analytics clients, you can still do this by using its full name:
ssh stat1008.eqiad.wmnet
. - ↑ https://quarto.org/docs/output-formats/html-basics.html#self-contained