Data Engineering/Systems/Jupyter/Administration

From Wikitech

System Overview

We have deprecated anaconda-wmf, and are transitioning into conda-analytics. In the interim, below we discuss how their respective Jupyter installations work.

anaconda-wmf

anaconda-wmf is a custom Debian package of Anaconda that includes additional packages useful for analytics at WMF. anaconda-wmf is installed on all analytics client (AKA stat boxes) and worker nodes. It installs to /usr/lib/anaconda-wmf. See the Analytics/Systems/Anaconda documentation for how this works.

anaconda-wmf includes the Python and other packages we need to run JupyterHub. The configuration and setup of JupyterHub is done by Puppet. Users can ssh tunnel to an analytics client node and access JupyterHub over HTTP. JupyterHub is configured to authenticate users via LDAP (and also restricts them to a few POSIX groups). It is also configured to work with anconda-wmf and 'stacked' conda environments via a custom CondaEnvProfilesSpawner which can create and activate new user Conda environments. After authentication, the user is prompted with a list of Conda environments to use. Their Jupyter Notebook Server process is then launched using a SystemdSpawner running the jupyterhub-singleuser command out of the user's Conda environment, e.g. in /home/otto/.conda/envs/2020-12-13T19.40.09_otto/bin/jupyterhub-singleuser.

Note that JupyterHub runs out of /usr/lib/anaconda-wmf, and the user's Jupyter Notebook Server (to which JupyterHub proxies) runs out of the user's selected Conda environment. This means that the user's conda environment is ephemeral and can be discarded at will by the user, or if really needed, by an administrator. We can upgrade anaconda-wmf, and users can install whatever packages they might need into their Conda environments.

conda-analytics

conda-analytics is a custom Debian package of Miniconda that includes additional packages useful for analytics at WMF. conda-analytics is installed on all analytics client (AKA stat boxes) and worker nodes. It installs to /opt/conda-analytics.

conda-analytics includes the Python and other packages we need to run JupyterHub. The configuration and setup of JupyterHub is done by Puppet. Users can ssh tunnel to an analytics client node and access JupyterHub over HTTP. JupyterHub is configured to authenticate users via LDAP (and also restricts them to a few POSIX groups). It is also configured to work with conda-analytics cloned conda environments via a custom CondaEnvProfilesSpawner which can create and activate new user Conda environments. After authentication, the user is prompted with a list of Conda environments to use. Their Jupyter Notebook Server process is then launched using a SystemdSpawner running the jupyterhub-singleuser command out of the user's Conda environment, e.g. in /home/otto/.conda/envs/2020-12-13T19.40.09_otto/bin/jupyterhub-singleuser.

Note that JupyterHub runs out of /opt/conda-analytics, and the user's Jupyter Notebook Server (to which JupyterHub proxies) runs out of the user's selected Conda environment. This means that the user's conda environment is ephemeral and can be discarded at will by the user, or if really needed, by an administrator. We can upgrade conda-analytics, and users can install whatever packages they might need into their Conda environments.