Analytics/Systems/Jupyter
The analytics clients include a hosted version
JupyterHub, allowing access to internal data
with Jupyter Notebooks.
Overview
JupyterHub is a multi-tenant Jupyter Notebook Server launcher. It runs on each of the analytics clients (AKA stat boxes). Users open ssh tunnels to the JupyterHub service, open a browser and log in, choose or create a Conda environment from which to run their Jupyter Notebook Server.
Access
To access JupyterHub, you need:
- production data access in the analytics-privatedata-users POSIX group with Kerberos.
- Your SSH configured correctly.
- You'll also need to be in the
wmf
ornda
LDAP groups.
Once you have this access, open a SSH tunnel to one of the analytics clients, e.g.
ssh -N stat1005.eqiad.wmnet -L 8880:127.0.0.1:8880
replacing stat1005
with the name of any analytics client hostname (AKA stat box)
if you prefer.
Then, open localhost:8880 in your browser and log in with your shell username and LDAP password. You'll be prompted to select or create a Conda environment. See the section on Conda environments below.
Note that this will give you access to your Jupyter Notebook Server on the chosen analytics client host only. Notebooks and files are saved to your home directory on that host. If you need shared access to files, consider putting those files in HDFS.
Authenticate to Hadoop via Kerberos
Once you've logged in, if you want to access data from Hadoop, you will need to authenticate with Kerberos.
This can be done either via a terminal ssh session or in a Jupyter Terminal Notebook.
In a terminal session, just type
kinit
You'll be prompted for your Kerberos password.
Querying Analytics Cluster Datasets
The Product Analytics team maintains software packages to make accessing data from the analytics clients as easy as possible by hard-coding all the setup and configuration.
For Python, there is wmfdata-python. It can access data through MariaDB, Hive, Presto, and Spark and has a number of other useful functions, like creating SparkSessions (see below).
For more advanced usage, see Analytics/Systems/Jupyter/Tips#Custom_PySpark_Notebook_Kernels
For R, there is wmfdata-r. It can access data from MariaDB and Hive and has many other useful functions, particularly for graphing and statistics.
For Scala-Spark and Spark-SQL, you need to install your own kernels in your Conda environment and use Apache Toree (see below). NOTE: Toree is a relatively inactive project.
PySpark and wmfdata
It is possible to create and use a custom Jupyter Notebook kernel to instantiate a PySpark session. However, predefined kernels must specify all possible options to Spark, making it impossible to customize a SparkSession to your needs. Instead, it is recommended to use a regular Python Notebook, and use either wmfdata-python or the findspark package to instantiate your Python SparkSession.
wmfdata has a simplified Spark run
function that allows you to quickly run
SQL to access data in Hive via Spark as a Pandas DataFrame.
import wmfdata
pandas_df = wmfdata.spark.run(
"""
SELECT meta.domain, count(*)
FROM event.mediawiki_page_create
WHERE year=2022 AND month=1 AND day=1 and hour=0
GROUP BY meta.domain
ORDER BY count(*) DESC
LIMIT 10
"""
)
print(pandas_df)
domain count(1)
0 commons.wikimedia.org 1491
1 en.wikipedia.org 308
2 mg.wiktionary.org 209
3 www.wikidata.org 126
4 fr.wikipedia.org 117
5 uk.wikisource.org 87
6 ar.wikipedia.org 61
7 ur.wikipedia.org 56
8 pl.wikipedia.org 36
9 hyw.wikipedia.org 32
The run
function
should only be used with smallish result sets, as it pulls all results into memory in the
Jupyter Notebook server.
You can also just instantiate a SparkSession and use it directly.
import wmfdata
# Get a predefined and preconfigured SparkSession type using get_session.
spark = wmfdata.spark.get_session(type='yarn-regular')
# Or get a totally customizable SparkSession using get_spark_session.
spark = wmfdata.spark.get_spark_session(
master='yarn',
spark_config={
'spark.executor.memory': '4g'
}
)
# If you have locally installed dependencies that you need on remote YARN Spark executors,
# wmfdata.spark.get_session and wmfdata.spark.get_custom_session
# have a ship_python_env options, which will automatically
# pack and ship your current conda environment to the remote executor,
# and cause it to use python and the dependencies from it.
spark = wmfdata.spark.get_spark_session(
master='yarn',
ship_python_env=True
)
By default, Yarn based SparkSessions used by run
will timeout
after 5 minutes of inactivity.
Also note that you can only have one active SparkSession instance per notebook at a time. Provided parameters will only be applied the first time the SparkSession is instaniated. Subsequent calls with different configuration parameters will not result in a modified SparkSession, unless the SparkSession is first stopped (or has timed-out).
Scala-Spark or Spark-SQL using Toree
To use either Scala-Spark or Spark-SQL notebooks you need to have Apache Toree available in your conda environment.
An easy way to do so is to install it via the notebook terminal interface: In your notebook interface, click New -> terminal
, and in the terminal type pip install toree
. And that's it :)
Now you can create a jupyter kernel using Toree as a gateway between the notebook and a spark session running on the cluster (note: the spark session is managed by Toree, no need to create it manually).
To create both a Scala-spark and Spark-SQL kernels, in your notebook terminal again :
NOTE: {lease change the kernel name and the spark options as you see fit - you can find the default wmfdata spark parameters on this github page.
jupyter toree install \
--user \
--spark_home="/usr/lib/spark2/" \
--interpreters=Scala,SQL \
--kernel_name="Scala Spark" \
--spark_opts="--master yarn --driver-memory 2G --executor-memory 8G --executor-cores 4 --conf spark.dynamicAllocation.maxExecutors=64 --conf spark.sql.shuffle.partitions=256"
Conda environments
Your Jupyter Notebook Server is run out of a Conda environment which is 'stacked' on top of a read only distribution of Anaconda, named anaconda-wmf. anaconda-wmf has a large list of packages already installed, and these packages are installed on all Hadoop worker nodes.
After logging into JupyterHub, when you start Jupyter Notebook Server it is launched out of your Conda environment stacked on anaconda-wmf. This means that the packages in anaconda-wmf are available to import in your python notebooks. If you need different or newer versions of packages, you can conda (preferred) or pip install them into your active Conda environment, and they will be imported from there.
Using multiple Conda environments
You can create as many Conda environments as you might need, but you can only run one Jupyter Notebook Server at a time. This means that you can only use one Conda environment in a Jupyter Notebook Server at a time. To use a Jupyter Notebook Server with a different Conda environment, you can stop your Jupyter Notebook Server from the JupyterHub Control Panel, and start a new server and select a different Conda environment for it to use.
These Conda environments may also be used outside of Jupyter on the CLI.
See Analytics/Systems/Anaconda for more information.
Troubleshooting
pip fails to install a newer version of a package
If using pip to install a package into your conda environment that already exists in the base anaconda-wmf environment, you might get an error like:
Attempting uninstall: wmfdata
Found existing installation: wmfdata 1.0.4
Uninstalling wmfdata-1.0.4:
ERROR: Could not install packages due to an EnvironmentError: [Errno 30] Read-only file system: 'WHEEL'
To work around this, tell pip to --ignore-installed
when running pip install
, like:
pip install --ignore-installed --upgrade git+https://github.com/wikimedia/wmfdata-python.git@release
See also: Analytics/Systems/Anaconda#Installing_packages_into_your_user_conda_environment
Trouble installing R packages
See Analytics/Systems/Anaconda#R_support
Browser disconnects
If your browser session disconnects from the kernel on the server (if, for example, your SSH
connection times out), any work the kernel is doing will continue, and you'll be able to access
the results the next time you connect to the kernel, but no further display output for that work
(like print()
commands to log progress) will accumulate, even if you reopen the
notebook (JupyterLab issue 4237).
My Python notebook will not start
Your IPython configuration may be broken.
Try deleting your ~/.ipython
directory (you'll lose any configurations you've made or
extensions you've installed, but it won't affect your notebooks, files, or Python packages).
My kernel restarts when I run a large query
It may be that your Jupyter Notebook Server ran out of memory and the operating system's
out of memory killer
decided to kill your kernel to cope with the situation. You won't get any notification that this
has happened other than the notebook restarting, but you can assess the state of the memory on
the notebook server by checking its host overview dashboard in Grafana
(host-overview dashboard) or using the
command line to see which processes are using the most (with ps aux --sort -rss | head
or similar).
Viewing Jupyter Notebook Server logs
JupyterHub logs are viewable by normal users in Kibana.
A dashboard has been created named JupyterHub and this is also linked from the Home Dashboard.
At present the logs are not split per user, but we are working to make this possible.
They are no longer written by default to /var/log/syslog
but they are retained on the host in the systemd journal.
You might need to see JupyterHub logs to troubleshoot login issues or resource issues affecting the cluster.
An individual user's notebook server log be examined with the following command
sudo journalctl -f -u jupyter-$USERNAME-singleuser.service
Viewing JupyterHub logs
TODO: Make this work for regular users!
You might need to see JupyterHub logs to troubleshoot login issues:
sudo journalctl -f -u jupyterhub
Tips
Analytics/Systems/Jupyter/Tips