SWAP

From Wikitech
Jump to navigation Jump to search

SWAP (the Simple Wikimedia Analytics Platform, previously known as PAWS Internal) is Jupyter notebook service for analyzing our private data sources. To access it, you need production data access.

It is similar to the public PAWS infrastructure that lives on the Wikimedia Cloud, but uses completely different infrastructure and configuration. For an introduction to notebooks and why they're important, see PAWS/Introduction.

Access and infrastructure

You will need production shell access (ask for the "researchers"/"analytics-privatedata-users"/"statistics-privatedata-users" groups, SWAP piggy backs on data access rules for the Analytics cluster, and any of these 3 groups should work), with your SSH configured correctly (see also the Discovery team's notes).

You'll also need your developer account to be added to an LDAP group, either wmf or nda. For WMF employees, this should be done when production shell access is granted (see these notes).

Once you have this access, you can connect to one of the two notebook hosts: notebook1003 or notebook1004. They're identical, but you may want to pick the one where less memory is being used. To see this, you can check the host overview dashboard in Grafana (notebook1003 or notebook1004).

To connect, start by opening an SSH tunnel using the terminal command enter the following in a terminal (to open an SSH tunnel):

ssh -N notebook1003.eqiad.wmnet -L 8000:127.0.0.1:8000 or ssh -N notebook1004.eqiad.wmnet -L 8000:127.0.0.1:8000.

Then, open http://localhost:8000 in your browser and log in with your developer account (using your shell username rather than your Wikitech username).

JupyterLab

JupyterLab is the new Jupyter notebook interface, currently in beta. It is installed and useable, but is not the default. To use it, navigate to http://localhost:8000/user/<username>/lab (replace <username> with your username).

Known issues

If your browser session disconnects from the kernel on the server (if, for example, your SSH connection times out), any work the kernel is doing will continue, and you'll be able to access the results the next time you connect to the kernel, but no further display output for that work (like print() commands to log progress) will accumulate, even if you reopen the notebook (JupyterLab issue 4237).

If you delete files through the Jupyter UI (as opposed to the rm command in Terminal/SSH), they get moved to a hidden directory. There's no way to empty that directory through the Jupyter UI or check what files are in it or how big it is, so use the following commands to check its disk usage and empty it:

# Disk usage:
du -hs ~/.local/share/Trash

# Deleting deleted files permanently:
rm -rf ~/.local/share/Trash/*

Updating

JupyterLab is being updated frequently, and if the installed version on SWAP is lagging behind, it's possible to update your own version.

  1. Launch a Jupyter terminal and update the package by running pip install --upgrade jupyterlab.
  2. Restart your Jupyter server: go to the classic interface (/user/YOUR-USERNAME/tree), click on "control panel" in the top right, click "stop my server", and then click "my server" on the resulting page.

Querying data

If you have permissions to access analytics datasets in production, you should be able to do so from SWAP as well.

sql_magic

Wow sql_magic is pretty cool, and works with Hive (via MapReduce), Hive via Spark and other SQL engines too.

with Hive (MapReduce):

!pip install sql_magic
%load_ext sql_magic

from pyhive import hive
hive = hive.connect(' an-coord1001.eqiad.wmnet', 10000)

%config SQL.conn_name = 'hive'


%%read_sql revision_text_1k
SELECT page_title, revision_text FROM wmf.mediawiki_wikitext_history WHERE snapshot="2019-07" AND wiki_db="testwiki" and revision_text_bytes > 1000 limit 10

Query started at 03:03:46 PM UTC; Query executed in 0.02 m

    page_title  revision_text
0   User:Rutilant/New.js    \n// Only add edit count button on user pages\...
1   MediaWiki:Gadget-dropdown-menus.js  // <nowiki>\n/********************************...
2   Wikipedia Signpost  <noinclude>{{pp-semi-indef}}{{pp-move-indef}}<...
3   User:Wiki ViewStats/Template/Pie Chart/Slice    <includeonly><div class="transborder" style="p...
4   Wikipedia Signpost  <!---Any changes to this page should be accomp...
5   User:Meetup     {{Communication}}{{Meetup}}\nMeetings of users...
6   User:Absconditus/cps.js     /*cps (Боевой Патрульный Самокат)  автор (пар...
7   Module:Pagetype     ----------------------------------------------...
8   Template:4x4 type square/T384   <noinclude>\n[[category:4x4 type square templa...
9   User:Daedalus969/Progress Bar   <includeonly><div style="position:relative"><d...

with Hive via Spark:

%load_ext sql_magic

import findspark, os
os.environ['SPARK_HOME'] = '/usr/lib/spark2';
findspark.init()
import pyspark
import pyspark.sql
conf = pyspark.SparkConf()  # Use master yarn here if you are going to query large datasets.
sc = pyspark.SparkContext(conf=conf)
spark_hive = pyspark.sql.HiveContext(sc)

%config SQL.conn_name = 'spark_hive'

%%read_sql df2
SELECT page_title FROM wmf.pageview_hourly WHERE year=2017 and month=1 and day=1 and hour=0 LIMIT 10
Query started at 08:04:48 PM UTC; Query executed in 0.12 m

0	User:64.255.164.10
1	Special:Log/!_!_!_!_!_!_!_!_!_!_!
2	User:Akhil_0950
3	User:82.52.37.150
4	User:Daniel
5	5_рашәара
6	Ажьырныҳәа_5
7	Алахәыла:ChuispastonBot
8	Алахәыла_ахцәажәара:Oshwah
9	Алахәыла_ахцәажәара:Untifler

MySQL

See: https://gist.github.com/madhuvishy/d349c472de1279568534e4fb2b5bf505

Hive

Both pyhive and impyla are installed in all user virtualenvs by default. When connecting to hive, use host an-coord1001.eqiad.wmnet on port 10000.

pyhive example:

from pyhive import hive # NOTE: this fails right now, I haven't had time to look into why, just drive-by leaving this note, SORRY BYE
cursor = hive.connect('an-coord1001.eqiad.wmnet', 10000).cursor()
cursor.execute('SELECT page_title FROM wmf.pageview_hourly WHERE year=2017 and month=1 and day=1 and hour=0 LIMIT 10')
cursor.description
[('page_title', 'STRING_TYPE', None, None, None, None, True)]
cursor.fetchall()
# ...

Problems have been reported with getting impyla to work. The following workaround may help

# cf. https://community.cloudera.com/t5/Interactive-Short-cycle-SQL/Python-Error-TSaslClientTransport-object-has-no-attribute-trans/td-p/58033
!pip uninstall -y thrift
!pip uninstall -y impyla
!pip uninstall -y sasl
!pip install thrift-sasl==0.2.1
!pip install thrift==0.9.3
!pip install impyla==0.13.8

Spark

Spark notebooks are available via Apache Toree and iPython (for PySpark). Spark kernels for local and YARN based Spark notebooks for Scala, Python, SQL and R.

Note that once you open a Python Spark notebook, you don't need to create a SparkSession or SparkContext—it's automatically provided the variable spark.

Note the 'large' YARN Spark notebook options. These launch Spark in yarn with --executor-memory 4g and memoryOverhead of 2g. See also this ticket for more info. if you need to launch Spark with different settings, refer to Custom Spark Kernels below.

Also, pyspark != Python 3! Your Python 3 Notebook will run out of your local virtualenv on the Jupyter Notebook host node, and as such you can !pip install things there. However, Spark is expected to run distributed, and your local virtualenv will not be available on remote worker nodes. If you need python packages installed to work with pyspark, you'll need to submit a Phabricator request for them.





Spark with Brunel

Brunel is a visualization library that works well with Spark and Scala in a Jupyter Notebook. We deploy a Brunel jar with SWAP. You just need to add it as a magic jar:

%AddJar -magic file:///srv/jupyterhub/deploy/spark-kernel-brunel-all-2.6.jar

import org.apache.spark.sql.DataFrame
val seq = (0 until 3).map(i => (i, i)).toSeq
val df = spark.sqlContext.createDataFrame(seq)

%%brunel data('df') x(_1) y(_2) bar  style("fill:red") filter(_1:2) :: width=300, height=300
Screen Shot 2018-08-07 at 16.45.54.png








See https://github.com/Brunel-Visualization/Brunel/tree/master/spark-kernel/examples for more Brunel examples.



Custom Spark Kernels

The Spark kernels that ship with SWAP all have hardcoded Spark options. There isn't a good way to make a Jupyter Notebook prompt the user for settings before the Notebook is launched, so we have to hardcode the options given to the Spark shell. If you need a Spark Notebook (or any kind of Notebook) with custom settings, you'll need to create a new kernelspec in your user's Jupyter kernels directory. The easiest way to do this is to install a new kernelspec from an existing one, and then edit the kernel.json file.

# Activate your Jupyter virtualenv (if it isn't already activated):
[@notebook1004:/home/otto] $ . ./venv/bin/activate

# Use jupyter kernelspec install to copy a global kernelspec into your user kernel directory, changing the kernel name on the way.
[@notebook1004:/home/otto] [venv] $ jupyter kernelspec install --user --name 'spark_yarn_pyspark_otto1' /usr/local/share/jupyter/kernels/spark_yarn_pyspark

# Edit the kernel.json file to change your settings.  Here, we change the display name and the --executor-memory:
[@notebook1004:/home/otto] [venv] $ vim ~/.local/share/jupyter/kernels/spark_yarn_pyspark_otto1/kernel.json
{
  "argv": [
    "/usr/bin/python3",
    "-m",
    "ipykernel",
    "-f",
    "{connection_file}"
  ],
  "language": "python",
  "display_name": "PySpark - YARN - 8g Executor (otto custom)",
  "env": {
    "PYSPARK_PYTHON": "/usr/bin/python3",
    "SPARK_HOME": "/usr/lib/spark2",
    "PYTHONPATH": "/usr/lib/spark2/python/lib/py4j-src.zip:/usr/lib/spark2/python",
    "PYTHONSTARTUP": "/usr/lib/spark2/python/pyspark/shell.py",
    "PYSPARK_SUBMIT_ARGS": "--master yarn pyspark-shell --conf spark.dynamicAllocation.maxExecutors=128 --executor-memory 8g"
  }
}

Once done, refresh Jupyter in your browser, and you should see the newly created Notebook Kernel show up for use.

Custom Python kernels

Custom python kernels through virtual environments are not supported in SWAP. However you can still use them by manually running your own notebooks and connecting them to a spark job. See Analytics/Systems/Cluster/Spark.

Miscellaneous use

Sharing notebooks

There is currently no direction functionality to view (Phab:T156980) or share (Phab:T156934) other users' notebooks in real-time, but it is possible to copy notebooks and files directly on the server by clicking 'New' -> 'Terminal' (in the root folder in the browser window) and using the cp command.

GitHub

It's also possible to track your notebooks in Git and push them to GitHub, which will display them fully rendered on its website. If you want to do this, you should connect to GitHub using HTTPS. The SRE recommends not using SSH because of the risk that other users could access your SSH keys (which, if combined with a production SSH key reused for GitHub, could result in a serious security breach).

With HTTPS, by default you'll have to type in your GitHub username and password every time you push. You can avoid this by adding the following (from this Superuser answer) to ~/.gitconfig:

[url "https://YOURUSERNAME@github.com"]
    insteadOf = https://github.com

[credential]
    helper = cache --timeout=28800

This will automatically apply your GitHub user name to any HTTPS access, and then cache the password you enter for 8 hours (28 800 seconds).

HTML files

You can also export your notebook as an HTML file and publish it at analytics.wikimedia.org/datasets by placing it in the /srv/published-datasets folder.

You will need to export the notebook from within Jupyter (using File > Export Notebook As... in the JupyterLab interface or the jupyter nbconvert --to html command), but for security reasons you will not be able to move it to srv/published-datasets from within Jupyter. You will need to SSH directly into the notebook host and move the notebook using the command line.

/srv/published-datasets is automatically synced to the website every 15 minutes, but you can run the sync manually by running the published-datasets-sync command.

Python virtual environment

All your Python notebooks will live inside a virtual environment automatically created in ~/venv. If you want to enter it directly from the terminal on your computer, SSH into the SWAP server and type source ~/venv/bin/activate.

HTTP requests

Allow HTTP requests, to for example enable your notebook to clone a repo, by adding export code to your .bash_profile in the notebook's terminal.

Sending emails from within a notebook

To send out an email from a Python notebook (e.g. as a notification that a long-running query or calculation has completed), one can use the following code

In[1]:
# cf. https://phabricator.wikimedia.org/T168103#4635031 :
notebookservername = !hostname
notebookserverdomain =  notebookservername[0]+'.eqiad.wmnet'
username = !whoami

def send_email(subject, body, to_email = username[0]+'@wikimedia.org', from_email = username[0]+'@'+notebookserverdomain):
    import smtplib
    smtp = smtplib.SMTP("localhost")
    message = """From: <{}>
To: <{}>
Subject: {}

{}
""".format(from_email, to_email, subject, body)
    smtp.sendmail(from_email, [to_email], message)

# example uses: 
# send_email('SWAP notebook ready (n/t)', '')
# send_email('SWAP email test', 'test body', 'yourname@wikimedia.org', 'yourname@wikimedia.org')

(Invoking the standard mail client via the shell, i.e. !mailx or !heirloom-mailx, fails for some reason, see phab:T168103.)

I want python 3+

https://phabricator.wikimedia.org/T212591

See also

Notes on how to use the prototype:

Troubleshooting

My Python kernel will not start

Your IPython configuration may be broken. Try deleting your ~/.ipython directory (you'll lose any configurations you've made or extensions you've installed, but it won't affect your notebooks, files, or Python packages).

My kernel restarts when I run a large query

It may be that the notebook server ran out of memory and the operating system's out of memory killer decided to kill your kernel to cope with the situation. You won't get any notification that this has happened other than the notebook restarting, but you can assess the state of the memory on the notebook server by checking its host overview dashboard in Grafana (notebook1003 or notebook1004) or using the command line to see which processes are using the most (with ps aux --sort -rss | head or similar).

Administration

Deployment

JupyterHub is built and installed in the analytics/deploy/jupyterhub repository and the jupyterhub Puppet module. There are a few steps to updating the deployment.

To upgrade or add new packages, you should edit the frozen-requirements.txt file to specify exactly which packages you want. Then, on a non production machine (in Cloud VPS, or mediawiki vagrant), run the build_wheels.sh script. This will create frozen wheels in the artifacts/ directory. Commit and merge the changes. To deploy these, you should git pull in /srv/jupyterhub/deploy on each of the notebook servers, and then run the create_virtualenv.sh script. This will build a new virtualenv from the updated frozen wheel artifacts. You can then service jupyterhub restart to have JupyterHub run from the newly built /srv/jupyterhub/venv.

Spark Integration

Spark integration is handled by global custom kernels installed into /usr/local/share/jupyter/kernels. pyspark kernels are a custom iPython kernel that loads pyspark. All other Spark kernels use Apache Toree.

These kernels are installed by the create_virtualenv.sh script that should be run during deployemnt. If you need to update them, you should modify the kernel.json files in the jupyterhub-deploy repository.

User virtualenvs

Upon first login, each user will automatically have a new python3 virtualenv created at $HOME/venv. The users themselves can pip install packages into this virtualenv as part of regular Jupyter notebook usage. If you need to update the automatically installed packages in user virtualenvs that have already been created, you'll have to do so manually.

Updating user virtualenvs

If you upgrade JupyterHub or any of the packages listed in frozen-requirements.txt, you might want to upgrade the installed versions of these packages in each user's virtualenv too. To do so, you want to rerun the pip install command that was used during the virtualenv creation. As of 2018-03, this was pip install --upgrade --no-index --ignore-installed --find-links=/srv/jupyterhub/deploy/artifacts/stretch/wheels --requirement=/srv/jupyterhub/deploy/frozen-requirements.txt. To do this for all users:

cd /srv/jupyterhub/deploy
wheels_path=/srv/jupyterhub/deploy/artifacts/stretch/wheels
for u in $(getent passwd | awk -F ':' '{print $1}'); do
    venv=/home/$u/venv
    if [ -d $venv ]; then
        echo "Updating $venv"
        sudo -H -u $u $venv/bin/pip install --upgrade --no-index --force-reinstall --find-links=$wheels_path --requirement=/srv/jupyterhub/deploy/frozen-requirements.txt
    fi
done

Resetting user virtualenvs

Sometimes someone may want to totally recreate their SWAP virtualenv from scratch. This can be done by the user themselves! The steps are as follows:

# 1. Stop your Jupyter Notebook server from the JupyterHub UI.

# 2. Move your old venv out of the way (or just delete it)
mv $HOME/venv $HOME/venv-old-$().$(date +%s)

# 3. create a new empty venv
python3 -m venv --system-site-packages $HOME/venv

# 4. Reinstall the jupyter venv
cd /srv/jupyterhub/deploy
$HOME/venv/bin/pip install --upgrade --no-index --force-reinstall --find-links=/srv/jupyterhub/deploy/artifacts/stretch/wheels --requirement=/srv/jupyterhub/deploy/frozen-requirements.txt

# 5. Login to JupyterHub and start your Jupyter Notebook server.

Machine ran out of space

As noted in known issues, files deleted through the Jupyter UI are moved to .local/share/Trash in the user's home directory. If users are reporting that they can't save their notebooks because the machine has run out of spaces, chances are the trash directories need to be emptied.