PAWS/Admin

Tracked in Phabricator
Task T211096

Introduction

PAWS is a Jupyterhub deployment that runs in the PAWS Coud VPS project. The main Jupyterhub login is accessible at https://hub-paws.wmcloud.org/hub/login, and is a public service that can authenticated to via Wikimedia OAuth. More end-user info is at PAWS. The PAWS code can be found in github. Besides a simple Jupyterhub deployment, PAWS also contains easy access methods for the wiki replicas, the wikis themselves via the OAuth grant and pywikibot.

Kubernetes cluster

Deployment

The PAWS repository is at https://github.com/toolforge/paws. It should be cloned to a paws bastion system (bastion.paws.eqiad1.wikimedia.cloud) Then the git-crypt key needs to be used to unlock secrets.yaml file. See one of the PAWS admins if you should have access to this key.
PAWS is built via github actions triggered by a PR. Github actions will also update the values.yaml to match any new container that is built.
Once all is checked out and unlocked run:

bash deploy.sh <eqiad1|codfw1dev>

Blue Green Deployment

Make a branch for the upgrade in git and modify the tofu, copy the version file, ex:

cp 123.tf 124.tf

Edit the new file such that the names of the tofu and openstack resources follow the new version

-resource "openstack_containerinfra_clustertemplate_v1" "template_123" {
-  name                  = "paws${var.name[var.datacenter]}-123"
+resource "openstack_containerinfra_clustertemplate_v1" "template_124" {
+  name                  = "paws${var.name[var.datacenter]}-124"

Edit the original copy (123.tf in this example) and remove

resource "local_file" "kube_config"

We only want one of those.

from bastion.paws.eqiad1.wikimedia.cloud in a paws repo directory

bash deploy.sh <eqiad1|codfw1dev>

You should see mostly creates, and one replace (the kube_config).

Update Web Proxies in Horizon. DNS > Web Proxies Point hub-paws and public-paws to the first node of the new cluster

At this point the new cluster is running, and available to the world. Log in, make sure everything is working as expected. If the new cluster does not appear to be working revert the web proxies to the previous cluster. When you are satisfied with this Merge the PR.

After the PR is merged and the new cluster appears fine, perhaps a few days later, the old cluster can be removed. Following the above example this would be performed by opening a new branch and removing the old magnum definition:

rm 123.tf

from bastion.paws.eqiad1.wikimedia.cloud in a paws repo directory

git pull ; git checkout <your branch>
bash deploy.sh <eqiad1|codfw1dev>

At this point you should see only the, old cluster template, and old cluster being destroyed. Nothing should be created.

Merge the PR.

PAWS updates

A blue green deployment will always work, but is not always necessary. In particular if updating the singleuser image one can run helm directly from the root of the git repo:

helm upgrade paws --namespace prod ./paws -f paws/secrets.yaml -f paws/production.yaml --timeout=50m

TODO: update chart version on chart changes, making updates run identically to deploys.

Tracked in Phabricator
Task T365725

Troubleshooting

Magnum relies on some containers from dockerhub. Dockerhub will limit after 100 anonymous pulls in a six hour window. If your containers are not deploying complaining of taint problems, check kube-system containers:

kubectl get all -n kube-system

If containers are crash looping, you will likely have to add a docker credential to them (You can check using kubectl describe <pod name>)

docker login
create secret generic regcred     --from-file=.dockerconfigjson=<path to your docker/config.json>     --type=kubernetes.io/dockerconfigjson -n kube-system

Using the following edit commands add:

       imagePullSecrets:    
      - name: regcred

under spec.template.spec

kubectl edit -n kube-system daemonset.apps/openstack-cloud-controller-manager
kubectl edit -n kube-system deployment.apps/kubernetes-dashboard
kubectl edit -n kube-system deployment.apps/dashboard-metrics-scraper
kubectl edit -n kube-system daemonset.apps/k8s-keystone-auth

Upgrading k8s

Upgrading of the cluster should be preformed the same as the deployment of the cluster.

Upgrading z2jh

Upgrading zero to jupyterhub is done the same as other deploys, use deploy.sh. The paws-hub image needs to be updated to match the image referenced in the z2jh chart.

Architecture

The core of paws is run on openstack magnum. Thus k8saas. In concept it should be able to be runable on any k8s, so long as it has access to nfs and the replicas.

Helm

Helm 3 is used to deploy kubernetes applications on the cluster. As this is helm 3, there is no tiller and RBAC affects what you can do.

Modify worker count

Edit the tofu configuration in the code to add or remove workers as desired. Run the deploy above.

General notes

To see status of k8s control plane pods (running coredns, kube-proxy, etcd, kube-apiserver, kube-controller-manager), see kubectl --namespace=kube-system get pod -o wide.
Prometheus stats and metrics-server are deployed in the metrics namespace during cluster build via kubectl apply -f $yaml-file, just like in the Toolforge deploy documentation.
Because of pod security policies in place, all init containers have been removed from the paws-project version of things. Privileged containers cannot be run inside the prod namespace.

Jupyterhub deployment

Jupyterhub & PAWS Components

Jupyterhub is a set of systems deployed together that provide Jupyter notebook servers per user. The three main subsystems for Jupyterhub are the Hub, Proxy, and the Single-User Notebook Server. Really good overview of these systems is available at http://jupyterhub.readthedocs.io/en/latest/reference/technical-overview.html.

PAWS is a Jupyterhub deployment (Hub, Proxy, Single-User Notebook Server) with some added bells and whistles. Some additional PAWS-specific pods in our deployment are:

nbserve and render: nbserve is an nginx proxy that runs in the cluster at https://public-paws.wmcloud.org that handles URL rewriting for public URLs to map numerical IDs to Wiki usernames (so we can have URLS like https://public-paws.wmcloud.org/User:BDavis_(WMF)/pip-colorama.ipynb), and render handles the actual rendering of the ipynb notebook as a static page. These images are both essential to how the publishing of PAWS notebooks works.

PAWS also includes customized versions of some Jupyterhub images:

singleuser: Since this is the environment for end users, there is a fair bit going on here. Our image is a replacement of the upstream one. We set the correct UID and directory. We install the jupyterhub/lab code directly from pip, along with PyWikiBot, a small library to allow importing a notebook like a python package along the lines of import paws.$username.$notebooks_name called ipynb-paws and code from https://github.com/toolforge/nbpawspublic to add a public link button. There are other customizations because this is a great surface for doing them. The general goal is to get a notebook up and running for use on wikis as fast as possible.
paws-hub: We build upon the upstream Jupyterhub hub image just a touch, adding bits that respect more of the UID settings and adding in a custom culling script. The code for doing OAuth is actually inserted in the helm chart instead.

The other custom image is a deploy-hook, which is undergoing some renovations before it is redeployed in the cluster.

Common administrative actions

Some common administrative actions.

Deleting user data in case of spam or credential leaks

In the instance a notebook or file hosted on PAWS needs an admin to remove it immediately (vs. asking a user to delete it), you can access all user data via the NFS mounted locally on all k8s nodes.

SSH to the nfs node paws-nfs-1.paws.eqiad1.wikimedia.cloud.
Become root with sudo su - tools.paws
cd /srv/paws/project/paws/userhomes this is the top level of user homes and paws public pages.
cd $wiki_user-id where $wiki_user-id is the numeric id of the user, not the text username
Remove the offending file with rm as needed.

Stop a running workload in PAWS

Useful if you want to stop a crypto miner or similar.

You need to be an admin inside PAWS.

Log in to PAWS, likely https://hub-paws.wmcloud.org/hub/home
Click the Admin button in the top menu. If you don't have the button, you aren't an admin
Search in the list for the workload you want to stop
Click the Stop server button

Bonus point if you check the user against https://meta.wikimedia.org/wiki/Special:CentralAuth for additional hints to see if the user is a bad actor.

Prevent an user from using PAWS

As of this writing the only method we know about is to talk to a Steward to global-block the user, so it breaks the OAuth that PAWS uses.

TODO: link is probably: https://meta.wikimedia.org/wiki/Steward requests/Global