Jump to content

Portal:Toolforge/Admin

From Wikitech


Toolforge is a Platform as a Service, offering easy to use bot, web, and compute resources for the Wikimedia community.

Toolforge system overview
Toolforge system overview with high level components

The main user interfaces are:

toolsadmin (Striker)
A web interface, where the users register for access, create/delete and add metadata and access permissions to their tools.
Bastion + toolforge cli
Ssh access to the toolforge network, and a client to manage tool resources (continuous jobs, scheduled jobs, webservices, deployments, builds, ...).
Toolforge API
API to manage toolforge resources.

For more details on user-facing features see the user documentation. For details on administration and infrastructure, see the the admin docs.

All pieces of Toolforge are deployed inside a Cloud VPS project (or tenant) called tools. The staging/development project is called toolsbeta.

Components

Documentation of backend components and admin procedures for Toolforge. See Help:Toolforge for user facing documentation about actually using Toolforge to run your bots and webservices.

Deploying a component

See the docs in gitlab for a list of components deployed in kubernetes and details on their deployment process.

List of components

APIs

Toolforge is moving towards an API-oriented model where client tools (such as those installed on bastions) contact the Toolforge API to make changes instead of making them directly.

See the user docs also.

Access to the toolforge API

They APIs are presented as one single aggregated endpoint though the API Gateway.

The base endpoint is https://api.svc.[project].eqiad1.wikimedia.cloud:30003. Services are routed with subpaths, for example /jobs for the Jobs API.

For authentication we currently use client certificates issued by the Kubernetes cluster internal CA via maintain-kubeusers. This will change in the future as we evolve how the APIs are accessed and used.

Administrative tasks

Admin permissions

Performing admin procedures requires having admin permissions on Toolforge. There is not a single "admin" flag, but a set of interrelated permissions you can be granted. These are described in detail in the page Toolforge roots and Toolforge admins.

Failover

Tools should be able to survive the failure of any one virt* node. Some items may need manual failover

Static webserver

This is a stateless simple nginx http server. Simply switch the floating IP from one tools-static-* to the other to switch over. Recovery is also equally trivial - just bring the machine back up and make sure puppet is ok.

Checker service

This is the service that Icinga hits to check status of several services. It's totally stateless.

See Portal:Toolforge/Admin/Toolschecker

Prometheus

See Portal:Toolforge/Admin/Prometheus#Failover.

tools-service

Service nodes run the Toolforge internal aptly service, to serve .deb packages as a repository for all the other nodes.

Command orchestration

Toolforge and Toolsbeta both have a local cumin server (cloudcumin* and {tools,toolsbeta}-cumin*.

Other tasks

Logging in as root

For normal login root access see Toolforge roots and Toolforge admins.

In case the normal login does not work for example due to an LDAP failure, administrators can also directly log in as root. To prepare for that occasion, generate a separate key with ssh-keygen, add an entry to the passwords::root::extra_keys hash in Horizon's 'Project Puppet' section with your shell username as key and your public key as value and wait a Puppet cycle to have your key added to the root accounts. Add to your ~/.ssh/config:

# Use different identity for Tools root.
Match host *.tools.eqiad1.wikimedia.cloud user root
     IdentityFile ~/.ssh/your_secret_root_key

The code that reads passwords::root::extra_keys is in labs/private:modules/passwords/manifests/init.pp.

Disabling all ssh logins except root

Useful for dealing with security critical situations. Just touch /etc/nologin and PAM will prevent any and all non-root logins.

Complaints of bastion being slow

Users are increasingly noticing slowness on tools-login due to either CPU or IOPS exhaustion caused by people running processes there instead of on Kubernetes. Here are some tips for finding the processes in need of killing:

  • Look for IOPS hogs
    • $ iotop
  • Look for abnormal processes:
    • $ ps axo user:32,pid,cmd | grep -Ev "^($USER|root|daemon|_lldpd|messagebus|nagios|nslcd|ntp|prometheus|statd|syslog|Debian-exim|www-data)" | grep -ivE 'screen|tmux|-bash|mosh-server|sshd:|/bin/bash|/bin/zsh'
    • If you see pyb.py kill with extreme prejudice.
  • If the rogue job is running as a tool, !log something like: !log tools.$TOOL Killed $PROC process running on tools-bastion-NN. See https://wikitech.wikimedia.org/wiki/Help:Toolforge/Jobs_framework for instructions on running jobs on Kubernetes.

Webserver statistics

To get a look at webserver statistics, goaccess is installed on the webproxies. Usage:

goaccess --date-format="%d/%b/%Y" --log-format='%h - - [%d:%t %^] "%r" %s %b "%R" "%u"' -q -f/var/log/nginx/access.log

Interactive key bindings are documented on the man page. HTML output is supported by piping to a file. Note that nginx logs are rotated (twice?) daily, so there is only very recent data available.

Banning an IP from tool labs

On Hiera:Tools, add the IP to the list of dynamicproxy::banned_ips, then force a puppet run on the webproxies. Add a note to Help:Toolforge/Banned explaining why. The user will get a message like [1].

Deploying the main web page

This website (plus the 403/500/503 error pages) are hosted under tools.admin. To deploy,

$ become admin
$ cd tool-admin-web
$ git pull

Regenerate replica.my.cnf

This requires access to the cloudcontrol host which is running maintain-dbusers, and can be done as follows:

$ ssh cloudcontrolXXXX.eqiad.wmnet
$ sudo /usr/local/sbin/maintain-dbusers delete tools.${NAME} --account-type=tool
:# or
$ sudo /usr/local/sbin/maintain-dbusers delete ${USERNAME} --account-type=user

Once the account has been deleted, the maintain-dbusers service will automatically recreate the user account.

Debugging bad MariaDB credentials

Sometimes things go wrong and a user's replica.my.cnf credentials don't propigate everywhere. You can check the status on various servers to try and narrow down what went wrong.

The database credentials needed are in /etc/dbusers.yaml on the cloudcontrol host running maintain-dbusers.

$ ssh cloudcontrolXXXX.eqiad.wmnet

$ sudo cat /etc/dbusers.yaml
:# look for the accounts-backend['password'] for the m5-master connections (user: labsdbaccounts)
:# look for the labsdbs['password'] for the other connections (user: labsdbadmin)

$ CHECK_UID=u12345  # User id to check for
:# Check if the user is in our meta datastore
$ mariadb -h m5-master.eqiad.wmnet -u labsdbaccounts -p -e "USE labsdbaccounts; SELECT * FROM account WHERE mysql_username='${CHECK_UID}'\G"

:# Check if all the accounts are created in the labsdb boxes from meta datastore.
$ ACCT_ID=.... # Account_id is foreign key (id from account table)
$ mariadb -h m5-master.eqiad.wmnet -u labsdbaccounts -p -e "USE labsdbaccounts; SELECT * FROM labsdbaccounts.account_host WHERE account_id=${ACCT_ID}\G"

:# Check the actual labsdbs if needed
$ mariadb -h clouddbXXXX.eqiad.wmnet -u labsdbadmin -p -e 'SELECT User, Password from mysql.user where User like "${CHECK_UID}";'

:# Resynchronize account state on the replicas by finding missing GRANTS on each db server
$ sudo maintain-dbusers harvest-replicas

See phab:T183644 for an example of fixing automatic credential creation caused when a old LDAP user becomes a Toolforge member and has an untracked user account on toolsdb.

Regenerate kubernetes credentials for tools (.kube/config)

With admin credentials (root on a control plane node will do), run kubectl -n tool-<toolname> delete cm maintain-kubeusers-<toolname>; it should get regenerated within minutes.

Adding K8S Components

See Portal:Toolforge/Admin/Kubernetes#Building new nodes

Deleting a tool

For batch or CLI deletion of tools, use the 'mark_tool' command on a cloudcontrol node:

The awful truth about tool deletion
andrew@cloudcontrol1003:~$ sudo mark_tool
usage: mark_tool [-h] [--ldap-user LDAP_USER] [--ldap-password LDAP_PASSWORD]
                 [--ldap-base-dn LDAP_BASE_DN] [--project PROJECT] [--disable]
                 [--delete] [--enable]
                 tool
mark_tool: error: the following arguments are required: tool

Maintainers can mark their tools for deletion using the "Disable tool" button on the tool's detail page on https://toolsadmin.wikimedia.org/. In either case, the immediate effect of disabling a tool is to stop any running jobs, prevent users from logging in as that tool, and schedule archiving and deletion for 40 days in the future.

A tool can be restored within 40 days of being disabled

Tool archives are stored on the tools NFS server, currently tools-nfs-2.tools.eqiad1.wikimedia.cloud:

root@labstore1004:/srv/disable-tool# ls -ltrah /srv/tools/archivedtools/
total 1.8G
drwxr-xr-x 5 root root 4.0K Jun 21 19:37 ..
-rw-r--r-- 1 root root 102K Jul 22 22:15 andrewtesttooltwo
-rw-r--r-- 1 root root   45 Oct 13 00:47 andrewtesttooltwo.tgz
-rw-r--r-- 1 root root 8.3M Oct 13 03:20 mediaplaycounts.tgz
-rw-r--r-- 1 root root 1.8G Oct 13 04:01 projanalysis.tgz
-rw-r--r-- 1 root root 1.3M Oct 13 21:05 reportsbot.tgz
drwxr-xr-x 2 root root 4.0K Oct 13 21:10 .
-rw-r--r-- 1 root root 719K Oct 13 21:10 wsm.tgz
-rw-r--r-- 1 root root 4.8K Oct 13 21:20 andrewtesttoolfour.tgz

The actual deletion process is shockingly complicated. A tool will only be archived and deleted if all of the prior steps succeed, but disabling of a tool should be a sure thing.

SSL certificates

See Portal:Toolforge/Admin/SSL certificates.

Granting a tool write access to Elasticsearch

  • Generate a random password and the mkpassword crypt entry for it using the script new-es-password.sh. (Must be run a host with the `mkpasswd` command installed. (The mkpasswd is part of the whois Debian package.)
$ ./new-es-password.sh tools.example
tools.example elasticsearch.ini
----
[elasticsearch]
user=tools.example 
password=A3rJqgFKxa/x4NlnIhmw2cXcV92it/Zv0Yt+a7yhxCw=
----

tools.example puppet master private (hieradata/labs/tools/common.yaml)
----
profile::toolforge::elasticsearch::haproxy::elastic_users:
  - name: 'tools.example'
    password: '$6$FYwP3wxT4K7O9EE$OA3P5972NWJVG/WUnD240sal34/dsNabbcawItevMYO9uoR.fJBrjSABex0EDW0wlkWHID1Tf4oJoiNvYFGmy/'
$ ssh tools-puppetserver-01.tools.eqiad1.wikimedia.cloud
$ sudo -i
# cd /srv/git/labs/private
# vim hieradata/labs/tools/common.yaml
... merge the hiera data with the existing key...
:wq
# git add hieradata/labs/tools/common.yaml
# git commit -m "[local] Elasticsearch credentials for $TOOL"
  • Force a puppet run on tools-elastic nodes using Cumin
cloudcumin1001.eqiad.wmnet:~$ sudo cumin "O{project:tools name:.*elastic.*}" "run-puppet-agent"
  • Make the credentials available to the tool as envvars:
$ ssh dev.toolforge.org
$ sudo -i become example-tool
$ toolforge envvars create TOOL_ELASTICSEARCH_USER
Enter the value of your envvar (Hit Ctrl+C to cancel): <insert user>
$ toolforge envvars create TOOL_ELASTICSEARCH_PASSWORD
Enter the value of your envvar (Hit Ctrl+C to cancel): <insert password>

Note: An older procedure placed the credentials in /data/project/$TOOL/.elasticsearch.ini instead.

  • Resolve the ticket!

Creating a new Docker image (e.g. for new versions of Node.js)

See Portal:Toolforge/Admin/Kubernetes#Docker Images

Users and community

Some information about how to manage users and general community and their relationship with Toolforge.

Project membership request approval

User access requests show up in https://toolsadmin.wikimedia.org/tools/membership/

Some guidelines for account approvals, based on advice from scfc:

  1. If the request contains any defamatory or abusive information as part of the username(s), reason, or comments → mark as Declined and check the "Suppress this request (hide from non-admin users)" checkbox.
    • You should also block the user on Wikitech and consider contacting a Steward for wider review of the SUL account.
  2. If the user name "looks" like a bot or someone else who could not consent to the Terms of use and Rules → mark as Declined.
  3. Check the status of the associated SUL account. If the user is banned on one or more wikis → mark as Declined.
  4. If the stated purpose is "tangible" ("I want to move my bot x to Toolforge", "I want to build a web app that does y", etc.) → mark as Approved.
    • If you know that someone else has been working on the same problem, add a message explaining who the user should contact or where they might find more information.
  5. If the stated purpose is "abstract" ("research", "experimentation", etc.) and there is a hackathon ongoing or planned, the user has a non-throw-away mail address, the user has created a user page with coherent information about theirself or linked a SUL account of good standing, etc. → mark as Approved.
  6. Otherwise add a comment asking for clarification of their reason for use and mark as Feedback needed. The request is not really "denied", but more (indefinitely) "delayed".

Requests left in Feedback needed for more information for more than 30 days should usually be declined with a message like "Feel free to apply again later with more complete information."

Quota management

Toolforge quotas are managed via maintain-kubeusers.

Other

How do Toolforge web services actually work?

See Portal:Toolforge/Admin/Kubernetes#Ingress

What makes a root/Giving root access

See Toolforge roots and Toolforge admins

Useful administrative tools

These tools offer useful information about Toolforge itself:

  • ToolsDB - Statistics about tables owned by tools
  • k8s-stats - examine what our tools are doing
  • OpenStack Browser - examine projects, instances, web proxies, and Puppet config

Brainstorming

Sub pages