Portal:Toolforge/Admin
Toolforge is a Platform as a Service, offering easy to use bot, web, and compute resources for the Wikimedia community.

The main user interfaces are:
- toolsadmin (Striker)
- A web interface, where the users register for access, create/delete and add metadata and access permissions to their tools.
- Bastion + toolforge cli
- Ssh access to the toolforge network, and a client to manage tool resources (continuous jobs, scheduled jobs, webservices, deployments, builds, ...).
- Toolforge API
- API to manage toolforge resources.
For more details on user-facing features see the user documentation. For details on administration and infrastructure, see the the admin docs.
All pieces of Toolforge are deployed inside a Cloud VPS project (or tenant) called tools. The staging/development project is called toolsbeta.
Components
Documentation of backend components and admin procedures for Toolforge. See Help:Toolforge for user facing documentation about actually using Toolforge to run your bots and webservices.
Deploying a component
See the docs in gitlab for a list of components deployed in kubernetes and details on their deployment process.
List of components
- API Gateway: Point of entry for all APIs, does routing, some authentication and other common services. See Portal:Toolforge/Admin/API Gateway
- Jobs Service: Executes workloads in the kubernetes cluster. See Portal:Toolforge/Admin/Jobs Service
- Envvars Service: Manages secrets and environment variables for the jobs. See Portal:Toolforge/Admin/Envvars Service
- Build Service: Creates runtime environments (container images) for the jobs. See Portal:Toolforge/Admin/Build Service
- Logs Service: Gathers and exposes the logs from the jobs. See Portal:Toolforge/Admin/Logs Service
- Components service: Manages tool-wide config and orchestrates deployments for jobs (build + run of the job). See Portal:Toolforge/Admin/Component Service
- Tools-mail / Exim: Used to send emails from tools and other toolforge components (ex. when a job fails). See Portal:Toolforge/Admin/Exim and Portal:Cloud_VPS/Admin/Email#Operations
- Checker Service: This is the service that Prometheus hits to check status of several services. It's totally stateless. See Portal:Toolforge/Admin/Toolschecker
- Shared Redis: For users. See Portal:Toolforge/Admin/Redis.
- Shared Elasticsearch: For users. No dedicated page yet, we have some dashboards.See also Help:Toolforge/Elasticsearch| the enduser documentation.
- Prometheus: Infrastructure monitoring. See Portal:Toolforge/Admin/Prometheus.
- Apt repository: Infrastructure packages. See Portal:Toolforge/Admin/Apt repository
- Striker/Toolforge UI/toolsadmin: Tool management and account creation. See Portal:Toolforge/Admin/Striker
- LDAP: Account and tool group database. See LDAP.
- ToolsDB: See Portal:Toolforge/Admin/ToolsDB
- Kubernetes infrastructure: See Portal:Toolforge/Admin/Kubernetes
- Harbor: Container image and helm chart repository. See Portal:Toolforge/Admin/Harbor.
- maintain-harbor: Synchronizes harbor with ldap, does certain cleanups, and other chores. See Portal:Toolforge/Admin/Harbor/maintain-harbor.
- maintain-kubeusers: Synchronizes kubernetes with ldap, creates tool and user namespaces, k8s certificates and other per-user/tool objects. See maintain-kubeusers.
- maintain-dbusers: Synchronizes replicas and toolsdb with ldap, creating accounts, passwords and populating them in them as envvars and
replica.cnffiles. See maintain-dbusers. Note that it also does it for PAWS. - replica_cnf: webservice deployed in the NFS servers used by maintain-dbusers to manage the
replica.cnffiles in user and tool homes. - Fourohfour: This is a special tool that handles the
404 Not Foundpages for tool webservices, suggesting other tools if not found or contact with the owners if just not setup. See Tool:Fourohfour. - K8s status: Tool to explore kubernetes details publicly. See the toolsadmin page.
- Replag: Tool to expose the replication lag in wikireplicas. See the main website.
APIs
Toolforge is moving towards an API-oriented model where client tools (such as those installed on bastions) contact the Toolforge API to make changes instead of making them directly.
See the user docs also.
Access to the toolforge API
They APIs are presented as one single aggregated endpoint though the API Gateway.
The base endpoint is https://api.svc.[project].eqiad1.wikimedia.cloud:30003. Services are routed with subpaths, for example /jobs for the Jobs API.
For authentication we currently use client certificates issued by the Kubernetes cluster internal CA via maintain-kubeusers. This will change in the future as we evolve how the APIs are accessed and used.
Administrative tasks
Admin permissions
Performing admin procedures requires having admin permissions on Toolforge. There is not a single "admin" flag, but a set of interrelated permissions you can be granted. These are described in detail in the page Toolforge roots and Toolforge admins.
Failover
Tools should be able to survive the failure of any one virt* node. Some items may need manual failover
Static webserver
This is a stateless simple nginx http server. Simply switch the floating IP from one tools-static-* to the other to switch over. Recovery is also equally trivial - just bring the machine back up and make sure puppet is ok.
Checker service
This is the service that Icinga hits to check status of several services. It's totally stateless.
See Portal:Toolforge/Admin/Toolschecker
Prometheus
See Portal:Toolforge/Admin/Prometheus#Failover.
tools-service
Service nodes run the Toolforge internal aptly service, to serve .deb packages as a repository for all the other nodes.
Command orchestration
Toolforge and Toolsbeta both have a local cumin server (cloudcumin* and {tools,toolsbeta}-cumin*.
Other tasks
Logging in as root
For normal login root access see Toolforge roots and Toolforge admins.
In case the normal login does not work for example due to an LDAP failure, administrators can also directly log in as root. To prepare for that occasion, generate a separate key with ssh-keygen, add an entry to the passwords::root::extra_keys hash in Horizon's 'Project Puppet' section with your shell username as key and your public key as value and wait a Puppet cycle to have your key added to the root accounts. Add to your ~/.ssh/config:
# Use different identity for Tools root.
Match host *.tools.eqiad1.wikimedia.cloud user root
IdentityFile ~/.ssh/your_secret_root_key
The code that reads passwords::root::extra_keys is in labs/private:modules/passwords/manifests/init.pp.
Disabling all ssh logins except root
Useful for dealing with security critical situations. Just touch /etc/nologin and PAM will prevent any and all non-root logins.
Complaints of bastion being slow
Users are increasingly noticing slowness on tools-login due to either CPU or IOPS exhaustion caused by people running processes there instead of on Kubernetes. Here are some tips for finding the processes in need of killing:
- Look for IOPS hogs
$ iotop
- Look for abnormal processes:
$ ps axo user:32,pid,cmd | grep -Ev "^($USER|root|daemon|_lldpd|messagebus|nagios|nslcd|ntp|prometheus|statd|syslog|Debian-exim|www-data)" | grep -ivE 'screen|tmux|-bash|mosh-server|sshd:|/bin/bash|/bin/zsh'- If you see
pyb.pykill with extreme prejudice.
- If the rogue job is running as a tool,
!logsomething like:!log tools.$TOOL Killed $PROC process running on tools-bastion-NN. See https://wikitech.wikimedia.org/wiki/Help:Toolforge/Jobs_framework for instructions on running jobs on Kubernetes.
Webserver statistics
To get a look at webserver statistics, goaccess is installed on the webproxies. Usage:
goaccess --date-format="%d/%b/%Y" --log-format='%h - - [%d:%t %^] "%r" %s %b "%R" "%u"' -q -f/var/log/nginx/access.log
Interactive key bindings are documented on the man page. HTML output is supported by piping to a file. Note that nginx logs are rotated (twice?) daily, so there is only very recent data available.
Banning an IP from tool labs
On Hiera:Tools, add the IP to the list of dynamicproxy::banned_ips, then force a puppet run on the webproxies. Add a note to Help:Toolforge/Banned explaining why. The user will get a message like [1].
Deploying the main web page
This website (plus the 403/500/503 error pages) are hosted under tools.admin. To deploy,
$ become admin
$ cd tool-admin-web
$ git pull
Regenerate replica.my.cnf
This requires access to the cloudcontrol host which is running maintain-dbusers, and can be done as follows:
$ ssh cloudcontrolXXXX.eqiad.wmnet
$ sudo /usr/local/sbin/maintain-dbusers delete tools.${NAME} --account-type=tool
:# or
$ sudo /usr/local/sbin/maintain-dbusers delete ${USERNAME} --account-type=user
Once the account has been deleted, the maintain-dbusers service will automatically recreate the user account.
Debugging bad MariaDB credentials
Sometimes things go wrong and a user's replica.my.cnf credentials don't propigate everywhere. You can check the status on various servers to try and narrow down what went wrong.
The database credentials needed are in /etc/dbusers.yaml on the cloudcontrol host running maintain-dbusers.
$ ssh cloudcontrolXXXX.eqiad.wmnet
$ sudo cat /etc/dbusers.yaml
:# look for the accounts-backend['password'] for the m5-master connections (user: labsdbaccounts)
:# look for the labsdbs['password'] for the other connections (user: labsdbadmin)
$ CHECK_UID=u12345 # User id to check for
:# Check if the user is in our meta datastore
$ mariadb -h m5-master.eqiad.wmnet -u labsdbaccounts -p -e "USE labsdbaccounts; SELECT * FROM account WHERE mysql_username='${CHECK_UID}'\G"
:# Check if all the accounts are created in the labsdb boxes from meta datastore.
$ ACCT_ID=.... # Account_id is foreign key (id from account table)
$ mariadb -h m5-master.eqiad.wmnet -u labsdbaccounts -p -e "USE labsdbaccounts; SELECT * FROM labsdbaccounts.account_host WHERE account_id=${ACCT_ID}\G"
:# Check the actual labsdbs if needed
$ mariadb -h clouddbXXXX.eqiad.wmnet -u labsdbadmin -p -e 'SELECT User, Password from mysql.user where User like "${CHECK_UID}";'
:# Resynchronize account state on the replicas by finding missing GRANTS on each db server
$ sudo maintain-dbusers harvest-replicas
See phab:T183644 for an example of fixing automatic credential creation caused when a old LDAP user becomes a Toolforge member and has an untracked user account on toolsdb.
Regenerate kubernetes credentials for tools (.kube/config)
With admin credentials (root on a control plane node will do), run kubectl -n tool-<toolname> delete cm maintain-kubeusers-<toolname>; it should get regenerated within minutes.
Adding K8S Components
See Portal:Toolforge/Admin/Kubernetes#Building new nodes
Deleting a tool
For batch or CLI deletion of tools, use the 'mark_tool' command on a cloudcontrol node:

andrew@cloudcontrol1003:~$ sudo mark_tool
usage: mark_tool [-h] [--ldap-user LDAP_USER] [--ldap-password LDAP_PASSWORD]
[--ldap-base-dn LDAP_BASE_DN] [--project PROJECT] [--disable]
[--delete] [--enable]
tool
mark_tool: error: the following arguments are required: tool
Maintainers can mark their tools for deletion using the "Disable tool" button on the tool's detail page on https://toolsadmin.wikimedia.org/. In either case, the immediate effect of disabling a tool is to stop any running jobs, prevent users from logging in as that tool, and schedule archiving and deletion for 40 days in the future.

Tool archives are stored on the tools NFS server, currently tools-nfs-2.tools.eqiad1.wikimedia.cloud:
root@labstore1004:/srv/disable-tool# ls -ltrah /srv/tools/archivedtools/
total 1.8G
drwxr-xr-x 5 root root 4.0K Jun 21 19:37 ..
-rw-r--r-- 1 root root 102K Jul 22 22:15 andrewtesttooltwo
-rw-r--r-- 1 root root 45 Oct 13 00:47 andrewtesttooltwo.tgz
-rw-r--r-- 1 root root 8.3M Oct 13 03:20 mediaplaycounts.tgz
-rw-r--r-- 1 root root 1.8G Oct 13 04:01 projanalysis.tgz
-rw-r--r-- 1 root root 1.3M Oct 13 21:05 reportsbot.tgz
drwxr-xr-x 2 root root 4.0K Oct 13 21:10 .
-rw-r--r-- 1 root root 719K Oct 13 21:10 wsm.tgz
-rw-r--r-- 1 root root 4.8K Oct 13 21:20 andrewtesttoolfour.tgz
The actual deletion process is shockingly complicated. A tool will only be archived and deleted if all of the prior steps succeed, but disabling of a tool should be a sure thing.
SSL certificates
See Portal:Toolforge/Admin/SSL certificates.
Granting a tool write access to Elasticsearch
- Generate a random password and the mkpassword crypt entry for it using the script new-es-password.sh. (Must be run a host with the `mkpasswd` command installed. (The mkpasswd is part of the whois Debian package.)
$ ./new-es-password.sh tools.example
tools.example elasticsearch.ini
----
[elasticsearch]
user=tools.example
password=A3rJqgFKxa/x4NlnIhmw2cXcV92it/Zv0Yt+a7yhxCw=
----
tools.example puppet master private (hieradata/labs/tools/common.yaml)
----
profile::toolforge::elasticsearch::haproxy::elastic_users:
- name: 'tools.example'
password: '$6$FYwP3wxT4K7O9EE$OA3P5972NWJVG/WUnD240sal34/dsNabbcawItevMYO9uoR.fJBrjSABex0EDW0wlkWHID1Tf4oJoiNvYFGmy/'
- Add the private SHA512 hash to the tools puppetserver:
$ ssh tools-puppetserver-01.tools.eqiad1.wikimedia.cloud
$ sudo -i
# cd /srv/git/labs/private
# vim hieradata/labs/tools/common.yaml
... merge the hiera data with the existing key...
:wq
# git add hieradata/labs/tools/common.yaml
# git commit -m "[local] Elasticsearch credentials for $TOOL"
- Force a puppet run on tools-elastic nodes using Cumin
cloudcumin1001.eqiad.wmnet:~$ sudo cumin "O{project:tools name:.*elastic.*}" "run-puppet-agent"
- Make the credentials available to the tool as envvars:
$ ssh dev.toolforge.org
$ sudo -i become example-tool
$ toolforge envvars create TOOL_ELASTICSEARCH_USER
Enter the value of your envvar (Hit Ctrl+C to cancel): <insert user>
$ toolforge envvars create TOOL_ELASTICSEARCH_PASSWORD
Enter the value of your envvar (Hit Ctrl+C to cancel): <insert password>
Note: An older procedure placed the credentials in /data/project/$TOOL/.elasticsearch.ini instead.
- Resolve the ticket!
Creating a new Docker image (e.g. for new versions of Node.js)
See Portal:Toolforge/Admin/Kubernetes#Docker Images
Users and community
Some information about how to manage users and general community and their relationship with Toolforge.
Project membership request approval
User access requests show up in https://toolsadmin.wikimedia.org/tools/membership/
Some guidelines for account approvals, based on advice from scfc:
- If the request contains any defamatory or abusive information as part of the username(s), reason, or comments → mark as Declined and check the "Suppress this request (hide from non-admin users)" checkbox.
- You should also block the user on Wikitech and consider contacting a Steward for wider review of the SUL account.
- If the user name "looks" like a bot or someone else who could not consent to the Terms of use and Rules → mark as Declined.
- Check the status of the associated SUL account. If the user is banned on one or more wikis → mark as Declined.
- If the stated purpose is "tangible" ("I want to move my bot x to Toolforge", "I want to build a web app that does y", etc.) → mark as Approved.
- If you know that someone else has been working on the same problem, add a message explaining who the user should contact or where they might find more information.
- If the stated purpose is "abstract" ("research", "experimentation", etc.) and there is a hackathon ongoing or planned, the user has a non-throw-away mail address, the user has created a user page with coherent information about theirself or linked a SUL account of good standing, etc. → mark as Approved.
- Otherwise add a comment asking for clarification of their reason for use and mark as Feedback needed. The request is not really "denied", but more (indefinitely) "delayed".
Requests left in Feedback needed for more information for more than 30 days should usually be declined with a message like "Feel free to apply again later with more complete information."
Quota management
Toolforge quotas are managed via maintain-kubeusers.
- Have the user open a phabricator ticket, for the papertrail. See also Help:Toolforge/Kubernetes#Quotas_and_Resources
- Send a patch for maintain-kubeusers, have it reviewed and merged: https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/blob/main/components/maintain-kubeusers/values/tools.yaml
- Deploy in the cluster, using the deploy Portal:Cloud_VPS/Admin/Cookbooks
Other
How do Toolforge web services actually work?
See Portal:Toolforge/Admin/Kubernetes#Ingress
What makes a root/Giving root access
See Toolforge roots and Toolforge admins
Useful administrative tools
These tools offer useful information about Toolforge itself:
- ToolsDB - Statistics about tables owned by tools
- k8s-stats - examine what our tools are doing
- OpenStack Browser - examine projects, instances, web proxies, and Puppet config
Brainstorming
Sub pages
- API Gateway
- Apt repository
- Build Service
- Components Service
- Envvars Service
- Exim
- Harbor
- Harbor/maintain-harbor
- Infrastructure tools
- Jobs Service
- Kubernetes
- Kubernetes/2020 Kubernetes cluster rebuild plan notes
- Kubernetes/Certificates
- Kubernetes/Docker-registry
- Kubernetes/Etcd (deprecated)
- Kubernetes/Labels
- Kubernetes/Networking and ingress
- Kubernetes/New cluster
- Kubernetes/Pod tracing
- Kubernetes/RBAC and Pod security
- Kubernetes/RBAC and Pod security/PSP migration
- Kubernetes/Upgrading Kubernetes
- Kubernetes/Upgrading Kubernetes/1.21 to 1.22 notes
- Kubernetes/Upgrading Kubernetes/1.22 to 1.23 notes
- Kubernetes/Upgrading Kubernetes/1.24 to 1.25 notes
- Kubernetes/Upgrading Kubernetes/1.25 to 1.26 notes
- Kubernetes/Upgrading Kubernetes/1.26 to 1.27 notes
- Kubernetes/Upgrading Kubernetes/1.27 to 1.28 notes
- Kubernetes/foxtrot-ldap
- Kubernetes/lima-kilo
- Legacy redirector for webservices
- Logging
- Logs Service
- Maintenance
- Monthly meeting
- Monthly meeting/2022-11-15
- Monthly meeting/2022-12-13
- Monthly meeting/2023-01-31
- Monthly meeting/2023-02-21
- Monthly meeting/2023-04-04
- Monthly meeting/2023-05-02
- Monthly meeting/2023-06-06
- Monthly meeting/2023-07-11
- Monthly meeting/2023-09-05
- Monthly meeting/2023-10-03
- Monthly meeting/2023-11-07
- Monthly meeting/2023-12-19
- Monthly meeting/2024-01-16
- Monthly meeting/2024-02-13
- Monthly meeting/2024-03-12
- Monthly meeting/2024-04-09
- Monthly meeting/2024-05-14
- Monthly meeting/2024-06-25
- Monthly meeting/2024-07-09
- Monthly meeting/2024-09-03
- Monthly meeting/2024-10-01
- Monthly meeting/2024-11-05
- Monthly meeting/2025-01-14
- Monthly meeting/2025-02-18
- Monthly meeting/2025-03-11
- Monthly meeting/2025-04-15
- Monthly meeting/2025-05-20
- Monthly meeting/2025-06-17
- Monthly meeting/2025-07-15
- Monthly meeting/2025-08-19
- Monthly meeting/2025-09-16
- Monthly meeting/2025-10-22
- Monthly meeting/2025-11-18
- Monthly meeting/2025-12-16
- Monthly meeting/2026-01-20
- Monthly meeting/2026-02-24
- Packaging
- Prometheus
- Pywikibot image
- Redis
- Runbooks
- Runbooks/BuildsApiDown
- Runbooks/BuildsApiUpMetricUnknown
- Runbooks/ComponentsApiDown
- Runbooks/ComponentsApiUpMetricUnknown
- Runbooks/EnvvarsAdmissionDown
- Runbooks/EnvvarsApiDown
- Runbooks/EnvvarsApiUpMetricUnknown
- Runbooks/HarborComponentDown
- Runbooks/HarborDown
- Runbooks/HarborProbeUnknown
- Runbooks/IstioGatewayPodMisplaced
- Runbooks/JobsApiDown
- Runbooks/JobsApiUpMetricUnknown
- Runbooks/JobsEmailerDown
- Runbooks/JobsEmailerNoEmails
- Runbooks/JobsEmailerUpMetricUnknown
- Runbooks/Kyverno
- Runbooks/MaintainDBUsersDown
- Runbooks/MaintainDBUsersManyErrors
- Runbooks/MaintainDBUsersStuck
- Runbooks/MaintainKubeusersDown
- Runbooks/PrometheusK8sCertExpirySoon
- Runbooks/Redis
- Runbooks/TektonDown
- Runbooks/TektonUpMetricUnknown
- Runbooks/ToolforgeKubernetesCapacity
- Runbooks/ToolforgeKubernetesHAproxyServerDown
- Runbooks/ToolforgeKubernetesHAproxyUnknown
- Runbooks/ToolforgeKubernetesNodeNotReady
- Runbooks/ToolforgeKubernetesWorkerDiskFull
- Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses
- Runbooks/ToolforgeToolviewsFailed
- Runbooks/ToolforgeToolviewsStale
- Runbooks/ToolforgeWebConnectionLimit
- Runbooks/ToolforgeWebHighErrorRate
- Runbooks/Toolforge Kyverno low policy resources
- Runbooks/Toolforge Kyverno no policy resources
- Runbooks/Toolforge Kyverno unknown state
- Runbooks/ToolsDBAlmostFull
- Runbooks/ToolsDBReplication
- Runbooks/ToolsNFSDown
- Runbooks/ToolsNfsAlmostFull
- Runbooks/ToolsToolsDBWritableState
- Runbooks/k8s-haproxy
- SSL certificates
- Striker
- Striker/Build
- Striker/Deployments
- Striker/FAQ
- System overview
- Toolforge-sync-meeting
- Toolforge roots and Toolforge admins
- ToolsDB
- Toolsbeta
- Toolschecker
- emergency guides
- emergency guides/irc bot deployment
- emergency guides/single tool webservice
- emergency guides/toolforge down notification
- local packages
- puppet refactor
- replagstats
- tofu-provisioning
- toolhistory
