From Wikitech
Jump to: navigation, search

Toolforge documentation for admins.


Tools should be able to survive the failure of any one virt* node. Some items may need manual failover


There are two webproxies, tools-proxy-01 and tools-proxy-02. They are both on different virt hosts, and they are 'hot spares' - you can switch them without any downtime. Webservices register themselves with the active proxy (specified by hiera setting active_proxy), and this information is stored in redis. This proxying information is also replicated to the standby proxy via simple redis replication. So when the proxies are switched, new webservice starts would fail for a while until puppet runs on all the web nodes and the proxies, but current http traffic will continue to be served.

Executing a failover between Tools Proxy instances:

1. Switch the floating IP for tools.wmflabs.org (Currently from one proxy to the other (if tools-proxy-01 is the current active one, switch to tools-proxy-02, and vice versa). The easiest way to verify that the routing is proper (other than just hitting tools.wmflabs.org) is to tail /var/log/nginx/access.log on the proxy machines. This is an invasive operation.

   # Find instance with floating ip:
   OS_TENANT_NAME=tools openstack server list | grep -C 0 '' | awk '{print $4, $2}'
   OS_TENANT_NAME=tools openstack ip floating remove <current instance UUID>
   OS_TENANT_NAME=tools openstack ip floating add <intended instance UUID>

If this is a temporary failover (e.g. in order to reboot the primary proxy) then this is all that's needed -- as soon as the IP is assigned back to the primary then service will resume as usual. If, on the other hand, the original proxy is going to be down for a few minutes, continue with the following steps to make the switch official.

2. Use hiera to set the active proxy host (toollabs::active_proxy_host) to the hostname (not fqdn) of the newly active proxy. For now this is at https://wikitech.wikimedia.org/wiki/Hiera:Tools. This will update /etc/active-proxy as well as each of the tools-proxy servers for who is the active replication master. There is a period of time where values on individual Tool project instances will not be consistent and this is a race condition possibility for active changes. The safest solution is to the stop redis on the original primary instance to prevent lost changes as the hiera value is updated.

3. Run puppet on the DNS recursor hosts (labservices1001 and labservices1002). This is required for internal hits to tools.wmflabs.org to resolve. See https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/modules/role/manifests/labs/dnsrecursor.pp;32dac026174c4eda1383713fda818a6a30c5e981$54

4. Force a puppet run on all the proxies and webgrid nodes

   clush -f 10 -g exec-trusty 'sudo puppet agent --test'
   clush -g webgrid 'sudo puppet agent --test'
   clush -g all 'sudo puppet agent --test'

5. Ensure puppet is running on tools-proxy-01 and tools-proxy-02 and the replication roles have switched.

5. Test tools.wmflabs.org

Doing potentially dangerous puppet merges

Since we have two instances, easy to verify a puppet merge doesn't boink anything.

  1. Disable puppet on the 'active' instance (you can find this from hiera and by tailing /var/log/nginx/access.log)
  2. Run puppet on the other instance
  3. Check to make sure everything is ok. curl -H "Host: tools.wmflabs.org" localhost is a simple smoketest
  4. If ok, then enable puppet on the 'active' instance, and test that too to make sure
  5. Celebrate!

Recovering a failed proxy

When a proxy fails, it should be brought back and recovered so that it can be the new hot spare. This can be a post fire-fighting operation. The process for that is:

  1. Bring the machine back up (this implies whatever hardware issue that caused the main machine to be down has been fixed)
  2. Run puppet (this will start up replication from the current active master)

Checking redis registration status

$ redis-cli hgetall prefix:dplbot

$ grep /var/lib/redis/tools-proxy-01-6379.aof -e 'dplbot' -C 5

Static webserver

This is a stateless simple nginx http server. Simply switch the floating IP from tools-static-10 to tools-static-11 (or vice versa) to switch over. Recovery is also equally trivial - just bring the machine back up and make sure puppet is ok.

Checker service

This is the service that catchpoint (our external monitoring service) hits to check status of several services. It's totally stateless, so just switching the public IP from tools-checker-01 to -02 (or vice versa) should be fine. Same procedure as static webserver.

GridEngine Master

The gridengine scheduler/dispatcher runs on tools-master, and manages dispatching jobs to execution nodes and reporting. The active master write its name to /var/lib/gridengine/default/common/act_qmaster, where all enduser tools pick it up. tools-grid-master normally serves in this role but tools-grid-shadow can also be manually started as the master iff there are currently no active masters with service gridengine-master start on the shadow master.

Note that puppet is configure to start the master at every run on the designated master and this probably needs to be disabled there if one intends to use the shadow master as primary.


Every 30s, the master touches the file /var/spool/gridengine/qmaster/heartbeat. On tools-grid-shadow there is a shadow master that watches this file for staleness, and will fire up a new master on itself if it has been for too long (currently set at 10m). This only works if the running master crashed or was killed uncleanly (including the server hosting it crashing), because a clean shutdown will create a lockfile forbidding shadows from starting a master (as would be expected in the case of willfuly stopped masters).

If it does, then it changes act_qmaster to point to itself, redirecting all userland tools. This move is unidirectional; once the master is ready to take over again then the gridengine-master on tools-grid-shadow need to be shut down manually, and the one on tools-master started (this is necessary to prevent flapping, or split brain, if tools-grid-master only failed temporarily). This is simply done with service gridengine-master {stop/start}.


Redis runs on two instances - tools-redis-1001 and -1002, and the currently active master is set via hiera on toollabs::active_redis (defaults to tools-redis-1001). The other is set to be a slave of the master. Switching over can be done by:

  1. Switchover on hiera, set toollabs::active_redis to the hostname (not fqdn) of the up host
  2. Force a puppet run on the redis hosts
  3. Restart redis on the redis hosts, this resets current connections and makes master / slaves see themselves as master / slave
  4. Set the IP address for 'tools-redis.tools.eqiad.wmflabs' and 'tools-redis.eqiad.wmflabs' in hieradata/common/dnsrecursor/labsaliaser.yaml to point to the IP of the new master. This needs a puppet merge + run on the DNS hosts (labservices1001 and holmium as of now). Eventually we'd like to move this step to Horizon...


These are services that run off service manifests for each tool - currently just the webservicemonitor service. They're in warm standby requiring manual switchover. tools-services-01 and tools-service-02 both have the exact same code running, but only one of them is 'active' at a time. Which one is determined by the puppet role param role::labs::tools::services::active_host. Set that via [[1]] to the fqdn of the host that should be 'active' and run puppet on all the services hosts. This will start the services in appropriate hosts and stop them in the appropriate hosts. Since services should not have any internal state, they can be run from any host without having to switch back compulsorily.

Bigbrother also runs on this host, via upstart. The log file can be found in /var/log/upstart/bigbrother.log.

Command orchestration

We have a clush install setup for admins to be able to execute arbitrary commands on groups of instances at the same time. This is setup to have a master (currently tools-puppetmaster-02), which has the role role::toollabs::clush::master set. Classification of nodes is done via prefix-matching - this logic needs to be kept up to date. You can find this mapping in modules/role/files/toollabs/clush/tools-clush-generator.

Example commands:

  • List all host groups: nodeset -l
  • Show hosts in host group: nodeset -e @redis
  • Run command on all hosts in group: clush -w @redis -b "cat /proc/cpuinfo | tail"
    • -w selects the hosts or host group (prepend @). Alternatively, use -g redis
    • -b collects all the output, deduplicates before displaying it.
    • e.g. list all processes connected to freenode from exec hosts: clush -w @exec -b "sudo netstat -atp | grep freenode"
    • e.g. list all weblinkchecker processes: clush -w @exec -b "ps axw o user:20,pid:8,%cpu:8,cmd | grep weblink | grep -v clush"

For more information on clush's amazing features, read the docs!

Administrative tasks

Logging in as root

In case the normal login does not work for example due to an LDAP failure, administrators can also directly log in as root. To prepare for that occasion, generate a separate key with ssh-keygen, add an entry to the passwords::root::extra_keys hash in Hiera:Tools with your shell username as key and your public key as value and wait a Puppet cycle to have your key added to the root accounts. Add to your ~/.ssh/config:

# Use different identity for Tools root.
Match host *.tools.eqiad.wmflabs user root
     IdentityFile ~/.ssh/your_secret_root_key

The code that reads passwords::root::extra_keys is in labs/private:modules/passwords/manifests/init.pp.

Disabling all ssh logins except root

Useful for dealing with security critical situations. Just touch /etc/nologin and PAM will prevent any and all non-root logins.

SGE resources

PDF manuals found using [2]:

List of handy commands

Most commands take -xml as a parameter to enable xml output. This is useful when lines get cut off.


  • list queues on given host: qhost -q -h <hostname>
  • list jobs on given host: qhost -j -h <hostname>
  • list all queues: qstat -f
  • qmaster log file: tail -f /data/project/.system/gridengine/spool/qmaster/messages


  • modify host group config: qconf -mhgrp \@general
  • print host group config: qconf -shgrp \@general
  • modify queue config: qconf -mq queuename
  • print queue config: qconf -sq continuous
  • enable a queue: qmod -e 'queue@node_name'
  • disable a queue: qmod -d 'queue@node_name'

  • add host as exec host: qconf -Ae node_name
  • print exec host config: qconf -se node_name
  • remove host as exec host: ??

  • add host as submit host: qconf -as node_name
  • remove host as submit host: ??

  • add host as admin host: ??
  • remove host as admin host: ??


  • retrieve information on finished job: qacct -j <jobid or jobname>
  • there are a few scripts in /home/valhallasw/accountingtools: (need to be puppetized)
    • vanaf.py makes a copy of recent entries in the accounting file
    • accounting.py contains python code to read in the accounting file
    • Usage:
      valhallasw@tools-bastion-03:~/accountingtools$ php time.php "-1 hour"
      valhallasw@tools-bastion-03:~/accountingtools$ python vanaf.py 1471465675 mylog
      Seeking to timestamp  1471465675
      valhallasw@tools-bastion-03:~/accountingtools$ grep mylog -e '6727696' | python merlijn_stdin.py
      25 1970-01-01 00:00:00 1970-01-01 00:00:00 tools-webgrid-lighttpd-1206.eqiad.wmflabs tools.ptwikis lighttpd-precise-ptwikis 6727696
      0 2016-08-17 21:01:42 2016-08-17 21:01:46 tools-webgrid-lighttpd-1207.eqiad.wmflabs tools.ptwikis lighttpd-precise-ptwikis 6727696
      Traceback (most recent call last):
        File "merlijn_stdin.py", line 4, in <module>
          line = raw_input()
      EOFError: EOF when reading a line
      • Ignore the EOFError; the relevant lines are above that. Error codes (first entry) are typically 0 (finished succesfully), 19 ('before writing exit_status' = crashed?), 25 (rescheduled) or 100 ('assumedly after job' = lost job?). I'm not entirely sure about the codes when the job stops because of an error.

Creating a new node

Clearing error state

Sometimes due to various hiccups (like LDAP or DNS malfunction), grid jobs can move to an Error state from which they will not come out without explicit user action. Once you have ascertained the cause of the Error state and fixed it, you can clear all the error state jobs with:

qstat -u '*' | grep Eqw | awk '{print $1;}' | xargs -L1 qmod -cj

You also need to clear all the queues that have gone into error state. Failing to do so prevents jobs from being scheduled on those queues. You can clear all error states on queues with:

qstat -explain E -xml | grep 'name' | sed 's/<name>//' | sed 's/<\/name>//'  | xargs qmod -cq

Draining a node of Jobs

  1. Disable the queues on the node with qmod -d '*@node_name'
  2. Reschedule continuous jobs running on the node (see below)
  3. Wait for non-restartable jobs to drain (if you want to be nice!) or qdel them
  4. Once whatever needed to be done, reenable the node with qmod -e '*@node_name'

There is no simple way to delete or reschedule jobs on a single host, but the following snippet is useful to provide a list to the command line:

$(qhost -j -h node_name| awk '{ print $1; }' |egrep ^[0-9])

which make for reasonable arguments for qdel or qmod -rj.

Decommission a node

In real life, you just do this with exec-manage depool <fqdn>. What follow are the detailed steps that are handled by that script.

  1. Drain the node (see above!). Give the non-restartable jobs some time to finish (maybe even a day if you are feeling generous?).
  2. Remove node from hostgroups it is present in, if any. You can check / remove with a qconf -mhgrp @general or qconf -mhgrp @webgrid on any admin host. This will open up the list in a text editor, where you can carefully delete the name of the host and save. Be careful to keep the line continuations going.
  3. Remove the node from any queues it might be included directly in. Look at qconf -sql for list of queues, and then qconf -mq <queue-name> to see list of hosts in it. Note that this seems to be mostly required only for webgrid hosts (yay consistency!)
  4. Remove the node from gridengine with qconf -de <fqdn>. Also do qconf -de <hostname>.eqiad.wmflabs for completeness, since some of the older nodes do not use fqdn.
  5. If the node is a webgrid node, also remove it from being a submit host, with qconf -ds <fqdn>. Also do qconf -de <hostname>.eqiad.wmflabs for completeness, since some of the older nodes do not use fqdn.
  6. Double check that you got rid of the node(s) from grid config by checking the output of sudo qconf -sel. (See phab:T149634 for what can happen.)
  7. Check if there's a host alias for the node in operations/puppet, in modules/toollabs/files/host_aliases. If there is, get rid of it.
  8. Wait for a while, then delete the VM!

Local package management

Local packages are provided by an aptly repository on tools-services-01.

On tools-services-01, you can manipulate the package database by various commands; cf. aptly(1). Afterwards, you need to publish the database to the file Packages by (for the trusty-tools repository) aptly publish --skip-signing update trusty-tools. To use the packages on the clients you need to wait 30 minutes again or run apt-get update. In general, you should never just delete packages, but move them to ~tools.admin/archived-packages.

You can always see where a package is (would be) coming from with apt-cache showpkg $package.

Local package policy

Package repositories

  • We only install packages from trustworthy repositories.
    • OK are
      • The official Debian and Ubuntu repositories, and
      • Self-built packages (apt.wikimedia.org and aptly)
    • Not OK are:
      • PPAs
      • other 3rd party repositories

Packagers effectively get root on our systems, as they could add a rootkit to the package, or upload an unsafe sshd version, and apt-get will happily install it

Hardness clause: in extraordinary cases, and for 'grandfathered in' packages, we can deviate from this policy, as long as security and maintainability are kept in mind.


We assume that whatever is good for production is also OK for Toolforge.


We manage the aptly repository ourselves.

  • Packages in aptly need to be built by Toolforge admins
    • we cannot import .deb files from untrusted 3rd party sources
  • Package source files need to come from a trusted source
    • a source file from a trusted source (i.e. backports), or
    • we build the debian source files ourselves
    • we cannot build .dcs files from untrusted 3rd party sources
  • Packages need to be easy to update and build
    • cowbuilder/pdebuild OK
    • fpm is OK
    • See Deploy new jobutils package for an example walk through of building and adding packages to aptly.
  • We only package if strictly necessary
    • infrastructure packages
    • packages that should be available for effective development (e.g. composer or sbt)
    • not: python-*, lib*-perl, ..., which should just be installed with the available platform-specific package managers
  • For each package, it should be clear who is responsible for keeping it up to date
    • for infrastructure packages, this should be one of the paid staffers

A list of locally maintained packages can be found under /local packages.

Building packages

/data/project/dpkg is used as storage for building packages. The 'how' of package building has not been completely thought out yet, but there are two basic routes:

Deploy new jobutils package

The jobutils package provides the job, jsub, jstop, jstart, and qcronsub command used on the bastions to submit new jobs to the grid. It is built as a deb from sources in the labs/toollabs.git repo and distributed to tools hosts via the aptly package repository hosted on tools-services-01.tools.eqiad.wmflabs.

Yuvi's method (debuild)

  1. ssh to tools-dev.wmflabs.org
  2. cd into your clone of the toollabs repo
  3. check out the branch master
  4. git pull
  5. debuild -uc -us -I
  6. ssh to tools-services-01.eqiad.wmflabs
  7. Add the packages to aptly
    1. sudo aptly repo add trusty-tools jobutils_$VERSION_all.deb
    2. sudo aptly publish --skip-signing update trusty-tools
  8. Back up the repositories with sudo rsync --chmod 440 --chown root:"${INSTANCEPROJECT}".admin -ilrt /srv/packages/ /data/project/.system/aptly/"$(hostname -f)" (-n for dry run)
  9. Run apt-get update and puppet agent --test --verbose on bastions to have them pull in new package

Bd808's method (pdebuild)

  1. ssh tools-package-builder-01.tools.eqiad.wmflabs
  2. cd /srv/src/toollabs
  3. git checkout master
  4. git pull
  5. DIST=trusty pdebuild
  6. export VERSION=$(dpkg-parsechangelog --show-field version)
  7. echo $VERSION
  8. cp /srv/pbuilder/result/trusty-amd64/jobutils_${VERSION}_all.deb ~
  9. ssh tools-services-01.tools.eqiad.wmflabs
  10. sudo aptly repo add trusty-tools jobutils_${VERSION}_all.deb
  11. sudo aptly publish --skip-signing update trusty-tools
  12. Back up the repositories with sudo rsync --chmod 440 --chown root:"${INSTANCEPROJECT}".admin -ilrt /srv/packages/ /data/project/.system/aptly/"$(hostname -f)" (-n for dry run)
  13. Run sudo -i apt-get update && sudo -i apt-get install jobutils on tools-bastion-02
  14. Test jsub on tools-bastion-02 to make sure the package works.
  15. Update the package on the rest of the cluster:
    $ ssh tools-puppetmaster-02.tools.eqiad.wmflabs
    $ clush -w @all -b '/usr/bin/dpkg -s jobutils &>/dev/null && sudo apt-get update -qq && sudo /usr/bin/env DEBIAN_FRONTEND=noninteractive apt-get install -q -y -o Dpkg::Options::="--force-confdef" -o Dpkg::Options::="--force-confold" jobutils'

Deploy new toollabs-webservice package

Follow the same basic steps as #Deploy_new_jobutils_package, but also rebuild all of the Docker images used for Kubernetes webservices.

Webserver statistics

To get a look at webserver statistics, goaccess is installed on the webproxies. Usage:

goaccess --date-format="%d/%b/%Y" --log-format='%h - - [%d:%t %^] "%r" %s %b "%R" "%u"' -q -f/var/log/nginx/access.log

Interactive key bindings are documented on the man page. HTML output is supported by piping to a file. Note that nginx logs are rotated (twice?) daily, so there is only very recent data available.

Restarting all webservices

This is sometimes necessary, if the proxy entries are out of whack. Can be done with

qstat -q webgrid-generic -q webgrid-lighttpd -u '*' | awk '{ print $1;}' | xargs -L1 qmod -rj

The qstat gives us a list of all jobs from all users under the two webgrid queues, and the qmod -rj asks gridengine to restart them. This can be run as root on tools-login.wmflabs.org

Banning an IP from tool labs

On Hiera:Tools, add the IP to the list of dynamicproxy::banned_ips, then force a puppet run on the webproxies. Add a note to Help:Toolforge/Banned explaining why. The user will get a message like [3].

Deploying the main web page

This website (plus the 403/500/503 error pages) are hosted under tools.admin. To deploy,
$ become admin
$ cd toollabs
$ git pull

Regenerate replica.my.cnf

See also: Portal:Data_Services/Admin/Labstore#maintain-dbusers

This requires access to the active labstore host, and can be done as follows (on the active labstore):

$ /usr/local/sbin/maintain-dbusers delete tools.$name

Debugging bad mysql credentials

Sometimes things go wrong and a user's replica.my.cnf credentials don't propigate everywhere. You can check the status on various servers to try and narrow down what went wrong.

The database credentials needed are in /etc/dbusers.yaml on the labstore servers.

$ CHECK_UID=u12345  # User id to check for

:# Check if the user is in our meta datastore
$ mysql -h m5-master.eqiad.wmnet -u labsdbaccounts -p -e "select * from account where mysql_username='${CHECK_UID}';"

:# Check if all the accounts are created in the labsdb boxes from meta datastore.
$ ACCT_ID=.... # Account_id is foreign key (id from account table)
$ mysql -h m5-master.eqiad.wmnet -u labsdbaccounts -p -e "select * from labsdbaccounts.account_host where account_id=${ACCT_ID};"

:# Check the actual labsdbs if needed
$ sudo mysql -h labsdb1001.eqiad.wmnet -u labsdbadmin -p -e 'SELECT User, Password from mysql.user where User like "${CHECK_UID}";'

Regenerate kubernetes credentials (.kube/config)

  1. Delete the `.kube/config` file
  2. On the k8s master (currently tools-k8s-master-01.tools.eqiad.wmflabs), remove the line for the tool from /etc/kubernetes/tokenauth. Make a backup before you edit this file :D
  3. On the k8s master, run systemctl restart maintain-kubeusers
  4. You can run journalctl -u maintain-kubeusers -f to follow the logs. The script will only create the missing things, so you might see some error messages. This is unfortunately normal!

Adding K8S Components

See Portal:Toolforge/Admin/Kubernetes#Building_new_nodes

Deleting a tool

  1. Ask the maintainers to delete or back up code and data they no longer need.
  2. Stop running grid jobs and web services (qstat).
  3. Delete the tool at Special:NovaServiceGroup.
  4. Use list-user-databases /data/project/$TOOLNAME/replica.my.cnf to check whether the tool has any databases and if it has and they are valuable, dump them to the tool's directory and delete the databases.
  5. If the tool's directory has valuable code or data, back it up to /data/project/.system/removed_tools and notify the previous maintainers or someone else appropriate. This is just courtesy – there are no guarantees that deleted tools can be restored.
  6. Delete the tool's directory on labstore. (replica.my.cnf files are immutable and that attribute cannot be changed over NFS.)

Updating JDK8

We build and keep a jdk8 for trusty for use by tools. This will be supported until end of 2016, at which point they should all be on k8s. Until then, admins have to keep updating the jdk8 package whenever new Java security vulns. show up. This can be done the following way:

  1. Go to https://launchpad.net/~openjdk-r/+archive/ubuntu/ppa and check if there's an update. This is setup by the same person who maintains jdk8 in Ubuntu, so the hope is that this would be good enough for us.
  2. Get the source package (.dsc file, .orig.tar.xz and .debian.tar.gz) from https://launchpad.net/~openjdk-r/+archive/ubuntu/ppa/+packages. Download them onto a pdebuilder host (currently tools-docker-builder-03)
  3. Verify the signature on the dsc file (TODO: Add instructions!)
  4. Extract the source package, with dpkg-source -x $DSCFILE. This will extract a directory that can be used to build the actual debs.
  5. In a screen session, run DIST=trusty pdebuild
  6. Check back in a few hours, and if build had succeeded, upload it to aptly on tools-services-01

SSL certificates

We use a bunch of different SSL certs on tools.


This is used for the tools web proxies (tools-proxy-*) and the tools static proxies (tools-static-*). The secret key files for these are kept as cherry picks in the tools puppetmaster (tools-puppetmaster-01). Updating the certs would involve just changing it centrally, running puppet on all these nodes, and restarting nginx. Note that this cert is also used for novaproxy (instances in project project-proxy).


This is used for the k8s master (tools-k8s-master-*) and the docker registry (tools-docker-registry-*). These aren't centrally managed yet, but should be!

This cert was most recently updated on 23 Mar 2017 (https://phabricator.wikimedia.org/T160187)

Process for updating:

  • Stage new private key in private repo on puppetmaster1001 at /srv/private/modules/secret/secrets/ssl with new.filename
  • Stage new certificate in gerrit (e.g. https://gerrit.wikimedia.org/r/#/c/342254/)
  • Copy over private key from puppetmaster1001 (at /srv/private/modules/secret/secrets/ssl/new.star.wmflabs.org.key) to tools-puppetmaster-02 at /var/lib/git/labs/private/modules/secret/secrets/ssl
  • Halt puppet on affected hosts (in this case puppet agent --disable "message" on tools-k8s-master-* and tools-docker-registry-*)
  • Merge gerrit patchset (And pull latest changes with git pull -r origin production on Tools puppetmaster at /var/lib/git/operations/puppet)
  • mv new.star.tools.wmflabs.org.key to replace existing(old) star.tools.wmflabs.org.key file in private repo, git commit changes to labs/private on tools-puppetmaster
  • Verify that ownership and permissions are set correctly for star.tools.wmflabs.org.key file (-rw-r--r-- and owned by root:root or gitpuppet:root)
  • Reenable and run puppet on affected hosts and ensure affected services accept update without error
  • Restart nginx on docker registries and kube-apiserver on k8s master just to make sure they have picked up new cert.

Granting a tool write access to Elasticsearch

  • Generate a random password and the htpassword crypt entry for it using the script new-es-password.sh. (Must be run a host with the `htpasswd` command installed. A MediaWiki-Vagrant virtual machine will work for this in a pinch.)
$ ./new-es-password.sh tools.an-example
  • Add the htpassword hash to puppet:
$ ssh tools-puppetmaster-02.tools.eqiad.wmflabs
$ cd /var/lib/git/labs/private
$ sudo -i vim /var/lib/git/labs/private/modules/secret/secrets/labs/toollabs/elasticsearch/htpasswd
... paste in htpassword crypt data ...
$ sudo git add modules/secret/secrets/labs/toollabs/elasticsearch/htpasswd
$ sudo git commit -m "[local] Elasticsearch credentials for $TOOL"
  • Force a puppet run on tools-elastic-0[123]
  • Create the credentials file in the tool's $HOME:
$ ssh tools-dev.wmflabs.org
$ sudo -i touch /data/project/$TOOL/.elasticsearch.ini
$ sudo -i chmod o-rwx /data/project/$TOOL/.elasticsearch.ini
$ sudo -i vim /data/project/$TOOL/.elasticsearch.ini
... paste in username and raw password in ini file format ...
  • Resolve the ticket!


See Portal:Toolforge/Admin/Kubernetes

Tools-mail / Exim

See Portal:Toolforge/Admin/Exim and Portal:Cloud_VPS/Admin/Exim

Emergency guides


What makes a root

Users who need to do administrative work in Toolforge need to be listed at several places:

  1. Project administrator: This allows a user to add and delete other users from the Toolforge project.
  2. sudo policy "roots": This allows a user to use sudo to become root on Toolforge instances.
  3. 'admin' tool maintainer: This allows a user to log into infrastructure instances and perform tasks as the admin tool.
  4. Gerrit groups "labs-toollabs" and "toollabs-trusted": These allow a user to +2 changes in repositories exclusive to Toolforge.

Servicegroup log

tools.admin runs /data/project/admin/bin/toolhistory, which provides an hourly snapshot of ldaplist -l servicegroup as git repository in /data/project/admin/var/lib/git/servicegroups

HBA: How does it work?

wikibooks:en:OpenSSH/Cookbook/Host-based_Authentication#Client_Configuration_for_Host-based_Authentication. If things don't work, check every point listed in that guide - sshd doesn't give you much output to work with.

Central syslog servers

tools-logs-01 and tools-logs-02 are central syslog servers that receive syslog data from all? tools hosts. These are stored in /srv/syslog.

maintain-kubeusers stuck

If home directories for new tools stop getting created, it is likely due to the service that creates them, maintain-kubeusers on the k8s master (currently tools-k8s-master-01) getting stuck with it's LDAP connection. You should be able to simply log into the master node and sudo service maintain-kubeusers restart to get things working again.