Nova Resource:Tools/Admin

From Wikitech
Jump to: navigation, search

Tool Labs documentation for admins.

Failover

Tools should be able to survive the failure of any one virt* node. Some items may need manual failover

WebProxy

There are two webproxies, tools-proxy-01 and tools-proxy-02. They are both on different virt hosts, and they are 'hot spares' - you can switch them without any downtime. Webservices register themselves with the active proxy (specified by hiera setting active_proxy), and this information is stored in redis. This proxying information is also replicated to the standby proxy via simple redis replication. So when the proxies are switched, new webservice starts would fail for a while until puppet runs on all the web nodes and the proxies, but current http traffic will continue to be served.

Warning Warning: The Redis replication between proxies is currently broken (cf. phab:T152356). Additionally, gerrit:253364 needs to be documented here.

To switch over:

  1. Switch the floating IP for tools.wmflabs.org (Currently 208.80.155.131) from one proxy to the other (if tools-proxy-01 is the current active one, switch to tools-proxy-02, and vice versa). You can use this link to do the switchover. Easiest way to verify that the routing is proper (other than just hitting tools.wmflabs.org) is to tail /var/log/nginx/access.log on the proxy machines.
  2. Use hiera to set the active proxy host (toollabs::active_proxy) to the hostname (not fqdn) of the newly active proxy.
  3. Force a puppet run on all the proxies and webgrid nodes
  4. Run puppet on the DNS recursor hosts (labservices1001 and labservices1002). This is required for internal hits to tools.wmflabs.org to resolve.

Doing potentially dangerous puppet merges

Since we have two instances, easy to verify a puppet merge doesn't boink anything.

  1. Disable puppet on the 'active' instance (you can find this from hiera and by tailing /var/log/nginx/access.log)
  2. Run puppet on the other instance
  3. Check to make sure everything is ok. curl -H "Host: tools.wmflabs.org" localhost is a simple smoketest
  4. If ok, then enable puppet on the 'active' instance, and test that too to make sure
  5. Celebrate!

Recovering a failed proxy

When a proxy fails, it should be brought back and recovered so that it can be the new hot spare. This can be a post fire-fighting operation. The process for that is:

  1. Bring the machine back up (this implies whatever hardware issue that caused the main machine to be down has been fixed)
  2. Run puppet (this will start up replication from the current active master)

Checking redis registration status

$ redis-cli hgetall prefix:dplbot
.*
http://tools-webgrid-lighttpd-1205.tools.eqiad.wmflabs:33555

$ grep /var/lib/redis/tools-proxy-01-6379.aof -e 'dplbot' -C 5
(...)
HDEL
$13
prefix:dplbot
$2
.*
*4
(...)
HSET
$13
prefix:dplbot
$2
.*
$60
http://tools-webgrid-lighttpd-1202.tools.eqiad.wmflabs:44504

Static webserver

This is a stateless simple nginx http server. Simply switch the floating IP from tools-web-static-01 to tools-web-static-02 (or vice versa) to switch over. Recovery is also equally trivial - just bring the machine back up and make sure puppet is ok.

Checker service

This is the service that catchpoint (our external monitoring service) hits to check status of several services. It's totally stateless, so just switching the public IP from tools-checker-01 to -02 (or vice versa) should be fine (IP switch direct link). Same procedure as static webserver.

GridEngine Master

The gridengine scheduler/dispatcher runs on tools-master, and manages dispatching jobs to execution nodes and reporting. The active master write its name to /var/lib/gridengine/default/common/act_qmaster, where all enduser tools pick it up. tools-grid-master normally serves in this role but tools-grid-shadow can also be manually started as the master iff there are currently no active masters with service gridengine-master start on the shadow master.

Note that puppet is configure to start the master at every run on the designated master and this probably needs to be disabled there if one intends to use the shadow master as primary.

Redundancy

Every 30s, the master touches the file /var/spool/gridengine/qmaster/heartbeat. On tools-grid-shadow there is a shadow master that watches this file for staleness, and will fire up a new master on itself if it has been for too long (currently set at 10m). This only works if the running master crashed or was killed uncleanly (including the server hosting it crashing), because a clean shutdown will create a lockfile forbidding shadows from starting a master (as would be expected in the case of willfuly stopped masters).

If it does, then it changes act_qmaster to point to itself, redirecting all userland tools. This move is unidirectional; once the master is ready to take over again then the gridengine-master on tools-grid-shadow need to be shut down manually, and the one on tools-master started (this is necessary to prevent flapping, or split brain, if tools-grid-master only failed temporarily). This is simply done with service gridengine-master {stop/start}.

Redis

Redis runs on two instances - tools-redis-1001 and -1002, and the currently active master is set via hiera on toollabs::active_redis (defaults to tools-redis-1001). The other is set to be a slave of the master. Switching over can be done by:

  1. Switchover on hiera, set toollabs::active_redis to the hostname (not fqdn) of the up host
  2. Force a puppet run on the redis hosts
  3. Restart redis on the redis hosts, this resets current connections and makes master / slaves see themselves as master / slave
  4. Set the IP address for 'tools-redis.tools.eqiad.wmflabs' and 'tools-redis.eqiad.wmflabs' in hieradata/common/dnsrecursor/labsaliaser.yaml to point to the IP of the new master. This needs a puppet merge + run on the DNS hosts (labservices1001 and holmium as of now). Eventually we'd like to move this step to Horizon...

Services

These are services that run off service manifests for each tool - currently just the webservicemonitor service. They're in warm standby requiring manual switchover. tools-services-01 and tools-service-02 both have the exact same code running, but only one of them is 'active' at a time. Which one is determined by the puppet role param role::labs::tools::services::active_host. Set that via [[1]] to the fqdn of the host that should be 'active' and run puppet on all the services hosts. This will start the services in appropriate hosts and stop them in the appropriate hosts. Since services should not have any internal state, they can be run from any host without having to switch back compulsorily.

Bigbrother also runs on this host, via upstart. The log file can be found in /var/log/upstart/bigbrother.log.

Command orchestration

We have a clush install setup for admins to be able to execute arbitrary commands on groups of instances at the same time. This is setup to have a master (currently tools-puppetmaster-02), which has the role role::toollabs::clush::master set. Classification of nodes is done via prefix-matching - this logic needs to be kept up to date. You can find this mapping in modules/role/files/toollabs/clush/tools-clush-generator.

Example commands:

  • List all host groups: nodeset -l
  • Show hosts in host group: nodeset -e @redis
  • Run command on all hosts in group: clush -w @redis -b "cat /proc/cpuinfo | tail"
    • -w selects the hosts or host group (prepend @). Alternatively, use -g redis
    • -b collects all the output, deduplicates before displaying it.
    • e.g. list all processes connected to freenode from exec hosts: clush -w @exec -b "sudo netstat -atp | grep freenode"
    • e.g. list all weblinkchecker processes: clush -w @exec -b "ps axw o user:20,pid:8,%cpu:8,cmd | grep weblink | grep -v clush"

For more information on clush's amazing features, read the docs!

Administrative tasks

Logging in as root

In case the normal login does not work for example due to an LDAP failure, administrators can also directly log in as root. To prepare for that occasion, generate a separate key with ssh-keygen, add an entry to the passwords::root::extra_keys hash in Hiera:Tools with your shell username as key and your public key as value and wait a Puppet cycle to have your key added to the root accounts. Add to your ~/.ssh/config:

# Use different identity for Tools root.
Match host *.tools.eqiad.wmflabs user root
     IdentityFile ~/.ssh/your_secret_root_key

The code that reads passwords::root::extra_keys is in labs/private:modules/passwords/manifests/init.pp.

Disabling all ssh logins except root

Useful for dealing with security critical situations. Just touch /etc/nologin and PAM will prevent any and all non-root logins.

SGE resources

PDF manuals found using [2]:

List of handy commands

Most commands take -xml as a parameter to enable xml output. This is useful when lines get cut off.

Queries

  • list queues on given host: qhost -q -h <hostname>
  • list jobs on given host: qhost -j -h <hostname>
  • list all queues: qstat -f
  • qmaster log file: tail -f /data/project/.system/gridengine/spool/qmaster/messages

Configuration

  • modify host group config: qconf -mhgrp \@general
  • print host group config: qconf -shgrp \@general
  • modify queue config: qconf -mq queuename
  • print queue config: qconf -sq continuous
  • enable a queue: qmod -e 'queue@node_name'
  • disable a queue: qmod -d 'queue@node_name'


  • add host as exec host: qconf -Ae node_name
  • print exec host config: qconf -se node_name
  • remove host as exec host: ??


  • add host as submit host: qconf -as node_name
  • remove host as submit host: ??


  • add host as admin host: ??
  • remove host as admin host: ??

Accounting

  • retrieve information on finished job: qacct -j <jobid or jobname>
  • there are a few scripts in /home/valhallasw/accountingtools: (need to be puppetized)
    • vanaf.py makes a copy of recent entries in the accounting file
    • accounting.py contains python code to read in the accounting file
    • Usage:
      valhallasw@tools-bastion-03:~/accountingtools$ php time.php "-1 hour"
      1471465675
      valhallasw@tools-bastion-03:~/accountingtools$ python vanaf.py 1471465675 mylog
      Seeking to timestamp  1471465675
      ...
      done!
      valhallasw@tools-bastion-03:~/accountingtools$ grep mylog -e '6727696' | python merlijn_stdin.py
      25 1970-01-01 00:00:00 1970-01-01 00:00:00 tools-webgrid-lighttpd-1206.eqiad.wmflabs tools.ptwikis lighttpd-precise-ptwikis 6727696
      0 2016-08-17 21:01:42 2016-08-17 21:01:46 tools-webgrid-lighttpd-1207.eqiad.wmflabs tools.ptwikis lighttpd-precise-ptwikis 6727696
      Traceback (most recent call last):
        File "merlijn_stdin.py", line 4, in <module>
          line = raw_input()
      EOFError: EOF when reading a line
      
      • Ignore the EOFError; the relevant lines are above that. Error codes (first entry) are typically 0 (finished succesfully), 19 ('before writing exit_status' = crashed?), 25 (rescheduled) or 100 ('assumedly after job' = lost job?). I'm not entirely sure about the codes when the job stops because of an error.

Creating a new node

Clearing error state

Sometimes due to various hiccups (like LDAP or DNS malfunction), grid jobs can move to an Error state from which they will not come out without explicit user action. Once you have ascertained the cause of the Error state and fixed it, you can clear all the error state jobs with:

qstat -u '*' | grep Eqw | awk '{print $1;}' | xargs -L1 qmod -cj

You also need to clear all the queues that have gone into error state. Failing to do so prevents jobs from being scheduled on those queues. You can clear all error states on queues with:

qstat -explain E -xml | grep 'name' | sed 's/<name>//' | sed 's/<\/name>//'  | xargs qmod -cq

Draining a node of Jobs

  1. Disable the queues on the node with qmod -d '*@node_name'
  2. Reschedule continuous jobs running on the node (see below)
  3. Wait for non-restartable jobs to drain (if you want to be nice!) or qdel them
  4. Once whatever needed to be done, reenable the node with qmod -e '*@node_name'

There is no simple way to delete or reschedule jobs on a single host, but the following snippet is useful to provide a list to the command line:

$(qhost -j -h node_name| awk '{ print $1; }' |egrep ^[0-9])

which make for reasonable arguments for qdel or qmod -rj.

Decommission a node

  1. Drain the node (see above!). Give the non-restartable jobs some time to finish (maybe even a day if you are feeling generous?).
  2. Remove node from hostgroups it is present in, if any. You can check / remove with a qconf -mhgrp @general or qconf -mhgrp @webgrid on any admin host. This will open up the list in a text editor, where you can carefully delete the name of the host and save. Be careful to keep the line continuations going.
  3. Remove the node from any queues it might be included directly in. Look at qconf -sql for list of queues, and then qconf -mq <queue-name> to see list of hosts in it. Note that this seems to be mostly required only for webgrid hosts (yay consistency!)
  4. Remove the node from gridengine with qconf -de <fqdn>. Also do qconf -de <hostname>.eqiad.wmflabs for completeness, since some of the older nodes do not use fqdn.
  5. If the node is a webgrid node, also remove it from being a submit host, with qconf -ds <fqdn>. Also do qconf -de <hostname>.eqiad.wmflabs for completeness, since some of the older nodes do not use fqdn.
  6. Check if there's a host alias for the node in operations/puppet, in modules/toollabs/files/host_aliases. If there is, get rid of it.
  7. Wait for a while, then delete the VM!

Local package management

Local packages are provided by an aptly repository on tools-services-01.

On tools-services-01, you can manipulate the package database by various commands; cf. aptly(1). Afterwards, you need to publish the database to the file Packages by (for the trusty-tools repository) aptly publish --skip-signing update trusty-tools. To use the packages on the clients you need to wait 30 minutes again or run apt-get update. In general, you should never just delete packages, but move them to ~tools.admin/archived-packages.

You can always see where a package is (would be) coming from with apt-cache showpkg $package.

Local package policy

Package repositories

  • We only install packages from trustworthy repositories.
    • OK are
      • The official Debian and Ubuntu repositories, and
      • Self-built packages (apt.wikimedia.org and aptly)
    • Not OK are:
      • PPAs
      • other 3rd party repositories

Packagers effectively get root on our systems, as they could add a rootkit to the package, or upload an unsafe sshd version, and apt-get will happily install it

Hardness clause: in extraordinary cases, and for 'grandfathered in' packages, we can deviate from this policy, as long as security and maintainability are kept in mind.

apt.wikimedia.org

We assume that whatever is good for production is also OK for Tool Labs.

aptly

We manage the aptly repository ourselves.

  • Packages in aptly need to be built by tool labs admins
    • we cannot import .deb files from untrusted 3rd party sources
  • Package source files need to come from a trusted source
    • a source file from a trusted source (i.e. backports), or
    • we build the debian source files ourselves
    • we cannot build .dcs files from untrusted 3rd party sources
  • Packages need to be easy to update and build
    • cowbuilder/pdebuild OK
    • fpm is OK
    • See Deploy new jobutils package for an example walk through of building and adding packages to aptly.
  • We only package if strictly necessary
    • infrastructure packages
    • packages that should be available for effective development (e.g. composer or sbt)
    • not: python-*, lib*-perl, ..., which should just be installed with the available platform-specific package managers
  • For each package, it should be clear who is responsible for keeping it up to date
    • for infrastructure packages, this should be one of the paid staffers

A list of locally maintained packages can be found under /local packages.

Building packages

/data/project/dpkg is used as storage for building packages. The 'how' of package building has not been completely thought out yet, but there are two basic routes:

Deploy new jobutils package

See Deploy new jobutils package

Webserver statistics

To get a look at webserver statistics, goaccess is installed on the webproxies. Usage:

goaccess --date-format="%d/%b/%Y" --log-format='%h - - [%d:%t %^] "%r" %s %b "%R" "%u"' -q -f/var/log/nginx/access.log

Interactive key bindings are documented on the man page. HTML output is supported by piping to a file. Note that nginx logs are rotated (twice?) daily, so there is only very recent data available.

Restarting all webservices

This is sometimes necessary, if the proxy entries are out of whack. Can be done with

qstat -q webgrid-generic -q webgrid-lighttpd -u '*' | awk '{ print $1;}' | xargs -L1 qmod -rj

The qstat gives us a list of all jobs from all users under the two webgrid queues, and the qmod -rj asks gridengine to restart them. This can be run as root on tools-login.wmflabs.org

Banning an IP from tool labs

On Hiera:Tools, add the IP to the list of dynamicproxy::banned_ips, then force a puppet run on the webproxies. Add a note to Help:Tool Labs/Banned explaining why. The user will get a message like [3].

Deploying the main web page

This website (plus the 403/500/503 error pages) are hosted under tools.admin. To deploy,
$ become admin
$ cd toollabs
$ git pull

Regenerate replica.my.cnf

This requires access to the active labstore host, and can be done as follows (on the active labstore):

/usr/local/sbin/maintain-dbusers delete tools.$name

Regenerate kubernetes credentials (.kube/config)

  1. Delete the `.kube/config` file
  2. On the k8s master (currently tools-k8s-master-01.tools.eqiad.wmflabs), remove the line for the tool from /etc/kubernetes/tokenauth. Make a backup before you edit this file :D
  3. On the k8s master, run systemctl restart maintain-kubeusers
  4. You can run journalctl -u maintain-kubeusers -f to follow the logs. The script will only create the missing things, so you might see some error messages. This is unfortunately normal!

Adding K8S Components

See Tools Kubernetes##Building_new_nodes

Deleting a tool

  1. Ask the maintainers to delete or back up code and data they no longer need.
  2. Stop running grid jobs and web services (qstat).
  3. Delete the tool at Special:NovaServiceGroup.
  4. Use list-user-databases /data/project/$TOOLNAME/replica.my.cnf to check whether the tool has any databases and if it has and they are valuable, dump them to the tool's directory and delete the databases.
  5. If the tool's directory has valuable code or data, back it up to /data/project/.system/removed_tools and notify the previous maintainers or someone else appropriate. This is just courtesy – there are no guarantees that deleted tools can be restored.
  6. Delete the tool's directory on labstore. (replica.my.cnf files are immutable and that attribute cannot be changed over NFS.)

Updating JDK8

We build and keep a jdk8 for trusty for use by tools. This will be supported until end of 2016, at which point they should all be on k8s. Until then, admins have to keep updating the jdk8 package whenever new Java security vulns. show up. This can be done the following way:

  1. Go to https://launchpad.net/~openjdk-r/+archive/ubuntu/ppa and check if there's an update. This is setup by the same person who maintains jdk8 in Ubuntu, so the hope is that this would be good enough for us.
  2. Get the source package (.dsc file, .orig.tar.xz and .debian.tar.gz) from https://launchpad.net/~openjdk-r/+archive/ubuntu/ppa/+packages. Download them onto a pdebuilder host (currently tools-docker-builder-03)
  3. Verify the signature on the dsc file (TODO: Add instructions!)
  4. Extract the source package, with dpkg-source -x $DSCFILE. This will extract a directory that can be used to build the actual debs.
  5. In a screen session, run DIST=trusty pdebuild
  6. Check back in a few hours, and if build had succeeded, upload it to aptly on tools-services-01

Email maintenance

Currently all mail services run on tools-mail (tools-mail-01 does not serve e-mail, and should probably be removed?). A few pointers:

SSL certificates

We use a bunch of different SSL certs on tools.

star.wmflabs.org

This is used for the tools web proxies (tools-proxy-*) and the tools static proxies (tools-static-*). The secret key files for these are kept as cherry picks in the tools puppetmaster (tools-puppetmaster-01). Updating the certs would involve just changing it centrally, running puppet on all these nodes, and restarting nginx.

star.tools.wmflabs.org

This is used for the k8s master (tools-k8s-master-*) and the docker registry (tools-docker-registry-*). These aren't centrally managed yet, but should be!

Granting a tool write access to Elasticsearch

  • Generate a random password and the htpassword crypt entry for it using the script new-es-password.sh. (Must be run a host with the `htpasswd` command installed. A MediaWiki-Vagrant virtual machine will work for this in a pinch.)
$ ./new-es-password.sh tools.an-example
user=tools.an-example
password="aAI47zCXKltGGl5iUVs+vUYMUtRL15Y/NPu5ou0SOP0="
---------------------------------
tools.an-example:$apr1$Uc51lEHf$gl6zeKIVvZ7uiTOuD/47Z1
  • Add the htpassword hash to puppet:
$ ssh tools-puppetmaster-02.tools.eqiad.wmflabs
$ cd /var/lib/git/labs/private
$ sudo -i vim /var/lib/git/labs/private/modules/secret/secrets/labs/toollabs/elasticsearch/htpasswd
... paste in htpassword crypt data ...
:wq
$ sudo git add modules/secret/secrets/labs/toollabs/elasticsearch/htpasswd
$ sudo git commit -m "[local] Elasticsearch credentials for $TOOL"
  • Force a puppet run on tools-elastic-0[123]
  • Create the credentials file in the tool's $HOME:
$ ssh tools-dev.wmflabs.org
$ sudo -i touch /data/project/$TOOL/.elasticsearch.ini
$ sudo -i chmod o-rwx /data/project/$TOOL/.elasticsearch.ini
$ sudo -i vim /data/project/$TOOL/.elasticsearch.ini
... paste in username and raw password in ini file format ...
:wq
  • Resolve the ticket!

Kubernetes

See Tools_Kubernetes

Emergency guides

Other

Servicegroup log

tools.admin runs /data/project/admin/bin/toolhistory, which provides an hourly snapshot of ldaplist -l servicegroup as git repository in /data/project/admin/var/lib/git/servicegroups

HBA: How does it work?

wikibooks:en:OpenSSH/Cookbook/Host-based_Authentication#Client_Configuration_for_Host-based_Authentication. If things don't work, check every point listed in that guide - sshd doesn't give you much output to work with.

Central syslog servers

tools-logs-01 and tools-logs-02 are central syslog servers that receive syslog data from all? tools hosts. These are stored in /srv/syslog.

maintain-kubeusers stuck

If home directories for new tools stop getting created, it is likely due to the service that creates them, maintain-kubeusers on the k8s master (currently tools-k8s-master-01) getting stuck with it's LDAP connection. You should be able to simply log into the master node and sudo service maintain-kubeusers restart to get things working again.

Brainstorming