Nova Resource:Tools/SAL/Archive 2

2017-12-31

02:00 bd808: Killed some pwb.py and qacct processes running on tools-bastion-03

2017-12-21

17:57 bd808: PAWS: deleted hub-deployment pod stuck in crashloopbackoff
17:30 bd808: PAWS: deleting hub-deployment pod. Lots of "Connection pool is full" warnings in pod logs

2017-12-19

21:27 chasemp: reboot tools-paws-master-01
18:38 andrewbogott: rebooting tools-paws-master-01
05:07 andrewbogott: "service gridengine-master restart" on tools-grid-master

2017-12-18

12:04 arturo: it seems jupyterhub tries to use a database which doesn't exists: [E 2017-12-18 11:59:49.896 JupyterHub app:904] Failed to connect to db: sqlite:///jupyterhub.sqlite
11:58 arturo: The restart didn't work. I could see a lot of log lines in the hub-deployment pod with something like: 2017-12-17 04:08:17,574 WARNING Connection pool is full, discarding connection: 10.96.0.1
11:51 arturo: the restart was with: kubectl get pod -o yaml hub-deployment-1381799904-b5g5j -n prod | kubectl replace --force -f -
11:50 arturo: restart pod hub-deployment in paws to try to fix the 502

2017-12-15

13:55 arturo: same in tools-checker-02.tools.eqiad.wmflabs
13:54 arturo: same in tools-exec-1415.tools.eqiad.wmflabs
13:52 arturo: running 'sudo puppet agent -t -v' in tools-webgrid-lighttpd-1416.tools.eqiad.wmflabs since didn't update in the last run with clush

2017-12-14

16:58 arturo: running clush -w @all 'sudo puppet agent --test' from tools-clushmaster-01.eqiad.wmflabs due to https://gerrit.wikimedia.org/r/#/c/394572/ being merged

2017-12-13

17:37 andrewbogott: upgrading puppet packages on all VMs
00:59 madhuvishy: Cordon and Drain tools-worker-1016
00:47 madhuvishy: Drain + Cordon, Reboot, Uncordon tools-workers-1018-1023, 1025-1027
00:34 madhuvishy: Drain + Cordon, Reboot, Uncordon tools-workers-1011, 1013-1015, 1017
00:28 madhuvishy: Drain + Cordon, Reboot, Uncordon tools-workers-1006-1010
00:11 madhuvishy: Drain + Cordon, Reboot, Uncordon tools-workers-1002-1005

2017-12-12

23:29 madhuvishy: rebooting tools-worker-1012
18:50 andrewbogott: rebooting tools-worker-1001

2017-12-11

19:32 bd808: git gc on tools-static-11; --aggressive was killed by system (T182604)
18:07 andrewbogott: upgrading tools puppetmaster to v4
17:07 bd808: git gc --aggressive on tools-static-11 (T182604)

2017-12-01

15:33 chasemp: put the weird mess of untracked files on tools puppetmaster into stash to see what breaks as they should not be there?
15:30 chasemp: prometheus nfs collector on tools-bastion-03

2017-11-30

23:23 bd808: Hard reboot of tools-bastion-03 via Horizon
23:06 chasemp: rebooting login.tools.wmflabs.org due to overload

2017-11-20

20:34 chasemp: backup crons tools-cron-01:/var/spool/cron# cp -Rp crontabs/ /root/20112017/
00:52 andrewbogott: cherry-picking https://gerrit.wikimedia.org/r/#/c/392172/ onto the tools puppetmaster

2017-11-17

21:33 valhallasw`cloud: also g-w'ed those files, and sent emails to all the affected users
21:17 valhallasw`cloud: chmod o-w'ed a bunch of files reported by Dispenser; writing emails to the owners about this

2017-11-16

17:40 chasemp: tools-clushmaster-01:~$ clush -w @all 'sudo puppet agent --enable && sudo puppet agent --test && sudo unattended-upgrades -d'
16:50 bd808: Force upgraded nginx on tools-elastic-*
16:37 chasemp: reboot tools-checker-01
15:17 chasemp: disable puppet

2017-11-15

22:48 madhuvishy: Rebooted tools-paws-worker-1017
15:53 chasemp: reboot bastion-03
15:48 chasemp: kill tools.powow on bastion-03 for hammering IO and making bastion unusable

2017-11-07

01:21 bd808: Removed all non-directory files from /home (via labstore1004 direct access)

2017-11-06

18:30 bd808: Load on tools-bastion-03 down to 0.72 from 17.47 after killing a bunch of local processes that should have been running on the job grid instead

2017-11-05

23:48 bd808: Cleaned up 2 huge /tmp files left by tools.croptool (~6.5G)
23:44 bd808: Cleaned up 109 files owned by tools.rezabot on tools-webgrid-lighttpd-1428 with `sudo find /tmp -user tools.rezabot -exec rm {} \+`
23:37 bd808: Cleaned up 955 files owned by tools.wsexport on tools-webgrid-lighttpd-1428 with `sudo find /tmp -user tools.wsexport -exec rm {} \+`

2017-11-03

21:19 bd808: Deployed misctools 1.26 (T156174)

2017-11-02

16:15 bd808: Restarted nslcd on tools-bastion-03

2017-11-01

07:11 madhuvishy: Clear nscd cache across all projects post labsdb dns switchover T179464
07:11 madhuvishy: Clear nscd cache across all projects post labsdb dns switchover

2017-10-31

16:50 bd808: tools-bastion-03 (tools-login, login.tools) is overloaded

2017-10-30

17:35 madhuvishy: Clear dns caches across tools hosts `sudo nscd -i hosts`
16:08 arturo: repool tools-exec-1401.tools.eqiad.wmflabs
15:57 arturo: depool again tools-exec-1401.tools.eqiad.wmflabs for more tests related to T179024
12:47 arturo: repool tools-exec-1401
11:58 arturo: depool tools-exec-1401 to test patch in T179024 --> aborrero@tools-bastion-03:~$ sudo exec-manage depool tools-exec-1401.tools.eqiad.wmflabs

2017-10-24

18:09 madhuvishy: Disable puppet on tools-package-builder-01 temporarily (T178920)
13:22 chasemp: start admin webservice
13:22 chasemp: stop admin webservice

2017-10-23

14:49 chasemp: wall message and scheduled reboot in 5m for bastion-03

2017-10-18

21:36 chasemp: stop basebot -- it is going crazy and spamming email w/ failing to log to error.log. Need to figure out how to notify but it's clearly in a failure loop.
14:04 chasemp: add strephit creds to elasticsearch per T178310

2017-10-12

16:57 bd808: Rebuilding all Kubernetes Docker images to include toollabs-webservice 0.38
16:53 bd808: Upgraded toollabs-webservice to 0.38

2017-10-06

15:33 bd808: Upgrade jobutils to 1.25 (T177614)
00:27 bd808: Updated misctools to 1.24

2017-10-05

22:47 bd808: Updated misctools to 1.23
22:42 bd808: Updated jobutils to 1.23
15:46 chasemp: tools-bastion-03 has tons of local tools running long lived NFS intensive processes. I'm rebooting rather than playing whackamole.

2017-10-03

19:30 bd808: `kubectl --namespace=prod delete pod --all` on tools-paws-master-01

2017-10-01

21:46 madhuvishy: Cold migrating tools-clushmaster-01 from labvirt1015 to labvirt1017

2017-09-29

19:49 andrewbogott: migration tools-clushmaster-01 to labvirt1015

2017-09-25

15:14 andrewbogott: rebooting tools-paws-worker-1006 since I can't access it
14:57 chasemp: OS_TENANT_NAME=tools openstack server reboot 2c0cf363-c7c3-42ad-94bd-e586f2492321 (unresponsive)

2017-09-20

16:52 madhuvishy: apt-get install --only-upgrade apache2; service apache2 restart on tools-puppetmaster-01

2017-09-19

15:22 chasemp: tools-clushmaster-01:~$ clush -f 5 -g all 'sudo puppet agent --test'
13:39 chasemp: bastion-03 someone dropped 8.6G in /tmp which is /not/ seemingly on a temp file system
13:25 chasemp: wall Bastion disk is full and needs attention and reboot in 60

2017-09-18

18:02 bd808: Updated PHP5.6 images for Kubernetes (T172358)

2017-09-13

15:34 bd808: Running inbound message purge via clush to @tools-exec
15:15 bd808: Running outbound message purge via clush to @tools-exec
13:57 bd808: apt-get install nginx-common on tools-static-1[01]
13:31 bd808: static down due to apparent nginx package upgrade/config change
02:10 bd808: Really disabled puppet on tools-mail
01:51 bd808: Nuked all messages in the exim spool on tools-mail
01:09 bd808: Removed user WiktCAPT from project
00:55 bd808: Archived and then purged /var/spool/exim4/input on tools-mail
00:47 bd808: Archived and then purged /var/spool/exim4/msglog on tools-mail
00:43 bd808: Stopped exim on tools-mail
00:43 bd808: Disabled puppet on tools-mail
00:15 chasemp: forced to clean out exim queue as the filesystem used up all inodes

2017-08-31

20:33 madhuvishy: Updated certs and ran puppet, restarted nginx on tools-proxy-* and tools-static-* (T174611)
20:25 madhuvishy: Merging new cert https://gerrit.wikimedia.org/r/#/c/374873/ (T174611)
20:24 madhuvishy: Disabling puppet on tools-proxy-* and tools-static-* for star.wmflabs.org SSL cert update (T174611)
20:23 madhuvishy: Disabling puppet on tools-proxy-* and tools-static-* for star.wmflabs.org SSL cert update

2017-08-24

19:59 bd808: restarted nslcd and nscd on tools-bastion-03
19:59 bd808: restarted nslcd and nscd on tools-bastion-02

2017-08-22

19:20 andrewbogott: deleted tools-puppetmaster-02, it was replaced a month ago by -01

2017-08-12

18:38 chasemp: retart admin webservice

2017-08-11

16:09 chasemp: qdel -f -j 7441503

2017-08-10

14:59 chasemp: 'become stimmberechtigung && restart' && 'become intersect-contribs && restart'

2017-08-09

17:28 chasemp: webservices restart tools.orphantalk

2017-08-03

00:47 bd808: tools-bastion-03 not usably responsive to interactive commands; will reboot
00:00 bd808: Restarted kube-proxy service on bastion-03

2017-08-02

16:59 bd808: Force deleted 6 jobs suck in 'dr' state

2017-07-31

15:28 chasemp: remove python-keystoneclient from bastion-03

2017-07-27

23:27 bd808: Killed python procs owned by sdesabbata on tools-login that were stealing all cpu/io
21:16 bd808: Disabled puppet on tools-proxy-01 to test nginx proxy config changes
16:27 bd808: Enabled puppet on tools-static-11
16:10 bd808: Disabled puppet on tools-static-11 to test https://gerrit.wikimedia.org/r/#/c/357878

2017-07-26

22:33 chasemp: hotpatching an hiera value on tools master to see effects

2017-07-20

19:48 bd808: Clearing all Eqw state jobs in all queues with: qstat -u '*' | grep Eqw | awk '{print $1;}' | xargs -L1 qmod -cj
13:54 andrewbogott: upgrading apache2 on tools-puppetmaster-01
04:00 chasemp: tools-webgrid-lighttpd-1402:~# service nslcd restart && service nscd restart
03:57 chasemp: tools-exec-1428:~# service nslcd restart && service nscd restart
03:57 bd808: Redtarted cron, nscd, nslcd on tools-cron-01
03:45 chasemp: tools-puppetmaster-01:~# service nslcd restart && service nscd restart
03:44 chasemp: tools-puppetmaster-01:~# service nslcd restart && service nscd restart
03:37 bd808: Restarted apache on tools-puppetmaster-01

2017-07-19

23:52 bd808: Restarted cron on tools-cron-01; toolschecker job showing user not found errors
21:19 valhallasw`cloud: Restarted nslcd on tools-bastion-03 (=tools-login); logins seem functional again.
21:18 bd808: Forced puppet run and restarted nscd, nslcd on tools-bastion-02

2017-07-18

19:51 andrewbogott: enabling puppet on tools-proxy-02. I don't know why it was disabled.

2017-07-17

01:43 bd808: Uncordoned tools-worker-1020 after it deleted pods with local storage that were filling the entire disk
01:36 bd808: Depooling tools-worker-1020

2017-07-13

21:59 bd808: Elasticsearch cluster upgraded to 5.3.2
21:25 bd808: Upgrading ElasticSearch cluster for T164842. There will be service interruptions
17:59 bd808: Puppet is disabled on tools-proxy-02 with no reason specified.
17:09 bd808: Upgraded nginx-common on tools-proxy-02
17:05 bd808: Upgraded nginx-common on tools-proxy-01

2017-07-12

15:46 chasemp: push out puppet run across tools
12:15 andrewbogott: restarting 'admin' webservice

2017-07-07

18:26 bd808: Forced puppet runs on tools-redis-* for security fix

2017-07-03

04:26 bd808: cdnjs on tools-static-10 is up to date
03:38 bd808: cdnjs on tools-static-11 is up to date
02:19 bd808: Cleaning up stuck merges for cdnjs clones on tools-static-10 and tools-static-11

2017-07-01

19:40 bd808: Disabled puppet on tools-k8s-master-01 to try and fix maintain-kubeusers
19:32 bd808: Restarted maintain-kubeusers on tools-k8s-master-01

2017-06-30

01:33 chasemp: time for i in `cat tools-hosts`; do ssh -i ~/.ssh/labs_root_id_rsa root@$i.eqiad.wmflabs 'hostname -f; uptime; tc-setup'; done
01:29 andrewbogott: rebooting tools-cron-01

2017-06-29

23:01 madhuvishy: Uncordoned all k8s-workers
20:50 madhuvishy: deppoling, rebooting and repooling all grid exec nodes
20:36 andrewbogott: depooling, rebooting, and repooling every lighttpd node three at a time
19:55 madhuvishy: Killed liangent-php jobs and usrd-tools jobs
18:00 madhuvishy: drain cordon reboot uncordon tools-worker-1015
17:37 madhuvishy: drain cordon reboot uncordon tools-worker-1005 tools-worker-1007 tools-worker-1008
17:22 bd808: rebooting tools-static-11
17:20 andrewbogott: rebooting tools-static-10
17:20 madhuvishy: drain cordon reboot uncordon tools-worker-1012 tools-worker-1003
17:13 madhuvishy: drain cordon reboot uncordon tools-worker-1022, tools-worker-1009, tools-worker-1002
16:27 chasemp: restart k8s components on master (madhu)
16:10 chasemp: tools-flannel-etcd-01:~$ sudo service etcd restart
16:04 madhuvishy: reboot tools-worker-1022 tools-worker-1009
15:57 chasemp: reboot tools-docker-registery-01 for nfs

2017-06-27

21:32 andrewbogott: moving all tools nodes to new puppetmaster, tools-puppetmaster-01.tools.eqiad.wmflabs

2017-06-25

15:13 madhuvishy: Restarted webservice on tools.fatameh

2017-06-24

16:01 bd808: Created and provisioned elasticsearch password for tools.wmde-uca-test (T167971)

2017-06-23

20:20 bd808: Reindexing various elasticsearch indexes created before we upgraded to v2.x
20:19 bd808: Dropped garbage indexes in elasticsearch cluster

2017-06-22

17:03 bd808: Rolled back attempt at Elasticsearch upgrade. Indices need to be rebuilt with 2.x before 5.x can be installed. T164842
16:19 bd808: Backed up elasticsearch indexes to personal laptop using elasticdump incase T164842 goes horribly wrong
00:12 bd808: Set ownership and permissions on $HOME/.kube for all tools (T165875)

2017-06-21

17:43 andrewbogott: repooling tools-exec-1412, 1415, 1417, 1420, tools-webgrid-lighttpd-1415, 1416, 1422, 1426
17:42 madhuvishy: Restarted webservice for openstack-browser
17:36 andrewbogott: depooling tools-exec-1412, 1415, 1417, 1420, tools-webgrid-lighttpd-1415, 1416, 1422, 1426
17:35 andrewbogott: repooling tools-exec-1411, 1416, 1418, 1424, tools-webgrid-lighttpd-1404, 1410
17:24 andrewbogott: depooling tools-exec-1411, 1416, 1418, 1424, tools-webgrid-lighttpd-1404, 1410
17:23 andrewbogott: repooling tools-exec-1406, 1421, 1436, 1437, tools-webgrid-generic-1404, 1409, 1411, 1418, 1420, 1425
17:11 andrewbogott: depooling tools-exec-1406, 1421, 1436, 1437, tools-webgrid-generic-1404, 1409, 1411, 1418, 1420, 1425
17:10 andrewbogott: repooling tools-webgrid-lighttpd-1412, tools-exec-1423
16:57 andrewbogott: depooling tools-webgrid-lighttpd-1412, tools-exec-1423
16:53 andrewbogott: repooling tools-exec-1413, 1442, tools-webgrid-lighttpd-1417, 1419, 1421, 1427, 1428
16:52 andrewbogott: repooling tools-exec-1409, 1410, 1414, 1419, 1427, 1428, tools-webgrid-generic-1401, tools-webgrid-lighttpd-1406
16:35 andrewbogott: depooling tools-exec-1409, 1410, 1414, 1419, 1427, 1428, tools-webgrid-generic-1401, tools-webgrid-lighttpd-1406
16:29 andrewbogott: depooling tools-exec-1413, 1442, tools-webgrid-lighttpd-1417, 1419, 1421, 1427, 1428
16:05 godog: delete pods for lolrrit-wm to force restart
15:45 andrewbogott: repooling tools-exec-1422, tools-webgrid-lighttpd-1413
15:41 andrewbogott: switching the proxy ip back to tools-proxy-02
15:31 andrewbogott: temporarily pointing the tools-proxy IP to tools-proxy-01
15:26 andrewbogott: depooling tools-exec-1422, tools-webgrid-lighttpd-1413
15:12 andrewbogott: depooling tools-exec-1404, tools-exec-1434, tools-worker-1026
15:10 andrewbogott: repooling tools-exec-1402, 1426, 1429, 1433, tools-webgrid-lighttpd-1408, 1414, 1424
14:53 andrewbogott: depooling tools-exec-1402, 1426, 1429, 1433, tools-webgrid-lighttpd-1408, 1414, 1424
14:52 andrewbogott: repooling tools-exec-1403, tools-exec-gift-trusty-01, tools-webgrid-generic-1402, tools-webgrid-lighttpd-1403
14:37 andrewbogott: depooling tools-exec-1403, tools-exec-gift-trusty-01, tools-webgrid-generic-1402, tools-webgrid-lighttpd-1403
14:32 andrewbogott: repooling tools-exec-1405, tools-exec-1425, tools-webgrid-generic-1403, tools-webgrid-lighttpd-1401, tools-webgrid-lighttpd-1405
14:20 andrewbogott: depooling tools-exec-1405, tools-exec-1425, tools-webgrid-generic-1403, tools-webgrid-lighttpd-1401, tools-webgrid-lighttpd-1405
14:19 andrewbogott: repooling tools-exec-1401, 1407, 1408, 1430, 1431, 1432, 1435, 1438, 1439, 1440, 1441, tools-webgrid-lighttpd-1402, tools-webgrid-lighttpd-1407
13:56 andrewbogott: depooling tools-exec-1401, 1407, 1408, 1430, 1431, 1432, 1435, 1438, 1439, 1440, 1441, tools-webgrid-lighttpd-1402, tools-webgrid-lighttpd-1407

2017-06-14

22:09 bd808: Restarted apache2 proc on tools-puppetmaster-02

2017-06-08

18:14 madhuvishy: Also delete from /tmp on tools-webgrid-lighttpd-1411 xvfb-run.*, calibre_* and ws-*.epub
18:10 madhuvishy: Delete ws-*.epub from /tmp on tools-webgrid-lighttpd-1426
18:07 madhuvishy: Clean up space on /tmp on tools-webgrid-lighttpd-1426 by deleting temp files xvfb-run.* and calibre_1.25.0_tmp_* created by the wsexport tool

2017-06-07

19:05 madhuvishy: Killed scp job run by user torin8 on tools-bastion-02

2017-06-06

20:30 chasemp: rebooting tools-bastion-02 as unresponsive (up 76 days and lots of seemingly left behind things running)

2017-06-05

23:44 bd808: Deleted tools.iabot crontab that somehow got locally installed on tools-exec-1412 on 2017-05-24T20:55Z
22:15 bd808: Deleted tools.aibot crontab that somehow got locally installed on tools-exec-1436 on 2017-05-24T20:55Z
19:55 andrewbogott: disabling puppet on tools-proxy-01 and -02 for a staged rollout of https://gerrit.wikimedia.org/r/#/c/350494/16

2017-06-01

15:15 andrewbogott: depooling/rebooting/repooling tools-exec-1403 as part of old kernel-purge testing

2017-05-31

19:29 bd808: Rebuiding all Docker images to pick up toollabs-webservice v0.37 (T163355)
19:24 bd808: Updating toolabs-webservice package via clush (T163355)
19:16 bd808: Installed toollabs-webservice_0.37_all.deb from local file on tools-bastion-02 (T163355)
16:34 andrewbogott: running 'apt-get -yq autoremove' env='{DEBIAN_FRONTEND: "noninteractive"}' on all instances with salt
16:25 andrewbogott: rebooting tools-exec-1404 as part of a disk-space-saving test
14:07 andrewbogott: migrating tools-exec-1409 to labvirt1009 to reduce CPU load on labvirt1006 (T165753)

2017-05-30

22:32 andrewbogott: migrating tools-webgrid-lighttpd-1406, tools-exec-1410 from labvirt1006 to labvirt1009 to balance cpu usage
18:15 andrewbogott: restarted robokobot virgule to free up leaked files
17:36 andrewbogott: restarting excel2wiki to clean up file leaks
17:36 andrewbogott: restarting idwiki-welcome in kenrick95bot to free up leaked files
17:31 andrewbogott: restarting onetools to clean up file leaks
17:29 andrewbogott: restarting ytcleaner webservice to clean up leaked files
17:22 andrewbogott: restarting vltools to clean up leaked files
17:20 madhuvishy: Uncordoned tools-worker-1006
17:16 madhuvishy: Killed tool videoconvert on tools-exec-1440 in debugging labstore disk space issues
17:15 madhuvishy: Drained and rebooted tools-worker-1006
17:15 andrewbogott: restarted croptool to clean up stray files
17:15 madhuvishy: depooled, rebooted, and repooled tools-exec-1412
17:15 andrewbogott: restarted catmon tool to clean up stray files

2017-05-26

20:32 bd808: Added tools-webgrid-lighttpd-14{19,2[0-8]} as submit hosts
20:31 bd808: Added tools-webgrid-lighttpd-1412 and tools-webgrid-lighttpd-1413 as submit hosts
20:28 bd808: sudo qconf -as tools-webgrid-lighttpd-1417.tools.eqiad.wmflabs

2017-05-22

07:49 chasemp: move ooooold shared resources into archive for later cleanup

2017-05-20

09:27 madhuvishy: Truncating jerr.log for tool videoconvert since it's 967GB

2017-05-10

19:11 bd808: Edited striker db record for user Stepan Grigoryev to detach SUL and Phab accounts. T164849
17:47 bd808: Signed and revoked puppet certs generated when our DNS flipped out and gave hosts non-FQDN hostnames
17:29 bd808: Fixed broken puppet cert on tools-package-builder-01

2017-05-04

19:23 madhuvishy: Rebooting tools-grid-shadow
16:21 madhuvishy: Start instance tools-grid-master.tools from horizon
16:20 madhuvishy: Shut off tools-grid-master.tools instance from horizon
16:16 madhuvishy: Stopped gridengine-shadow on tools-grid-shadow.tools (service gridengine-shadow stop and kill -9 individual shadowd processes)

2017-04-24

15:33 bd808: Removed Gergő Tisza as a projectadmin for T163611; event done

2017-04-21

22:30 bd808: Added Gergő Tisza as a projectadmin for T163611
13:43 chasemp: T161898 clush -g all 'sudo puppet agent --disable "rollout nfs-mount-manager"'

2017-04-20

17:15 bd808: Deleted shutdown VM tools-docker-builder-04; tools-docker-builder-05 is the new hotness
17:11 bd808: kill -INT 19897 on tools-proxy-02 to stop a hung nginx child process left from the last graceful restart of nginx

2017-04-19

15:10 bd808: apt-get install psmisc on tools-proxy-0[12]
13:23 chasemp: stop docker on tools-proxy-01
13:20 chasemp: clean up disk space on tools-proxy-01

2017-04-18

20:37 bd808: Restarted bigbrother on tools-services-02
04:23 bd808: Shutdown tools-docker-builder-04; will wait a bit before deleting
04:04 bd808: Built and pushed new Docker images based on 82a46b4 (Refactor apt-get actions in Dockerfiles)
03:42 bd808: Made tools-docker-builder-05.tools.eqiad.wmflabs the active docker build host
01:01 bd808: Built instance tools-package-builder-01

2017-04-17

20:41 bd808: Building tools-docker-builder-05
19:35 chasemp: add reedy to sudo all perms so he can admin things
17:21 andrewbogott: adding 8 more exec nodes: tools-exec-1435 through 1442

2017-04-11

16:46 andrewbogott: added exec nodes tools-exec-1430, 31, 32, 33, 34.
14:15 andrewbogott: emptied /srv/pbuilder to make space on tools-docker-04
02:35 bd808: Restarted maintain-kubeusers on tools-k8s-master-01

2017-04-03

13:48 chasemp: enable puppet on gridmaster

2017-04-01

15:28 andrewbogott: added five new exec nodes, tools-exec-1425 through 1429
14:26 chasemp: up nfs thresholds https://gerrit.wikimedia.org/r/#/c/345975/
14:00 chasemp: disable puppet on tools-grid-msater
13:52 chasemp: tools-grid-master tc-setup clean
13:40 chasemp: restart nscd and nscld on tools-grid-master
13:31 chasemp: reboot tools-exec-1420

2017-03-31

22:25 yuvipanda: apt-get update && apt-get install kubernetes-node on tools-proxy-01 to upgrade kube-proxy systemd service unit

2017-03-30

20:29 chasemp: stop grid-master temporarily & umount -fl project nfs & remount & start grid-master
17:38 chasemp: reboot tools-exec-1401
17:30 madhuvishy: Updating tools project hiera config to add role::labs::nfsclient::lookupcache: all via Horizon (T136712)
17:29 madhuvishy: Disabled puppet across tools in prep for T136712

2017-03-27

04:06 andrewbogott: erasing random log files on tools-proxy-01 to avoid filling the disk

2017-03-23

20:38 andrewbogott: migrating tools-exec-1401 to labvirt1001
19:56 andrewbogott: migrating tools-exec-1408 to labvirt1001
19:02 andrewbogott: migrating tools-exec-1407 to labvirt1001
16:37 andrewbogott: migrating tools-webgrid-lighttpd-1402 and 1407 to labvirt1001 (testing labvirt1001 and easing CPU load on labvirt1010)

2017-03-22

13:48 andrewbogott: migrating tools-bastion-02 in 15 minutes

2017-03-21

17:06 andrewbogott: moving tools-webgrid-lighttpd-1404 to labvirt1012 to ease pressure on labvirt1004
16:19 andrewbogott: moving tools-exec-1406 to labvirt1011 to ease CPU usage on labvirt1004

2017-03-20

22:47 yuvipanda: disable puppet on all k8s workers to test https://gerrit.wikimedia.org/r/#/c/343708/
18:36 bd808: Applied openstack::clientlib on tools-checker-02 and forced puppet run
18:03 bd808: Applied openstack::clientlib on tools-checker-01 and forced puppet run
17:31 andrewbogott: migrating tools-exec-1417 to labvirt1013
17:05 andrewbogott: migrating tools-webgrid-lighttpd-1410 to labvirt1012 to reduce load on labvirt1001
16:42 andrewbogott: migrating tools-webgrid-generic-1404 to labvirt1011 to reduce load on labvirt1001
16:13 andrewbogott: migrating tools-exec-1408 to labvirt1010 to reduce load on labvirt1001

2017-03-17

17:24 andrewbogott: moving tools-webgrid-lighttpd-1416 to labvirt1013 to reduce load on labvirt1004
17:15 andrewbogott: moving tools-exec-1424 to labvirt1012 to ease load on labvirt1004

2017-03-15

19:21 andrewbogott: added new exec nodes: tools-exec-1421 and tools-exec-1422
17:42 madhuvishy: Restarted stashbot
17:29 chasemp: docker stop && rm -fR /var/lib/docker/* on worker-1001
17:20 chasemp: test of logging
16:11 chasemp: k8s master 'for h in `kubectl get nodes | grep worker | grep -v NotReady | grep -v Disabled | awk '{print $1}'`; do echo $h && kubectl drain --delete-local-data --force $h && sleep 10 ; done'
16:08 chasemp: stop puppet on k8s master and drain nodes
15:50 chasemp: (late) kill what appears to be an android emulator? unsure but it's eating all IO

2017-03-14

21:24 bd808: Deleted tools-precise-dev (T160466)
21:13 bd808: Removed non-existent tools-submit.eqiad.wmflabs from submit hosts list
21:02 bd808: Deleted tools-exec-gift (T160461)
20:45 bd808: Deleted tools-webgrid-lighttpd-12* nodes (T160442)
20:29 bd808: Deleted tools-exec-12* nodes (T160457)
20:27 bd808: Disassociated floating IPs from tools-exec-12* nodes (T160457)
17:41 madhuvishy: Hand fix tools-puppetmaster by removing the old mariadb submodule directory
17:23 madhuvishy: Remove role::toollabs::precise_reminder from tools-bastion-03
15:40 bd808: Installing toollabs-webservice 0.36 across cluster using clush
15:36 bd808: Upgraded toollabs-webservice to 0.36 on tools-bastion-02.tools
15:25 bd808: Installing jobutils 1.21 across cluster using clush
15:23 bd808: Installed jobutils 1.21 on tools-bastion-02
15:03 bd808: Shutting down webservices running on Precise job grid nodes

2017-03-13

21:12 valhallasw`cloud: tools-bastion-03: killed heavy unzip operation from staeiou, and heavy (inadvertent large file opening?) vim operation from steenth, as the entire server was blocked due to high i/o

2017-03-07

17:59 andrewbogott: depooling, migrating tools-exec-1416 as part of ongoing labvirt1001 issues
17:21 madhuvishy: tools-webgrid-lighttpd-1409 migrated to labvirt1011 and repooled
16:31 madhuvishy: Depooled tools-webgrid-lighttpd-1409 for cold migrating to different labvirt

2017-03-06

22:52 andrewbogott: migrating tools-webgrid-lighttpd-1411 to labvirt1011 to give labvirt1001 a break
19:03 madhuvishy: Stopping webservice running on tool tree-of-life on author request
18:25 yuvipanda: set complex_values slots=300,release=trusty for tools-exec-gift-trusty-01.tools.eqiad.wmflabs

2017-03-04

23:47 madhuvishy: Added new k8s workers 1028, 1029

2017-02-28

03:52 scfc_de: Deployed jobtools and misctools 1.20/1.20~precise+1 (T158722).

2017-02-27

02:42 scfc_de: Purged misctools from instances where not puppetized.
02:42 scfc_de: Deployed jobtools and misctools 1.19/1.19~precise+1 (T155787, T156886).

2017-02-17

12:51 chasemp: create tools-exec-gift-trusty-01
12:40 chasemp: create tools-exec-gift-trusty
12:24 chasemp: mass apt-get clean and removal of some old .gz log files due to 30+ low space warnings

2017-02-15

18:45 yuvipanda: clush a restart of nscd across all of tools
00:01 bd808: Rebuilt python and python2 Docker images (T157744)

2017-02-08

06:22 yuvipanda: drain tools-worker-1026 for docker upgrade
05:28 yuvipanda: drain pods from tools-worker-1027.tools.eqiad.wmflabs for docker upgrade
05:28 yuvipanda: disable puppet on all k8s nodes in preparation for docker upgrade

2017-02-07

13:49 scfc_de: Deployed toollabs-webservice_0.33_all.deb (T156605, T156626).
13:49 scfc_de: Deployed tools-manifest_0.11_all.deb.

2017-02-04

02:13 yuvipanda: launch tools-worker-1027 to see if puppet works fine on first run!
02:13 yuvipanda: reboot tools-worker-1026 to see if it comes up fine
01:46 yuvipanda: launch tools-worker-1026

2017-02-03

21:34 madhuvishy: Migrated over precise tools to trusty for user multichill (catbot, family, locator, multichill, nlwikibots, railways, wlmtrafo, wikidata-janitor)
21:13 chasemp: reboot tools-bastion-03 as unresponsive

2017-02-02

20:39 yuvipanda: import docker-engine 1.11.2 (currently running version) and 1.12.6 (latest version) into aptly
00:06 madhuvishy: Remove user maximilianklein from tools.cite-o-meter (on request)

2017-01-30

20:25 yuvipanda: sudo ln -s /usr/bin/kubectl /usr/local/bin/kubectl to temporarily fix webservice shell not working

2017-01-27

19:22 chasemp: reboot tools-bastion-02 as it is having issues
02:01 madhuvishy: Reenabled puppet on tools-checker-01
00:29 madhuvishy: Disabling puppet on tools-checker instances to test https://gerrit.wikimedia.org/r/#/c/334433/

2017-01-26

23:37 madhuvishy: reenabled puppet on tools-checker
23:02 madhuvishy: Disabling puppet on tools-checker instances to test https://gerrit.wikimedia.org/r/#/c/334433/
16:08 chasemp: major cleanup for stale var items on tools-exec-1221

2017-01-24

18:14 andrewbogott: one last reboot of tools-mail
18:00 andrewbogott: apt-get autoremove on tools-mail
17:51 andrewbogott: rebooting tools-mail post upgrade
17:19 andrewbogott: restarting tools-mail, beginning do-release-upgrade -d -q
17:17 andrewbogott: backing up tools-mail to ~root/8c499e6e-1b79-4bb1-8f7f-72fee1f74ea5-backup on labvirt1009
17:15 andrewbogott: stopping tools-mail, backing up, upgrading from precise to trusty
15:49 yuvipanda: clush -g all 'sudo rm /usr/local/bin/kube*' to get rid of old kube related binaries
14:42 yuvipanda: re-enable puppet on tools-proxy-01, test success on proxy-02
14:37 yuvipanda: disable puppet on tools-proxy-01 (active proxy) to check deploying debianized kube-proxy on proxy-02
13:52 yuvipanda: upgrading k8s on worker nodes to use debs + new k8s version
13:52 yuvipanda: finished upgrading k8s + using debs
12:49 yuvipanda: purge ancient kubectl, kube-apiserver, kube-controller-manager, kube-scheduler packages from tools-k8s-master-01, these were my old terrible packages

2017-01-23

19:36 andrewbogott: temporarily shutting down tools-webgrid-lighttpd-1201
19:35 yuvipanda: depool tools-webgrid-lighttpd-1201 for snapshotting tests
17:13 chasemp: reboot tools-exec-1411 as having serious transient issues

2017-01-20

15:58 yuvipanda: enabling puppet across all hosts
15:36 yuvipanda: disable puppet everywhere to cherrypick patch moving base to a profile
00:50 bd808: sudo qdel -f 1199218 to force delete a stuck toolschecker job

2017-01-17

18:47 madhuvishy: Reenabled puppet across tools
18:26 madhuvishy: Disabling puppet across tools to test https://gerrit.wikimedia.org/r/#/c/329707/

2017-01-11

22:09 chasemp: add Reedy to admin in tool labs (approved by bryon and chase for access to investigate specific tool abuse behavior)

2017-01-10

19:05 madhuvishy: Killed 3 jobs from tools.arnaub that were causing high load on tools-exec-1411

2017-01-06

19:02 bd808: Terminated deprecated instances tools-exec-121[2-6] (T154539)

2017-01-04

02:43 madhuvishy: Reenabled puppet on toolschecker and removed iptables rule on labservices1001 blocking incoming connections from tools-checker-01. T152369

2017-01-03

23:56 bd808: Removed tools-exec-12[12-16] from gridengine (T154539)
23:27 bd808: drained tools-exec-1216 (T154539)
23:26 bd808: drained tools-exec-1215 (T154539)
23:25 bd808: drained tools-exec-1214 (T154539)
23:25 bd808: drained tools-exec-1213 (T154539)
23:24 bd808: drained tools-exec-1212 (T154539)
23:11 madhuvishy: Disabled puppet on tools-checker-01 (T152369)
21:43 madhuvishy: Adding iptables rule to drop incoming connections from toolschecker on labservices1001
20:51 madhuvishy: Adding iptables rule to block outgoing connections to labservices1001 on tools-checker-01
20:43 madhuvishy: Silenced tools checker on icinga to test labservices1001 failure causing toolschecker to flake out T152369

2016-12-25

00:28 yuvipanda: comment out cron running 'clean' script of avicbot every minute without -once
00:28 yuvipanda: force delete all jobs of avicbot
00:25 yuvipanda: delete all jobs of avicbot. This is 419 jobs
00:20 yuvipanda: kill clean.sh process of avicbot

2016-12-19

20:07 valhallasw`cloud: killed gps_exif_bot2.py (tools.gpsexif), was using 50MB/s io, lagging all of tools-bastion-03
13:06 yuvipanda: run /usr/local/bin/deploy-master http://tools-docker-builder-03.tools.eqiad.wmflabs v1.3.3wmf1 on tools-k8s-master-01
12:53 yuvipanda: cleaned out pbuilder from tools-docker-builder-01 to clean up

2016-12-17

04:49 yuvipanda: turned on lookupcache again for bastions

2016-12-15

18:52 yuvipanda: reboot tools-exec-1204
18:49 yuvipanda: reboot tools-webgrid-lighttpd-12[01-05]
18:45 yuvipanda: reboot tools-exec-gift
18:41 yuvipanda: reboot tools-exec-1217 to 1221
18:30 yuvipanda: rebooted tools-exec-1212 to 1216
14:55 yuvipanda: reboot tools-services-01

2016-12-14

18:43 mutante: tools-bastion-03 - ran 'locale-gen ko_KR.EUC-KR' for T130532

2016-12-13

20:54 chasemp: reboot bastion-03 as unresponsive

2016-12-09

19:32 godog: upgrade / restart prometheus-node-exporter
08:37 YuviPanda: run delete-dbusers and force replica.my.cnf creation for all tools that did not have it

2016-12-08

18:48 YuviPanda: restarted toolschecker on tools-checker-01

2016-12-07

09:45 YuviPanda: restart redis on tools-proxy-02
09:32 YuviPanda: cherry-pick https://gerrit.wikimedia.org/r/324210 and https://gerrit.wikimedia.org/r/324211
09:29 YuviPanda: clush -g k8s-worker -g k8s-master -g webproxy -b 'sudo puppet agent --disable "Deploying k8s change with alex"'

2016-12-06

00:36 bd808: Updated toollabs-webservice to 0.31 on rest of cluster (T147350)

2016-12-05

23:19 bd808: Updated toollabs-webservice to 0.31 on tools-bastion-02 (T147350)
22:55 bd808: Updated jobutils to 1.17 on tools-mail (T147350)
22:53 bd808: Updated jobutils to 1.17 on tools-precise-dev (T147350)
22:53 bd808: Updated jobutils to 1.17 on tools-cron-01 (T147350)
22:52 bd808: Updated jobutils to 1.17 on tools-bastion-03 (T147350)
22:52 bd808: Updated jobutils to 1.17 on tools-bastion-02 (T147350)
16:53 bd808: Terminated deprecated instances: "tools-exec-1201", "tools-exec-1202", "tools-exec-1203", "tools-exec-1205", "tools-exec-1206", "tools-exec-1207", "tools-exec-1208", "tools-exec-1209", "tools-exec-1210", "tools-exec-1211" (T151980)
16:50 bd808: Released floating IPs from decommissioned tools-exec-12[01-11] instances

2016-11-30

23:06 bd808: Removed tools-exec-12[00-11] from gridengine (T151980)
22:54 bd808: Removed tools-exec-12[00-11] from @general hostgroup
15:17 chasemp: restart coibot 'coibot.sh -o syslog.output -e syslog.errors -r yes'
05:20 bd808: rescheduled continuous jobs on tools-exec-1210; 2 task queue jobs remain (T151980)
05:18 bd808: drained tools-exec-1211 (T151980)
05:14 bd808: drained tools-exec-1209 (T151980)
05:13 bd808: drained tools-exec-1208 (T151980)
05:12 bd808: drained tools-exec-1207 (T151980)
05:10 bd808: drained tools-exec-1206 (T151980)
05:07 bd808: drained tools-exec-1205 (T151980)
05:04 bd808: drained tools-exec-1204 (T151980)
05:00 bd808: drained tools-exec-1203 (T151980)
05:00 bd808: drained tools-exec-1202 (T151980)
04:58 bd808: disabled queues on tools-exec-1211 (T151980)
04:58 bd808: disabled queues on tools-exec-1210 (T151980)
04:58 bd808: disabled queues on tools-exec-1209 (T151980)
04:57 bd808: disabled queues on tools-exec-1208 (T151980)
04:57 bd808: disabled queues on tools-exec-1207 (T151980)
04:57 bd808: disabled queues on tools-exec-1206 (T151980)
04:56 bd808: disabled queues on tools-exec-1205 (T151980)
04:56 bd808: disabled queues on tools-exec-1204 (T151980)
04:56 bd808: disabled queues on tools-exec-1203 (T151980)
04:55 bd808: disabled queues on tools-exec-1202 (T151980)
04:52 bd808: drained tools-exec-1201 (T151980)
04:48 bd808: draining tools-exec-1201

2016-11-29

13:43 hashar: updating jouncebot so it properly reclaim its nick ( T150916 https://gerrit.wikimedia.org/r/#/c/324025/ )

2016-11-22

15:13 chasemp: readd attr +i to replica.my.cnf that seems to have gotten lost in rsync migration

2016-11-21

21:15 YuviPanda: disable puppet everywhere
19:49 YuviPanda: restart all webservice jobs on gridengine to pick up logging again

2016-11-20

06:51 Krenair: ran `qmod -rj lighttpd-admin` as tools.admin to try to get the main page back up, it worked briefly but then broke again

2016-11-16

20:14 yuvipanda: upgrade toollabs-webservice to 0.30 on all webgrid nodes
18:31 chasemp: reboot tools-exec-1404 (already depooled)
18:19 chasemp: reboot tools-exec-1403
17:23 chasemp: reboot tools-exec-1212 (converted via 321786 testing for recovery on boot)
16:55 chasemp: clush -g all "puppet agent --disable 'trail run for changeset 321786 handling /var/lib/gridengine'"
02:05 yuvipanda: rebooting tools-docker-registry-01, can't ssh in
01:43 yuvipanda: cleanup old images on tools-docker-builder-03

2016-11-15

19:52 chasemp: reboot tools-precise-dev
05:20 yuvipanda: restart all k8s webservices too
05:05 yuvipanda: restarting all webservices on gridengine
03:21 chasemp: reboot tools-checker-01
02:56 chasemp: reboot tools-exec-1405 to ensure noauto works (because atboot=>false is lies)
02:31 chasemp: reboot tools-exec-1406

2016-11-14

22:51 chasemp: shut down bastion 02 and 05 and make 03 root only
19:35 madhuvishy: Stopped cron on tools-cron-01 (T146154)
18:24 madhuvishy: Tools NFS is read-only. /data/project and /home across tools are ro T146154
16:57 yuvipanda: stopped gridengine master
16:47 yuvipanda: start restarting kubernetes webservice pods
16:30 madhuvishy: Unmounted all nfs shares from tools-k8s-master-01 (sudo /usr/local/sbin/nfs-mount-manager clean) T146154
16:22 yuvipanda: kill maintain-kubeusers on tools-k8s-master-01, sole process touching NFS
16:22 chasemp: enable puppet and run on tools-services-01
16:21 yuvipanda: restarting all webservice jobs, watching webservicewatcher logs on tools-services-02
16:14 madhuvishy: Disabling puppet across tools T146154

2016-11-11

20:49 madhuvishy: Dual mount of tools share complete. Puppet reenabled across tools hosts. T146154
20:18 madhuvishy: Rolling out dual mount of tools share across all hosts T146154
19:29 madhuvishy: Disabling puppet across tools to dual mount tools share from labstore-secondary T146154

2016-11-02

18:23 yuvipanda: manually stop tools-grid-master for reboot
17:42 yuvipanda: drain nodes from labvirt1012 and 13
13:42 chasemp: depool tools-exec-1404 for maint

2016-11-01

21:54 yuvipanda: stop gridengine-master on tools-grid-master in preparation for reboot
21:34 yuvipanda: depool tools nodes on labvirt1012
21:16 yuvipanda: depool things in labvirt1011
20:58 yuvipanda: depool tools nodes on labvirt1010
20:32 yuvipanda: depool tools things on labvirt1005 and 1009
20:08 yuvipanda: depooled things on labvirt1006 and 1008
19:51 yuvipanda: move tools-elastic-03 to labvirt1010, -02 already in 09
19:34 yuvipanda: migrate tools-elastic-03 to labvirt1009
19:10 yuvipanda: depooled tools nodes from labvirt1004 and 1007
17:57 yuvipanda: depool exec nodes on labvirt1002
13:27 chasemp: reboot tools-exec-1404 post depool for test

2016-10-31

21:50 yuvipanda: deleted cyberbot queue with qconf -dq cyberbot
21:44 yuvipanda: restarted cron on tools-cron-01

2016-10-30

02:25 yuvipanda: restarted maintain-kubeusers

2016-10-29

17:21 yuvipanda: depool tools-worker-1005

2016-10-28

20:15 chasemp: restart prometheus service on tools-prometheus-01 to see if that wakes it up
20:06 yuvipanda: restart kube-apiserver again, ran into too many open file handles
15:58 Yuvi[m]: restart k8s master, seems to have run out of fds
15:43 chasemp: restart toolschecker service on 01 and 02

2016-10-27

21:09 godog: upgrade prometheus on tools-prometheus0[12]
18:49 andrewbogott: rebooting tools-webgrid-lighttpd-1401
13:51 chasemp: reboot tools-webgrid-generic-1403
13:50 chasemp: reboot dockerbuilder-01

2016-10-26

23:20 madhuvishy: Disabling puppet on tools proxy hosts for applying proxy health check endpoint T143638
23:17 godog: upgrade prometheus on tools-prometheus-02
16:52 bd808: Deployed jobutils_1.16_all.deb on tools-mail (default jsub target to trusty)
16:50 bd808: Deployed jobutils_1.16_all.deb on tools-precise-dev (default jsub target to trusty)
16:48 bd808: Deployed jobutils_1.16_all.deb on tools-bastion-02, tools-bastion-03, tools-cron-01 (default jsub target to trusty)

2016-10-25

18:48 yuvipanda: repool all depooled instances
04:19 yuvipanda: reboot tools-flannel-etcd-01 for https://phabricator.wikimedia.org/T149072#2741012

2016-10-24

03:45 Krenair: reset host keys for tools-puppetmaster-02 on -01, looks like it was recreated 5-6 days ago

2016-10-20

16:55 yuvipanda: killed bzip2 taking 100% CPU on tools-bastion-03

2016-10-18

22:56 Guest20046: flip tools-k8s-master-01 to tools-puppetmaster-02
07:43 yuvipanda: move all tools webgrid nodes to tools-puppetmaster-02 too
07:40 yuvipanda: complete moving all general tools exec nodes to tools-puppetmaster-02
07:33 yuvipanda: restarted puppetmaster on tools-puppetmaster-01

2016-10-17

14:37 chasemp: remove bdsync-deb and bdsync-deb-2 errornously created in Tools and now defunct anyway
14:05 chasemp: restart puppetmaster on tools-puppetmaster-01 (instances sticking on puppet runs for a long time)
14:01 chasemp: reboot tools-exec-1215 and tools-exec-1410 as unresponsive

2016-10-14

16:20 yuvipanda: repoooled tools-worker-1012, seems to have recovered?!
15:57 yuvipanda: drain tools-worker-1012, seems stuck

2016-10-10

18:04 valhallasw`vecto: sudo service bigbrother restart @ tools-services-02

2016-10-09

18:33 valhallasw`cloud: removed empty local crontabs for {yuvipanda, yuvipanda, tools.toolschecker} on {tools-webgrid-lighttpd-1404, tools-webgrid-lighttpd-1204, tools-checker-01}. No other local crontabs remaining.

2016-10-05

12:15 chasemp: reboot tools-webgrid-generic-1404 as locked up

2016-10-01

10:03 yuvipanda: re-enable puppet on tools-checker-02

2016-09-29

18:15 bd808: Rebooting tools-elastic-02.tools.eqiad.wmflabs via wikitech; couldn't ssh in
18:10 bd808: Investigating elasticsearch cluster issues effecting stashbot

2016-09-27

08:07 chasemp: tools-bastion-03:~# chmod 640 /var/log/syslog

2016-09-25

15:27 Krenair: restarted labs-logbot under tools.morebots

2016-09-21

18:56 madhuvishy: Repooled tools-webgrid-lighttpd-1418 (T146212) after dns records cleanup
18:42 madhuvishy: Repooled tools-webgrid-lighttpd-1416 (T146212) after dns records cleanup
16:57 chasemp: reboot tools-webgrid-lighttpd-1407, tools-webgrid-lighttpd-1210, tools-webgrid-lighttpd-1414, and then tools-webgrid-lighttpd-1405 as the first 3 return

2016-09-20

23:24 yuvipanda: depool tools-webgrid-lighttpd-1416 and 1418, they aren't in actual working order
21:23 madhuvishy|food: Pooled new sge exec node tools-webgrid-lighttpd-1416 (T146212)
21:17 madhuvishy|food: Pooled new sge exec node tools-webgrid-lighttpd-1415 (T146212)
20:34 madhuvishy: Created new instance tools-webgrid-lighttpd-1418 (T146212)
20:34 madhuvishy: Created new instance tools-webgrid-lighttpd-1416 (T146212)
20:34 madhuvishy: Created new instance tools-webgrid-lighttpd-1415 (T146212)
17:58 andrewbogott: reboot tools-exec-1410
17:54 yuvipanda: repool tools-webgrid-lighttpd-1412
17:49 yuvipanda: webgrid-lighttpd-1412 hung on io (no change in nova diagnostics), rebooting
17:33 yuvipanda: reboot tools-puppetmaster-01
17:20 yuvipanda: reboot tools-checker-02
15:42 chasemp: move floating ip from tools-checker-02 (failed) to tools-checker-01

2016-09-13

21:09 madhuvishy: Bumped proxy nginx worker_connections limit T143637
21:08 madhuvishy: Reenabled puppet across proxy hosts
20:44 madhuvishy: Disabling puppet across proxy hosts

2016-09-12

18:33 bd808: Forcing puppet run on tools-cron-01
18:31 bd808: Forcing puppet run on tools-bastion-03
18:28 bd808: Forcing puppet run on tools-bastion-02
18:26 bd808: Forcing puppet run on tools-precise-dev
18:26 bd808: Built toollabs-webservice v0.27 package and added to aptly

2016-09-10

01:06 yuvipanda: migrate tools-k8s-etcd-01 to labvirt1012, is in state doing no io

2016-09-09

19:27 yuvipanda: reboot tools-exec-1218 and 1219
18:10 yuvipanda: killed massive grep running as root

2016-09-08

21:49 bd808: forcing puppet runs to install toollabs-webservice_0.26_all.deb
20:51 bd808: forcing puppet runs to install jobutils_1.15_all.deb

2016-09-07

21:11 Krenair: brought labs/private.git up to date on tools-puppetmaster-01
02:32 Krenair: ran `SULWatcher/restart_SULWatcher.sh` as `tools.stewardbots` on bastion-03 to fix T144887

2016-09-06

22:14 yuvipanda: got pbuilder off tools-services-01, was taking up too much space.
22:10 madhuvishy: Deleted instance tools-web-static-01 and tools-web-static-02 (T143637)
21:45 yuvipanda: reboot tools-prometheus-02. nova diagnostics shows no vda activity.
20:43 chasemp: drain and reboot tools-exec-1410 for testing
07:32 yuvipanda: depooled tools-exec-1219 and 1218, seem to be unresponsive, causing jobs that appear to run but aren't really

2016-09-05

16:27 andrewbogott: rebooting tools-cron-01 because it is hanging all over the place

2016-09-01

05:19 yuvipanda: restart maintain-kubeusers on tools-k8s-master-01, was stuck

2016-08-31

20:48 madhuvishy: Reenabled puppet across tools hosts
20:45 madhuvishy: Scratch migration complete on all grid exec nodes (T134896)
19:36 madhuvishy: Scratch migration on all non exec/worker nodes complete (T134896)
18:18 madhuvishy: Scratch migration complete for all k8s workers (T134896)
17:50 madhuvishy: Reenabling puppet across tools hosts.
16:55 madhuvishy: Rsync-ed over latest backup of /srv/scratch from labstore1001 to labstore1003
16:50 madhuvishy: Puppet disabling complete (T134896)

2016-08-30

18:54 valhallasw`cloud: edited /etc/shadow on a range of hosts to fix https://phabricator.wikimedia.org/T143191
10:59 godog: bounce stashbot, not seen on irc

2016-08-29

23:38 Krenair: added myself to the tools.admin service group earlier to try to figure out what was causing the outage, removed again now
16:35 yuvipanda: run chmod u+x /data/project/framabot
13:40 chasemp: restart jouncebot

2016-08-28

05:34 bd808: After git gc on web-static-02.tools:/srv/cdnjs: /dev/mapper/vd-cdnjs--disk 61G 54G 3.3G 95% /srv
05:25 bd808: sudo git gc --aggressive on tools-web-static-01.tools:/srv/cdnjs
04:56 bd808: sudo git gc --aggressive on tools-web-static-02.tools:/srv/cdnjs

2016-08-26

16:53 yuvipanda: migrate tools-static-02 to labvirt1001

2016-08-25

18:07 yuvipanda: restart puppetmaster on tools-puppetmaster-01
17:41 yuvipanda: depooled tools-webgrid-1413
01:16 yuvipanda: restarted puppetmaster on tools-puppetmaster-01

2016-08-24

23:03 chasemp: reboot tools-exec-1217
17:25 yuvipanda: depool tools-exec-1217, it is dead/stuck/hung/io-starved

2016-08-23

07:08 madhuvishy: Enabled puppet across tools after merging https://gerrit.wikimedia.org/r/#/c/305657/ (see T134896)
05:48 yuvipanda: restarted nginx on tools-proxy-01, was out of connection slots

2016-08-22

22:07 madhuvishy: Disabled puppet across tools hosts in preparation to merge https://gerrit.wikimedia.org/r/#/c/305657/ (see T134896)
22:01 madhuvishy: Disabling puppet across tools hosts

2016-08-20

11:42 valhallasw`cloud: rebooting tools-mail (hanging)

2016-08-19

14:52 chasemp: reboot 82323ee4-762e-4b1f-87a7-d7aa7afa22f6

2016-08-18

20:00 yuvipanda: restarted maintain-kubeusers on tools-k8s-master-01

2016-08-15

22:10 yuvipanda: depool tools-exec-1211 and 1205, seem to be out of action
19:12 yuvipanda: kill unused tools-merlbot-proxy

2016-08-12

20:39 yuvipanda: delete tools-webgrid-lighttpd-1415, enough webservices have moved to k8s from that queue
20:37 yuvipanda: delete tools-logs-01, going to recreate with a smaller image
20:36 yuvipanda: delete tools-webgrid-generic-1405, enough things have moved to k8s from that queue!
20:10 yuvipanda: migration of tools-grid-master to labvirt1013 complete
20:01 yuvipanda: migrating tools-grid-master (currently inactive) to labvirt1013 away from crowded 1010
12:40 chasemp: tools.templatetransclusioncheck@tools-bastion-03:~$ webservice restart

2016-08-11

20:13 yuvipanda: tools-grid-master finally stopped
20:05 yuvipanda: disabled tools-webgrid-lighttpd-1202, is hung
17:23 yuvipanda: instance being rebooted is tools-grid-master
17:22 chasemp: reboot via nova master as it is stuck

2016-08-05

19:29 paladox: adding tom29739 to lolrrit-wm project

2016-08-04

19:09 yuvipanda: cleaned up nginx log files in tools-docker-registry-01 to fix free space warning
00:19 yuvipanda: added Krenair as admin to help with T132225 and other issues.

2016-08-03

22:48 yuvipanda: deleted tools-worker-1005
22:08 yuvipanda: depool & delete tools-worker-1007 and 1008
21:34 yuvipanda: rebooting tools-puppetmaster-01 to test a hypothesis
21:10 yuvipanda: rebooting tools-puppetmaster-01 for kernel upgrade
00:20 madhuvishy: Repooled nodes tools-worker 1012 and 1013 for T141126

2016-08-02

22:49 yuvipanda: depooled tools-worker-1014 as well for T141126
22:44 yuvipanda: depool tools-worker-1015 for T141126
22:42 paladox: cherry picking 302617 onto lolrrit-wm
22:41 madhuvishy: Depooling tools-worker 1012 and 1013 for T141126
22:32 yuvipanda: added paladox to tools
09:38 godog: bounce morebots production
00:01 yuvipanda: depool tools-worker-1017 for T141126

2016-08-01

23:48 madhuvishy: Repooled tools-worker-1011 and tools-worker-1018 (Yuvi) for T114126
23:41 madhuvishy: Repooled tools-worker-1010 and tools-worker-1019 (Yuvi) for T114126
23:21 madhuvishy: Yuvi is depooling tools-worker-1018 for T114126
23:19 madhuvishy: Depooling tools-worker 1010 and 1011 for T114126
23:17 madhuvishy: Yuvi depooled tools-worker-1019 for T114126
23:06 madhuvishy: Added tools-worker-1022 as new k8s worker node
23:06 madhuvishy: Repooled tools-worker-1009 (T114126)
22:48 madhuvishy: Depooling tools-worker-1009 to prepare for T141126

2016-07-29

22:04 YuviPanda: repooled tools-worker-1006
21:48 YuviPanda: deleted tools-worker-1006 after depooling+draining
21:45 YuviPanda: repool new tools-worker-1003 with direct-lvm docker storage backend
21:30 YuviPanda: depool tools-worker-1003 to be recreated with new docker config, picking this because it's on a non-ssd host
21:17 YuviPanda: depooled tools-worker-1020/21 after fixing them up
20:41 YuviPanda: delete tools-worker-1001
20:29 YuviPanda: depool tools-worker-1001, going to recreate with to test new puppet deploying-first-run
20:26 YuviPanda: built new worker nodes tools-worker-1020 and 21 with direct-lvm storage backend
17:48 YuviPanda: disable puppet on all tools k8s worker nodes

2016-07-25

14:17 chasemp: nova reboot 64f01f90-c805-4a2e-9ed5-f523b909094e (grid master)

2016-07-23

23:21 YuviPanda: restart maintain-kubeusers on tools-k8s-master-01, was stuck on connecting to seaborgium preventing new tool creation
01:56 YuviPanda: deploy kubernetes v1.3.3wmf1

2016-07-22

17:30 YuviPanda: repool tools-worker-1018
14:04 chasemp: reboot tools-worker-1015 as stuck w/ high iowait warning seconds ago. I cannot ssh in as root.

2016-07-21

22:42 chasemp: reboot tools-worker-1018 as stuck T141017

2016-07-20

21:27 andrewbogott: rebooting tools-k8s-etcd-01
11:14 Guest9334: rebooted tools-worker-1004

2016-07-19

01:06 bd808: Upgraded Elasticsearch on tools-elastic-* to 2.3.4

2016-07-18

21:50 YuviPanda: force downgrade hhvm on tools-webgrid-lighttpd-1408 to fix puppet issues
21:40 YuviPanda: bind mount and kill files in /var/lib/docker that were monuted over by proper mount on lvm on tools-worker-1004
21:40 YuviPanda: bind mount and kill files in /var/lib/docker that were monuted over by proper mount on lvm
21:37 YuviPanda: killed tools-pastion-01, no longer in use
20:59 bd808: Disabled puppet on tools-elastic-0[123]. Elasticsearch needs to be upgraded.
15:15 YuviPanda: kill 8807036 for Luke081515
12:48 YuviPanda: reboot tools-flannel-etcd-03 for T140256
12:41 YuviPanda: reboot tools-k8s-etcd-02 for T140256

2016-07-15

10:24 yuvipanda: depool tools-exec-1402 for T138447
10:24 yuvipanda: reboot tools-exec-1402 for T138447
10:16 yuvipanda: depooling tools-webgrid-lighttpd-1402 and -1412 since they seem to be suffering from T138447
10:08 yuvipanda: reboot tools-webgrid-lighttpd-1402 and 1412

2016-07-14

23:12 bd808: Added Madhuvishy to project "roots" sudoer list
22:58 bd808: Added Madhuvishy as projectadmin
21:25 chasemp: change perms for tools.readmore to correct bot

2016-07-13

11:40 yuvipanda: cold-migrate tools-worker-1014 off labvirt1010 to see if that improves the ksoftirqd situation
11:19 yuvipanda: drained tools-worker-1004 - high ksoftirqd usage even with no load
11:13 yuvipanda: depool tools-worker-1014 - unusable, totally in iowait
11:13 yuvipanda: reboot tools-worker-1004, was unresponsive

2016-07-12

18:07 yuvipanda: reboot tools-worker-1012, it seems to have failed LDAP connectivity :|

2016-07-08

12:38 yuvipanda: starting up tools-web-static-02 again

2016-07-07

12:45 yuvipanda: start deployment of k8s 1.3.0wmf4 for T139259

2016-07-06

13:09 yuvipanda: associated a floating IP with tools-k8s-master-01 for T139461
11:47 yuvipanda: moved tools-checker-0[12] to use tools-puppetmaster-01 as puppetmaster so they get appropriate CA for use when talking to kubernetes API

2016-07-04

11:13 yuvipanda: delete tools-prometheus-01 to free up resources on labvirt1010
11:11 yuvipanda: actually deleted instance tools-cron-02 to free up resources on labvirt1010 - was large and not currently used, and failover process takes a while anyway, so we can recreate if needed
11:11 yuvipanda: stopped instance tools-cron-02 to free up some resources on labvirt1010

2016-07-03

17:09 yuvipanda: run qstat -u '*' | grep 'dr ' | awk '{ print $1;}' | xargs -L1 qdel -f to clean out jobs stuck in dr state
16:59 yuvipanda: migrate tools-web-static-02 to labvirt1011 to provide more breathing room
16:56 yuvipanda: delete temp-test-trusty-package to provide more breathing room on labvirt1010
13:49 yuvipanda: reboot tools-exec-1219
13:37 yuvipanda: migrating tools-exec-1216 to labvirt1011
13:07 yuvipanda: delete tools-bastion-01 which was shut down anyway
13:04 yuvipanda: attempt to reboot tools-exec-1212

2016-06-28

15:25 bd808: Signed client cert for tools-worker-1019.tools.eqiad.wmflabs on tools-puppetmaster-01.tools.eqiad.wmflabs

2016-06-21

16:49 bd808: Updated jobutils to v1.14 for T138178

2016-06-17

06:17 yuvipanda: forced deletion of 7033590 for dykbot for shubinator

2016-06-08

20:31 yuvipanda: start tools-bastion-03 was stuck in 'stopped' state
20:31 yuvipanda: reboot tools-bastion-03

2016-05-31

17:35 valhallasw`cloud: re-enabled queues on tools-exec-1407, tools-exec-1216, tools-exec-1219
13:13 chasemp: reboot of tools-exec-1203 see T136495 all jobs seem gone now

2016-05-30

13:06 valhallasw`cloud: rebooting tools-exec-1221
11:53 godog: cherry-pick https://gerrit.wikimedia.org/r/#/c/280652 https://gerrit.wikimedia.org/r/#/c/290479 https://gerrit.wikimedia.org/r/#/c/291710/ on tools-puppetmaster-01

2016-05-29

18:58 YuviPanda: deleted tools-k8s-bastion-01 for T136496
14:29 valhallasw`cloud: chowned /data/project/xtools-mab-dev to root and back to stop rogue process that was writing to the directory. I'm still not sure where that process was running, but at least this seems to have solved the issue

2016-05-28

21:52 valhallasw`cloud: rebooted tools-webgrid-lighttpd-1408, tools-pastion-01, tools-exec-1205
21:21 valhallasw`cloud: rebooting tools-exec-1204 (T136495)

2016-05-27

14:45 YuviPanda: start moving tools-bastion-03 to use tools-puppetmaster-01 as puppetmaster

2016-05-25

20:15 YuviPanda: deleted tools-bastion-mtemp per chasemp
19:43 YuviPanda: delete devpi instance, not currently in use
19:39 YuviPanda: run sudo dpkg --configure -a on tools-worker-1007 to get it unstuck
19:19 YuviPanda: deleted tools-docker-builder-01 and -02, hosed hosts that are unused
17:18 YuviPanda: fixed hhvm upgrade on tools-cron-01
07:19 YuviPanda: hard reboot tools-services-01, was completely stuck on /public/dumps
06:06 bd808: Restarting all webservice jobs
05:33 andrewbogott: rebooting tools-proxy-02

2016-05-24

01:36 scfc_de: tools-cron-02: Downgraded hhvm (sudo apt-get install hhvm).
01:36 scfc_de: tools-bastion-03, tools-checker-01, tools-cron-02, tools-exec-1202, tools-proxy-02, tools-redis-1001: Remounted /public/dumps read-only (while sudo umount /public/dumps; do :; done && sudo puppet agent -t).

2016-05-23

19:36 YuviPanda: switched tools-checker to tools-checker-03
16:33 bd808: Rebooting tools-elastic-02.tools.eqiad.wmflabs
13:28 chasemp: 'apt-get install hhvm -y --force-yes' across trusty hosts to handle hhvm downgrade

2016-05-20

23:39 bd808: Forced puppet run on bastion-02 & bastion-05 to apply fix for T135861
19:47 chasemp: tools-exec-1406 having issues rebooting

2016-05-19

21:07 bd808: deployed jobutils 1.13 on bastions; now with '-l release=...' validation!
15:43 YuviPanda: rebooting all tools worker instances
13:12 chasemp: reboot tools-exec-1220 stuck in state of unresponsivenss

2016-05-13

00:40 YuviPanda: cleared all queues that were in error state

2016-05-12

22:59 YuviPanda: restart tools-worker-1004 to attempt bringing it back up
22:59 YuviPanda: deploy k8s 1.2.4wmf1 on all proxy nodes
22:58 YuviPanda: deploy k8s on all worker nodes
22:46 YuviPanda: deploy k8s master for 1.2.4wmf1

2016-05-10

04:25 bd808: Added role::package::builder to tools-services-01

2016-05-09

04:33 YuviPanda: reboot tools-worker-1004, lots of ksoftirqd stuckness despite no actual containers running

2016-05-08

07:06 YuviPanda: restarted admin tool

2016-05-05

13:11 godog: cherry-pick https://gerrit.wikimedia.org/r/#/c/280652/ on puppetmaster

2016-04-28

04:15 YuviPanda: delete half of the trusty webservice jobs
04:00 YuviPanda: deleted all precise webservice jobs, waiting for webservicemonitor to bring them back up

2016-04-24

12:22 YuviPanda: force deleted job 5435259 from pbbot per PeterBowman

2016-04-11

14:20 andrewbogott: moving tools-bastion-mtemp to labvirt1009

2016-04-06

15:20 bd808: Removed local hack for T131906 from tools-puppetmaster-01

2016-04-05

21:24 bd808: Committed local hack on tools-puppetmaster-01 to get elasticsearch working again
21:02 bd808: Forcing puppet runs to fix elasticsearch
20:39 bd808: Elasticsearch processes down. Looks like a prod puppet change that needs tweaking for tool labs

2016-04-04

19:43 YuviPanda: new bastion!
19:15 chasemp: reboot tools-bastion-05

2016-03-30

15:50 andrewbogott: rebooting tools-proxy-01 in hopes of clearing some bad caches

2016-03-28

20:51 yuvipanda: lifted RAM quota from 900Gigs to 1TB?!
20:30 chasemp: change perm grant files from create-dbusers for chmod 400 chat chattr +i

2016-03-27

17:40 scfc_de: tools-webgrid-generic-1405, tools-webgrid-lighttpd-1411, tools-web-static-01, tools-web-static-02: "apt-get install cloud-init" and accepted changes for /etc/cloud/cloud.cfg (users: + default; cloud_config_modules: + ssh-import-id, + puppet, + chef, + salt-minion; system_info/package_mirrors/arches[i386, amd64]/search/primary: + http://%(region)s.clouds.archive.ubuntu.com/ubuntu/).

2016-03-18

15:47 chasemp: had to kill stalkboten as it was logging constant errors filling logs to the tune of hundreds of gigs
15:36 chasemp: cleanup huge log collection for broken bot: /srv/project/tools/project/betacommand-dev/tspywiki/irc/logs# rm -fR SpamBotLog.log\.*

2016-03-11

20:57 mutante: reverted font changes - puppet runs recovering
20:37 mutante: more puppet issues due to font dependencies on trusty, on it
19:39 mutante: should a tools-exec server be influenced by font packages on an mw appserver?
19:39 mutante: fixed puppet runs on tools-exec (gerrit 276792)

2016-03-02

14:56 chasemp: qdel 3956069 and 3758653 for abusing auth

2016-02-29

21:49 scfc_de: tools-exec-1218: rm -f /usr/local/lib/nagios/plugins/check_eth to work around "Got passed new contents for sum" (https://tickets.puppetlabs.com/browse/PUP-1334).
21:20 scfc_de: tools-exec-1209: rm -f /var/lib/puppet/state/agent_catalog_run.lock (no Puppet process running, probably from the reboots).
20:58 scfc_de: Ran "dpkg --configure -a" on all instances.
13:50 scfc_de: Deployed jobutils/misctools 1.10.

2016-02-28

20:08 bd808: Removed unwanted NFS mounts from tools-elastic-01.tools.eqiad.wmflabs

2016-02-26

19:08 bd808: Upgraded Elasticsearch on tools-elastic-0[123] to 1.7.5

2016-02-25

21:43 scfc_de: Deployed jobutils/misctools 1.9.

2016-02-24

19:46 chasemp: runonce deployed for https://gerrit.wikimedia.org/r/#/c/272891/

2016-02-22

15:55 andrewbogott: redirecting tools-login.wmflabs.org to tools-bastion-05

2016-02-19

15:58 chasemp: rerollout tools nfs shaping pilot for sanity in anticipation of formalization
09:21 _joe_: killed cluebot3 instance on tools-exec-1207, writing 20 M/s to the error log
00:50 yuvipanda: failover services to services-02

2016-02-18

20:37 yuvipanda: failover proxy back to tools-proxy-01
19:46 chasemp: repool labvirt1003 and depool labvirt1004
18:19 chasemp: draining nodes from labvirt1001

2016-02-16

21:33 chasemp: reboot of bastion-1002

2016-02-12

19:56 chasemp: nfs traffic shaping pilot round 2

2016-02-05

22:01 chasemp: throttle some vm nfs write speeds
16:49 scfc_de: find /data/project/wikidata-edits -group ssh-key-ldap-lookup -exec chgrp tools.wikidata-edits \{\} + (probably a remnant of the work on ssh-key-ldap-lookup last summer).
16:45 scfc_de: Removed /data/project/test300 (uid/gid 52080; none of them resolves, no databases, just an unmodified pywikipedia clone inside).

2016-02-03

03:00 YuviPanda: upgraded flannel on all hosts running it

2016-01-31

20:01 scfc_de: tools-webgrid-generic-1405: Rebooted via wikitech; rebooting via "shutdown -r now" did not seem to work.
18:51 bd808: tools-elastic-01.tools.eqiad.wmflabs console shows blocked tasks, possible kernel bug?
18:49 bd808: tools-elastic-01.tools.eqiad.wmflabs not responsive to ssh or Elasticsearch requests; rebooting via wikitech interface
13:32 hashar: restarted qamorebot

2016-01-30

06:38 scfc_de: tools-webgrid-generic-1405: Rebooted for load ~ 175 and lots of processes stuck in D.

2016-01-29

21:25 YuviPanda: restarted image-resize-calc manually, no service.manifest file

2016-01-28

15:02 scfc_de: tools-cron-01: Rebooted via wikitech as "shutdown -r now" => "@sbin/plymouthd --mode=shutdown" => "/bin/sh -e /proc/self/fd/9" => "/bin/sh /etc/init.d/rc 6" => "/bin/sh /etc/rc6.d/S20sendsigs stop" => "sync" stuck in D. *argl*
14:56 scfc_de: tools-cron-01: Rebooted due to high number of processes stuck in D and load >> 100.
14:54 scfc_de: tools-cron-01: HUPped 43 processes wikitrends/refresh.sh, though a lot of all processes seem to be stuck in D, so I'll reboot this instance.
14:50 scfc_de: tools-cron-01: HUPped 85 processes /usr/lib/php5/sessionclean.

2016-01-27

23:07 YuviPanda: removed all members of templatetiger, added self instead, removed active shell sessions
20:24 chasemp: master stop, truncate accounting log to accounting.01272016, master start
19:34 chasemp: master start grid master
19:23 chasemp: stopped master
19:11 YuviPanda: depooled tools-webgrid-1405 to prep for restart, lots of stuck processes
18:29 valhallasw`cloud: job 2551539 is ifttt, which is also running as 2700629. Killing 2551539 .
18:26 valhallasw`cloud: messages repeatedly reports "01/27/2016 18:26:17|worker|tools-grid-master|E|execd@tools-webgrid-generic-1405.tools.eqiad.wmflabs reports running job (2551539.1/master) in queue "webgrid-generic@tools-webgrid-generic-1405.tools.eqiad.wmflabs" that was not supposed to be there - killing". SSH'ing there to investigate
18:24 valhallasw`cloud: 'sleep' test job also seems to work without issues
18:23 valhallasw`cloud: no errors in log file, qstat works
18:23 chasemp: master sge restarted post dump and restart for jobs db
18:22 valhallasw`cloud: messages file reports 'Wed Jan 27 18:21:39 UTC 2016 db_load_sge_maint_pre_jobs_dump_01272016'
18:20 chasemp: master db_load -f /root/sge_maint_pre_jobs_dump_01272016 sge_job
18:19 valhallasw`cloud: dumped jobs database to /root/sge_maint_pre_jobs_dump_01272016, 4.6M
18:17 valhallasw`cloud: SGE Configuration successfully saved to /root/sge_maint_01272016 directory.
18:14 chasemp: grid master stopped
00:56 scfc_de: Deployed admin/www bde15df..12a3586.

2016-01-26

21:28 YuviPanda: qstat -u '*' | grep E | awk '{print $1}' | xargs -L1 qmod -cj
21:16 chasemp: reboot tools-exec-1217.tools.eqiad.wmflabs

2016-01-25

20:30 YuviPanda: switched over cron host to tools-cron-01, manually copied all old cron files from tools-submit to tools-cron-01
19:06 chasemp: kill python merge/merge-unique.py tools-exec-1213 as it seemed to be overwhelming nfs
17:07 scfc_de: Deployed admin/www at bde15df2a379c33edfb8350afd2f0c7186705a93.

2016-01-23

15:49 scfc_de: Removed remnant send_puppet_failure_emails cron entries except from unreachable hosts sacrificial-kitten, tools-worker-06 and tools-worker-1003.

2016-01-21

22:24 YuviPanda: deleted tools-redis-01 and -02 (are on 1001 and 1002 now)
21:13 YuviPanda: repooled exec nodes on labvirt1010
21:08 YuviPanda: gridengine-master started, verified shadow hasn't started
21:00 YuviPanda: stop gridengine master
20:51 YuviPanda: repooled exec nodes on labvirt1007 was last message
20:51 YuviPanda: repooled exec nodes on labvirt1006
20:39 YuviPanda: failover tools-static too tools-web-static-01
20:38 YuviPanda: failover tools-checker to tools-checker-01
20:32 YuviPanda: depooled exec nodes on 1007
20:32 YuviPanda: repooled exec nodes on 1006
20:14 YuviPanda: depooled all exec nodes in labvirt1006
20:11 YuviPanda: repooled exec node son 1005
19:53 YuviPanda: depooled exec nodes on labvirt1005
19:49 YuviPanda: repooled exec nodes from labvirt1004
19:48 YuviPanda: failed over proxy to tools-proxy-01 again
19:31 YuviPanda: depooled exec nodes from labvirt1004
19:29 YuviPanda: repooled exec nodes from labvirt1003
19:13 YuviPanda: depooled instances on labvirt1003
19:06 YuviPanda: re-enabled queues on exec nodes that were on labvirt1002
19:02 YuviPanda: failed over tools proxy to tools-proxy-02
18:46 YuviPanda: drained and disabled queues on all nodes on labvirt1002
18:38 YuviPanda: restarted all restartable jobs in instances on labvirt1001 and deleted all non-restartable ghost jobs. these were already dead

2016-01-12

09:48 scfc_de: tools-checker-01: Removed exim paniclog (OOM).

2016-01-11

22:19 valhallasw`cloud: reset maxujobs 0->128, job_load_adjustments none->np_load_avg=0.50, load_ad... -> 0:7:30
22:12 YuviPanda: restarted gridengine master again
22:07 valhallasw`cloud: set job_load_adjustments from np_load_avg=0.50 to none and load_adjustment_decay_time to 0:0:0
22:05 valhallasw`cloud: set maxujobs back to 0, but doesn't help
21:57 valhallasw`cloud: reset to 7:30
21:57 valhallasw`cloud: that cleared the measure, but jobs still not starting. Ugh!
21:56 valhallasw`cloud: set job_load_adjustments_decay_time = 0:0:0
21:45 YuviPanda: restarted gridengine master
21:43 valhallasw`cloud: qstat -j <jobid> shows all queues overloaded; seems to have started just after a load test for the new maxujobs setting
21:42 valhallasw`cloud: resetting to 0:7:30, as it's not having the intended effect
21:41 valhallasw`cloud: currently 353 jobs in qw state
21:40 valhallasw`cloud: that's load_adjustment_decay_time
21:40 valhallasw`cloud: temporarily sudo qconf -msconf to 0:0:1
19:59 YuviPanda: Set maxujobs (max concurrent jobs per user) on gridengine to 128
17:51 YuviPanda: kill all queries running on labsdb1003
17:20 YuviPanda: stopped webservice for quentinv57-tools

2016-01-09

21:07 valhallasw`cloud: moved tools-checker/208.80.155.229 back to tools-checker-01
21:02 andrewbogott: rebooting tools-checker-01 as it is unresponsive.
13:12 valhallasw`cloud: tools-worker-1002. is unresponsive. Maybe that's where the other grrrit-wm is hiding? Rebooting.

2016-01-08

19:46 chasemp: couldn't get into tools-mail-01 at all and it seemed borked so I rebooted
17:23 andrewbogott: killing tools.icelab as per https://wikitech.wikimedia.org/wiki/User_talk:Torin#Running_queries_on_tools-dev_.28tools-bastion-02.29

2015-12-30

04:06 YuviPanda: delete all webgrid jobs to start with a clean slate
03:54 YuviPanda: qmod -rj all tools in the continuous queue, they are all orphaned
02:39 YuviPanda: remove lbenedix and ebekebe from tools.hcclab
00:40 YuviPanda: restarted master on grid-master
00:40 YuviPanda: copied and cleaned out spooldb
00:10 YuviPanda: reboot tools-grid-shadow
00:08 YuviPanda: attempt to stop shadowd
00:03 YuviPanda: attempting to start gridengine-master on tools-grid-shadow
00:00 YuviPanda: kill -9'd gridengine master

2015-12-29

23:31 YuviPanda: rebooting tools-grid-master
23:22 YuviPanda: restart gridengine-master on tools-grid-master
00:18 YuviPanda: shut down redis on tools-redis-01

2015-12-28

22:34 chasemp: attempt to unmount nfs volumes on tools-redis-01 to debug but it hands (I am on console and see root at console hang on login)
22:31 YuviPanda: disable NFS on tools-redis-1001 and 1002
21:32 YuviPanda: disable puppet on tools-redis-01 and -02
21:27 YuviPanda: created tools-redis-1001

2015-12-23

21:21 YuviPanda: deleted tools-worker-01 to -05, creating tools-worker-1001 to 1005
21:19 valhallasw`cloud: tools-proxy-01: umount /home /data/project /data/scratch /public/dumps
19:01 valhallasw`cloud: ah, connections that are kept open. A new incognito window is routed correctly.
18:59 valhallasw`cloud: switched to -02, worked correctly, switched back. Switching back does not seem to fully work?!
18:40 valhallasw`cloud: scratch that, first going to eat dinner
18:38 valhallasw`cloud: dynamicproxy ban system deployed on tools-proxy-02 working correctly for localhost; switching over users there by moving the external IP.
14:42 valhallasw`cloud: toollabs homepage is unhappy because tools.xtools-articleinfo is using a lot of cpu on tools-webgrid-lighttpd-1409. Checking to see what's happening there.
10:46 YuviPanda: migrate tools-worker-01 to 3.19 kernel

2015-12-22

18:30 YuviPanda: rescheduling all webservices
18:17 YuviPanda: failed over active proxy to proxy-01
18:12 YuviPanda: upgraded kernel and rebooted tools-proxy-01
01:42 YuviPanda: rebooting tools-worker-08

2015-12-21

18:44 YuviPanda: reboot tools-proxy-01
18:31 YuviPanda: failover proxy to tools-proxy-02

2015-12-20

00:00 YuviPanda: tools-worker-08 stuck again :|

2015-12-18

15:16 andrewbogott: rebooting locked up host tools-exec-1409

2015-12-16

23:14 andrewbogott: rebooting tools-exec-1407, unresponsive
22:48 YuviPanda: run qmod -c '*' to clear error state on gridengine
21:28 andrewbogott: deleted tools-docker-registry-01
16:24 andrewbogott: rebooting tools-exec-1221 as it was in kernel lockup

2015-12-12

10:08 YuviPanda: restarted cron on tools-submit

2015-12-10

12:47 valhallasw`cloud: broke tools-proxy-02 login (for valhallasw, root still works) by restarting nslcd. Restarting; current proxy is -01.

2015-12-07

13:46 Coren: The new grid masters are happy, killing the old ones (-shadow, -master)
10:46 YuviPanda: restarted nscd on tools-proxy-01

2015-12-06

10:29 YuviPanda: did webservice start on tool 'derivative', was missing service.manifest

2015-12-04

19:33 Coren: switching master role to tools-grid-master
04:42 yuvipanda: disabled puppet on tools-puppetmaster-01 because everything sucks
04:09 bd808: Cherry-picked https://gerrit.wikimedia.org/r/#/c/256618 to tools-puppetmaster-01

2015-12-02

18:29 Coren: switching gridmaster activity to tools-grid-shadow
05:13 yuvipanda: increased security groups quota to 50 because why not

2015-12-01

21:07 yuvipanda: added bd808 as admin
21:01 andrewbogott: deleted tool/service group tools.test300

2015-11-25

15:42 Coren: migrating tools-web-static-02 to labvirt1010 to free space on labvirt1002

2015-11-20

22:02 Coren: tools-webgrid-lighttpd-1412 tools-webgrid-lighttpd-1413 tools-webgrid-lighttpd-1414 tools-webgrid-lighttpd-1415 done and back in rotation.
21:46 Coren: tools-webgrid-lighttpd-1411 tools-webgrid-lighttpd-1211 done and back in rotation.
21:30 Coren: tools-webgrid-lighttpd-1410 tools-webgrid-lighttpd-1210 done and back in rotation.
21:25 Coren: tools-webgrid-lighttpd-1409 tools-webgrid-lighttpd-1209 done and back in rotation.
21:13 Coren: tools-webgrid-lighttpd-1408 tools-webgrid-lighttpd-1208 done and back in rotation.
20:58 Coren: tools-webgrid-lighttpd-1407 tools-webgrid-lighttpd-1207 done and back in rotation.
20:53 Coren: tools-webgrid-lighttpd-1406 tools-webgrid-lighttpd-1206 done and back in rotation.
20:41 Coren: tools-webgrid-lighttpd-1405 tools-webgrid-lighttpd-1205 tools-webgrid-generic-1405 done and back in rotation.
20:28 Coren: tools-webgrid-lighttpd-1404 tools-webgrid-lighttpd-1204 tools-webgrid-generic-1404 done and back in rotation.
19:49 Coren: done, and putting back in rotation: tools-webgrid-lighttpd-1403 tools-webgrid-lighttpd-1203 tools-webgrid-generic-1403
19:25 Coren: -lighttpd-1403 wants a restart.
19:15 Coren: done, and putting back in rotation: tools-webgrid-lighttpd-1402 tools-webgrid-lighttpd-1202 tools-webgrid-generic-1402
18:55 Coren: Putting -lighttpd-1401 -lighttpd-1201 -generic-1401 back in rotation, disabling the others.
18:24 Coren: Beginning draining web nodes; -lighttpd-1401 -lighttpd-1201 -generic-1401
18:10 Coren: disabling puppet on the grid nodes listed at https://phabricator.wikimedia.org/P2337 so that the /tmp change in https://gerrit.wikimedia.org/r/#/c/252506/ do not apply early and break services

2015-11-17

19:39 YuviPanda: created tools-worker-03 to be k8s worker node
19:34 YuviPanda: blanked 'realm' for tools-bastion-01 to figure out what happens

2015-11-16

20:44 PlasmaFury: switch over the proxy to tools-proxy-01
17:38 PlasmaFury: deleted tools-webgrid-lighttpd-1412 for https://phabricator.wikimedia.org/T118654

2015-11-03

03:59 scfc_de: tools-submit, tools-webgrid-lighttpd-1409, tools-webgrid-lighttpd-1411: Removed exim paniclog (OOM).

2015-11-02

22:57 YuviPanda: pooled tools-webgrid-lighttpd-1413
22:10 YuviPanda: created tools-webgrid-lighttpd-1414 and 1415
22:04 YuviPanda: created tools-webgrid-lighttpd-1412 and 1413
19:53 YuviPanda: drained continuous jobs and disabled queues on tools-exec-1203 and tools-exec-1402
19:50 YuviPanda: drain webgrid-lighttpd-1408 of jobs

2015-10-26

20:53 YuviPanda: updated 6.9 ssh backport to all trusty hosts

2015-10-11

22:54 yuvipanda: delete service.manifest for tool wikiviz to prevent it from attempting to be started. It set itself up for nodejs but didn't actually have any code

2015-10-09

22:47 yuvipanda: kill NFS on tools-puppetmaster-01 with https://wikitech.wikimedia.org/wiki/Hiera:Tools/host/tools-puppetmaster-01
14:37 Coren: Beginning rotation of execution nodes to apply fix for T106170

2015-10-06

04:35 yuvipanda: created tools-puppetmaster-02 as hot spare

2015-10-02

17:30 scfc_de: tools-webgrid-lighttpd-1402: Removed exim paniclog (OOM).

2015-10-01

23:38 yuvipanda: actually rebooting tools-worker-02, had actually rebooted-01 earlier #facepalm
23:20 yuvipanda: rebooting tools-worker-02 to pickup new kernel
23:10 yuvipanda: failed over tools-proxy-01 to -02, restarting -01 to pick up new kernel
22:58 yuvipanda: rebooted tools-proxy-02 to pick up new kernel

2015-09-30

07:12 yuvipanda: deleted tools-webproxy-01 and -02, running on proxy-01 and -02 now
06:40 yuvipanda: migrated webproxy to tools-proxy-01

2015-09-29

12:08 scfc_de: tools-bastion-01: Removed exim paniclog (OOM).

2015-09-28

15:24 Coren: rebooting tools-shadow after mount option changes.

2015-09-25

16:02 scfc_de: tools-webgrid-lighttpd-1403: Removed exim paniclog (OOM).

2015-09-24

14:06 scfc_de: tools-exec-1201: Restarted grid engine exec for T109485.
13:56 scfc_de: tools-master: Restarted grid engine master for T109485.

2015-09-23

18:22 valhallasw`cloud: here = https://etherpad.wikimedia.org/p/74j8K2zIob
18:22 valhallasw`cloud: experimenting with https://github.com/jordansissel/fpm on tools-packages, and manually installing packages for that. Noting them here.

2015-09-16

17:33 scfc_de: Removed python-tools-webservice from precise-tools as apparently old version of tools-webservice.
01:17 YuviPanda: attempting to move grrrit-wm to kubernetes
01:17 YuviPanda: attempting to move to kubernetes

2015-09-15

01:18 scfc_de: Added unixodbc_2.2.14p2-5_amd64.deb back to precise-tools to diagnose if it is related to T111760.

2015-09-14

23:47 scfc_de: Archived unixodbc_2.2.14p2-5_amd64 from deb-precise and aptly, no reference in Puppet or Phabricator and same version as distribution.

2015-09-13

20:53 scfc_de: Archived lua-json_1.3.2-1 from labsdebrepo and aptly, upgraded manually to Trusty's new 1.3.1-1ubuntu0.1~ubuntu14.04.1, restarted nginx on tools-webproxy-01 and tools-webproxy-02, checked that proxy and localhost:8081/list works.
20:42 scfc_de: rm -f /etc/apt/apt.conf.d/20auto-upgrades.ucf-dist on all hosts (cf. T110055).

2015-09-11

14:54 scfc_de: tools-webgrid-lighttpd-1403: Removed exim paniclog (OOM).

2015-09-08

08:05 valhallasw`cloud: Publish for local repo ./trusty-tools [all, amd64] publishes {main: [trusty-tools]} has been successfully updated.
Publish for local repo ./precise-tools [all, amd64] publishes {main: [precise-tools]} has been successfully updated.
08:04 valhallasw`cloud: added all packages in data/project/.system/deb-precise to aptly repo precise-tools
08:03 valhallasw`cloud: added all packages in data/project/.system/deb-trusty to aptly repo trusty-tools

2015-09-07

18:49 valhallasw`cloud: ran sudo mount -o remount /data/project on tools-static-01, which also solved the issue, so skipping the reboot
18:47 valhallasw`cloud: switched static webserver to tools-static-02
18:45 valhallasw`cloud: weird NFS issue on tools-web-static-01. Switching over to -02 before rebooting.
17:57 YuviPanda: created tools-k8s-master-01 with jessie, will be etcd and kubernetes master

2015-09-03

07:09 valhallasw`cloud: and just re-running puppet solves the issue. Sigh.
07:09 valhallasw`cloud: last message in puppet.log.1.gz is Error: /Stage[main]/Toollabs::Exec_environ/Package[fonts-ipafont-gothic]/ensure: change from 00303-5 to latest failed: Could not get latest version: Execution of '/usr/bin/apt-cache policy fonts-ipafont-gothic' returned 100: fonts-ipafont-gothic: (...) E: Cache is out of sync, can't x-ref a package file
07:07 valhallasw`cloud: err, is empty.
07:07 valhallasw`cloud: uppet failure on tools-exec-1215 is CRITICAL 66.67% of data above the critical threshold -- but /var/log/puppet.log doesn't exist?!

2015-09-02

15:01 scfc_de: Added -M option to qsub call for crontab of tools.sdbot.
13:58 valhallasw`cloud: rebooting tools-exec-1403; https://phabricator.wikimedia.org/T107052 happening, also causing significant NFS server load
13:55 valhallasw`cloud: restarted gridengine_exec on tools-exec-1403
13:53 valhallasw`cloud: tools-exec-1403 does lots of locking opreations. Only job there was jid 1072678 = /data/project/hat-collector/irc-bots/snitch.py . Rescheduled that job.
13:16 YuviPanda: deleted all jobs of ralgisbot
13:12 YuviPanda: suspended all jobs in ralgisbot temporarily
12:57 YuviPanda: rescheduled all jobs of ralgisbot, was suffering from stale NFS file handles

2015-09-01

21:01 valhallasw`cloud: killed one of the grrrit-wm jobs; for some reason two of them were running?! Not sure what SGE is up to lately.
16:12 scfc_de: tools-bastion-01: Killed bot of tools.cobain.
15:47 valhallasw`cloud: git reset --hard cdnjs on tools-web-static-01
06:23 valhallasw`cloud: seems to have worked. SGE :(
06:17 valhallasw`cloud: going to restart sge_qmaster, hoping this solves the issue :/
06:08 valhallasw`cloud: e.g. "queue instance "task@tools-exec-1211.eqiad.wmflabs" dropped because it is overloaded: np_load_avg=1.820000 (= 0.070000 + 0.50 * 14.000000 with nproc=4) >= 1.75" but the actual load is only 0.3?!
06:06 valhallasw`cloud: test job does not get submitted because all queues are overloaded?!
06:06 valhallasw`cloud: investigating SGE issues reported on irc/email

2015-08-31

23:20 scfc_de: Changed host name tools-webgrid-generic-1405 in "qconf -mq webgrid-generic" to fix the "au" state of the queue on that host.
21:21 valhallasw`cloud: webservice: error: argument server: invalid choice: 'generic' (choose from 'lighttpd', 'tomcat', 'uwsgi-python', 'nodejs', 'uwsgi-plain') (for tools.javatest)
21:20 valhallasw`cloud: restarted webservicemonitor
21:19 valhallasw`cloud: seems to have some errors in restarting: subprocess.CalledProcessError: Command '['/usr/bin/sudo', '-i', '-u', 'tools.javatest', '/usr/local/bin/webservice', '--release', 'trusty', 'generic', 'restart']' returned non-zero exit status 2
21:18 valhallasw`cloud: running puppet agent -tv on tools-services-02 to make sure webservicemonitor is running
21:15 valhallasw`cloud: several webservices seem to actually have not gotten back online?! what on earth is going on.
21:10 valhallasw`cloud: some jobs still died (including tools.admin). I'm assuming service.manifest will make sure they start again
20:29 valhallasw`cloud: |sort is not so spread out in terms of affected hosts because a lot of jobs were started on lighttpd-1409 and -1410 around the same time.
20:25 valhallasw`cloud: ca 500 jobs @ 5s/job = approx 40 minutes
20:23 valhallasw`cloud: doh. accidentally used the wrong file, causing restarts for another few uwsgi hosts. Three more jobs dead *sigh*
20:21 valhallasw`cloud: now doing more rescheduling, with 5 sec intervals, on a sorted list to spread load between queues
19:36 valhallasw`cloud: last restarted job is 1423661, rest of them are still in /home/valhallaw/webgrid_jobs
19:35 valhallasw`cloud: one per second still seems to make SGE unhappy; there's a whole set of jobs dying, mostly uwsgi?
19:31 valhallasw`cloud: https://phabricator.wikimedia.org/T110861 : rescheduling 521 webgrid jobs, at a rate of one per second, while watching the accounting log for issues
07:31 valhallasw`cloud: removed paniclog on tools-submit; probably related to the NFS outage yesterday (although I'm not sure why that would give OOMs)

2015-08-30

13:23 valhallasw`cloud: killed wikibugs-backup and grrrit-wm on tools-webproxy-01
13:20 valhallasw`cloud: disabling 503 error page

2015-08-29

04:09 scfc_de: Disabled queue webgrid-lighttpd@tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs (qmod -d) because I can't ssh to it and jobs deployed there fail with "failed assumedly before job:can't get password entry for user".

2015-08-27

15:00 valhallasw`cloud: killed multiple kmlexport processes on tools-webgrid-lighttpd-1401 again

2015-08-26

01:10 scfc_de: Felt lucky: kill -STOP bigbrother on tools-submit, installed I00cd7a90273e0d745699855eb671710afb4e85a7 on tools-services-02 and service bigbrothermonitor start. If it goes berserk, please service bigbrothermonitor stop.

2015-08-25

20:23 scfc_de: tools-webgrid-generic-1405: killall mpt-statusd.
14:58 YuviPanda: pooled in two new instances for the precise exec pool
14:45 YuviPanda: reboot tools-exec-1221
14:26 YuviPanda: rebooting tools-exec-1220 because NFS wedge...
14:18 YuviPanda: pooled in tools-webgrid-generic-1405
10:16 YuviPanda: created tools-webgrid-generic-1405
10:04 YuviPanda: apply exec node puppet roles to tools-exec-1220 and -1221
09:59 YuviPanda: created tools-exec-1220 and -1221

2015-08-24

16:37 valhallasw`cloud: more processes were started, so added a talk page message on User:Coet (who was starting the processes according to /var/log/auth.log) and using 'write coet' on tools-bastion-01
16:15 valhallasw`cloud: kill -9'ing because normal killing doesn't work
16:13 valhallasw`cloud: killing all processes of tools.cobain which are flooding tools-bastion-01

2015-08-20

18:44 valhallasw`cloud: both are now at 3dbbc87
18:43 valhallasw`cloud: running git reset --hard origin/master on both checkouts. Old HEAD is 86ec36677bea85c28f9a796f7e57f93b1b928fa7 (-01) / c4abeabd3acf614285a40e36538f50655e53b47d (-02).
18:42 valhallasw`cloud: tools-web-static-01 has the same issue, but with different commit ids (because different hostname). No local changes on static-01. The initial merge commit on -01 is 57994c, merging 1e392ab and fc918b8; on -02 it's 511617f, merging a90818c and fc918b8.
18:39 valhallasw`cloud: cdnjs on tools-web-static-02 can't pull because it has a dirty working tree, and there's a bunch of weird merge commits. Old commit is c4abeabd3acf614285a40e36538f50655e53b47d, the dirty working tree is changes from http to https in various files
17:06 valhallasw`cloud: wait, what timezone is this?!

2015-08-19

10:45 valhallasw`cloud: ran `for i in $(qstat -f -xml | grep "<state>au" -B 6 | grep "<name>" | cut -d'@' -f2 | cut -d. -f1); do echo $i; ssh $i sudo service gridengine-exec start; done`; this fixed queues on tools-exec-1404 tools-exec-1409 tools-exec-1410 tools-webgrid-lighttpd-1406

2015-08-18

15:53 scfc_de: Added valhallasw as grid manager (qconf -am valhallasw).
14:42 scfc_de: tools-webgrid-lighttpd-1411: Killed mpt-statusd (T104779).
13:57 valhallasw`cloud: same issue seems to happen with the other hosts: tools-exec-1401.tools.eqiad.wmflabs vs tools-exec-1401.eqiad.wmflabs and tools-exec-catscan.tools.eqiad.wmflabs vs tools-exec-catscan.eqiad.wmflabs.
13:55 valhallasw`cloud: no, wait, that's tools-webgrid-lighttpd-1411.eqiad.wmflabs, not the actual host tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs. We should fix that dns mess as well.
13:54 valhallasw`cloud: tried to restart gridengine-exec on tools-exec-1401, no effect. tools-webgrid-lighttpd-1411 also just went into 'au' state.
13:47 valhallasw`cloud: that brought tools-exec-1403, tools-exec-1406 and tools-webgrid-generic-1402 back up, tools-exec-1401 and tools-exec-catscan are still in 'au' state
13:46 valhallasw`cloud: starting gridengine-exec on hosts with queues in 'au' (=alarm, unknown) state using for i in $(qstat -f -xml | grep "<state>au" -B 6 | grep "<name>" | cut -d'@' -f2 | cut -d. -f1); do echo $i; ssh $i sudo service gridengine-exec start; done
08:37 valhallasw`cloud: sudo service gridengine-exec start on tools-webgrid-lighttpd-1404.eqiad.wmflabs" tools-webgrid-lighttpd-1406.eqiad.wmflabs" tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs"
08:33 valhallasw`cloud: tools-webgrid-lighttpd-1403.eqiad.wmflabs, tools-webgrid-lighttpd-1404.eqiad.wmflabs and tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs are all broken (queue dropped because it is temporarily not available)
08:30 valhallasw`cloud: hostname mismatch: host is called tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs in config, but it was named tools-webgrid-lighttpd-1411.eqiad.wmflabs in the hostgroup config
08:21 valhallasw`cloud: still sudo qmod -e "*@tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs" -> invalid queue "*@tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs"
08:20 valhallasw`cloud: sudo qconf -mhgrp "@webgrid", added tools-webgrid-lighttpd-1411.eqiad.wmflabs
08:14 valhallasw`cloud: and the hostgroup @webgrid doesn't even exist? (╯°□°）╯︵ ┻━┻
08:10 valhallasw`cloud: /var/lib/gridengine/etc/queues/webgrid-lighttpd does not seem to be the correct configuration as the current config refers to '@webgrid' as host list.
08:07 valhallasw`cloud: sudo qconf -Ae /var/lib/gridengine/etc/exechosts/tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs -> root@tools-bastion-01.eqiad.wmflabs added "tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs" to exechost list
08:06 valhallasw`cloud: ok, success. /var/lib/gridengine/etc/exechosts/tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs now exists. Do I still have to add it manually to the grid? I suppose so.
08:04 valhallasw`cloud: installing packages from /data/project/.system/deb-trusty seems to fail. sudo apt-get update helps.
08:00 valhallasw`cloud: running puppet agent -tv again
07:55 valhallasw`cloud: argh. Disabling toollabs::node::web::generic again and enabling toollabs::node::web::lighttpd
07:54 valhallasw`cloud: various issues such as Error: /Stage[main]/Gridengine::Submit_host/File[/var/lib/gridengine/default/common/accounting]/ensure: change from absent to link failed: Could not set 'link' on ensure: No such file or directory - /var/lib/gridengine/default/common at 17:/etc/puppet/modules/gridengine/manifests/submit_host.pp; probably an ordering issue in
07:53 valhallasw`cloud: Setting up adminbot (1.7.8) ... chmod: cannot access '/usr/lib/adminbot/README': No such file or directory --- ran sudo touch /usr/lib/adminbot/README
07:37 valhallasw`cloud: applying role::labs::tools::compute and toollabs::node::web::generic to \tools-webgrid-lighttpd-1411
07:31 valhallasw`cloud: reading puppet suggests I should qconf -ah /var/lib/gridengine/etc/exechosts/tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs but that file is missing?
07:26 valhallasw`cloud: andrewbogott built tools-webgrid-lighttpd-1411 yesterday but it's not actually added as exec host. Trying to figure out how to do that...

2015-08-17

19:00 scfc_de: tools-checker-01, tools-exec-1410, tools-exec-catscan, tools-redis-01, tools-redis-02, tools-web-static-01, tools-webgrid-lighttpd-1406, tools-webproxy-02: Remounted /public/dumps (T109261).
16:17 andrewbogott: disable queues for tools-exec-1205 tools-exec-1207 tools-exec-1208 tools-exec-140 tools-exec-1404 tools-exec-1409 tools-exec-1410 tools-exec-catscan tools-web-static-01 tools-webgrid-lighttpd-1201 tools-webgrid-lighttpd-1205 tools-webgrid lighttpd-1206 tools-webgrid-lighttpd-1406 tools-webproxy-02
15:33 andrewbogott: re-enabling the queue on tools-exec-1211 tools-exec-1212 tools-exec-1215 tools-exec-1403 tools-exec-1406 tools-master tools-shadow tools-webgrid-generic-1402 tools-webgrid-lighttpd-1203 tools-webgrid-lighttpd-1208 tools-webgrid-lighttpd-1403 tools-webgrid-lighttpd-1404 tools-webproxy-01
14:50 andrewbogott: killing remaining jobs on tools-exec-1211 tools-exec-1212 tools-exec-1215 tools-exec-1403 tools-exec-1406 tools-master tools-shadow tools-webgrid-generic-1402 tools-webgrid-lighttpd-1203 tools-webgrid-lighttpd-1208 tools-webgrid-lighttpd-1403 tools-webgrid-lighttpd-1404 tools-webproxy-01

2015-08-15

05:14 andrewbogott: resumed tools-exec-gift, seems not to have been the culprit
05:10 andrewbogott: suspending tools-exec-gift, just for a moment...

2015-08-14

17:21 andrewbogott: disabling grid jobqueue for tools-exec-1211 tools-exec-1212 tools-exec-1215 tools-exec-1403 tools-exec-1406 tools-master tools-shadow tools-webgrid-generic-1402 tools-webgrid-lighttpd-1203 tools-webgrid-lighttpd-1208 tools-webgrid-lighttpd-1403 tools-webgrid-lighttpd-1404 tools-webproxy-01 in anticipation of monday reboot of labvirt1004
15:20 andrewbogott: Adding back to the grid engine queue: tools-exec-1216 tools-exec-1219 tools-exec-1407 tools-mail tools-services-02 tools-webgrid-generic-1401 tools-webgrid-lighttpd-1202 tools-webgrid-lighttpd-1207 tools-webgrid-lighttpd-1210 tools-webgrid-lighttpd-1402 tools-webgrid-lighttpd-1407
14:43 andrewbogott: killing remaining jobs on tools-exec-1216 tools-exec-1219 tools-exec-1407 tools-mail tools-services-02 tools-webgrid-generic-1401 tools-webgrid-lighttpd-1202 tools-webgrid-lighttpd-1207 tools-webgrid-lighttpd-1210 tools-webgrid-lighttpd-1402 tools-webgrid-lighttpd-1407

2015-08-13

18:51 valhallasw`cloud: which was resolved by scfc earlier
18:50 valhallasw`cloud: tools-exec-1201/Puppet staleness was critical due to an agent lock (Ignoring stale puppet agent lock for pid
Run of Puppet configuration client already in progress; skipping (/var/lib/puppet/state/agent_catalog_run.lock exists))
18:08 scfc_de: scfc@tools-exec-1201: Removed stale /var/lib/puppet/state/agent_catalog_run.lock; Puppet run was started Aug 12 15:06:08, instance was rebooted ~ 15:14.
16:44 andrewbogott: disabling job queue for tools-exec-1216 tools-exec-1219 tools-exec-1407 tools-mail tools-services-02 tools-webgrid-generic-1401 tools-webgrid-lighttpd-1202 tools-webgrid-lighttpd-1207 tools-webgrid-lighttpd-1210 tools-webgrid-lighttpd-1402 tools-webgrid-lighttpd-1407
14:48 andrewbogott: and tools-webgrid-lighttpd-1408
14:48 andrewbogott: rescheduling (and in some cases killing) jobs on tools-exec-1203 tools-exec-1210 tools-exec-1214 tools-exec-1402 tools-exec-1405 tools-exec-gift tools-services-01 tools-web-static-02 tools-webgrid-generic-1403 tools-webgrid-lighttpd-1204 tools-webgrid-lighttpd-1209 tools-webgrid-lighttpd-1401 tools-webgrid-lighttpd-1405

2015-08-12

16:05 andrewbogott: depooling tools-exec-1203 tools-exec-1210 tools-exec-1214 tools-exec-1402 tools-exec-1405 tools-exec-gift tools-services-01 tools-web-static-02 tools-webgrid-generic-1403 tools-webgrid-lighttpd-1204 tools-webgrid-lighttpd-1209 tools-webgrid-lighttpd-1401 tools-webgrid-lighttpd-1405 tools-webgrid-lighttpd-1408
15:20 valhallasw`cloud: re-enabling queues on restarted hosts
14:41 andrewbogott: forcing reschedule of jobs on tools-exec-1201 tools-exec-1202 tools-exec-1204 tools-exec-1206 tools-exec-1209 tools-exec-1213 tools-exec-1217 tools-exec-1218 tools-exec-1408 tools-webgrid-generic-1404 tools-webgrid-lighttpd-1409 tools-webgrid-lighttpd-1410

2015-08-11

18:17 andrewbogott: depooling tools-exec-1201 tools-exec-1202 tools-exec-1204 tools-exec-1206 tools-exec-1209 tools-exec-1213 tools-exec-1217 tools-exec-1218 tools-exec-1408 tools-webgrid-generic-1404 tools-webgrid-lighttpd-1409 tools-webgrid-lighttpd-1410 in anticipation of labvirt1001 reboot tomorrow

2015-08-04

13:43 scfc_de: Fixed owner of ~tools.kasparbot/error.log (T99576).

2015-08-03

19:13 andrewbogott: deleted tools-static-01

2015-08-01

18:09 andrewbogott: depooling/rebooting tools-webgrid-lighttpd-1407 because it’s unable to fork
16:54 scfc_de: tools-webgrid-lighttpd-1407: Removed exim paniclog (OOM).

2015-07-30

15:00 andrewbogott: rebooting tools-bastion-01 aka tools-login
14:46 scfc_de: tools-webgrid-lighttpd-1408, tools-webgrid-lighttpd-1409: Removed exim paniclog (OOM).
02:53 scfc_de: "webservice uwsgi-python start" for blogconverter.
02:40 scfc_de: qdel 545479 (hazard-bot, "release=trusty-quiet", stuck since July 9th).
02:39 scfc_de: qdel 301895 (projanalysis, "release=trust", stuck since July 1st).
02:38 scfc_de: tools-webgrid-generic-1401, tools-webgrid-generic-1402, tools-webgrid-generic-1403: Rebooted for T107052 (disabled queue, killall -TERM lighttpd, let tools-manifest restart webservices elsewhere, reboot, enabled queue).
01:41 scfc_de: tools-webgrid-lighttpd-1406: Rebooted for T107052 (disabled queue, killall -TERM lighttpd, let tools-manifest restart webservices elsewhere, reboot, enabled queue).

2015-07-29

23:43 andrewbogott: draining, rebooting tools-webgrid-lighttpd-1408
20:11 andrewbogott: rebooting tools-webgrid-lighttpd-1404
19:58 scfc_de: tools-*: sudo rmdir /etc/ssh/userkeys/ubuntu{/.ssh{/authorized_keys\ {/public{/keys{/ubuntu{/.ssh,},},},},},}

2015-07-28

17:49 valhallasw`cloud: Jobs were drained at 19:43, but this did not decreade he rate, which is still at ~50k/minute. Now running "sysctl -w sunrpc.nfs_debug=1023 && sleep 2 && sysctl -w sunrpc.nfs_debug=0" which hopefully doesn't kill the server
17:43 valhallasw`cloud: rescheduled all webservice jobs on tools-webgrid-lighttpd-1401.eqiad.wmflabs, server is now empty
17:16 valhallasw`cloud: disabled queue "webgrid-lighttpd@tools-webgrid-lighttpd-1401.eqiad.wmflabs"
02:07 YuviPanda: removed pacct files from tools-bastion-01

2015-07-27

21:27 valhallasw`cloud: turned off process accounting on tools-login while we try to find the root cause of phab:T107052:
```
accton off
```

2015-07-19

01:51 scfc_de: tools-bastion-01: Removed exim paniclog (OOM).

2015-07-11

00:01 mutante: fixing puppet runs on tools-webgrid-* via salt

2015-07-10

23:59 mutante: fixing puppet runs on tools-exec via salt

2015-07-10

20:09 valhallasw`cloud: it took three of us, but adminbot is updated!

July 6

09:49 valhallasw`cloud: 10:14 <jynus> s51053 is abusing his/her access to replica dbs and creating lag for other users. His/her queries are to be terminated. (= tools.jackbot / user jackpotte)

July 2

17:07 valhallasw`cloud: can't login to tools-mailrelay-01., probably because puppet was disabled for too long. Deleting instance.
16:12 valhallasw`cloud: I mean tools-bastion-01
16:12 valhallasw`cloud: stopping puppet on tools-login and tools-mail to check for changes in deploying https://gerrit.wikimedia.org/r/#/c/205914/

June 29

17:29 YuviPanda: failed over tools webproxy to tools-webproxy-02

June 21

18:57 scfc_de: tools-precise-dev: apt-get purge python-ldap3 (the previous fix for "Cache has broken packages, exiting" didn't work).
16:39 scfc_de: tools-precise-dev: apt-get clean ("Cache has broken packages, exiting").
16:33 scfc_de: tools-submit: Removed exim4 paniclog (OOM).

June 19

15:07 YuviPanda: remounting /data/scratch

June 10

11:52 YuviPanda: tools-trusty be gone

June 8

16:31 YuviPanda: added Nova Tools Bot as admin, for automated nova API access

June 7

17:05 YuviPanda: killed sort /data/project/templatetiger/public_html/dumps/ruwiki-2015-03-24.txt -k4,4 -k2,2 -k3,3n -k5,5n -t? -o /data/project/templatetiger/public_html/dumps/sort/ruwiki-2015-03-24.txt -T /data/project/templatetiger to rescue NFS

June 5

17:44 YuviPanda: migrate tools-shadow to labvirt1002

June 2

18:34 Coren: rebooting tools-webgrid-lighttpd-1406.eqiad.wmflabs
16:27 YuviPanda: cleaned out /etc/hosts file on tools-shadow
16:20 Coren: switching back to tools-master
16:10 YuviPanda: restart nscd on tools-submit
15:54 Coren: Switching names for tools-exec-1401
15:43 Coren: adding the "new" exec nodes (aka, current nodes with new names)
14:34 YuviPanda: turned off dnsmasq for toollabs
13:54 Coren: adding new-style names for submit hosts
13:53 YuviPanda: moved tools-master / shadow to designate
13:52 Coren: new-style names for gridengin admin hosts added
13:28 Coren: sge_shadowd started a new master as expected, after /two/ timeouts of 60s (unexpected)
13:23 Coren: stracing the shadowd to see what's up; master is down as expected.
13:17 Coren: killing the sge_qmaster to test failover
12:56 YuviPanda: switched labs webproxies to designate, forcing puppet run and restarting nscd

May 29

13:39 YuviPanda: tools-redis-01 is redis master now
13:35 YuviPanda: enable puppet on all hosts, redis move-around completed
13:01 YuviPanda: recreating tools-redis-01 and -02
12:52 YuviPanda: disable puppet on all toollabs hosts for tools-redis update
12:27 YuviPanda: created two redis instances (tools-redis-01 and tools-redis-02), beginning to set up stuff

May 28

12:22 wm-bot: petrb: inserted some local IP's to hosts file
12:15 wm-bot: petrb: shutting nscd off on tools-master
12:14 wm-bot: petrb: test
11:28 petan: syslog is full of these May 28 11:27:36 tools-master nslcd[1041]: [81823a] <group=550> error writing to client: Broken pipe
11:25 petan: rebooted tools-master in order to try fix that network issues

May 27

20:10 LostPanda: disabled puppet on tools-shadow too
19:46 LostPanda: echo -n 'tools-master.eqiad.wmflabs' > /var/lib/gridengine/default/common/act_qmaster haaail someone?
19:10 YuviPanda: reverted gridengine-common on tools-shadow to 6.2u5-4 as well, to match tools-master
18:58 YuviPanda: rebooting tools-master after switchoer failed and it can not seem to do DNS

May 23

19:56 scfc_de: tools-webgrid-lighttpd-1410: Removed exim4 paniclog (OOM).

May 22

20:37 yuvipanda: deleted and depooled tools-exec-07

May 20

20:09 yuvipanda: transient shinken puppet alerts because I tried to force puppet runs on all tools hosts but cancelled
20:01 yuvipanda: enabling puppet on all hosts
20:01 yuvipanda: tested new /etc/hosts on tools-bastion-01, puppet run produced no diffs, all good
19:56 yuvipanda: copy cleaned up and regenerated /etc/hosts from tools-precise-dev to all toollabs hosts
19:54 yuvipanda: copy cleaned up hosts file to /etc/hosts on tools-precise-dev
19:54 yuvipanda: enabled puppet on tools-precise-dev
19:33 yuvipanda: disabling puppet on *all* hosts for https://gerrit.wikimedia.org/r/#/c/210000/
06:21 yuvipanda: killed a bunch of webservice jobs stuck in dRr state

May 19

21:06 yuvipanda: failed over services to tools-services-02, -01 was refusing to start some webservices with permission denied errors for setegid
20:16 yuvipanda: qdel -f for all webservice jobs that were in dr state
20:12 yuvipanda: force killed croptool webservice

May 18

01:36 yuvipanda: created new tools-checker-01, applying role and provisioning
01:32 yuvipanda: killed tools-checker-01 instance, recreating

May 15

12:06 valhallasw: killed those perl scripts; kmlexport's lighttpd is also using excessive memory (5%), so restarting that
12:01 valhallasw: webgrid-lighttpd-1402 puppet failure caused by major memory usage; tools.kmlexport is running heavy perl scripts
00:27 yuvipanda: cleared graphite data for /var/* mounts on tools-redis

May 14

21:53 valhallasw: shut down & removed "tools-exec-08.eqiad.wmflabs" from execution host list
21:11 valhallasw: forced rescheduling of (non-cont) welcome.py job (iluvatarbot, jobid 8869)
03:29 yuvipanda: drained, depooled and deleted tools-exec-15

May 10

22:08 yuvipanda: created tools-precise-dev instance
09:28 yuvipanda: cleared and depooled tools-exec-02 and -13. only job running was deadlocked for a long, long time (week)
05:47 scfc_de: tools-submit: Removed paniclog (OOM) and stopped apache2.

May 5

18:50 Betacommand: helperbot WP:AVI bot running logged out owner is MIA, Coren killed job from 1204 and commented out crontab

May 4

21:24 yuvipanda: reboot tools-submit, was stuck

May 2

10:21 yuvipanda: drained all the old webgrid nodes, pooled in all the new webgrid nodes! POTATO!
10:13 yuvipanda: cleaned out wegrid jobs from tools-webgrid-03
10:12 yuvipanda: pooled tools-webgrid-lighttpd-{06-10}
08:56 yuvipanda: drained and deleted tools-webgrid-01
07:31 yuvipanda: depooled and deleted tools-webgrid-{01,02}
07:31 yuvipanda: disabled catmonitor task / cron, was heavily using an sqlite db on NFS
06:56 yuvipanda: pooled tools-webgrid-generic-{01-04}
03:44 yuvipanda: drained and deleted old trusty webgrid tools-webgrid-{05-07}
02:13 yuvipanda: created tools-webgrid-lighttpd-12{01-05} and tools-webgrid-generic-14{01-04}
01:59 yuvipanda: created tools-webgrid-lighttpd-14{01-10}
01:58 yuvipanda: increased tools instance quota

May 1

03:55 YuviKTM: depooled and deleted tools-exec-20
03:54 YuviKTM: killed final job in tools-exec-20 (9911317), decommissioning node

April 30

19:33 YuviKTM: depooled and deleted tools-exec-01, -05, -06 and -11.
19:31 YuviKTM: depooled and deleted tools-exec-01, -05, -06 and -11.
06:30 YuviKTM: added public IPs for all exec nodes so IRC tools continue to work. Removed all associated hostnames, let’s not do those
06:13 YuviKTM: allocating new floating IPs for the new instances, because IRC bots need them.
05:42 YuviKTM: disabled and drained tools-exec-1{1-5} of continuous jobs
05:40 YuviKTM: pooled in tools-exec-121{1-9}
05:39 YuviKTM: rebooted tools-exec-121{1-9} instances so they can apply gridengine-common properly
05:39 YuviKTM: created new instances tools-exec-121{1-9} as precise
05:39 YuviKTM: killed tools-dev, nobody still ssh’d in, no crontabs
05:39 YuviKTM: deplooled exec-{06-10} rejigged jobs to newer nodes
05:39 YuviKTM: delete tools-exec-10, was out of jobs
04:28 YuviKTM: deleted tools-exec-09
04:27 YuviKTM: depooled tools-exec-09.eqiad.wmflabs
04:23 YuviKTM: repooled tools-exec-1201 is all good now
04:19 YuviKTM: rejuggle jobs again in trustyland
04:14 YuviKTM: repooled tools-exec-09, apt troubles fixed
04:08 YuviKTM: depooled tools-exec-09, apt troubles
04:04 YuviKTM: pooled tools-exec-1408 and tools-exec-1409
04:00 YuviKTM: pooled tools-exec-1406 and 1407
03:58 YuviKTM: pooled tools-exec-12{02-10}, forgot to put appropriate roles on 1201, fixing now
03:54 YuviKTM: tools-exec-03 and -04 have been deleted a long time ago
03:53 YuviKTM: depooled tools-exec-03 / 04
03:31 YuviKTM: depooled and deleted tools-exec-12 had nothing on it
03:28 YuviKTM: deleted toolx-exec-21 to 24, one task still running on tools-exec
03:24 YuviKTM: disabled and drained continuous tasks off tools-exec-20 to tools-exec-24
03:18 YuviKTM: pooled tools-exec-1403, 1404
03:13 YuviKTM: pooled tools-exec-1402
03:07 YuviKTM: pooled tools-exec-1405
03:04 YuviKTM: pooled tools-exec-1401
02:53 YuviKTM: created tools-exec-14{06-10}
02:14 YuviKTM: created toolx-exec-14{01-05}
01:09 YuviPanda: killing local copy of python-requests, there seems to be a newer vesrion in prod

April 29

19:33 valhallasw`cloud: re-created tools-mailrelay-01 with precise: Nova_Resource:I-00000bca.eqiad.wmflabs
19:30 YuviPanda: set appopriate classes for recreated tools-exec-12* nodes
19:28 YuviPanda: recreated tools-static-02
19:11 YuviPanda: failed over tools-static to tools-static-01
14:47 andrewbogott: deleting tools-exec-04
14:44 Coren: -exec-04 drained; removed from queues. Rest well, old friend.
14:41 Coren: disabled -exec-04 (going away)
02:35 YuviPanda: set tools-exec-12{01-10} to configure as exec nodes
02:27 YuviPanda: created tools-exec-12{01-10}

April 28

21:41 andrewbogott: shrinking tools-master
21:33 YuviPanda: failover is going to take longer than actual recompression for tools-master, so let’s just recompress. tools-shadow should take over automatically if that doesn’t work
21:32 andrewbogott: shrinking tools-redis
21:28 YuviPanda: attempting to failover gridengine to tools-shadow
21:27 andrewbogott: shrinking tools-submit |
21:21 YuviPanda: backup crontabs onto NFS
21:18 andrewbogott: shrinking tools-webproxy-02
21:14 andrewbogott: shrinking tools-static-01
21:11 andrewbogott: shrinking tools-exec-gift
21:06 YuviPanda: failover tools-webproxy to tools-webproxy-01
21:06 andrewbogott: stopping, shrinking and starting tools-exec-catscan
21:01 YuviPanda: failover tools-static to tools-static-02
20:53 andrewbogott: stopping, shrinking, restarting tools-shadow
20:43 andrewbogott: stopping, shrinking, starting tools-static-02
20:39 valhallasw`cloud: created tools-mailrelay-01 Nova_Resource:I-00000bac.eqiad.wmflabs
20:26 YuviPanda: failed over tools-services to services-01
18:11 Coren: reenabled -webgrid-generic-02
18:05 Coren: reenabled -webgrid-03, -webgrid-08, -webgrid-generic-01; drained -webgrid-generic-02
17:44 Coren: -webgrid-03, -webgrid-08 and -webgrid-generic-01 drained
14:04 Coren: reenable -exec-11 for jobs.
13:55 andrewbogott: stopping tools-exec-11 for a resize experiment

April 25

01:32 YuviPanda: deleted tools-static, tools-static-01 has taken over
01:02 YuviPanda: deleted tools-login, tools-bastion-01 has been running for long enoug

April 24

16:29 Coren: repooled -exec-02, -08, -12
16:05 Coren: -exec-02, -08 and -12 draining
15:54 Coren: reenabled tools-exec-07, -10 and -11 after reboot of host
15:41 Coren: -exec-03 goes away for good.
15:31 Coren: draining -exec-03 to ease migration
13:43 Coren: draining tools-exec-07,10,11 to allow virt host reboot

April 23

22:41 YuviPanda: disabled *@tools-exec-09
22:40 YuviPanda: add tools-exec-09 back to @general
22:38 YuviPanda: take tools-exec-09 from @general group
20:53 YuviPanda: restart bigbrother
20:28 YuviPanda: restarted nscd on tools-login and tools-dev
20:22 valhallasw`cloud: removed 10.68.16.4 tools-webproxy tools.wmflabs.org from /etc/hosts
13:17 andrewbogott: beginning migration of tools instances to labvirt100x hosts
01:00 YuviPanda: good bye tools-login.eqiad.wmflabs

April 20

13:38 scfc_de: tools-mail: Removed paniclog and killed superfluous exim.

April 18

20:09 YuviPanda: sysctl vm.overcommit_memory=1 on tools-redis to allow it to bgsave again
19:52 valhallasw`cloud: tools-redis unresponsive (T96485); rebooting

April 17

01:48 YuviPanda: disable puppet on live webproxy (-01) to apply firewall changes to -02

April 16

20:57 Coren: -webgrid-08 drained, rebooting
20:46 Coren: -webgrid-03 repooled, depooling -webgrid-08
20:45 Coren: -webgrid-03 drained, rebooting
20:38 Coren: -webgrid-03 depooled
20:38 Coren: -webgrid-02 repooled
20:35 Coren: -webgrid-02 drained, rebooting
20:33 Coren: -webgrid-02 depooled
20:32 Coren: -webgrid-01 repooled
20:06 Coren: -webgrid-01 drained, rebooting.
19:56 Coren: depooling -webgrid-01 for reboot
14:37 Coren: rebooting -master
14:29 Coren: rebooting -mail
14:22 Coren: rebooting -shadow
14:22 Coren: -exec-15 repooled
14:19 Coren: -exec-15 drained, rebooting.
13:46 Coren: -exec-14 repooled. That's it for general exec nodes.
13:44 Coren: -exec-14 drained, rebooting.

April 15

21:06 Coren: -exec-10 repooled
20:55 Coren: -exec-10 drained, rebooting
20:49 Coren: -exec-07 repooled.
20:47 Coren: -exec-07 drained, rebooting
20:43 Coren: -exec-06 requeued
20:41 Coren: -exec-06 drained, rebooting
20:15 Coren: repool -exec-05
20:10 Coren: -exec-05 drained, rebooting.
19:56 Coren: -exec-04 repooled
19:52 Coren: -exec-04 drained, rebooting.
19:41 Coren: disabling new jobs on remaining (exec) precise instances
19:32 Coren: repool -exec-02
19:30 Coren: draining -exec-04
19:29 Coren: -exec-02 drained, rebooting
19:28 Coren: -exec-03 rebooted, requeing
19:26 Coren: -exec-03 drained, rebooting
18:50 Coren: dequeuing tools-exec-03 whilst waiting for -02 to drain.
18:43 Coren: tools-exec-01 back sans idmap, returning to pool
18:40 Coren: tools-exec-01 drained of jobs; rebooting
18:39 YuviPanda: disabled puppet on running webproxy, tools-webproxy-01
18:25 Coren: disabled -exec-01 and -exec-02 to new jobs.

April 14

13:13 scfc_de: tools-submit: Removed exim paniclog (OOM doom).
13:13 scfc_de: tools-mail: Killed superfluous exim and removed paniclog.

April 13

21:11 YuviPanda: restart portgranter on all webgrid nodes

April 12

10:52 scfc_de: tools-mail: Killed superfluous exims and removed paniclog.

April 11

21:49 andrewbogott: moved /data/project/admin/toollabs to /data/project/admin/toollabsbak on tools-webproxy-01 and tools-webproxy-02 to fix permission errors
02:15 YuviPanda: rebooted tools-submit, was not responding

April 10

07:10 PissedPanda: take out tools-services-01 to test switchover and also to recreate as small
05:20 YuviPanda: delete the tomcat node finally :D

April 9

23:24 scfc_de: rm -f /puppet_{host,service}groups.cfg on all hosts (apparently a Puppet/hiera mishap last November).
23:11 scfc_de: tools-webgrid-04: Rescheduled all jobs running on this instance (T95537).
08:32 scfc_de: tools-mail: Removed paniclog (multiple exims, but only one found).

April 8

13:25 scfc_de: Repaired servicegroups repository and restarted toolhistory job; was stuck at 2015-03-29T09:15:05Z (NFS?).
12:01 scfc_de: Removed empty tools with no maintainers javed/javedbaker/shell.
09:10 scfc_de: Removed stale proxy entries for analytalks/anno/commons-coverage/coursestats/eagleeye/hashtags/itwiki/mathbot/nasirkhanbot/rc-vikidia/wikistream.

April 7

07:42 scfc_de: tools-mail: Killed superfluous exim and removed paniclog.

April 5

10:11 scfc_de: tools-mail: Killed superfluous exims and removed paniclog.

April 4

22:48 scfc_de: Removed zombie jobs (qdel 1991607,1994800,1994826,1994827,2054201,3449476,3450329,3451518,3451549,3451590,3451628,3451635,3451830,3451869,3452632,3452633,3452654,3452655,3452657,3452668,4218785,4219210,4219674,4219722,4219791,4219923,4220646).
08:49 scfc_de: tools-submit: Restarted bigbrother because it didn't notice admin's .bigbrotherrc.
08:49 scfc_de: Add webservice to .bigbrotherrc for admin tool.
03:35 scfc_de: Deployed jobutils/misctools 1.5 (T91954).

April 3

22:55 scfc_de: Removed empty cgi-bin directories.
20:35 scfc_de: tools-mail: Killed superfluous exims and removed paniclog.

April 2

20:07 scfc_de: tools-mail: Killed superfluous exims and removed paniclog.
20:06 scfc_de: tools-submit: Removed exim paniclog (OOM).
01:25 YuviPanda: created tools-bastion-02

April 1

00:14 scfc_de: tools-webgrid-03: Rebooted, was stuck on console input when unable to mount NFS on boot (per wikitech consule output).

March 31

14:02 Coren: rebooting tools-submit
07:07 YuviPanda: moved tools.wmflabs.org to tools-webproxy-01
07:02 YuviPanda: reboot tools-webgrid-03 and tools-exec-03
00:21 andrewbogott: temporarily shutting ‘toolsbeta-pam-sshd-motd-test’ down to conserve resources. It can be restarted any time.

March 30

22:53 Coren: resyncing project storage with rsync
22:40 Coren: reboot tools-login
22:30 Coren: also bastion2
22:28 Coren: reboot bastion1 so users can log in
21:49 Coren: rebooting dedicated exec nodes.
21:49 Coren: rebooting tools-submit
17:27 scfc_de: tools-mail: Removed paniclog (multiple exims, but only one found).

March 29

19:30 scfc_de: tools-submit: Restarted bigbrother for T90384.

March 28

19:42 YuviPanda: created tools-exec-20

March 26

21:24 scfc_de: tools-mail: Killed superfluous exims and removed paniclog.

March 25

16:49 scfc_de: tools-mail: Removed paniclog (multiple exims, but only one found).

March 24

16:03 scfc_de: tools-login: Removed exim paniclog (entries from Sunday).
15:51 scfc_de: tools-mail: Killed superfluous exims and removed paniclog.

March 23

21:23 scfc_de: tools-login, tools-dev, tools-trusty: Now actually disabled role::labs::bastion per T93661 :-).
21:08 scfc_de: tools-login, tools-dev, tools-trusty: role::labs::bastion is still enabled due to T93663.
20:57 scfc_de: tools-login, tools-dev, tools-trusty: Disabled role::labs::bastion per T93661.
03:02 andrewbogott: wiped out atop.log on tools-dev because /var was filling up

March 22

23:08 scfc_de: qconf -ah tools-bastion-01.eqiad.wmflabs
23:07 scfc_de: for host in {tools-bastion-01,tools-webgrid-07,tools-webgrid-generic-{01,02}}.eqiad.wmflabs; do qconf -as "$host"; done
23:07 yuvipanda: copied /etc/hosts into place on tools-bastion-01

March 21

16:18 scfc_de: tools-mail: Killed superfluous exim and removed paniclog.

March 15

22:38 scfc_de: tools-mail: Killed superfluous exims and removed paniclog.

March 13

16:23 YuviPanda: cleaned out / on tools-trusty

March 11

04:28 YuviPanda: tools-redis is back now, as trusty and hopefully slightly more fortified
04:14 YuviPanda: kill tools-redis instance, upgrade to trusty while it is down anyway
03:56 YuviPanda: restarted redis server, it had OOM-killed

March 9

11:02 scfc_de: Deleted probably outdated proxy entry for tool wp-signpost and restarted webservice.
10:22 scfc_de: Deleted obsolete proxy entries without webservice for tools bracketbot/herculebot/extreg-wos/pirsquared/searchsbl/translate/yifeibot.
10:11 scfc_de: Restarted webservices for tools blahma/catmonitor/catscan2/contributions-summary/eagleeye/imagemapedit/jackbot/tb-dev/vcat/wikihistory/xtools-ec (cf. T91939).
08:27 scfc_de: qmod -cq webgrid-lighttpd@tools-webgrid-03.eqiad.wmflabs (OOM of two jobs in the past).

March 7

12:17 scfc_de: Moved obsolete packages that are installed on no instance at all from /data/project/.system/deb to ~tools.admin/archived-packages.

March 6

07:46 scfc_de: Set role::labs::tools::toolwatcher for tools-login.
07:43 scfc_de: Deployed jobutils/misctools 1.4.

March 2

09:53 YuviPanda: added ananthrk to project
08:41 YuviPanda: delete tools-uwsgi-01
08:11 YuviPanda: delete tools-uwsgi-02 because https://phabricator.wikimedia.org/T91065

March 1

15:11 YuviPanda|brb: pooled in tools-webgrid-07 to lighty webgrid, moving some tools off -05 and -06 to relieve pressure

February 28

07:51 YuviPanda: create tools-webgrid-07
01:00 Coren: Set vm.overcommit_memory=0 on -webgrid-05 (also trusty)
01:00 Coren: Also That was -webgrid-05
00:59 Coren: set exec-06 to vm.overcommit_memory=0 for now, until the vm behaviour difference between precise and trusty can be nailed down.

February 27

17:53 YuviPanda: increased quota to 512G RAM and 256 cores
15:33 Coren: Switched back to -master. I'm making a note here: great success.
15:27 Coren: Gridengine master failover test part three; killing the master with -9
15:20 Coren: Gridengine master failover test part deux - now with verbose logs
15:10 YuviPanda: created tools-webgrid-generic-02
15:10 YuviPanda: increase instance quota to 64
15:10 Coren: Master restarted - test not sucessful.
14:50 Coren: testing gridengine master failover starting now
08:27 YuviPanda: restart *all* webtools (with qmod -rj webgrid-lighttpd) to have tools-webproxy-01 and -02 pick them up as well

February 24

18:33 Coren: tools-submit not recovering well from outage, kicking it.
17:58 YuviPanda: rebooting *all* webgrid jobs on toollabs

February 16

02:31 scfc_de: rm -f /var/log/exim4/paniclog.

February 13

18:01 Coren: tools-redis is dead, long live tools-redis
17:48 Coren: rebuilding tools-redis with moar ramz
17:38 legoktm: redis on tools-redis is OOMing?
17:26 marktraceur: restarting grrrit-wm because it's not behaving

February 1

10:55 scfc_de: Submitted dummy jobs for tools ftl/limesmap/newwebtest/osm-add-tags/render/tsreports/typoscan/usersearch to get bigbrother to recognize those users and cleaned up output files afterwards.
07:51 YuviPanda: cleared error state of stuck queues
06:41 YuviPanda: set chmod +xw manually on /var/run/lighttpd on webgrid-05, need to investigate why it was necessary
05:47 YuviPanda: completed migrating magnus' tools to trusty, more details at https://etherpad.wikimedia.org/p/tools-trusty-move
05:37 YuviPanda: added tools-webgrid-06 as trusty webnode, operational now
04:52 YuviPanda: migrating all of magnus’ tools, after consultation with him (https://etherpad.wikimedia.org/p/tools-trusty-move for status)
04:10 YuviPanda: widar moved to trusty
03:01 YuviPanda: ran salt -G 'instanceproject:tools' cmd.run 'sudo rm -rf /var/tmp/core’ because disks were getting full.

January 29

17:26 YuviPanda: reschedule all tomcat jobs

January 27

23:27 YuviPanda: qdel -f 7662482 7661111 for Merlissimo

January 19

20:51 YuviPanda: because valhallasw is nice
10:34 YuviPanda: manually started tools-webgrid-generic-01
09:48 YuviPanda: restarted toold-webgrid-03
08:42 scfc_de: qmod -cq {continuous,mailq,task}@tools-exec-{06,10,11,15}.eqiad.wmflabs
08:36 scfc_de: tools-mail: rm -f /var/log/exim4/paniclog and killed second exim (belated SAL amendment.

January 16

22:11 scfc_de: tools-mail: rm -f /var/log/exim4/paniclog.

January 15

22:10 YuviPanda: created instance tools-webgrid-generic-01

January 11

06:38 scfc_de: tools-mail: rm -f /var/log/exim4/paniclog.

January 8

07:40 YuviPanda: increase memory limit for autolist from 4G to 7G