Nova Resource:Tools/SAL/Archive 2

From Wikitech

2017-12-31

  • 02:00 bd808: Killed some pwb.py and qacct processes running on tools-bastion-03

2017-12-21

  • 17:57 bd808: PAWS: deleted hub-deployment pod stuck in crashloopbackoff
  • 17:30 bd808: PAWS: deleting hub-deployment pod. Lots of "Connection pool is full" warnings in pod logs

2017-12-19

  • 21:27 chasemp: reboot tools-paws-master-01
  • 18:38 andrewbogott: rebooting tools-paws-master-01
  • 05:07 andrewbogott: "service gridengine-master restart" on tools-grid-master

2017-12-18

  • 12:04 arturo: it seems jupyterhub tries to use a database which doesn't exists: [E 2017-12-18 11:59:49.896 JupyterHub app:904] Failed to connect to db: sqlite:///jupyterhub.sqlite
  • 11:58 arturo: The restart didn't work. I could see a lot of log lines in the hub-deployment pod with something like: 2017-12-17 04:08:17,574 WARNING Connection pool is full, discarding connection: 10.96.0.1
  • 11:51 arturo: the restart was with: kubectl get pod -o yaml hub-deployment-1381799904-b5g5j -n prod | kubectl replace --force -f -
  • 11:50 arturo: restart pod hub-deployment in paws to try to fix the 502

2017-12-15

  • 13:55 arturo: same in tools-checker-02.tools.eqiad.wmflabs
  • 13:54 arturo: same in tools-exec-1415.tools.eqiad.wmflabs
  • 13:52 arturo: running 'sudo puppet agent -t -v' in tools-webgrid-lighttpd-1416.tools.eqiad.wmflabs since didn't update in the last run with clush

2017-12-14

2017-12-13

  • 17:37 andrewbogott: upgrading puppet packages on all VMs
  • 00:59 madhuvishy: Cordon and Drain tools-worker-1016
  • 00:47 madhuvishy: Drain + Cordon, Reboot, Uncordon tools-workers-1018-1023, 1025-1027
  • 00:34 madhuvishy: Drain + Cordon, Reboot, Uncordon tools-workers-1011, 1013-1015, 1017
  • 00:28 madhuvishy: Drain + Cordon, Reboot, Uncordon tools-workers-1006-1010
  • 00:11 madhuvishy: Drain + Cordon, Reboot, Uncordon tools-workers-1002-1005

2017-12-12

  • 23:29 madhuvishy: rebooting tools-worker-1012
  • 18:50 andrewbogott: rebooting tools-worker-1001

2017-12-11

  • 19:32 bd808: git gc on tools-static-11; --aggressive was killed by system (T182604)
  • 18:07 andrewbogott: upgrading tools puppetmaster to v4
  • 17:07 bd808: git gc --aggressive on tools-static-11 (T182604)

2017-12-01

  • 15:33 chasemp: put the weird mess of untracked files on tools puppetmaster into stash to see what breaks as they should not be there?
  • 15:30 chasemp: prometheus nfs collector on tools-bastion-03

2017-11-30

  • 23:23 bd808: Hard reboot of tools-bastion-03 via Horizon
  • 23:06 chasemp: rebooting login.tools.wmflabs.org due to overload

2017-11-20

2017-11-17

  • 21:33 valhallasw`cloud: also g-w'ed those files, and sent emails to all the affected users
  • 21:17 valhallasw`cloud: chmod o-w'ed a bunch of files reported by Dispenser; writing emails to the owners about this

2017-11-16

  • 17:40 chasemp: tools-clushmaster-01:~$ clush -w @all 'sudo puppet agent --enable && sudo puppet agent --test && sudo unattended-upgrades -d'
  • 16:50 bd808: Force upgraded nginx on tools-elastic-*
  • 16:37 chasemp: reboot tools-checker-01
  • 15:17 chasemp: disable puppet

2017-11-15

  • 22:48 madhuvishy: Rebooted tools-paws-worker-1017
  • 15:53 chasemp: reboot bastion-03
  • 15:48 chasemp: kill tools.powow on bastion-03 for hammering IO and making bastion unusable

2017-11-07

  • 01:21 bd808: Removed all non-directory files from /home (via labstore1004 direct access)

2017-11-06

  • 18:30 bd808: Load on tools-bastion-03 down to 0.72 from 17.47 after killing a bunch of local processes that should have been running on the job grid instead

2017-11-05

  • 23:48 bd808: Cleaned up 2 huge /tmp files left by tools.croptool (~6.5G)
  • 23:44 bd808: Cleaned up 109 files owned by tools.rezabot on tools-webgrid-lighttpd-1428 with `sudo find /tmp -user tools.rezabot -exec rm {} \+`
  • 23:37 bd808: Cleaned up 955 files owned by tools.wsexport on tools-webgrid-lighttpd-1428 with `sudo find /tmp -user tools.wsexport -exec rm {} \+`

2017-11-03

  • 21:19 bd808: Deployed misctools 1.26 (T156174)

2017-11-02

  • 16:15 bd808: Restarted nslcd on tools-bastion-03

2017-11-01

  • 07:11 madhuvishy: Clear nscd cache across all projects post labsdb dns switchover T179464
  • 07:11 madhuvishy: Clear nscd cache across all projects post labsdb dns switchover

2017-10-31

  • 16:50 bd808: tools-bastion-03 (tools-login, login.tools) is overloaded

2017-10-30

  • 17:35 madhuvishy: Clear dns caches across tools hosts `sudo nscd -i hosts`
  • 16:08 arturo: repool tools-exec-1401.tools.eqiad.wmflabs
  • 15:57 arturo: depool again tools-exec-1401.tools.eqiad.wmflabs for more tests related to T179024
  • 12:47 arturo: repool tools-exec-1401
  • 11:58 arturo: depool tools-exec-1401 to test patch in T179024 --> aborrero@tools-bastion-03:~$ sudo exec-manage depool tools-exec-1401.tools.eqiad.wmflabs

2017-10-24

  • 18:09 madhuvishy: Disable puppet on tools-package-builder-01 temporarily (T178920)
  • 13:22 chasemp: start admin webservice
  • 13:22 chasemp: stop admin webservice

2017-10-23

  • 14:49 chasemp: wall message and scheduled reboot in 5m for bastion-03

2017-10-18

  • 21:36 chasemp: stop basebot -- it is going crazy and spamming email w/ failing to log to error.log. Need to figure out how to notify but it's clearly in a failure loop.
  • 14:04 chasemp: add strephit creds to elasticsearch per T178310

2017-10-12

  • 16:57 bd808: Rebuilding all Kubernetes Docker images to include toollabs-webservice 0.38
  • 16:53 bd808: Upgraded toollabs-webservice to 0.38

2017-10-06

  • 15:33 bd808: Upgrade jobutils to 1.25 (T177614)
  • 00:27 bd808: Updated misctools to 1.24

2017-10-05

  • 22:47 bd808: Updated misctools to 1.23
  • 22:42 bd808: Updated jobutils to 1.23
  • 15:46 chasemp: tools-bastion-03 has tons of local tools running long lived NFS intensive processes. I'm rebooting rather than playing whackamole.

2017-10-03

  • 19:30 bd808: `kubectl --namespace=prod delete pod --all` on tools-paws-master-01

2017-10-01

  • 21:46 madhuvishy: Cold migrating tools-clushmaster-01 from labvirt1015 to labvirt1017

2017-09-29

  • 19:49 andrewbogott: migration tools-clushmaster-01 to labvirt1015

2017-09-25

  • 15:14 andrewbogott: rebooting tools-paws-worker-1006 since I can't access it
  • 14:57 chasemp: OS_TENANT_NAME=tools openstack server reboot 2c0cf363-c7c3-42ad-94bd-e586f2492321 (unresponsive)

2017-09-20

  • 16:52 madhuvishy: apt-get install --only-upgrade apache2; service apache2 restart on tools-puppetmaster-01

2017-09-19

  • 15:22 chasemp: tools-clushmaster-01:~$ clush -f 5 -g all 'sudo puppet agent --test'
  • 13:39 chasemp: bastion-03 someone dropped 8.6G in /tmp which is /not/ seemingly on a temp file system
  • 13:25 chasemp: wall Bastion disk is full and needs attention and reboot in 60

2017-09-18

  • 18:02 bd808: Updated PHP5.6 images for Kubernetes (T172358)

2017-09-13

  • 15:34 bd808: Running inbound message purge via clush to @tools-exec
  • 15:15 bd808: Running outbound message purge via clush to @tools-exec
  • 13:57 bd808: apt-get install nginx-common on tools-static-1[01]
  • 13:31 bd808: static down due to apparent nginx package upgrade/config change
  • 02:10 bd808: Really disabled puppet on tools-mail
  • 01:51 bd808: Nuked all messages in the exim spool on tools-mail
  • 01:09 bd808: Removed user WiktCAPT from project
  • 00:55 bd808: Archived and then purged /var/spool/exim4/input on tools-mail
  • 00:47 bd808: Archived and then purged /var/spool/exim4/msglog on tools-mail
  • 00:43 bd808: Stopped exim on tools-mail
  • 00:43 bd808: Disabled puppet on tools-mail
  • 00:15 chasemp: forced to clean out exim queue as the filesystem used up all inodes

2017-08-31

  • 20:33 madhuvishy: Updated certs and ran puppet, restarted nginx on tools-proxy-* and tools-static-* (T174611)
  • 20:25 madhuvishy: Merging new cert https://gerrit.wikimedia.org/r/#/c/374873/ (T174611)
  • 20:24 madhuvishy: Disabling puppet on tools-proxy-* and tools-static-* for star.wmflabs.org SSL cert update (T174611)
  • 20:23 madhuvishy: Disabling puppet on tools-proxy-* and tools-static-* for star.wmflabs.org SSL cert update

2017-08-24

  • 19:59 bd808: restarted nslcd and nscd on tools-bastion-03
  • 19:59 bd808: restarted nslcd and nscd on tools-bastion-02

2017-08-22

  • 19:20 andrewbogott: deleted tools-puppetmaster-02, it was replaced a month ago by -01

2017-08-12

  • 18:38 chasemp: retart admin webservice

2017-08-11

2017-08-10

  • 14:59 chasemp: 'become stimmberechtigung && restart' && 'become intersect-contribs && restart'

2017-08-09

  • 17:28 chasemp: webservices restart tools.orphantalk

2017-08-03

  • 00:47 bd808: tools-bastion-03 not usably responsive to interactive commands; will reboot
  • 00:00 bd808: Restarted kube-proxy service on bastion-03

2017-08-02

  • 16:59 bd808: Force deleted 6 jobs suck in 'dr' state

2017-07-31

  • 15:28 chasemp: remove python-keystoneclient from bastion-03

2017-07-27

  • 23:27 bd808: Killed python procs owned by sdesabbata on tools-login that were stealing all cpu/io
  • 21:16 bd808: Disabled puppet on tools-proxy-01 to test nginx proxy config changes
  • 16:27 bd808: Enabled puppet on tools-static-11
  • 16:10 bd808: Disabled puppet on tools-static-11 to test https://gerrit.wikimedia.org/r/#/c/357878

2017-07-26

  • 22:33 chasemp: hotpatching an hiera value on tools master to see effects

2017-07-20

  • 19:48 bd808: Clearing all Eqw state jobs in all queues with: qstat -u '*' | grep Eqw | awk '{print $1;}' | xargs -L1 qmod -cj
  • 13:54 andrewbogott: upgrading apache2 on tools-puppetmaster-01
  • 04:00 chasemp: tools-webgrid-lighttpd-1402:~# service nslcd restart && service nscd restart
  • 03:57 chasemp: tools-exec-1428:~# service nslcd restart && service nscd restart
  • 03:57 bd808: Redtarted cron, nscd, nslcd on tools-cron-01
  • 03:45 chasemp: tools-puppetmaster-01:~# service nslcd restart && service nscd restart
  • 03:44 chasemp: tools-puppetmaster-01:~# service nslcd restart && service nscd restart
  • 03:37 bd808: Restarted apache on tools-puppetmaster-01

2017-07-19

  • 23:52 bd808: Restarted cron on tools-cron-01; toolschecker job showing user not found errors
  • 21:19 valhallasw`cloud: Restarted nslcd on tools-bastion-03 (=tools-login); logins seem functional again.
  • 21:18 bd808: Forced puppet run and restarted nscd, nslcd on tools-bastion-02

2017-07-18

  • 19:51 andrewbogott: enabling puppet on tools-proxy-02. I don't know why it was disabled.

2017-07-17

  • 01:43 bd808: Uncordoned tools-worker-1020 after it deleted pods with local storage that were filling the entire disk
  • 01:36 bd808: Depooling tools-worker-1020

2017-07-13

  • 21:59 bd808: Elasticsearch cluster upgraded to 5.3.2
  • 21:25 bd808: Upgrading ElasticSearch cluster for T164842. There will be service interruptions
  • 17:59 bd808: Puppet is disabled on tools-proxy-02 with no reason specified.
  • 17:09 bd808: Upgraded nginx-common on tools-proxy-02
  • 17:05 bd808: Upgraded nginx-common on tools-proxy-01

2017-07-12

  • 15:46 chasemp: push out puppet run across tools
  • 12:15 andrewbogott: restarting 'admin' webservice

2017-07-07

  • 18:26 bd808: Forced puppet runs on tools-redis-* for security fix

2017-07-03

  • 04:26 bd808: cdnjs on tools-static-10 is up to date
  • 03:38 bd808: cdnjs on tools-static-11 is up to date
  • 02:19 bd808: Cleaning up stuck merges for cdnjs clones on tools-static-10 and tools-static-11

2017-07-01

  • 19:40 bd808: Disabled puppet on tools-k8s-master-01 to try and fix maintain-kubeusers
  • 19:32 bd808: Restarted maintain-kubeusers on tools-k8s-master-01

2017-06-30

  • 01:33 chasemp: time for i in `cat tools-hosts`; do ssh -i ~/.ssh/labs_root_id_rsa root@$i.eqiad.wmflabs 'hostname -f; uptime; tc-setup'; done
  • 01:29 andrewbogott: rebooting tools-cron-01

2017-06-29

  • 23:01 madhuvishy: Uncordoned all k8s-workers
  • 20:50 madhuvishy: deppoling, rebooting and repooling all grid exec nodes
  • 20:36 andrewbogott: depooling, rebooting, and repooling every lighttpd node three at a time
  • 19:55 madhuvishy: Killed liangent-php jobs and usrd-tools jobs
  • 18:00 madhuvishy: drain cordon reboot uncordon tools-worker-1015
  • 17:37 madhuvishy: drain cordon reboot uncordon tools-worker-1005 tools-worker-1007 tools-worker-1008
  • 17:22 bd808: rebooting tools-static-11
  • 17:20 andrewbogott: rebooting tools-static-10
  • 17:20 madhuvishy: drain cordon reboot uncordon tools-worker-1012 tools-worker-1003
  • 17:13 madhuvishy: drain cordon reboot uncordon tools-worker-1022, tools-worker-1009, tools-worker-1002
  • 16:27 chasemp: restart k8s components on master (madhu)
  • 16:10 chasemp: tools-flannel-etcd-01:~$ sudo service etcd restart
  • 16:04 madhuvishy: reboot tools-worker-1022 tools-worker-1009
  • 15:57 chasemp: reboot tools-docker-registery-01 for nfs

2017-06-27

  • 21:32 andrewbogott: moving all tools nodes to new puppetmaster, tools-puppetmaster-01.tools.eqiad.wmflabs

2017-06-25

  • 15:13 madhuvishy: Restarted webservice on tools.fatameh

2017-06-24

  • 16:01 bd808: Created and provisioned elasticsearch password for tools.wmde-uca-test (T167971)

2017-06-23

  • 20:20 bd808: Reindexing various elasticsearch indexes created before we upgraded to v2.x
  • 20:19 bd808: Dropped garbage indexes in elasticsearch cluster

2017-06-22

  • 17:03 bd808: Rolled back attempt at Elasticsearch upgrade. Indices need to be rebuilt with 2.x before 5.x can be installed. T164842
  • 16:19 bd808: Backed up elasticsearch indexes to personal laptop using elasticdump incase T164842 goes horribly wrong
  • 00:12 bd808: Set ownership and permissions on $HOME/.kube for all tools (T165875)

2017-06-21

  • 17:43 andrewbogott: repooling tools-exec-1412, 1415, 1417, 1420, tools-webgrid-lighttpd-1415, 1416, 1422, 1426
  • 17:42 madhuvishy: Restarted webservice for openstack-browser
  • 17:36 andrewbogott: depooling tools-exec-1412, 1415, 1417, 1420, tools-webgrid-lighttpd-1415, 1416, 1422, 1426
  • 17:35 andrewbogott: repooling tools-exec-1411, 1416, 1418, 1424, tools-webgrid-lighttpd-1404, 1410
  • 17:24 andrewbogott: depooling tools-exec-1411, 1416, 1418, 1424, tools-webgrid-lighttpd-1404, 1410
  • 17:23 andrewbogott: repooling tools-exec-1406, 1421, 1436, 1437, tools-webgrid-generic-1404, 1409, 1411, 1418, 1420, 1425
  • 17:11 andrewbogott: depooling tools-exec-1406, 1421, 1436, 1437, tools-webgrid-generic-1404, 1409, 1411, 1418, 1420, 1425
  • 17:10 andrewbogott: repooling tools-webgrid-lighttpd-1412, tools-exec-1423
  • 16:57 andrewbogott: depooling tools-webgrid-lighttpd-1412, tools-exec-1423
  • 16:53 andrewbogott: repooling tools-exec-1413, 1442, tools-webgrid-lighttpd-1417, 1419, 1421, 1427, 1428
  • 16:52 andrewbogott: repooling tools-exec-1409, 1410, 1414, 1419, 1427, 1428, tools-webgrid-generic-1401, tools-webgrid-lighttpd-1406
  • 16:35 andrewbogott: depooling tools-exec-1409, 1410, 1414, 1419, 1427, 1428, tools-webgrid-generic-1401, tools-webgrid-lighttpd-1406
  • 16:29 andrewbogott: depooling tools-exec-1413, 1442, tools-webgrid-lighttpd-1417, 1419, 1421, 1427, 1428
  • 16:05 godog: delete pods for lolrrit-wm to force restart
  • 15:45 andrewbogott: repooling tools-exec-1422, tools-webgrid-lighttpd-1413
  • 15:41 andrewbogott: switching the proxy ip back to tools-proxy-02
  • 15:31 andrewbogott: temporarily pointing the tools-proxy IP to tools-proxy-01
  • 15:26 andrewbogott: depooling tools-exec-1422, tools-webgrid-lighttpd-1413
  • 15:12 andrewbogott: depooling tools-exec-1404, tools-exec-1434, tools-worker-1026
  • 15:10 andrewbogott: repooling tools-exec-1402, 1426, 1429, 1433, tools-webgrid-lighttpd-1408, 1414, 1424
  • 14:53 andrewbogott: depooling tools-exec-1402, 1426, 1429, 1433, tools-webgrid-lighttpd-1408, 1414, 1424
  • 14:52 andrewbogott: repooling tools-exec-1403, tools-exec-gift-trusty-01, tools-webgrid-generic-1402, tools-webgrid-lighttpd-1403
  • 14:37 andrewbogott: depooling tools-exec-1403, tools-exec-gift-trusty-01, tools-webgrid-generic-1402, tools-webgrid-lighttpd-1403
  • 14:32 andrewbogott: repooling tools-exec-1405, tools-exec-1425, tools-webgrid-generic-1403, tools-webgrid-lighttpd-1401, tools-webgrid-lighttpd-1405
  • 14:20 andrewbogott: depooling tools-exec-1405, tools-exec-1425, tools-webgrid-generic-1403, tools-webgrid-lighttpd-1401, tools-webgrid-lighttpd-1405
  • 14:19 andrewbogott: repooling tools-exec-1401, 1407, 1408, 1430, 1431, 1432, 1435, 1438, 1439, 1440, 1441, tools-webgrid-lighttpd-1402, tools-webgrid-lighttpd-1407
  • 13:56 andrewbogott: depooling tools-exec-1401, 1407, 1408, 1430, 1431, 1432, 1435, 1438, 1439, 1440, 1441, tools-webgrid-lighttpd-1402, tools-webgrid-lighttpd-1407

2017-06-14

  • 22:09 bd808: Restarted apache2 proc on tools-puppetmaster-02

2017-06-08

  • 18:14 madhuvishy: Also delete from /tmp on tools-webgrid-lighttpd-1411 xvfb-run.*, calibre_* and ws-*.epub
  • 18:10 madhuvishy: Delete ws-*.epub from /tmp on tools-webgrid-lighttpd-1426
  • 18:07 madhuvishy: Clean up space on /tmp on tools-webgrid-lighttpd-1426 by deleting temp files xvfb-run.* and calibre_1.25.0_tmp_* created by the wsexport tool

2017-06-07

  • 19:05 madhuvishy: Killed scp job run by user torin8 on tools-bastion-02

2017-06-06

  • 20:30 chasemp: rebooting tools-bastion-02 as unresponsive (up 76 days and lots of seemingly left behind things running)

2017-06-05

  • 23:44 bd808: Deleted tools.iabot crontab that somehow got locally installed on tools-exec-1412 on 2017-05-24T20:55Z
  • 22:15 bd808: Deleted tools.aibot crontab that somehow got locally installed on tools-exec-1436 on 2017-05-24T20:55Z
  • 19:55 andrewbogott: disabling puppet on tools-proxy-01 and -02 for a staged rollout of https://gerrit.wikimedia.org/r/#/c/350494/16

2017-06-01

  • 15:15 andrewbogott: depooling/rebooting/repooling tools-exec-1403 as part of old kernel-purge testing

2017-05-31

  • 19:29 bd808: Rebuiding all Docker images to pick up toollabs-webservice v0.37 (T163355)
  • 19:24 bd808: Updating toolabs-webservice package via clush (T163355)
  • 19:16 bd808: Installed toollabs-webservice_0.37_all.deb from local file on tools-bastion-02 (T163355)
  • 16:34 andrewbogott: running 'apt-get -yq autoremove' env='{DEBIAN_FRONTEND: "noninteractive"}' on all instances with salt
  • 16:25 andrewbogott: rebooting tools-exec-1404 as part of a disk-space-saving test
  • 14:07 andrewbogott: migrating tools-exec-1409 to labvirt1009 to reduce CPU load on labvirt1006 (T165753)

2017-05-30

  • 22:32 andrewbogott: migrating tools-webgrid-lighttpd-1406, tools-exec-1410 from labvirt1006 to labvirt1009 to balance cpu usage
  • 18:15 andrewbogott: restarted robokobot virgule to free up leaked files
  • 17:36 andrewbogott: restarting excel2wiki to clean up file leaks
  • 17:36 andrewbogott: restarting idwiki-welcome in kenrick95bot to free up leaked files
  • 17:31 andrewbogott: restarting onetools to clean up file leaks
  • 17:29 andrewbogott: restarting ytcleaner webservice to clean up leaked files
  • 17:22 andrewbogott: restarting vltools to clean up leaked files
  • 17:20 madhuvishy: Uncordoned tools-worker-1006
  • 17:16 madhuvishy: Killed tool videoconvert on tools-exec-1440 in debugging labstore disk space issues
  • 17:15 madhuvishy: Drained and rebooted tools-worker-1006
  • 17:15 andrewbogott: restarted croptool to clean up stray files
  • 17:15 madhuvishy: depooled, rebooted, and repooled tools-exec-1412
  • 17:15 andrewbogott: restarted catmon tool to clean up stray files

2017-05-26

  • 20:32 bd808: Added tools-webgrid-lighttpd-14{19,2[0-8]} as submit hosts
  • 20:31 bd808: Added tools-webgrid-lighttpd-1412 and tools-webgrid-lighttpd-1413 as submit hosts
  • 20:28 bd808: sudo qconf -as tools-webgrid-lighttpd-1417.tools.eqiad.wmflabs

2017-05-22

  • 07:49 chasemp: move ooooold shared resources into archive for later cleanup

2017-05-20

  • 09:27 madhuvishy: Truncating jerr.log for tool videoconvert since it's 967GB

2017-05-10

  • 19:11 bd808: Edited striker db record for user Stepan Grigoryev to detach SUL and Phab accounts. T164849
  • 17:47 bd808: Signed and revoked puppet certs generated when our DNS flipped out and gave hosts non-FQDN hostnames
  • 17:29 bd808: Fixed broken puppet cert on tools-package-builder-01

2017-05-04

  • 19:23 madhuvishy: Rebooting tools-grid-shadow
  • 16:21 madhuvishy: Start instance tools-grid-master.tools from horizon
  • 16:20 madhuvishy: Shut off tools-grid-master.tools instance from horizon
  • 16:16 madhuvishy: Stopped gridengine-shadow on tools-grid-shadow.tools (service gridengine-shadow stop and kill -9 individual shadowd processes)

2017-04-24

  • 15:33 bd808: Removed Gergő Tisza as a projectadmin for T163611; event done

2017-04-21

  • 22:30 bd808: Added Gergő Tisza as a projectadmin for T163611
  • 13:43 chasemp: T161898 clush -g all 'sudo puppet agent --disable "rollout nfs-mount-manager"'

2017-04-20

  • 17:15 bd808: Deleted shutdown VM tools-docker-builder-04; tools-docker-builder-05 is the new hotness
  • 17:11 bd808: kill -INT 19897 on tools-proxy-02 to stop a hung nginx child process left from the last graceful restart of nginx

2017-04-19

  • 15:10 bd808: apt-get install psmisc on tools-proxy-0[12]
  • 13:23 chasemp: stop docker on tools-proxy-01
  • 13:20 chasemp: clean up disk space on tools-proxy-01

2017-04-18

  • 20:37 bd808: Restarted bigbrother on tools-services-02
  • 04:23 bd808: Shutdown tools-docker-builder-04; will wait a bit before deleting
  • 04:04 bd808: Built and pushed new Docker images based on 82a46b4 (Refactor apt-get actions in Dockerfiles)
  • 03:42 bd808: Made tools-docker-builder-05.tools.eqiad.wmflabs the active docker build host
  • 01:01 bd808: Built instance tools-package-builder-01

2017-04-17

  • 20:41 bd808: Building tools-docker-builder-05
  • 19:35 chasemp: add reedy to sudo all perms so he can admin things
  • 17:21 andrewbogott: adding 8 more exec nodes: tools-exec-1435 through 1442

2017-04-11

  • 16:46 andrewbogott: added exec nodes tools-exec-1430, 31, 32, 33, 34.
  • 14:15 andrewbogott: emptied /srv/pbuilder to make space on tools-docker-04
  • 02:35 bd808: Restarted maintain-kubeusers on tools-k8s-master-01

2017-04-03

  • 13:48 chasemp: enable puppet on gridmaster

2017-04-01

  • 15:28 andrewbogott: added five new exec nodes, tools-exec-1425 through 1429
  • 14:26 chasemp: up nfs thresholds https://gerrit.wikimedia.org/r/#/c/345975/
  • 14:00 chasemp: disable puppet on tools-grid-msater
  • 13:52 chasemp: tools-grid-master tc-setup clean
  • 13:40 chasemp: restart nscd and nscld on tools-grid-master
  • 13:31 chasemp: reboot tools-exec-1420

2017-03-31

  • 22:25 yuvipanda: apt-get update && apt-get install kubernetes-node on tools-proxy-01 to upgrade kube-proxy systemd service unit

2017-03-30

  • 20:29 chasemp: stop grid-master temporarily & umount -fl project nfs & remount & start grid-master
  • 17:38 chasemp: reboot tools-exec-1401
  • 17:30 madhuvishy: Updating tools project hiera config to add role::labs::nfsclient::lookupcache: all via Horizon (T136712)
  • 17:29 madhuvishy: Disabled puppet across tools in prep for T136712

2017-03-27

  • 04:06 andrewbogott: erasing random log files on tools-proxy-01 to avoid filling the disk

2017-03-23

  • 20:38 andrewbogott: migrating tools-exec-1401 to labvirt1001
  • 19:56 andrewbogott: migrating tools-exec-1408 to labvirt1001
  • 19:02 andrewbogott: migrating tools-exec-1407 to labvirt1001
  • 16:37 andrewbogott: migrating tools-webgrid-lighttpd-1402 and 1407 to labvirt1001 (testing labvirt1001 and easing CPU load on labvirt1010)

2017-03-22

  • 13:48 andrewbogott: migrating tools-bastion-02 in 15 minutes

2017-03-21

  • 17:06 andrewbogott: moving tools-webgrid-lighttpd-1404 to labvirt1012 to ease pressure on labvirt1004
  • 16:19 andrewbogott: moving tools-exec-1406 to labvirt1011 to ease CPU usage on labvirt1004

2017-03-20

  • 22:47 yuvipanda: disable puppet on all k8s workers to test https://gerrit.wikimedia.org/r/#/c/343708/
  • 18:36 bd808: Applied openstack::clientlib on tools-checker-02 and forced puppet run
  • 18:03 bd808: Applied openstack::clientlib on tools-checker-01 and forced puppet run
  • 17:31 andrewbogott: migrating tools-exec-1417 to labvirt1013
  • 17:05 andrewbogott: migrating tools-webgrid-lighttpd-1410 to labvirt1012 to reduce load on labvirt1001
  • 16:42 andrewbogott: migrating tools-webgrid-generic-1404 to labvirt1011 to reduce load on labvirt1001
  • 16:13 andrewbogott: migrating tools-exec-1408 to labvirt1010 to reduce load on labvirt1001

2017-03-17

  • 17:24 andrewbogott: moving tools-webgrid-lighttpd-1416 to labvirt1013 to reduce load on labvirt1004
  • 17:15 andrewbogott: moving tools-exec-1424 to labvirt1012 to ease load on labvirt1004

2017-03-15

  • 19:21 andrewbogott: added new exec nodes: tools-exec-1421 and tools-exec-1422
  • 17:42 madhuvishy: Restarted stashbot
  • 17:29 chasemp: docker stop && rm -fR /var/lib/docker/* on worker-1001
  • 17:20 chasemp: test of logging
  • 16:11 chasemp: k8s master 'for h in `kubectl get nodes | grep worker | grep -v NotReady | grep -v Disabled | awk '{print $1}'`; do echo $h && kubectl drain --delete-local-data --force $h && sleep 10 ; done'
  • 16:08 chasemp: stop puppet on k8s master and drain nodes
  • 15:50 chasemp: (late) kill what appears to be an android emulator? unsure but it's eating all IO

2017-03-14

  • 21:24 bd808: Deleted tools-precise-dev (T160466)
  • 21:13 bd808: Removed non-existent tools-submit.eqiad.wmflabs from submit hosts list
  • 21:02 bd808: Deleted tools-exec-gift (T160461)
  • 20:45 bd808: Deleted tools-webgrid-lighttpd-12* nodes (T160442)
  • 20:29 bd808: Deleted tools-exec-12* nodes (T160457)
  • 20:27 bd808: Disassociated floating IPs from tools-exec-12* nodes (T160457)
  • 17:41 madhuvishy: Hand fix tools-puppetmaster by removing the old mariadb submodule directory
  • 17:23 madhuvishy: Remove role::toollabs::precise_reminder from tools-bastion-03
  • 15:40 bd808: Installing toollabs-webservice 0.36 across cluster using clush
  • 15:36 bd808: Upgraded toollabs-webservice to 0.36 on tools-bastion-02.tools
  • 15:25 bd808: Installing jobutils 1.21 across cluster using clush
  • 15:23 bd808: Installed jobutils 1.21 on tools-bastion-02
  • 15:03 bd808: Shutting down webservices running on Precise job grid nodes

2017-03-13

  • 21:12 valhallasw`cloud: tools-bastion-03: killed heavy unzip operation from staeiou, and heavy (inadvertent large file opening?) vim operation from steenth, as the entire server was blocked due to high i/o

2017-03-07

  • 17:59 andrewbogott: depooling, migrating tools-exec-1416 as part of ongoing labvirt1001 issues
  • 17:21 madhuvishy: tools-webgrid-lighttpd-1409 migrated to labvirt1011 and repooled
  • 16:31 madhuvishy: Depooled tools-webgrid-lighttpd-1409 for cold migrating to different labvirt

2017-03-06

  • 22:52 andrewbogott: migrating tools-webgrid-lighttpd-1411 to labvirt1011 to give labvirt1001 a break
  • 19:03 madhuvishy: Stopping webservice running on tool tree-of-life on author request
  • 18:25 yuvipanda: set complex_values slots=300,release=trusty for tools-exec-gift-trusty-01.tools.eqiad.wmflabs

2017-03-04

  • 23:47 madhuvishy: Added new k8s workers 1028, 1029

2017-02-28

  • 03:52 scfc_de: Deployed jobtools and misctools 1.20/1.20~precise+1 (T158722).

2017-02-27

  • 02:42 scfc_de: Purged misctools from instances where not puppetized.
  • 02:42 scfc_de: Deployed jobtools and misctools 1.19/1.19~precise+1 (T155787, T156886).

2017-02-17

  • 12:51 chasemp: create tools-exec-gift-trusty-01
  • 12:40 chasemp: create tools-exec-gift-trusty
  • 12:24 chasemp: mass apt-get clean and removal of some old .gz log files due to 30+ low space warnings

2017-02-15

  • 18:45 yuvipanda: clush a restart of nscd across all of tools
  • 00:01 bd808: Rebuilt python and python2 Docker images (T157744)

2017-02-08

  • 06:22 yuvipanda: drain tools-worker-1026 for docker upgrade
  • 05:28 yuvipanda: drain pods from tools-worker-1027.tools.eqiad.wmflabs for docker upgrade
  • 05:28 yuvipanda: disable puppet on all k8s nodes in preparation for docker upgrade

2017-02-07

  • 13:49 scfc_de: Deployed toollabs-webservice_0.33_all.deb (T156605, T156626).
  • 13:49 scfc_de: Deployed tools-manifest_0.11_all.deb.

2017-02-04

  • 02:13 yuvipanda: launch tools-worker-1027 to see if puppet works fine on first run!
  • 02:13 yuvipanda: reboot tools-worker-1026 to see if it comes up fine
  • 01:46 yuvipanda: launch tools-worker-1026

2017-02-03

  • 21:34 madhuvishy: Migrated over precise tools to trusty for user multichill (catbot, family, locator, multichill, nlwikibots, railways, wlmtrafo, wikidata-janitor)
  • 21:13 chasemp: reboot tools-bastion-03 as unresponsive

2017-02-02

  • 20:39 yuvipanda: import docker-engine 1.11.2 (currently running version) and 1.12.6 (latest version) into aptly
  • 00:06 madhuvishy: Remove user maximilianklein from tools.cite-o-meter (on request)

2017-01-30

  • 20:25 yuvipanda: sudo ln -s /usr/bin/kubectl /usr/local/bin/kubectl to temporarily fix webservice shell not working

2017-01-27

  • 19:22 chasemp: reboot tools-bastion-02 as it is having issues
  • 02:01 madhuvishy: Reenabled puppet on tools-checker-01
  • 00:29 madhuvishy: Disabling puppet on tools-checker instances to test https://gerrit.wikimedia.org/r/#/c/334433/

2017-01-26

  • 23:37 madhuvishy: reenabled puppet on tools-checker
  • 23:02 madhuvishy: Disabling puppet on tools-checker instances to test https://gerrit.wikimedia.org/r/#/c/334433/
  • 16:08 chasemp: major cleanup for stale var items on tools-exec-1221

2017-01-24

  • 18:14 andrewbogott: one last reboot of tools-mail
  • 18:00 andrewbogott: apt-get autoremove on tools-mail
  • 17:51 andrewbogott: rebooting tools-mail post upgrade
  • 17:19 andrewbogott: restarting tools-mail, beginning do-release-upgrade -d -q
  • 17:17 andrewbogott: backing up tools-mail to ~root/8c499e6e-1b79-4bb1-8f7f-72fee1f74ea5-backup on labvirt1009
  • 17:15 andrewbogott: stopping tools-mail, backing up, upgrading from precise to trusty
  • 15:49 yuvipanda: clush -g all 'sudo rm /usr/local/bin/kube*' to get rid of old kube related binaries
  • 14:42 yuvipanda: re-enable puppet on tools-proxy-01, test success on proxy-02
  • 14:37 yuvipanda: disable puppet on tools-proxy-01 (active proxy) to check deploying debianized kube-proxy on proxy-02
  • 13:52 yuvipanda: upgrading k8s on worker nodes to use debs + new k8s version
  • 13:52 yuvipanda: finished upgrading k8s + using debs
  • 12:49 yuvipanda: purge ancient kubectl, kube-apiserver, kube-controller-manager, kube-scheduler packages from tools-k8s-master-01, these were my old terrible packages

2017-01-23

  • 19:36 andrewbogott: temporarily shutting down tools-webgrid-lighttpd-1201
  • 19:35 yuvipanda: depool tools-webgrid-lighttpd-1201 for snapshotting tests
  • 17:13 chasemp: reboot tools-exec-1411 as having serious transient issues

2017-01-20

  • 15:58 yuvipanda: enabling puppet across all hosts
  • 15:36 yuvipanda: disable puppet everywhere to cherrypick patch moving base to a profile
  • 00:50 bd808: sudo qdel -f 1199218 to force delete a stuck toolschecker job

2017-01-17

2017-01-11

  • 22:09 chasemp: add Reedy to admin in tool labs (approved by bryon and chase for access to investigate specific tool abuse behavior)

2017-01-10

  • 19:05 madhuvishy: Killed 3 jobs from tools.arnaub that were causing high load on tools-exec-1411

2017-01-06

  • 19:02 bd808: Terminated deprecated instances tools-exec-121[2-6] (T154539)

2017-01-04

  • 02:43 madhuvishy: Reenabled puppet on toolschecker and removed iptables rule on labservices1001 blocking incoming connections from tools-checker-01. T152369

2017-01-03

  • 23:56 bd808: Removed tools-exec-12[12-16] from gridengine (T154539)
  • 23:27 bd808: drained tools-exec-1216 (T154539)
  • 23:26 bd808: drained tools-exec-1215 (T154539)
  • 23:25 bd808: drained tools-exec-1214 (T154539)
  • 23:25 bd808: drained tools-exec-1213 (T154539)
  • 23:24 bd808: drained tools-exec-1212 (T154539)
  • 23:11 madhuvishy: Disabled puppet on tools-checker-01 (T152369)
  • 21:43 madhuvishy: Adding iptables rule to drop incoming connections from toolschecker on labservices1001
  • 20:51 madhuvishy: Adding iptables rule to block outgoing connections to labservices1001 on tools-checker-01
  • 20:43 madhuvishy: Silenced tools checker on icinga to test labservices1001 failure causing toolschecker to flake out T152369

2016-12-25

  • 00:28 yuvipanda: comment out cron running 'clean' script of avicbot every minute without -once
  • 00:28 yuvipanda: force delete all jobs of avicbot
  • 00:25 yuvipanda: delete all jobs of avicbot. This is 419 jobs
  • 00:20 yuvipanda: kill clean.sh process of avicbot

2016-12-19

  • 20:07 valhallasw`cloud: killed gps_exif_bot2.py (tools.gpsexif), was using 50MB/s io, lagging all of tools-bastion-03
  • 13:06 yuvipanda: run /usr/local/bin/deploy-master http://tools-docker-builder-03.tools.eqiad.wmflabs v1.3.3wmf1 on tools-k8s-master-01
  • 12:53 yuvipanda: cleaned out pbuilder from tools-docker-builder-01 to clean up

2016-12-17

  • 04:49 yuvipanda: turned on lookupcache again for bastions

2016-12-15

  • 18:52 yuvipanda: reboot tools-exec-1204
  • 18:49 yuvipanda: reboot tools-webgrid-lighttpd-12[01-05]
  • 18:45 yuvipanda: reboot tools-exec-gift
  • 18:41 yuvipanda: reboot tools-exec-1217 to 1221
  • 18:30 yuvipanda: rebooted tools-exec-1212 to 1216
  • 14:55 yuvipanda: reboot tools-services-01

2016-12-14

  • 18:43 mutante: tools-bastion-03 - ran 'locale-gen ko_KR.EUC-KR' for T130532

2016-12-13

  • 20:54 chasemp: reboot bastion-03 as unresponsive

2016-12-09

  • 19:32 godog: upgrade / restart prometheus-node-exporter
  • 08:37 YuviPanda: run delete-dbusers and force replica.my.cnf creation for all tools that did not have it

2016-12-08

  • 18:48 YuviPanda: restarted toolschecker on tools-checker-01

2016-12-07

2016-12-06

  • 00:36 bd808: Updated toollabs-webservice to 0.31 on rest of cluster (T147350)

2016-12-05

  • 23:19 bd808: Updated toollabs-webservice to 0.31 on tools-bastion-02 (T147350)
  • 22:55 bd808: Updated jobutils to 1.17 on tools-mail (T147350)
  • 22:53 bd808: Updated jobutils to 1.17 on tools-precise-dev (T147350)
  • 22:53 bd808: Updated jobutils to 1.17 on tools-cron-01 (T147350)
  • 22:52 bd808: Updated jobutils to 1.17 on tools-bastion-03 (T147350)
  • 22:52 bd808: Updated jobutils to 1.17 on tools-bastion-02 (T147350)
  • 16:53 bd808: Terminated deprecated instances: "tools-exec-1201", "tools-exec-1202", "tools-exec-1203", "tools-exec-1205", "tools-exec-1206", "tools-exec-1207", "tools-exec-1208", "tools-exec-1209", "tools-exec-1210", "tools-exec-1211" (T151980)
  • 16:50 bd808: Released floating IPs from decommissioned tools-exec-12[01-11] instances

2016-11-30

  • 23:06 bd808: Removed tools-exec-12[00-11] from gridengine (T151980)
  • 22:54 bd808: Removed tools-exec-12[00-11] from @general hostgroup
  • 15:17 chasemp: restart coibot 'coibot.sh -o syslog.output -e syslog.errors -r yes'
  • 05:20 bd808: rescheduled continuous jobs on tools-exec-1210; 2 task queue jobs remain (T151980)
  • 05:18 bd808: drained tools-exec-1211 (T151980)
  • 05:14 bd808: drained tools-exec-1209 (T151980)
  • 05:13 bd808: drained tools-exec-1208 (T151980)
  • 05:12 bd808: drained tools-exec-1207 (T151980)
  • 05:10 bd808: drained tools-exec-1206 (T151980)
  • 05:07 bd808: drained tools-exec-1205 (T151980)
  • 05:04 bd808: drained tools-exec-1204 (T151980)
  • 05:00 bd808: drained tools-exec-1203 (T151980)
  • 05:00 bd808: drained tools-exec-1202 (T151980)
  • 04:58 bd808: disabled queues on tools-exec-1211 (T151980)
  • 04:58 bd808: disabled queues on tools-exec-1210 (T151980)
  • 04:58 bd808: disabled queues on tools-exec-1209 (T151980)
  • 04:57 bd808: disabled queues on tools-exec-1208 (T151980)
  • 04:57 bd808: disabled queues on tools-exec-1207 (T151980)
  • 04:57 bd808: disabled queues on tools-exec-1206 (T151980)
  • 04:56 bd808: disabled queues on tools-exec-1205 (T151980)
  • 04:56 bd808: disabled queues on tools-exec-1204 (T151980)
  • 04:56 bd808: disabled queues on tools-exec-1203 (T151980)
  • 04:55 bd808: disabled queues on tools-exec-1202 (T151980)
  • 04:52 bd808: drained tools-exec-1201 (T151980)
  • 04:48 bd808: draining tools-exec-1201

2016-11-29

2016-11-22

  • 15:13 chasemp: readd attr +i to replica.my.cnf that seems to have gotten lost in rsync migration

2016-11-21

  • 21:15 YuviPanda: disable puppet everywhere
  • 19:49 YuviPanda: restart all webservice jobs on gridengine to pick up logging again

2016-11-20

  • 06:51 Krenair: ran `qmod -rj lighttpd-admin` as tools.admin to try to get the main page back up, it worked briefly but then broke again

2016-11-16

  • 20:14 yuvipanda: upgrade toollabs-webservice to 0.30 on all webgrid nodes
  • 18:31 chasemp: reboot tools-exec-1404 (already depooled)
  • 18:19 chasemp: reboot tools-exec-1403
  • 17:23 chasemp: reboot tools-exec-1212 (converted via 321786 testing for recovery on boot)
  • 16:55 chasemp: clush -g all "puppet agent --disable 'trail run for changeset 321786 handling /var/lib/gridengine'"
  • 02:05 yuvipanda: rebooting tools-docker-registry-01, can't ssh in
  • 01:43 yuvipanda: cleanup old images on tools-docker-builder-03

2016-11-15

  • 19:52 chasemp: reboot tools-precise-dev
  • 05:20 yuvipanda: restart all k8s webservices too
  • 05:05 yuvipanda: restarting all webservices on gridengine
  • 03:21 chasemp: reboot tools-checker-01
  • 02:56 chasemp: reboot tools-exec-1405 to ensure noauto works (because atboot=>false is lies)
  • 02:31 chasemp: reboot tools-exec-1406

2016-11-14

  • 22:51 chasemp: shut down bastion 02 and 05 and make 03 root only
  • 19:35 madhuvishy: Stopped cron on tools-cron-01 (T146154)
  • 18:24 madhuvishy: Tools NFS is read-only. /data/project and /home across tools are ro T146154
  • 16:57 yuvipanda: stopped gridengine master
  • 16:47 yuvipanda: start restarting kubernetes webservice pods
  • 16:30 madhuvishy: Unmounted all nfs shares from tools-k8s-master-01 (sudo /usr/local/sbin/nfs-mount-manager clean) T146154
  • 16:22 yuvipanda: kill maintain-kubeusers on tools-k8s-master-01, sole process touching NFS
  • 16:22 chasemp: enable puppet and run on tools-services-01
  • 16:21 yuvipanda: restarting all webservice jobs, watching webservicewatcher logs on tools-services-02
  • 16:14 madhuvishy: Disabling puppet across tools T146154

2016-11-11

  • 20:49 madhuvishy: Dual mount of tools share complete. Puppet reenabled across tools hosts. T146154
  • 20:18 madhuvishy: Rolling out dual mount of tools share across all hosts T146154
  • 19:29 madhuvishy: Disabling puppet across tools to dual mount tools share from labstore-secondary T146154

2016-11-02

  • 18:23 yuvipanda: manually stop tools-grid-master for reboot
  • 17:42 yuvipanda: drain nodes from labvirt1012 and 13
  • 13:42 chasemp: depool tools-exec-1404 for maint

2016-11-01

  • 21:54 yuvipanda: stop gridengine-master on tools-grid-master in preparation for reboot
  • 21:34 yuvipanda: depool tools nodes on labvirt1012
  • 21:16 yuvipanda: depool things in labvirt1011
  • 20:58 yuvipanda: depool tools nodes on labvirt1010
  • 20:32 yuvipanda: depool tools things on labvirt1005 and 1009
  • 20:08 yuvipanda: depooled things on labvirt1006 and 1008
  • 19:51 yuvipanda: move tools-elastic-03 to labvirt1010, -02 already in 09
  • 19:34 yuvipanda: migrate tools-elastic-03 to labvirt1009
  • 19:10 yuvipanda: depooled tools nodes from labvirt1004 and 1007
  • 17:57 yuvipanda: depool exec nodes on labvirt1002
  • 13:27 chasemp: reboot tools-exec-1404 post depool for test

2016-10-31

  • 21:50 yuvipanda: deleted cyberbot queue with qconf -dq cyberbot
  • 21:44 yuvipanda: restarted cron on tools-cron-01

2016-10-30

  • 02:25 yuvipanda: restarted maintain-kubeusers

2016-10-29

  • 17:21 yuvipanda: depool tools-worker-1005

2016-10-28

  • 20:15 chasemp: restart prometheus service on tools-prometheus-01 to see if that wakes it up
  • 20:06 yuvipanda: restart kube-apiserver again, ran into too many open file handles
  • 15:58 Yuvi[m]: restart k8s master, seems to have run out of fds
  • 15:43 chasemp: restart toolschecker service on 01 and 02

2016-10-27

  • 21:09 godog: upgrade prometheus on tools-prometheus0[12]
  • 18:49 andrewbogott: rebooting tools-webgrid-lighttpd-1401
  • 13:51 chasemp: reboot tools-webgrid-generic-1403
  • 13:50 chasemp: reboot dockerbuilder-01

2016-10-26

  • 23:20 madhuvishy: Disabling puppet on tools proxy hosts for applying proxy health check endpoint T143638
  • 23:17 godog: upgrade prometheus on tools-prometheus-02
  • 16:52 bd808: Deployed jobutils_1.16_all.deb on tools-mail (default jsub target to trusty)
  • 16:50 bd808: Deployed jobutils_1.16_all.deb on tools-precise-dev (default jsub target to trusty)
  • 16:48 bd808: Deployed jobutils_1.16_all.deb on tools-bastion-02, tools-bastion-03, tools-cron-01 (default jsub target to trusty)

2016-10-25

2016-10-24

  • 03:45 Krenair: reset host keys for tools-puppetmaster-02 on -01, looks like it was recreated 5-6 days ago

2016-10-20

  • 16:55 yuvipanda: killed bzip2 taking 100% CPU on tools-bastion-03

2016-10-18

  • 22:56 Guest20046: flip tools-k8s-master-01 to tools-puppetmaster-02
  • 07:43 yuvipanda: move all tools webgrid nodes to tools-puppetmaster-02 too
  • 07:40 yuvipanda: complete moving all general tools exec nodes to tools-puppetmaster-02
  • 07:33 yuvipanda: restarted puppetmaster on tools-puppetmaster-01

2016-10-17

  • 14:37 chasemp: remove bdsync-deb and bdsync-deb-2 errornously created in Tools and now defunct anyway
  • 14:05 chasemp: restart puppetmaster on tools-puppetmaster-01 (instances sticking on puppet runs for a long time)
  • 14:01 chasemp: reboot tools-exec-1215 and tools-exec-1410 as unresponsive

2016-10-14

  • 16:20 yuvipanda: repoooled tools-worker-1012, seems to have recovered?!
  • 15:57 yuvipanda: drain tools-worker-1012, seems stuck

2016-10-10

  • 18:04 valhallasw`vecto: sudo service bigbrother restart @ tools-services-02

2016-10-09

  • 18:33 valhallasw`cloud: removed empty local crontabs for {yuvipanda, yuvipanda, tools.toolschecker} on {tools-webgrid-lighttpd-1404, tools-webgrid-lighttpd-1204, tools-checker-01}. No other local crontabs remaining.

2016-10-05

  • 12:15 chasemp: reboot tools-webgrid-generic-1404 as locked up

2016-10-01

  • 10:03 yuvipanda: re-enable puppet on tools-checker-02

2016-09-29

  • 18:15 bd808: Rebooting tools-elastic-02.tools.eqiad.wmflabs via wikitech; couldn't ssh in
  • 18:10 bd808: Investigating elasticsearch cluster issues effecting stashbot

2016-09-27

  • 08:07 chasemp: tools-bastion-03:~# chmod 640 /var/log/syslog

2016-09-25

  • 15:27 Krenair: restarted labs-logbot under tools.morebots

2016-09-21

  • 18:56 madhuvishy: Repooled tools-webgrid-lighttpd-1418 (T146212) after dns records cleanup
  • 18:42 madhuvishy: Repooled tools-webgrid-lighttpd-1416 (T146212) after dns records cleanup
  • 16:57 chasemp: reboot tools-webgrid-lighttpd-1407, tools-webgrid-lighttpd-1210, tools-webgrid-lighttpd-1414, and then tools-webgrid-lighttpd-1405 as the first 3 return

2016-09-20

  • 23:24 yuvipanda: depool tools-webgrid-lighttpd-1416 and 1418, they aren't in actual working order
  • 21:23 madhuvishy|food: Pooled new sge exec node tools-webgrid-lighttpd-1416 (T146212)
  • 21:17 madhuvishy|food: Pooled new sge exec node tools-webgrid-lighttpd-1415 (T146212)
  • 20:34 madhuvishy: Created new instance tools-webgrid-lighttpd-1418 (T146212)
  • 20:34 madhuvishy: Created new instance tools-webgrid-lighttpd-1416 (T146212)
  • 20:34 madhuvishy: Created new instance tools-webgrid-lighttpd-1415 (T146212)
  • 17:58 andrewbogott: reboot tools-exec-1410
  • 17:54 yuvipanda: repool tools-webgrid-lighttpd-1412
  • 17:49 yuvipanda: webgrid-lighttpd-1412 hung on io (no change in nova diagnostics), rebooting
  • 17:33 yuvipanda: reboot tools-puppetmaster-01
  • 17:20 yuvipanda: reboot tools-checker-02
  • 15:42 chasemp: move floating ip from tools-checker-02 (failed) to tools-checker-01

2016-09-13

  • 21:09 madhuvishy: Bumped proxy nginx worker_connections limit T143637
  • 21:08 madhuvishy: Reenabled puppet across proxy hosts
  • 20:44 madhuvishy: Disabling puppet across proxy hosts

2016-09-12

  • 18:33 bd808: Forcing puppet run on tools-cron-01
  • 18:31 bd808: Forcing puppet run on tools-bastion-03
  • 18:28 bd808: Forcing puppet run on tools-bastion-02
  • 18:26 bd808: Forcing puppet run on tools-precise-dev
  • 18:26 bd808: Built toollabs-webservice v0.27 package and added to aptly

2016-09-10

  • 01:06 yuvipanda: migrate tools-k8s-etcd-01 to labvirt1012, is in state doing no io

2016-09-09

  • 19:27 yuvipanda: reboot tools-exec-1218 and 1219
  • 18:10 yuvipanda: killed massive grep running as root

2016-09-08

  • 21:49 bd808: forcing puppet runs to install toollabs-webservice_0.26_all.deb
  • 20:51 bd808: forcing puppet runs to install jobutils_1.15_all.deb

2016-09-07

  • 21:11 Krenair: brought labs/private.git up to date on tools-puppetmaster-01
  • 02:32 Krenair: ran `SULWatcher/restart_SULWatcher.sh` as `tools.stewardbots` on bastion-03 to fix T144887

2016-09-06

  • 22:14 yuvipanda: got pbuilder off tools-services-01, was taking up too much space.
  • 22:10 madhuvishy: Deleted instance tools-web-static-01 and tools-web-static-02 (T143637)
  • 21:45 yuvipanda: reboot tools-prometheus-02. nova diagnostics shows no vda activity.
  • 20:43 chasemp: drain and reboot tools-exec-1410 for testing
  • 07:32 yuvipanda: depooled tools-exec-1219 and 1218, seem to be unresponsive, causing jobs that appear to run but aren't really

2016-09-05

  • 16:27 andrewbogott: rebooting tools-cron-01 because it is hanging all over the place

2016-09-01

  • 05:19 yuvipanda: restart maintain-kubeusers on tools-k8s-master-01, was stuck

2016-08-31

  • 20:48 madhuvishy: Reenabled puppet across tools hosts
  • 20:45 madhuvishy: Scratch migration complete on all grid exec nodes (T134896)
  • 19:36 madhuvishy: Scratch migration on all non exec/worker nodes complete (T134896)
  • 18:18 madhuvishy: Scratch migration complete for all k8s workers (T134896)
  • 17:50 madhuvishy: Reenabling puppet across tools hosts.
  • 16:55 madhuvishy: Rsync-ed over latest backup of /srv/scratch from labstore1001 to labstore1003
  • 16:50 madhuvishy: Puppet disabling complete (T134896)

2016-08-30

2016-08-29

  • 23:38 Krenair: added myself to the tools.admin service group earlier to try to figure out what was causing the outage, removed again now
  • 16:35 yuvipanda: run chmod u+x /data/project/framabot
  • 13:40 chasemp: restart jouncebot

2016-08-28

  • 05:34 bd808: After git gc on web-static-02.tools:/srv/cdnjs: /dev/mapper/vd-cdnjs--disk 61G 54G 3.3G 95% /srv
  • 05:25 bd808: sudo git gc --aggressive on tools-web-static-01.tools:/srv/cdnjs
  • 04:56 bd808: sudo git gc --aggressive on tools-web-static-02.tools:/srv/cdnjs

2016-08-26

  • 16:53 yuvipanda: migrate tools-static-02 to labvirt1001

2016-08-25

  • 18:07 yuvipanda: restart puppetmaster on tools-puppetmaster-01
  • 17:41 yuvipanda: depooled tools-webgrid-1413
  • 01:16 yuvipanda: restarted puppetmaster on tools-puppetmaster-01

2016-08-24

  • 23:03 chasemp: reboot tools-exec-1217
  • 17:25 yuvipanda: depool tools-exec-1217, it is dead/stuck/hung/io-starved

2016-08-23

2016-08-22

2016-08-20

  • 11:42 valhallasw`cloud: rebooting tools-mail (hanging)

2016-08-19

  • 14:52 chasemp: reboot 82323ee4-762e-4b1f-87a7-d7aa7afa22f6

2016-08-18

  • 20:00 yuvipanda: restarted maintain-kubeusers on tools-k8s-master-01

2016-08-15

  • 22:10 yuvipanda: depool tools-exec-1211 and 1205, seem to be out of action
  • 19:12 yuvipanda: kill unused tools-merlbot-proxy

2016-08-12

  • 20:39 yuvipanda: delete tools-webgrid-lighttpd-1415, enough webservices have moved to k8s from that queue
  • 20:37 yuvipanda: delete tools-logs-01, going to recreate with a smaller image
  • 20:36 yuvipanda: delete tools-webgrid-generic-1405, enough things have moved to k8s from that queue!
  • 20:10 yuvipanda: migration of tools-grid-master to labvirt1013 complete
  • 20:01 yuvipanda: migrating tools-grid-master (currently inactive) to labvirt1013 away from crowded 1010
  • 12:40 chasemp: tools.templatetransclusioncheck@tools-bastion-03:~$ webservice restart

2016-08-11

  • 20:13 yuvipanda: tools-grid-master finally stopped
  • 20:05 yuvipanda: disabled tools-webgrid-lighttpd-1202, is hung
  • 17:23 yuvipanda: instance being rebooted is tools-grid-master
  • 17:22 chasemp: reboot via nova master as it is stuck

2016-08-05

  • 19:29 paladox: adding tom29739 to lolrrit-wm project

2016-08-04

  • 19:09 yuvipanda: cleaned up nginx log files in tools-docker-registry-01 to fix free space warning
  • 00:19 yuvipanda: added Krenair as admin to help with T132225 and other issues.

2016-08-03

  • 22:48 yuvipanda: deleted tools-worker-1005
  • 22:08 yuvipanda: depool & delete tools-worker-1007 and 1008
  • 21:34 yuvipanda: rebooting tools-puppetmaster-01 to test a hypothesis
  • 21:10 yuvipanda: rebooting tools-puppetmaster-01 for kernel upgrade
  • 00:20 madhuvishy: Repooled nodes tools-worker 1012 and 1013 for T141126

2016-08-02

  • 22:49 yuvipanda: depooled tools-worker-1014 as well for T141126
  • 22:44 yuvipanda: depool tools-worker-1015 for T141126
  • 22:42 paladox: cherry picking 302617 onto lolrrit-wm
  • 22:41 madhuvishy: Depooling tools-worker 1012 and 1013 for T141126
  • 22:32 yuvipanda: added paladox to tools
  • 09:38 godog: bounce morebots production
  • 00:01 yuvipanda: depool tools-worker-1017 for T141126

2016-08-01

  • 23:48 madhuvishy: Repooled tools-worker-1011 and tools-worker-1018 (Yuvi) for T114126
  • 23:41 madhuvishy: Repooled tools-worker-1010 and tools-worker-1019 (Yuvi) for T114126
  • 23:21 madhuvishy: Yuvi is depooling tools-worker-1018 for T114126
  • 23:19 madhuvishy: Depooling tools-worker 1010 and 1011 for T114126
  • 23:17 madhuvishy: Yuvi depooled tools-worker-1019 for T114126
  • 23:06 madhuvishy: Added tools-worker-1022 as new k8s worker node
  • 23:06 madhuvishy: Repooled tools-worker-1009 (T114126)
  • 22:48 madhuvishy: Depooling tools-worker-1009 to prepare for T141126

2016-07-29

  • 22:04 YuviPanda: repooled tools-worker-1006
  • 21:48 YuviPanda: deleted tools-worker-1006 after depooling+draining
  • 21:45 YuviPanda: repool new tools-worker-1003 with direct-lvm docker storage backend
  • 21:30 YuviPanda: depool tools-worker-1003 to be recreated with new docker config, picking this because it's on a non-ssd host
  • 21:17 YuviPanda: depooled tools-worker-1020/21 after fixing them up
  • 20:41 YuviPanda: delete tools-worker-1001
  • 20:29 YuviPanda: depool tools-worker-1001, going to recreate with to test new puppet deploying-first-run
  • 20:26 YuviPanda: built new worker nodes tools-worker-1020 and 21 with direct-lvm storage backend
  • 17:48 YuviPanda: disable puppet on all tools k8s worker nodes

2016-07-25

  • 14:17 chasemp: nova reboot 64f01f90-c805-4a2e-9ed5-f523b909094e (grid master)

2016-07-23

  • 23:21 YuviPanda: restart maintain-kubeusers on tools-k8s-master-01, was stuck on connecting to seaborgium preventing new tool creation
  • 01:56 YuviPanda: deploy kubernetes v1.3.3wmf1

2016-07-22

  • 17:30 YuviPanda: repool tools-worker-1018
  • 14:04 chasemp: reboot tools-worker-1015 as stuck w/ high iowait warning seconds ago. I cannot ssh in as root.

2016-07-21

  • 22:42 chasemp: reboot tools-worker-1018 as stuck T141017

2016-07-20

  • 21:27 andrewbogott: rebooting tools-k8s-etcd-01
  • 11:14 Guest9334: rebooted tools-worker-1004

2016-07-19

  • 01:06 bd808: Upgraded Elasticsearch on tools-elastic-* to 2.3.4

2016-07-18

  • 21:50 YuviPanda: force downgrade hhvm on tools-webgrid-lighttpd-1408 to fix puppet issues
  • 21:40 YuviPanda: bind mount and kill files in /var/lib/docker that were monuted over by proper mount on lvm on tools-worker-1004
  • 21:40 YuviPanda: bind mount and kill files in /var/lib/docker that were monuted over by proper mount on lvm
  • 21:37 YuviPanda: killed tools-pastion-01, no longer in use
  • 20:59 bd808: Disabled puppet on tools-elastic-0[123]. Elasticsearch needs to be upgraded.
  • 15:15 YuviPanda: kill 8807036 for Luke081515
  • 12:48 YuviPanda: reboot tools-flannel-etcd-03 for T140256
  • 12:41 YuviPanda: reboot tools-k8s-etcd-02 for T140256

2016-07-15

  • 10:24 yuvipanda: depool tools-exec-1402 for T138447
  • 10:24 yuvipanda: reboot tools-exec-1402 for T138447
  • 10:16 yuvipanda: depooling tools-webgrid-lighttpd-1402 and -1412 since they seem to be suffering from T138447
  • 10:08 yuvipanda: reboot tools-webgrid-lighttpd-1402 and 1412

2016-07-14

  • 23:12 bd808: Added Madhuvishy to project "roots" sudoer list
  • 22:58 bd808: Added Madhuvishy as projectadmin
  • 21:25 chasemp: change perms for tools.readmore to correct bot

2016-07-13

  • 11:40 yuvipanda: cold-migrate tools-worker-1014 off labvirt1010 to see if that improves the ksoftirqd situation
  • 11:19 yuvipanda: drained tools-worker-1004 - high ksoftirqd usage even with no load
  • 11:13 yuvipanda: depool tools-worker-1014 - unusable, totally in iowait
  • 11:13 yuvipanda: reboot tools-worker-1004, was unresponsive

2016-07-12

  • 18:07 yuvipanda: reboot tools-worker-1012, it seems to have failed LDAP connectivity :|

2016-07-08

  • 12:38 yuvipanda: starting up tools-web-static-02 again

2016-07-07

  • 12:45 yuvipanda: start deployment of k8s 1.3.0wmf4 for T139259

2016-07-06

  • 13:09 yuvipanda: associated a floating IP with tools-k8s-master-01 for T139461
  • 11:47 yuvipanda: moved tools-checker-0[12] to use tools-puppetmaster-01 as puppetmaster so they get appropriate CA for use when talking to kubernetes API

2016-07-04

  • 11:13 yuvipanda: delete tools-prometheus-01 to free up resources on labvirt1010
  • 11:11 yuvipanda: actually deleted instance tools-cron-02 to free up resources on labvirt1010 - was large and not currently used, and failover process takes a while anyway, so we can recreate if needed
  • 11:11 yuvipanda: stopped instance tools-cron-02 to free up some resources on labvirt1010

2016-07-03

  • 17:09 yuvipanda: run qstat -u '*' | grep 'dr ' | awk '{ print $1;}' | xargs -L1 qdel -f to clean out jobs stuck in dr state
  • 16:59 yuvipanda: migrate tools-web-static-02 to labvirt1011 to provide more breathing room
  • 16:56 yuvipanda: delete temp-test-trusty-package to provide more breathing room on labvirt1010
  • 13:49 yuvipanda: reboot tools-exec-1219
  • 13:37 yuvipanda: migrating tools-exec-1216 to labvirt1011
  • 13:07 yuvipanda: delete tools-bastion-01 which was shut down anyway
  • 13:04 yuvipanda: attempt to reboot tools-exec-1212

2016-06-28

  • 15:25 bd808: Signed client cert for tools-worker-1019.tools.eqiad.wmflabs on tools-puppetmaster-01.tools.eqiad.wmflabs

2016-06-21

  • 16:49 bd808: Updated jobutils to v1.14 for T138178

2016-06-17

  • 06:17 yuvipanda: forced deletion of 7033590 for dykbot for shubinator

2016-06-08

  • 20:31 yuvipanda: start tools-bastion-03 was stuck in 'stopped' state
  • 20:31 yuvipanda: reboot tools-bastion-03

2016-05-31

  • 17:35 valhallasw`cloud: re-enabled queues on tools-exec-1407, tools-exec-1216, tools-exec-1219
  • 13:13 chasemp: reboot of tools-exec-1203 see T136495 all jobs seem gone now

2016-05-30

2016-05-29

  • 18:58 YuviPanda: deleted tools-k8s-bastion-01 for T136496
  • 14:29 valhallasw`cloud: chowned /data/project/xtools-mab-dev to root and back to stop rogue process that was writing to the directory. I'm still not sure where that process was running, but at least this seems to have solved the issue

2016-05-28

  • 21:52 valhallasw`cloud: rebooted tools-webgrid-lighttpd-1408, tools-pastion-01, tools-exec-1205
  • 21:21 valhallasw`cloud: rebooting tools-exec-1204 (T136495)

2016-05-27

  • 14:45 YuviPanda: start moving tools-bastion-03 to use tools-puppetmaster-01 as puppetmaster

2016-05-25

  • 20:15 YuviPanda: deleted tools-bastion-mtemp per chasemp
  • 19:43 YuviPanda: delete devpi instance, not currently in use
  • 19:39 YuviPanda: run sudo dpkg --configure -a on tools-worker-1007 to get it unstuck
  • 19:19 YuviPanda: deleted tools-docker-builder-01 and -02, hosed hosts that are unused
  • 17:18 YuviPanda: fixed hhvm upgrade on tools-cron-01
  • 07:19 YuviPanda: hard reboot tools-services-01, was completely stuck on /public/dumps
  • 06:06 bd808: Restarting all webservice jobs
  • 05:33 andrewbogott: rebooting tools-proxy-02

2016-05-24

  • 01:36 scfc_de: tools-cron-02: Downgraded hhvm (sudo apt-get install hhvm).
  • 01:36 scfc_de: tools-bastion-03, tools-checker-01, tools-cron-02, tools-exec-1202, tools-proxy-02, tools-redis-1001: Remounted /public/dumps read-only (while sudo umount /public/dumps; do :; done && sudo puppet agent -t).

2016-05-23

  • 19:36 YuviPanda: switched tools-checker to tools-checker-03
  • 16:33 bd808: Rebooting tools-elastic-02.tools.eqiad.wmflabs
  • 13:28 chasemp: 'apt-get install hhvm -y --force-yes' across trusty hosts to handle hhvm downgrade

2016-05-20

  • 23:39 bd808: Forced puppet run on bastion-02 & bastion-05 to apply fix for T135861
  • 19:47 chasemp: tools-exec-1406 having issues rebooting

2016-05-19

  • 21:07 bd808: deployed jobutils 1.13 on bastions; now with '-l release=...' validation!
  • 15:43 YuviPanda: rebooting all tools worker instances
  • 13:12 chasemp: reboot tools-exec-1220 stuck in state of unresponsivenss

2016-05-13

  • 00:40 YuviPanda: cleared all queues that were in error state

2016-05-12

  • 22:59 YuviPanda: restart tools-worker-1004 to attempt bringing it back up
  • 22:59 YuviPanda: deploy k8s 1.2.4wmf1 on all proxy nodes
  • 22:58 YuviPanda: deploy k8s on all worker nodes
  • 22:46 YuviPanda: deploy k8s master for 1.2.4wmf1

2016-05-10

  • 04:25 bd808: Added role::package::builder to tools-services-01

2016-05-09

  • 04:33 YuviPanda: reboot tools-worker-1004, lots of ksoftirqd stuckness despite no actual containers running

2016-05-08

  • 07:06 YuviPanda: restarted admin tool

2016-05-05

2016-04-28

  • 04:15 YuviPanda: delete half of the trusty webservice jobs
  • 04:00 YuviPanda: deleted all precise webservice jobs, waiting for webservicemonitor to bring them back up

2016-04-24

  • 12:22 YuviPanda: force deleted job 5435259 from pbbot per PeterBowman

2016-04-11

  • 14:20 andrewbogott: moving tools-bastion-mtemp to labvirt1009

2016-04-06

  • 15:20 bd808: Removed local hack for T131906 from tools-puppetmaster-01

2016-04-05

  • 21:24 bd808: Committed local hack on tools-puppetmaster-01 to get elasticsearch working again
  • 21:02 bd808: Forcing puppet runs to fix elasticsearch
  • 20:39 bd808: Elasticsearch processes down. Looks like a prod puppet change that needs tweaking for tool labs

2016-04-04

  • 19:43 YuviPanda: new bastion!
  • 19:15 chasemp: reboot tools-bastion-05

2016-03-30

  • 15:50 andrewbogott: rebooting tools-proxy-01 in hopes of clearing some bad caches

2016-03-28

  • 20:51 yuvipanda: lifted RAM quota from 900Gigs to 1TB?!
  • 20:30 chasemp: change perm grant files from create-dbusers for chmod 400 chat chattr +i

2016-03-27

  • 17:40 scfc_de: tools-webgrid-generic-1405, tools-webgrid-lighttpd-1411, tools-web-static-01, tools-web-static-02: "apt-get install cloud-init" and accepted changes for /etc/cloud/cloud.cfg (users: + default; cloud_config_modules: + ssh-import-id, + puppet, + chef, + salt-minion; system_info/package_mirrors/arches[i386, amd64]/search/primary: + http://%(region)s.clouds.archive.ubuntu.com/ubuntu/).

2016-03-18

  • 15:47 chasemp: had to kill stalkboten as it was logging constant errors filling logs to the tune of hundreds of gigs
  • 15:36 chasemp: cleanup huge log collection for broken bot: /srv/project/tools/project/betacommand-dev/tspywiki/irc/logs# rm -fR SpamBotLog.log\.*

2016-03-11

  • 20:57 mutante: reverted font changes - puppet runs recovering
  • 20:37 mutante: more puppet issues due to font dependencies on trusty, on it
  • 19:39 mutante: should a tools-exec server be influenced by font packages on an mw appserver?
  • 19:39 mutante: fixed puppet runs on tools-exec (gerrit 276792)

2016-03-02

  • 14:56 chasemp: qdel 3956069 and 3758653 for abusing auth

2016-02-29

  • 21:49 scfc_de: tools-exec-1218: rm -f /usr/local/lib/nagios/plugins/check_eth to work around "Got passed new contents for sum" (https://tickets.puppetlabs.com/browse/PUP-1334).
  • 21:20 scfc_de: tools-exec-1209: rm -f /var/lib/puppet/state/agent_catalog_run.lock (no Puppet process running, probably from the reboots).
  • 20:58 scfc_de: Ran "dpkg --configure -a" on all instances.
  • 13:50 scfc_de: Deployed jobutils/misctools 1.10.

2016-02-28

  • 20:08 bd808: Removed unwanted NFS mounts from tools-elastic-01.tools.eqiad.wmflabs

2016-02-26

  • 19:08 bd808: Upgraded Elasticsearch on tools-elastic-0[123] to 1.7.5

2016-02-25

  • 21:43 scfc_de: Deployed jobutils/misctools 1.9.

2016-02-24

2016-02-22

  • 15:55 andrewbogott: redirecting tools-login.wmflabs.org to tools-bastion-05

2016-02-19

  • 15:58 chasemp: rerollout tools nfs shaping pilot for sanity in anticipation of formalization
  • 09:21 _joe_: killed cluebot3 instance on tools-exec-1207, writing 20 M/s to the error log
  • 00:50 yuvipanda: failover services to services-02

2016-02-18

  • 20:37 yuvipanda: failover proxy back to tools-proxy-01
  • 19:46 chasemp: repool labvirt1003 and depool labvirt1004
  • 18:19 chasemp: draining nodes from labvirt1001

2016-02-16

  • 21:33 chasemp: reboot of bastion-1002

2016-02-12

  • 19:56 chasemp: nfs traffic shaping pilot round 2

2016-02-05

  • 22:01 chasemp: throttle some vm nfs write speeds
  • 16:49 scfc_de: find /data/project/wikidata-edits -group ssh-key-ldap-lookup -exec chgrp tools.wikidata-edits \{\} + (probably a remnant of the work on ssh-key-ldap-lookup last summer).
  • 16:45 scfc_de: Removed /data/project/test300 (uid/gid 52080; none of them resolves, no databases, just an unmodified pywikipedia clone inside).

2016-02-03

  • 03:00 YuviPanda: upgraded flannel on all hosts running it

2016-01-31

  • 20:01 scfc_de: tools-webgrid-generic-1405: Rebooted via wikitech; rebooting via "shutdown -r now" did not seem to work.
  • 18:51 bd808: tools-elastic-01.tools.eqiad.wmflabs console shows blocked tasks, possible kernel bug?
  • 18:49 bd808: tools-elastic-01.tools.eqiad.wmflabs not responsive to ssh or Elasticsearch requests; rebooting via wikitech interface
  • 13:32 hashar: restarted qamorebot

2016-01-30

  • 06:38 scfc_de: tools-webgrid-generic-1405: Rebooted for load ~ 175 and lots of processes stuck in D.

2016-01-29

  • 21:25 YuviPanda: restarted image-resize-calc manually, no service.manifest file

2016-01-28

  • 15:02 scfc_de: tools-cron-01: Rebooted via wikitech as "shutdown -r now" => "@sbin/plymouthd --mode=shutdown" => "/bin/sh -e /proc/self/fd/9" => "/bin/sh /etc/init.d/rc 6" => "/bin/sh /etc/rc6.d/S20sendsigs stop" => "sync" stuck in D. *argl*
  • 14:56 scfc_de: tools-cron-01: Rebooted due to high number of processes stuck in D and load >> 100.
  • 14:54 scfc_de: tools-cron-01: HUPped 43 processes wikitrends/refresh.sh, though a lot of all processes seem to be stuck in D, so I'll reboot this instance.
  • 14:50 scfc_de: tools-cron-01: HUPped 85 processes /usr/lib/php5/sessionclean.

2016-01-27

  • 23:07 YuviPanda: removed all members of templatetiger, added self instead, removed active shell sessions
  • 20:24 chasemp: master stop, truncate accounting log to accounting.01272016, master start
  • 19:34 chasemp: master start grid master
  • 19:23 chasemp: stopped master
  • 19:11 YuviPanda: depooled tools-webgrid-1405 to prep for restart, lots of stuck processes
  • 18:29 valhallasw`cloud: job 2551539 is ifttt, which is also running as 2700629. Killing 2551539 .
  • 18:26 valhallasw`cloud: messages repeatedly reports "01/27/2016 18:26:17|worker|tools-grid-master|E|execd@tools-webgrid-generic-1405.tools.eqiad.wmflabs reports running job (2551539.1/master) in queue "webgrid-generic@tools-webgrid-generic-1405.tools.eqiad.wmflabs" that was not supposed to be there - killing". SSH'ing there to investigate
  • 18:24 valhallasw`cloud: 'sleep' test job also seems to work without issues
  • 18:23 valhallasw`cloud: no errors in log file, qstat works
  • 18:23 chasemp: master sge restarted post dump and restart for jobs db
  • 18:22 valhallasw`cloud: messages file reports 'Wed Jan 27 18:21:39 UTC 2016 db_load_sge_maint_pre_jobs_dump_01272016'
  • 18:20 chasemp: master db_load -f /root/sge_maint_pre_jobs_dump_01272016 sge_job
  • 18:19 valhallasw`cloud: dumped jobs database to /root/sge_maint_pre_jobs_dump_01272016, 4.6M
  • 18:17 valhallasw`cloud: SGE Configuration successfully saved to /root/sge_maint_01272016 directory.
  • 18:14 chasemp: grid master stopped
  • 00:56 scfc_de: Deployed admin/www bde15df..12a3586.

2016-01-26

  • 21:28 YuviPanda: qstat -u '*' | grep E | awk '{print $1}' | xargs -L1 qmod -cj
  • 21:16 chasemp: reboot tools-exec-1217.tools.eqiad.wmflabs

2016-01-25

  • 20:30 YuviPanda: switched over cron host to tools-cron-01, manually copied all old cron files from tools-submit to tools-cron-01
  • 19:06 chasemp: kill python merge/merge-unique.py tools-exec-1213 as it seemed to be overwhelming nfs
  • 17:07 scfc_de: Deployed admin/www at bde15df2a379c33edfb8350afd2f0c7186705a93.

2016-01-23

  • 15:49 scfc_de: Removed remnant send_puppet_failure_emails cron entries except from unreachable hosts sacrificial-kitten, tools-worker-06 and tools-worker-1003.

2016-01-21

  • 22:24 YuviPanda: deleted tools-redis-01 and -02 (are on 1001 and 1002 now)
  • 21:13 YuviPanda: repooled exec nodes on labvirt1010
  • 21:08 YuviPanda: gridengine-master started, verified shadow hasn't started
  • 21:00 YuviPanda: stop gridengine master
  • 20:51 YuviPanda: repooled exec nodes on labvirt1007 was last message
  • 20:51 YuviPanda: repooled exec nodes on labvirt1006
  • 20:39 YuviPanda: failover tools-static too tools-web-static-01
  • 20:38 YuviPanda: failover tools-checker to tools-checker-01
  • 20:32 YuviPanda: depooled exec nodes on 1007
  • 20:32 YuviPanda: repooled exec nodes on 1006
  • 20:14 YuviPanda: depooled all exec nodes in labvirt1006
  • 20:11 YuviPanda: repooled exec node son 1005
  • 19:53 YuviPanda: depooled exec nodes on labvirt1005
  • 19:49 YuviPanda: repooled exec nodes from labvirt1004
  • 19:48 YuviPanda: failed over proxy to tools-proxy-01 again
  • 19:31 YuviPanda: depooled exec nodes from labvirt1004
  • 19:29 YuviPanda: repooled exec nodes from labvirt1003
  • 19:13 YuviPanda: depooled instances on labvirt1003
  • 19:06 YuviPanda: re-enabled queues on exec nodes that were on labvirt1002
  • 19:02 YuviPanda: failed over tools proxy to tools-proxy-02
  • 18:46 YuviPanda: drained and disabled queues on all nodes on labvirt1002
  • 18:38 YuviPanda: restarted all restartable jobs in instances on labvirt1001 and deleted all non-restartable ghost jobs. these were already dead

2016-01-12

  • 09:48 scfc_de: tools-checker-01: Removed exim paniclog (OOM).

2016-01-11

  • 22:19 valhallasw`cloud: reset maxujobs 0->128, job_load_adjustments none->np_load_avg=0.50, load_ad... -> 0:7:30
  • 22:12 YuviPanda: restarted gridengine master again
  • 22:07 valhallasw`cloud: set job_load_adjustments from np_load_avg=0.50 to none and load_adjustment_decay_time to 0:0:0
  • 22:05 valhallasw`cloud: set maxujobs back to 0, but doesn't help
  • 21:57 valhallasw`cloud: reset to 7:30
  • 21:57 valhallasw`cloud: that cleared the measure, but jobs still not starting. Ugh!
  • 21:56 valhallasw`cloud: set job_load_adjustments_decay_time = 0:0:0
  • 21:45 YuviPanda: restarted gridengine master
  • 21:43 valhallasw`cloud: qstat -j <jobid> shows all queues overloaded; seems to have started just after a load test for the new maxujobs setting
  • 21:42 valhallasw`cloud: resetting to 0:7:30, as it's not having the intended effect
  • 21:41 valhallasw`cloud: currently 353 jobs in qw state
  • 21:40 valhallasw`cloud: that's load_adjustment_decay_time
  • 21:40 valhallasw`cloud: temporarily sudo qconf -msconf to 0:0:1
  • 19:59 YuviPanda: Set maxujobs (max concurrent jobs per user) on gridengine to 128
  • 17:51 YuviPanda: kill all queries running on labsdb1003
  • 17:20 YuviPanda: stopped webservice for quentinv57-tools

2016-01-09

  • 21:07 valhallasw`cloud: moved tools-checker/208.80.155.229 back to tools-checker-01
  • 21:02 andrewbogott: rebooting tools-checker-01 as it is unresponsive.
  • 13:12 valhallasw`cloud: tools-worker-1002. is unresponsive. Maybe that's where the other grrrit-wm is hiding? Rebooting.

2016-01-08

2015-12-30

  • 04:06 YuviPanda: delete all webgrid jobs to start with a clean slate
  • 03:54 YuviPanda: qmod -rj all tools in the continuous queue, they are all orphaned
  • 02:39 YuviPanda: remove lbenedix and ebekebe from tools.hcclab
  • 00:40 YuviPanda: restarted master on grid-master
  • 00:40 YuviPanda: copied and cleaned out spooldb
  • 00:10 YuviPanda: reboot tools-grid-shadow
  • 00:08 YuviPanda: attempt to stop shadowd
  • 00:03 YuviPanda: attempting to start gridengine-master on tools-grid-shadow
  • 00:00 YuviPanda: kill -9'd gridengine master

2015-12-29

  • 23:31 YuviPanda: rebooting tools-grid-master
  • 23:22 YuviPanda: restart gridengine-master on tools-grid-master
  • 00:18 YuviPanda: shut down redis on tools-redis-01

2015-12-28

  • 22:34 chasemp: attempt to unmount nfs volumes on tools-redis-01 to debug but it hands (I am on console and see root at console hang on login)
  • 22:31 YuviPanda: disable NFS on tools-redis-1001 and 1002
  • 21:32 YuviPanda: disable puppet on tools-redis-01 and -02
  • 21:27 YuviPanda: created tools-redis-1001

2015-12-23

  • 21:21 YuviPanda: deleted tools-worker-01 to -05, creating tools-worker-1001 to 1005
  • 21:19 valhallasw`cloud: tools-proxy-01: umount /home /data/project /data/scratch /public/dumps
  • 19:01 valhallasw`cloud: ah, connections that are kept open. A new incognito window is routed correctly.
  • 18:59 valhallasw`cloud: switched to -02, worked correctly, switched back. Switching back does not seem to fully work?!
  • 18:40 valhallasw`cloud: scratch that, first going to eat dinner
  • 18:38 valhallasw`cloud: dynamicproxy ban system deployed on tools-proxy-02 working correctly for localhost; switching over users there by moving the external IP.
  • 14:42 valhallasw`cloud: toollabs homepage is unhappy because tools.xtools-articleinfo is using a lot of cpu on tools-webgrid-lighttpd-1409. Checking to see what's happening there.
  • 10:46 YuviPanda: migrate tools-worker-01 to 3.19 kernel

2015-12-22

  • 18:30 YuviPanda: rescheduling all webservices
  • 18:17 YuviPanda: failed over active proxy to proxy-01
  • 18:12 YuviPanda: upgraded kernel and rebooted tools-proxy-01
  • 01:42 YuviPanda: rebooting tools-worker-08

2015-12-21

  • 18:44 YuviPanda: reboot tools-proxy-01
  • 18:31 YuviPanda: failover proxy to tools-proxy-02

2015-12-20

  • 00:00 YuviPanda: tools-worker-08 stuck again :|

2015-12-18

  • 15:16 andrewbogott: rebooting locked up host tools-exec-1409

2015-12-16

  • 23:14 andrewbogott: rebooting tools-exec-1407, unresponsive
  • 22:48 YuviPanda: run qmod -c '*' to clear error state on gridengine
  • 21:28 andrewbogott: deleted tools-docker-registry-01
  • 16:24 andrewbogott: rebooting tools-exec-1221 as it was in kernel lockup

2015-12-12

  • 10:08 YuviPanda: restarted cron on tools-submit

2015-12-10

  • 12:47 valhallasw`cloud: broke tools-proxy-02 login (for valhallasw, root still works) by restarting nslcd. Restarting; current proxy is -01.

2015-12-07

  • 13:46 Coren: The new grid masters are happy, killing the old ones (-shadow, -master)
  • 10:46 YuviPanda: restarted nscd on tools-proxy-01

2015-12-06

  • 10:29 YuviPanda: did webservice start on tool 'derivative', was missing service.manifest

2015-12-04

  • 19:33 Coren: switching master role to tools-grid-master
  • 04:42 yuvipanda: disabled puppet on tools-puppetmaster-01 because everything sucks
  • 04:09 bd808: Cherry-picked https://gerrit.wikimedia.org/r/#/c/256618 to tools-puppetmaster-01

2015-12-02

  • 18:29 Coren: switching gridmaster activity to tools-grid-shadow
  • 05:13 yuvipanda: increased security groups quota to 50 because why not

2015-12-01

  • 21:07 yuvipanda: added bd808 as admin
  • 21:01 andrewbogott: deleted tool/service group tools.test300

2015-11-25

  • 15:42 Coren: migrating tools-web-static-02 to labvirt1010 to free space on labvirt1002

2015-11-20

  • 22:02 Coren: tools-webgrid-lighttpd-1412 tools-webgrid-lighttpd-1413 tools-webgrid-lighttpd-1414 tools-webgrid-lighttpd-1415 done and back in rotation.
  • 21:46 Coren: tools-webgrid-lighttpd-1411 tools-webgrid-lighttpd-1211 done and back in rotation.
  • 21:30 Coren: tools-webgrid-lighttpd-1410 tools-webgrid-lighttpd-1210 done and back in rotation.
  • 21:25 Coren: tools-webgrid-lighttpd-1409 tools-webgrid-lighttpd-1209 done and back in rotation.
  • 21:13 Coren: tools-webgrid-lighttpd-1408 tools-webgrid-lighttpd-1208 done and back in rotation.
  • 20:58 Coren: tools-webgrid-lighttpd-1407 tools-webgrid-lighttpd-1207 done and back in rotation.
  • 20:53 Coren: tools-webgrid-lighttpd-1406 tools-webgrid-lighttpd-1206 done and back in rotation.
  • 20:41 Coren: tools-webgrid-lighttpd-1405 tools-webgrid-lighttpd-1205 tools-webgrid-generic-1405 done and back in rotation.
  • 20:28 Coren: tools-webgrid-lighttpd-1404 tools-webgrid-lighttpd-1204 tools-webgrid-generic-1404 done and back in rotation.
  • 19:49 Coren: done, and putting back in rotation: tools-webgrid-lighttpd-1403 tools-webgrid-lighttpd-1203 tools-webgrid-generic-1403
  • 19:25 Coren: -lighttpd-1403 wants a restart.
  • 19:15 Coren: done, and putting back in rotation: tools-webgrid-lighttpd-1402 tools-webgrid-lighttpd-1202 tools-webgrid-generic-1402
  • 18:55 Coren: Putting -lighttpd-1401 -lighttpd-1201 -generic-1401 back in rotation, disabling the others.
  • 18:24 Coren: Beginning draining web nodes; -lighttpd-1401 -lighttpd-1201 -generic-1401
  • 18:10 Coren: disabling puppet on the grid nodes listed at https://phabricator.wikimedia.org/P2337 so that the /tmp change in https://gerrit.wikimedia.org/r/#/c/252506/ do not apply early and break services

2015-11-17

  • 19:39 YuviPanda: created tools-worker-03 to be k8s worker node
  • 19:34 YuviPanda: blanked 'realm' for tools-bastion-01 to figure out what happens

2015-11-16

2015-11-03

  • 03:59 scfc_de: tools-submit, tools-webgrid-lighttpd-1409, tools-webgrid-lighttpd-1411: Removed exim paniclog (OOM).

2015-11-02

  • 22:57 YuviPanda: pooled tools-webgrid-lighttpd-1413
  • 22:10 YuviPanda: created tools-webgrid-lighttpd-1414 and 1415
  • 22:04 YuviPanda: created tools-webgrid-lighttpd-1412 and 1413
  • 19:53 YuviPanda: drained continuous jobs and disabled queues on tools-exec-1203 and tools-exec-1402
  • 19:50 YuviPanda: drain webgrid-lighttpd-1408 of jobs

2015-10-26

  • 20:53 YuviPanda: updated 6.9 ssh backport to all trusty hosts

2015-10-11

  • 22:54 yuvipanda: delete service.manifest for tool wikiviz to prevent it from attempting to be started. It set itself up for nodejs but didn't actually have any code

2015-10-09

2015-10-06

  • 04:35 yuvipanda: created tools-puppetmaster-02 as hot spare

2015-10-02

  • 17:30 scfc_de: tools-webgrid-lighttpd-1402: Removed exim paniclog (OOM).

2015-10-01

  • 23:38 yuvipanda: actually rebooting tools-worker-02, had actually rebooted-01 earlier #facepalm
  • 23:20 yuvipanda: rebooting tools-worker-02 to pickup new kernel
  • 23:10 yuvipanda: failed over tools-proxy-01 to -02, restarting -01 to pick up new kernel
  • 22:58 yuvipanda: rebooted tools-proxy-02 to pick up new kernel

2015-09-30

  • 07:12 yuvipanda: deleted tools-webproxy-01 and -02, running on proxy-01 and -02 now
  • 06:40 yuvipanda: migrated webproxy to tools-proxy-01

2015-09-29

  • 12:08 scfc_de: tools-bastion-01: Removed exim paniclog (OOM).

2015-09-28

  • 15:24 Coren: rebooting tools-shadow after mount option changes.

2015-09-25

  • 16:02 scfc_de: tools-webgrid-lighttpd-1403: Removed exim paniclog (OOM).

2015-09-24

  • 14:06 scfc_de: tools-exec-1201: Restarted grid engine exec for T109485.
  • 13:56 scfc_de: tools-master: Restarted grid engine master for T109485.

2015-09-23

2015-09-16

  • 17:33 scfc_de: Removed python-tools-webservice from precise-tools as apparently old version of tools-webservice.
  • 01:17 YuviPanda: attempting to move grrrit-wm to kubernetes
  • 01:17 YuviPanda: attempting to move to kubernetes

2015-09-15

  • 01:18 scfc_de: Added unixodbc_2.2.14p2-5_amd64.deb back to precise-tools to diagnose if it is related to T111760.

2015-09-14

  • 23:47 scfc_de: Archived unixodbc_2.2.14p2-5_amd64 from deb-precise and aptly, no reference in Puppet or Phabricator and same version as distribution.

2015-09-13

  • 20:53 scfc_de: Archived lua-json_1.3.2-1 from labsdebrepo and aptly, upgraded manually to Trusty's new 1.3.1-1ubuntu0.1~ubuntu14.04.1, restarted nginx on tools-webproxy-01 and tools-webproxy-02, checked that proxy and localhost:8081/list works.
  • 20:42 scfc_de: rm -f /etc/apt/apt.conf.d/20auto-upgrades.ucf-dist on all hosts (cf. T110055).

2015-09-11

  • 14:54 scfc_de: tools-webgrid-lighttpd-1403: Removed exim paniclog (OOM).

2015-09-08

  • 08:05 valhallasw`cloud: Publish for local repo ./trusty-tools [all, amd64] publishes {main: [trusty-tools]} has been successfully updated.
    Publish for local repo ./precise-tools [all, amd64] publishes {main: [precise-tools]} has been successfully updated.
  • 08:04 valhallasw`cloud: added all packages in data/project/.system/deb-precise to aptly repo precise-tools
  • 08:03 valhallasw`cloud: added all packages in data/project/.system/deb-trusty to aptly repo trusty-tools

2015-09-07

  • 18:49 valhallasw`cloud: ran sudo mount -o remount /data/project on tools-static-01, which also solved the issue, so skipping the reboot
  • 18:47 valhallasw`cloud: switched static webserver to tools-static-02
  • 18:45 valhallasw`cloud: weird NFS issue on tools-web-static-01. Switching over to -02 before rebooting.
  • 17:57 YuviPanda: created tools-k8s-master-01 with jessie, will be etcd and kubernetes master

2015-09-03

  • 07:09 valhallasw`cloud: and just re-running puppet solves the issue. Sigh.
  • 07:09 valhallasw`cloud: last message in puppet.log.1.gz is Error: /Stage[main]/Toollabs::Exec_environ/Package[fonts-ipafont-gothic]/ensure: change from 00303-5 to latest failed: Could not get latest version: Execution of '/usr/bin/apt-cache policy fonts-ipafont-gothic' returned 100: fonts-ipafont-gothic: (...) E: Cache is out of sync, can't x-ref a package file
  • 07:07 valhallasw`cloud: err, is empty.
  • 07:07 valhallasw`cloud: uppet failure on tools-exec-1215 is CRITICAL 66.67% of data above the critical threshold -- but /var/log/puppet.log doesn't exist?!

2015-09-02

  • 15:01 scfc_de: Added -M option to qsub call for crontab of tools.sdbot.
  • 13:58 valhallasw`cloud: rebooting tools-exec-1403; https://phabricator.wikimedia.org/T107052 happening, also causing significant NFS server load
  • 13:55 valhallasw`cloud: restarted gridengine_exec on tools-exec-1403
  • 13:53 valhallasw`cloud: tools-exec-1403 does lots of locking opreations. Only job there was jid 1072678 = /data/project/hat-collector/irc-bots/snitch.py . Rescheduled that job.
  • 13:16 YuviPanda: deleted all jobs of ralgisbot
  • 13:12 YuviPanda: suspended all jobs in ralgisbot temporarily
  • 12:57 YuviPanda: rescheduled all jobs of ralgisbot, was suffering from stale NFS file handles

2015-09-01

  • 21:01 valhallasw`cloud: killed one of the grrrit-wm jobs; for some reason two of them were running?! Not sure what SGE is up to lately.
  • 16:12 scfc_de: tools-bastion-01: Killed bot of tools.cobain.
  • 15:47 valhallasw`cloud: git reset --hard cdnjs on tools-web-static-01
  • 06:23 valhallasw`cloud: seems to have worked. SGE :(
  • 06:17 valhallasw`cloud: going to restart sge_qmaster, hoping this solves the issue :/
  • 06:08 valhallasw`cloud: e.g. "queue instance "task@tools-exec-1211.eqiad.wmflabs" dropped because it is overloaded: np_load_avg=1.820000 (= 0.070000 + 0.50 * 14.000000 with nproc=4) >= 1.75" but the actual load is only 0.3?!
  • 06:06 valhallasw`cloud: test job does not get submitted because all queues are overloaded?!
  • 06:06 valhallasw`cloud: investigating SGE issues reported on irc/email

2015-08-31

  • 23:20 scfc_de: Changed host name tools-webgrid-generic-1405 in "qconf -mq webgrid-generic" to fix the "au" state of the queue on that host.
  • 21:21 valhallasw`cloud: webservice: error: argument server: invalid choice: 'generic' (choose from 'lighttpd', 'tomcat', 'uwsgi-python', 'nodejs', 'uwsgi-plain') (for tools.javatest)
  • 21:20 valhallasw`cloud: restarted webservicemonitor
  • 21:19 valhallasw`cloud: seems to have some errors in restarting: subprocess.CalledProcessError: Command '['/usr/bin/sudo', '-i', '-u', 'tools.javatest', '/usr/local/bin/webservice', '--release', 'trusty', 'generic', 'restart']' returned non-zero exit status 2
  • 21:18 valhallasw`cloud: running puppet agent -tv on tools-services-02 to make sure webservicemonitor is running
  • 21:15 valhallasw`cloud: several webservices seem to actually have not gotten back online?! what on earth is going on.
  • 21:10 valhallasw`cloud: some jobs still died (including tools.admin). I'm assuming service.manifest will make sure they start again
  • 20:29 valhallasw`cloud: |sort is not so spread out in terms of affected hosts because a lot of jobs were started on lighttpd-1409 and -1410 around the same time.
  • 20:25 valhallasw`cloud: ca 500 jobs @ 5s/job = approx 40 minutes
  • 20:23 valhallasw`cloud: doh. accidentally used the wrong file, causing restarts for another few uwsgi hosts. Three more jobs dead *sigh*
  • 20:21 valhallasw`cloud: now doing more rescheduling, with 5 sec intervals, on a sorted list to spread load between queues
  • 19:36 valhallasw`cloud: last restarted job is 1423661, rest of them are still in /home/valhallaw/webgrid_jobs
  • 19:35 valhallasw`cloud: one per second still seems to make SGE unhappy; there's a whole set of jobs dying, mostly uwsgi?
  • 19:31 valhallasw`cloud: https://phabricator.wikimedia.org/T110861 : rescheduling 521 webgrid jobs, at a rate of one per second, while watching the accounting log for issues
  • 07:31 valhallasw`cloud: removed paniclog on tools-submit; probably related to the NFS outage yesterday (although I'm not sure why that would give OOMs)

2015-08-30

  • 13:23 valhallasw`cloud: killed wikibugs-backup and grrrit-wm on tools-webproxy-01
  • 13:20 valhallasw`cloud: disabling 503 error page

2015-08-29

  • 04:09 scfc_de: Disabled queue webgrid-lighttpd@tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs (qmod -d) because I can't ssh to it and jobs deployed there fail with "failed assumedly before job:can't get password entry for user".

2015-08-27

  • 15:00 valhallasw`cloud: killed multiple kmlexport processes on tools-webgrid-lighttpd-1401 again

2015-08-26

  • 01:10 scfc_de: Felt lucky: kill -STOP bigbrother on tools-submit, installed I00cd7a90273e0d745699855eb671710afb4e85a7 on tools-services-02 and service bigbrothermonitor start. If it goes berserk, please service bigbrothermonitor stop.

2015-08-25

  • 20:23 scfc_de: tools-webgrid-generic-1405: killall mpt-statusd.
  • 14:58 YuviPanda: pooled in two new instances for the precise exec pool
  • 14:45 YuviPanda: reboot tools-exec-1221
  • 14:26 YuviPanda: rebooting tools-exec-1220 because NFS wedge...
  • 14:18 YuviPanda: pooled in tools-webgrid-generic-1405
  • 10:16 YuviPanda: created tools-webgrid-generic-1405
  • 10:04 YuviPanda: apply exec node puppet roles to tools-exec-1220 and -1221
  • 09:59 YuviPanda: created tools-exec-1220 and -1221

2015-08-24

  • 16:37 valhallasw`cloud: more processes were started, so added a talk page message on User:Coet (who was starting the processes according to /var/log/auth.log) and using 'write coet' on tools-bastion-01
  • 16:15 valhallasw`cloud: kill -9'ing because normal killing doesn't work
  • 16:13 valhallasw`cloud: killing all processes of tools.cobain which are flooding tools-bastion-01

2015-08-20

  • 18:44 valhallasw`cloud: both are now at 3dbbc87
  • 18:43 valhallasw`cloud: running git reset --hard origin/master on both checkouts. Old HEAD is 86ec36677bea85c28f9a796f7e57f93b1b928fa7 (-01) / c4abeabd3acf614285a40e36538f50655e53b47d (-02).
  • 18:42 valhallasw`cloud: tools-web-static-01 has the same issue, but with different commit ids (because different hostname). No local changes on static-01. The initial merge commit on -01 is 57994c, merging 1e392ab and fc918b8; on -02 it's 511617f, merging a90818c and fc918b8.
  • 18:39 valhallasw`cloud: cdnjs on tools-web-static-02 can't pull because it has a dirty working tree, and there's a bunch of weird merge commits. Old commit is c4abeabd3acf614285a40e36538f50655e53b47d, the dirty working tree is changes from http to https in various files
  • 17:06 valhallasw`cloud: wait, what timezone is this?!

2015-08-19

  • 10:45 valhallasw`cloud: ran `for i in $(qstat -f -xml | grep "<state>au" -B 6 | grep "<name>" | cut -d'@' -f2 | cut -d. -f1); do echo $i; ssh $i sudo service gridengine-exec start; done`; this fixed queues on tools-exec-1404 tools-exec-1409 tools-exec-1410 tools-webgrid-lighttpd-1406

2015-08-18

  • 15:53 scfc_de: Added valhallasw as grid manager (qconf -am valhallasw).
  • 14:42 scfc_de: tools-webgrid-lighttpd-1411: Killed mpt-statusd (T104779).
  • 13:57 valhallasw`cloud: same issue seems to happen with the other hosts: tools-exec-1401.tools.eqiad.wmflabs vs tools-exec-1401.eqiad.wmflabs and tools-exec-catscan.tools.eqiad.wmflabs vs tools-exec-catscan.eqiad.wmflabs.
  • 13:55 valhallasw`cloud: no, wait, that's tools-webgrid-lighttpd-1411.eqiad.wmflabs, not the actual host tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs. We should fix that dns mess as well.
  • 13:54 valhallasw`cloud: tried to restart gridengine-exec on tools-exec-1401, no effect. tools-webgrid-lighttpd-1411 also just went into 'au' state.
  • 13:47 valhallasw`cloud: that brought tools-exec-1403, tools-exec-1406 and tools-webgrid-generic-1402 back up, tools-exec-1401 and tools-exec-catscan are still in 'au' state
  • 13:46 valhallasw`cloud: starting gridengine-exec on hosts with queues in 'au' (=alarm, unknown) state using for i in $(qstat -f -xml | grep "<state>au" -B 6 | grep "<name>" | cut -d'@' -f2 | cut -d. -f1); do echo $i; ssh $i sudo service gridengine-exec start; done
  • 08:37 valhallasw`cloud: sudo service gridengine-exec start on tools-webgrid-lighttpd-1404.eqiad.wmflabs" tools-webgrid-lighttpd-1406.eqiad.wmflabs" tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs"
  • 08:33 valhallasw`cloud: tools-webgrid-lighttpd-1403.eqiad.wmflabs, tools-webgrid-lighttpd-1404.eqiad.wmflabs and tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs are all broken (queue dropped because it is temporarily not available)
  • 08:30 valhallasw`cloud: hostname mismatch: host is called tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs in config, but it was named tools-webgrid-lighttpd-1411.eqiad.wmflabs in the hostgroup config
  • 08:21 valhallasw`cloud: still sudo qmod -e "*@tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs" -> invalid queue "*@tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs"
  • 08:20 valhallasw`cloud: sudo qconf -mhgrp "@webgrid", added tools-webgrid-lighttpd-1411.eqiad.wmflabs
  • 08:14 valhallasw`cloud: and the hostgroup @webgrid doesn't even exist? (╯°□°)╯︵ ┻━┻
  • 08:10 valhallasw`cloud: /var/lib/gridengine/etc/queues/webgrid-lighttpd does not seem to be the correct configuration as the current config refers to '@webgrid' as host list.
  • 08:07 valhallasw`cloud: sudo qconf -Ae /var/lib/gridengine/etc/exechosts/tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs -> root@tools-bastion-01.eqiad.wmflabs added "tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs" to exechost list
  • 08:06 valhallasw`cloud: ok, success. /var/lib/gridengine/etc/exechosts/tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs now exists. Do I still have to add it manually to the grid? I suppose so.
  • 08:04 valhallasw`cloud: installing packages from /data/project/.system/deb-trusty seems to fail. sudo apt-get update helps.
  • 08:00 valhallasw`cloud: running puppet agent -tv again
  • 07:55 valhallasw`cloud: argh. Disabling toollabs::node::web::generic again and enabling toollabs::node::web::lighttpd
  • 07:54 valhallasw`cloud: various issues such as Error: /Stage[main]/Gridengine::Submit_host/File[/var/lib/gridengine/default/common/accounting]/ensure: change from absent to link failed: Could not set 'link' on ensure: No such file or directory - /var/lib/gridengine/default/common at 17:/etc/puppet/modules/gridengine/manifests/submit_host.pp; probably an ordering issue in
  • 07:53 valhallasw`cloud: Setting up adminbot (1.7.8) ... chmod: cannot access '/usr/lib/adminbot/README': No such file or directory --- ran sudo touch /usr/lib/adminbot/README
  • 07:37 valhallasw`cloud: applying role::labs::tools::compute and toollabs::node::web::generic to \tools-webgrid-lighttpd-1411
  • 07:31 valhallasw`cloud: reading puppet suggests I should qconf -ah /var/lib/gridengine/etc/exechosts/tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs but that file is missing?
  • 07:26 valhallasw`cloud: andrewbogott built tools-webgrid-lighttpd-1411 yesterday but it's not actually added as exec host. Trying to figure out how to do that...

2015-08-17

  • 19:00 scfc_de: tools-checker-01, tools-exec-1410, tools-exec-catscan, tools-redis-01, tools-redis-02, tools-web-static-01, tools-webgrid-lighttpd-1406, tools-webproxy-02: Remounted /public/dumps (T109261).
  • 16:17 andrewbogott: disable queues for tools-exec-1205 tools-exec-1207 tools-exec-1208 tools-exec-140 tools-exec-1404 tools-exec-1409 tools-exec-1410 tools-exec-catscan tools-web-static-01 tools-webgrid-lighttpd-1201 tools-webgrid-lighttpd-1205 tools-webgrid lighttpd-1206 tools-webgrid-lighttpd-1406 tools-webproxy-02
  • 15:33 andrewbogott: re-enabling the queue on tools-exec-1211 tools-exec-1212 tools-exec-1215 tools-exec-1403 tools-exec-1406 tools-master tools-shadow tools-webgrid-generic-1402 tools-webgrid-lighttpd-1203 tools-webgrid-lighttpd-1208 tools-webgrid-lighttpd-1403 tools-webgrid-lighttpd-1404 tools-webproxy-01
  • 14:50 andrewbogott: killing remaining jobs on tools-exec-1211 tools-exec-1212 tools-exec-1215 tools-exec-1403 tools-exec-1406 tools-master tools-shadow tools-webgrid-generic-1402 tools-webgrid-lighttpd-1203 tools-webgrid-lighttpd-1208 tools-webgrid-lighttpd-1403 tools-webgrid-lighttpd-1404 tools-webproxy-01

2015-08-15

  • 05:14 andrewbogott: resumed tools-exec-gift, seems not to have been the culprit
  • 05:10 andrewbogott: suspending tools-exec-gift, just for a moment...

2015-08-14

  • 17:21 andrewbogott: disabling grid jobqueue for tools-exec-1211 tools-exec-1212 tools-exec-1215 tools-exec-1403 tools-exec-1406 tools-master tools-shadow tools-webgrid-generic-1402 tools-webgrid-lighttpd-1203 tools-webgrid-lighttpd-1208 tools-webgrid-lighttpd-1403 tools-webgrid-lighttpd-1404 tools-webproxy-01 in anticipation of monday reboot of labvirt1004
  • 15:20 andrewbogott: Adding back to the grid engine queue: tools-exec-1216 tools-exec-1219 tools-exec-1407 tools-mail tools-services-02 tools-webgrid-generic-1401 tools-webgrid-lighttpd-1202 tools-webgrid-lighttpd-1207 tools-webgrid-lighttpd-1210 tools-webgrid-lighttpd-1402 tools-webgrid-lighttpd-1407
  • 14:43 andrewbogott: killing remaining jobs on tools-exec-1216 tools-exec-1219 tools-exec-1407 tools-mail tools-services-02 tools-webgrid-generic-1401 tools-webgrid-lighttpd-1202 tools-webgrid-lighttpd-1207 tools-webgrid-lighttpd-1210 tools-webgrid-lighttpd-1402 tools-webgrid-lighttpd-1407

2015-08-13

  • 18:51 valhallasw`cloud: which was resolved by scfc earlier
  • 18:50 valhallasw`cloud: tools-exec-1201/Puppet staleness was critical due to an agent lock (Ignoring stale puppet agent lock for pid
    Run of Puppet configuration client already in progress; skipping (/var/lib/puppet/state/agent_catalog_run.lock exists))
  • 18:08 scfc_de: scfc@tools-exec-1201: Removed stale /var/lib/puppet/state/agent_catalog_run.lock; Puppet run was started Aug 12 15:06:08, instance was rebooted ~ 15:14.
  • 16:44 andrewbogott: disabling job queue for tools-exec-1216 tools-exec-1219 tools-exec-1407 tools-mail tools-services-02 tools-webgrid-generic-1401 tools-webgrid-lighttpd-1202 tools-webgrid-lighttpd-1207 tools-webgrid-lighttpd-1210 tools-webgrid-lighttpd-1402 tools-webgrid-lighttpd-1407
  • 14:48 andrewbogott: and tools-webgrid-lighttpd-1408
  • 14:48 andrewbogott: rescheduling (and in some cases killing) jobs on tools-exec-1203 tools-exec-1210 tools-exec-1214 tools-exec-1402 tools-exec-1405 tools-exec-gift tools-services-01 tools-web-static-02 tools-webgrid-generic-1403 tools-webgrid-lighttpd-1204 tools-webgrid-lighttpd-1209 tools-webgrid-lighttpd-1401 tools-webgrid-lighttpd-1405

2015-08-12

  • 16:05 andrewbogott: depooling tools-exec-1203 tools-exec-1210 tools-exec-1214 tools-exec-1402 tools-exec-1405 tools-exec-gift tools-services-01 tools-web-static-02 tools-webgrid-generic-1403 tools-webgrid-lighttpd-1204 tools-webgrid-lighttpd-1209 tools-webgrid-lighttpd-1401 tools-webgrid-lighttpd-1405 tools-webgrid-lighttpd-1408
  • 15:20 valhallasw`cloud: re-enabling queues on restarted hosts
  • 14:41 andrewbogott: forcing reschedule of jobs on tools-exec-1201 tools-exec-1202 tools-exec-1204 tools-exec-1206 tools-exec-1209 tools-exec-1213 tools-exec-1217 tools-exec-1218 tools-exec-1408 tools-webgrid-generic-1404 tools-webgrid-lighttpd-1409 tools-webgrid-lighttpd-1410

2015-08-11

  • 18:17 andrewbogott: depooling tools-exec-1201 tools-exec-1202 tools-exec-1204 tools-exec-1206 tools-exec-1209 tools-exec-1213 tools-exec-1217 tools-exec-1218 tools-exec-1408 tools-webgrid-generic-1404 tools-webgrid-lighttpd-1409 tools-webgrid-lighttpd-1410 in anticipation of labvirt1001 reboot tomorrow

2015-08-04

  • 13:43 scfc_de: Fixed owner of ~tools.kasparbot/error.log (T99576).

2015-08-03

  • 19:13 andrewbogott: deleted tools-static-01

2015-08-01

  • 18:09 andrewbogott: depooling/rebooting tools-webgrid-lighttpd-1407 because it’s unable to fork
  • 16:54 scfc_de: tools-webgrid-lighttpd-1407: Removed exim paniclog (OOM).

2015-07-30

  • 15:00 andrewbogott: rebooting tools-bastion-01 aka tools-login
  • 14:46 scfc_de: tools-webgrid-lighttpd-1408, tools-webgrid-lighttpd-1409: Removed exim paniclog (OOM).
  • 02:53 scfc_de: "webservice uwsgi-python start" for blogconverter.
  • 02:40 scfc_de: qdel 545479 (hazard-bot, "release=trusty-quiet", stuck since July 9th).
  • 02:39 scfc_de: qdel 301895 (projanalysis, "release=trust", stuck since July 1st).
  • 02:38 scfc_de: tools-webgrid-generic-1401, tools-webgrid-generic-1402, tools-webgrid-generic-1403: Rebooted for T107052 (disabled queue, killall -TERM lighttpd, let tools-manifest restart webservices elsewhere, reboot, enabled queue).
  • 01:41 scfc_de: tools-webgrid-lighttpd-1406: Rebooted for T107052 (disabled queue, killall -TERM lighttpd, let tools-manifest restart webservices elsewhere, reboot, enabled queue).

2015-07-29

  • 23:43 andrewbogott: draining, rebooting tools-webgrid-lighttpd-1408
  • 20:11 andrewbogott: rebooting tools-webgrid-lighttpd-1404
  • 19:58 scfc_de: tools-*: sudo rmdir /etc/ssh/userkeys/ubuntu{/.ssh{/authorized_keys\ {/public{/keys{/ubuntu{/.ssh,},},},},},}

2015-07-28

  • 17:49 valhallasw`cloud: Jobs were drained at 19:43, but this did not decreade he rate, which is still at ~50k/minute. Now running "sysctl -w sunrpc.nfs_debug=1023 && sleep 2 && sysctl -w sunrpc.nfs_debug=0" which hopefully doesn't kill the server
  • 17:43 valhallasw`cloud: rescheduled all webservice jobs on tools-webgrid-lighttpd-1401.eqiad.wmflabs, server is now empty
  • 17:16 valhallasw`cloud: disabled queue "webgrid-lighttpd@tools-webgrid-lighttpd-1401.eqiad.wmflabs"
  • 02:07 YuviPanda: removed pacct files from tools-bastion-01

2015-07-27

  • 21:27 valhallasw`cloud: turned off process accounting on tools-login while we try to find the root cause of phab:T107052:
    accton off

2015-07-19

  • 01:51 scfc_de: tools-bastion-01: Removed exim paniclog (OOM).

2015-07-11

  • 00:01 mutante: fixing puppet runs on tools-webgrid-* via salt

2015-07-10

  • 23:59 mutante: fixing puppet runs on tools-exec via salt

2015-07-10

  • 20:09 valhallasw`cloud: it took three of us, but adminbot is updated!

July 6

  • 09:49 valhallasw`cloud: 10:14 <jynus> s51053 is abusing his/her access to replica dbs and creating lag for other users. His/her queries are to be terminated. (= tools.jackbot / user jackpotte)

July 2

  • 17:07 valhallasw`cloud: can't login to tools-mailrelay-01., probably because puppet was disabled for too long. Deleting instance.
  • 16:12 valhallasw`cloud: I mean tools-bastion-01
  • 16:12 valhallasw`cloud: stopping puppet on tools-login and tools-mail to check for changes in deploying https://gerrit.wikimedia.org/r/#/c/205914/

June 29

  • 17:29 YuviPanda: failed over tools webproxy to tools-webproxy-02

June 21

  • 18:57 scfc_de: tools-precise-dev: apt-get purge python-ldap3 (the previous fix for "Cache has broken packages, exiting" didn't work).
  • 16:39 scfc_de: tools-precise-dev: apt-get clean ("Cache has broken packages, exiting").
  • 16:33 scfc_de: tools-submit: Removed exim4 paniclog (OOM).

June 19

  • 15:07 YuviPanda: remounting /data/scratch

June 10

  • 11:52 YuviPanda: tools-trusty be gone

June 8

  • 16:31 YuviPanda: added Nova Tools Bot as admin, for automated nova API access

June 7

  • 17:05 YuviPanda: killed sort /data/project/templatetiger/public_html/dumps/ruwiki-2015-03-24.txt -k4,4 -k2,2 -k3,3n -k5,5n -t? -o /data/project/templatetiger/public_html/dumps/sort/ruwiki-2015-03-24.txt -T /data/project/templatetiger to rescue NFS

June 5

  • 17:44 YuviPanda: migrate tools-shadow to labvirt1002

June 2

  • 18:34 Coren: rebooting tools-webgrid-lighttpd-1406.eqiad.wmflabs
  • 16:27 YuviPanda: cleaned out /etc/hosts file on tools-shadow
  • 16:20 Coren: switching back to tools-master
  • 16:10 YuviPanda: restart nscd on tools-submit
  • 15:54 Coren: Switching names for tools-exec-1401
  • 15:43 Coren: adding the "new" exec nodes (aka, current nodes with new names)
  • 14:34 YuviPanda: turned off dnsmasq for toollabs
  • 13:54 Coren: adding new-style names for submit hosts
  • 13:53 YuviPanda: moved tools-master / shadow to designate
  • 13:52 Coren: new-style names for gridengin admin hosts added
  • 13:28 Coren: sge_shadowd started a new master as expected, after /two/ timeouts of 60s (unexpected)
  • 13:23 Coren: stracing the shadowd to see what's up; master is down as expected.
  • 13:17 Coren: killing the sge_qmaster to test failover
  • 12:56 YuviPanda: switched labs webproxies to designate, forcing puppet run and restarting nscd

May 29

  • 13:39 YuviPanda: tools-redis-01 is redis master now
  • 13:35 YuviPanda: enable puppet on all hosts, redis move-around completed
  • 13:01 YuviPanda: recreating tools-redis-01 and -02
  • 12:52 YuviPanda: disable puppet on all toollabs hosts for tools-redis update
  • 12:27 YuviPanda: created two redis instances (tools-redis-01 and tools-redis-02), beginning to set up stuff

May 28

  • 12:22 wm-bot: petrb: inserted some local IP's to hosts file
  • 12:15 wm-bot: petrb: shutting nscd off on tools-master
  • 12:14 wm-bot: petrb: test
  • 11:28 petan: syslog is full of these May 28 11:27:36 tools-master nslcd[1041]: [81823a] <group=550> error writing to client: Broken pipe
  • 11:25 petan: rebooted tools-master in order to try fix that network issues

May 27

  • 20:10 LostPanda: disabled puppet on tools-shadow too
  • 19:46 LostPanda: echo -n 'tools-master.eqiad.wmflabs' > /var/lib/gridengine/default/common/act_qmaster haaail someone?
  • 19:10 YuviPanda: reverted gridengine-common on tools-shadow to 6.2u5-4 as well, to match tools-master
  • 18:58 YuviPanda: rebooting tools-master after switchoer failed and it can not seem to do DNS

May 23

  • 19:56 scfc_de: tools-webgrid-lighttpd-1410: Removed exim4 paniclog (OOM).

May 22

  • 20:37 yuvipanda: deleted and depooled tools-exec-07

May 20

  • 20:09 yuvipanda: transient shinken puppet alerts because I tried to force puppet runs on all tools hosts but cancelled
  • 20:01 yuvipanda: enabling puppet on all hosts
  • 20:01 yuvipanda: tested new /etc/hosts on tools-bastion-01, puppet run produced no diffs, all good
  • 19:56 yuvipanda: copy cleaned up and regenerated /etc/hosts from tools-precise-dev to all toollabs hosts
  • 19:54 yuvipanda: copy cleaned up hosts file to /etc/hosts on tools-precise-dev
  • 19:54 yuvipanda: enabled puppet on tools-precise-dev
  • 19:33 yuvipanda: disabling puppet on *all* hosts for https://gerrit.wikimedia.org/r/#/c/210000/
  • 06:21 yuvipanda: killed a bunch of webservice jobs stuck in dRr state

May 19

  • 21:06 yuvipanda: failed over services to tools-services-02, -01 was refusing to start some webservices with permission denied errors for setegid
  • 20:16 yuvipanda: qdel -f for all webservice jobs that were in dr state
  • 20:12 yuvipanda: force killed croptool webservice

May 18

  • 01:36 yuvipanda: created new tools-checker-01, applying role and provisioning
  • 01:32 yuvipanda: killed tools-checker-01 instance, recreating

May 15

  • 12:06 valhallasw: killed those perl scripts; kmlexport's lighttpd is also using excessive memory (5%), so restarting that
  • 12:01 valhallasw: webgrid-lighttpd-1402 puppet failure caused by major memory usage; tools.kmlexport is running heavy perl scripts
  • 00:27 yuvipanda: cleared graphite data for /var/* mounts on tools-redis

May 14

  • 21:53 valhallasw: shut down & removed "tools-exec-08.eqiad.wmflabs" from execution host list
  • 21:11 valhallasw: forced rescheduling of (non-cont) welcome.py job (iluvatarbot, jobid 8869)
  • 03:29 yuvipanda: drained, depooled and deleted tools-exec-15

May 10

  • 22:08 yuvipanda: created tools-precise-dev instance
  • 09:28 yuvipanda: cleared and depooled tools-exec-02 and -13. only job running was deadlocked for a long, long time (week)
  • 05:47 scfc_de: tools-submit: Removed paniclog (OOM) and stopped apache2.

May 5

  • 18:50 Betacommand: helperbot WP:AVI bot running logged out owner is MIA, Coren killed job from 1204 and commented out crontab

May 4

  • 21:24 yuvipanda: reboot tools-submit, was stuck

May 2

  • 10:21 yuvipanda: drained all the old webgrid nodes, pooled in all the new webgrid nodes! POTATO!
  • 10:13 yuvipanda: cleaned out wegrid jobs from tools-webgrid-03
  • 10:12 yuvipanda: pooled tools-webgrid-lighttpd-{06-10}
  • 08:56 yuvipanda: drained and deleted tools-webgrid-01
  • 07:31 yuvipanda: depooled and deleted tools-webgrid-{01,02}
  • 07:31 yuvipanda: disabled catmonitor task / cron, was heavily using an sqlite db on NFS
  • 06:56 yuvipanda: pooled tools-webgrid-generic-{01-04}
  • 03:44 yuvipanda: drained and deleted old trusty webgrid tools-webgrid-{05-07}
  • 02:13 yuvipanda: created tools-webgrid-lighttpd-12{01-05} and tools-webgrid-generic-14{01-04}
  • 01:59 yuvipanda: created tools-webgrid-lighttpd-14{01-10}
  • 01:58 yuvipanda: increased tools instance quota

May 1

  • 03:55 YuviKTM: depooled and deleted tools-exec-20
  • 03:54 YuviKTM: killed final job in tools-exec-20 (9911317), decommissioning node

April 30

  • 19:33 YuviKTM: depooled and deleted tools-exec-01, -05, -06 and -11.
  • 19:31 YuviKTM: depooled and deleted tools-exec-01, -05, -06 and -11.
  • 06:30 YuviKTM: added public IPs for all exec nodes so IRC tools continue to work. Removed all associated hostnames, let’s not do those
  • 06:13 YuviKTM: allocating new floating IPs for the new instances, because IRC bots need them.
  • 05:42 YuviKTM: disabled and drained tools-exec-1{1-5} of continuous jobs
  • 05:40 YuviKTM: pooled in tools-exec-121{1-9}
  • 05:39 YuviKTM: rebooted tools-exec-121{1-9} instances so they can apply gridengine-common properly
  • 05:39 YuviKTM: created new instances tools-exec-121{1-9} as precise
  • 05:39 YuviKTM: killed tools-dev, nobody still ssh’d in, no crontabs
  • 05:39 YuviKTM: deplooled exec-{06-10} rejigged jobs to newer nodes
  • 05:39 YuviKTM: delete tools-exec-10, was out of jobs
  • 04:28 YuviKTM: deleted tools-exec-09
  • 04:27 YuviKTM: depooled tools-exec-09.eqiad.wmflabs
  • 04:23 YuviKTM: repooled tools-exec-1201 is all good now
  • 04:19 YuviKTM: rejuggle jobs again in trustyland
  • 04:14 YuviKTM: repooled tools-exec-09, apt troubles fixed
  • 04:08 YuviKTM: depooled tools-exec-09, apt troubles
  • 04:04 YuviKTM: pooled tools-exec-1408 and tools-exec-1409
  • 04:00 YuviKTM: pooled tools-exec-1406 and 1407
  • 03:58 YuviKTM: pooled tools-exec-12{02-10}, forgot to put appropriate roles on 1201, fixing now
  • 03:54 YuviKTM: tools-exec-03 and -04 have been deleted a long time ago
  • 03:53 YuviKTM: depooled tools-exec-03 / 04
  • 03:31 YuviKTM: depooled and deleted tools-exec-12 had nothing on it
  • 03:28 YuviKTM: deleted toolx-exec-21 to 24, one task still running on tools-exec
  • 03:24 YuviKTM: disabled and drained continuous tasks off tools-exec-20 to tools-exec-24
  • 03:18 YuviKTM: pooled tools-exec-1403, 1404
  • 03:13 YuviKTM: pooled tools-exec-1402
  • 03:07 YuviKTM: pooled tools-exec-1405
  • 03:04 YuviKTM: pooled tools-exec-1401
  • 02:53 YuviKTM: created tools-exec-14{06-10}
  • 02:14 YuviKTM: created toolx-exec-14{01-05}
  • 01:09 YuviPanda: killing local copy of python-requests, there seems to be a newer vesrion in prod

April 29

  • 19:33 valhallasw`cloud: re-created tools-mailrelay-01 with precise: Nova_Resource:I-00000bca.eqiad.wmflabs
  • 19:30 YuviPanda: set appopriate classes for recreated tools-exec-12* nodes
  • 19:28 YuviPanda: recreated tools-static-02
  • 19:11 YuviPanda: failed over tools-static to tools-static-01
  • 14:47 andrewbogott: deleting tools-exec-04
  • 14:44 Coren: -exec-04 drained; removed from queues. Rest well, old friend.
  • 14:41 Coren: disabled -exec-04 (going away)
  • 02:35 YuviPanda: set tools-exec-12{01-10} to configure as exec nodes
  • 02:27 YuviPanda: created tools-exec-12{01-10}

April 28

  • 21:41 andrewbogott: shrinking tools-master
  • 21:33 YuviPanda: failover is going to take longer than actual recompression for tools-master, so let’s just recompress. tools-shadow should take over automatically if that doesn’t work
  • 21:32 andrewbogott: shrinking tools-redis
  • 21:28 YuviPanda: attempting to failover gridengine to tools-shadow
  • 21:27 andrewbogott: shrinking tools-submit |
  • 21:21 YuviPanda: backup crontabs onto NFS
  • 21:18 andrewbogott: shrinking tools-webproxy-02
  • 21:14 andrewbogott: shrinking tools-static-01
  • 21:11 andrewbogott: shrinking tools-exec-gift
  • 21:06 YuviPanda: failover tools-webproxy to tools-webproxy-01
  • 21:06 andrewbogott: stopping, shrinking and starting tools-exec-catscan
  • 21:01 YuviPanda: failover tools-static to tools-static-02
  • 20:53 andrewbogott: stopping, shrinking, restarting tools-shadow
  • 20:43 andrewbogott: stopping, shrinking, starting tools-static-02
  • 20:39 valhallasw`cloud: created tools-mailrelay-01 Nova_Resource:I-00000bac.eqiad.wmflabs
  • 20:26 YuviPanda: failed over tools-services to services-01
  • 18:11 Coren: reenabled -webgrid-generic-02
  • 18:05 Coren: reenabled -webgrid-03, -webgrid-08, -webgrid-generic-01; drained -webgrid-generic-02
  • 17:44 Coren: -webgrid-03, -webgrid-08 and -webgrid-generic-01 drained
  • 14:04 Coren: reenable -exec-11 for jobs.
  • 13:55 andrewbogott: stopping tools-exec-11 for a resize experiment

April 25

  • 01:32 YuviPanda: deleted tools-static, tools-static-01 has taken over
  • 01:02 YuviPanda: deleted tools-login, tools-bastion-01 has been running for long enoug

April 24

  • 16:29 Coren: repooled -exec-02, -08, -12
  • 16:05 Coren: -exec-02, -08 and -12 draining
  • 15:54 Coren: reenabled tools-exec-07, -10 and -11 after reboot of host
  • 15:41 Coren: -exec-03 goes away for good.
  • 15:31 Coren: draining -exec-03 to ease migration
  • 13:43 Coren: draining tools-exec-07,10,11 to allow virt host reboot

April 23

  • 22:41 YuviPanda: disabled *@tools-exec-09
  • 22:40 YuviPanda: add tools-exec-09 back to @general
  • 22:38 YuviPanda: take tools-exec-09 from @general group
  • 20:53 YuviPanda: restart bigbrother
  • 20:28 YuviPanda: restarted nscd on tools-login and tools-dev
  • 20:22 valhallasw`cloud: removed 10.68.16.4 tools-webproxy tools.wmflabs.org from /etc/hosts
  • 13:17 andrewbogott: beginning migration of tools instances to labvirt100x hosts
  • 01:00 YuviPanda: good bye tools-login.eqiad.wmflabs

April 20

  • 13:38 scfc_de: tools-mail: Removed paniclog and killed superfluous exim.

April 18

  • 20:09 YuviPanda: sysctl vm.overcommit_memory=1 on tools-redis to allow it to bgsave again
  • 19:52 valhallasw`cloud: tools-redis unresponsive (T96485); rebooting

April 17

  • 01:48 YuviPanda: disable puppet on live webproxy (-01) to apply firewall changes to -02

April 16

  • 20:57 Coren: -webgrid-08 drained, rebooting
  • 20:46 Coren: -webgrid-03 repooled, depooling -webgrid-08
  • 20:45 Coren: -webgrid-03 drained, rebooting
  • 20:38 Coren: -webgrid-03 depooled
  • 20:38 Coren: -webgrid-02 repooled
  • 20:35 Coren: -webgrid-02 drained, rebooting
  • 20:33 Coren: -webgrid-02 depooled
  • 20:32 Coren: -webgrid-01 repooled
  • 20:06 Coren: -webgrid-01 drained, rebooting.
  • 19:56 Coren: depooling -webgrid-01 for reboot
  • 14:37 Coren: rebooting -master
  • 14:29 Coren: rebooting -mail
  • 14:22 Coren: rebooting -shadow
  • 14:22 Coren: -exec-15 repooled
  • 14:19 Coren: -exec-15 drained, rebooting.
  • 13:46 Coren: -exec-14 repooled. That's it for general exec nodes.
  • 13:44 Coren: -exec-14 drained, rebooting.

April 15

  • 21:06 Coren: -exec-10 repooled
  • 20:55 Coren: -exec-10 drained, rebooting
  • 20:49 Coren: -exec-07 repooled.
  • 20:47 Coren: -exec-07 drained, rebooting
  • 20:43 Coren: -exec-06 requeued
  • 20:41 Coren: -exec-06 drained, rebooting
  • 20:15 Coren: repool -exec-05
  • 20:10 Coren: -exec-05 drained, rebooting.
  • 19:56 Coren: -exec-04 repooled
  • 19:52 Coren: -exec-04 drained, rebooting.
  • 19:41 Coren: disabling new jobs on remaining (exec) precise instances
  • 19:32 Coren: repool -exec-02
  • 19:30 Coren: draining -exec-04
  • 19:29 Coren: -exec-02 drained, rebooting
  • 19:28 Coren: -exec-03 rebooted, requeing
  • 19:26 Coren: -exec-03 drained, rebooting
  • 18:50 Coren: dequeuing tools-exec-03 whilst waiting for -02 to drain.
  • 18:43 Coren: tools-exec-01 back sans idmap, returning to pool
  • 18:40 Coren: tools-exec-01 drained of jobs; rebooting
  • 18:39 YuviPanda: disabled puppet on running webproxy, tools-webproxy-01
  • 18:25 Coren: disabled -exec-01 and -exec-02 to new jobs.

April 14

  • 13:13 scfc_de: tools-submit: Removed exim paniclog (OOM doom).
  • 13:13 scfc_de: tools-mail: Killed superfluous exim and removed paniclog.

April 13

  • 21:11 YuviPanda: restart portgranter on all webgrid nodes

April 12

  • 10:52 scfc_de: tools-mail: Killed superfluous exims and removed paniclog.

April 11

  • 21:49 andrewbogott: moved /data/project/admin/toollabs to /data/project/admin/toollabsbak on tools-webproxy-01 and tools-webproxy-02 to fix permission errors
  • 02:15 YuviPanda: rebooted tools-submit, was not responding

April 10

  • 07:10 PissedPanda: take out tools-services-01 to test switchover and also to recreate as small
  • 05:20 YuviPanda: delete the tomcat node finally :D

April 9

  • 23:24 scfc_de: rm -f /puppet_{host,service}groups.cfg on all hosts (apparently a Puppet/hiera mishap last November).
  • 23:11 scfc_de: tools-webgrid-04: Rescheduled all jobs running on this instance (T95537).
  • 08:32 scfc_de: tools-mail: Removed paniclog (multiple exims, but only one found).

April 8

  • 13:25 scfc_de: Repaired servicegroups repository and restarted toolhistory job; was stuck at 2015-03-29T09:15:05Z (NFS?).
  • 12:01 scfc_de: Removed empty tools with no maintainers javed/javedbaker/shell.
  • 09:10 scfc_de: Removed stale proxy entries for analytalks/anno/commons-coverage/coursestats/eagleeye/hashtags/itwiki/mathbot/nasirkhanbot/rc-vikidia/wikistream.

April 7

  • 07:42 scfc_de: tools-mail: Killed superfluous exim and removed paniclog.

April 5

  • 10:11 scfc_de: tools-mail: Killed superfluous exims and removed paniclog.

April 4

  • 22:48 scfc_de: Removed zombie jobs (qdel 1991607,1994800,1994826,1994827,2054201,3449476,3450329,3451518,3451549,3451590,3451628,3451635,3451830,3451869,3452632,3452633,3452654,3452655,3452657,3452668,4218785,4219210,4219674,4219722,4219791,4219923,4220646).
  • 08:49 scfc_de: tools-submit: Restarted bigbrother because it didn't notice admin's .bigbrotherrc.
  • 08:49 scfc_de: Add webservice to .bigbrotherrc for admin tool.
  • 03:35 scfc_de: Deployed jobutils/misctools 1.5 (T91954).

April 3

  • 22:55 scfc_de: Removed empty cgi-bin directories.
  • 20:35 scfc_de: tools-mail: Killed superfluous exims and removed paniclog.

April 2

  • 20:07 scfc_de: tools-mail: Killed superfluous exims and removed paniclog.
  • 20:06 scfc_de: tools-submit: Removed exim paniclog (OOM).
  • 01:25 YuviPanda: created tools-bastion-02

April 1

  • 00:14 scfc_de: tools-webgrid-03: Rebooted, was stuck on console input when unable to mount NFS on boot (per wikitech consule output).

March 31

  • 14:02 Coren: rebooting tools-submit
  • 07:07 YuviPanda: moved tools.wmflabs.org to tools-webproxy-01
  • 07:02 YuviPanda: reboot tools-webgrid-03 and tools-exec-03
  • 00:21 andrewbogott: temporarily shutting ‘toolsbeta-pam-sshd-motd-test’ down to conserve resources. It can be restarted any time.

March 30

  • 22:53 Coren: resyncing project storage with rsync
  • 22:40 Coren: reboot tools-login
  • 22:30 Coren: also bastion2
  • 22:28 Coren: reboot bastion1 so users can log in
  • 21:49 Coren: rebooting dedicated exec nodes.
  • 21:49 Coren: rebooting tools-submit
  • 17:27 scfc_de: tools-mail: Removed paniclog (multiple exims, but only one found).

March 29

  • 19:30 scfc_de: tools-submit: Restarted bigbrother for T90384.

March 28

  • 19:42 YuviPanda: created tools-exec-20

March 26

  • 21:24 scfc_de: tools-mail: Killed superfluous exims and removed paniclog.

March 25

  • 16:49 scfc_de: tools-mail: Removed paniclog (multiple exims, but only one found).

March 24

  • 16:03 scfc_de: tools-login: Removed exim paniclog (entries from Sunday).
  • 15:51 scfc_de: tools-mail: Killed superfluous exims and removed paniclog.

March 23

  • 21:23 scfc_de: tools-login, tools-dev, tools-trusty: Now actually disabled role::labs::bastion per T93661 :-).
  • 21:08 scfc_de: tools-login, tools-dev, tools-trusty: role::labs::bastion is still enabled due to T93663.
  • 20:57 scfc_de: tools-login, tools-dev, tools-trusty: Disabled role::labs::bastion per T93661.
  • 03:02 andrewbogott: wiped out atop.log on tools-dev because /var was filling up

March 22

  • 23:08 scfc_de: qconf -ah tools-bastion-01.eqiad.wmflabs
  • 23:07 scfc_de: for host in {tools-bastion-01,tools-webgrid-07,tools-webgrid-generic-{01,02}}.eqiad.wmflabs; do qconf -as "$host"; done
  • 23:07 yuvipanda: copied /etc/hosts into place on tools-bastion-01

March 21

  • 16:18 scfc_de: tools-mail: Killed superfluous exim and removed paniclog.

March 15

  • 22:38 scfc_de: tools-mail: Killed superfluous exims and removed paniclog.

March 13

  • 16:23 YuviPanda: cleaned out / on tools-trusty

March 11

  • 04:28 YuviPanda: tools-redis is back now, as trusty and hopefully slightly more fortified
  • 04:14 YuviPanda: kill tools-redis instance, upgrade to trusty while it is down anyway
  • 03:56 YuviPanda: restarted redis server, it had OOM-killed

March 9

  • 11:02 scfc_de: Deleted probably outdated proxy entry for tool wp-signpost and restarted webservice.
  • 10:22 scfc_de: Deleted obsolete proxy entries without webservice for tools bracketbot/herculebot/extreg-wos/pirsquared/searchsbl/translate/yifeibot.
  • 10:11 scfc_de: Restarted webservices for tools blahma/catmonitor/catscan2/contributions-summary/eagleeye/imagemapedit/jackbot/tb-dev/vcat/wikihistory/xtools-ec (cf. T91939).
  • 08:27 scfc_de: qmod -cq webgrid-lighttpd@tools-webgrid-03.eqiad.wmflabs (OOM of two jobs in the past).

March 7

  • 12:17 scfc_de: Moved obsolete packages that are installed on no instance at all from /data/project/.system/deb to ~tools.admin/archived-packages.

March 6

  • 07:46 scfc_de: Set role::labs::tools::toolwatcher for tools-login.
  • 07:43 scfc_de: Deployed jobutils/misctools 1.4.

March 2

March 1

  • 15:11 YuviPanda|brb: pooled in tools-webgrid-07 to lighty webgrid, moving some tools off -05 and -06 to relieve pressure

February 28

  • 07:51 YuviPanda: create tools-webgrid-07
  • 01:00 Coren: Set vm.overcommit_memory=0 on -webgrid-05 (also trusty)
  • 01:00 Coren: Also That was -webgrid-05
  • 00:59 Coren: set exec-06 to vm.overcommit_memory=0 for now, until the vm behaviour difference between precise and trusty can be nailed down.

February 27

  • 17:53 YuviPanda: increased quota to 512G RAM and 256 cores
  • 15:33 Coren: Switched back to -master. I'm making a note here: great success.
  • 15:27 Coren: Gridengine master failover test part three; killing the master with -9
  • 15:20 Coren: Gridengine master failover test part deux - now with verbose logs
  • 15:10 YuviPanda: created tools-webgrid-generic-02
  • 15:10 YuviPanda: increase instance quota to 64
  • 15:10 Coren: Master restarted - test not sucessful.
  • 14:50 Coren: testing gridengine master failover starting now
  • 08:27 YuviPanda: restart *all* webtools (with qmod -rj webgrid-lighttpd) to have tools-webproxy-01 and -02 pick them up as well

February 24

  • 18:33 Coren: tools-submit not recovering well from outage, kicking it.
  • 17:58 YuviPanda: rebooting *all* webgrid jobs on toollabs

February 16

  • 02:31 scfc_de: rm -f /var/log/exim4/paniclog.

February 13

  • 18:01 Coren: tools-redis is dead, long live tools-redis
  • 17:48 Coren: rebuilding tools-redis with moar ramz
  • 17:38 legoktm: redis on tools-redis is OOMing?
  • 17:26 marktraceur: restarting grrrit-wm because it's not behaving

February 1

  • 10:55 scfc_de: Submitted dummy jobs for tools ftl/limesmap/newwebtest/osm-add-tags/render/tsreports/typoscan/usersearch to get bigbrother to recognize those users and cleaned up output files afterwards.
  • 07:51 YuviPanda: cleared error state of stuck queues
  • 06:41 YuviPanda: set chmod +xw manually on /var/run/lighttpd on webgrid-05, need to investigate why it was necessary
  • 05:47 YuviPanda: completed migrating magnus' tools to trusty, more details at https://etherpad.wikimedia.org/p/tools-trusty-move
  • 05:37 YuviPanda: added tools-webgrid-06 as trusty webnode, operational now
  • 04:52 YuviPanda: migrating all of magnus’ tools, after consultation with him (https://etherpad.wikimedia.org/p/tools-trusty-move for status)
  • 04:10 YuviPanda: widar moved to trusty
  • 03:01 YuviPanda: ran salt -G 'instanceproject:tools' cmd.run 'sudo rm -rf /var/tmp/core’ because disks were getting full.

January 29

  • 17:26 YuviPanda: reschedule all tomcat jobs

January 27

  • 23:27 YuviPanda: qdel -f 7662482 7661111 for Merlissimo

January 19

  • 20:51 YuviPanda: because valhallasw is nice
  • 10:34 YuviPanda: manually started tools-webgrid-generic-01
  • 09:48 YuviPanda: restarted toold-webgrid-03
  • 08:42 scfc_de: qmod -cq {continuous,mailq,task}@tools-exec-{06,10,11,15}.eqiad.wmflabs
  • 08:36 scfc_de: tools-mail: rm -f /var/log/exim4/paniclog and killed second exim (belated SAL amendment.

January 16

  • 22:11 scfc_de: tools-mail: rm -f /var/log/exim4/paniclog.

January 15

  • 22:10 YuviPanda: created instance tools-webgrid-generic-01

January 11

  • 06:38 scfc_de: tools-mail: rm -f /var/log/exim4/paniclog.

January 8

  • 07:40 YuviPanda: increase memory limit for autolist from 4G to 7G