Nova Resource:Tools/SAL

From Wikitech
Jump to: navigation, search

2017-08-12

  • 18:38 chasemp: retart admin webservice

2017-08-11

2017-08-10

  • 14:59 chasemp: 'become stimmberechtigung && restart' && 'become intersect-contribs && restart'

2017-08-09

  • 17:28 chasemp: webservices restart tools.orphantalk

2017-08-03

  • 00:47 bd808: tools-bastion-03 not usably responsive to interactive commands; will reboot
  • 00:00 bd808: Restarted kube-proxy service on bastion-03

2017-08-02

  • 16:59 bd808: Force deleted 6 jobs suck in 'dr' state

2017-07-31

  • 15:28 chasemp: remove python-keystoneclient from bastion-03

2017-07-27

  • 23:27 bd808: Killed python procs owned by sdesabbata on tools-login that were stealing all cpu/io
  • 21:16 bd808: Disabled puppet on tools-proxy-01 to test nginx proxy config changes
  • 16:27 bd808: Enabled puppet on tools-static-11
  • 16:10 bd808: Disabled puppet on tools-static-11 to test https://gerrit.wikimedia.org/r/#/c/357878

2017-07-26

  • 22:33 chasemp: hotpatching an hiera value on tools master to see effects

2017-07-20

  • 19:48 bd808: Clearing all Eqw state jobs in all queues with: qstat -u '*' | grep Eqw | awk '{print $1;}' | xargs -L1 qmod -cj
  • 13:54 andrewbogott: upgrading apache2 on tools-puppetmaster-01
  • 04:00 chasemp: tools-webgrid-lighttpd-1402:~# service nslcd restart && service nscd restart
  • 03:57 chasemp: tools-exec-1428:~# service nslcd restart && service nscd restart
  • 03:57 bd808: Redtarted cron, nscd, nslcd on tools-cron-01
  • 03:45 chasemp: tools-puppetmaster-01:~# service nslcd restart && service nscd restart
  • 03:44 chasemp: tools-puppetmaster-01:~# service nslcd restart && service nscd restart
  • 03:37 bd808: Restarted apache on tools-puppetmaster-01

2017-07-19

  • 23:52 bd808: Restarted cron on tools-cron-01; toolschecker job showing user not found errors
  • 21:19 valhallasw`cloud: Restarted nslcd on tools-bastion-03 (=tools-login); logins seem functional again.
  • 21:18 bd808: Forced puppet run and restarted nscd, nslcd on tools-bastion-02

2017-07-18

  • 19:51 andrewbogott: enabling puppet on tools-proxy-02. I don't know why it was disabled.

2017-07-17

  • 01:43 bd808: Uncordoned tools-worker-1020 after it deleted pods with local storage that were filling the entire disk
  • 01:36 bd808: Depooling tools-worker-1020

2017-07-13

  • 21:59 bd808: Elasticsearch cluster upgraded to 5.3.2
  • 21:25 bd808: Upgrading ElasticSearch cluster for T164842. There will be service interruptions
  • 17:59 bd808: Puppet is disabled on tools-proxy-02 with no reason specified.
  • 17:09 bd808: Upgraded nginx-common on tools-proxy-02
  • 17:05 bd808: Upgraded nginx-common on tools-proxy-01

2017-07-12

  • 15:46 chasemp: push out puppet run across tools
  • 12:15 andrewbogott: restarting 'admin' webservice

2017-07-07

  • 18:26 bd808: Forced puppet runs on tools-redis-* for security fix

2017-07-03

  • 04:26 bd808: cdnjs on tools-static-10 is up to date
  • 03:38 bd808: cdnjs on tools-static-11 is up to date
  • 02:19 bd808: Cleaning up stuck merges for cdnjs clones on tools-static-10 and tools-static-11

2017-07-01

  • 19:40 bd808: Disabled puppet on tools-k8s-master-01 to try and fix maintain-kubeusers
  • 19:32 bd808: Restarted maintain-kubeusers on tools-k8s-master-01

2017-06-30

  • 01:33 chasemp: time for i in `cat tools-hosts`; do ssh -i ~/.ssh/labs_root_id_rsa root@$i.eqiad.wmflabs 'hostname -f; uptime; tc-setup'; done
  • 01:29 andrewbogott: rebooting tools-cron-01

2017-06-29

  • 23:01 madhuvishy: Uncordoned all k8s-workers
  • 20:50 madhuvishy: deppoling, rebooting and repooling all grid exec nodes
  • 20:36 andrewbogott: depooling, rebooting, and repooling every lighttpd node three at a time
  • 19:55 madhuvishy: Killed liangent-php jobs and usrd-tools jobs
  • 18:00 madhuvishy: drain cordon reboot uncordon tools-worker-1015
  • 17:37 madhuvishy: drain cordon reboot uncordon tools-worker-1005 tools-worker-1007 tools-worker-1008
  • 17:22 bd808: rebooting tools-static-11
  • 17:20 andrewbogott: rebooting tools-static-10
  • 17:20 madhuvishy: drain cordon reboot uncordon tools-worker-1012 tools-worker-1003
  • 17:13 madhuvishy: drain cordon reboot uncordon tools-worker-1022, tools-worker-1009, tools-worker-1002
  • 16:27 chasemp: restart k8s components on master (madhu)
  • 16:10 chasemp: tools-flannel-etcd-01:~$ sudo service etcd restart
  • 16:04 madhuvishy: reboot tools-worker-1022 tools-worker-1009
  • 15:57 chasemp: reboot tools-docker-registery-01 for nfs

2017-06-27

  • 21:32 andrewbogott: moving all tools nodes to new puppetmaster, tools-puppetmaster-01.tools.eqiad.wmflabs

2017-06-25

  • 15:13 madhuvishy: Restarted webservice on tools.fatameh

2017-06-24

  • 16:01 bd808: Created and provisioned elasticsearch password for tools.wmde-uca-test (T167971)

2017-06-23

  • 20:20 bd808: Reindexing various elasticsearch indexes created before we upgraded to v2.x
  • 20:19 bd808: Dropped garbage indexes in elasticsearch cluster

2017-06-22

  • 17:03 bd808: Rolled back attempt at Elasticsearch upgrade. Indices need to be rebuilt with 2.x before 5.x can be installed. T164842
  • 16:19 bd808: Backed up elasticsearch indexes to personal laptop using elasticdump incase T164842 goes horribly wrong
  • 00:12 bd808: Set ownership and permissions on $HOME/.kube for all tools (T165875)

2017-06-21

  • 17:43 andrewbogott: repooling tools-exec-1412, 1415, 1417, 1420, tools-webgrid-lighttpd-1415, 1416, 1422, 1426
  • 17:42 madhuvishy: Restarted webservice for openstack-browser
  • 17:36 andrewbogott: depooling tools-exec-1412, 1415, 1417, 1420, tools-webgrid-lighttpd-1415, 1416, 1422, 1426
  • 17:35 andrewbogott: repooling tools-exec-1411, 1416, 1418, 1424, tools-webgrid-lighttpd-1404, 1410
  • 17:24 andrewbogott: depooling tools-exec-1411, 1416, 1418, 1424, tools-webgrid-lighttpd-1404, 1410
  • 17:23 andrewbogott: repooling tools-exec-1406, 1421, 1436, 1437, tools-webgrid-generic-1404, 1409, 1411, 1418, 1420, 1425
  • 17:11 andrewbogott: depooling tools-exec-1406, 1421, 1436, 1437, tools-webgrid-generic-1404, 1409, 1411, 1418, 1420, 1425
  • 17:10 andrewbogott: repooling tools-webgrid-lighttpd-1412, tools-exec-1423
  • 16:57 andrewbogott: depooling tools-webgrid-lighttpd-1412, tools-exec-1423
  • 16:53 andrewbogott: repooling tools-exec-1413, 1442, tools-webgrid-lighttpd-1417, 1419, 1421, 1427, 1428
  • 16:52 andrewbogott: repooling tools-exec-1409, 1410, 1414, 1419, 1427, 1428, tools-webgrid-generic-1401, tools-webgrid-lighttpd-1406
  • 16:35 andrewbogott: depooling tools-exec-1409, 1410, 1414, 1419, 1427, 1428, tools-webgrid-generic-1401, tools-webgrid-lighttpd-1406
  • 16:29 andrewbogott: depooling tools-exec-1413, 1442, tools-webgrid-lighttpd-1417, 1419, 1421, 1427, 1428
  • 16:05 godog: delete pods for lolrrit-wm to force restart
  • 15:45 andrewbogott: repooling tools-exec-1422, tools-webgrid-lighttpd-1413
  • 15:41 andrewbogott: switching the proxy ip back to tools-proxy-02
  • 15:31 andrewbogott: temporarily pointing the tools-proxy IP to tools-proxy-01
  • 15:26 andrewbogott: depooling tools-exec-1422, tools-webgrid-lighttpd-1413
  • 15:12 andrewbogott: depooling tools-exec-1404, tools-exec-1434, tools-worker-1026
  • 15:10 andrewbogott: repooling tools-exec-1402, 1426, 1429, 1433, tools-webgrid-lighttpd-1408, 1414, 1424
  • 14:53 andrewbogott: depooling tools-exec-1402, 1426, 1429, 1433, tools-webgrid-lighttpd-1408, 1414, 1424
  • 14:52 andrewbogott: repooling tools-exec-1403, tools-exec-gift-trusty-01, tools-webgrid-generic-1402, tools-webgrid-lighttpd-1403
  • 14:37 andrewbogott: depooling tools-exec-1403, tools-exec-gift-trusty-01, tools-webgrid-generic-1402, tools-webgrid-lighttpd-1403
  • 14:32 andrewbogott: repooling tools-exec-1405, tools-exec-1425, tools-webgrid-generic-1403, tools-webgrid-lighttpd-1401, tools-webgrid-lighttpd-1405
  • 14:20 andrewbogott: depooling tools-exec-1405, tools-exec-1425, tools-webgrid-generic-1403, tools-webgrid-lighttpd-1401, tools-webgrid-lighttpd-1405
  • 14:19 andrewbogott: repooling tools-exec-1401, 1407, 1408, 1430, 1431, 1432, 1435, 1438, 1439, 1440, 1441, tools-webgrid-lighttpd-1402, tools-webgrid-lighttpd-1407
  • 13:56 andrewbogott: depooling tools-exec-1401, 1407, 1408, 1430, 1431, 1432, 1435, 1438, 1439, 1440, 1441, tools-webgrid-lighttpd-1402, tools-webgrid-lighttpd-1407

2017-06-14

  • 22:09 bd808: Restarted apache2 proc on tools-puppetmaster-02

2017-06-08

  • 18:14 madhuvishy: Also delete from /tmp on tools-webgrid-lighttpd-1411 xvfb-run.*, calibre_* and ws-*.epub
  • 18:10 madhuvishy: Delete ws-*.epub from /tmp on tools-webgrid-lighttpd-1426
  • 18:07 madhuvishy: Clean up space on /tmp on tools-webgrid-lighttpd-1426 by deleting temp files xvfb-run.* and calibre_1.25.0_tmp_* created by the wsexport tool

2017-06-07

  • 19:05 madhuvishy: Killed scp job run by user torin8 on tools-bastion-02

2017-06-06

  • 20:30 chasemp: rebooting tools-bastion-02 as unresponsive (up 76 days and lots of seemingly left behind things running)

2017-06-05

  • 23:44 bd808: Deleted tools.iabot crontab that somehow got locally installed on tools-exec-1412 on 2017-05-24T20:55Z
  • 22:15 bd808: Deleted tools.aibot crontab that somehow got locally installed on tools-exec-1436 on 2017-05-24T20:55Z
  • 19:55 andrewbogott: disabling puppet on tools-proxy-01 and -02 for a staged rollout of https://gerrit.wikimedia.org/r/#/c/350494/16

2017-06-01

  • 15:15 andrewbogott: depooling/rebooting/repooling tools-exec-1403 as part of old kernel-purge testing

2017-05-31

  • 19:29 bd808: Rebuiding all Docker images to pick up toollabs-webservice v0.37 (T163355)
  • 19:24 bd808: Updating toolabs-webservice package via clush (T163355)
  • 19:16 bd808: Installed toollabs-webservice_0.37_all.deb from local file on tools-bastion-02 (T163355)
  • 16:34 andrewbogott: running 'apt-get -yq autoremove' env='{DEBIAN_FRONTEND: "noninteractive"}' on all instances with salt
  • 16:25 andrewbogott: rebooting tools-exec-1404 as part of a disk-space-saving test
  • 14:07 andrewbogott: migrating tools-exec-1409 to labvirt1009 to reduce CPU load on labvirt1006 (T165753)

2017-05-30

  • 22:32 andrewbogott: migrating tools-webgrid-lighttpd-1406, tools-exec-1410 from labvirt1006 to labvirt1009 to balance cpu usage
  • 18:15 andrewbogott: restarted robokobot virgule to free up leaked files
  • 17:36 andrewbogott: restarting excel2wiki to clean up file leaks
  • 17:36 andrewbogott: restarting idwiki-welcome in kenrick95bot to free up leaked files
  • 17:31 andrewbogott: restarting onetools to clean up file leaks
  • 17:29 andrewbogott: restarting ytcleaner webservice to clean up leaked files
  • 17:22 andrewbogott: restarting vltools to clean up leaked files
  • 17:20 madhuvishy: Uncordoned tools-worker-1006
  • 17:16 madhuvishy: Killed tool videoconvert on tools-exec-1440 in debugging labstore disk space issues
  • 17:15 madhuvishy: Drained and rebooted tools-worker-1006
  • 17:15 andrewbogott: restarted croptool to clean up stray files
  • 17:15 madhuvishy: depooled, rebooted, and repooled tools-exec-1412
  • 17:15 andrewbogott: restarted catmon tool to clean up stray files

2017-05-26

  • 20:32 bd808: Added tools-webgrid-lighttpd-14{19,2[0-8]} as submit hosts
  • 20:31 bd808: Added tools-webgrid-lighttpd-1412 and tools-webgrid-lighttpd-1413 as submit hosts
  • 20:28 bd808: sudo qconf -as tools-webgrid-lighttpd-1417.tools.eqiad.wmflabs

2017-05-22

  • 07:49 chasemp: move ooooold shared resources into archive for later cleanup

2017-05-20

  • 09:27 madhuvishy: Truncating jerr.log for tool videoconvert since it's 967GB

2017-05-10

  • 19:11 bd808: Edited striker db record for user Stepan Grigoryev to detach SUL and Phab accounts. T164849
  • 17:47 bd808: Signed and revoked puppet certs generated when our DNS flipped out and gave hosts non-FQDN hostnames
  • 17:29 bd808: Fixed broken puppet cert on tools-package-builder-01

2017-05-04

  • 19:23 madhuvishy: Rebooting tools-grid-shadow
  • 16:21 madhuvishy: Start instance tools-grid-master.tools from horizon
  • 16:20 madhuvishy: Shut off tools-grid-master.tools instance from horizon
  • 16:16 madhuvishy: Stopped gridengine-shadow on tools-grid-shadow.tools (service gridengine-shadow stop and kill -9 individual shadowd processes)

2017-04-24

  • 15:33 bd808: Removed Gergő Tisza as a projectadmin for T163611; event done

2017-04-21

  • 22:30 bd808: Added Gergő Tisza as a projectadmin for T163611
  • 13:43 chasemp: T161898 clush -g all 'sudo puppet agent --disable "rollout nfs-mount-manager"'

2017-04-20

  • 17:15 bd808: Deleted shutdown VM tools-docker-builder-04; tools-docker-builder-05 is the new hotness
  • 17:11 bd808: kill -INT 19897 on tools-proxy-02 to stop a hung nginx child process left from the last graceful restart of nginx

2017-04-19

  • 15:10 bd808: apt-get install psmisc on tools-proxy-0[12]
  • 13:23 chasemp: stop docker on tools-proxy-01
  • 13:20 chasemp: clean up disk space on tools-proxy-01

2017-04-18

  • 20:37 bd808: Restarted bigbrother on tools-services-02
  • 04:23 bd808: Shutdown tools-docker-builder-04; will wait a bit before deleting
  • 04:04 bd808: Built and pushed new Docker images based on 82a46b4 (Refactor apt-get actions in Dockerfiles)
  • 03:42 bd808: Made tools-docker-builder-05.tools.eqiad.wmflabs the active docker build host
  • 01:01 bd808: Built instance tools-package-builder-01

2017-04-17

  • 20:41 bd808: Building tools-docker-builder-05
  • 19:35 chasemp: add reedy to sudo all perms so he can admin things
  • 17:21 andrewbogott: adding 8 more exec nodes: tools-exec-1435 through 1442

2017-04-11

  • 16:46 andrewbogott: added exec nodes tools-exec-1430, 31, 32, 33, 34.
  • 14:15 andrewbogott: emptied /srv/pbuilder to make space on tools-docker-04
  • 02:35 bd808: Restarted maintain-kubeusers on tools-k8s-master-01

2017-04-03

  • 13:48 chasemp: enable puppet on gridmaster

2017-04-01

  • 15:28 andrewbogott: added five new exec nodes, tools-exec-1425 through 1429
  • 14:26 chasemp: up nfs thresholds https://gerrit.wikimedia.org/r/#/c/345975/
  • 14:00 chasemp: disable puppet on tools-grid-msater
  • 13:52 chasemp: tools-grid-master tc-setup clean
  • 13:40 chasemp: restart nscd and nscld on tools-grid-master
  • 13:31 chasemp: reboot tools-exec-1420

2017-03-31

  • 22:25 yuvipanda: apt-get update && apt-get install kubernetes-node on tools-proxy-01 to upgrade kube-proxy systemd service unit

2017-03-30

  • 20:29 chasemp: stop grid-master temporarily & umount -fl project nfs & remount & start grid-master
  • 17:38 chasemp: reboot tools-exec-1401
  • 17:30 madhuvishy: Updating tools project hiera config to add role::labs::nfsclient::lookupcache: all via Horizon (T136712)
  • 17:29 madhuvishy: Disabled puppet across tools in prep for T136712

2017-03-27

  • 04:06 andrewbogott: erasing random log files on tools-proxy-01 to avoid filling the disk

2017-03-23

  • 20:38 andrewbogott: migrating tools-exec-1401 to labvirt1001
  • 19:56 andrewbogott: migrating tools-exec-1408 to labvirt1001
  • 19:02 andrewbogott: migrating tools-exec-1407 to labvirt1001
  • 16:37 andrewbogott: migrating tools-webgrid-lighttpd-1402 and 1407 to labvirt1001 (testing labvirt1001 and easing CPU load on labvirt1010)

2017-03-22

  • 13:48 andrewbogott: migrating tools-bastion-02 in 15 minutes

2017-03-21

  • 17:06 andrewbogott: moving tools-webgrid-lighttpd-1404 to labvirt1012 to ease pressure on labvirt1004
  • 16:19 andrewbogott: moving tools-exec-1406 to labvirt1011 to ease CPU usage on labvirt1004

2017-03-20

  • 22:47 yuvipanda: disable puppet on all k8s workers to test https://gerrit.wikimedia.org/r/#/c/343708/
  • 18:36 bd808: Applied openstack::clientlib on tools-checker-02 and forced puppet run
  • 18:03 bd808: Applied openstack::clientlib on tools-checker-01 and forced puppet run
  • 17:31 andrewbogott: migrating tools-exec-1417 to labvirt1013
  • 17:05 andrewbogott: migrating tools-webgrid-lighttpd-1410 to labvirt1012 to reduce load on labvirt1001
  • 16:42 andrewbogott: migrating tools-webgrid-generic-1404 to labvirt1011 to reduce load on labvirt1001
  • 16:13 andrewbogott: migrating tools-exec-1408 to labvirt1010 to reduce load on labvirt1001

2017-03-17

  • 17:24 andrewbogott: moving tools-webgrid-lighttpd-1416 to labvirt1013 to reduce load on labvirt1004
  • 17:15 andrewbogott: moving tools-exec-1424 to labvirt1012 to ease load on labvirt1004

2017-03-15

  • 19:21 andrewbogott: added new exec nodes: tools-exec-1421 and tools-exec-1422
  • 17:42 madhuvishy: Restarted stashbot
  • 17:29 chasemp: docker stop && rm -fR /var/lib/docker/* on worker-1001
  • 17:20 chasemp: test of logging
  • 16:11 chasemp: k8s master 'for h in `kubectl get nodes | grep worker | grep -v NotReady | grep -v Disabled | awk '{print $1}'`; do echo $h && kubectl drain --delete-local-data --force $h && sleep 10 ; done'
  • 16:08 chasemp: stop puppet on k8s master and drain nodes
  • 15:50 chasemp: (late) kill what appears to be an android emulator? unsure but it's eating all IO

2017-03-14

  • 21:24 bd808: Deleted tools-precise-dev (T160466)
  • 21:13 bd808: Removed non-existent tools-submit.eqiad.wmflabs from submit hosts list
  • 21:02 bd808: Deleted tools-exec-gift (T160461)
  • 20:45 bd808: Deleted tools-webgrid-lighttpd-12* nodes (T160442)
  • 20:29 bd808: Deleted tools-exec-12* nodes (T160457)
  • 20:27 bd808: Disassociated floating IPs from tools-exec-12* nodes (T160457)
  • 17:41 madhuvishy: Hand fix tools-puppetmaster by removing the old mariadb submodule directory
  • 17:23 madhuvishy: Remove role::toollabs::precise_reminder from tools-bastion-03
  • 15:40 bd808: Installing toollabs-webservice 0.36 across cluster using clush
  • 15:36 bd808: Upgraded toollabs-webservice to 0.36 on tools-bastion-02.tools
  • 15:25 bd808: Installing jobutils 1.21 across cluster using clush
  • 15:23 bd808: Installed jobutils 1.21 on tools-bastion-02
  • 15:03 bd808: Shutting down webservices running on Precise job grid nodes

2017-03-13

  • 21:12 valhallasw`cloud: tools-bastion-03: killed heavy unzip operation from staeiou, and heavy (inadvertent large file opening?) vim operation from steenth, as the entire server was blocked due to high i/o

2017-03-07

  • 17:59 andrewbogott: depooling, migrating tools-exec-1416 as part of ongoing labvirt1001 issues
  • 17:21 madhuvishy: tools-webgrid-lighttpd-1409 migrated to labvirt1011 and repooled
  • 16:31 madhuvishy: Depooled tools-webgrid-lighttpd-1409 for cold migrating to different labvirt

2017-03-06

  • 22:52 andrewbogott: migrating tools-webgrid-lighttpd-1411 to labvirt1011 to give labvirt1001 a break
  • 19:03 madhuvishy: Stopping webservice running on tool tree-of-life on author request
  • 18:25 yuvipanda: set complex_values slots=300,release=trusty for tools-exec-gift-trusty-01.tools.eqiad.wmflabs

2017-03-04

  • 23:47 madhuvishy: Added new k8s workers 1028, 1029

2017-02-28

  • 03:52 scfc_de: Deployed jobtools and misctools 1.20/1.20~precise+1 (T158722).

2017-02-27

  • 02:42 scfc_de: Purged misctools from instances where not puppetized.
  • 02:42 scfc_de: Deployed jobtools and misctools 1.19/1.19~precise+1 (T155787, T156886).

2017-02-17

  • 12:51 chasemp: create tools-exec-gift-trusty-01
  • 12:40 chasemp: create tools-exec-gift-trusty
  • 12:24 chasemp: mass apt-get clean and removal of some old .gz log files due to 30+ low space warnings

2017-02-15

  • 18:45 yuvipanda: clush a restart of nscd across all of tools
  • 00:01 bd808: Rebuilt python and python2 Docker images (T157744)

2017-02-08

  • 06:22 yuvipanda: drain tools-worker-1026 for docker upgrade
  • 05:28 yuvipanda: drain pods from tools-worker-1027.tools.eqiad.wmflabs for docker upgrade
  • 05:28 yuvipanda: disable puppet on all k8s nodes in preparation for docker upgrade

2017-02-07

  • 13:49 scfc_de: Deployed toollabs-webservice_0.33_all.deb (T156605, T156626).
  • 13:49 scfc_de: Deployed tools-manifest_0.11_all.deb.

2017-02-04

  • 02:13 yuvipanda: launch tools-worker-1027 to see if puppet works fine on first run!
  • 02:13 yuvipanda: reboot tools-worker-1026 to see if it comes up fine
  • 01:46 yuvipanda: launch tools-worker-1026

2017-02-03

  • 21:34 madhuvishy: Migrated over precise tools to trusty for user multichill (catbot, family, locator, multichill, nlwikibots, railways, wlmtrafo, wikidata-janitor)
  • 21:13 chasemp: reboot tools-bastion-03 as unresponsive

2017-02-02

  • 20:39 yuvipanda: import docker-engine 1.11.2 (currently running version) and 1.12.6 (latest version) into aptly
  • 00:06 madhuvishy: Remove user maximilianklein from tools.cite-o-meter (on request)

2017-01-30

  • 20:25 yuvipanda: sudo ln -s /usr/bin/kubectl /usr/local/bin/kubectl to temporarily fix webservice shell not working

2017-01-27

  • 19:22 chasemp: reboot tools-bastion-02 as it is having issues
  • 02:01 madhuvishy: Reenabled puppet on tools-checker-01
  • 00:29 madhuvishy: Disabling puppet on tools-checker instances to test https://gerrit.wikimedia.org/r/#/c/334433/

2017-01-26

  • 23:37 madhuvishy: reenabled puppet on tools-checker
  • 23:02 madhuvishy: Disabling puppet on tools-checker instances to test https://gerrit.wikimedia.org/r/#/c/334433/
  • 16:08 chasemp: major cleanup for stale var items on tools-exec-1221

2017-01-24

  • 18:14 andrewbogott: one last reboot of tools-mail
  • 18:00 andrewbogott: apt-get autoremove on tools-mail
  • 17:51 andrewbogott: rebooting tools-mail post upgrade
  • 17:19 andrewbogott: restarting tools-mail, beginning do-release-upgrade -d -q
  • 17:17 andrewbogott: backing up tools-mail to ~root/8c499e6e-1b79-4bb1-8f7f-72fee1f74ea5-backup on labvirt1009
  • 17:15 andrewbogott: stopping tools-mail, backing up, upgrading from precise to trusty
  • 15:49 yuvipanda: clush -g all 'sudo rm /usr/local/bin/kube*' to get rid of old kube related binaries
  • 14:42 yuvipanda: re-enable puppet on tools-proxy-01, test success on proxy-02
  • 14:37 yuvipanda: disable puppet on tools-proxy-01 (active proxy) to check deploying debianized kube-proxy on proxy-02
  • 13:52 yuvipanda: upgrading k8s on worker nodes to use debs + new k8s version
  • 13:52 yuvipanda: finished upgrading k8s + using debs
  • 12:49 yuvipanda: purge ancient kubectl, kube-apiserver, kube-controller-manager, kube-scheduler packages from tools-k8s-master-01, these were my old terrible packages

2017-01-23

  • 19:36 andrewbogott: temporarily shutting down tools-webgrid-lighttpd-1201
  • 19:35 yuvipanda: depool tools-webgrid-lighttpd-1201 for snapshotting tests
  • 17:13 chasemp: reboot tools-exec-1411 as having serious transient issues

2017-01-20

  • 15:58 yuvipanda: enabling puppet across all hosts
  • 15:36 yuvipanda: disable puppet everywhere to cherrypick patch moving base to a profile
  • 00:50 bd808: sudo qdel -f 1199218 to force delete a stuck toolschecker job

2017-01-17

2017-01-11

  • 22:09 chasemp: add Reedy to admin in tool labs (approved by bryon and chase for access to investigate specific tool abuse behavior)

2017-01-10

  • 19:05 madhuvishy: Killed 3 jobs from tools.arnaub that were causing high load on tools-exec-1411

2017-01-06

  • 19:02 bd808: Terminated deprecated instances tools-exec-121[2-6] (T154539)

2017-01-04

  • 02:43 madhuvishy: Reenabled puppet on toolschecker and removed iptables rule on labservices1001 blocking incoming connections from tools-checker-01. T152369

2017-01-03

  • 23:56 bd808: Removed tools-exec-12[12-16] from gridengine (T154539)
  • 23:27 bd808: drained tools-exec-1216 (T154539)
  • 23:26 bd808: drained tools-exec-1215 (T154539)
  • 23:25 bd808: drained tools-exec-1214 (T154539)
  • 23:25 bd808: drained tools-exec-1213 (T154539)
  • 23:24 bd808: drained tools-exec-1212 (T154539)
  • 23:11 madhuvishy: Disabled puppet on tools-checker-01 (T152369)
  • 21:43 madhuvishy: Adding iptables rule to drop incoming connections from toolschecker on labservices1001
  • 20:51 madhuvishy: Adding iptables rule to block outgoing connections to labservices1001 on tools-checker-01
  • 20:43 madhuvishy: Silenced tools checker on icinga to test labservices1001 failure causing toolschecker to flake out T152369

2016-12-25

  • 00:28 yuvipanda: comment out cron running 'clean' script of avicbot every minute without -once
  • 00:28 yuvipanda: force delete all jobs of avicbot
  • 00:25 yuvipanda: delete all jobs of avicbot. This is 419 jobs
  • 00:20 yuvipanda: kill clean.sh process of avicbot

2016-12-19

  • 20:07 valhallasw`cloud: killed gps_exif_bot2.py (tools.gpsexif), was using 50MB/s io, lagging all of tools-bastion-03
  • 13:06 yuvipanda: run /usr/local/bin/deploy-master http://tools-docker-builder-03.tools.eqiad.wmflabs v1.3.3wmf1 on tools-k8s-master-01
  • 12:53 yuvipanda: cleaned out pbuilder from tools-docker-builder-01 to clean up

2016-12-17

  • 04:49 yuvipanda: turned on lookupcache again for bastions

2016-12-15

  • 18:52 yuvipanda: reboot tools-exec-1204
  • 18:49 yuvipanda: reboot tools-webgrid-lighttpd-12[01-05]
  • 18:45 yuvipanda: reboot tools-exec-gift
  • 18:41 yuvipanda: reboot tools-exec-1217 to 1221
  • 18:30 yuvipanda: rebooted tools-exec-1212 to 1216
  • 14:55 yuvipanda: reboot tools-services-01

2016-12-14

  • 18:43 mutante: tools-bastion-03 - ran 'locale-gen ko_KR.EUC-KR' for T130532

2016-12-13

  • 20:54 chasemp: reboot bastion-03 as unresponsive

2016-12-09

  • 19:32 godog: upgrade / restart prometheus-node-exporter
  • 08:37 YuviPanda: run delete-dbusers and force replica.my.cnf creation for all tools that did not have it

2016-12-08

  • 18:48 YuviPanda: restarted toolschecker on tools-checker-01

2016-12-07

2016-12-06

  • 00:36 bd808: Updated toollabs-webservice to 0.31 on rest of cluster (T147350)

2016-12-05

  • 23:19 bd808: Updated toollabs-webservice to 0.31 on tools-bastion-02 (T147350)
  • 22:55 bd808: Updated jobutils to 1.17 on tools-mail (T147350)
  • 22:53 bd808: Updated jobutils to 1.17 on tools-precise-dev (T147350)
  • 22:53 bd808: Updated jobutils to 1.17 on tools-cron-01 (T147350)
  • 22:52 bd808: Updated jobutils to 1.17 on tools-bastion-03 (T147350)
  • 22:52 bd808: Updated jobutils to 1.17 on tools-bastion-02 (T147350)
  • 16:53 bd808: Terminated deprecated instances: "tools-exec-1201", "tools-exec-1202", "tools-exec-1203", "tools-exec-1205", "tools-exec-1206", "tools-exec-1207", "tools-exec-1208", "tools-exec-1209", "tools-exec-1210", "tools-exec-1211" (T151980)
  • 16:50 bd808: Released floating IPs from decommissioned tools-exec-12[01-11] instances

2016-11-30

  • 23:06 bd808: Removed tools-exec-12[00-11] from gridengine (T151980)
  • 22:54 bd808: Removed tools-exec-12[00-11] from @general hostgroup
  • 15:17 chasemp: restart coibot 'coibot.sh -o syslog.output -e syslog.errors -r yes'
  • 05:20 bd808: rescheduled continuous jobs on tools-exec-1210; 2 task queue jobs remain (T151980)
  • 05:18 bd808: drained tools-exec-1211 (T151980)
  • 05:14 bd808: drained tools-exec-1209 (T151980)
  • 05:13 bd808: drained tools-exec-1208 (T151980)
  • 05:12 bd808: drained tools-exec-1207 (T151980)
  • 05:10 bd808: drained tools-exec-1206 (T151980)
  • 05:07 bd808: drained tools-exec-1205 (T151980)
  • 05:04 bd808: drained tools-exec-1204 (T151980)
  • 05:00 bd808: drained tools-exec-1203 (T151980)
  • 05:00 bd808: drained tools-exec-1202 (T151980)
  • 04:58 bd808: disabled queues on tools-exec-1211 (T151980)
  • 04:58 bd808: disabled queues on tools-exec-1210 (T151980)
  • 04:58 bd808: disabled queues on tools-exec-1209 (T151980)
  • 04:57 bd808: disabled queues on tools-exec-1208 (T151980)
  • 04:57 bd808: disabled queues on tools-exec-1207 (T151980)
  • 04:57 bd808: disabled queues on tools-exec-1206 (T151980)
  • 04:56 bd808: disabled queues on tools-exec-1205 (T151980)
  • 04:56 bd808: disabled queues on tools-exec-1204 (T151980)
  • 04:56 bd808: disabled queues on tools-exec-1203 (T151980)
  • 04:55 bd808: disabled queues on tools-exec-1202 (T151980)
  • 04:52 bd808: drained tools-exec-1201 (T151980)
  • 04:48 bd808: draining tools-exec-1201

2016-11-29

2016-11-22

  • 15:13 chasemp: readd attr +i to replica.my.cnf that seems to have gotten lost in rsync migration

2016-11-21

  • 21:15 YuviPanda: disable puppet everywhere
  • 19:49 YuviPanda: restart all webservice jobs on gridengine to pick up logging again

2016-11-20

  • 06:51 Krenair: ran `qmod -rj lighttpd-admin` as tools.admin to try to get the main page back up, it worked briefly but then broke again

2016-11-16

  • 20:14 yuvipanda: upgrade toollabs-webservice to 0.30 on all webgrid nodes
  • 18:31 chasemp: reboot tools-exec-1404 (already depooled)
  • 18:19 chasemp: reboot tools-exec-1403
  • 17:23 chasemp: reboot tools-exec-1212 (converted via 321786 testing for recovery on boot)
  • 16:55 chasemp: clush -g all "puppet agent --disable 'trail run for changeset 321786 handling /var/lib/gridengine'"
  • 02:05 yuvipanda: rebooting tools-docker-registry-01, can't ssh in
  • 01:43 yuvipanda: cleanup old images on tools-docker-builder-03

2016-11-15

  • 19:52 chasemp: reboot tools-precise-dev
  • 05:20 yuvipanda: restart all k8s webservices too
  • 05:05 yuvipanda: restarting all webservices on gridengine
  • 03:21 chasemp: reboot tools-checker-01
  • 02:56 chasemp: reboot tools-exec-1405 to ensure noauto works (because atboot=>false is lies)
  • 02:31 chasemp: reboot tools-exec-1406

2016-11-14

  • 22:51 chasemp: shut down bastion 02 and 05 and make 03 root only
  • 19:35 madhuvishy: Stopped cron on tools-cron-01 (T146154)
  • 18:24 madhuvishy: Tools NFS is read-only. /data/project and /home across tools are ro T146154
  • 16:57 yuvipanda: stopped gridengine master
  • 16:47 yuvipanda: start restarting kubernetes webservice pods
  • 16:30 madhuvishy: Unmounted all nfs shares from tools-k8s-master-01 (sudo /usr/local/sbin/nfs-mount-manager clean) T146154
  • 16:22 yuvipanda: kill maintain-kubeusers on tools-k8s-master-01, sole process touching NFS
  • 16:22 chasemp: enable puppet and run on tools-services-01
  • 16:21 yuvipanda: restarting all webservice jobs, watching webservicewatcher logs on tools-services-02
  • 16:14 madhuvishy: Disabling puppet across tools T146154

2016-11-11

  • 20:49 madhuvishy: Dual mount of tools share complete. Puppet reenabled across tools hosts. T146154
  • 20:18 madhuvishy: Rolling out dual mount of tools share across all hosts T146154
  • 19:29 madhuvishy: Disabling puppet across tools to dual mount tools share from labstore-secondary T146154

2016-11-02

  • 18:23 yuvipanda: manually stop tools-grid-master for reboot
  • 17:42 yuvipanda: drain nodes from labvirt1012 and 13
  • 13:42 chasemp: depool tools-exec-1404 for maint

2016-11-01

  • 21:54 yuvipanda: stop gridengine-master on tools-grid-master in preparation for reboot
  • 21:34 yuvipanda: depool tools nodes on labvirt1012
  • 21:16 yuvipanda: depool things in labvirt1011
  • 20:58 yuvipanda: depool tools nodes on labvirt1010
  • 20:32 yuvipanda: depool tools things on labvirt1005 and 1009
  • 20:08 yuvipanda: depooled things on labvirt1006 and 1008
  • 19:51 yuvipanda: move tools-elastic-03 to labvirt1010, -02 already in 09
  • 19:34 yuvipanda: migrate tools-elastic-03 to labvirt1009
  • 19:10 yuvipanda: depooled tools nodes from labvirt1004 and 1007
  • 17:57 yuvipanda: depool exec nodes on labvirt1002
  • 13:27 chasemp: reboot tools-exec-1404 post depool for test

2016-10-31

  • 21:50 yuvipanda: deleted cyberbot queue with qconf -dq cyberbot
  • 21:44 yuvipanda: restarted cron on tools-cron-01

2016-10-30

  • 02:25 yuvipanda: restarted maintain-kubeusers

2016-10-29

  • 17:21 yuvipanda: depool tools-worker-1005

2016-10-28

  • 20:15 chasemp: restart prometheus service on tools-prometheus-01 to see if that wakes it up
  • 20:06 yuvipanda: restart kube-apiserver again, ran into too many open file handles
  • 15:58 Yuvi[m]: restart k8s master, seems to have run out of fds
  • 15:43 chasemp: restart toolschecker service on 01 and 02

2016-10-27

  • 21:09 godog: upgrade prometheus on tools-prometheus0[12]
  • 18:49 andrewbogott: rebooting tools-webgrid-lighttpd-1401
  • 13:51 chasemp: reboot tools-webgrid-generic-1403
  • 13:50 chasemp: reboot dockerbuilder-01

2016-10-26

  • 23:20 madhuvishy: Disabling puppet on tools proxy hosts for applying proxy health check endpoint T143638
  • 23:17 godog: upgrade prometheus on tools-prometheus-02
  • 16:52 bd808: Deployed jobutils_1.16_all.deb on tools-mail (default jsub target to trusty)
  • 16:50 bd808: Deployed jobutils_1.16_all.deb on tools-precise-dev (default jsub target to trusty)
  • 16:48 bd808: Deployed jobutils_1.16_all.deb on tools-bastion-02, tools-bastion-03, tools-cron-01 (default jsub target to trusty)

2016-10-25

2016-10-24

  • 03:45 Krenair: reset host keys for tools-puppetmaster-02 on -01, looks like it was recreated 5-6 days ago

2016-10-20

  • 16:55 yuvipanda: killed bzip2 taking 100% CPU on tools-bastion-03

2016-10-18

  • 22:56 Guest20046: flip tools-k8s-master-01 to tools-puppetmaster-02
  • 07:43 yuvipanda: move all tools webgrid nodes to tools-puppetmaster-02 too
  • 07:40 yuvipanda: complete moving all general tools exec nodes to tools-puppetmaster-02
  • 07:33 yuvipanda: restarted puppetmaster on tools-puppetmaster-01

2016-10-17

  • 14:37 chasemp: remove bdsync-deb and bdsync-deb-2 errornously created in Tools and now defunct anyway
  • 14:05 chasemp: restart puppetmaster on tools-puppetmaster-01 (instances sticking on puppet runs for a long time)
  • 14:01 chasemp: reboot tools-exec-1215 and tools-exec-1410 as unresponsive

2016-10-14

  • 16:20 yuvipanda: repoooled tools-worker-1012, seems to have recovered?!
  • 15:57 yuvipanda: drain tools-worker-1012, seems stuck

2016-10-10

  • 18:04 valhallasw`vecto: sudo service bigbrother restart @ tools-services-02

2016-10-09

  • 18:33 valhallasw`cloud: removed empty local crontabs for {yuvipanda, yuvipanda, tools.toolschecker} on {tools-webgrid-lighttpd-1404, tools-webgrid-lighttpd-1204, tools-checker-01}. No other local crontabs remaining.

2016-10-05

  • 12:15 chasemp: reboot tools-webgrid-generic-1404 as locked up

2016-10-01

  • 10:03 yuvipanda: re-enable puppet on tools-checker-02

2016-09-29

  • 18:15 bd808: Rebooting tools-elastic-02.tools.eqiad.wmflabs via wikitech; couldn't ssh in
  • 18:10 bd808: Investigating elasticsearch cluster issues effecting stashbot

2016-09-27

  • 08:07 chasemp: tools-bastion-03:~# chmod 640 /var/log/syslog

2016-09-25

  • 15:27 Krenair: restarted labs-logbot under tools.morebots

2016-09-21

  • 18:56 madhuvishy: Repooled tools-webgrid-lighttpd-1418 (T146212) after dns records cleanup
  • 18:42 madhuvishy: Repooled tools-webgrid-lighttpd-1416 (T146212) after dns records cleanup
  • 16:57 chasemp: reboot tools-webgrid-lighttpd-1407, tools-webgrid-lighttpd-1210, tools-webgrid-lighttpd-1414, and then tools-webgrid-lighttpd-1405 as the first 3 return

2016-09-20

  • 23:24 yuvipanda: depool tools-webgrid-lighttpd-1416 and 1418, they aren't in actual working order
  • 21:23 madhuvishy|food: Pooled new sge exec node tools-webgrid-lighttpd-1416 (T146212)
  • 21:17 madhuvishy|food: Pooled new sge exec node tools-webgrid-lighttpd-1415 (T146212)
  • 20:34 madhuvishy: Created new instance tools-webgrid-lighttpd-1418 (T146212)
  • 20:34 madhuvishy: Created new instance tools-webgrid-lighttpd-1416 (T146212)
  • 20:34 madhuvishy: Created new instance tools-webgrid-lighttpd-1415 (T146212)
  • 17:58 andrewbogott: reboot tools-exec-1410
  • 17:54 yuvipanda: repool tools-webgrid-lighttpd-1412
  • 17:49 yuvipanda: webgrid-lighttpd-1412 hung on io (no change in nova diagnostics), rebooting
  • 17:33 yuvipanda: reboot tools-puppetmaster-01
  • 17:20 yuvipanda: reboot tools-checker-02
  • 15:42 chasemp: move floating ip from tools-checker-02 (failed) to tools-checker-01

2016-09-13

  • 21:09 madhuvishy: Bumped proxy nginx worker_connections limit T143637
  • 21:08 madhuvishy: Reenabled puppet across proxy hosts
  • 20:44 madhuvishy: Disabling puppet across proxy hosts

2016-09-12

  • 18:33 bd808: Forcing puppet run on tools-cron-01
  • 18:31 bd808: Forcing puppet run on tools-bastion-03
  • 18:28 bd808: Forcing puppet run on tools-bastion-02
  • 18:26 bd808: Forcing puppet run on tools-precise-dev
  • 18:26 bd808: Built toollabs-webservice v0.27 package and added to aptly

2016-09-10

  • 01:06 yuvipanda: migrate tools-k8s-etcd-01 to labvirt1012, is in state doing no io

2016-09-09

  • 19:27 yuvipanda: reboot tools-exec-1218 and 1219
  • 18:10 yuvipanda: killed massive grep running as root

2016-09-08

  • 21:49 bd808: forcing puppet runs to install toollabs-webservice_0.26_all.deb
  • 20:51 bd808: forcing puppet runs to install jobutils_1.15_all.deb

2016-09-07

  • 21:11 Krenair: brought labs/private.git up to date on tools-puppetmaster-01
  • 02:32 Krenair: ran `SULWatcher/restart_SULWatcher.sh` as `tools.stewardbots` on bastion-03 to fix T144887

2016-09-06

  • 22:14 yuvipanda: got pbuilder off tools-services-01, was taking up too much space.
  • 22:10 madhuvishy: Deleted instance tools-web-static-01 and tools-web-static-02 (T143637)
  • 21:45 yuvipanda: reboot tools-prometheus-02. nova diagnostics shows no vda activity.
  • 20:43 chasemp: drain and reboot tools-exec-1410 for testing
  • 07:32 yuvipanda: depooled tools-exec-1219 and 1218, seem to be unresponsive, causing jobs that appear to run but aren't really

2016-09-05

  • 16:27 andrewbogott: rebooting tools-cron-01 because it is hanging all over the place

2016-09-01

  • 05:19 yuvipanda: restart maintain-kubeusers on tools-k8s-master-01, was stuck

2016-08-31

  • 20:48 madhuvishy: Reenabled puppet across tools hosts
  • 20:45 madhuvishy: Scratch migration complete on all grid exec nodes (T134896)
  • 19:36 madhuvishy: Scratch migration on all non exec/worker nodes complete (T134896)
  • 18:18 madhuvishy: Scratch migration complete for all k8s workers (T134896)
  • 17:50 madhuvishy: Reenabling puppet across tools hosts.
  • 16:55 madhuvishy: Rsync-ed over latest backup of /srv/scratch from labstore1001 to labstore1003
  • 16:50 madhuvishy: Puppet disabling complete (T134896)

2016-08-30

2016-08-29

  • 23:38 Krenair: added myself to the tools.admin service group earlier to try to figure out what was causing the outage, removed again now
  • 16:35 yuvipanda: run chmod u+x /data/project/framabot
  • 13:40 chasemp: restart jouncebot

2016-08-28

  • 05:34 bd808: After git gc on web-static-02.tools:/srv/cdnjs: /dev/mapper/vd-cdnjs--disk 61G 54G 3.3G 95% /srv
  • 05:25 bd808: sudo git gc --aggressive on tools-web-static-01.tools:/srv/cdnjs
  • 04:56 bd808: sudo git gc --aggressive on tools-web-static-02.tools:/srv/cdnjs

2016-08-26

  • 16:53 yuvipanda: migrate tools-static-02 to labvirt1001

2016-08-25

  • 18:07 yuvipanda: restart puppetmaster on tools-puppetmaster-01
  • 17:41 yuvipanda: depooled tools-webgrid-1413
  • 01:16 yuvipanda: restarted puppetmaster on tools-puppetmaster-01

2016-08-24

  • 23:03 chasemp: reboot tools-exec-1217
  • 17:25 yuvipanda: depool tools-exec-1217, it is dead/stuck/hung/io-starved

2016-08-23

2016-08-22

2016-08-20

  • 11:42 valhallasw`cloud: rebooting tools-mail (hanging)

2016-08-19

  • 14:52 chasemp: reboot 82323ee4-762e-4b1f-87a7-d7aa7afa22f6

2016-08-18

  • 20:00 yuvipanda: restarted maintain-kubeusers on tools-k8s-master-01

2016-08-15

  • 22:10 yuvipanda: depool tools-exec-1211 and 1205, seem to be out of action
  • 19:12 yuvipanda: kill unused tools-merlbot-proxy

2016-08-12

  • 20:39 yuvipanda: delete tools-webgrid-lighttpd-1415, enough webservices have moved to k8s from that queue
  • 20:37 yuvipanda: delete tools-logs-01, going to recreate with a smaller image
  • 20:36 yuvipanda: delete tools-webgrid-generic-1405, enough things have moved to k8s from that queue!
  • 20:10 yuvipanda: migration of tools-grid-master to labvirt1013 complete
  • 20:01 yuvipanda: migrating tools-grid-master (currently inactive) to labvirt1013 away from crowded 1010
  • 12:40 chasemp: tools.templatetransclusioncheck@tools-bastion-03:~$ webservice restart

2016-08-11

  • 20:13 yuvipanda: tools-grid-master finally stopped
  • 20:05 yuvipanda: disabled tools-webgrid-lighttpd-1202, is hung
  • 17:23 yuvipanda: instance being rebooted is tools-grid-master
  • 17:22 chasemp: reboot via nova master as it is stuck

2016-08-05

  • 19:29 paladox: adding tom29739 to lolrrit-wm project

2016-08-04

  • 19:09 yuvipanda: cleaned up nginx log files in tools-docker-registry-01 to fix free space warning
  • 00:19 yuvipanda: added Krenair as admin to help with T132225 and other issues.

2016-08-03

  • 22:48 yuvipanda: deleted tools-worker-1005
  • 22:08 yuvipanda: depool & delete tools-worker-1007 and 1008
  • 21:34 yuvipanda: rebooting tools-puppetmaster-01 to test a hypothesis
  • 21:10 yuvipanda: rebooting tools-puppetmaster-01 for kernel upgrade
  • 00:20 madhuvishy: Repooled nodes tools-worker 1012 and 1013 for T141126

2016-08-02

  • 22:49 yuvipanda: depooled tools-worker-1014 as well for T141126
  • 22:44 yuvipanda: depool tools-worker-1015 for T141126
  • 22:42 paladox: cherry picking 302617 onto lolrrit-wm
  • 22:41 madhuvishy: Depooling tools-worker 1012 and 1013 for T141126
  • 22:32 yuvipanda: added paladox to tools
  • 09:38 godog: bounce morebots production
  • 00:01 yuvipanda: depool tools-worker-1017 for T141126

2016-08-01

  • 23:48 madhuvishy: Repooled tools-worker-1011 and tools-worker-1018 (Yuvi) for T114126
  • 23:41 madhuvishy: Repooled tools-worker-1010 and tools-worker-1019 (Yuvi) for T114126
  • 23:21 madhuvishy: Yuvi is depooling tools-worker-1018 for T114126
  • 23:19 madhuvishy: Depooling tools-worker 1010 and 1011 for T114126
  • 23:17 madhuvishy: Yuvi depooled tools-worker-1019 for T114126
  • 23:06 madhuvishy: Added tools-worker-1022 as new k8s worker node
  • 23:06 madhuvishy: Repooled tools-worker-1009 (T114126)
  • 22:48 madhuvishy: Depooling tools-worker-1009 to prepare for T141126

2016-07-29

  • 22:04 YuviPanda: repooled tools-worker-1006
  • 21:48 YuviPanda: deleted tools-worker-1006 after depooling+draining
  • 21:45 YuviPanda: repool new tools-worker-1003 with direct-lvm docker storage backend
  • 21:30 YuviPanda: depool tools-worker-1003 to be recreated with new docker config, picking this because it's on a non-ssd host
  • 21:17 YuviPanda: depooled tools-worker-1020/21 after fixing them up
  • 20:41 YuviPanda: delete tools-worker-1001
  • 20:29 YuviPanda: depool tools-worker-1001, going to recreate with to test new puppet deploying-first-run
  • 20:26 YuviPanda: built new worker nodes tools-worker-1020 and 21 with direct-lvm storage backend
  • 17:48 YuviPanda: disable puppet on all tools k8s worker nodes

2016-07-25

  • 14:17 chasemp: nova reboot 64f01f90-c805-4a2e-9ed5-f523b909094e (grid master)

2016-07-23

  • 23:21 YuviPanda: restart maintain-kubeusers on tools-k8s-master-01, was stuck on connecting to seaborgium preventing new tool creation
  • 01:56 YuviPanda: deploy kubernetes v1.3.3wmf1

2016-07-22

  • 17:30 YuviPanda: repool tools-worker-1018
  • 14:04 chasemp: reboot tools-worker-1015 as stuck w/ high iowait warning seconds ago. I cannot ssh in as root.

2016-07-21

  • 22:42 chasemp: reboot tools-worker-1018 as stuck T141017

2016-07-20

  • 21:27 andrewbogott: rebooting tools-k8s-etcd-01
  • 11:14 Guest9334: rebooted tools-worker-1004

2016-07-19

  • 01:06 bd808: Upgraded Elasticsearch on tools-elastic-* to 2.3.4

2016-07-18

  • 21:50 YuviPanda: force downgrade hhvm on tools-webgrid-lighttpd-1408 to fix puppet issues
  • 21:40 YuviPanda: bind mount and kill files in /var/lib/docker that were monuted over by proper mount on lvm on tools-worker-1004
  • 21:40 YuviPanda: bind mount and kill files in /var/lib/docker that were monuted over by proper mount on lvm
  • 21:37 YuviPanda: killed tools-pastion-01, no longer in use
  • 20:59 bd808: Disabled puppet on tools-elastic-0[123]. Elasticsearch needs to be upgraded.
  • 15:15 YuviPanda: kill 8807036 for Luke081515
  • 12:48 YuviPanda: reboot tools-flannel-etcd-03 for T140256
  • 12:41 YuviPanda: reboot tools-k8s-etcd-02 for T140256

2016-07-15

  • 10:24 yuvipanda: depool tools-exec-1402 for T138447
  • 10:24 yuvipanda: reboot tools-exec-1402 for T138447
  • 10:16 yuvipanda: depooling tools-webgrid-lighttpd-1402 and -1412 since they seem to be suffering from T138447
  • 10:08 yuvipanda: reboot tools-webgrid-lighttpd-1402 and 1412

2016-07-14

  • 23:12 bd808: Added Madhuvishy to project "roots" sudoer list
  • 22:58 bd808: Added Madhuvishy as projectadmin
  • 21:25 chasemp: change perms for tools.readmore to correct bot

2016-07-13

  • 11:40 yuvipanda: cold-migrate tools-worker-1014 off labvirt1010 to see if that improves the ksoftirqd situation
  • 11:19 yuvipanda: drained tools-worker-1004 - high ksoftirqd usage even with no load
  • 11:13 yuvipanda: depool tools-worker-1014 - unusable, totally in iowait
  • 11:13 yuvipanda: reboot tools-worker-1004, was unresponsive

2016-07-12

  • 18:07 yuvipanda: reboot tools-worker-1012, it seems to have failed LDAP connectivity :|

2016-07-08

  • 12:38 yuvipanda: starting up tools-web-static-02 again

2016-07-07

  • 12:45 yuvipanda: start deployment of k8s 1.3.0wmf4 for T139259

2016-07-06

  • 13:09 yuvipanda: associated a floating IP with tools-k8s-master-01 for T139461
  • 11:47 yuvipanda: moved tools-checker-0[12] to use tools-puppetmaster-01 as puppetmaster so they get appropriate CA for use when talking to kubernetes API

2016-07-04

  • 11:13 yuvipanda: delete tools-prometheus-01 to free up resources on labvirt1010
  • 11:11 yuvipanda: actually deleted instance tools-cron-02 to free up resources on labvirt1010 - was large and not currently used, and failover process takes a while anyway, so we can recreate if needed
  • 11:11 yuvipanda: stopped instance tools-cron-02 to free up some resources on labvirt1010

2016-07-03

  • 17:09 yuvipanda: run qstat -u '*' | grep 'dr ' | awk '{ print $1;}' | xargs -L1 qdel -f to clean out jobs stuck in dr state
  • 16:59 yuvipanda: migrate tools-web-static-02 to labvirt1011 to provide more breathing room
  • 16:56 yuvipanda: delete temp-test-trusty-package to provide more breathing room on labvirt1010
  • 13:49 yuvipanda: reboot tools-exec-1219
  • 13:37 yuvipanda: migrating tools-exec-1216 to labvirt1011
  • 13:07 yuvipanda: delete tools-bastion-01 which was shut down anyway
  • 13:04 yuvipanda: attempt to reboot tools-exec-1212

2016-06-28

  • 15:25 bd808: Signed client cert for tools-worker-1019.tools.eqiad.wmflabs on tools-puppetmaster-01.tools.eqiad.wmflabs

2016-06-21

  • 16:49 bd808: Updated jobutils to v1.14 for T138178

2016-06-17

  • 06:17 yuvipanda: forced deletion of 7033590 for dykbot for shubinator

2016-06-08

  • 20:31 yuvipanda: start tools-bastion-03 was stuck in 'stopped' state
  • 20:31 yuvipanda: reboot tools-bastion-03

2016-05-31

  • 17:35 valhallasw`cloud: re-enabled queues on tools-exec-1407, tools-exec-1216, tools-exec-1219
  • 13:13 chasemp: reboot of tools-exec-1203 see T136495 all jobs seem gone now

2016-05-30

2016-05-29

  • 18:58 YuviPanda: deleted tools-k8s-bastion-01 for T136496
  • 14:29 valhallasw`cloud: chowned /data/project/xtools-mab-dev to root and back to stop rogue process that was writing to the directory. I'm still not sure where that process was running, but at least this seems to have solved the issue

2016-05-28

  • 21:52 valhallasw`cloud: rebooted tools-webgrid-lighttpd-1408, tools-pastion-01, tools-exec-1205
  • 21:21 valhallasw`cloud: rebooting tools-exec-1204 (T136495)

2016-05-27

  • 14:45 YuviPanda: start moving tools-bastion-03 to use tools-puppetmaster-01 as puppetmaster

2016-05-25

  • 20:15 YuviPanda: deleted tools-bastion-mtemp per chasemp
  • 19:43 YuviPanda: delete devpi instance, not currently in use
  • 19:39 YuviPanda: run sudo dpkg --configure -a on tools-worker-1007 to get it unstuck
  • 19:19 YuviPanda: deleted tools-docker-builder-01 and -02, hosed hosts that are unused
  • 17:18 YuviPanda: fixed hhvm upgrade on tools-cron-01
  • 07:19 YuviPanda: hard reboot tools-services-01, was completely stuck on /public/dumps
  • 06:06 bd808: Restarting all webservice jobs
  • 05:33 andrewbogott: rebooting tools-proxy-02

2016-05-24

  • 01:36 scfc_de: tools-cron-02: Downgraded hhvm (sudo apt-get install hhvm).
  • 01:36 scfc_de: tools-bastion-03, tools-checker-01, tools-cron-02, tools-exec-1202, tools-proxy-02, tools-redis-1001: Remounted /public/dumps read-only (while sudo umount /public/dumps; do :; done && sudo puppet agent -t).

2016-05-23

  • 19:36 YuviPanda: switched tools-checker to tools-checker-03
  • 16:33 bd808: Rebooting tools-elastic-02.tools.eqiad.wmflabs
  • 13:28 chasemp: 'apt-get install hhvm -y --force-yes' across trusty hosts to handle hhvm downgrade

2016-05-20

  • 23:39 bd808: Forced puppet run on bastion-02 & bastion-05 to apply fix for T135861
  • 19:47 chasemp: tools-exec-1406 having issues rebooting

2016-05-19

  • 21:07 bd808: deployed jobutils 1.13 on bastions; now with '-l release=...' validation!
  • 15:43 YuviPanda: rebooting all tools worker instances
  • 13:12 chasemp: reboot tools-exec-1220 stuck in state of unresponsivenss

2016-05-13

  • 00:40 YuviPanda: cleared all queues that were in error state

2016-05-12

  • 22:59 YuviPanda: restart tools-worker-1004 to attempt bringing it back up
  • 22:59 YuviPanda: deploy k8s 1.2.4wmf1 on all proxy nodes
  • 22:58 YuviPanda: deploy k8s on all worker nodes
  • 22:46 YuviPanda: deploy k8s master for 1.2.4wmf1

2016-05-10

  • 04:25 bd808: Added role::package::builder to tools-services-01

2016-05-09

  • 04:33 YuviPanda: reboot tools-worker-1004, lots of ksoftirqd stuckness despite no actual containers running

2016-05-08

  • 07:06 YuviPanda: restarted admin tool

2016-05-05

2016-04-28

  • 04:15 YuviPanda: delete half of the trusty webservice jobs
  • 04:00 YuviPanda: deleted all precise webservice jobs, waiting for webservicemonitor to bring them back up

2016-04-24

  • 12:22 YuviPanda: force deleted job 5435259 from pbbot per PeterBowman

2016-04-11

  • 14:20 andrewbogott: moving tools-bastion-mtemp to labvirt1009

2016-04-06

  • 15:20 bd808: Removed local hack for T131906 from tools-puppetmaster-01

2016-04-05

  • 21:24 bd808: Committed local hack on tools-puppetmaster-01 to get elasticsearch working again
  • 21:02 bd808: Forcing puppet runs to fix elasticsearch
  • 20:39 bd808: Elasticsearch processes down. Looks like a prod puppet change that needs tweaking for tool labs

2016-04-04

  • 19:43 YuviPanda: new bastion!
  • 19:15 chasemp: reboot tools-bastion-05

2016-03-30

  • 15:50 andrewbogott: rebooting tools-proxy-01 in hopes of clearing some bad caches

2016-03-28

  • 20:51 yuvipanda: lifted RAM quota from 900Gigs to 1TB?!
  • 20:30 chasemp: change perm grant files from create-dbusers for chmod 400 chat chattr +i

2016-03-27

  • 17:40 scfc_de: tools-webgrid-generic-1405, tools-webgrid-lighttpd-1411, tools-web-static-01, tools-web-static-02: "apt-get install cloud-init" and accepted changes for /etc/cloud/cloud.cfg (users: + default; cloud_config_modules: + ssh-import-id, + puppet, + chef, + salt-minion; system_info/package_mirrors/arches[i386, amd64]/search/primary: + http://%(region)s.clouds.archive.ubuntu.com/ubuntu/).

2016-03-18

  • 15:47 chasemp: had to kill stalkboten as it was logging constant errors filling logs to the tune of hundreds of gigs
  • 15:36 chasemp: cleanup huge log collection for broken bot: /srv/project/tools/project/betacommand-dev/tspywiki/irc/logs# rm -fR SpamBotLog.log\.*

2016-03-11

  • 20:57 mutante: reverted font changes - puppet runs recovering
  • 20:37 mutante: more puppet issues due to font dependencies on trusty, on it
  • 19:39 mutante: should a tools-exec server be influenced by font packages on an mw appserver?
  • 19:39 mutante: fixed puppet runs on tools-exec (gerrit 276792)

2016-03-02

  • 14:56 chasemp: qdel 3956069 and 3758653 for abusing auth

2016-02-29

  • 21:49 scfc_de: tools-exec-1218: rm -f /usr/local/lib/nagios/plugins/check_eth to work around "Got passed new contents for sum" (https://tickets.puppetlabs.com/browse/PUP-1334).
  • 21:20 scfc_de: tools-exec-1209: rm -f /var/lib/puppet/state/agent_catalog_run.lock (no Puppet process running, probably from the reboots).
  • 20:58 scfc_de: Ran "dpkg --configure -a" on all instances.
  • 13:50 scfc_de: Deployed jobutils/misctools 1.10.

2016-02-28

  • 20:08 bd808: Removed unwanted NFS mounts from tools-elastic-01.tools.eqiad.wmflabs

2016-02-26

  • 19:08 bd808: Upgraded Elasticsearch on tools-elastic-0[123] to 1.7.5

2016-02-25

  • 21:43 scfc_de: Deployed jobutils/misctools 1.9.

2016-02-24

2016-02-22

  • 15:55 andrewbogott: redirecting tools-login.wmflabs.org to tools-bastion-05

2016-02-19

  • 15:58 chasemp: rerollout tools nfs shaping pilot for sanity in anticipation of formalization
  • 09:21 _joe_: killed cluebot3 instance on tools-exec-1207, writing 20 M/s to the error log
  • 00:50 yuvipanda: failover services to services-02

2016-02-18

  • 20:37 yuvipanda: failover proxy back to tools-proxy-01
  • 19:46 chasemp: repool labvirt1003 and depool labvirt1004
  • 18:19 chasemp: draining nodes from labvirt1001

2016-02-16

  • 21:33 chasemp: reboot of bastion-1002

2016-02-12

  • 19:56 chasemp: nfs traffic shaping pilot round 2

2016-02-05

  • 22:01 chasemp: throttle some vm nfs write speeds
  • 16:49 scfc_de: find /data/project/wikidata-edits -group ssh-key-ldap-lookup -exec chgrp tools.wikidata-edits \{\} + (probably a remnant of the work on ssh-key-ldap-lookup last summer).
  • 16:45 scfc_de: Removed /data/project/test300 (uid/gid 52080; none of them resolves, no databases, just an unmodified pywikipedia clone inside).

2016-02-03

  • 03:00 YuviPanda: upgraded flannel on all hosts running it

2016-01-31

  • 20:01 scfc_de: tools-webgrid-generic-1405: Rebooted via wikitech; rebooting via "shutdown -r now" did not seem to work.
  • 18:51 bd808: tools-elastic-01.tools.eqiad.wmflabs console shows blocked tasks, possible kernel bug?
  • 18:49 bd808: tools-elastic-01.tools.eqiad.wmflabs not responsive to ssh or Elasticsearch requests; rebooting via wikitech interface
  • 13:32 hashar: restarted qamorebot

2016-01-30

  • 06:38 scfc_de: tools-webgrid-generic-1405: Rebooted for load ~ 175 and lots of processes stuck in D.

2016-01-29

  • 21:25 YuviPanda: restarted image-resize-calc manually, no service.manifest file

2016-01-28

  • 15:02 scfc_de: tools-cron-01: Rebooted via wikitech as "shutdown -r now" => "@sbin/plymouthd --mode=shutdown" => "/bin/sh -e /proc/self/fd/9" => "/bin/sh /etc/init.d/rc 6" => "/bin/sh /etc/rc6.d/S20sendsigs stop" => "sync" stuck in D. *argl*
  • 14:56 scfc_de: tools-cron-01: Rebooted due to high number of processes stuck in D and load >> 100.
  • 14:54 scfc_de: tools-cron-01: HUPped 43 processes wikitrends/refresh.sh, though a lot of all processes seem to be stuck in D, so I'll reboot this instance.
  • 14:50 scfc_de: tools-cron-01: HUPped 85 processes /usr/lib/php5/sessionclean.

2016-01-27

  • 23:07 YuviPanda: removed all members of templatetiger, added self instead, removed active shell sessions
  • 20:24 chasemp: master stop, truncate accounting log to accounting.01272016, master start
  • 19:34 chasemp: master start grid master
  • 19:23 chasemp: stopped master
  • 19:11 YuviPanda: depooled tools-webgrid-1405 to prep for restart, lots of stuck processes
  • 18:29 valhallasw`cloud: job 2551539 is ifttt, which is also running as 2700629. Killing 2551539 .
  • 18:26 valhallasw`cloud: messages repeatedly reports "01/27/2016 18:26:17|worker|tools-grid-master|E|execd@tools-webgrid-generic-1405.tools.eqiad.wmflabs reports running job (2551539.1/master) in queue "webgrid-generic@tools-webgrid-generic-1405.tools.eqiad.wmflabs" that was not supposed to be there - killing". SSH'ing there to investigate
  • 18:24 valhallasw`cloud: 'sleep' test job also seems to work without issues
  • 18:23 valhallasw`cloud: no errors in log file, qstat works
  • 18:23 chasemp: master sge restarted post dump and restart for jobs db
  • 18:22 valhallasw`cloud: messages file reports 'Wed Jan 27 18:21:39 UTC 2016 db_load_sge_maint_pre_jobs_dump_01272016'
  • 18:20 chasemp: master db_load -f /root/sge_maint_pre_jobs_dump_01272016 sge_job
  • 18:19 valhallasw`cloud: dumped jobs database to /root/sge_maint_pre_jobs_dump_01272016, 4.6M
  • 18:17 valhallasw`cloud: SGE Configuration successfully saved to /root/sge_maint_01272016 directory.
  • 18:14 chasemp: grid master stopped
  • 00:56 scfc_de: Deployed admin/www bde15df..12a3586.

2016-01-26

  • 21:28 YuviPanda: qstat -u '*' | grep E | awk '{print $1}' | xargs -L1 qmod -cj
  • 21:16 chasemp: reboot tools-exec-1217.tools.eqiad.wmflabs

2016-01-25

  • 20:30 YuviPanda: switched over cron host to tools-cron-01, manually copied all old cron files from tools-submit to tools-cron-01
  • 19:06 chasemp: kill python merge/merge-unique.py tools-exec-1213 as it seemed to be overwhelming nfs
  • 17:07 scfc_de: Deployed admin/www at bde15df2a379c33edfb8350afd2f0c7186705a93.

2016-01-23

  • 15:49 scfc_de: Removed remnant send_puppet_failure_emails cron entries except from unreachable hosts sacrificial-kitten, tools-worker-06 and tools-worker-1003.

2016-01-21

  • 22:24 YuviPanda: deleted tools-redis-01 and -02 (are on 1001 and 1002 now)
  • 21:13 YuviPanda: repooled exec nodes on labvirt1010
  • 21:08 YuviPanda: gridengine-master started, verified shadow hasn't started
  • 21:00 YuviPanda: stop gridengine master
  • 20:51 YuviPanda: repooled exec nodes on labvirt1007 was last message
  • 20:51 YuviPanda: repooled exec nodes on labvirt1006
  • 20:39 YuviPanda: failover tools-static too tools-web-static-01
  • 20:38 YuviPanda: failover tools-checker to tools-checker-01
  • 20:32 YuviPanda: depooled exec nodes on 1007
  • 20:32 YuviPanda: repooled exec nodes on 1006
  • 20:14 YuviPanda: depooled all exec nodes in labvirt1006
  • 20:11 YuviPanda: repooled exec node son 1005
  • 19:53 YuviPanda: depooled exec nodes on labvirt1005
  • 19:49 YuviPanda: repooled exec nodes from labvirt1004
  • 19:48 YuviPanda: failed over proxy to tools-proxy-01 again
  • 19:31 YuviPanda: depooled exec nodes from labvirt1004
  • 19:29 YuviPanda: repooled exec nodes from labvirt1003
  • 19:13 YuviPanda: depooled instances on labvirt1003
  • 19:06 YuviPanda: re-enabled queues on exec nodes that were on labvirt1002
  • 19:02 YuviPanda: failed over tools proxy to tools-proxy-02
  • 18:46 YuviPanda: drained and disabled queues on all nodes on labvirt1002
  • 18:38 YuviPanda: restarted all restartable jobs in instances on labvirt1001 and deleted all non-restartable ghost jobs. these were already dead

2016-01-12

  • 09:48 scfc_de: tools-checker-01: Removed exim paniclog (OOM).

2016-01-11

  • 22:19 valhallasw`cloud: reset maxujobs 0->128, job_load_adjustments none->np_load_avg=0.50, load_ad... -> 0:7:30
  • 22:12 YuviPanda: restarted gridengine master again
  • 22:07 valhallasw`cloud: set job_load_adjustments from np_load_avg=0.50 to none and load_adjustment_decay_time to 0:0:0
  • 22:05 valhallasw`cloud: set maxujobs back to 0, but doesn't help
  • 21:57 valhallasw`cloud: reset to 7:30
  • 21:57 valhallasw`cloud: that cleared the measure, but jobs still not starting. Ugh!
  • 21:56 valhallasw`cloud: set job_load_adjustments_decay_time = 0:0:0
  • 21:45 YuviPanda: restarted gridengine master
  • 21:43 valhallasw`cloud: qstat -j <jobid> shows all queues overloaded; seems to have started just after a load test for the new maxujobs setting
  • 21:42 valhallasw`cloud: resetting to 0:7:30, as it's not having the intended effect
  • 21:41 valhallasw`cloud: currently 353 jobs in qw state
  • 21:40 valhallasw`cloud: that's load_adjustment_decay_time
  • 21:40 valhallasw`cloud: temporarily sudo qconf -msconf to 0:0:1
  • 19:59 YuviPanda: Set maxujobs (max concurrent jobs per user) on gridengine to 128
  • 17:51 YuviPanda: kill all queries running on labsdb1003
  • 17:20 YuviPanda: stopped webservice for quentinv57-tools

2016-01-09

  • 21:07 valhallasw`cloud: moved tools-checker/208.80.155.229 back to tools-checker-01
  • 21:02 andrewbogott: rebooting tools-checker-01 as it is unresponsive.
  • 13:12 valhallasw`cloud: tools-worker-1002. is unresponsive. Maybe that's where the other grrrit-wm is hiding? Rebooting.

2016-01-08

2015-12-30

  • 04:06 YuviPanda: delete all webgrid jobs to start with a clean slate
  • 03:54 YuviPanda: qmod -rj all tools in the continuous queue, they are all orphaned
  • 02:39 YuviPanda: remove lbenedix and ebekebe from tools.hcclab
  • 00:40 YuviPanda: restarted master on grid-master
  • 00:40 YuviPanda: copied and cleaned out spooldb
  • 00:10 YuviPanda: reboot tools-grid-shadow
  • 00:08 YuviPanda: attempt to stop shadowd
  • 00:03 YuviPanda: attempting to start gridengine-master on tools-grid-shadow
  • 00:00 YuviPanda: kill -9'd gridengine master

2015-12-29

  • 23:31 YuviPanda: rebooting tools-grid-master
  • 23:22 YuviPanda: restart gridengine-master on tools-grid-master
  • 00:18 YuviPanda: shut down redis on tools-redis-01

2015-12-28

  • 22:34 chasemp: attempt to unmount nfs volumes on tools-redis-01 to debug but it hands (I am on console and see root at console hang on login)
  • 22:31 YuviPanda: disable NFS on tools-redis-1001 and 1002
  • 21:32 YuviPanda: disable puppet on tools-redis-01 and -02
  • 21:27 YuviPanda: created tools-redis-1001

2015-12-23

  • 21:21 YuviPanda: deleted tools-worker-01 to -05, creating tools-worker-1001 to 1005
  • 21:19 valhallasw`cloud: tools-proxy-01: umount /home /data/project /data/scratch /public/dumps
  • 19:01 valhallasw`cloud: ah, connections that are kept open. A new incognito window is routed correctly.
  • 18:59 valhallasw`cloud: switched to -02, worked correctly, switched back. Switching back does not seem to fully work?!
  • 18:40 valhallasw`cloud: scratch that, first going to eat dinner
  • 18:38 valhallasw`cloud: dynamicproxy ban system deployed on tools-proxy-02 working correctly for localhost; switching over users there by moving the external IP.
  • 14:42 valhallasw`cloud: toollabs homepage is unhappy because tools.xtools-articleinfo is using a lot of cpu on tools-webgrid-lighttpd-1409. Checking to see what's happening there.
  • 10:46 YuviPanda: migrate tools-worker-01 to 3.19 kernel

2015-12-22

  • 18:30 YuviPanda: rescheduling all webservices
  • 18:17 YuviPanda: failed over active proxy to proxy-01
  • 18:12 YuviPanda: upgraded kernel and rebooted tools-proxy-01
  • 01:42 YuviPanda: rebooting tools-worker-08

2015-12-21

  • 18:44 YuviPanda: reboot tools-proxy-01
  • 18:31 YuviPanda: failover proxy to tools-proxy-02

2015-12-20

  • 00:00 YuviPanda: tools-worker-08 stuck again :|

2015-12-18

  • 15:16 andrewbogott: rebooting locked up host tools-exec-1409

2015-12-16

  • 23:14 andrewbogott: rebooting tools-exec-1407, unresponsive
  • 22:48 YuviPanda: run qmod -c '*' to clear error state on gridengine
  • 21:28 andrewbogott: deleted tools-docker-registry-01
  • 16:24 andrewbogott: rebooting tools-exec-1221 as it was in kernel lockup

2015-12-12

  • 10:08 YuviPanda: restarted cron on tools-submit

2015-12-10

  • 12:47 valhallasw`cloud: broke tools-proxy-02 login (for valhallasw, root still works) by restarting nslcd. Restarting; current proxy is -01.

2015-12-07

  • 13:46 Coren: The new grid masters are happy, killing the old ones (-shadow, -master)
  • 10:46 YuviPanda: restarted nscd on tools-proxy-01

2015-12-06

  • 10:29 YuviPanda: did webservice start on tool 'derivative', was missing service.manifest

2015-12-04

  • 19:33 Coren: switching master role to tools-grid-master
  • 04:42 yuvipanda: disabled puppet on tools-puppetmaster-01 because everything sucks
  • 04:09 bd808: Cherry-picked https://gerrit.wikimedia.org/r/#/c/256618 to tools-puppetmaster-01

2015-12-02

  • 18:29 Coren: switching gridmaster activity to tools-grid-shadow
  • 05:13 yuvipanda: increased security groups quota to 50 because why not

2015-12-01

  • 21:07 yuvipanda: added bd808 as admin
  • 21:01 andrewbogott: deleted tool/service group tools.test300

2015-11-25

  • 15:42 Coren: migrating tools-web-static-02 to labvirt1010 to free space on labvirt1002

2015-11-20

  • 22:02 Coren: tools-webgrid-lighttpd-1412 tools-webgrid-lighttpd-1413 tools-webgrid-lighttpd-1414 tools-webgrid-lighttpd-1415 done and back in rotation.
  • 21:46 Coren: tools-webgrid-lighttpd-1411 tools-webgrid-lighttpd-1211 done and back in rotation.
  • 21:30 Coren: tools-webgrid-lighttpd-1410 tools-webgrid-lighttpd-1210 done and back in rotation.
  • 21:25 Coren: tools-webgrid-lighttpd-1409 tools-webgrid-lighttpd-1209 done and back in rotation.
  • 21:13 Coren: tools-webgrid-lighttpd-1408 tools-webgrid-lighttpd-1208 done and back in rotation.
  • 20:58 Coren: tools-webgrid-lighttpd-1407 tools-webgrid-lighttpd-1207 done and back in rotation.
  • 20:53 Coren: tools-webgrid-lighttpd-1406 tools-webgrid-lighttpd-1206 done and back in rotation.
  • 20:41 Coren: tools-webgrid-lighttpd-1405 tools-webgrid-lighttpd-1205 tools-webgrid-generic-1405 done and back in rotation.
  • 20:28 Coren: tools-webgrid-lighttpd-1404 tools-webgrid-lighttpd-1204 tools-webgrid-generic-1404 done and back in rotation.
  • 19:49 Coren: done, and putting back in rotation: tools-webgrid-lighttpd-1403 tools-webgrid-lighttpd-1203 tools-webgrid-generic-1403
  • 19:25 Coren: -lighttpd-1403 wants a restart.
  • 19:15 Coren: done, and putting back in rotation: tools-webgrid-lighttpd-1402 tools-webgrid-lighttpd-1202 tools-webgrid-generic-1402
  • 18:55 Coren: Putting -lighttpd-1401 -lighttpd-1201 -generic-1401 back in rotation, disabling the others.
  • 18:24 Coren: Beginning draining web nodes; -lighttpd-1401 -lighttpd-1201 -generic-1401
  • 18:10 Coren: disabling puppet on the grid nodes listed at https://phabricator.wikimedia.org/P2337 so that the /tmp change in https://gerrit.wikimedia.org/r/#/c/252506/ do not apply early and break services

2015-11-17

  • 19:39 YuviPanda: created tools-worker-03 to be k8s worker node
  • 19:34 YuviPanda: blanked 'realm' for tools-bastion-01 to figure out what happens

2015-11-16

2015-11-03

  • 03:59 scfc_de: tools-submit, tools-webgrid-lighttpd-1409, tools-webgrid-lighttpd-1411: Removed exim paniclog (OOM).

2015-11-02

  • 22:57 YuviPanda: pooled tools-webgrid-lighttpd-1413
  • 22:10 YuviPanda: created tools-webgrid-lighttpd-1414 and 1415
  • 22:04 YuviPanda: created tools-webgrid-lighttpd-1412 and 1413
  • 19:53 YuviPanda: drained continuous jobs and disabled queues on tools-exec-1203 and tools-exec-1402
  • 19:50 YuviPanda: drain webgrid-lighttpd-1408 of jobs

2015-10-26

  • 20:53 YuviPanda: updated 6.9 ssh backport to all trusty hosts

2015-10-11

  • 22:54 yuvipanda: delete service.manifest for tool wikiviz to prevent it from attempting to be started. It set itself up for nodejs but didn't actually have any code

2015-10-09

2015-10-06

  • 04:35 yuvipanda: created tools-puppetmaster-02 as hot spare

2015-10-02

  • 17:30 scfc_de: tools-webgrid-lighttpd-1402: Removed exim paniclog (OOM).

2015-10-01

  • 23:38 yuvipanda: actually rebooting tools-worker-02, had actually rebooted-01 earlier #facepalm
  • 23:20 yuvipanda: rebooting tools-worker-02 to pickup new kernel
  • 23:10 yuvipanda: failed over tools-proxy-01 to -02, restarting -01 to pick up new kernel
  • 22:58 yuvipanda: rebooted tools-proxy-02 to pick up new kernel

2015-09-30

  • 07:12 yuvipanda: deleted tools-webproxy-01 and -02, running on proxy-01 and -02 now
  • 06:40 yuvipanda: migrated webproxy to tools-proxy-01

2015-09-29

  • 12:08 scfc_de: tools-bastion-01: Removed exim paniclog (OOM).

2015-09-28

  • 15:24 Coren: rebooting tools-shadow after mount option changes.

2015-09-25

  • 16:02 scfc_de: tools-webgrid-lighttpd-1403: Removed exim paniclog (OOM).

2015-09-24

  • 14:06 scfc_de: tools-exec-1201: Restarted grid engine exec for T109485.
  • 13:56 scfc_de: tools-master: Restarted grid engine master for T109485.

2015-09-23

2015-09-16

  • 17:33 scfc_de: Removed python-tools-webservice from precise-tools as apparently old version of tools-webservice.
  • 01:17 YuviPanda: attempting to move grrrit-wm to kubernetes
  • 01:17 YuviPanda: attempting to move to kubernetes

2015-09-15

  • 01:18 scfc_de: Added unixodbc_2.2.14p2-5_amd64.deb back to precise-tools to diagnose if it is related to T111760.

2015-09-14

  • 23:47 scfc_de: Archived unixodbc_2.2.14p2-5_amd64 from deb-precise and aptly, no reference in Puppet or Phabricator and same version as distribution.

2015-09-13

  • 20:53 scfc_de: Archived lua-json_1.3.2-1 from labsdebrepo and aptly, upgraded manually to Trusty's new 1.3.1-1ubuntu0.1~ubuntu14.04.1, restarted nginx on tools-webproxy-01 and tools-webproxy-02, checked that proxy and localhost:8081/list works.
  • 20:42 scfc_de: rm -f /etc/apt/apt.conf.d/20auto-upgrades.ucf-dist on all hosts (cf. T110055).

2015-09-11

  • 14:54 scfc_de: tools-webgrid-lighttpd-1403: Removed exim paniclog (OOM).

2015-09-08

  • 08:05 valhallasw`cloud: Publish for local repo ./trusty-tools [all, amd64] publishes {main: [trusty-tools]} has been successfully updated.
    Publish for local repo ./precise-tools [all, amd64] publishes {main: [precise-tools]} has been successfully updated.
  • 08:04 valhallasw`cloud: added all packages in data/project/.system/deb-precise to aptly repo precise-tools
  • 08:03 valhallasw`cloud: added all packages in data/project/.system/deb-trusty to aptly repo trusty-tools

2015-09-07

  • 18:49 valhallasw`cloud: ran sudo mount -o remount /data/project on tools-static-01, which also solved the issue, so skipping the reboot
  • 18:47 valhallasw`cloud: switched static webserver to tools-static-02
  • 18:45 valhallasw`cloud: weird NFS issue on tools-web-static-01. Switching over to -02 before rebooting.
  • 17:57 YuviPanda: created tools-k8s-master-01 with jessie, will be etcd and kubernetes master

2015-09-03

  • 07:09 valhallasw`cloud: and just re-running puppet solves the issue. Sigh.
  • 07:09 valhallasw`cloud: last message in puppet.log.1.gz is Error: /Stage[main]/Toollabs::Exec_environ/Package[fonts-ipafont-gothic]/ensure: change from 00303-5 to latest failed: Could not get latest version: Execution of '/usr/bin/apt-cache policy fonts-ipafont-gothic' returned 100: fonts-ipafont-gothic: (...) E: Cache is out of sync, can't x-ref a package file
  • 07:07 valhallasw`cloud: err, is empty.
  • 07:07 valhallasw`cloud: uppet failure on tools-exec-1215 is CRITICAL 66.67% of data above the critical threshold -- but /var/log/puppet.log doesn't exist?!

2015-09-02

  • 15:01 scfc_de: Added -M option to qsub call for crontab of tools.sdbot.
  • 13:58 valhallasw`cloud: rebooting tools-exec-1403; https://phabricator.wikimedia.org/T107052 happening, also causing significant NFS server load
  • 13:55 valhallasw`cloud: restarted gridengine_exec on tools-exec-1403
  • 13:53 valhallasw`cloud: tools-exec-1403 does lots of locking opreations. Only job there was jid 1072678 = /data/project/hat-collector/irc-bots/snitch.py . Rescheduled that job.
  • 13:16 YuviPanda: deleted all jobs of ralgisbot
  • 13:12 YuviPanda: suspended all jobs in ralgisbot temporarily
  • 12:57 YuviPanda: rescheduled all jobs of ralgisbot, was suffering from stale NFS file handles

2015-09-01

  • 21:01 valhallasw`cloud: killed one of the grrrit-wm jobs; for some reason two of them were running?! Not sure what SGE is up to lately.
  • 16:12 scfc_de: tools-bastion-01: Killed bot of tools.cobain.
  • 15:47 valhallasw`cloud: git reset --hard cdnjs on tools-web-static-01
  • 06:23 valhallasw`cloud: seems to have worked. SGE :(
  • 06:17 valhallasw`cloud: going to restart sge_qmaster, hoping this solves the issue :/
  • 06:08 valhallasw`cloud: e.g. "queue instance "task@tools-exec-1211.eqiad.wmflabs" dropped because it is overloaded: np_load_avg=1.820000 (= 0.070000 + 0.50 * 14.000000 with nproc=4) >= 1.75" but the actual load is only 0.3?!
  • 06:06 valhallasw`cloud: test job does not get submitted because all queues are overloaded?!
  • 06:06 valhallasw`cloud: investigating SGE issues reported on irc/email

2015-08-31

  • 23:20 scfc_de: Changed host name tools-webgrid-generic-1405 in "qconf -mq webgrid-generic" to fix the "au" state of the queue on that host.
  • 21:21 valhallasw`cloud: webservice: error: argument server: invalid choice: 'generic' (choose from 'lighttpd', 'tomcat', 'uwsgi-python', 'nodejs', 'uwsgi-plain') (for tools.javatest)
  • 21:20 valhallasw`cloud: restarted webservicemonitor
  • 21:19 valhallasw`cloud: seems to have some errors in restarting: subprocess.CalledProcessError: Command '['/usr/bin/sudo', '-i', '-u', 'tools.javatest', '/usr/local/bin/webservice', '--release', 'trusty', 'generic', 'restart']' returned non-zero exit status 2
  • 21:18 valhallasw`cloud: running puppet agent -tv on tools-services-02 to make sure webservicemonitor is running
  • 21:15 valhallasw`cloud: several webservices seem to actually have not gotten back online?! what on earth is going on.
  • 21:10 valhallasw`cloud: some jobs still died (including tools.admin). I'm assuming service.manifest will make sure they start again
  • 20:29 valhallasw`cloud: |sort is not so spread out in terms of affected hosts because a lot of jobs were started on lighttpd-1409 and -1410 around the same time.
  • 20:25 valhallasw`cloud: ca 500 jobs @ 5s/job = approx 40 minutes
  • 20:23 valhallasw`cloud: doh. accidentally used the wrong file, causing restarts for another few uwsgi hosts. Three more jobs dead *sigh*
  • 20:21 valhallasw`cloud: now doing more rescheduling, with 5 sec intervals, on a sorted list to spread load between queues
  • 19:36 valhallasw`cloud: last restarted job is 1423661, rest of them are still in /home/valhallaw/webgrid_jobs
  • 19:35 valhallasw`cloud: one per second still seems to make SGE unhappy; there's a whole set of jobs dying, mostly uwsgi?
  • 19:31 valhallasw`cloud: https://phabricator.wikimedia.org/T110861 : rescheduling 521 webgrid jobs, at a rate of one per second, while watching the accounting log for issues
  • 07:31 valhallasw`cloud: removed paniclog on tools-submit; probably related to the NFS outage yesterday (although I'm not sure why that would give OOMs)

2015-08-30

  • 13:23 valhallasw`cloud: killed wikibugs-backup and grrrit-wm on tools-webproxy-01
  • 13:20 valhallasw`cloud: disabling 503 error page

2015-08-29

  • 04:09 scfc_de: Disabled queue webgrid-lighttpd@tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs (qmod -d) because I can't ssh to it and jobs deployed there fail with "failed assumedly before job:can't get password entry for user".

2015-08-27

  • 15:00 valhallasw`cloud: killed multiple kmlexport processes on tools-webgrid-lighttpd-1401 again

2015-08-26

  • 01:10 scfc_de: Felt lucky: kill -STOP bigbrother on tools-submit, installed I00cd7a90273e0d745699855eb671710afb4e85a7 on tools-services-02 and service bigbrothermonitor start. If it goes berserk, please service bigbrothermonitor stop.

2015-08-25

  • 20:23 scfc_de: tools-webgrid-generic-1405: killall mpt-statusd.
  • 14:58 YuviPanda: pooled in two new instances for the precise exec pool
  • 14:45 YuviPanda: reboot tools-exec-1221
  • 14:26 YuviPanda: rebooting tools-exec-1220 because NFS wedge...
  • 14:18 YuviPanda: pooled in tools-webgrid-generic-1405
  • 10:16 YuviPanda: created tools-webgrid-generic-1405
  • 10:04 YuviPanda: apply exec node puppet roles to tools-exec-1220 and -1221
  • 09:59 YuviPanda: created tools-exec-1220 and -1221

2015-08-24

  • 16:37 valhallasw`cloud: more processes were started, so added a talk page message on User:Coet (who was starting the processes according to /var/log/auth.log) and using 'write coet' on tools-bastion-01
  • 16:15 valhallasw`cloud: kill -9'ing because normal killing doesn't work
  • 16:13 valhallasw`cloud: killing all processes of tools.cobain which are flooding tools-bastion-01

2015-08-20

  • 18:44 valhallasw`cloud: both are now at 3dbbc87
  • 18:43 valhallasw`cloud: running git reset --hard origin/master on both checkouts. Old HEAD is 86ec36677bea85c28f9a796f7e57f93b1b928fa7 (-01) / c4abeabd3acf614285a40e36538f50655e53b47d (-02).
  • 18:42 valhallasw`cloud: tools-web-static-01 has the same issue, but with different commit ids (because different hostname). No local changes on static-01. The initial merge commit on -01 is 57994c, merging 1e392ab and fc918b8; on -02 it's 511617f, merging a90818c and fc918b8.
  • 18:39 valhallasw`cloud: cdnjs on tools-web-static-02 can't pull because it has a dirty working tree, and there's a bunch of weird merge commits. Old commit is c4abeabd3acf614285a40e36538f50655e53b47d, the dirty working tree is changes from http to https in various files
  • 17:06 valhallasw`cloud: wait, what timezone is this?!

2015-08-19

  • 10:45 valhallasw`cloud: ran `for i in $(qstat -f -xml | grep "<state>au" -B 6 | grep "<name>" | cut -d'@' -f2 | cut -d. -f1); do echo $i; ssh $i sudo service gridengine-exec start; done`; this fixed queues on tools-exec-1404 tools-exec-1409 tools-exec-1410 tools-webgrid-lighttpd-1406

2015-08-18

  • 15:53 scfc_de: Added valhallasw as grid manager (qconf -am valhallasw).
  • 14:42 scfc_de: tools-webgrid-lighttpd-1411: Killed mpt-statusd (T104779).
  • 13:57 valhallasw`cloud: same issue seems to happen with the other hosts: tools-exec-1401.tools.eqiad.wmflabs vs tools-exec-1401.eqiad.wmflabs and tools-exec-catscan.tools.eqiad.wmflabs vs tools-exec-catscan.eqiad.wmflabs.
  • 13:55 valhallasw`cloud: no, wait, that's tools-webgrid-lighttpd-1411.eqiad.wmflabs, not the actual host tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs. We should fix that dns mess as well.
  • 13:54 valhallasw`cloud: tried to restart gridengine-exec on tools-exec-1401, no effect. tools-webgrid-lighttpd-1411 also just went into 'au' state.
  • 13:47 valhallasw`cloud: that brought tools-exec-1403, tools-exec-1406 and tools-webgrid-generic-1402 back up, tools-exec-1401 and tools-exec-catscan are still in 'au' state
  • 13:46 valhallasw`cloud: starting gridengine-exec on hosts with queues in 'au' (=alarm, unknown) state using for i in $(qstat -f -xml | grep "<state>au" -B 6 | grep "<name>" | cut -d'@' -f2 | cut -d. -f1); do echo $i; ssh $i sudo service gridengine-exec start; done
  • 08:37 valhallasw`cloud: sudo service gridengine-exec start on tools-webgrid-lighttpd-1404.eqiad.wmflabs" tools-webgrid-lighttpd-1406.eqiad.wmflabs" tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs"
  • 08:33 valhallasw`cloud: tools-webgrid-lighttpd-1403.eqiad.wmflabs, tools-webgrid-lighttpd-1404.eqiad.wmflabs and tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs are all broken (queue dropped because it is temporarily not available)
  • 08:30 valhallasw`cloud: hostname mismatch: host is called tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs in config, but it was named tools-webgrid-lighttpd-1411.eqiad.wmflabs in the hostgroup config
  • 08:21 valhallasw`cloud: still sudo qmod -e "*@tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs" -> invalid queue "*@tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs"
  • 08:20 valhallasw`cloud: sudo qconf -mhgrp "@webgrid", added tools-webgrid-lighttpd-1411.eqiad.wmflabs
  • 08:14 valhallasw`cloud: and the hostgroup @webgrid doesn't even exist? (╯°□°)╯︵ ┻━┻
  • 08:10 valhallasw`cloud: /var/lib/gridengine/etc/queues/webgrid-lighttpd does not seem to be the correct configuration as the current config refers to '@webgrid' as host list.
  • 08:07 valhallasw`cloud: sudo qconf -Ae /var/lib/gridengine/etc/exechosts/tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs -> root@tools-bastion-01.eqiad.wmflabs added "tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs" to exechost list
  • 08:06 valhallasw`cloud: ok, success. /var/lib/gridengine/etc/exechosts/tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs now exists. Do I still have to add it manually to the grid? I suppose so.
  • 08:04 valhallasw`cloud: installing packages from /data/project/.system/deb-trusty seems to fail. sudo apt-get update helps.
  • 08:00 valhallasw`cloud: running puppet agent -tv again
  • 07:55 valhallasw`cloud: argh. Disabling toollabs::node::web::generic again and enabling toollabs::node::web::lighttpd
  • 07:54 valhallasw`cloud: various issues such as Error: /Stage[main]/Gridengine::Submit_host/File[/var/lib/gridengine/default/common/accounting]/ensure: change from absent to link failed: Could not set 'link' on ensure: No such file or directory - /var/lib/gridengine/default/common at 17:/etc/puppet/modules/gridengine/manifests/submit_host.pp; probably an ordering issue in
  • 07:53 valhallasw`cloud: Setting up adminbot (1.7.8) ... chmod: cannot access '/usr/lib/adminbot/README': No such file or directory --- ran sudo touch /usr/lib/adminbot/README
  • 07:37 valhallasw`cloud: applying role::labs::tools::compute and toollabs::node::web::generic to \tools-webgrid-lighttpd-1411
  • 07:31 valhallasw`cloud: reading puppet suggests I should qconf -ah /var/lib/gridengine/etc/exechosts/tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs but that file is missing?
  • 07:26 valhallasw`cloud: andrewbogott built tools-webgrid-lighttpd-1411 yesterday but it's not actually added as exec host. Trying to figure out how to do that...

2015-08-17

  • 19:00 scfc_de: tools-checker-01, tools-exec-1410, tools-exec-catscan, tools-redis-01, tools-redis-02, tools-web-static-01, tools-webgrid-lighttpd-1406, tools-webproxy-02: Remounted /public/dumps (T109261).
  • 16:17 andrewbogott: disable queues for tools-exec-1205 tools-exec-1207 tools-exec-1208 tools-exec-140 tools-exec-1404 tools-exec-1409 tools-exec-1410 tools-exec-catscan tools-web-static-01 tools-webgrid-lighttpd-1201 tools-webgrid-lighttpd-1205 tools-webgrid lighttpd-1206 tools-webgrid-lighttpd-1406 tools-webproxy-02
  • 15:33 andrewbogott: re-enabling the queue on tools-exec-1211 tools-exec-1212 tools-exec-1215 tools-exec-1403 tools-exec-1406 tools-master tools-shadow tools-webgrid-generic-1402 tools-webgrid-lighttpd-1203 tools-webgrid-lighttpd-1208 tools-webgrid-lighttpd-1403 tools-webgrid-lighttpd-1404 tools-webproxy-01
  • 14:50 andrewbogott: killing remaining jobs on tools-exec-1211 tools-exec-1212 tools-exec-1215 tools-exec-1403 tools-exec-1406 tools-master tools-shadow tools-webgrid-generic-1402 tools-webgrid-lighttpd-1203 tools-webgrid-lighttpd-1208 tools-webgrid-lighttpd-1403 tools-webgrid-lighttpd-1404 tools-webproxy-01

2015-08-15

  • 05:14 andrewbogott: resumed tools-exec-gift, seems not to have been the culprit
  • 05:10 andrewbogott: suspending tools-exec-gift, just for a moment...

2015-08-14

  • 17:21 andrewbogott: disabling grid jobqueue for tools-exec-1211 tools-exec-1212 tools-exec-1215 tools-exec-1403 tools-exec-1406 tools-master tools-shadow tools-webgrid-generic-1402 tools-webgrid-lighttpd-1203 tools-webgrid-lighttpd-1208 tools-webgrid-lighttpd-1403 tools-webgrid-lighttpd-1404 tools-webproxy-01 in anticipation of monday reboot of labvirt1004
  • 15:20 andrewbogott: Adding back to the grid engine queue: tools-exec-1216 tools-exec-1219 tools-exec-1407 tools-mail tools-services-02 tools-webgrid-generic-1401 tools-webgrid-lighttpd-1202 tools-webgrid-lighttpd-1207 tools-webgrid-lighttpd-1210 tools-webgrid-lighttpd-1402 tools-webgrid-lighttpd-1407
  • 14:43 andrewbogott: killing remaining jobs on tools-exec-1216 tools-exec-1219 tools-exec-1407 tools-mail tools-services-02 tools-webgrid-generic-1401 tools-webgrid-lighttpd-1202 tools-webgrid-lighttpd-1207 tools-webgrid-lighttpd-1210 tools-webgrid-lighttpd-1402 tools-webgrid-lighttpd-1407

2015-08-13

  • 18:51 valhallasw`cloud: which was resolved by scfc earlier
  • 18:50 valhallasw`cloud: tools-exec-1201/Puppet staleness was critical due to an agent lock (Ignoring stale puppet agent lock for pid
    Run of Puppet configuration client already in progress; skipping (/var/lib/puppet/state/agent_catalog_run.lock exists))
  • 18:08 scfc_de: scfc@tools-exec-1201: Removed stale /var/lib/puppet/state/agent_catalog_run.lock; Puppet run was started Aug 12 15:06:08, instance was rebooted ~ 15:14.
  • 16:44 andrewbogott: disabling job queue for tools-exec-1216 tools-exec-1219 tools-exec-1407 tools-mail tools-services-02 tools-webgrid-generic-1401 tools-webgrid-lighttpd-1202 tools-webgrid-lighttpd-1207 tools-webgrid-lighttpd-1210 tools-webgrid-lighttpd-1402 tools-webgrid-lighttpd-1407
  • 14:48 andrewbogott: and tools-webgrid-lighttpd-1408
  • 14:48 andrewbogott: rescheduling (and in some cases killing) jobs on tools-exec-1203 tools-exec-1210 tools-exec-1214 tools-exec-1402 tools-exec-1405 tools-exec-gift tools-services-01 tools-web-static-02 tools-webgrid-generic-1403 tools-webgrid-lighttpd-1204 tools-webgrid-lighttpd-1209 tools-webgrid-lighttpd-1401 tools-webgrid-lighttpd-1405

2015-08-12

  • 16:05 andrewbogott: depooling tools-exec-1203 tools-exec-1210 tools-exec-1214 tools-exec-1402 tools-exec-1405 tools-exec-gift tools-services-01 tools-web-static-02 tools-webgrid-generic-1403 tools-webgrid-lighttpd-1204 tools-webgrid-lighttpd-1209 tools-webgrid-lighttpd-1401 tools-webgrid-lighttpd-1405 tools-webgrid-lighttpd-1408
  • 15:20 valhallasw`cloud: re-enabling queues on restarted hosts
  • 14:41 andrewbogott: forcing reschedule of jobs on tools-exec-1201 tools-exec-1202 tools-exec-1204 tools-exec-1206 tools-exec-1209 tools-exec-1213 tools-exec-1217 tools-exec-1218 tools-exec-1408 tools-webgrid-generic-1404 tools-webgrid-lighttpd-1409 tools-webgrid-lighttpd-1410

2015-08-11

  • 18:17 andrewbogott: depooling tools-exec-1201 tools-exec-1202 tools-exec-1204 tools-exec-1206 tools-exec-1209 tools-exec-1213 tools-exec-1217 tools-exec-1218 tools-exec-1408 tools-webgrid-generic-1404 tools-webgrid-lighttpd-1409 tools-webgrid-lighttpd-1410 in anticipation of labvirt1001 reboot tomorrow

2015-08-04

  • 13:43 scfc_de: Fixed owner of ~tools.kasparbot/error.log (T99576).

2015-08-03

  • 19:13 andrewbogott: deleted tools-static-01

2015-08-01

  • 18:09 andrewbogott: depooling/rebooting tools-webgrid-lighttpd-1407 because it’s unable to fork
  • 16:54 scfc_de: tools-webgrid-lighttpd-1407: Removed exim paniclog (OOM).

2015-07-30

  • 15:00 andrewbogott: rebooting tools-bastion-01 aka tools-login
  • 14:46 scfc_de: tools-webgrid-lighttpd-1408, tools-webgrid-lighttpd-1409: Removed exim paniclog (OOM).
  • 02:53 scfc_de: "webservice uwsgi-python start" for blogconverter.
  • 02:40 scfc_de: qdel 545479 (hazard-bot, "release=trusty-quiet", stuck since July 9th).
  • 02:39 scfc_de: qdel 301895 (projanalysis, "release=trust", stuck since July 1st).
  • 02:38 scfc_de: tools-webgrid-generic-1401, tools-webgrid-generic-1402, tools-webgrid-generic-1403: Rebooted for T107052 (disabled queue, killall -TERM lighttpd, let tools-manifest restart webservices elsewhere, reboot, enabled queue).
  • 01:41 scfc_de: tools-webgrid-lighttpd-1406: Rebooted for T107052 (disabled queue, killall -TERM lighttpd, let tools-manifest restart webservices elsewhere, reboot, enabled queue).

2015-07-29

  • 23:43 andrewbogott: draining, rebooting tools-webgrid-lighttpd-1408
  • 20:11 andrewbogott: rebooting tools-webgrid-lighttpd-1404
  • 19:58 scfc_de: tools-*: sudo rmdir /etc/ssh/userkeys/ubuntu{/.ssh{/authorized_keys\ {/public{/keys{/ubuntu{/.ssh,},},},},},}

2015-07-28

  • 17:49 valhallasw`cloud: Jobs were drained at 19:43, but this did not decreade he rate, which is still at ~50k/minute. Now running "sysctl -w sunrpc.nfs_debug=1023 && sleep 2 && sysctl -w sunrpc.nfs_debug=0" which hopefully doesn't kill the server
  • 17:43 valhallasw`cloud: rescheduled all webservice jobs on tools-webgrid-lighttpd-1401.eqiad.wmflabs, server is now empty
  • 17:16 valhallasw`cloud: disabled queue "webgrid-lighttpd@tools-webgrid-lighttpd-1401.eqiad.wmflabs"
  • 02:07 YuviPanda: removed pacct files from tools-bastion-01

2015-07-27

  • 21:27 valhallasw`cloud: turned off process accounting on tools-login while we try to find the root cause of phab:T107052:
    accton off

2015-07-19

  • 01:51 scfc_de: tools-bastion-01: Removed exim paniclog (OOM).

2015-07-11

  • 00:01 mutante: fixing puppet runs on tools-webgrid-* via salt

2015-07-10

  • 23:59 mutante: fixing puppet runs on tools-exec via salt

2015-07-10

  • 20:09 valhallasw`cloud: it took three of us, but adminbot is updated!

July 6

  • 09:49 valhallasw`cloud: 10:14 <jynus> s51053 is abusing his/her access to replica dbs and creating lag for other users. His/her queries are to be terminated. (= tools.jackbot / user jackpotte)

July 2

  • 17:07 valhallasw`cloud: can't login to tools-mailrelay-01., probably because puppet was disabled for too long. Deleting instance.
  • 16:12 valhallasw`cloud: I mean tools-bastion-01
  • 16:12 valhallasw`cloud: stopping puppet on tools-login and tools-mail to check for changes in deploying https://gerrit.wikimedia.org/r/#/c/205914/

June 29

  • 17:29 YuviPanda: failed over tools webproxy to tools-webproxy-02

June 21

  • 18:57 scfc_de: tools-precise-dev: apt-get purge python-ldap3 (the previous fix for "Cache has broken packages, exiting" didn't work).
  • 16:39 scfc_de: tools-precise-dev: apt-get clean ("Cache has broken packages, exiting").
  • 16:33 scfc_de: tools-submit: Removed exim4 paniclog (OOM).

June 19

  • 15:07 YuviPanda: remounting /data/scratch

June 10

  • 11:52 YuviPanda: tools-trusty be gone

June 8

  • 16:31 YuviPanda: added Nova Tools Bot as admin, for automated nova API access

June 7

  • 17:05 YuviPanda: killed sort /data/project/templatetiger/public_html/dumps/ruwiki-2015-03-24.txt -k4,4 -k2,2 -k3,3n -k5,5n -t? -o /data/project/templatetiger/public_html/dumps/sort/ruwiki-2015-03-24.txt -T /data/project/templatetiger to rescue NFS

June 5

  • 17:44 YuviPanda: migrate tools-shadow to labvirt1002

June 2

  • 18:34 Coren: rebooting tools-webgrid-lighttpd-1406.eqiad.wmflabs
  • 16:27 YuviPanda: cleaned out /etc/hosts file on tools-shadow
  • 16:20 Coren: switching back to tools-master
  • 16:10 YuviPanda: restart nscd on tools-submit
  • 15:54 Coren: Switching names for tools-exec-1401
  • 15:43 Coren: adding the "new" exec nodes (aka, current nodes with new names)
  • 14:34 YuviPanda: turned off dnsmasq for toollabs
  • 13:54 Coren: adding new-style names for submit hosts
  • 13:53 YuviPanda: moved tools-master / shadow to designate
  • 13:52 Coren: new-style names for gridengin admin hosts added
  • 13:28 Coren: sge_shadowd started a new master as expected, after /two/ timeouts of 60s (unexpected)
  • 13:23 Coren: stracing the shadowd to see what's up; master is down as expected.
  • 13:17 Coren: killing the sge_qmaster to test failover
  • 12:56 YuviPanda: switched labs webproxies to designate, forcing puppet run and restarting nscd

May 29

  • 13:39 YuviPanda: tools-redis-01 is redis master now
  • 13:35 YuviPanda: enable puppet on all hosts, redis move-around completed
  • 13:01 YuviPanda: recreating tools-redis-01 and -02
  • 12:52 YuviPanda: disable puppet on all toollabs hosts for tools-redis update
  • 12:27 YuviPanda: created two redis instances (tools-redis-01 and tools-redis-02), beginning to set up stuff

May 28

  • 12:22 wm-bot: petrb: inserted some local IP's to hosts file
  • 12:15 wm-bot: petrb: shutting nscd off on tools-master
  • 12:14 wm-bot: petrb: test
  • 11:28 petan: syslog is full of these May 28 11:27:36 tools-master nslcd[1041]: [81823a] <group=550> error writing to client: Broken pipe
  • 11:25 petan: rebooted tools-master in order to try fix that network issues

May 27

  • 20:10 LostPanda: disabled puppet on tools-shadow too
  • 19:46 LostPanda: echo -n 'tools-master.eqiad.wmflabs' > /var/lib/gridengine/default/common/act_qmaster haaail someone?
  • 19:10 YuviPanda: reverted gridengine-common on tools-shadow to 6.2u5-4 as well, to match tools-master
  • 18:58 YuviPanda: rebooting tools-master after switchoer failed and it can not seem to do DNS

May 23

  • 19:56 scfc_de: tools-webgrid-lighttpd-1410: Removed exim4 paniclog (OOM).

May 22

  • 20:37 yuvipanda: deleted and depooled tools-exec-07

May 20

  • 20:09 yuvipanda: transient shinken puppet alerts because I tried to force puppet runs on all tools hosts but cancelled
  • 20:01 yuvipanda: enabling puppet on all hosts
  • 20:01 yuvipanda: tested new /etc/hosts on tools-bastion-01, puppet run produced no diffs, all good
  • 19:56 yuvipanda: copy cleaned up and regenerated /etc/hosts from tools-precise-dev to all toollabs hosts
  • 19:54 yuvipanda: copy cleaned up hosts file to /etc/hosts on tools-precise-dev
  • 19:54 yuvipanda: enabled puppet on tools-precise-dev
  • 19:33 yuvipanda: disabling puppet on *all* hosts for https://gerrit.wikimedia.org/r/#/c/210000/
  • 06:21 yuvipanda: killed a bunch of webservice jobs stuck in dRr state

May 19

  • 21:06 yuvipanda: failed over services to tools-services-02, -01 was refusing to start some webservices with permission denied errors for setegid
  • 20:16 yuvipanda: qdel -f for all webservice jobs that were in dr state
  • 20:12 yuvipanda: force killed croptool webservice

May 18

  • 01:36 yuvipanda: created new tools-checker-01, applying role and provisioning
  • 01:32 yuvipanda: killed tools-checker-01 instance, recreating

May 15

  • 12:06 valhallasw: killed those perl scripts; kmlexport's lighttpd is also using excessive memory (5%), so restarting that
  • 12:01 valhallasw: webgrid-lighttpd-1402 puppet failure caused by major memory usage; tools.kmlexport is running heavy perl scripts
  • 00:27 yuvipanda: cleared graphite data for /var/* mounts on tools-redis

May 14

  • 21:53 valhallasw: shut down & removed "tools-exec-08.eqiad.wmflabs" from execution host list
  • 21:11 valhallasw: forced rescheduling of (non-cont) welcome.py job (iluvatarbot, jobid 8869)
  • 03:29 yuvipanda: drained, depooled and deleted tools-exec-15

May 10

  • 22:08 yuvipanda: created tools-precise-dev instance
  • 09:28 yuvipanda: cleared and depooled tools-exec-02 and -13. only job running was deadlocked for a long, long time (week)
  • 05:47 scfc_de: tools-submit: Removed paniclog (OOM) and stopped apache2.

May 5

  • 18:50 Betacommand: helperbot WP:AVI bot running logged out owner is MIA, Coren killed job from 1204 and commented out crontab

May 4

  • 21:24 yuvipanda: reboot tools-submit, was stuck

May 2

  • 10:21 yuvipanda: drained all the old webgrid nodes, pooled in all the new webgrid nodes! POTATO!
  • 10:13 yuvipanda: cleaned out wegrid jobs from tools-webgrid-03
  • 10:12 yuvipanda: pooled tools-webgrid-lighttpd-{06-10}
  • 08:56 yuvipanda: drained and deleted tools-webgrid-01
  • 07:31 yuvipanda: depooled and deleted tools-webgrid-{01,02}
  • 07:31 yuvipanda: disabled catmonitor task / cron, was heavily using an sqlite db on NFS
  • 06:56 yuvipanda: pooled tools-webgrid-generic-{01-04}
  • 03:44 yuvipanda: drained and deleted old trusty webgrid tools-webgrid-{05-07}
  • 02:13 yuvipanda: created tools-webgrid-lighttpd-12{01-05} and tools-webgrid-generic-14{01-04}
  • 01:59 yuvipanda: created tools-webgrid-lighttpd-14{01-10}
  • 01:58 yuvipanda: increased tools instance quota

May 1

  • 03:55 YuviKTM: depooled and deleted tools-exec-20
  • 03:54 YuviKTM: killed final job in tools-exec-20 (9911317), decommissioning node

April 30

  • 19:33 YuviKTM: depooled and deleted tools-exec-01, -05, -06 and -11.
  • 19:31 YuviKTM: depooled and deleted tools-exec-01, -05, -06 and -11.
  • 06:30 YuviKTM: added public IPs for all exec nodes so IRC tools continue to work. Removed all associated hostnames, let’s not do those
  • 06:13 YuviKTM: allocating new floating IPs for the new instances, because IRC bots need them.
  • 05:42 YuviKTM: disabled and drained tools-exec-1{1-5} of continuous jobs
  • 05:40 YuviKTM: pooled in tools-exec-121{1-9}
  • 05:39 YuviKTM: rebooted tools-exec-121{1-9} instances so they can apply gridengine-common properly
  • 05:39 YuviKTM: created new instances tools-exec-121{1-9} as precise
  • 05:39 YuviKTM: killed tools-dev, nobody still ssh’d in, no crontabs
  • 05:39 YuviKTM: deplooled exec-{06-10} rejigged jobs to newer nodes
  • 05:39 YuviKTM: delete tools-exec-10, was out of jobs
  • 04:28 YuviKTM: deleted tools-exec-09
  • 04:27 YuviKTM: depooled tools-exec-09.eqiad.wmflabs
  • 04:23 YuviKTM: repooled tools-exec-1201 is all good now
  • 04:19 YuviKTM: rejuggle jobs again in trustyland
  • 04:14 YuviKTM: repooled tools-exec-09, apt troubles fixed
  • 04:08 YuviKTM: depooled tools-exec-09, apt troubles
  • 04:04 YuviKTM: pooled tools-exec-1408 and tools-exec-1409
  • 04:00 YuviKTM: pooled tools-exec-1406 and 1407
  • 03:58 YuviKTM: pooled tools-exec-12{02-10}, forgot to put appropriate roles on 1201, fixing now
  • 03:54 YuviKTM: tools-exec-03 and -04 have been deleted a long time ago
  • 03:53 YuviKTM: depooled tools-exec-03 / 04
  • 03:31 YuviKTM: depooled and deleted tools-exec-12 had nothing on it
  • 03:28 YuviKTM: deleted toolx-exec-21 to 24, one task still running on tools-exec
  • 03:24 YuviKTM: disabled and drained continuous tasks off tools-exec-20 to tools-exec-24
  • 03:18 YuviKTM: pooled tools-exec-1403, 1404
  • 03:13 YuviKTM: pooled tools-exec-1402
  • 03:07 YuviKTM: pooled tools-exec-1405
  • 03:04 YuviKTM: pooled tools-exec-1401
  • 02:53 YuviKTM: created tools-exec-14{06-10}
  • 02:14 YuviKTM: created toolx-exec-14{01-05}
  • 01:09 YuviPanda: killing local copy of python-requests, there seems to be a newer vesrion in prod

April 29

  • 19:33 valhallasw`cloud: re-created tools-mailrelay-01 with precise: Nova_Resource:I-00000bca.eqiad.wmflabs
  • 19:30 YuviPanda: set appopriate classes for recreated tools-exec-12* nodes
  • 19:28 YuviPanda: recreated tools-static-02
  • 19:11 YuviPanda: failed over tools-static to tools-static-01
  • 14:47 andrewbogott: deleting tools-exec-04
  • 14:44 Coren: -exec-04 drained; removed from queues. Rest well, old friend.
  • 14:41 Coren: disabled -exec-04 (going away)
  • 02:35 YuviPanda: set tools-exec-12{01-10} to configure as exec nodes
  • 02:27 YuviPanda: created tools-exec-12{01-10}

April 28

  • 21:41 andrewbogott: shrinking tools-master
  • 21:33 YuviPanda: failover is going to take longer than actual recompression for tools-master, so let’s just recompress. tools-shadow should take over automatically if that doesn’t work
  • 21:32 andrewbogott: shrinking tools-redis
  • 21:28 YuviPanda: attempting to failover gridengine to tools-shadow
  • 21:27 andrewbogott: shrinking tools-submit |
  • 21:21 YuviPanda: backup crontabs onto NFS
  • 21:18 andrewbogott: shrinking tools-webproxy-02
  • 21:14 andrewbogott: shrinking tools-static-01
  • 21:11 andrewbogott: shrinking tools-exec-gift
  • 21:06 YuviPanda: failover tools-webproxy to tools-webproxy-01
  • 21:06 andrewbogott: stopping, shrinking and starting tools-exec-catscan
  • 21:01 YuviPanda: failover tools-static to tools-static-02
  • 20:53 andrewbogott: stopping, shrinking, restarting tools-shadow
  • 20:43 andrewbogott: stopping, shrinking, starting tools-static-02
  • 20:39 valhallasw`cloud: created tools-mailrelay-01 Nova_Resource:I-00000bac.eqiad.wmflabs
  • 20:26 YuviPanda: failed over tools-services to services-01
  • 18:11 Coren: reenabled -webgrid-generic-02
  • 18:05 Coren: reenabled -webgrid-03, -webgrid-08, -webgrid-generic-01; drained -webgrid-generic-02
  • 17:44 Coren: -webgrid-03, -webgrid-08 and -webgrid-generic-01 drained
  • 14:04 Coren: reenable -exec-11 for jobs.
  • 13:55 andrewbogott: stopping tools-exec-11 for a resize experiment

April 25

  • 01:32 YuviPanda: deleted tools-static, tools-static-01 has taken over
  • 01:02 YuviPanda: deleted tools-login, tools-bastion-01 has been running for long enoug

April 24

  • 16:29 Coren: repooled -exec-02, -08, -12
  • 16:05 Coren: -exec-02, -08 and -12 draining
  • 15:54 Coren: reenabled tools-exec-07, -10 and -11 after reboot of host
  • 15:41 Coren: -exec-03 goes away for good.
  • 15:31 Coren: draining -exec-03 to ease migration
  • 13:43 Coren: draining tools-exec-07,10,11 to allow virt host reboot

April 23

  • 22:41 YuviPanda: disabled *@tools-exec-09
  • 22:40 YuviPanda: add tools-exec-09 back to @general
  • 22:38 YuviPanda: take tools-exec-09 from @general group
  • 20:53 YuviPanda: restart bigbrother
  • 20:28 YuviPanda: restarted nscd on tools-login and tools-dev
  • 20:22 valhallasw`cloud: removed 10.68.16.4 tools-webproxy tools.wmflabs.org from /etc/hosts
  • 13:17 andrewbogott: beginning migration of tools instances to labvirt100x hosts
  • 01:00 YuviPanda: good bye tools-login.eqiad.wmflabs

April 20

  • 13:38 scfc_de: tools-mail: Removed paniclog and killed superfluous exim.

April 18

  • 20:09 YuviPanda: sysctl vm.overcommit_memory=1 on tools-redis to allow it to bgsave again
  • 19:52 valhallasw`cloud: tools-redis unresponsive (T96485); rebooting

April 17

  • 01:48 YuviPanda: disable puppet on live webproxy (-01) to apply firewall changes to -02

April 16

  • 20:57 Coren: -webgrid-08 drained, rebooting
  • 20:46 Coren: -webgrid-03 repooled, depooling -webgrid-08
  • 20:45 Coren: -webgrid-03 drained, rebooting
  • 20:38 Coren: -webgrid-03 depooled
  • 20:38 Coren: -webgrid-02 repooled
  • 20:35 Coren: -webgrid-02 drained, rebooting
  • 20:33 Coren: -webgrid-02 depooled
  • 20:32 Coren: -webgrid-01 repooled
  • 20:06 Coren: -webgrid-01 drained, rebooting.
  • 19:56 Coren: depooling -webgrid-01 for reboot
  • 14:37 Coren: rebooting -master
  • 14:29 Coren: rebooting -mail
  • 14:22 Coren: rebooting -shadow
  • 14:22 Coren: -exec-15 repooled
  • 14:19 Coren: -exec-15 drained, rebooting.
  • 13:46 Coren: -exec-14 repooled. That's it for general exec nodes.
  • 13:44 Coren: -exec-14 drained, rebooting.

April 15

  • 21:06 Coren: -exec-10 repooled
  • 20:55 Coren: -exec-10 drained, rebooting
  • 20:49 Coren: -exec-07 repooled.
  • 20:47 Coren: -exec-07 drained, rebooting
  • 20:43 Coren: -exec-06 requeued
  • 20:41 Coren: -exec-06 drained, rebooting
  • 20:15 Coren: repool -exec-05
  • 20:10 Coren: -exec-05 drained, rebooting.
  • 19:56 Coren: -exec-04 repooled
  • 19:52 Coren: -exec-04 drained, rebooting.
  • 19:41 Coren: disabling new jobs on remaining (exec) precise instances
  • 19:32 Coren: repool -exec-02
  • 19:30 Coren: draining -exec-04
  • 19:29 Coren: -exec-02 drained, rebooting
  • 19:28 Coren: -exec-03 rebooted, requeing
  • 19:26 Coren: -exec-03 drained, rebooting
  • 18:50 Coren: dequeuing tools-exec-03 whilst waiting for -02 to drain.
  • 18:43 Coren: tools-exec-01 back sans idmap, returning to pool
  • 18:40 Coren: tools-exec-01 drained of jobs; rebooting
  • 18:39 YuviPanda: disabled puppet on running webproxy, tools-webproxy-01
  • 18:25 Coren: disabled -exec-01 and -exec-02 to new jobs.

April 14

  • 13:13 scfc_de: tools-submit: Removed exim paniclog (OOM doom).
  • 13:13 scfc_de: tools-mail: Killed superfluous exim and removed paniclog.

April 13

  • 21:11 YuviPanda: restart portgranter on all webgrid nodes

April 12

  • 10:52 scfc_de: tools-mail: Killed superfluous exims and removed paniclog.

April 11

  • 21:49 andrewbogott: moved /data/project/admin/toollabs to /data/project/admin/toollabsbak on tools-webproxy-01 and tools-webproxy-02 to fix permission errors
  • 02:15 YuviPanda: rebooted tools-submit, was not responding

April 10

  • 07:10 PissedPanda: take out tools-services-01 to test switchover and also to recreate as small
  • 05:20 YuviPanda: delete the tomcat node finally :D

April 9

  • 23:24 scfc_de: rm -f /puppet_{host,service}groups.cfg on all hosts (apparently a Puppet/hiera mishap last November).
  • 23:11 scfc_de: tools-webgrid-04: Rescheduled all jobs running on this instance (T95537).
  • 08:32 scfc_de: tools-mail: Removed paniclog (multiple exims, but only one found).

April 8

  • 13:25 scfc_de: Repaired servicegroups repository and restarted toolhistory job; was stuck at 2015-03-29T09:15:05Z (NFS?).
  • 12:01 scfc_de: Removed empty tools with no maintainers javed/javedbaker/shell.
  • 09:10 scfc_de: Removed stale proxy entries for analytalks/anno/commons-coverage/coursestats/eagleeye/hashtags/itwiki/mathbot/nasirkhanbot/rc-vikidia/wikistream.

April 7

  • 07:42 scfc_de: tools-mail: Killed superfluous exim and removed paniclog.

April 5

  • 10:11 scfc_de: tools-mail: Killed superfluous exims and removed paniclog.

April 4

  • 22:48 scfc_de: Removed zombie jobs (qdel 1991607,1994800,1994826,1994827,2054201,3449476,3450329,3451518,3451549,3451590,3451628,3451635,3451830,3451869,3452632,3452633,3452654,3452655,3452657,3452668,4218785,4219210,4219674,4219722,4219791,4219923,4220646).
  • 08:49 scfc_de: tools-submit: Restarted bigbrother because it didn't notice admin's .bigbrotherrc.
  • 08:49 scfc_de: Add webservice to .bigbrotherrc for admin tool.
  • 03:35 scfc_de: Deployed jobutils/misctools 1.5 (T91954).

April 3

  • 22:55 scfc_de: Removed empty cgi-bin directories.
  • 20:35 scfc_de: tools-mail: Killed superfluous exims and removed paniclog.

April 2

  • 20:07 scfc_de: tools-mail: Killed superfluous exims and removed paniclog.
  • 20:06 scfc_de: tools-submit: Removed exim paniclog (OOM).
  • 01:25 YuviPanda: created tools-bastion-02

April 1

  • 00:14 scfc_de: tools-webgrid-03: Rebooted, was stuck on console input when unable to mount NFS on boot (per wikitech consule output).

March 31

  • 14:02 Coren: rebooting tools-submit
  • 07:07 YuviPanda: moved tools.wmflabs.org to tools-webproxy-01
  • 07:02 YuviPanda: reboot tools-webgrid-03 and tools-exec-03
  • 00:21 andrewbogott: temporarily shutting ‘toolsbeta-pam-sshd-motd-test’ down to conserve resources. It can be restarted any time.

March 30

  • 22:53 Coren: resyncing project storage with rsync
  • 22:40 Coren: reboot tools-login
  • 22:30 Coren: also bastion2
  • 22:28 Coren: reboot bastion1 so users can log in
  • 21:49 Coren: rebooting dedicated exec nodes.
  • 21:49 Coren: rebooting tools-submit
  • 17:27 scfc_de: tools-mail: Removed paniclog (multiple exims, but only one found).

March 29

  • 19:30 scfc_de: tools-submit: Restarted bigbrother for T90384.

March 28

  • 19:42 YuviPanda: created tools-exec-20

March 26

  • 21:24 scfc_de: tools-mail: Killed superfluous exims and removed paniclog.

March 25

  • 16:49 scfc_de: tools-mail: Removed paniclog (multiple exims, but only one found).

March 24

  • 16:03 scfc_de: tools-login: Removed exim paniclog (entries from Sunday).
  • 15:51 scfc_de: tools-mail: Killed superfluous exims and removed paniclog.

March 23

  • 21:23 scfc_de: tools-login, tools-dev, tools-trusty: Now actually disabled role::labs::bastion per T93661 :-).
  • 21:08 scfc_de: tools-login, tools-dev, tools-trusty: role::labs::bastion is still enabled due to T93663.
  • 20:57 scfc_de: tools-login, tools-dev, tools-trusty: Disabled role::labs::bastion per T93661.
  • 03:02 andrewbogott: wiped out atop.log on tools-dev because /var was filling up

March 22

  • 23:08 scfc_de: qconf -ah tools-bastion-01.eqiad.wmflabs
  • 23:07 scfc_de: for host in {tools-bastion-01,tools-webgrid-07,tools-webgrid-generic-{01,02}}.eqiad.wmflabs; do qconf -as "$host"; done
  • 23:07 yuvipanda: copied /etc/hosts into place on tools-bastion-01

March 21

  • 16:18 scfc_de: tools-mail: Killed superfluous exim and removed paniclog.

March 15

  • 22:38 scfc_de: tools-mail: Killed superfluous exims and removed paniclog.

March 13

  • 16:23 YuviPanda: cleaned out / on tools-trusty

March 11

  • 04:28 YuviPanda: tools-redis is back now, as trusty and hopefully slightly more fortified
  • 04:14 YuviPanda: kill tools-redis instance, upgrade to trusty while it is down anyway
  • 03:56 YuviPanda: restarted redis server, it had OOM-killed

March 9

  • 11:02 scfc_de: Deleted probably outdated proxy entry for tool wp-signpost and restarted webservice.
  • 10:22 scfc_de: Deleted obsolete proxy entries without webservice for tools bracketbot/herculebot/extreg-wos/pirsquared/searchsbl/translate/yifeibot.
  • 10:11 scfc_de: Restarted webservices for tools blahma/catmonitor/catscan2/contributions-summary/eagleeye/imagemapedit/jackbot/tb-dev/vcat/wikihistory/xtools-ec (cf. T91939).
  • 08:27 scfc_de: qmod -cq webgrid-lighttpd@tools-webgrid-03.eqiad.wmflabs (OOM of two jobs in the past).

March 7

  • 12:17 scfc_de: Moved obsolete packages that are installed on no instance at all from /data/project/.system/deb to ~tools.admin/archived-packages.

March 6

  • 07:46 scfc_de: Set role::labs::tools::toolwatcher for tools-login.
  • 07:43 scfc_de: Deployed jobutils/misctools 1.4.

March 2

March 1

  • 15:11 YuviPanda|brb: pooled in tools-webgrid-07 to lighty webgrid, moving some tools off -05 and -06 to relieve pressure

February 28

  • 07:51 YuviPanda: create tools-webgrid-07
  • 01:00 Coren: Set vm.overcommit_memory=0 on -webgrid-05 (also trusty)
  • 01:00 Coren: Also That was -webgrid-05
  • 00:59 Coren: set exec-06 to vm.overcommit_memory=0 for now, until the vm behaviour difference between precise and trusty can be nailed down.

February 27

  • 17:53 YuviPanda: increased quota to 512G RAM and 256 cores
  • 15:33 Coren: Switched back to -master. I'm making a note here: great success.
  • 15:27 Coren: Gridengine master failover test part three; killing the master with -9
  • 15:20 Coren: Gridengine master failover test part deux - now with verbose logs
  • 15:10 YuviPanda: created tools-webgrid-generic-02
  • 15:10 YuviPanda: increase instance quota to 64
  • 15:10 Coren: Master restarted - test not sucessful.
  • 14:50 Coren: testing gridengine master failover starting now
  • 08:27 YuviPanda: restart *all* webtools (with qmod -rj webgrid-lighttpd) to have tools-webproxy-01 and -02 pick them up as well

February 24

  • 18:33 Coren: tools-submit not recovering well from outage, kicking it.
  • 17:58 YuviPanda: rebooting *all* webgrid jobs on toollabs

February 16

  • 02:31 scfc_de: rm -f /var/log/exim4/paniclog.

February 13

  • 18:01 Coren: tools-redis is dead, long live tools-redis
  • 17:48 Coren: rebuilding tools-redis with moar ramz
  • 17:38 legoktm: redis on tools-redis is OOMing?
  • 17:26 marktraceur: restarting grrrit-wm because it's not behaving

February 1

  • 10:55 scfc_de: Submitted dummy jobs for tools ftl/limesmap/newwebtest/osm-add-tags/render/tsreports/typoscan/usersearch to get bigbrother to recognize those users and cleaned up output files afterwards.
  • 07:51 YuviPanda: cleared error state of stuck queues
  • 06:41 YuviPanda: set chmod +xw manually on /var/run/lighttpd on webgrid-05, need to investigate why it was necessary
  • 05:47 YuviPanda: completed migrating magnus' tools to trusty, more details at https://etherpad.wikimedia.org/p/tools-trusty-move
  • 05:37 YuviPanda: added tools-webgrid-06 as trusty webnode, operational now
  • 04:52 YuviPanda: migrating all of magnus’ tools, after consultation with him (https://etherpad.wikimedia.org/p/tools-trusty-move for status)
  • 04:10 YuviPanda: widar moved to trusty
  • 03:01 YuviPanda: ran salt -G 'instanceproject:tools' cmd.run 'sudo rm -rf /var/tmp/core’ because disks were getting full.

January 29

  • 17:26 YuviPanda: reschedule all tomcat jobs

January 27

  • 23:27 YuviPanda: qdel -f 7662482 7661111 for Merlissimo

January 19

  • 20:51 YuviPanda: because valhallasw is nice
  • 10:34 YuviPanda: manually started tools-webgrid-generic-01
  • 09:48 YuviPanda: restarted toold-webgrid-03
  • 08:42 scfc_de: qmod -cq {continuous,mailq,task}@tools-exec-{06,10,11,15}.eqiad.wmflabs
  • 08:36 scfc_de: tools-mail: rm -f /var/log/exim4/paniclog and killed second exim (belated SAL amendment.

January 16

  • 22:11 scfc_de: tools-mail: rm -f /var/log/exim4/paniclog.

January 15

  • 22:10 YuviPanda: created instance tools-webgrid-generic-01

January 11

  • 06:38 scfc_de: tools-mail: rm -f /var/log/exim4/paniclog.

January 8

  • 07:40 YuviPanda: increase memory limit for autolist from 4G to 7G

December 23

  • 06:00 YuviPanda: tools-uwsgi-01 randomly went to SHUTOFF state, rebooting from virt1000

December 22

  • 07:43 YuviPanda: increased RAM and Cores quota for tools

December 19

  • 16:38 YuviPanda: puppet disabled on tools-webproxy because urlproxy.lua is handhacked to remove stupid syntax errors that got merged.
  • 12:00 YuviPanda|brb: created tools-static, static http server
  • 07:07 scfc_de: tools-webgrid-02: rm -f /var/log/exim4/paniclog (OOM, again).

December 17

  • 22:38 YuviPanda: touched /data/project/repo/Packages so tools-webproxy stops complaining about that not xisting and never running apt-get

December 12

  • 14:08 scfc_de: Ran Puppet on all hosts to fix puppet-run issue.

December 11

  • 07:58 YuviPanda: rebooted tools-login, wasn’t responsive.

December 8

  • 00:15 YuviPanda: killed all db and tools-webproxy aliases in /etc/hosts for tools-webproxy, since otherwise puppet fails because ec2id thinks we’re not in labs because hostname -d is empty because we set /etc/hosts to resolve IP directly to tools-webproxy

December 7

  • 06:31 scfc_de: tools-webgrid-02: rm -f /var/log/exim4/paniclog (OOM, again).
  • 06:31 scfc_de: tools-mail: rm -f /var/log/exim4/paniclog (multiple exim4 processes, again).

December 2

  • 21:31 scfc_de: tools-mail: rm -f /var/log/exim4/paniclog (multiple exim4 processes, again).
  • 21:30 scfc_de: tools-webgrid-02: rm -f /var/log/exim4/paniclog (OOM, again).

November 26

  • 19:26 YuviPanda: created tools-webgrid-05 on trusty to set up a working webnode for trusty

November 25

  • 06:53 scfc_de: tools-webgrid-02: rm -f /var/log/exim4/paniclog (OOM, again).

November 24

  • 14:02 YuviPanda: rebooting tools-login, OOM'd
  • 02:51 scfc_de: tools-webgrid-02: rm -f /var/log/exim4/paniclog (OOM, again).

November 22

  • 19:05 scfc_de: tools-webgrid-02: rm -f /var/log/exim4/paniclog (OOM, again).

November 17

  • 20:40 YuviPanda: cleaned out /tmp on tools-login

November 16

  • 21:31 matanya: back to normal
  • 21:27 matanya: "Could not resolve hostname bastion.wmflabs.org"

November 15

  • 07:24 YuviPanda|zzz: move coredumps from tools-webgrid-04 to /home/yuvipanda

November 14

  • 20:23 YuviPanda: cleared out coredumps on tools-webgrid-01 to free up space
  • 18:26 YuviPanda: cleaned out core dumps on tools-webgrid
  • 16:55 scfc_de: tools-webgrid-02: rm -f /var/log/exim4/paniclog (OOM).

November 13

  • 21:11 YuviPanda: disable puppet on tools-dev to check shinken
  • 21:00 scfc_de: qmod -cq continuous@tools-exec-09,continuous@tools-exec-11,continuous@tools-exec-13,continuous@tools-exec-14,mailq@tools-exec-09,mailq@tools-exec-11,mailq@tools-exec-13,mailq@tools-exec-14,task@tools-exec-06,task@tools-exec-09,task@tools-exec-11,task@tools-exec-13,task@tools-exec-14,task@tools-exec-15,webgrid-lighttpd@tools-webgrid-01,webgrid-lighttpd@tools-webgrid-02,webgrid-lighttpd@tools-webgrid-04 (fallout from /var being full).
  • 20:38 YuviPanda: didn't actually stop puppet, need more patches
  • 20:38 YuviPanda: stopping puppet on tools-dev to test shinken
  • 15:30 scfc_de: tools-exec-06, tools-webgrid-01: rm -f /var/tmp/core/*.
  • 13:31 scfc_de: tools-exec-09, tools-exec-11, tools-exec-13, tools-exec-14, tools-exec-15, tools-webgrid-02, tools-webgrid-04: rm -f /var/tmp/core/*.

November 12

  • 22:07 StupidPanda: enabled puppet on tools-exec-07
  • 21:47 StupidPanda: removed coredumps from tools-webgrid-04 to reclaim space
  • 21:45 StupidPanda: removed coredump from tools-webgrid-01 to reclaim space
  • 20:31 YuviPanda: disabling puppet on tools-exec-07 to test shinken

November 7

  • 13:56 scfc_de: tools-submit, tools-webgrid-04: rm -f /var/log/exim4/paniclog (OOM around the time of the filesystem outage).

November 6

  • 13:21 scfc_de: tools-dev: Gzipped /var/log/account/pacct.0 (804111872 bytes); looks like root had his own bigbrother instance running on tools-dev (multiple invocations of webservice per second).

November 5

  • 19:15 mutante: exec nodes have p7zip-full now
  • 10:07 YuviPanda: cleaned out pacct and atop logs on tools-login

November 4

  • 19:50 mutante: - apt-get clean on tools-login, and gzipped some logs

November 1

  • 12:51 scfc_de: Removed log files in /var/log/diamond older than five weeks (pdsh -f 1 -g tools sudo find /var/log/diamond -type f -mtime +35 -ls -delete).

October 30

  • 14:37 YuviPanda: cleaned out pacct and atop logs on tools-dev
  • 06:18 paravoid: killed a "vi" process belonging to user icelabs and running for two days saturating the I/O network bandwidth, and rm'ed a 3.5T(!) .final_mg.txt.swp

October 27

  • 16:06 scfc_de: tools-mail: Killed -HUP old queue runners and restarted exim4; probably the source of paniclog's "re-exec of exim (/usr/sbin/exim4) with -Mc failed: No such file or directory".
  • 15:36 scfc_de: tools-exec-07, tools-exec-14, tools-exec-15: Recreated (empty) /var/log/apache2 and /var/log/upstart.

October 26

  • 12:35 scfc_de: tools-exec-07, tools-exec-14, tools-exec-15: Created /var/log/account.
  • 12:33 scfc_de: tools-trusty: Went through shadowed /var and rebooted.
  • 12:31 scfc_de: tools-exec-07, tools-exec-14, tools-exec-15: Created /var/log/exim4, started exim4 and ran queue.

October 24

  • 20:31 andrewbogott: moved tools-exec-12, tools-shadow and tools-mail to virt1006

October 23

  • 22:55 Coren: reboot tools-shadow, upstart seems hosed

October 14

  • 23:22 YuviPanda|zzz: removed stale puppet lockfile and ran puppet manually on tools-exec-07

October 11

  • 15:31 andrewbogott: rebooting tools-master, stab in the dark
  • 06:01 YuviPanda: restarted gridengine-master on tools-master

October 4

  • 18:31 scfc_de: tools-mail: Deleted /usr/local/bin/collect_exim_stats_via_gmetric and root's crontab; clean-up for Ic9e0b5bb36931aacfb9128cfa5d24678c263886b

October 2

  • 17:59 andrewbogott: added Ryan back to tools admins because that turned out to not have anything to do with the bounce messages
  • 17:32 andrewbogott: removing ryan lane from tools admins, because his email in ldap is defunct and I get bounces every time something goes wrong in tools

September 28

  • 14:45 andrewbogott: rebased /var/lib/git/operations/puppet on toolsbeta-puppetmaster3

September 25

  • 14:43 YuviPanda: cleaned up ghost /var/log (from before biglogs mount) that was taking up space, /var space situation better now

September 17

  • 21:40 andrewbogott: caused a brief auth outage while messing with codfw ldap

September 15

  • 11:00 YuviPanda: tested CPU monitoring on tools-exec-12 by running stress, seems to work

September 13

  • 20:52 yuvipanda: cleaned out rotated log files on tools-webproxy

September 12

  • 21:54 jeremyb: [morebots] booted all bots, reverted to using systemwide (.deb) codebase

September 8

  • 16:08 scfc_de: tools-login: rm -f /var/log/exim4/paniclog (OOM @ 2014-09-07 15:13:59)

September 5

  • 22:22 scfc_de: Deleted stale nginx entries for "rightstool" and "svgcheck"
  • 22:20 scfc_de: Stopped 12 webservices for tool "meta" and started one
  • 18:50 scfc_de: geohack's lighttpd dumped core and left an entry in Redis behind; tools-webproxy: "DEL prefix:geohack"; geohack: "webservice start"

September 4

  • 19:47 lokal-profil: local-heritage Renamed two swedish tables

September 2

  • 04:31 scfc_de: "iptables -A OUTPUT -d 10.68.16.1 -p udp -m udp --dport 53" on all hosts in support of bug #70076

August 23

  • 17:44 scfc_de: qmod -cq task@tools-exec-07 (job #2796555, "11  : before job")

August 21

  • 20:05 scfc_de: Deployed release 1.0.11 of jobutils and miscutils

August 15

  • 16:45 legoktm: fixed grrrit-wm
  • 16:36 legoktm: restarting grrrit-wm

August 14

  • 22:36 scfc_de: Removed again jobs in error state due to LDAP with "for JOBID in $(qstat -u \* | sed -ne 's/^\([0-9]\+\) .*Eqw.*$/\1/p;'); do if qstat -j "$JOBID" | fgrep -q "can't get password entry for user"; then qdel "$JOBID"; fi; done"; cf. also bug #69529

August 12

  • 03:32 scfc_de: tools-exec-08, tools-exec-wmt, tools-webgrid-02, tools-webgrid-03, tools-webgrid-04: Removed stale "apt-get update" processes to get Puppet working again

August 2

  • 16:39 scfc_de: tools.mybot's crontab uses qsub without -M, added that as a temporary measure and will inform user later
  • 16:36 scfc_de: Manually rerouted mails for tools.mybot@tools-submit.eqiad.wmflabs

August 1

  • 22:41 scfc_de: Deleted all jobs in "E" state that were caused by an LDAP failure at ~ 2014-07-30 07:00Z ("can't get password entry for user [...]")

July 24

  • 20:53 scfc_de: Set SGE "mailer" parameter again for bug #61160
  • 14:51 scfc_de: Removed ignored file /etc/apt/preferences.d/puppet_base_2.7 on all hosts

July 21

  • 18:39 scfc_de: Removed stale Redis entries for currentevents, misc2svg, osm4wiki, wp-signpost, wscredits and yadfa
  • 18:38 scfc_de: Restarted webservice for stewardbots because it wasn't in Redis
  • 18:33 scfc_de: Stopped eight (!) webservices of tools.bookmanagerv2 and started one again

July 18

  • 14:29 scfc_de: admin: Set up .bigbrotherrc for toolhistory
  • 13:24 scfc_de: Made tools-webgrid-04 a grid submit host
  • 12:58 scfc_de: Made tools-webgrid-03 a grid submit host

July 16

  • 22:41 YuviPanda: reloaded nginx on tools-webproxy to pick up https://gerrit.wikimedia.org/r/#/c/146466/3
  • 15:18 scfc_de: replagstats OOMed four hours after start on May 6th; with ganglia.wmflabs.org down, not restarting
  • 15:14 scfc_de: Restarted toolhistory with 350 MBytes; OOMed June 1st

July 15

  • 11:31 scfc_de: Started webservice for sulinfo; stopped at 2014-06-29 18:31:04

July 14

  • 20:40 andrewbogott: on tools-login
  • 20:39 andrewbogott: manually deleted /var/lib/apt/lists/lock, forcing apt to update

July 13

  • 13:13 scfc_de: tools-exec-13: Moved /var/log around, reboot, iptables-restore & reenabled queues
  • 13:11 scfc_de: tools-exec-12: Moved /var/log around, reboot & iptables-restore

July 12

  • 17:57 scfc_de: tools-exec-11: Stopping apache2 service; no clue how it got there
  • 17:53 scfc_de: tools-exec-11: Moved log files around, rebooted, restored iptables and reenabled queue ("qmod -e {continuous,task}@tools-exec-11...")
  • 13:00 scfc_de: tools-exec-11, tools-exec-13: qmod -r continuous@tools-exec-1[13].eqiad.wmflabs in preparation of reboot
  • 12:58 scfc_de: tools-exec-11, tools-exec-13: Disabled queues in preparation of reboot
  • 11:58 scfc_de: tools-exec-11, tools-exec-12, tools-exec-13: mkdir -m 2750 /var/log/exim4 && chown Debian-exim:adm /var/log/exim4; I'll file a bug why the directory wasn't created later

July 11

  • 11:59 scfc_de: tools-exec-11, tools-exec-12, tools-exec-13: cp -f /data/project/.system/hosts /etc/hosts

July 10

  • 20:35 scfc_de: tools-exec-11, tools-exec-12, tools-exec-13: iptables-restore /data/project/.system/iptables.conf
  • 16:00 YuviPanda: manually removed mariadb remote repo from tools-exec-12 instance, won't be added to new instances (puppet patch was merged)
  • 01:33 YuviPanda|zzz: tools-exec-11 and tools-exec-13 have been added to the @general hostgroup

July 9

  • 23:14 YuviPanda: applied execnode, hba and biglogs to tools-exec-11 and tools-exec-13
  • 23:09 YuviPanda: created tools-exec-13 with precise
  • 23:08 YuviPanda: created tools-exec-12 as trusty by accident, will keep on standby for testing
  • 23:07 YuviPanda: created tools-exec-12
  • 23:06 YuviPanda: created tools-exec-11
  • 19:23 scfc_de: tools-webproxy: "iptables -A INPUT -p tcp \! --source 127/8 --dport 6379 -j REJECT" to block connections from other Tools instances to Redis again
  • 14:12 scfc_de: tools-exec-cyberbot: Reran Puppet successfully and hotfixed the Peachy temporary file issue; will mail labs-l later
  • 13:33 scfc_de: tools-exec-cyberbot: Freed 402398 inodes ...
  • 12:50 scfc_de: tools-exec-cyberbot: "find /tmp -maxdepth 1 -type f -name \*cyberbotpeachy.cookies\* -mtime +30 -delete" as a first step
  • 12:40 scfc_de: tools-exec-cyberbot: Root partition has run out of inodes
  • 12:34 scfc_de: tools-exec-gift: Forgot to log yesterday: The problems were due to overload (load >> 150); SGE shouldn't have allowed that
  • 12:28 YuviPanda: cleaned out old diamond archive logs on tools-master
  • 12:28 YuviPanda: cleaned out old diamond archive logs on tools-webgrid-04
  • 12:25 YuviPanda: cleaned out old diamond archive logs from tools-exec-08

July 8

  • 20:57 scfc_de: tools-exec-gift: Puppet hangs due to "apt-get update" not finishing in time; manual runs of the latter take forever
  • 19:52 scfc_de: tools-exec-wmt, tools-shadow: Removed stale Puppet lock files and reran manually (handy: "sudo find /var/lib/puppet/state -maxdepth 1 -type f -name agent_catalog_run.lock -ls -ok rm -f \{\} \; -exec sudo puppet agent apply -tv \;")
  • 18:09 scfc_de: tools-webgrid-03, tools-webgrid-04: killall -TERM gmond (bug #64216)
  • 17:57 scfc_de: tools-exec-08, tools-exec-09, tools-webgrid-02, tools-webgrid-03: Removed stale Puppet lock files and reran manually
  • 17:26 scfc_de: tools-tcl-test: Rebooted because system said so
  • 17:04 YuviPanda: webservice start on tools.meetbot since it seemed down
  • 14:55 YuviPanda: cleaned out old diamond archive logs on tools-webproxy
  • 13:39 scfc_de: tools-login: rm -f /var/log/exim4/paniclog ("daemon: fork of queue-runner process failed: Cannot allocate memory")

July 6

  • 12:09 scfc_de: tools-mail: rm -f /var/log/exim4/paniclog after I20afa5fb2be7d8b9cf5c3bf4018377d0e847daef got merged

July 5

July 4

  • 08:51 scfc_de: tools-exec-08 (some hours ago): rm -f /var/log/diamond/* && restart diamond
  • 00:02 scfc_de: tools-master: rm -f /var/log/diamond/* && restart diamond

July 3

  • 16:59 Betacommand: Coren: It may take a while though; what the catscan queries was blocking is a DDL query changing the schema and that pauses replication.
  • 16:58 Betacommand: Coren: transactions over 30ks killed; the DB should start catching up soon.
  • 14:37 Betacommand: replication for enwiki is halted current lag is at 9876

July 2

  • 00:21 YuviPanda: restarted diamond on almost all nodes to stop sending nfs stats, some still need to be flushed
  • 00:21 YuviPanda: restarted diamond on all exec nodes to stop sending nfs stats

July 1

  • 23:09 legoktm: tools-pywikibot started the webservice, don't know why it wasn't running
  • 21:08 scfc_de: Reset queues in error state again
  • 17:51 YuviPanda: tools-exec-04 removed stale pid file and force puppet run
  • 16:07 YuviPanda: applied biglogs to tools-exec-02 and rejigged things
  • 15:54 YuviPanda: tools-exec-02 removed stale puppet pid file, forcing run
  • 15:51 Coren: adjusted resource limits for -exec-07 to match the smaller instance size.
  • 15:50 Coren: created logfile disk for -exec-07 by hand (smaller instance)
  • 01:53 YuviPanda: tools-exec-10 applied biglogs, moved logs around, killed some old diamond logs
  • 01:41 YuviPanda: tools-exec-03 restarted diamond, atop, exim4, ssh to pick up new log partition
  • 01:40 YuviPanda: tools-exec-03 applied biglogs, moved logs around, killed some old diamond logs
  • 01:34 scfc_de: tools-exec-03, tools-exec-10: Removed /var/log/diamond/diamond.log, restarted diamond and bzip2'ed /var/log/diamond/*.log.2014*

June 30

  • 22:10 YuviPanda: ran webservice start for enwp10
  • 22:06 YuviPanda: stale lockfile in tools-login as well, removing and forcing puppet run
  • 22:01 YuviPanda: removed stale lockfile for puppet, forcing run
  • 19:58 YuviPanda|food: added tools-webgrid-04 to webgrid queue, had to start portgranter manually
  • 17:43 YuviPanda: created tools-webgrid-04, applying webnode role and running puppet
  • 17:27 YuviPanda: created tools-webgrid-03 and added it to the queue

June 29

  • 19:45 scfc_de: magnustools: "webservice start"
  • 18:24 YuviPanda: rebooted tools-webgrid-02. Could not ssh, was dead

June 28

  • 21:07 YuviPanda: removed alias for tools-webproxy and tools.wmflabs.org from /etc/hosts on tools-webproxy

June 21

  • 20:09 scfc_de: Created tool mediawiki-mirror (yuvipanda + Nemo_bis) and chown'ed & chmod o-w /shared/mediawiki

June 20

  • 21:01 scfc_de: tools-webgrid-tomcat: Added to submit host list with "qconf -as" for bug #66882
  • 14:47 scfc_de: Restarted webservice for mono; cf. bug #64219

June 16

  • 23:50 scfc_de: Shut down diamond services and removed log files on all hosts

June 15

  • 17:12 YuviPanda: deleted tools-mongo. MongoDB pre-allocates db files, and so allocating one db to every tool fills up the disk *really* quickly, even with 0 data. Their non preallocating version is 'not meant for production', so putting on hold for now
  • 16:50 scfc_de: qmod -cq cyberbot@tools-exec-cyberbot.eqiad.wmflabs
  • 16:48 scfc_de: tools-exec-cyberbot: rm -f /var/log/diamond/diamond.log && restart diamond
  • 16:48 scfc_de: tools-exec-cyberbot: No DNS entry (again)

June 13

  • 22:59 YuviPanda: "sudo -u ineditable -s" to force creation of homedir, since the user was unable to login before. /var/log/auth.log had no record of their attempts, but now seems to work. straange

June 10

  • 21:51 scfc_de: Restarted diamond service on all Tools hosts to actually free the disk space :-)
  • 21:36 scfc_de: Deleted /var/log/diamond/diamond.log on all Tools hosts to free up space on /var

June 3

  • 17:50 Betacommand: Brief network outage. source: It's not clearly determined yet; we aborted the investigation to rollback and restore service. As far as we can tell, there is something subtly wrong with the switch configuration of LACP.

June 2

  • 20:15 YuviPanda: create instance tools-trusty-test to test nginx proxy on trusty
  • 19:00 scfc_de: zoomviewer: Set TMPDIR to /data/project/zoomviewer/var/tmp and ./webwatcher.sh; cannot see *any* temporary files being created anywhere, though. iipsrv.fcgi however has TMPDIR set as planned.

May 27

  • 18:49 wm-bot: petrb: temporarily hardcoding tools-exec-cyberbot to /etc/hosts so that host resolution works
  • 10:36 scfc_de: tools-webgrid-01: removed all files of tools.zoomviewer in /tmp
  • 10:22 scfc_de: tools-webgrid-01: /tmp was full, removed files of tools.zoomviewer older than five days
  • 07:52 wm-bot: petrb: restarted webservice of tool admin in order to purge that huge access.log

May 25

  • 14:27 scfc_de: tools-mail: "rm -f /var/log/exim4/paniclog" to leave only relay_domains errors

May 23

  • 14:14 andrewbogott: rebooting tools-webproxy so that services start logging again
  • 14:10 andrewbogott: applying role::labs::lvm::biglogs on tools-webproxy because /var/log was full and causing errors

May 22

  • 02:45 scfc_de: tools-mail: Enabled role::labs::lvm::biglogs, moved data around & rebooted.
  • 02:36 scfc_de: tools-mail: Removed all jsub notifications from hazard-bot from queue.
  • 01:46 scfc_de: hazard-bot: Disabled minutely cron job github-updater
  • 01:36 scfc_de: tools-mail: Freezing all messages to Yahoo!: "421 4.7.1 [TS03] All messages from 208.80.155.162 will be permanently deferred; Retrying will NOT succeed. See http://postmaster.yahoo.com/421-ts03.html"
  • 01:12 scfc_de: tools-mail: /var is full

May 20

  • 18:34 YuviPanda: back to homerolled nginx 1.5 on proxy, newer versions causing too many issues

May 16

  • 17:01 scfc_de: tools-webgrid-02: rm -f /tmp/core (tools.misc2svg, May 13 06:10, 3861106688)

May 14

  • 16:31 scfc_de: tools-webproxy: "iptables -A INPUT -p tcp \! --source 127/8 --dport 6379 -j REJECT" to block connections from other Tools instances to Redis
  • 00:23 Betacommand: 503's related to bug 65179

May 13

  • 20:36 YuviPanda: restarting redis on tools-webproxy fixed 503s
  • 20:36 valhallasw: redis failed, causing tools-webproxy to thow 503's
  • 19:09 marktraceur: Restarted grrrit because it had a stupid nick

May 10

  • 14:50 YuviPanda: upgraded nginx to 1.7.0 on tools-webproxy to get SPDY/3.1

May 9

  • 13:16 scfc_de: Cleared error state of queues {continuous,mailq,task}@tools-exec-06 and webgrid-lighttpd; no obvious or persistent causes

May 6

  • 19:31 scfc_de: replagstats fixed; Ganglia graphs are now under the virtual host "tools-replags"
  • 17:53 scfc_de: Don't think replagstats is really working ...
  • 16:40 scfc_de: Moved ~scfc/bin/replagstats to ~tools.admin/bin/ and enabled as a continuous job (cf. also bug #48694).

April 28

  • 11:51 YuviPanda: pywikibugs Deployed bf1be7b

April 27

  • 13:34 scfc_de: Restarted webservice for geohack and moved {access,error}.log to {access,error}.log.1

April 24

  • 23:39 YuviPanda: restarted grrrit-wm, not greg-g. greg-g does not survive restarts and hence care must be taken to make sure he is not.
  • 23:38 YuviPanda: restarted greg-g after cherry-picking aec09a6 for auth of IRC bot
  • 23:33 legoktm: restarting grrrit-wm https://gerrit.wikimedia.org/r/129610
  • 13:07 scfc_de: tools-mail: rm -f /var/log/exim4/paniclog (relay_domains bug)

April 20

  • 14:27 scfc_de: tools-redis: Set role::labs::lvm::mnt and $lvm_mount_point=/var/lib, moved the data around and rebooted
  • 14:08 scfc_de: tools-redis: /var is full
  • 08:59 legoktm: grrrit-wm: 2014-04-20T08:28:15.889Z - error: Caught error in redisClient.brpop: Redis connection to tools-redis:6379 failed - connect ECONNREFUSED
  • 08:48 legoktm: Your job 438884 ("lolrrit-wm") has been submitted
  • 08:47 legoktm: [01:28:28] * grrrit-wm has quit (Remote host closed the connection)

April 13

April 12

  • 23:51 scfc_de: tools-mail: rm -f /var/log/exim4/paniclog ("unknown named domain list "+relay_domains"")

April 11

April 10

  • 18:20 scfc_de: tools-webgrid-01, tools-webgrid-02: "kill -HUP" all php-cgis that are not (grand-)children of lighttpd processes

April 8

  • 05:06 Ryan_Lane: restart nginx on tools-proxy-test
  • 05:03 Ryan_Lane: upgraded libssl on all nodes

April 4

  • 15:48 Coren: Moar powar!!1!one: added two exec nodes (-09 -10) and one webgrid node (-02)
  • 11:11 scfc_de: Set /data/project/.system/config/wikihistory.workers to 20 on apper's request

March 30

  • 18:16 scfc_de: Removed empty directories /data/project/{d930913,sudo-test{,-2},testbug{,2,3}}: Corresponding service groups don't exist (anymore)
  • 18:13 scfc_de: Removed /data/project/backup: Only empty dynamic-proxy backup files of January 3rd and earlier

March 29

  • 10:14 wm-bot: petrb: disabled 1 job in cron in -login of user tools.tools-info which was killing login server

March 28

  • 11:53 wm-bot: petrb: did the same on -mail server (removed /var/log/exim4/paniclog) so that we don't get spam every day
  • 11:51 wm-bot: petrb: removed content of /var/log/exim4/paniclog
  • 11:49 wm-bot: petrb: disabled default vimrc which everybody hates on -login

March 21

  • 16:50 scfc_de: tools-login: pkill -u tools.bene (OOM)
  • 16:13 scfc_de: rmdir /home/icinga (totally empty, "drwxr-xr-x 2 nemobis 50383 4096 Mär 17 16:42", perhaps artifact of mass migration?)
  • 15:49 scfc_de: sudo cp -R /etc/skel /home/csroychan && sudo chown -R csroychan.wikidev /home/csroychan; that should close [[bugzilla:62132]]
  • 15:15 scfc_de: sudo cp -R /etc/skel /home/annabel && sudo chown -R annabel.wikidev /home/annabel
  • 15:14 scfc_de: sudo chown -R torin8.wikidev /home/torin8

March 20

  • 18:36 scfc_de: Pointed tools-dev.wmflabs.org at tools-dev.eqiad.wmflabs; cf. [[Bugzilla:62883]]

March 5

  • 13:57 wm-bot: petrb: test

March 4

  • 22:35 wm-bot: petrb: uninstalling it from -login too
  • 22:32 wm-bot: petrb: uninstalling apache2 from tools-dev it has nothing to do there

March 3

  • 19:20 wm-bot: petrb: shutting down almost all services on webserver-02 in order to make system useable and finish upgrade
  • 19:17 wm-bot: petrb: upgrading all packages on webserver-02
  • 19:15 petan: rebooting webserver-01 which is totally dead
  • 19:07 wm-bot: petrb: restarting apache on webserver-02 it complains about OOM but the server has more than 1.5g memory free
  • 19:03 wm-bot: petrb: switched local-svg-map-maker to webserver-02 because 01 is not accessible to me, hence I can't debug that
  • 16:44 scfc_de: tools-webserver-03: Apache was swamped by request for /guc. "webservice start" for that, and pkill -HUP -u local-guc.
  • 12:54 scfc_de: tools-webserver-02: Rebooted, apache2/error.log told of OOM, though more than 1G free memory.
  • 12:50 scfc_de: tools-webserver-03: Rebooted, scripts were timing out
  • 12:42 scfc_de: tools-webproxy: Rebooted; wasn't accessible by ssh.

March 1

  • 03:42 Coren: disabled puppet in pmtpa tool labs\

February 28

  • 14:46 wm-bot: petrb: extending /usr on tools-dev by 800mb
  • 00:26 scfc_de: tools-webserver-02: Rebooted; inaccessible via ssh, http said "500 Internal Server Error"

February 27

  • 15:28 scfc_de: chmod g-w ~fsainsbu/.forward

February 25

  • 22:48 rdwrer: Lol, so, something happened with grrrit-wm earlier and nobody logged any of it. It was yoyoing, Yuvi killed it, then aude did something and now it's back.

February 23

  • 20:46 scfc_de: morebots: labs HUPped to reconnect to IRC

February 21

  • 17:32 scfc_de: tools-dev: mount -t nfs -o nfsvers=3,ro labstore1.pmtpa.wmnet:/publicdata-project /public/datasets; automount seems to have been stuck
  • 15:24 scfc_de: tools-webserver-03: Rebooted, wasn't accessible by ssh and apparently no access to /public/datasets either

February 20

  • 21:23 scfc_de: tools-login: Disabled crontab for local-rezabot and left a message at User talk:Reza#Running bots on tools-login, etc. (fa:بحث_کاربر:Reza1615 is write-protected)
  • 20:15 scfc_de: tools-login: Disabled crontab for local-chobot and left a message at ko:사용자토론:ChongDae#Running bots on tools-login, etc.
  • 10:42 scfc_de: tools-mail: rm -f /var/log/exim4/paniclog ("User 0 set for local_delivery transport is on the never_users list", cf. [[bugzilla:61583]])
  • 10:30 scfc_de: tools-login: rm -f /var/log/exim4/paniclog (OOM)
  • 10:28 scfc_de: Reset error status of task@tools-exec-09 ("can't get password entry for user 'local-voxelbot'"); "getent passwd local-voxelbot" works on tools-exec-09, possibly a glitch

February 19

  • 20:21 scfc_de: morebots: Set "enable_twitter=False" in confs/labs-logbot.py and restarted labs-morebots
  • 19:14 scfc_de: tools-login: Disabled crontab and pkill -HUP -u fatemi127

February 18

  • 11:42 scfc_de: tools-mail: Rerouted queued mail (@tools-login.pmtpa.wmflabs => @tools.wmflabs.org)
  • 11:34 scfc_de: tools-exec-08: Rebooted due to not responding on ssh and SGE
  • 10:39 scfc_de: tools-mail: rm -f /var/log/exim4/paniclog ("User 0 set for local_delivery transport is on the never_users list" => probably artifacts from Coren's LDAP changes)
  • 10:37 scfc_de: tools-login: rm -f /var/log/exim4/paniclog (OOM)

February 14

  • 23:54 legoktm: restarting grrrit-wm since it disappeared
  • 08:19 scfc_de: tools-login: rm -f /var/log/exim4/paniclog (OOM)

February 13

  • 13:11 scfc_de: Deleted old job of user veblenbot stuck in error state
  • 13:08 scfc_de: Deleted old jobs of user v2 stuck in error state
  • 10:49 scfc_de: tools-login: Commented out local-shuaib-bot's crontab with a pointer to Tools/Help

February 12

  • 07:51 wm-bot: petrb: removed /data/project/james/adminstats/wikitools per request from james on irc

February 11

  • 15:47 scfc_de: Restarted webservice for geohack
  • 13:02 scfc_de: tools-login: rm -f /var/log/exim4/paniclog (OOM)
  • 13:00 scfc_de: Killed -HUP local-hawk-eye-bot's jobs; one was hanging with a stale NFS handle on tools-exec-05

February 10

  • 23:16 Coren: rebooting webproxy (braindead autofs)

February 9

February 6

February 4

January 31

  • 03:43 scfc_de: Cleaned up all exim queues
  • 01:26 scfc_de: chmod g-w ~{bgwhite,daniel,euku,fale,henna,hydriz,lfaraone}/.forward (test: sudo find /home -mindepth 2 -maxdepth 2 -type f -name .forward -perm /g=w -ls)

January 30

  • 21:48 scfc_de: chmod g-w ~fluff/.forward
  • 21:40 scfc_de: local-betabot: Added "-M" option to crontab's qsub call and rerouted queued mail (freeze, exim -Mar, exim -Mmd, thaw)
  • 18:33 scfc_de: tools-exec-04: puppetd --enable (apparently disabled sometime around 2014-01-16?!)
  • 17:25 scfc_de: tools-exec-06: mv -f /etc/init.d/nagios-nrpe-server{.dpkg-dist,} (nagios-nrpe-server didn't start because start-up script tried to "chown icinga" instead of "chown nagios")

January 28

  • 04:27 scfc_de: tools-webproxy: Blocked Phonifier

January 25

  • 05:37 scfc_de: tools-webserver-02: rm -f /var/log/exim4/paniclog (OOM)

January 24

  • 01:07 scfc_de: tools-db: Removed /var/lib/mysql2, set expire_logs_days to 1 day
  • 00:11 scfc_de: tools-db: and restarted mysqld
  • 00:11 scfc_de: tools-db: Moved 4.2 GBytes of the oldest binlogs to /var/lib/mysql2/

January 23

  • 19:24 legoktm: restarting grrrit-wm now https://gerrit.wikimedia.org/r/#/c/109116/
  • 19:23 legoktm: ^ was for grrrit-wm
  • 19:23 legoktm: re-committed password to local repo, not sure why that wasn't committed already

January 21

  • 17:41 scfc_de: tools-exec-09: iptables-restore /data/project/.system/iptables.conf

January 20

  • 07:02 andrewbogott: merged a lint patch to the gridengine module. Should be a noop

January 16

  • 17:11 scfc_de: tools-exec-09: "iptables-restore /data/project/.system/iptables.conf" after reboot

January 15

  • 13:36 scfc_de: After reboot of tools-exec-09, all continuous jobs were successfully restarted ("Rr"); task jobs (1974113, 2188472) failed ("19  : before writing exit_status")
  • 13:27 scfc_de: tools-login: rm -f /var/log/exim4/paniclog (OOM)
  • 08:54 andrewbogott: rebooted tools-exec-09
  • 08:32 andrewbogott: rebooted tools-db

January 14

  • 15:10 scfc_de: tools-login: pkill -u local-mlwikisource: Freed 1 GByte of memory
  • 14:58 scfc_de: tools-login: Disabled local-mlwikisource's crontab with explanation
  • 13:57 scfc_de: tools-webserver-02: rm -f /var/log/exim4/paniclog (out of memory errors on 2014-01-10)

January 10

January 9

January 8

  • 13:44 scfc_de: Cleared error states of continuous@tools-exec-05, task@tools-exec-05, task@tools-exec-09

January 7

  • 18:59 scfc_de: tools-login, tools-mail: rm -f /var/log/exim4/paniclog (apparently some artifacts of the LDAP failure)

January 6

  • 14:06 YuviPanda: deleted instance tools-mc, didn't know it had come back from the dead

January 1

  • 13:24 scfc_de: tools-exec-02, tools-master, tools-shadow, tools-webserver-01: Commented out duplicate MariaDB entries in /etc/apt/sources.list and re-ran apt-get update
  • 11:27 scfc_de: tools-webserver-01, tools-webserver-01: rm -f /var/log/exim4/paniclog; out of memory errors
  • 11:18 scfc_de: Emptied /{data/project,home}/.snaplist as the snapshots themselves are not available

December 27

  • 07:39 legoktm: grrrit-wm restart didn't really work.
  • 07:38 legoktm: restarting grrit-wm, for some reason it reconnected and lost its cloak

December 23

  • 18:30 marktraceur: restart grrrit-wm for subbu

December 21

  • 06:50 scfc_de: tools-exec-01: Commented out duplicate MariaDB entries in /etc/apt/sources.list and re-ran apt-get update

December 19

  • 17:22 marktraceur: deploying grrrit config change

December 17

  • 23:19 legoktm: rebooted grrrit-wm with new config stuffs

December 14

  • 18:13 marktraceur: restarting grrrit-wm to fix its nickname
  • 13:17 scfc_de: tools-exec-08: Purged packages libapache2-mod-suphp and suphp-common (probably remnants from when the host was misconfigured as a webserver)
  • 13:09 scfc_de: tools-dev, tools-login, tools-mail, tools-webserver-01, tools-webserver-02: rm /var/log/exim4/paniclog (mostly out of memory errors)

December 4

  • 22:15 Coren: tools-exec-01 rebooted to fix the autofs issue; will return to rotation shortly.
  • 16:33 Coren: rebooting webproxy with new kernel settings to help against the DDOS

December 1

  • 14:05 Coren: underlying virtualization hardware rebooted; tools-master and friends coming back up.

November 25

  • 21:03 YuviPanda: created tools-proxy-test instance to play around with the dynamicproxy
  • 12:16 wm-bot: petrb: deswapping -login (swapoff -a && swapon -a)

November 24

  • 07:19 paravoid: disabled crontab for user avocato on tools-login, see above
  • 07:17 paravoid: pkill -u avocato on tools-login, multiple /home/avocato/pywikipedia/redirect.py DoSing the bastion

November 14

  • 09:12 ori-l: Added aude to lolrrit-wm maintainers group

November 13

  • 22:36 andrewbogott: removed 'imagescaler' class from tools-login because that class hasn't existed for a year. Which, a year ago is before that instance even existed so what the heck?

November 3

  • 16:49 ori-l: grrrit-wm stopped receiving events. restarted it; didn't help. then restarted gerrit-to-redis, which seems to have fixed it.

November 1

  • 16:11 wm-bot: petrb: restarted terminator daemon on -login to sort out memory issues caused by heavy mysql client by elbransco

October 23

  • 15:19 Coren: deleted tools-tyrant and tools-exec-cyberbot (cleanup of obsoleted instances)

October 20

  • 18:52 wm-bot: petrb: everything looks better
  • 18:51 wm-bot: petrb: restarting apache server on tools-webproxy
  • 18:49 wm-bot: petrb: installed links on -dev and going to investigate what is wrong with apaches, documentation, Coren, please update it

October 15

  • 21:03 Coren: labs-login rebooted to fix the ownership/take issue with success.

October 10

  • 09:49 addshore: tools-webserver-01is getting a 500 Internal Server Error again

September 23

  • 06:44 YuviPanda: remove unpuppetized install of openjdk-6 packages causing problems in -dev (for bug: 54444)
  • 06:44 YuviPanda: remove unpuppetized install of openjdk-6 packages causing problems in -dev (for bug: 54444)
  • 05:15 legoktm: logging a log to test the log logging
  • 05:13 legoktm: logging a log to test the log logging

September 11

  • 09:39 wm-bot: petrb: started toolwatcher

August 24

  • 18:00 wm-bot: petrb: freed 1600mb of ram by killing yasbot processes on -login
  • 17:59 wm-bot: petrb: killing all python processes of yasbot on -login, this bot needs to run on grid, -login is constantly getting OOM because of this bot

August 23

  • 12:17 wm-bot: petrb: test
  • 12:15 wm-bot: petrb: making pv from /dev/vdb on new nodes
  • 11:49 wm-bot: petrb: syncing packages of -login with exec nodes
  • 11:48 petan: someone installed firefox on exec nodes, should investigate / remove

August 22

  • 01:24 scfc_de: tools-webserver-03: Installed python-oursql

August 20

  • 23:00 scfc_de: Opened port 3000 for intra-Labs traffic in execnode security group for YuviPanda's proxy experiments

August 19

  • 09:52 wm-bot: petrb: deleting fatestwiki tool, requested by creator

August 16

  • 00:16 scfc_de: tools-exec-01 doesn't come up again even after repeat reboots

August 15

  • 15:14 scfc_de: tools-webserver-01: Simplified /usr/local/bin/php-wrapper
  • 14:31 scfc_de: tools-webserver-01: "dpkg --configure -a" on apt-get's advice
  • 14:24 scfc_de: chmod 644 ~magnus/.forward
  • 03:07 scfc_de: tools-webproxy: Temporarily serving 403s to AhrefsBot/bingbot/Googlebot/PaperLiBot/TweetmemeBot/YandexBot until they reread robots.txt
  • 02:02 scfc_de: robots.txt: "Disallow: /"

August 11

  • 03:14 scfc_de: tools-mc: Purged memcached

August 10

  • 02:36 scfc_de: Disabled terminatord on tools-login and tools-dev
  • 02:24 scfc_de: chmod g-w ~whym/.forward

August 6

  • 19:26 scfc_de: Set up basic robots.txt to exclude Geohack to see how that affects traffic
  • 02:09 scfc_de: tools-mail: Enabled rudimentary Ganglia monitoring in root's crontab

August 5

  • 20:32 scfc_de: chmod g-w ~ladsgroup/.forward

August 2

  • 23:45 scfc_de: tools-dev: Installed dialog for testing

August 1

  • 19:57 scfc_de: Created new instance tools-redis with redis_maxmemory = "7GB"
  • 19:56 scfc_de: Added redis_maxmemory to wikitech Puppet variables

July 31

  • 10:50 HenriqueCrang: ptwikis added graph with mobile edits

July 30

  • 19:08 scfc_de: tools-webproxy: Purged popularity-contest and ubuntu-standard
  • 07:32 wm-bot: petrb: deleted local-addbot jobs
  • 02:01 scfc_de: tools-webserver-01: Symlinked /usr/local/bin/{job,jstart,jstop,jsub} to /usr/bin; were obsolete versions.

July 29

  • 15:15 scfc_de: tools-webserver-01: rm /var/log/exim4/paniclog
  • 15:10 scfc_de: Purged popularity-contest from tools-webserver-01.
  • 02:40 scfc_de: Restarted toolwatcher on tools-login.
  • 02:11 scfc_de: Reboot tools-login, was not responsive

July 25

  • 23:37 Ryan_Lane: added myself to lolrrit-wm tool
  • 12:06 wm-bot: petrb: test
  • 07:11 wm-bot: petrb: created /var/log/glusterfs/bricks/ to stop rotatelogs from complaining about it being missing

July 20

  • 15:19 petan: rebooting tools-redis

July 19

  • 07:06 petan: instances were rebooted for unknown reasons
  • 00:42 helderwiki: it works! :-)
  • 00:41 legoktm: test

July 10

  • 18:04 wm-bot: petrb: installing mysqltcl on grid
  • 18:01 wm-bot: petrb: installing tclodbc on grid

July 5

  • 19:38 AzaToth: test
  • 19:36 AzaToth: test for example
  • 18:23 Coren: brief outage of webproxy complete (back to business!)
  • 18:13 Coren: brief outage of webproxy (rollback 2.4 upgrade)

July 3

  • 13:44 scfc_de: Set "HostbasedAuthentication yes" and "EnableSSHKeysign yes" in tools-dev's /etc/ssh/ssh_config
  • 12:58 petan: rebooting -mc it's aparently OOM dying

July 2

  • 16:24 wm-bot: petrb: installed maria to all nodes so we can connect to db even from sge
  • 12:19 wm-bot: petrb: installing packages -- libmediawiki-api-perl libdatetime-format-strptime-perl libbot-basicbot-perl libdatetime-format-duration-perl

July 1

  • 18:39 wm-bot: petrb: started toolwatcher on - login
  • 14:22 wm-bot: petrb: installing following packages on grid: libdata-dumper-simple-perl libhtml-html5-entities-perl libirc-utils-perl libtask-weaken-perl libobject-pluggable-perl libpoe-component-syndicator-perl libpoe-filter-ircd-perl libsocket-getaddrinfo-perl libpoe-component-irc-perl libxml-simple-perl
  • 12:05 wm-bot: petrb: starting toolwatcher
  • 11:40 wm-bot: petrb: tools is back o/
  • 09:42 wm-bot: petrb: installing python -zmg -matplotlib @ dev
  • 03:33 scfc_de: Rebooted tools-login apparently out of memory and not responding to ssh

June 30

  • 17:58 scfc_de: Set ssh_hba to yes on tools-exec-06
  • 17:13 scfc_de: Installed python-matplotlib and python-zmq on tools-login for YuviPanda

June 26

  • 21:16 Coren: +Tim Landscheidt to project admins, local-admin
  • 14:23 wm-bot: petrb: updating several packages on -login
  • 13:43 wm-bot: petrb: killing old instance of redis: Jun15 ? 00:06:49 /usr/bin/redis-server /etc/redis/redis.conf
  • 13:42 wm-bot: petrb: restarting redis
  • 13:28 wm-bot: petrb: running puppet on -mc
  • 13:27 wm-bot: petrb: adding ::redis role to tools-mc - if anything will break, YuviPanda did it :P
  • 09:35 wm-bot: petrb: updated status.php to version which display free vmem as well

June 25

  • 12:34 wm-bot: petrb: installing php5-mcrypt on exec and web

June 24

  • 15:45 wm-bot: petrb: changed colors of root prompt productions vs testing
  • 07:57 wm-bot: petrb: 50527 4186 22830 1 Jun23 pts/41 00:08:54 python fill2.py eats 48% of ram on -login

June 19

  • 12:17 wm-bot: petrb: increasing limit on mysql connections

June 17

  • 17:34 wm-bot: petrb: /var/spool/cron/crontabs/ has -rw------- 1 8006 crontab 1176 Apr 11 14:07 local-voxelbot fixing

June 16

  • 21:23 Coren: 1.0.3 deployed (jobutils, misctools)

June 15

  • 21:40 wm-bot: petrb: there is no lvm on -db which we need as hell - therefore no swap either nor storage for binary logs :( I got a feeling that mysql will die oom soonish
  • 21:39 wm-bot: petrb: db has 5% free RAM eeeek
  • 18:36 wm-bot: root: removed lot of ?audit? logs from exec-04 they were eating too much storage
  • 18:23 wm-bot: petrb: temporarily disabling /tmp on exec-04 in order to set up lvm
  • 18:23 wm-bot: petrb: exec-04 96% / usage, creating a new volume
  • 12:33 wm-bot: petrb: installing redis on tools-mc

June 14

  • 12:35 wm-bot: petrb: updating logsplitter to new version

June 13

  • 21:59 wm-bot: petrb: replaced logsplitter on both apache servers with far more powerfull c++ version thus saving a lot of resources on both servers
  • 12:43 wm-bot: petrb: tools-webserver-01 is running quite expensive python job (currently eating almost 1gb of ram) it may need to be fixed or moved to separate webserver, adding swap to prevent machine die OOM
  • 12:22 wm-bot: petrb: killing process 31187 sort -T./enwiki/target -t of user local-enwp10 for same reason as previous one
  • 12:21 wm-bot: petrb: killing process 31190 sort -T./enwiki/target of user local-enwp10 for same reason as previous one
  • 12:17 wm-bot: petrb: killing process 31186 31185 69 Jun11 pts/32 1-13:14:41 /usr/bin/perl ./bin/catpagelinks.pl ./enwiki/target/main_pages_sort_by_ids.lst ./enwiki/target/pagelinks_main_sort_by_ids.lst because it seems to be a bot running on login server eating too many resources

June 11

  • 07:36 wm-bot: petrb: installed libdigest-crc-perl

June 10

  • 13:05 wm-bot: petrb: installing libcrypt-gcrypt-perl
  • 08:45 wm-bot: petrb: updated /usr/local/bin/logsplitter on webserver-01 in order to fix !b 49383
  • 08:45 wm-bot: petrb: updated /usr/local/bin/logsplitter on webserver-01 in order to fix become afcbot 49383
  • 08:44 wm-bot: petrb: updated /usr/local/bin/logsplitter on webserver-01 in order to fix become afcbot 49383
  • 08:25 wm-bot: petrb: fixing missing packages on exec nodes

June 9

  • 20:44 wm-bot: petrb: moved logs on -login to separate storage

June 8

  • 21:24 wm-bot: petrb: installing python-imaging-tk on grid
  • 21:20 wm-bot: petrb: installing python-tk
  • 21:16 wm-bot: petrb: installing python-flickrapi on grid
  • 21:16 wm-bot: petrb: installing
  • 16:49 wm-bot: petrb: turned off wmf style of vi on tools-dev feel free to slap me :o or do cat /etc/vim/vimrc.local >> .vimrc if you love it
  • 15:33 wm-bot: petrb: grid is overloaded, needs to be either enlarged or jobs calmed down :o
  • 09:55 wm-bot: petrb: backporting tcl 8.6 from debian
  • 09:38 wm-bot: petrb: update python requests to version 1.2.3.1

June 7

  • 15:29 Coren: Deleted no-longer-needed tools-exec-cg node (spun off to its own project)

June 5

  • 09:52 wm-bot: petrb: on -dev
  • 09:52 wm-bot: petrb: moving /usr to separate volume expect problems :o
  • 09:41 wm-bot: petrb: moved /var/log to separate volume on -dev
  • 09:31 wm-bot: petrb: houston we have problem, / on dev is 94%
  • 09:28 wm-bot: petrb: installed openjdk7 on -dev
  • 09:00 wm-bot: petrb: removing wd-terminator service
  • 08:39 wm-bot: petrb: started toolwatcher
  • 07:04 wm-bot: petrb: installing maven on -dev

June 4

  • 14:49 wm-bot: petrb: installing sbt in order to fix b48859
  • 13:28 wm-bot: petrb: installing csh on cluster
  • 08:37 wm-bot: petrb: installing python-memcache on exec nodes

June 3

  • 21:40 Coren: Rebooting -login; it's trashing. Will keep an eye on it.
  • 14:15 wm-bot: petrb: removing popularity contest
  • 14:11 wm-bot: petrb: removing /etc/logrotate.d/glusterlogs on all servers to fix logrotate daemon
  • 09:43 wm-bot: petrb: syncing packages on exec nodes to avoid troubles with missing libs on some etc

June 2

  • 08:39 wm-bot: petrb: installing ack-grep everywhere per yuvipanda and irc

June 1

  • 20:57 wm-bot: petrb: installed this to exec nodes because it was on some and not on others cpp-4.4 cpp-4.5 cython dbus dosfstools ed emacs23 ftp gcc-4.4-base iptables iputils-tracepath ksh lsof ltrace lshw mariadb-client-5.5 nano python-dbus python-egenix-mxdatetime python-egenix-mxtools python-gevent python-greenlet strace telnet time -y
  • 20:42 wm-bot: petrb: installing wikitools cluster wide
  • 20:40 wm-bot: petrb: installing oursql cluster wide
  • 10:46 wm-bot: petrb: created new instance for experiments with sasl memcache tools-mc

May 31

  • 19:17 petan: deleting xtools project (requested by Cyberpower678)
  • 17:24 wm-bot: petrb: removing old kernels from -dev because / is almost full
  • 17:17 wm-bot: petrb: installed lsof to -dev
  • 15:55 wm-bot: petrb: installed subversion to exec nodes 4 legoktm
  • 15:47 wm-bot: petrb: replacing mysql with maria on exec nodes
  • 15:46 wm-bot: petrb: replacing mysql with maria on exec nodes
  • 15:14 wm-bot: petrb: installing default-jre in order to satisfy its dependencies
  • 15:13 wm-bot: petrb: installing /data/project/.system/deb/all/sbt.deb to -dev in order to test it
  • 13:04 wm-bot: petrb: installing bashdb on tools and -dev
  • 12:27 wm-bot: petrb: removing project local-jimmyxu - per request on irc
  • 10:54 wm-bot: petrb: killing process 3060 on -login (mahdiz 3060 1964 88 May30 ? 21:32:51 /bin/nano /tmp/crontab.Ht3bSO/crontab) it takes max cpu and doesn't seem to be attached

May 30

  • 12:24 wm-bot: petrb: deleted job 1862 from queue (error state)
  • 08:26 wm-bot: petrb: updated sql command

May 29

  • 21:05 wm-bot: petrb: running sudo apt-get install php5-gd

May 28

  • 20:00 wm-bot: petrb: installing p7zip-full to -dev and -login

May 27

  • 08:46 wm-bot: petrb: changed config of mysql to use /mnt as path to save binary logs, this however requires server to be restarted

May 24

  • 08:44 petan: setting up lvm on new exec nodes because it is more flexible and allows us to change the size of volumes on the fly
  • 08:28 petan: created 2 more exec nodes, setting up now...

May 23

  • 09:20 wm-bot: petrb: process 27618 on -login is constantly eating 100% of cpu, changing priority to 20

May 22

  • 20:54 wm-bot: petrb: changing ownership of /data/project/bracketbot/ to local-bracketbot
  • 14:28 labs-logs-bottie: petrb: installed netcat as well
  • 14:28 labs-logs-bottie: petrb: installed telnet to -dev
  • 14:02 Coren: tools-webserver-02 now live; / and /cluebot/ moved there

May 21

  • 20:27 labs-logs-bottie: petrb: uploaded hosts to -dev

May 19

  • 13:40 labs-logs-bottie: petrb: killing that nano process seems to be some hang and unattached anyway
  • 12:59 labs-logs-bottie: petrb: changed priority of nano process to 19
  • 12:55 labs-logs-bottie: petrb: local-hawk-eye-bot /bin/nano /tmp/crontab.d4JhUj/crontab eat too much cpu
  • 12:50 petan: nvm previous line
  • 12:50 labs-logs-bottie: petrb: vul alias viewuserlang

May 14

  • 21:22 labs-logs-bottie: petrb: created a separate volume for /tmp on login so that temp files do not fragment root fs and it does not get filled up by them, it also makes it easier to track filesystem usage
  • 13:16 Coren: reboot -dev, need to test kernel upgrade

May 10

  • 15:08 Coren: create tools-webserver-02 for Apache 2.4 experimentation

May 9

  • 04:12 Coren: added -exec-03 and -exec-04. Moar power!!1!

May 6

  • 19:59 Coren: made tools-dev.wmflabs.org public
  • 08:04 labs-logs-bottie: petrb: created a small swap on -login so that users can not bring it to OOM so easily and so that unused memory blocks can be swapined in order to use the remaining memory more effectively
  • 08:00 labs-logs-bottie: petrb: making lvm from unused disk from /mnt on -login so that we can eventually use it somewhere if needed

May 4

  • 17:50 labs-logs-bottie: petrb: foobar as well
  • 17:47 labs-logs-bottie: petrb: removing project flask-stub using rmtool
  • 15:33 labs-logs-bottie: petrb: fixing missing db user for local-stub
  • 12:51 labs-logs-bottie: petrb: creating mysql accounts by hand for alchimista and fubar

May 2

  • 20:49 labs-logs-bottie: petrb: uploaded motd to exec-N as well, with information which server users connected to

May 1

  • 16:59 labs-logs-bottie: petrb: fixed invalid permissions on /home

April 27

  • 18:54 labs-logs-bottie: petrb: installing pymysql using pip on whole grid because it is needed for greenrosseta (for some reason it is better than python-mysql package)

April 26

  • 23:55 Coren: reboot to finish security updates
  • 08:00 labs-logs-bottie: petrb: patching qtop
  • 07:57 labs-logs-bottie: petrb: added tools-dev to admin host list so that qtop works and fixing the bug of qtop
  • 07:28 labs-logs-bottie: petrb: installing GE tools to -dev so that we can develop new j|q* stuff there

April 25

  • 19:00 Coren: Maintenance over; systems restarted and should be working.
  • 18:18 labs-logs-bottie: petrb: we are getting in troubles with memory on tools-db there is only less than 20% free memory
  • 18:01 Coren: Begin maintenance (login disabled)
  • 13:21 petan: removing local-wikidatastats from ldap

April 24

  • 13:17 labs-logs-bottie: petrb: sudo chown local-peachy PeachyFrameworkLogo.png
  • 11:37 labs-logs-bottie: petrb: created new project stats and cloned acl from wikidatastats, which is supposed to be deleted
  • 11:32 legoktm: wikidatastats attempting to install limn
  • 11:15 labs-logs-bottie: petrb: installing npm to -login instance
  • 07:34 petan: creating project wikidatastats for legoktm addshore and yuvipandianablah :P

April 23

  • 13:32 labs-logs-bottie: petrb: changing permissions of cyberbot and peachy to 775 so that it is easier to use them
  • 12:14 labs-logs-bottie: petrb: qtop on -dev
  • 12:12 labs-logs-bottie: petrb: removed part of motd from login server that got there in a mysterious way

April 19

  • 22:38 Coren: reboot -login, all done with the NFS config. yeay.
  • 17:13 Coren: (final?) reboot of -login with the new autofs configuration
  • 16:24 Coren: (rebooted -login)
  • 16:24 Coren: autofs + gluster = fail
  • 14:45 Coren: reboot -login (NFS mount woes)

April 15

  • 22:29 Coren: also a test; note how said bot knows its place.  :-)
  • 22:14 andrewbogott: this is a test of labs-morebots.
  • 21:49 andrewbogott: this is a test
  • 15:41 labs-logs-bottie: petrb: installing p7zip everywhere
  • 08:00 labs-logs-bottie: petrb: installing dev packages needed for YuviPanda on login box

April 11

  • 22:39 Coren: rebooted tools-puppet-test (no end-user impact): hung filesystem prevents login
  • 07:42 labs-logs-bottie: petrb: removed reboot information from motd

April 10

  • 21:42 labs-logs-bottie: petrb: reverting the change
  • 21:35 labs-logs-bottie: petrb: inserting /lib to /etc/ld.so.conf in order to fix the bug with gcc / ubuntu see irc logs (22:30 GMT)
  • 21:22 labs-logs-bottie: petrb: installing jobutils.deb on login
  • 20:30 labs-logs-bottie: petrb: installing some dev tools to -dev
  • 20:23 petan: created -dev instance for various purposes

April 8

  • 14:07 labs-logs-bottie: petrb: ongrid apt-get install mono-complete
  • 13:50 labs-logs-bottie: local-afcbot: unable to run mono applications: The assembly mscorlib.dll was not found or could not be loaded.

April 4

  • 14:40 labs-logs-bottie: petrb: trying to convert afcbot to new service group local-afcbot

April 2

  • 16:04 labs-logs-bottie: petrb: installed log to /home/petrb/bin/ and testing it
  • 15:55 petan: patched /usr/local/bin/qdisplay so that it can display jobs per node properly
  • 15:54 petan: giving sudo to Petrb in order to update qdisplay

March 28

  • 15:44 Coren: reboot (still unactivated) tools-shadow

March 26

  • 18:17 Coren: Doubled the size of the compute grid! (added tools-exec-02 to the grid)

March 21

  • 23:30 Coren: turned on interpretation of .py as CGI by default on tools-webserver-* to parallel .php
  • 16:15 Coren: Added tools-login.wmflabs.org public IP for the tools-login instance and allowed incoming ssh to it.

March 19

  • 14:21 Coren: reboot cycle (all instances) to apply security updates

March 13

  • 14:04 Coren: restarted webserver: relax AllowOverride options

March 11

  • 15:47 Coren: enabled X forwarding for qmon. Also, installed qmon.
  • 13:17 Coren: added python-requests (1.0, from pip)

March 7

  • 20:41 Coren: tools' php errors now sent to ~/php_errors.log
  • 19:31 Coren: access.log now split by tools (in tool homedir)
  • 16:15 Coren: can haz database (support for user/tool databases in place)

March 6

  • 20:25 Coren: tools-db installed mariadb-server from official repo
  • 19:50 Coren: created tools-db instance for a (temporary) mysql install

March 5

  • 21:45 Coren: rejiggered the webproxy config to be smarter about paths not leading to specific tools

February 26

  • 23:49 Coren: Original note structure: created tools-{master,exec-01,webserver-01,webproxy} instances
  • 18:39 Coren: Created tools-puppet-test for dev and testing of tools' puppet classes.
  • 01:52 Coren: created instance tools-login (primary login/dev instance)
  • 01:52 Coren: created sudo policies and security groups (skeletal)
  • 01:08 Coren: Creation of the new project for preproduction deployment of the current (preleminary) plan mw:Wikimedia Labs/Tool Labs/Design