Nova Resource:Tools/SAL/Archive 3

From Wikitech

2019-12-30

  • 05:02 andrewbogott: moving tools-worker-1012 to cloudvirt1024 for T241523
  • 04:49 andrewbogott: draining and rebooting tools-worker-1031, its drive is full

2019-12-29

  • 01:38 Krenair: Cordoned tools-worker-1012 and deleted pods associated with dplbot and dewikigreetbot as well as my own testing one, host seems to be under heavy load - T241523

2019-12-27

  • 15:06 Krenair: Killed a "python parse_page.py outreachy" process by aikochou that was hogging IO on tools-sgebastion-07

2019-12-25

  • 16:07 zhuyifei1999_: pkilled 5 `python pwb.py` processes belonging to `tools.kaleem-bot` on tools-sgebastion-07

2019-12-22

  • 20:13 bd808: Enabled Puppet on tools-proxy-06.tools.eqiad.wmflabs after nginx config test (T241310)
  • 18:52 bd808: Disabled Puppet on tools-proxy-06.tools.eqiad.wmflabs to test nginx config change (T241310)

2019-12-20

  • 22:28 bd808: Re-enabled Puppet on tools-sgebastion-09. Reason for disable was "arturo raising systemd limits"
  • 11:33 arturo: reboot tools-k8s-control-3 to fix some stale NFS mount issues

2019-12-18

  • 17:33 bstorm_: updated package in aptly for toollabs-webservice to 0.53
  • 11:49 arturo: introduce placeholder DNS records for toolforge.org domain. No services are provided under this domain yet for end users, this is just us testing (SSL, proxy stuff etc). This may be reverted anytime.

2019-12-17

  • 20:25 bd808: Fixed https://tools.wmflabs.org/ to redirect to https://tools.wmflabs.org/admin/
  • 19:21 bstorm_: deployed the changes to the live proxy to enable the new kubernetes cluster T234037
  • 16:53 bstorm_: maintain-kubeusers app deployed fully in tools for new kubernetes cluster T214513 T228499
  • 16:50 bstorm_: updated the maintain-kubeusers docker image for beta and tools
  • 04:48 bstorm_: completed first run of maintain-kubeusers 2 in the new cluster T214513
  • 01:26 bstorm_: running the first run of maintain-kubeusers 2.0 for the new cluster T214513 (more successfully this time)
  • 01:25 bstorm_: unset the immutable bit from 1704 tool kubeconfigs T214513
  • 01:05 bstorm_: beginning the first run of the new maintain-kubeusers in gentle-mode -- but it was just killed by some files setting the immutable bit T214513
  • 00:45 bstorm_: enabled encryption at rest on the new k8s cluster

2019-12-16

  • 22:04 bd808: Added 'ALLOW IPv4 25/tcp from 0.0.0.0/0' to "MTA" security group applied to tools-mail-02
  • 19:05 bstorm_: deployed the maintain-kubeusers operations pod to the new cluster

2019-12-14

  • 10:48 valhallasw`cloud: re-enabling puppet on tools-sgeexec-0912, likely left-over from NFS maintenance (no reason was specified).

2019-12-13

  • 18:46 bstorm_: updated tools-k8s-control-2 and 3 to the new config as well
  • 17:56 bstorm_: updated tools-k8s-control-1 to the new control plane configuration
  • 17:47 bstorm_: edited kubeadm-config configMap object to match the new init config
  • 17:32 bstorm_: rebooting tools-k8s-control-2 to correct mount issue
  • 00:45 bstorm_: rebooting tools-static-13
  • 00:28 bstorm_: rebooting the k8s master to clear NFS errors
  • 00:15 bstorm_: switch tools-acme-chief config to match the new authdns_servers format upstream

2019-12-12

  • 23:36 bstorm_: rebooting toolschecker after downtiming the services
  • 22:58 bstorm_: rebooting tools-acme-chief-01
  • 22:53 bstorm_: rebooting the cron server, tools-sgecron-01 as it wasn't recovered from last night's maintenance
  • 11:20 arturo: rolling reboot for all grid & k8s worker nodes due to NFS staleness
  • 09:22 arturo: reboot tools-sgeexec-0911 to try fixing weird NFS state
  • 08:46 arturo: doing `run-puppet-agent` in all VMs to see state of NFS
  • 08:34 arturo: reboot tools-worker-1033/1034 and tools-sgebastion-08 to try to correct NFS mount issues

2019-12-11

  • 18:13 bd808: Restarted maintain-dbusers on labstore1004. Process had not logged any account creations since 2019-12-01T22:45:45.
  • 17:24 andrewbogott: deleted and/or truncated a bunch of logfiles on tools-worker-1031

2019-12-10

  • 13:59 arturo: set pod replicas to 3 in the new k8s cluster (T239405)

2019-12-09

  • 11:06 andrewbogott: deleting unused security groups: catgraph, devpi, MTA, mysql, syslog, test T91619

2019-12-04

  • 13:45 arturo: drop puppet prefix `tools-cron`, deprecated and no longer in use

2019-11-29

  • 11:45 arturo: created 3 new VMs `tools-k8s-worker-[3,4,5]` (T239403)
  • 10:28 arturo: re-arm keyholder in tools-acme-chief-01 (password in labs/private.git @ tools-puppetmaster-01)
  • 10:27 arturo: re-arm keyholder in tools-acme-chief-02 (password in labs/private.git @ tools-puppetmaster-01)

2019-11-26

  • 23:25 bstorm_: rebuilding docker images to include the new webservice 0.52 in all versions instead of just the stretch ones T236202
  • 22:57 bstorm_: push upgraded webservice 0.52 to the buster and jessie repos for container rebuilds T236202
  • 19:55 phamhi: drained tools-worker-1002,8,15,32 to rebalance the cluster
  • 19:45 phamhi: cleaned up container that was taken up 16G of disk space on tools-worker-1020 in order to re-run puppet client
  • 14:01 arturo: drop hiera references to `tools-test-proxy-01.tools.eqiad.wmflabs`. Such VM no longer exists
  • 14:00 arturo: introduce the `profile::toolforge::proxies` hiera key in the global puppet config

2019-11-25

  • 10:35 arturo: refresh puppet certs for tools-k8s-etcd-[4-6] nodes (T238655)
  • 10:35 arturo: add puppet cert SANs via instance hiera to tools-k8s-etcd-[4-6] nodes (T238655)

2019-11-22

  • 13:32 arturo: created security group `tools-new-k8s-full-connectivity` and add new k8s VMs to it (T238654)
  • 05:55 jeh: add Riley Huntley `riley` to base tools project

2019-11-21

  • 12:48 arturo: reboot the new k8s cluster after the upgrade
  • 11:49 arturo: upgrading new k8s kubectl version to 1.15.6 (T238654)
  • 11:44 arturo: upgrading new k8s kubelet version to 1.15.6 (T238654)
  • 10:29 arturo: upgrading new k8s cluster version to 1.15.6 using kubeadm (T238654)
  • 10:28 arturo: install kubeadm 1.15.6 on worker/control nodes in the new k8s cluster (T238654)

2019-11-19

  • 13:49 arturo: re-create nginx-ingress pod due to deployment template refresh (T237643)
  • 12:46 arturo: deploy changes to tools-prometheus to account for the new k8s cluster (T237643)

2019-11-15

  • 14:44 arturo: stop live-hacks on tools-prometheus-01 T237643

2019-11-13

  • 17:20 arturo: live-hacking tools-prometheus-01 to test some experimental configs for the new k8s cluster (T237643)

2019-11-12

  • 12:52 arturo: reboot tools-proxy-06 to reset iptables setup T238058

2019-11-10

2019-11-08

  • 22:47 bstorm_: adding rsync::server::wrap_with_stunnel: false to the tools-docker-registry-03/4 servers to unbreak puppet
  • 18:40 bstorm_: pushed new webservice package to the bastions T230961
  • 18:37 bstorm_: pushed new webservice package supporting buster containers to repo T230961
  • 18:36 bstorm_: pushed buster-sssd images to the docker repo
  • 17:15 phamhi: pushed new buster images with the prefix name "toolforge"

2019-11-07

  • 13:27 arturo: deployed registry-admission-webhook and ingress-admission-controller into the new k8s cluster (T236826)
  • 13:01 arturo: creating puppet prefix `tools-k8s-worker` and a couple of VMs `tools-k8s-worker-[1,2]` T236826
  • 12:57 arturo: increasing project quota T237633
  • 11:54 arturo: point `k8s.tools.eqiad1.wikimedia.cloud` to tools-k8s-haproxy-1 T236826
  • 11:43 arturo: create VMs `tools-k8s-haproxy-[1,2]` T236826
  • 11:43 arturo: create puppet prefix `tools-k8s-haproxy` T236826

2019-11-06

  • 22:32 bstorm_: added rsync::server::wrap_with_stunnel: false to tools-sge-services prefix to fix puppet
  • 21:33 bstorm_: docker images needed for kubernetes cluster upgrade deployed T215531
  • 20:26 bstorm_: building and pushing docker images needed for kubernetes cluster upgrade
  • 16:10 arturo: new k8s cluster control nodes are bootstrapped (T236826)
  • 13:51 arturo: created FQDN `k8s.tools.eqiad1.wikimedia.cloud` pointing to `tools-k8s-control-1` for the initial bootstrap (T236826)
  • 13:50 arturo: created 3 VMs`tools-k8s-control-[1,2,3]` (T236826)
  • 13:43 arturo: created `tools-k8s-control` puppet prefix T236826
  • 11:57 phamhi: restarted all webservices in grid (T233347)

2019-11-05

  • 23:08 Krenair: Dropped 59a77a3, 3830802, and 83df61f from tools-puppetmaster-01:/var/lib/git/labs/private cherry-picks as these are no longer required T206235
  • 22:49 Krenair: Disassociated floating IP 185.15.56.60 from tools-static-13, traffic to this host goes via the project-proxy now. DNS was already changed a few days ago. T236952
  • 22:35 bstorm_: upgraded libpython3.4 libpython3.4-dbg libpython3.4-minimal libpython3.4-stdlib python3.4 python3.4-dbg python3.4-minimal to fix an old broken patch T237468
  • 22:12 bstorm_: pushed docker-registry.tools.wmflabs.org/maintain-kubeusers:beta to the registry to deploy in toolsbeta
  • 17:38 phamhi: restarted lighttpd based webservice pods on tools-worker-103x and 1040 (T233347)
  • 17:34 phamhi: restarted lighttpd based webservice pods on tools-worker-102[0-9] (T233347)
  • 17:06 phamhi: restarted lighttpd based webservice pods on tools-worker-101[0-9] (T233347)
  • 16:44 phamhi: restarted lighttpd based webservice pods on tools-worker-100[1-9] (T233347)
  • 13:55 arturo: created 3 new VMs: `tools-k8s-etcd-[4,5,6]` T236826

2019-11-04

  • 14:45 phamhi: Built and pushed ruby25 docker image based on buster (T230961)
  • 14:45 phamhi: Built and pushed golang111 docker image based on buster (T230961)
  • 14:45 phamhi: Built and pushed jdk11 docker image based on buster (T230961)
  • 14:45 phamhi: Built and pushed php73 docker image based on buster (T230961)
  • 11:10 phamhi: Built and pushed python37 docker image based on buster (T230961)

2019-11-01

  • 21:00 Krenair: Removed tools-checker.wmflabs.org A record to 208.80.155.229 as that target IP is in the old pre-neutron range that is no longer routed
  • 20:57 Krenair: Removed trusty.tools.wmflabs.org CNAME to login-trusty.tools.wmflabs.org as that target record does not exist, presumably deleted ages ago
  • 20:56 Krenair: Removed tools-trusty.wmflabs.org CNAME to login-trusty.tools.wmflabs.org as that target record does not exist, presumably deleted ages ago
  • 20:38 Krenair: Updated A record for tools-static.wmflabs.org to point towards project-proxy T236952

2019-10-31

  • 18:47 andrewbogott: deleted and/or truncated a bunch of logfiles on tools-worker-1001. Runaway logfiles filled up the drive which prevented puppet from running. If puppet had run, it would have prevented the runaway logfiles.
  • 13:59 arturo: update puppet prefix `tools-k8s-etcd-` to use the `role::wmcs::toolforge::k8s::etcd` T236826
  • 13:41 arturo: disabling puppet in tools-k8s-etcd- nodes to test https://gerrit.wikimedia.org/r/c/operations/puppet/+/546995
  • 10:15 arturo: SSL cert replacement for tools-docker-registry and tools-k8s-master went fine apparently (T236962)
  • 10:02 arturo: icinga downtime toolschecker for 1h for replacing SSL certs in tools-docker-registry and tools-k8s-master (T236962)

2019-10-30

2019-10-29

  • 10:49 arturo: deleting VMs tools-test-proxy-01, no longer in use
  • 10:07 arturo: deleting old jessie VMs tools-proxy-03/04 T235627

2019-10-28

  • 16:06 arturo: delete VM instance `tools-test-proxy-01` and the puppet prefix `tools-test-proxy`
  • 15:54 arturo: tools-proxy-05 has now the 185.15.56.11 floating IP as active proxy. Old one 185.15.56.6 has been freed T235627
  • 15:54 arturo: shutting down tools-proxy-03 T235627
  • 15:26 bd808: Killed all processes owned by jem on tools-sgebastion-08
  • 15:16 arturo: tools-proxy-05 has now the 185.15.56.5 floating IP as active proxy T235627
  • 15:14 arturo: refresh hiera to use tools-proxy-05 as active proxy T235627
  • 15:11 bd808: Killed ircbot.php processes started by jem on tools-sgebastion-08 per request on irc
  • 14:58 arturo: added `webproxy` security group to tools-proxy-05 and tools-proxy-06 (T235627)
  • 14:57 phamhi: drained tools-worker-1031.tools.eqiad.wmflabs to clean up disk space
  • 14:45 arturo: created VMs tools-proxy-05 and tools-proxy-06 (T235627)
  • 14:43 arturo: adding `role::wmcs::toolforge::proxy` to the `tools-proxy` puppet prefix (T235627)
  • 14:42 arturo: deleted `role::toollabs::proxy` from the `tools-proxy` puppet profile (T235627)
  • 14:34 arturo: icinga downtime toolschecker for 1h (T235627)
  • 12:25 arturo: upload image `coredns` v1.3.1 (eb516548c180) to docker registry (T236249)
  • 12:23 arturo: upload image `kube-apiserver` v1.15.1 (68c3eb07bfc3) to docker registry (T236249)
  • 12:22 arturo: upload image `kube-controller-manager` v1.15.1 (d75082f1d121) to docker registry (T236249)
  • 12:20 arturo: upload image `kube-proxy` v1.15.1 (89a062da739d) to docker registry (T236249)
  • 12:19 arturo: upload image `kube-scheduler` v1.15.1 (b0b3c4c404da) to docker registry (T236249)
  • 12:04 arturo: upload image `calico/node` v3.8.0 (cd3efa20ff37) to docker registry (T236249)
  • 12:03 arturo: upload image `calico/calico/pod2daemon-flexvol` v3.8.0 (f68c8f870a03) to docker registry (T236249)
  • 12:01 arturo: upload image `calico/cni` v3.8.0 (539ca36a4c13) to docker registry (T236249)
  • 11:58 arturo: upload image `calico/kube-controllers` v3.8.0 (df5ff96cd966) to docker registry (T236249)
  • 11:47 arturo: upload image `nginx-ingress-controller` v0.25.1 (0439eb3e11f1) to docker registry (T236249)

2019-10-24

  • 16:32 bstorm_: set the prod rsyslog config for kubernetes to false for Toolforge

2019-10-23

  • 20:00 phamhi: Rebuilding all jessie and stretch docker images to pick up toollabs-webservice 0.47 (T233347)
  • 12:09 phamhi: Deployed toollabs-webservice 0.47 to buster-tools and stretch-tools (T233347)
  • 09:13 arturo: 9 tools-sgeexec nodes and 6 other related VMs are down because hypervisor is rebooting
  • 09:03 arturo: tools-sgebastion-08 is down because hypervisor is rebooting

2019-10-22

  • 16:56 bstorm_: drained tools-worker-1025.tools.eqiad.wmflabs which was malfunctioning
  • 09:25 arturo: created the `tools.eqiad1.wikimedia.cloud.` DNS zone

2019-10-21

  • 17:32 phamhi: Rebuilding all jessie and stretch docker images to pick up toollabs-webservice 0.46

2019-10-18

  • 22:15 bd808: Rescheduled continuous jobs away from tools-sgeexec-0904 because of high system load
  • 22:09 bd808: Cleared error state of webgrid-generic@tools-sgewebgrid-generic-0901, webgrid-lighttpd@tools-sgewebgrid-lighttpd-09{12,15,19,20,26}
  • 21:29 bd808: Rescheduled all grid engine webservice jobs (T217815)

2019-10-16

  • 16:21 phamhi: Deployed toollabs-webservice 0.46 to buster-tools and stretch-tools (T218461)
  • 09:29 arturo: toolforge is recovered from the reboot of cloudvirt1029
  • 09:17 arturo: due to the reboot of cloudvirt1029, several sgeexec nodes (8) are offline, also sgewebgrid-lighttpd (8) and tools-worker (3) and the main toolforge proxy (tools-proxy-03)

2019-10-15

  • 17:10 phamhi: restart tools-worker-1035 because it is no longer responding

2019-10-14

  • 09:26 arturo: cleaned-up updatetools from tools-sge-services nodes (T229261)

2019-10-11

  • 19:52 bstorm_: restarted docker on tools-docker-builder after phamhi noticed the daemon had a routing issue (blank iptables)
  • 11:55 arturo: create tools-test-proxy-01 VM for testing T235059 and a puppet prefix for it
  • 10:53 arturo: added kubernetes-node_1.4.6-7_amd64.deb to buster-tools and buster-toolsbeta (aptly) for T235059
  • 10:51 arturo: added docker-engine_1.12.6-0~debian-jessie_amd64.deb to buster-tools and buster-toolsbeta (aptly) for T235059
  • 10:46 arturo: added logster_0.0.10-2~jessie1_all.deb to buster-tools and buster-toolsbeta (aptly) for T235059

2019-10-10

  • 02:33 bd808: Rebooting tools-sgewebgrid-lighttpd-0903. Instance hung.

2019-10-09

  • 22:52 jeh: removing test instances tools-sssd-sgeexec-test-[12] from SGE
  • 15:32 phamhi: drained tools-worker-1020/23/33/35/36/40 to rebalance the cluster
  • 14:46 phamhi: drained and cordoned tools-worker-1029 after status reset on reboot
  • 12:37 arturo: drain tools-worker-1038 to rebalance load in the k8s cluster
  • 12:35 arturo: uncordon tools-worker-1029 (was disabled for unknown reasons)
  • 12:33 arturo: drain tools-worker-1010 to rebalance load
  • 10:33 arturo: several sgewebgrid-lighttpd nodes (9) not available because cloudvirt1013 is rebooting
  • 10:21 arturo: several worker nodes (7) not available because cloudvirt1012 is rebooting
  • 10:08 arturo: several worker nodes (6) not available because cloudvirt1009 is rebooting
  • 09:59 arturo: several worker nodes (5) not available because cloudvirt1008 is rebooting

2019-10-08

  • 19:40 bstorm_: drained tools-worker-1007/8 to rebalance the cluster
  • 19:34 bstorm_: drained tools-worker-1009 and then 1014 for rebalancing
  • 19:27 bstorm_: drained tools-worker-1005 for rebalancing (and put these back in service as I went)
  • 19:24 bstorm_: drained tools-worker-1003 and 1009 for rebalancing
  • 15:41 arturo: deleted VM instance tools-sgebastion-0test. No longer in use.

2019-10-07

  • 20:17 bd808: Dropped backlog of messages for delivery to tools.usrd-tools
  • 20:16 bd808: Dropped backlog of messages for delivery to tools.mix-n-match
  • 20:13 bd808: Dropped backlog of frozen messages for delivery (240 dropped)
  • 19:25 bstorm_: deleted tools-puppetmaster-02
  • 19:20 Krenair: reboot tools-k8s-master-01 due to nfs stale issue
  • 19:18 Krenair: reboot tools-paws-worker-1006 due to nfs stale issue
  • 19:16 phamhi: reboot tools-worker-1040 due to nfs stale issue
  • 19:16 phamhi: reboot tools-worker-1039 due to nfs stale issue
  • 19:16 phamhi: reboot tools-worker-1038 due to nfs stale issue
  • 19:16 phamhi: reboot tools-worker-1037 due to nfs stale issue
  • 19:16 phamhi: reboot tools-worker-1036 due to nfs stale issue
  • 19:16 phamhi: reboot tools-worker-1035 due to nfs stale issue
  • 19:15 phamhi: reboot tools-worker-1034 due to nfs stale issue
  • 19:15 phamhi: reboot tools-worker-1033 due to nfs stale issue
  • 19:15 phamhi: reboot tools-worker-1032 due to nfs stale issue
  • 19:15 phamhi: reboot tools-worker-1031 due to nfs stale issue
  • 19:15 phamhi: reboot tools-worker-1030 due to nfs stale issue
  • 19:10 Krenair: reboot tools-puppetmaster-02 due to nfs stale issue
  • 19:09 Krenair: reboot tools-sgebastion-0test due to nfs stale issue
  • 19:08 Krenair: reboot tools-sgebastion-09 due to nfs stale issue
  • 19:08 Krenair: reboot tools-sge-services-04 due to nfs stale issue
  • 19:07 Krenair: reboot tools-paws-worker-1002 due to nfs stale issue
  • 19:06 Krenair: reboot tools-mail-02 due to nfs stale issue
  • 19:06 Krenair: reboot tools-docker-registry-03 due to nfs stale issue
  • 19:04 Krenair: reboot tools-worker-1029 due to nfs stale issue
  • 19:00 Krenair: reboot tools-static-12 tools-docker-registry-04 and tools-clushmaster-02 due to NFS stale issue
  • 18:55 phamhi: reboot tools-worker-1028 due to nfs stale issue
  • 18:55 phamhi: reboot tools-worker-1027 due to nfs stale issue
  • 18:55 phamhi: reboot tools-worker-1026 due to nfs stale issue
  • 18:55 phamhi: reboot tools-worker-1025 due to nfs stale issue
  • 18:47 phamhi: reboot tools-worker-1023 due to nfs stale issue
  • 18:47 phamhi: reboot tools-worker-1022 due to nfs stale issue
  • 18:46 phamhi: reboot tools-worker-1021 due to nfs stale issue
  • 18:46 phamhi: reboot tools-worker-1020 due to nfs stale issue
  • 18:46 phamhi: reboot tools-worker-1019 due to nfs stale issue
  • 18:46 phamhi: reboot tools-worker-1018 due to nfs stale issue
  • 18:34 phamhi: reboot tools-worker-1017 due to nfs stale issue
  • 18:34 phamhi: reboot tools-worker-1016 due to nfs stale issue
  • 18:32 phamhi: reboot tools-worker-1015 due to nfs stale issue
  • 18:32 phamhi: reboot tools-worker-1014 due to nfs stale issue
  • 18:23 phamhi: reboot tools-worker-1013 due to nfs stale issue
  • 18:21 phamhi: reboot tools-worker-1012 due to nfs stale issue
  • 18:12 phamhi: reboot tools-worker-1011 due to nfs stale issue
  • 18:12 phamhi: reboot tools-worker-1010 due to nfs stale issue
  • 18:08 phamhi: reboot tools-worker-1009 due to nfs stale issue
  • 18:07 phamhi: reboot tools-worker-1008 due to nfs stale issue
  • 17:58 phamhi: reboot tools-worker-1007 due to nfs stale issue
  • 17:57 phamhi: reboot tools-worker-1006 due to nfs stale issue
  • 17:47 phamhi: reboot tools-worker-1005 due to nfs stale issue
  • 17:47 phamhi: reboot tools-worker-1004 due to nfs stale issue
  • 17:43 phamhi: reboot tools-worker-1002.tools.eqiad.wmflabs due to nfs stale issue
  • 17:35 phamhi: drained and uncordoned tools-worker-100[1-5]
  • 17:32 bstorm_: reboot tools-sgewebgrid-lighttpd-0912
  • 17:30 bstorm_: reboot tools-sgewebgrid-lighttpd-0923/24/08
  • 17:01 bstorm_: rebooting tools-sgegrid-master and tools-sgegrid-shadow 😭
  • 16:58 bstorm_: rebooting tools-sgewebgrid-lighttpd-0902/4/6/7/8/19
  • 16:53 bstorm_: rebooting tools-sgewebgrid-generic-0902/4
  • 16:50 bstorm_: rebooting tools-sgeexec-0915/18/19/23/26
  • 16:49 bstorm_: rebooting tools-sgeexec-0901 and tools-sgeexec-0909/10/11
  • 16:46 bd808: `sudo shutdown -r now` for tools-sgebastion-08
  • 16:41 bstorm_: reboot tools-sgebastion-07
  • 16:39 bd808: `sudo service nslcd restart` on tools-sgebastion-08

2019-10-04

  • 21:43 bd808: `sudo exec-manage repool tools-sgeexec-0923.tools.eqiad.wmflabs`
  • 21:26 bd808: Rebooting tools-sgeexec-0923 after lots of messing about with a broken update-initramfs build
  • 20:35 bd808: Manually running `/usr/bin/python3 /usr/bin/unattended-upgrade` on tools-sgeexec-0923
  • 20:33 bd808: Killed 2 /usr/bin/unattended-upgrade procs on tools-sgeexec-0923 that seemed stuck
  • 13:33 arturo: remove /etc/init.d/rsyslog on tools-worker-XXXX nodes so the rsyslog deb prerm script doesn't prevent the package from being updated

2019-10-03

  • 13:05 arturo: delete servers tools-sssd-sgeexec-test-[1,2], no longer required

2019-09-27

  • 16:59 bd808: Set "profile::rsyslog::kafka_shipper::kafka_brokers: []" in tools-elastic prefix puppet
  • 00:40 bstorm_: depooled and rebooted tools-sgewebgrid-lighttpd-0927

2019-09-25

  • 19:08 andrewbogott: moving tools-sgewebgrid-lighttpd-0903 to cloudvirt1021

2019-09-23

  • 16:58 bstorm_: deployed tools-manifest 0.20 and restarted webservicemonitor
  • 06:01 bd808: Restarted maintain-dbusers process on labstore1004. (T233530)

2019-09-12

  • 20:48 phamhi: Deleted tools-puppetdb-01.tools as it is no longer in used

2019-09-11

  • 13:30 jeh: restart tools-sgeexec-0912

2019-09-09

  • 22:44 bstorm_: uncordoned tools-worker-1030 and tools-worker-1038

2019-09-06

  • 15:11 bd808: `sudo kill -9 10635` on tools-k8s-master-01 (T194859)

2019-09-05

  • 21:02 bd808: Enabled Puppet on tools-docker-registry-03 and forced puppet run (T232135)
  • 18:13 bd808: Disabled Puppet on tools-docker-registry-03 to investigate docker-registry issue (no phab task yet)

2019-09-01

  • 20:51 Reedy: `sudo service maintain-kubeusers restart` on tools-k8s-master-01

2019-08-30

  • 16:54 phamhi: restart maintain-kuberusers service in tools-k8s-master-01
  • 16:21 bstorm_: depooling tools-sgewebgrid-lighttpd-0923 to reboot -- high iowait likely from NFS mounts

2019-08-29

  • 22:18 bd808: Finished building new stretch Docker images for Toolforge Kubernetes use
  • 22:06 bd808: Starting process of building new stretch Docker images for Toolforge Kubernetes use
  • 22:05 bd808: Jessie Docker image rebuild complete
  • 21:31 bd808: Starting process of building new jessie Docker images for Toolforge Kubernetes use

2019-08-27

  • 19:10 bd808: Restarted maintain-kubeusers after complaint on irc. It was stuck in limbo again

2019-08-26

  • 21:48 bstorm_: repooled tools-sgewebgrid-generic-0902, tools-sgewebgrid-lighttpd-0902, tools-sgewebgrid-lighttpd-0903 and tools-sgeexec-0905

2019-08-18

  • 08:11 arturo: restart maintain-kuberusers service in tools-k8s-master-01

2019-08-17

  • 10:56 arturo: force-reboot tools-worker-1006. Is completely stuck

2019-08-15

  • 15:32 jeh: upgraded jobutils debian package to 1.38 T229551
  • 09:22 arturo: restart maintain-kubeusers service in tools-k8s-master-01 because some tools were missing their namespaces

2019-08-13

  • 22:00 bstorm_: truncated exim paniclog on tools-sgecron-01 because it was being spammy
  • 13:41 jeh: Set icingia downtime for toolschecker labs showmount T229448

2019-08-12

  • 16:08 phamhi: updated prometheus-node-exporter from 0.14.0~git20170523-1 to 0.17.0+ds-3 in tools-worker-[1030-1040] nodes (T230147)

2019-08-08

  • 19:26 jeh: restarting tools-sgewebgrid-lighttpd-0915 T230157

2019-08-07

  • 19:07 bd808: Disassociated SUL and Phabricator accounts from user Lophi (T229713)

2019-08-06

  • 16:18 arturo: add phamhi as user/projectadmin (T228942) and delete hpham
  • 15:59 arturo: add hpham as user/projectadmin (T228942)
  • 13:44 jeh: disabling puppet on tools-checker-03 while testing nginx timeouts T221301

2019-08-05

  • 22:49 bstorm_: launching tools-worker-1040
  • 20:36 andrewbogott: rebooting oom tools-worker-1026
  • 16:10 jeh: `tools-k8s-master-01: systemctl restart maintain-kubeusers` T229846
  • 09:39 arturo: `root@tools-checker-03:~# toolscheckerctl restart` again (T229787)
  • 09:30 arturo: `root@tools-checker-03:~# toolscheckerctl restart` (T229787)

2019-08-02

  • 14:00 andrewbogott_: rebooting tools-worker-1022 as it is unresponsive

2019-07-31

  • 18:07 bstorm_: drained tools-worker-1015/05/03/17 to rebalance load
  • 17:41 bstorm_: drained tools-worker-1025 and 1026 to rebalance load
  • 17:32 bstorm_: drained tools-worker-1028 to rebalance load
  • 17:29 bstorm_: drained tools-worker-1008 to rebalance load
  • 17:23 bstorm_: drained tools-worker-1021 to rebalance load
  • 17:17 bstorm_: drained tools-worker-1007 to rebalance load
  • 17:07 bstorm_: drained tools-worker-1004 to rebalance load
  • 16:27 andrewbogott: moving tools-static-12 to cloudvirt1018
  • 15:33 bstorm_: T228573 spinning up 5 worker nodes for kubernetes cluster (tools-worker-1035-9)

2019-07-27

2019-07-26

  • 17:39 bstorm_: restarted maintain-kubeusers because it was suspiciously tardy and quiet
  • 17:14 bstorm_: drained tools-worker-1013.tools.eqiad.wmflabs to rebalance load
  • 17:09 bstorm_: draining tools-worker-1020.tools.eqiad.wmflabs to rebalance load
  • 16:32 bstorm_: created tools-worker-1034 - T228573
  • 15:57 bstorm_: created tools-worker-1032 and 1033 - T228573
  • 15:55 bstorm_: created tools-worker-1031 - T228573

2019-07-25

  • 22:01 bstorm_: T228573 created tools-worker-1030
  • 21:22 jeh: rebooting tools-worker-1016 unresponsive

2019-07-24

  • 10:14 arturo: reallocating tools-puppetmaster-01 from cloudvirt1027 to cloudvirt1028 (T227539)
  • 10:12 arturo: reallocating tools-docker-registry-04 from cloudvirt1027 to cloudvirt1028 (T227539)

2019-07-22

  • 18:39 bstorm_: repooled tools-sgeexec-0905 after reboot
  • 18:33 bstorm_: depooled tools-sgeexec-0905 because it's acting kind of weird and not responding to prometheus
  • 18:32 bstorm_: repooled tools-sgewebgrid-lighttpd-0902 after restarting the grid-exec service
  • 18:28 bstorm_: depooled tools-sgewebgrid-lighttpd-0902 to find out why it is behaving weird
  • 17:55 bstorm_: draining tools-worker-1023 since it is having issues
  • 17:38 bstorm_: Adding the prometheus servers to the ferm rules via wikitech hiera for kubelet stats T228573

2019-07-20

  • 19:52 andrewbogott: rebooting tools-worker-1023

2019-07-17

  • 20:23 andrewbogott: migrating tools-sgegrid-shadow to cloudvirt1014

2019-07-15

  • 14:50 bstorm_: cleared error state from tools-sgeexec-0911 which went offline after error from job 5190035

2019-06-25

  • 09:30 arturo: detected puppet issue in all VMs: T226480

2019-06-24

  • 17:42 andrewbogott: moving tools-sgeexec-0905 to cloudvirt1015

2019-06-17

  • 14:07 andrewbogott: moving tools-sgewebgrid-lighttpd-0903 to cloudvirt1015
  • 13:59 andrewbogott: moving tools-sgewebgrid-generic-0902 and tools-sgewebgrid-lighttpd-0902 to cloudvirt1015 (optimistic re: T220853 )

2019-06-11

  • 18:03 bstorm_: deleted anomalous kubernetes node tools-worker-1019.eqiad.wmflabs

2019-06-05

  • 18:33 andrewbogott: repooled tools-sgeexec-0921 and tools-sgeexec-0929
  • 18:16 andrewbogott: depooling and moving tools-sgeexec-0921 and tools-sgeexec-0929

2019-05-30

  • 13:01 arturo: uncordon/repool tools-worker-1001/2/3. They should be fine now. I'm only leaving 1029 cordoned for testing purposes
  • 13:01 arturo: reboot tools-woker-1003 to cleanup sssd config and let nslcd/nscd start freshly
  • 12:47 arturo: reboot tools-woker-1002 to cleanup sssd config and let nslcd/nscd start freshly
  • 12:42 arturo: reboot tools-woker-1001 to cleanup sssd config and let nslcd/nscd start freshly
  • 12:35 arturo: enable puppet in tools-worker nodes
  • 12:29 arturo: switch hiera setting back to classic/sudoldap for tools-worker because T224651 (T224558)
  • 12:25 arturo: cordon/drain tools-worker-1002 because T224651 and T224651
  • 12:23 arturo: cordon/drain tools-worker-1001 because T224651 and T224651
  • 12:22 arturo: cordon/drain tools-worker-1029 because T224651 and T224651
  • 12:20 arturo: cordon/drain tools-worker-1003 because T224651 and T224651
  • 11:59 arturo: T224558 repool tools-worker-1003 (using sssd/sudo now!)
  • 11:23 arturo: T224558 depool tools-worker-1003
  • 10:48 arturo: T224558 drop/build a VM for tools-worker-1002. It didn't like the sssd/sudo change :-(
  • 10:33 arturo: T224558 switch tools-worker-1002 to sssd/sudo. Includes drain/depool/reboot/repool
  • 10:28 arturo: T224558 use hiera config in prefix tools-worker for sssd/sudo
  • 10:27 arturo: T224558 switch tools-worker-1001 to sssd/sudo. Includes drain/depool/reboot/repool
  • 10:09 arturo: T224558 disable puppet in all tools-worker- nodes
  • 10:01 arturo: T224558 add tools-worker-1029 to the nodes pool of k8s
  • 09:58 arturo: T224558 reboot tools-worker-1029 after puppet changes for sssd/sudo in jessie

2019-05-29

  • 11:13 arturo: briefly tested some sssd config changes in tools-sgebastion-09
  • 10:13 arturo: enroll the tools-worker-1029 VM into toolforge k8s, but leave it cordoned for sssd testing purposes (T221225)
  • 10:12 arturo: re-create the tools-worker-1001 VM, already enrolled into toolforge k8s
  • 09:34 arturo: delete tools-worker-1001, it was totally malfunctioning

2019-05-28

  • 18:15 arturo: T221225 for the record, tools-worker-1001 is not working after trying with sssd
  • 18:13 arturo: T221225 created tools-worker-1029 to test sssd/sudo stuff
  • 17:49 arturo: T221225 repool tools-worker-1002 (using nscd/nslcd and sudoldap)
  • 17:44 arturo: T221225 back to classic/ldap hiera config in the tools-worker puppet prefix
  • 17:35 arturo: T221225 hard reboot tools-worker-1001 again
  • 17:27 arturo: T221225 hard reboot tools-worker-1001
  • 17:12 arturo: T221225 depool & switch to sssd/sudo & reboot & repool tools-worker-1002
  • 17:09 arturo: T221225 depool & switch to sssd/sudo & reboot & repool tools-worker-1001
  • 17:08 arturo: T221225 switch to sssd/sudo in puppet prefix for tools-worker
  • 13:04 arturo: T221225 depool and rebooted tools-worker-1001 in preparation for sssd migration
  • 12:39 arturo: T221225 disable puppet in all tools-worker nodes in preparation for sssd
  • 12:32 arturo: drop the tools-bastion puppet prefix, unused
  • 12:31 arturo: T221225 set sssd/sudo in the hiera config for the tools-checker prefix, and reboot tools-checker-03
  • 12:27 arturo: T221225 set sssd/sudo in the hiera config for the tools-docker-registry prefix, and reboot tools-docker-registry-[03-04]
  • 12:16 arturo: T221225 set sssd/sudo in the hiera config for the tools-sgebastion prefix, and reboot tools-sgebastion-07/08
  • 11:26 arturo: merged change to the sudo module to allow sssd transition

2019-05-27

  • 09:47 arturo: run `apt-get clean` to wipe 4GB of unused .deb packages, usage on / (root) was > 90% (on tools-sgebastion-08)
  • 09:47 arturo: run `apt-get clean` to wipe 4GB of unused .deb packages, usage on / (root) was > 90%

2019-05-21

  • 12:35 arturo: T223992 rebooting tools-redis-1002

2019-05-20

  • 11:25 arturo: T223332 enable puppet agent in tools-k8s-master and tools-docker-registry nodes and deploy new SSL cert
  • 10:53 arturo: T223332 disable puppet agent in tools-k8s-master and tools-docker-registry nodes

2019-05-18

  • 11:13 chicocvenancio: PAWS update helm chart to point to new singleuser image (T217908)
  • 09:06 bd808: Rebuilding all stretch docker images to pick up toollabs-webservice 0.45

2019-05-17

  • 17:36 bd808: Rebuilding all docker images to pick up toollabs-webservice 0.45
  • 17:35 bd808: Deployed toollabs-webservice 0.45 (python 3.5 and nodejs 10 containers)

2019-05-16

  • 11:22 chicocvenancio: PAWS: restart hub to get new configured announcement
  • 11:05 chicocvenancio: PAWS: change confimap to reference WMHACK 2019 as busiest time

2019-05-15

  • 16:20 arturo: T223148 repool both tools-sgeexec-0921 and -0929
  • 15:32 arturo: T223148 depool tools-sgeexec-0921 and move to cloudvirt1014
  • 15:32 arturo: T223148 depool tools-sgeexec-0920 and move to cloudvirt1014
  • 12:29 arturo: T223148 repool both tools-sgeexec-09[37,39]
  • 12:13 arturo: T223148 depool tools-sgeexec-0937 and move to cloudvirt1008
  • 12:13 arturo: T223148 depool tools-sgeexec-0939 and move to cloudvirt1007
  • 11:34 arturo: T223148 repool tools-sgeexec-0940
  • 11:20 arturo: T223148 depool tools-sgeexec-0940 and move to cloudvirt1006
  • 11:11 arturo: T223148 repool tools-sgeexec-0941
  • 10:46 arturo: T223148 depool tools-sgeexec-0941 and move to cloudvirt1005
  • 09:44 arturo: T223148 repool tools-sgeexec-0901
  • 09:00 arturo: T223148 depool tools-sgeexec-0901 and reallocate to cloudvirt1004

2019-05-14

  • 17:12 arturo: T223148 repool tools-sgeexec-0920
  • 16:37 arturo: T223148 depool tools-sgeexec-0920 and reallocate to cloudvirt1003
  • 16:36 arturo: T223148 repool tools-sgeexec-0911
  • 15:56 arturo: T223148 depool tools-sgeexec-0911 and reallocate to cloudvirt1003
  • 15:52 arturo: T223148 repool tools-sgeexec-0909
  • 15:24 arturo: T223148 depool tools-sgeexec-0909 and reallocate to cloudvirt1002
  • 15:24 arturo: T223148 last SAL entry is bogus, please ignore (depool tools-worker-1009)
  • 15:23 arturo: T223148 depool tools-worker-1009
  • 15:13 arturo: T223148 repool tools-worker-1023
  • 13:16 arturo: T223148 repool tools-sgeexec-0942
  • 13:03 arturo: T223148 repool tools-sgewebgrid-generic-0904
  • 12:58 arturo: T223148 reallocating tools-worker-1023 to cloudvirt1001
  • 12:56 arturo: T223148 depool tools-worker-1023
  • 12:52 arturo: T223148 reallocating tools-sgeexec-0942 to cloudvirt1001
  • 12:50 arturo: T223148 depool tools-sgeexec-0942
  • 12:49 arturo: T223148 reallocating tools-sgewebgrid-generic-0904 to cloudvirt1001
  • 12:43 arturo: T223148 depool tools-sgewebgrid-generic-0904

2019-05-13

  • 08:15 zhuyifei1999_: `truncate -s 0 /var/log/exim4/paniclog` on tools-sgecron-01.tools.eqiad.wmflabs & tools-sgewebgrid-lighttpd-0921.tools.eqiad.wmflabs

2019-05-07

  • 14:38 arturo: T222718 uncordon tools-worker-1019, I couldn't find a reason for it to be cordoned
  • 14:31 arturo: T222718 reboot tools-worker-1009 and 1022 after being drained
  • 14:28 arturo: k8s drain tools-worker-1009 and 1022
  • 11:46 arturo: T219362 enable puppet in tools-redis servers and use the new puppet role
  • 11:33 arturo: T219362 disable puppet in tools-reds servers for puppet code cleanup
  • 11:12 arturo: T219362 drop the `tools-services` puppet prefix (we are actually using `tools-sgeservices`)
  • 11:10 arturo: T219362 enable puppet in tools-static servers and use new puppet role
  • 11:01 arturo: T219362 disable puppet in tools-static servers for puppet code cleanup
  • 10:16 arturo: T219362 drop the `tools-webgrid-lighttpd` puppet prefix
  • 10:14 arturo: T219362 drop the `tools-webgrid-generic` puppet prefix
  • 10:06 arturo: T219362 drop the `tools-exec-1` puppet prefix

2019-05-06

  • 11:34 arturo: T221225 reenable puppet
  • 10:53 arturo: T221225 disable puppet in all toolforge servers for testing sssd patch (puppetmaster livehack)

2019-05-03

  • 09:43 arturo: fixed puppet in tools-puppetdb-01 too
  • 09:39 arturo: puppet should be now fine across toolforge (except tools-puppetdb-01 which is WIP I think)
  • 09:37 arturo: fix puppet in tools-elastic-03, archived jessie repos, weird rsyslog-kafka package situation
  • 09:33 arturo: fix puppet in tools-elastic-02, archived jessie repos, weird rsyslog-kafka package situation
  • 09:24 arturo: fix puppet in tools-elastic-01, archived jessie repos, weird rsyslog-kafka package situation
  • 09:18 arturo: solve a weird apt situation in tools-puppetmaster-01 regarding the rsyslog-kafka package (puppet agent was failing)
  • 09:16 arturo: solve a weird apt situation in tools-worker-1028 regarding the rsyslog-kafka package

2019-04-30

  • 12:50 arturo: enable puppet in all servers T221225
  • 12:45 arturo: adding `sudo_flavor: sudo` hiera config to all puppet prefixes with sssd (T221225)
  • 12:45 arturo: adding `sudo_flavor: sudo` hiera config to all puppet prefixes with sssd
  • 11:07 arturo: T221225 disable puppet in toolforge
  • 10:56 arturo: T221225 create tools-sgebastion-0test for more sssd tests

2019-04-29

  • 11:22 arturo: T221225 re-enable puppet agent in all toolforge servers
  • 10:27 arturo: T221225 reboot tool-sgebastion-09 for testing sssd
  • 10:21 arturo: disable puppet in all servers to livehack tools-puppetmaster-01 to test T221225
  • 08:29 arturo: cleanup disk in tools-sgebastion-09, was full of debug logs and unused apt packages

2019-04-26

  • 12:20 andrewbogott: rescheduling every pod everywhere
  • 12:18 andrewbogott: rescheduling all pods on tools-worker-1023.tools.eqiad.wmflabs

2019-04-25

  • 12:49 arturo: T221225 using `profile::ldap::client::labs::client_stack: sssd` in horizon for tools-sgebastion-09 (testing)
  • 11:43 arturo: T221793 removing prometheus crontab and letting puppet agent re-create it again to resolve staleness

2019-04-24

  • 12:54 arturo: puppet broken, fixing right now
  • 09:18 arturo: T221225 reallocating tools-sgebastion-09 to cloudvirt1008

2019-04-23

  • 15:26 arturo: T221225 rebooting tools-sgebastion-08 to cleanup sssd
  • 15:19 arturo: T221225 creating tools-sgebastion-09 for testing sssd stuff
  • 13:06 arturo: T221225 use `profile::ldap::client::labs::client_stack: classic` in the puppet bastion prefix, again. Rollback again.
  • 12:57 arturo: T221225 use `profile::ldap::client::labs::client_stack: sssd` in the puppet bastion prefix, try again with sssd in the bastions, reboot them
  • 10:28 arturo: T221225 use `profile::ldap::client::labs::client_stack: classic` in the puppet bastion prefix
  • 10:27 arturo: T221225 rebooting tools-sgebastion-07 to clean sssd confiuration
  • 10:16 arturo: T221225 disable puppet in tools-sgebastion-08 for sssd testing
  • 09:49 arturo: T221225 run puppet agent in the bastions and reboot them with sssd
  • 09:43 arturo: T221225 use `profile::ldap::client::labs::client_stack: sssd` in the puppet bastion prefix
  • 09:41 arturo: T221225 disable puppet agent in the bastions

2019-04-17

  • 12:09 arturo: T221225 rebooting bastions to clean sssd. We are back to nscd/nslcd until we figure out what's wrong here
  • 11:59 arturo: T221205 sssd was deployed successfully into all webgrid nodes
  • 11:39 arturo: deploy sssd to tools-sge-services-03/04 (includes reboot)
  • 11:31 arturo: reboot bastions for sssd deployment
  • 11:30 arturo: deploy sssd to bastions
  • 11:24 arturo: disable puppet in bastions to deploy sssd
  • 09:52 arturo: T221205 tools-sgewebgrid-lighttpd-0915 requires some manual intervention because issues in the dpkg database prevents deleting nscd/nslcd packages
  • 09:45 arturo: T221205 tools-sgewebgrid-lighttpd-0913 requires some manual intervention because unconfigured packages prevents a clean puppet agent run
  • 09:12 arturo: T221205 start deploying sssd to sgewebgrid nodes
  • 09:00 arturo: T221205 add `profile::ldap::client::labs::client_stack: sssd` in horizon for the puppet prefixes `tools-sgewebgrid-lighttpd` and `tools-sgewebgrid-generic`
  • 08:57 arturo: T221205 disable puppet in all tools-sgewebgrid-* nodes

2019-04-16

  • 20:49 chicocvenancio: change paws announcement in configmap hub-config back to a welcome message
  • 17:15 chicocvenancio: add paws outage announcement in configmap hub-config
  • 17:00 andrewbogott: moving tools-k8s-master-01 to eqiad1-r

2019-04-15

  • 18:50 andrewbogott: moving tools-elastic-01 to cloudvirt1008 to make spreadcheck happy
  • 15:01 andrewbogott: moving tools-redis-1001 to eqiad1-r

2019-04-14

  • 16:23 andrewbogott: moved all tools-worker nodes off of cloudvirt1015 and uncordoned them

2019-04-13

  • 21:09 bstorm_: Moving tools-prometheus-01 to cloudvirt1009 and tools-clushmaster-02 to cloudvirt1008 for T220853
  • 20:36 bstorm_: moving tools-elastic-02 to cloudvirt1009 for T220853
  • 19:58 bstorm_: started migrating tools-k8s-etcd-03 to cloudvirt1012 T220853
  • 19:51 bstorm_: started migrating tools-flannel-etcd-02 to cloudvirt1013 T220853

2019-04-11

  • 22:38 andrewbogott: moving tools-paws-worker-1005 to cloudvirt1009 to make spreadcheck happier
  • 21:49 bd808: Re-enabled puppet on tools-elastic-02 and forced puppet run
  • 21:44 andrewbogott: moving tools-mail-02 to eqiad1-r
  • 20:56 andrewbogott: shutting down tools-logs-02 — seems unused
  • 19:44 bd808: Disabled puppet on tools-elastic-02 and set to 1-node master
  • 19:34 andrewbogott: moving tools-puppetmaster-01 to eqiad1-r
  • 15:40 andrewbogott: moving tools-redis-1002 to eqiad1-r
  • 13:52 andrewbogott: moving tools-prometheus-01 and tools-elastic-01 to eqiad1-r
  • 12:01 arturo: T151704 deploying oidentd
  • 11:54 arturo: disable puppet in all hosts to deploy oidentd
  • 02:33 andrewbogott: tools-paws-worker-1005, tools-paws-worker-1006 to eqiad1-r
  • 00:03 andrewbogott: tools-paws-worker-1002, tools-paws-worker-1003 to eqiad1-r

2019-04-10

  • 23:58 andrewbogott: moving tools-clushmaster-02, tools-elastic-03 and tools-paws-worker-1001 to eqiad1-r
  • 18:52 bstorm_: depooled and rebooted tools-sgeexec-0929 because systemd was in a weird state
  • 18:46 bstorm_: depooled and rebooted tools-sgewebgrid-lighttpd-0913 because high load was caused by ancient lsof processes
  • 14:49 bstorm_: cleared E state from 5 queues
  • 13:06 arturo: T218126 hard reboot tools-sgeexec-0906
  • 12:31 arturo: T218126 hard reboot tools-sgeexec-0926
  • 12:27 arturo: T218126 hard reboot tools-sgeexec-0925
  • 12:06 arturo: T218126 hard reboot tools-sgeexec-0901
  • 11:55 arturo: T218126 hard reboot tools-sgeexec-0924
  • 11:47 arturo: T218126 hard reboot tools-sgeexec-0921
  • 11:23 arturo: T218126 hard reboot tools-sgeexec-0940
  • 11:03 arturo: T218126 hard reboot tools-sgeexec-0928
  • 10:49 arturo: T218126 hard reboot tools-sgeexec-0923
  • 10:43 arturo: T218126 hard reboot tools-sgeexec-0915
  • 10:27 arturo: T218126 hard reboot tools-sgeexec-0935
  • 10:19 arturo: T218126 hard reboot tools-sgeexec-0914
  • 10:02 arturo: T218126 hard reboot tools-sgeexec-0907
  • 09:41 arturo: T218126 hard reboot tools-sgeexec-0918
  • 09:27 arturo: T218126 hard reboot tools-sgeexec-0932
  • 09:26 arturo: T218216 hard reboot tools-sgeexec-0932
  • 09:04 arturo: T218216 add `profile::ldap::client::labs::client_stack: sssd` to prefix puppet for sge-exec nodes
  • 09:03 arturo: T218216 do a controlled rollover of sssd, depooling sgeexec nodes, reboot and repool
  • 08:39 arturo: T218216 disable puppet in all tools-sgeexec-XXXX nodes for controlled sssd rollout
  • 00:32 andrewbogott: migrating tools-worker-1022, 1023, 1025, 1026 to eqiad1-r

2019-04-09

  • 22:04 bstorm_: added the new region on port 80 to the elasticsearch security group for stashbot
  • 21:16 andrewbogott: moving tools-flannel-etcd-03 to eqiad1-r
  • 20:43 andrewbogott: moving tools-worker-1018, 1019, 1020, 1021 to eqiad1-r
  • 20:04 andrewbogott: moving tools-k8s-etcd-03 to eqiad1-r
  • 19:54 andrewbogott: moving tools-flannel-etcd-02 to eqiad1-r
  • 18:36 andrewbogott: moving tools-worker-1016, tools-worker-1017 to eqiad1-r
  • 18:05 andrewbogott: migrating tools-k8s-etcd-02 to eqiad1-r
  • 18:00 andrewbogott: migrating tools-flannel-etcd-01 to eqiad1-r
  • 17:36 andrewbogott: moving tools-worker-1014, tools-worker-1015 to eqiad1-r
  • 17:05 andrewbogott: migrating tools-k8s-etcd-01 to eqiad1-r
  • 15:56 andrewbogott: moving tools-worker-1012, tools-worker-1013 to eqiad1-r
  • 14:56 bstorm_: cleared 4 queues on gridengine of E status (ldap again)
  • 14:07 andrewbogott: moving tools-worker-1010, tools-worker-1011, tools-worker-1001 to eqiad1-r
  • 03:48 andrewbogott: moving tools-worker-1008 and tools-worker-1009 to eqiad1-r
  • 02:07 bstorm_: reloaded ferm on tools-flannel-etcd-0[1-3] to get the k8s node moves to register

2019-04-08

  • 22:36 andrewbogott: moving tools-worker-1006 and tools-worker-1007 to eqiad1-r
  • 20:03 andrewbogott: moving tools-worker-1003 and tools-worker-1004 to eqiad1-r

2019-04-07

  • 16:54 zhuyifei1999_: tools-sgeexec-0928 unresponsive since around 22 UTC. No data on Graphite. Can't ssh in even as root. Hard rebooting via Horizon
  • 01:06 bstorm_: cleared E state from 6 queues

2019-04-05

  • 15:44 bstorm_: cleared E state from two exec queues

2019-04-04

  • 21:21 bd808: Uncordoned tools-worker-1013.tools.eqiad.wmflabs after reboot and forced puppet run
  • 20:53 bd808: Rebooting tools-worker-1013
  • 20:50 bd808: Draining tools-worker-1013.tools.eqiad.wmflabs
  • 20:29 bd808: Released floating IP and deleted instance tools-checker-01 via Horizon
  • 20:28 bd808: Shutdown tools-checker-01 via Horizon
  • 20:17 bd808: Repooled tools-webgrid-lighttpd-0906 after reboot, apt-get dist-upgrade, and forced puppet run
  • 20:13 bd808: Hard reboot of tools-sgewebgrid-lighttpd-0906 via Horizon
  • 20:09 bd808: Repooled tools-webgrid-lighttpd-0912 after reboot, apt-get dist-upgrade, and forced puppet run
  • 20:05 bd808: Depooled and rebooted tools-sgewebgrid-lighttpd-0912
  • 20:05 bstorm_: rebooted tools-webgrid-lighttpd-0912
  • 20:03 bstorm_: depooled tools-webgrid-lighttpd-0912
  • 19:59 bstorm_: depooling and rebooting tools-webgrid-lighttpd-0906
  • 19:43 bd808: Repooled tools-sgewebgrid-lighttpd-0926 after reboot, apt-get dist-update, and forced puppet run
  • 19:36 bd808: Hard reboot of tools-sgewebgrid-lighttpd-0926 via Horizon
  • 19:30 bd808: Rebooting tools-sgewebgrid-lighttpd-0926
  • 19:28 bd808: Depooled tools-sgewebgrid-lighttpd-0926
  • 19:13 bstorm_: cleared E state from 7 queues
  • 17:32 andrewbogott: moving tools-static-12 to cloudvirt1023 to keep the two static nodes off the same host

2019-04-03

  • 11:22 arturo: puppet breakage in due to me introducing openstack-mitaka-jessie repo by mistake. Cleaning up already

2019-04-02

  • 12:11 arturo: icinga downtime toolschecker for 1 month T219243
  • 03:55 bd808: Added etcd service group to tools-k8s-etcd-* (T219243)

2019-04-01

  • 19:44 bd808: Deleted tools-checker-02 via Horizon (T219243)
  • 19:43 bd808: Shutdown tools-checker-02 via Horizon (T219243)
  • 16:53 bstorm_: cleared E state on 6 grid queues
  • 14:54 andrewbogott: moving tools-static-12 to eqiad1-r (for real this time maybe)

2019-03-29

  • 21:13 bstorm_: depooled tools-sgewebgrid-generic-0903 because of some stuck jobs and odd load characteristics
  • 21:09 bd808: Updated cherry-pick of https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/500095/ on tools-puppetmaster-01 (T219243)
  • 20:48 bd808: Using root console to fix broken initial puppet run on tools-checker-03.
  • 20:32 bd808: Creating tools-checker-03 with role::wmcs::toolforge::checker (T219243)
  • 20:24 bd808: Cherry-picked https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/500095/ to tools-puppetmaster-01 for testing (T219243)
  • 20:22 bd808: Disabled puppet on tools-checker-0{1,2} to make testing new role::wmcs::toolforge::checker easier (T219243)
  • 17:25 bd808: Cleared the "Eqw" state of 44 jobs with `qstat -u '*' | grep Eqw | awk '{print $1;}' | xargs -L1 sudo qmod -cj` on tools-sgegrid-master
  • 17:16 andrewbogott: aborted move of tools-static-12; will wait until tomorrow and give DNS caches more time to update
  • 17:11 bd808: Restarted nginx on tools-static-13
  • 16:53 andrewbogott: moving tools-static-12 to eqiad1-r
  • 16:49 bstorm_: cleared E state from 21 queues
  • 14:34 andrewbogott: moving tools-static.wmflabs.org to point to tools-static-13 in eqiad1-r
  • 13:54 andrewbogott: moving tools-static-13 to eqiad1-r

2019-03-28

  • 01:00 bstorm_: cleared error states from two queues
  • 00:23 bstorm_: T216060 created tools-sgewebgrid-generic-0901...again!

2019-03-27

  • 23:35 bstorm_: rebooted tools-paws-master-01 for NFS issue T219460
  • 14:45 bstorm_: cleared several "E" state queues
  • 12:26 gtirloni: truncated exim4/paniclog on tools-sgewebgrid-lighttpd-0921
  • 12:25 gtirloni: truncated exim4/paniclog on tools-sgecron-01
  • 12:15 arturo: T218126 `aborrero@tools-sgegrid-master:~$ sudo qmod -d 'test@tools-sssd-sgeexec-test-2'` (and 1)

2019-03-26

  • 22:00 gtirloni: downtimed toolschecker
  • 17:31 arturo: T218126 create VM instances tools-sssd-sgeexec-test-[12]
  • 00:26 bd808: Deleted DNS record for login-trusty.tools.wmflabs.org
  • 00:26 bd808: Deleted DNS record for trusty-dev.tools.wmflabs.org

2019-03-25

  • 21:21 bd808: All Trusty grid engine hosts shutdown and deleted (T217152)
  • {{safesubst:SAL entry|1=21:19 bd808: Deleted tools-grid-{master,shadow} (T217152)}}
  • 21:18 bd808: Deleted tools-webgrid-lighttpd-14* (T217152)
  • 20:55 bstorm_: reboot tools-sgewebgrid-generic-0903 to clear up some issues
  • 20:52 bstorm_: rebooting tools-package-builder-02 due to lots of hung /usr/bin/lsof +c 15 -nXd DEL processes
  • 20:51 bd808: Deleted tools-webgrid-generic-14* (T217152)
  • 20:49 bd808: Deleted tools-exec-143* (T217152)
  • 20:49 bd808: Deleted tools-exec-142* (T217152)
  • 20:48 bd808: Deleted tools-exec-141* (T217152)
  • 20:47 bd808: Deleted tools-exec-140* (T217152)
  • 20:43 bd808: Deleted tools-cron-01 (T217152)
  • 20:42 bd808: Deleted tools-bastion-0{2,3} (T217152)
  • 20:35 bstorm_: rebooted tools-worker-1025 and tools-worker-1021
  • 19:59 bd808: Shutdown tools-exec-143* (T217152)
  • 19:51 bd808: Shutdown tools-exec-142* (T217152)
  • 19:47 bstorm_: depooling tools-worker-1025.tools.eqiad.wmflabs because it's not responding and showing insane load
  • 19:33 bd808: Shutdown tools-exec-141* (T217152)
  • 19:31 bd808: Shutdown tools-bastion-0{2,3} (T217152)
  • 19:19 bd808: Shutdown tools-exec-140* (T217152)
  • 19:12 bd808: Shutdown tools-webgrid-generic-14* (T217152)
  • 19:11 bd808: Shutdown tools-webgrid-lighttpd-14* (T217152)
  • 18:53 bd808: Shutdown tools-grid-master (T217152)
  • 18:53 bd808: Shutdown tools-grid-shadow (T217152)
  • 18:49 bd808: All jobs still running on the Trusty job grid force deleted.
  • 18:46 bd808: All Trusty job grid queues marked as disabled. This should stop all new Trusty job submissions.
  • 18:43 arturo: icinga downtime tools-checker for 24h due to trusty grid shutdown
  • 18:39 bd808: Shutdown tools-cron-01.tools.eqiad.wmflabs (T217152)
  • 15:27 bd808: Copied all crontab files still on tools-cron-01 to tool's $HOME/crontab.trusty.save
  • 02:34 bd808: Disassociated floating IPs and deleted shutdown Trusty grid nodes tools-exec-14{33,34,35,36,37,38,39,40,41,42} (T217152)
  • 02:26 bd808: Deleted shutdown Trusty grid nodes tools-webgrid-lighttpd-14{20,21,22,24,25,26,27,28} (T217152)

2019-03-22

  • 17:16 andrewbogott: switching all instances to use ldap-ro.eqiad.wikimedia.org as both primary and secondary ldap server
  • 16:12 bstorm_: cleared errored out stretch grid queues
  • 15:56 bd808: Rebooting tools-static-12
  • 03:09 bstorm_: T217280 depooled and rebooted 15 other nodes. Entire stretch grid is in a good state for now.
  • 02:31 bstorm_: T217280 depooled and rebooted tools-sgeexec-0908 since it had no jobs but very high load from an NFS event that was no longer happening
  • 02:09 bstorm_: T217280 depooled and rebooted tools-sgewebgrid-lighttpd-0924
  • 00:39 bstorm_: T217280 depooled and rebooted tools-sgewebgrid-lighttpd-0902

2019-03-21

  • 23:28 bstorm_: T217280 depooled, reloaded and repooled tools-sgeexec-0938
  • 21:53 bstorm_: T217280 rebooted and cleared "unknown status" from tools-sgeexec-0914 after depooling
  • 21:51 bstorm_: T217280 rebooted and cleared "unknown status" from tools-sgeexec-0909 after depooling
  • 21:26 bstorm_: T217280 cleared error state from a couple queues and rebooted tools-sgeexec-0901 and 04 to clear other issues related

2019-03-18

  • 18:43 bd808: Rebooting tools-static-12
  • 18:42 chicocvenancio: PAWS: 3 nodes still in not ready state, `worker-10(01|07|10)` all else working
  • 18:41 chicocvenancio: PAWS: deleting pods stuck in Unknown state with ` --grace-period=0 --force`
  • 18:40 andrewbogott: rebooting tools-static-13 in hopes of fixing some nfs mounts
  • 18:25 chicocvenancio: removing postStart hook for PWB update and restarting hub while gerrit.wikimedia.com is down

2019-03-17

  • 23:41 bd808: Cherry-picked https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/497210/ as a quick fix for T218494
  • 22:30 bd808: Investigating strange system state on tools-bastion-03.
  • 17:48 bstorm_: T218514 rebooting tools-worker-1009 and 1012
  • 17:46 bstorm_: depooling tools-worker-1009 and tools-worker-1012 for T218514
  • 17:13 bstorm_: depooled and rebooting tools-worker-1018
  • 15:09 andrewbogott: running 'killall dpkg and dpkg --configure -a' on all nodes to try to work around a race with initramfs

2019-03-16

  • 22:34 bstorm_: clearing errored out queues again

2019-03-15

  • 21:08 bstorm_: cleared error state on several queues T217280
  • 15:58 gtirloni: rebooted tools-clushmaster-02
  • 14:40 mutante: tools-sgebastion-07 - dpkg-reconfigure locales and adding Korean ko_KR.EUC-KR - T130532
  • 14:32 mutante: tools-sgebastion-07 - generating locales for user request in T130532

2019-03-14

  • 23:52 bd808: Disabled job queues and rescheduled continuous jobs away from tools-exec-14{21,22,23,24,25,26,27,28,29,30,31,32} (T217152)
  • 23:28 bd808: Deleted tools-bastion-05 (T217152)
  • 22:30 bd808: Removed obsolete submit hosts from Trusty grid config
  • 22:20 bd808: Removed tools-webgrid-lighttpd-142{0,1,2,5} from the grid and shutdown instances via horizon (T217152)
  • 22:10 bd808: Depooled tools-webgrid-lighttpd-142{0,1,2,5} (T217152)
  • 21:55 bd808: Removed submit host flag from tools-bastion-05.tools.eqiad.wmflabs, removed floating ip, and shutdown instance via horizon (T217152)
  • 21:48 bd808: Removed tools-exec-14{33,34,35,36,37,38,39,40,41,42} from the grid and shutdown instances via horizon (T217152)
  • 21:38 gtirloni: rebooted tools-sgewebgrid-generic-0904 (T218341)
  • 21:32 gtirloni: rebooted tools-exec-1020 (T218341)
  • 21:23 gtirloni: rebooted tools-sgeexec-0919, tools-sgeexec-0934, tools-worker-1018 (T218341)
  • 21:19 bd808: Killed jobs still running on tools-exec-14{33,34,35,36,37,38,39,40,41,42}.tools.eqiad.wmflabs 2 weeks after being depooled (T217152)
  • 20:58 bd808: Repooled tools-sgeexec-0941 following reboot
  • 20:57 bd808: Hard reboot of tools-sgeexec-0941 via horizon
  • 20:54 bd808: Depooled and rebooted tools-sgeexec-0941.tools.eqiad.wmflabs
  • 20:53 bd808: Repooled tools-sgeexec-0917 following reboot
  • 20:52 bd808: Hard reboot of tools-sgeexec-0917 via horizon
  • 20:47 bd808: Depooled and rebooted tools-sgeexec-0917
  • 20:44 bd808: Repooled tools-sgeexec-0908 after reboot
  • 20:36 bd808: depooled and rebooted tools-sgeexec-0908
  • 19:08 gtirloni: rebooted tools-worker-1028 (T218341)
  • 19:08 gtirloni: rebooted tools-sgewebgrid-lighttpd-0914 (T218341)
  • 19:07 gtirloni: rebooted tools-sgewebgrid-lighttpd-0914
  • 18:13 gtirloni: drained tools-worker-1028 for reboot (processes in D state)

2019-03-13

  • 23:30 bd808: Rebuilding stretch Kubernetes images
  • 22:55 bd808: Rebuilding jessie Kubernetes images
  • 17:11 bstorm_: specifically rebooted SGE cron server tools-sgecron-01
  • 17:10 bstorm_: rebooted cron server
  • 16:10 bd808: Updated DNS for dev.tools.wmflabs.org to point to Stretch secondary bastion. This was missed on 2019-03-07
  • 12:33 arturo: reboot tools-sgebastion-08 (T215154)
  • 12:17 arturo: reboot tools-sgebastion-07 (T215154)
  • 11:53 arturo: enable puppet in tools-sgebastion-07 (T215154)
  • 11:20 arturo: disable puppet in tools-sgebastion-07 for testing T215154
  • 05:07 bstorm_: re-enabled puppet for tools-sgebastion-07
  • 04:59 bstorm_: disabled puppet for a little bit on tools-bastion-07
  • 00:22 bd808: Raise web-memlimit for isbn tool to 6G for tomcat8 (T217406)

2019-03-11

  • 15:53 bd808: Manually started `service gridengine-master` on tools-sgegrid-master after reboot (T218038)
  • 15:47 bd808: Hard reboot of tools-sgegrid-master via Horizon UI (T218038)
  • 15:42 bd808: Rebooting tools-sgegrid-master (T218038)
  • 14:49 gtirloni: deleted tools-webgrid-lighttpd-1419
  • 00:53 bd808: Re-enabled 13 queue instances that had been disabled by LDAP failures during job initialization (T217280)

2019-03-10

  • 22:36 gtirloni: increased nscd group TTL from 60 to 300sec

2019-03-08

  • 19:48 andrewbogott: repooling tools-exec-1430 and tools-sgeexec-0905 to compare ldap usage
  • 19:21 andrewbogott: depooling tools-exec-1430 and tools-sgeexec-0905 to compare ldap usage
  • 17:49 bd808: Re-enabled 4 queue instances that had been disabled by LDAP failures during job initialization (T217280)
  • 00:30 bd808: DNS record created for trusty-dev.tools.wmflabs.org (Trusty secondary bastion)

2019-03-07

  • 23:31 bd808: Updated DNS to point login.tools.wmflabs.org at 185.15.56.48 (Stretch bastion)
  • 04:15 bd808: Killed 3 orphan processes on Trusty grid
  • 04:01 bd808: Cleared error state on a large number of Stretch grid queues which had been disabled by LDAP and/or NFS hiccups (T217280)
  • 00:49 zhuyifei1999_: clushed misctools 1.37 upgrade on @bastion,@cron,@bastion-stretch T217406
  • 00:38 zhuyifei1999_: published misctools 1.37 T217406
  • 00:34 zhuyifei1999_: begin building misctools 1.37 using debuild T217406

2019-03-06

  • 13:57 gtirloni: fixed SSH warnings in tools-clushmaster-02

2019-03-04

  • 19:07 bstorm_: umounted /mnt/nfs/dumps-labstore1006.wikimedia.org for T217473
  • {{safesubst:SAL entry|1=14:05 gtirloni: rebooted tools-docker-registry-{03,04}, tools-puppetmaster-02 and tools-puppetdb-01 (load avg >45, not accessible)}}

2019-03-03

  • 20:54 andrewbogott: cleaning out /tmp on tools-exec-1412

2019-02-28

  • 19:36 zhuyifei1999_: built with debuild instead T217297
  • 19:08 zhuyifei1999_: test failures during build, see ticket
  • 18:55 zhuyifei1999_: start building jobutils 1.36 T217297

2019-02-27

  • 20:41 andrewbogott: restarting nginx on tools-checker-01
  • 19:34 andrewbogott: uncordoning tools-worker-1028, 1002 and 1005, now in eqiad1-r
  • 16:20 zhuyifei1999_: regenerating k8s creds for tools.whichsub & tools.permission-denied-test T176027
  • 15:40 andrewbogott: moving tools-worker-1002, 1005, 1028 to eqiad1-r
  • 01:36 bd808: Shutdown tools-webgrid-lighttpd-1419.tools.eqiad.wmflabs via horizon (T217152)
  • 01:29 bd808: Depooled tools-webgrid-lighttpd-1419.tools.eqiad.wmflabs (T217152)
  • 01:26 bd808: Disabled job queues and rescheduled continuous jobs away from tools-exec-14{33,34,35,36,37,38,39,40,41,42}.tools.eqiad.wmflabs (T217152)

2019-02-26

  • 20:51 gtirloni: reboot tools-package-builder-02 (unresponsive)
  • 19:01 gtirloni: pushed updated docker images
  • 17:30 andrewbogott: draining and cordoning tools-worker-1027 for a region migration test

2019-02-25

  • 23:20 bstorm_: Depooled tools-sgeexec-0914 and tools-sgeexec-0915 for T217066
  • 21:41 andrewbogott: depooling tools-sgeexec-0911, tools-sgeexec-0912, tools-sgeexec-0913 to test T217066
  • 13:11 chicocvenancio: PAWS: Stopped AABot notebook pod T217010
  • 12:54 chicocvenancio: PAWS: Restarted Criscod notebook pod T217010
  • 12:21 chicocvenancio: PAWS: killed proxy and hub pods to attempt to get it to see routes to open notebooks servers to no avail. Restarted BernhardHumm's notebook pod T217010
  • 09:50 gtirloni: rebooted tools-sgeexec-09{16,22,40} (T216988)
  • 09:41 gtirloni: rebooted tools-sgeexec-09{16,22,40}
  • 08:37 zhuyifei1999_: uncordon tools-worker-1015.tools.eqiad.wmflabs
  • 08:34 legoktm: hard rebooted tools-worker-1015 via horizon
  • 07:48 zhuyifei1999_: systemd stuck in D state. :(
  • 07:44 zhuyifei1999_: I saved dmesg and process list to a few files in /root if that helps debugging
  • 07:43 zhuyifei1999_: D states are not responding to SIGKILL. Will reboot.
  • 07:37 zhuyifei1999_: tools-worker-1015.tools.eqiad.wmflabs having severe NFS issues (all NFS accessing processes are stuck in D state). Draining.

2019-02-22

  • 16:29 gtirloni: upgraded and rebooted tools-puppetmaster-01 (new kernel)
  • 15:59 gtirloni: started tools-puppetmaster-01 (new size: m1.large)
  • 15:13 gtirloni: shutdown tools-puppetmaster-01

2019-02-21

  • 09:59 gtirloni: upgraded all packages in all stretch nodes
  • 00:12 zhuyifei1999_: forcing puppet run on tools-k8s-master-01
  • 00:08 zhuyifei1999_: running /usr/local/bin/git-sync-upstream on tools-puppetmaster-01 to speed puppet changes up

2019-02-20

  • 23:30 zhuyifei1999_: begin rebuilding all docker images T178601 T193646 T215683
  • 23:25 zhuyifei1999_: upgraded toollabs-webservice on tools-bastion-02 to 0.44 (newly-built version)
  • 23:19 zhuyifei1999_: this was built for stretch. hopefully it works for all distros
  • 23:17 zhuyifei1999_: begin build new tools-webservice package T178601 T193646 T215683
  • 21:57 andrewbogott: moving tools-static-13 to a new virt host
  • 21:34 andrewbogott: moving the tools-static IP from tools-static-13 to tools-static-12
  • 19:17 andrewbogott: moving tools-bastion-02 to labvirt1004
  • 16:56 andrewbogott: moving tools-paws-worker-1003
  • 15:53 andrewbogott: moving tools-worker-1017, tools-worker-1027, tools-worker-1028
  • 15:04 andrewbogott: moving tools-exec-1413 and tools-exec-1442

2019-02-19

  • 01:49 bd808: Revoked Toolforge project membership for user DannyS712 (T215092)

2019-02-18

  • 20:45 gtirloni: upgraded and rebooted tools-sgebastion-07 (login-stretch)
  • 20:22 gtirloni: enabled toolsdb monitoring in Icinga
  • 20:03 gtirloni: pointed tools-db.eqiad.wmflabs to 172.16.7.153
  • 18:50 chicocvenancio: moving paws back to toolsdb T216208
  • 13:47 arturo: rebooting tools-sgebastion-07 to try fixing general slowness

2019-02-17

  • 22:23 zhuyifei1999_: uncordon tools-worker-1010.tools.eqiad.wmflabs
  • 22:11 zhuyifei1999_: rebooting tools-worker-1010.tools.eqiad.wmflabs
  • 22:10 zhuyifei1999_: draining tools-worker-1010.tools.eqiad.wmflabs, `docker ps` is hanging. no idea why. also other weirdness like ContainerCreating forever

2019-02-16

  • 05:00 zhuyifei1999_: fixed by restarting flannel. another puppet run simply started kubelet
  • 04:58 zhuyifei1999_: puppet logs: https://phabricator.wikimedia.org/P8097. Docker is failing with 'Failed to load environment files: No such file or directory'
  • 04:52 zhuyifei1999_: copied the resolv.conf from tools-k8s-master-01, removing secondary DNS to make sure puppet fixes that, and starting puppet
  • 04:48 zhuyifei1999_: that host's resolv.conf is badly broken https://phabricator.wikimedia.org/P8096. The last Puppet run was at Thu Feb 14 15:21:09 UTC 2019 (2247 minutes ago)
  • 04:44 zhuyifei1999_: puppet is also failing bad here 'Error: Could not request certificate: getaddrinfo: Name or service not known'
  • 04:43 zhuyifei1999_: this one has logs full of 'Can't contact LDAP server'
  • 04:41 zhuyifei1999_: nslcd also broken on tools-worker-1005
  • 04:34 zhuyifei1999_: uncordon tools-worker-1014.tools.eqiad.wmflabs
  • 04:33 zhuyifei1999_: the issue was, /var/run/nslcd/socket was somehow a directory, AFAICT
  • 04:31 zhuyifei1999_: then started nslcd vis systemctl and `id zhuyifei1999` returns correct stuffs
  • 04:30 zhuyifei1999_: `nslcd -nd` complains about 'nslcd: bind() to /var/run/nslcd/socket failed: Address already in use'. SIGTERMed a background nslcd, `rmdir /var/run/nslcd/socket`, and `nslcd -nd` seemingly starts to work
  • 04:23 zhuyifei1999_: drained tools-worker-1014.tools.eqiad.wmflabs
  • 04:16 zhuyifei1999_: logs: https://phabricator.wikimedia.org/P8095
  • 04:14 zhuyifei1999_: restarting nslcd on tools-worker-1014 in an attempt to fix that, service failed to start, looking into logs
  • 04:12 zhuyifei1999_: restarting nscd on tools-worker-1014 in an attempt to fix seemingly-not-attached-to-LDAP

2019-02-14

  • 21:57 bd808: Deleted old tools-proxy-02 instance
  • 21:57 bd808: Deleted old tools-proxy-01 instance
  • 21:56 bd808: Deleted old tools-package-builder-01 instance
  • 20:57 andrewbogott: rebooting tools-worker-1005
  • 20:34 andrewbogott: moving tools-exec-1409, tools-exec-1410, tools-exec-1414, tools-exec-1419
  • 19:55 andrewbogott: moving tools-webgrid-generic-1401 and tools-webgrid-lighttpd-1419
  • 19:33 andrewbogott: moving tools-checker-01 to labvirt1003
  • 19:25 andrewbogott: moving tools-elastic-02 to labvirt1003
  • 19:11 andrewbogott: moving tools-k8s-etcd-01 to labvirt1002
  • 18:37 andrewbogott: moving tools-exec-1418, tools-exec-1424 to labvirt1003
  • 18:34 andrewbogott: moving tools-webgrid-lighttpd-1404, tools-webgrid-lighttpd-1406, tools-webgrid-lighttpd-1410 to labvirt1002
  • 17:35 arturo: T215154 tools-sgebastion-07 now running systemd 239 and starts enforcing user limits
  • 15:33 andrewbogott: moving tools-worker-1002, 1003, 1005, 1006, 1007, 1010, 1013, 1014 to different labvirts in order to move labvirt1012 to eqiad1-r

2019-02-13

  • 19:16 andrewbogott: deleting tools-sgewebgrid-generic-0901, tools-sgewebgrid-lighttpd-0901, tools-sgebastion-06
  • 15:16 zhuyifei1999_: `sudo /usr/local/bin/grid-configurator --all-domains --observer-pass $(grep OS_PASSWORD /etc/novaobserver.yaml|awk '{gsub(/"/,"",$2);print $2}')` on tools-sgegrid-master to attempt to make it recognize -sgebastion-07 T216042
  • 15:06 zhuyifei1999_: `sudo systemctl restart gridengine-master` on tools-sgegrid-master to attempt to make it recognize -sgebastion-07 T216042
  • 13:03 arturo: T216030 switch login-stretch.tools.wmflabs.org floating IP to tools-sgebastion-07

2019-02-12

  • 01:24 bd808: Stopped maintain-kubeusers, edited /etc/kubernetes/tokenauth, restarted maintain-kubeusers (T215704)

2019-02-11

  • 22:57 bd808: Shutoff tools-webgrid-lighttpd-14{01,13,24,26,27,28} via Horizon UI
  • 22:34 bd808: Decommissioned tools-webgrid-lighttpd-14{01,13,24,26,27,28}
  • 22:23 bd808: sudo exec-manage depool tools-webgrid-lighttpd-1401.tools.eqiad.wmflabs
  • 22:21 bd808: sudo exec-manage depool tools-webgrid-lighttpd-1413.tools.eqiad.wmflabs
  • 22:18 bd808: sudo exec-manage depool tools-webgrid-lighttpd-1428.tools.eqiad.wmflabs
  • 22:07 bd808: sudo exec-manage depool tools-webgrid-lighttpd-1427.tools.eqiad.wmflabs
  • 22:06 bd808: sudo exec-manage depool tools-webgrid-lighttpd-1424.tools.eqiad.wmflabs
  • 22:05 bd808: sudo exec-manage depool tools-webgrid-lighttpd-1426.tools.eqiad.wmflabs
  • 20:06 bstorm_: Ran apt-get clean on tools-sgebastion-07 since it was running out of disk (and lots of it was the apt cache)
  • 19:09 bd808: Upgraded tools-manifest on tools-cron-01 to v0.19 (T107878)
  • 18:57 bd808: Upgraded tools-manifest on tools-sgecron-01 to v0.19 (T107878)
  • 18:57 bd808: Built tools-manifest_0.19_all.deb and published to aptly repos (T107878)
  • 18:26 bd808: Upgraded tools-manifest on tools-sgecron-01 to v0.18 (T107878)
  • 18:25 bd808: Built tools-manifest_0.18_all.deb and published to aptly repos (T107878)
  • 18:12 bd808: Upgraded tools-manifest on tools-sgecron-01 to v0.17 (T107878)
  • 18:08 bd808: Built tools-manifest_0.17_all.deb and published to aptly repos (T107878)
  • 10:41 godog: flip tools-prometheus proxy back to tools-prometheus-01 and upgrade to prometheus 2.7.1

2019-02-08

  • 19:17 hauskatze: Stopped webservice of `tools.sulinfo` which redirects to `tools.quentinv57-tools` which is also unavalaible
  • 18:32 hauskatze: Stopped webservice for `tools.quentinv57-tools` for T210829.
  • 13:49 gtirloni: upgraded all packages in SGE cluster
  • 12:25 arturo: install aptitude in tools-sgebastion-06
  • 11:08 godog: flip tools-prometheus.wmflabs.org to tools-prometheus-02 - T215272
  • 01:07 bd808: Creating tools-sgebastion-07

2019-02-07

  • 23:48 bd808: Updated DNS to make tools-trusty.wmflabs.org and trusty.tools.wmflabs.org CNAMEs for login-trusty.tools.wmflabs.org
  • 20:18 gtirloni: cleared mail queue on tools-mail-02
  • 08:41 godog: upgrade prometheus-02 to prometheus 2.6 - T215272

2019-02-04

  • 13:20 arturo: T215154 another reboot for tools-sgebastion-06
  • 12:26 arturo: T215154 another reboot for tools-sgebastion-06. Puppet is disabled
  • 11:38 arturo: T215154 reboot tools-sgebastion-06 to totally refresh systemd status
  • 11:36 arturo: T215154 manually install systemd 239 in tools-sgebastion-06

2019-01-30

  • 23:54 gtirloni: cleared apt cache on sge* hosts

2019-01-25

  • 20:50 bd808: Deployed new tcl/web Kubernetes image based on Debian Stretch (T214668)
  • 14:22 andrewbogott: draining and moving tools-worker-1016 to a new labvirt for T214447
  • 14:22 andrewbogott: draining and moving tools-worker-1021 to a new labvirt for T214447

2019-01-24

  • 11:09 arturo: T213421 delete tools-services-01/02
  • 09:46 arturo: T213418 delete tools-docker-registry-02
  • 09:45 arturo: T213418 delete tools-docker-builder-05 and tools-docker-registry-01
  • 03:28 bd808: Fixed rebase conflict in labs/private on tools-puppetmaster-01

2019-01-23

  • 22:18 bd808: Building new tools-sgewebgrid-lighttpd-0904 instance using Stretch base image (T214519)
  • 22:09 bd808: Deleted tools-sgewebgrid-lighttpd-0904 instance via Horizon, used wrong base image (T214519)
  • 21:04 bd808: Building new tools-sgewebgrid-lighttpd-0904 instance (T214519)
  • 20:53 bd808: Deleted broken tools-sgewebgrid-lighttpd-0904 instance via Horizon (T214519)
  • 19:49 andrewbogott: shutting down eqiad-region proxies tools-proxy-01 and tools-proxy-02
  • 17:44 bd808: Added rules to default security group for prometheus monitoring on port 9100 (T211684)

2019-01-22

  • 20:21 gtirloni: published new docker images (all)
  • 18:57 bd808: Changed deb-tools.wmflabs.org proxy to point to tools-sge-services-03.tools.eqiad.wmflabs

2019-01-21

  • 05:25 andrewbogott: restarted tools-sgeexec-0906 and tools-sgeexec-0904; they seem better now but I have not repooled them yet

2019-01-18

  • 21:22 bd808: Forcing php-igbinary update via clush for T213666

2019-01-17

  • 23:37 bd808: Shutdown tools-package-builder-01. Use tools-package-builder-02 instead!
  • 22:09 bd808: Upgrading tools-manifest to 0.16 on tools-sgecron-01
  • 22:05 bd808: Upgrading tools-manifest to 0.16 on tools-cron-01
  • 21:51 bd808: Upgrading tools-manifest to 0.15 on tools-cron-01
  • 20:41 bd808: Building tools-package-builder-02 to replace tools-package-builder-01
  • 17:16 arturo: T213421 shutdown tools-services-01/02. Will delete VMs after a grace period
  • 12:54 arturo: add webservice security group to tools-sge-services-03/04

2019-01-16

  • 17:29 andrewbogott: depooling and moving tools-sgeexec-0904 tools-sgeexec-0906 tools-sgewebgrid-lighttpd-0904
  • 16:38 arturo: T213418 shutdown tools-docker-registry-01 and 02. Will delete the instances in a week or so
  • 14:34 arturo: T213418 point docker-registry.tools.wmflabs.org to tools-docker-registry-03 (was in -02)
  • 14:24 arturo: T213418 allocate floating IPs for tools-docker-registry-03 & 04

2019-01-15

  • 21:02 bstorm_: restarting webservicemonitor on tools-services-02 -- acting funny
  • 18:46 bd808: Dropped A record for www.tools.wmflabs.org and replaced it with a CNAME pointing to tools.wmflabs.org.
  • 18:29 bstorm_: T213711 installed python3-requests=2.11.1-1~bpo8+1 python3-urllib3=1.16-1~bpo8+1 on tools-proxy-03, which stopped the bleeding
  • 14:55 arturo: disable puppet in tools-docker-registry-01 and tools-docker-registry-02, trying with `role::wmcs::toolforge::docker::registry` in the puppetmaster for -03 and -04. The registry shouldn't be affected by this
  • 14:21 arturo: T213418 put a backup of the docker registry in NFS just in case: `aborrero@tools-docker-registry-02:$ sudo cp /srv/registry/registry.tar.gz /data/project/.system_sge/docker-registry-backup/`

2019-01-14

  • 22:03 bstorm_: T213711 Added UDP port needed for flannel packets to work to k8s worker sec groups in both eqiad and eqiad1-r
  • 22:03 bstorm_: T213711 Added ports needed for etcd-flannel to work on the etcd security group in eqiad
  • 21:42 zhuyifei1999_: also `write`-ed to them (as root). auth on my personal account would take a long time
  • 21:37 zhuyifei1999_: that command belonged to tools.scholia (with fnielsen as the ssh user)
  • 21:36 zhuyifei1999_: killed an egrep using too mush NFS bandwidth on tools-bastion-03
  • 21:33 zhuyifei1999_: SIGTERM PID 12542 24780 875 14569 14722. `tail`s with parent as init, belonging to user maxlath. they should submit to grid.
  • 16:44 arturo: T213418 docker-registry.tools.wmflabs.org point floating IP to tools-docker-registry-02
  • 14:00 arturo: T213421 disable updatetools in the new services nodes while building them
  • 13:53 arturo: T213421 delete tools-services-03/04 and create them with another prefix: tools-sge-services-03/04 to actually use the new role
  • 13:47 arturo: T213421 create tools-services-03 and tools-services-04 (stretch) they will use the new puppet role `role::wmcs::toolforge::services`

2019-01-11

  • 11:55 arturo: T213418 shutdown tools-docker-builder-05, will give a grace period before deleting the VM
  • 10:51 arturo: T213418 created tools-docker-builder-06 in eqiad1
  • 10:46 arturo: T213418 migrating tools-docker-registry-02 from eqiad to eqiad1

2019-01-10

  • 22:45 bstorm_: T213357 - Added 24 lighttpd nodes tot he new grid
  • 18:54 bstorm_: T213355 built and configured two more generic web nodes for the new grid
  • 10:35 gtirloni: deleted non-puppetized checks from tools-checker-0[1,2]
  • 00:12 bstorm_: T213353 Added 36 exec nodes to the new grid

2019-01-09

  • 20:16 andrewbogott: moving tools-paws-worker-1013 and tools-paws-worker-1007 to eqiad1
  • 17:17 andrewbogott: moving paws-worker-1017 and paws-worker-1016 to eqiad1
  • 14:42 andrewbogott: experimentally moving tools-paws-worker-1019 to eqiad1
  • 09:59 gtirloni: rebooted tools-checker-01 (T213252)

2019-01-07

  • 17:21 bstorm_: T67777 - set the max_u_jobs global grid config setting to 50 in the new grid
  • 15:54 bstorm_: T67777 Set stretch grid user job limit to 16
  • 05:45 bd808: Manually installed python3-venv on tools-sgebastion-06. Gerrit patch submitted for proper automation.

2019-01-06

  • 22:06 bd808: Added floating ip to tools-sgebastion-06 (T212360)

2019-01-05

  • 23:54 bd808: Manually installed php-mbstring on tools-sgebastion-06. Gerrit patch submitted to install it on the rest of the Son of Grid Engine nodes.

2019-01-04

  • 21:37 bd808: Truncated /data/project/.system/accounting after archiving ~30 days of history

2019-01-03

  • 21:03 bd808: Enabled Puppet on tools-proxy-02
  • 20:53 bd808: Disabled Puppet on tools-proxy-02
  • 20:51 bd808: Enabled Puppet on tools-proxy-01
  • 20:49 bd808: Disabled Puppet on tools-proxy-01

2018-12-21

  • 16:29 andrewbogott: migrating tools-exec-1416 to labvirt1004
  • 16:01 andrewbogott: moving tools-grid-master to labvirt1004
  • 00:35 bd808: Installed tools-manifest 0.14 for T212390
  • 00:22 bd808: Rebuiliding all docker containers with toollabs-webservice 0.43 for T212390
  • 00:19 bd808: Installed toollabs-webservice 0.43 on all hosts for T212390
  • 00:01 bd808: Installed toollabs-webservice 0.43 on tools-bastion-02 for T212390

2018-12-20

  • 20:43 andrewbogott: moving moving tools-prometheus-02 to labvirt1004
  • 20:42 andrewbogott: moving tools-k8s-etcd-02 to labvirt1003
  • 20:41 andrewbogott: moving tools-package-builder-01 to labvirt1002

2018-12-17

  • 22:16 bstorm_: Adding a bunch of hiera values and prefixes for the new grid - T212153
  • 19:18 gtirloni: decreased nfs-mount-manager verbosity (T211817)
  • 19:02 arturo: T211977 add package tools-manifest 0.13 to stretch-tools & stretch-toolsbeta in aptly
  • 13:46 arturo: T211977 `aborrero@tools-services-01:~$ sudo aptly repo move trusty-tools stretch-toolsbeta 'tools-manifest (=0.12)'`

2018-12-11

  • 13:19 gtirloni: Removed BigBrother (T208357)

2018-12-05

  • 12:17 gtirloni: remoted node tools-worker-1029.tools.eqiad.wmflabs from cluster (T196973)

2018-12-04

  • 22:47 bstorm_: gtirloni added back main floating IP for tools-k8s-master-01 and removed unnecessary ones to stop k8s outage T164123
  • 20:03 gtirloni: removed floating IPs from tools-k8s-master-01 (T164123)

2018-12-01

  • 02:44 gtirloni: deleted instance tools-exec-gift-trusty-01 (T194615)
  • 00:10 andrewbogott: moving tools-worker-1020 and tools-worker-1022 to different labvirts

2018-11-30

  • 23:13 andrewbogott: moving tools-worker-1009 and tools-worker-1019 to different labvirts
  • 22:18 gtirloni: Pushed new jdk8 docker image based on stretch (T205774)
  • 18:15 gtirloni: shutdown tools-exec-gift-trusty-01 instance (T194615)

2018-11-27

  • 17:49 bstorm_: restarted maintain-kubeusers just in case it had any issues reconnecting to toolsdb

2018-11-26

  • 17:39 gtirloni: updated tools-manifest package on tools-services-01/02 to version 0.12 (10->60 seconds sleep time) (T210190)
  • 17:34 gtirloni: T186571 removed legofan4000 user from project-tools group (again)
  • 13:31 gtirloni: deleted instance tools-clushmaster-01 (T209701)

2018-11-20

  • 23:05 gtirloni: Published stretch-tools and stretch-toolsbeta aptly repositories individually on tools-services-01
  • 14:18 gtirloni: Created Puppet prefixes 'tools-clushmaster' & 'tools-mail'
  • 13:24 gtirloni: shutdown tools-clushmaster-01 (use tools-clushmaster-02)
  • 10:52 arturo: T208579 distributing now misctools and jobutils 1.33 in all aptly repos
  • 09:43 godog: restart prometheus@tools on prometheus-01

2018-11-16

  • 21:16 bd808: Ran grid engine orphan process kill script from T153281. Only 3 orphan php-cgi processes belonging to iluvatarbot found.
  • 17:47 gtirloni: deleted tools-mail instance
  • 17:23 andrewbogott: moving tools-docker-registry-02 to labvirt1001
  • 17:07 andrewbogott: moving tools-elastic-03 to labvirt1007
  • 13:36 gtirloni: rebooted tools-static-12 and tools-static-13 after package upgrades

2018-11-14

  • 17:29 andrewbogott: moving tools-worker-1027 to labvirt1008
  • 17:18 andrewbogott: moving tools-webgrid-lighttpd-1417 to labvirt1005
  • 17:15 andrewbogott: moving tools-exec-1420 to labvirt1009

2018-11-13

  • 17:40 arturo: remove misctools 1.31 and jobutils 1.30 from the stretch-tools repo (T207970)
  • 13:32 gtirloni: pointed mail.tools.wmflabs.org to new IP 208.80.155.158
  • 13:29 gtirloni: Changed active mail relay to tools-mail-02 (T209356)
  • 13:22 arturo: T207970 misctools and jobutils v1.32 are now in both `stretch-tools` and `stretch-toolsbeta` repos in tools-services-01
  • 13:05 arturo: T207970 there is now a `stretch-toolsbeta` repo in tools-services-01, still empty
  • 12:59 arturo: the puppet issue has been solved by reverting the code
  • 12:28 arturo: puppet broken in toolforge due to a refactor. Will be fixed in a bit

2018-11-08

  • 18:12 gtirloni: cleaned up old tmp files on tools-bastion-02
  • 17:58 arturo: installing jobutils and misctools v1.32 (T207970)
  • 17:18 gtirloni: cleaned up old tmp files on tools-exec-1406
  • 16:56 gtirloni: cleaned up /tmp on tools-bastion-05
  • 16:37 gtirloni: re-enabled tools-webgrid-lighttpd-1424.tools.eqiad.wmflabs
  • 16:36 gtirloni: re-enabled tools-webgrid-lighttpd-1414.tools.eqiad.wmflabs
  • 16:36 gtirloni: re-enabled tools-webgrid-lighttpd-1408.tools.eqiad.wmflabs
  • 16:36 gtirloni: re-enabled tools-webgrid-lighttpd-1403.tools.eqiad.wmflabs
  • 16:36 gtirloni: re-enabled tools-exec-1429.tools.eqiad.wmflabs
  • 16:36 gtirloni: re-enabled tools-exec-1411.tools.eqiad.wmflabs
  • 16:29 bstorm_: re-enabled tools-exec-1433.tools.eqiad.wmflabs
  • 11:32 gtirloni: removed temporary /var/mail fix (T208843)

2018-11-07

  • 10:37 gtirloni: removed invalid apt.conf.d file from all hosts (T110055)

2018-11-02

  • 18:11 arturo: T206223 some disturbances due to the certificate renewal
  • 17:04 arturo: renewing *.wmflabs.org T206223

2018-10-31

  • 18:02 gtirloni: truncated big .err and error.log files
  • 13:15 addshore: removing Jonas Kress (WMDE) from tools project, no longer with wmde

2018-10-29

  • 17:00 bd808: Ran grid engine orphan process kill script from T153281

2018-10-26

  • 10:34 arturo: T207970 added misctools 1.31 and jobutils 1.30 to stretch-tools aptly repo
  • 10:32 arturo: T209970 added misctools 1.31 and jobutils 1.30 to stretch-tools aptly repo

2018-10-19

  • 14:17 andrewbogott: moving tools-clushmaster-01 to labvirt1004
  • 00:29 andrewbogott: migrating tools-exec-1411 and tools-exec-1410 off of cloudvirt1017

2018-10-18

  • 19:57 andrewbogott: moving tools-webgrid-lighttpd-1419, tools-webgrid-lighttpd-1420 and tools-webgrid-lighttpd-1421 to labvirt1009, 1010 and 1011 as part of (gradually) draining labvirt1017

2018-10-16

  • 15:13 bd808: (repost for gtirloni) T186571 removed legofan4000 user from project-tools group (leftover from T165624 legofan4000->macfan4000 rename)

2018-10-07

  • 21:57 zhuyifei1999_: restarted maintain-kubeusers on tools-k8s-master-01 T194859
  • 21:48 zhuyifei1999_: maintain-kubeusers on tools-k8s-master-01 seems to be in an infinite loop of 10 seconds. installed python3-dbg
  • 21:44 zhuyifei1999_: journal on tools-k8s-master-01 is full of etcd failures, did a puppet run, nothing interesting happens

2018-09-21

  • 12:35 arturo: cleanup stalled apt preference files (pinning) in tools-clushmaster-01
  • 12:14 arturo: T205078 same for {jessie,stretch}-wikimedia
  • 12:12 arturo: T205078 upgrade trusty-wikimedia packages (git-fat, debmonitor)
  • 11:57 arturo: T205078 purge packages smbclient libsmbclient libwbclient0 python-samba samba-common samba-libs from trusty machines

2018-09-17

  • 09:13 arturo: T204481 aborrero@tools-mail:~$ sudo exiqgrep -i | xargs sudo exim -Mrm

2018-09-14

  • 11:22 arturo: T204267 stop the corhist tool (k8s) because is hammering the wikidata API
  • 10:51 arturo: T204267 stop the openrefine-wikidata tool (k8s) because is hammering the wikidata API

2018-09-08

  • 10:35 gtirloni: restarted cron and truncated /var/log/exim4/paniclog (T196137)

2018-09-07

  • 05:07 legoktm: uploaded/imported toollabs-webservice_0.42_all.deb

2018-08-27

  • 23:40 bd808: `# exec-manage repool tools-webgrid-generic-1402.eqiad.wmflabs` T202932
  • 23:28 bd808: Restarted down instance tools-webgrid-generic-1402 & ran apt-upgrade
  • 22:36 zhuyifei1999_: `# exec-manage depool tools-webgrid-generic-1402.eqiad.wmflabs` T202932

2018-08-22

  • 13:02 arturo: I used this command: `sudo exim -bp | sudo exiqgrep -i | xargs sudo exim -Mrm`
  • 13:00 arturo: remove all emails in tools-mail.eqiad.wmflabs queue, 3378 bounce msgs, mostly related to @qq.com

2018-08-19

2018-08-14

2018-08-13

  • 23:31 legoktm: rebuilding docker images for webservice upgrade
  • 23:16 legoktm: published toollabs-webservice_0.41_all.deb
  • 23:06 legoktm: fixed permissions of tools-package-builder-01:/srv/src/tools-webservice

2018-08-09

  • 10:40 arturo: T201602 upgrade packages from jessie-backports (excluding python-designateclient)
  • 10:30 arturo: T201602 upgrade packages from jessie-wikimedia
  • 10:27 arturo: T201602 upgrade packages from trusty-updates

2018-08-08

  • 10:01 zhuyifei1999_: building & publishing toollabs-webservice 0.40 deb, and all Docker images T156626 T148872 T158244

2018-08-06

  • 12:33 arturo: T197176 installing texlive-full in toolforge

2018-08-01

  • 14:31 andrewbogott: temporarily depooling tools-exec-1409, 1410, 1414, 1419, 1427, 1428 to try to give labvirt1009 a break

2018-07-30

  • 20:33 bd808: Started rebuilding all Kubernetes Docker images to pick up latest apt updates
  • 04:47 legoktm: added toollabs-webservice_0.39_all.deb to stretch-tools

2018-07-27

  • 04:52 zhuyifei1999_: rebuilding python/base docker container T190274

2018-07-25

  • 19:02 chasemp: tools-worker-1004 reboot
  • 19:01 chasemp: ifconfig eth0:fakenfs 208.80.155.106 netmask 255.255.255.255 up on tools-worker-1004 (late log)

2018-07-18

  • 13:24 arturo: upgrading packages from `stretch-wikimedia` T199905
  • 13:18 arturo: upgrading packages from `stable` T199905
  • 12:51 arturo: upgrading packages from `oldstable` T199905
  • 12:31 arturo: upgrading packages from `trusty-updates` T199905
  • 12:16 arturo: upgrading packages from `jessie-wikimedia` T199905
  • 12:09 arturo: upgrading packages from `trusty-wikimedia` T199905

2018-06-30

  • 18:15 chicocvenancio: pushed new config to PAWS to fix dumps nfs mountpoint
  • 16:40 zhuyifei1999_: because tools-paws-master-01 was having ~1000 loadavg due to NFS having issues and processes stuck in D state
  • 16:39 zhuyifei1999_: reboot tools-paws-master-01
  • 16:35 zhuyifei1999_: `root@tools-paws-master-01:~# sed -i 's/^labstore1006.wikimedia.org/#labstore1006.wikimedia.org/' /etc/fstab`
  • 16:34 andrewbogott: "sed -i '/labstore1006/d' /etc/fstab" everywhere

2018-06-29

  • 17:41 bd808: Rescheduling continuous jobs away from tools-exec-1408 where load is high
  • 17:11 bd808: Rescheduled jobs away from toole-exec-1404 where linkwatcher is currently stealing most of the CPU (T123121)
  • 16:46 bd808: Killed orphan tool owned processes running on the job grid. Mostly jembot and wsexport php-cgi processes stuck in deadlock following an OOM. T182070

2018-06-28

  • 19:50 chasemp: tools-clushmaster-01:~$ clush -w @all 'sudo umount -fl /mnt/nfs/dumps-labstore1006.wikimedia.org'
  • 18:02 chasemp: tools-clushmaster-01:~$ clush -w @all "sudo umount -fl /mnt/nfs/dumps-labstore1007.wikimedia.org"
  • 17:53 chasemp: tools-clushmaster-01:~$ clush -w @all "sudo puppet agent --disable 'labstore1007 outage'"
  • 17:20 chasemp: tools-worker-1007:~# /sbin/reboot
  • 16:48 arturo: rebooting tools-docker-registry-01
  • 16:42 andrewbogott: rebooting tools-worker-<everything> to get NFS unstuck
  • 16:40 andrewbogott: rebooting tools-worker-1012 and tools-worker-1015 to get their nfs mounts unstuck

2018-06-21

  • 13:18 chasemp: tools-bastion-03:~# bash -x /data/project/paws/paws-userhomes-hack.bash

2018-06-20

  • 15:09 bd808: Killed orphan processes on webgrid nodes (T182070); most owned by jembot and croptool

2018-06-14

  • 14:20 chasemp: timeout 180s bash -x /data/project/paws/paws-userhomes-hack.bash

2018-06-11

  • 10:11 arturo: T196137 `aborrero@tools-clushmaster-01:~$ clush -w@all 'sudo wc -l /var/log/exim4/paniclog 2>/dev/null | grep -v ^0 && sudo rm -rf /var/log/exim4/paniclog && sudo service prometheus-node-exporter restart || true'`

2018-06-08

  • 07:46 arturo: T196137 more rootspam today, restarting again `prometheus-node-exporter` and force rotating exim4 paniclog in 12 nodes

2018-06-07

  • 11:01 arturo: T196137 force rotate all exim panilog files to avoid rootspam `aborrero@tools-clushmaster-01:~$ clush -w@all 'sudo logrotate /etc/logrotate.d/exim4-paniclog -f -v'`

2018-06-06

  • 22:00 bd808: Scripting a restart of webservice for tools that are still in CrashLoopBackOff state after 2nd attempt (T196589)
  • 21:10 bd808: Scripting a restart of webservice for 59 tools that are still in CrashLoopBackOff state after last attempt (P7220)
  • 20:25 bd808: Scripting a restart of webservice for 175 tools that are in CrashLoopBackOff state (P7220)
  • 19:04 chasemp: tools-bastion-03 is virtually unusable
  • 09:49 arturo: T196137 aborrero@tools-clushmaster-01:~$ clush -w@all 'sudo service prometheus-node-exporter restart' <-- procs using the old uid

2018-06-05

  • 18:02 bd808: Forced puppet run on tools-bastion-03 to re-enable logins by dubenben (T196486)
  • 17:39 arturo: T196137 clush: delete `prometheus` user and re-create it locally. Then, chown prometheus dirs
  • 17:38 bd808: Added grid engine quota to limit user debenben to 2 concurrent jobs (T196486)

2018-06-04

  • 10:28 arturo: T196006 installing sqlite3 package in exec nodes

2018-06-03

  • 10:19 zhuyifei1999_: Grid is full. qdel'ed all jobs belonging to tools.dibot except lighttpd, and tools.mbh that has a job name starting 'comm_delin', 'delfilexcl' T195834

2018-05-31

2018-05-30

  • 10:52 zhuyifei1999_: undid both changes to tools-bastion-05
  • 10:50 zhuyifei1999_: also making /proc/sys/kernel/yama/ptrace_scope 0 temporarily on tools-bastion-05
  • 10:45 zhuyifei1999_: installing mono-runtime-dbg on tools-bastion-05 to produce debugging information; was previously installed on tools-exec-1413 & 1441. Might be a good idea to uninstall them once we can close T195834

2018-05-28

  • 12:09 arturo: T194665 adding mono packages to apt.wikimedia.org for jessie-wikimedia and stretch-wikimedia
  • 12:06 arturo: T194665 adding mono packages to apt.wikimedia.org for trusty-wikimedia

2018-05-25

  • 05:31 zhuyifei1999_: Edit /data/project/.system/gridengine/default/common/sge_request, h_vmem 256M -> 512M, release precise -> trusty T195558

2018-05-22

2018-05-18

  • 16:36 bd808: Restarted bigbrother on tools-services-02

2018-05-16

  • 21:17 zhuyifei1999_: maintain-kubeusers on stuck in infinite sleeps of 10 seconds

2018-05-15

  • 04:28 andrewbogott: depooling, rebooting, re-pooling tools-exec-1414. It's hanging for unknown reasons.
  • 04:07 zhuyifei1999_: Draining unresponsive tools-exec-1414 following Portal:Toolforge/Admin#Draining_a_node_of_Jobs
  • 04:05 zhuyifei1999_: Force deletion of grid job 5221417 (tools.giftbot sga), host tools-exec-1414 not responding

2018-05-12

  • 10:09 Hauskatze: tools.quentinv57-tools@tools-bastion-02:~$ webservice stop | T194343

2018-05-11

  • 14:34 andrewbogott: repooling labvirt1001 tools instances
  • 13:59 andrewbogott: depooling a bunch of things before rebooting labvirt1001 for T194258: tools-exec-1401 tools-exec-1407 tools-exec-1408 tools-exec-1430 tools-exec-1431 tools-exec-1432 tools-exec-1435 tools-exec-1438 tools-exec-1439 tools-exec-1441 tools-webgrid-lighttpd-1402 tools-webgrid-lighttpd-1407

2018-05-10

  • 18:55 andrewbogott: depooling, rebooting, repooling tools-exec-1401 to test a kernel update

2018-05-09

  • 21:11 Reedy: Added Tim Starling as member/admin

2018-05-07

  • 21:02 zhuyifei1999_: re-building all docker images T190893
  • 20:49 zhuyifei1999_: building, signing, and publishing toollabs-webservice 0.39 T190893
  • 00:25 zhuyifei1999_: `renice -n 15 -p 28865` (`tar cvzf` of `tools.giftbot`) on tools-bastion-02, been hogging the NFS IO for a few hours

2018-05-05

  • 23:37 zhuyifei1999_: regenerate k8s creds for tools.zhuyifei1999-test because I messed up while testing

2018-05-03

  • 14:48 arturo: uploaded a new ruby docker image to the registry with the libmysqlclient-dev package T192566

2018-05-01

  • 14:05 andrewbogott: moving tools-webgrid-lighttpd-1406 to labvirt1016 (routine rebalancing)

2018-04-27

  • 18:26 zhuyifei1999_: `$ write` doesn't seem to be able to write to their tmux tty, so echoed into their pts directly: `# echo -e '\n\n[...]\n' > /dev/pts/81`
  • 18:17 zhuyifei1999_: SIGTERM tools-bastion-03 PID 6562 tools.zoomproof celery worker

2018-04-23

  • 14:41 zhuyifei1999_: `chown tools.pywikibot:tools.pywikibot /shared/pywikipedia/` Prior owner: tools.russbot:project-tools T192732

2018-04-22

  • 13:07 bd808: Kill orphan php-cgi processes across the job grid via clush -w @exec -w @webgrid -b 'ps axwo user:20,ppid,pid,cmd | grep -E " 1 " | grep php-cgi | xargs sudo kill -9'`

2018-04-15

  • 17:51 zhuyifei1999_: forced puppet puns across tools-elastic-0[1-3] T192224
  • 17:45 zhuyifei1999_: granted elasticsearch credentials to tools.flaky-ci T192224

2018-04-11

  • 13:25 chasemp: cleanup exim frozen messages in an effort to aleve queue pressure

2018-04-06

  • 16:30 chicocvenancio: killed job in bastion, tools.gpy affected
  • 14:30 arturo: add puppet class `toollabs::apt_pinning` to tools-puppetmaster-01 using horizon, to add some apt pinning related to T159254
  • 11:23 arturo: manually upgrade apache2 on tools-puppemaster for T159254

2018-04-05

  • 18:46 chicocvenancio: killed wget that was hogging io

2018-03-29

  • 20:09 chicocvenancio: killed interactive processes in tools-bastion-03
  • 19:56 chicocvenancio: several interactive jobs running in bastion-03. I am writing to connected users and will kill the jobs once done

2018-03-28

  • 13:06 zhuyifei1999_: SIGTERM PID 30633 on tools-bastion-03 (tool 3d2commons's celery). Please run this on grid

2018-03-26

  • 21:34 bd808: clush -w @exec -w @webgrid -b 'sudo find /tmp -type f -atime +1 -delete'

2018-03-23

2018-03-22

  • 22:04 bd808: Forced puppet run on tools-proxy-02 for T130748
  • 21:52 bd808: Forced puppet run on tools-proxy-01 for T130748
  • 21:48 bd808: Disabled puppet on tools-proxy-* for https://gerrit.wikimedia.org/r/#/c/420619/ rollout
  • 03:50 bd808: clush -w @exec -w @webgrid -b 'sudo find /tmp -type f -atime +1 -delete'

2018-03-21

  • 17:50 bd808: Cleaned up stale /project/.system/bigbrother.scoreboard.* files from labstore1004
  • 01:09 bd808: Deleting /tmp files owned by tools.wsexport with -mtime +2 across grid (T190185)

2018-03-20

  • 08:28 zhuyifei1999_: unmount dumps & remount on tools-bastion-02 (can someone clush this?) T189018 T190126

2018-03-19

  • 11:02 arturo: reboot tools-exec-1408, to balance load. Server is unresponsive due to high load by some tools

2018-03-16

  • 22:44 zhuyifei1999_: suspended process 22825 (BotOrderOfChapters.exe) on tools-bastion-03. Threads continuously going to D-state & R-state. Also sent message via $ write on pts/10
  • 12:13 arturo: reboot tools-webgrid-lighttpd-1420 due to almost full /tmp

2018-03-15

  • 16:56 zhuyifei1999_: granted elasticsearch credentials to tools.denkmalbot T185624

2018-03-14

  • 20:57 bd808: Upgrading elasticsearch on tools-elastic-01 (T181531)
  • 20:53 bd808: Upgrading elasticsearch on tools-elastic-02 (T181531)
  • 20:51 bd808: Upgrading elasticsearch on tools-elastic-03 (T181531)
  • 12:07 arturo: reboot tools-webgrid-lighttpd-1415, almost full /tmp
  • 12:01 arturo: repool tools-webgrid-lighttpd-1421, /tmp is now empty
  • 11:56 arturo: depool tools-webgrid-lighttpd-1421 for reboot due to /tmp almost full

2018-03-12

  • 20:09 madhuvishy: Run clush -w @all -b 'sudo umount /mnt/nfs/labstore1003-scratch && sudo mount -a' to remount scratch across all of tools
  • 17:13 arturo: T188994 upgrading packages from `stable`
  • 16:53 arturo: T188994 upgrading packages from stretch-wikimedia
  • 16:33 arturo: T188994 upgrading packages form jessie-wikimedia
  • 14:58 zhuyifei1999_: building, publishing, and deploying misctools 1.31 5f3561e T189430
  • 13:31 arturo: tools-exec-1441 and tools-exec-1442 rebooted fine and are repooled
  • 13:26 arturo: depool tools-exec-1441 and tools-exec-1442 for reboots
  • 13:19 arturo: T188994 upgrade packages from jessie-backports in all jessie servers
  • 12:49 arturo: T188994 upgrade packages from trusty-updates in all ubuntu servers
  • 12:34 arturo: T188994 upgrade packages from trusty-wikimedia in all ubuntu servers

2018-03-08

  • 16:05 chasemp: tools-clushmaster-01:~$ clush -g all 'sudo puppet agent --test'
  • 14:02 arturo: T188994 upgrading trusty-tools packages in all the cluster, this includes jobutils, openssh-server and openssh-sftp-server

2018-03-07

2018-03-06

  • 16:15 madhuvishy: Reboot tools-docker-registry-02 T189018
  • 15:50 madhuvishy: Rebooting tools-worker-1011
  • 15:08 chasemp: tools-k8s-master-01:~# kubectl uncordon tools-worker-1011.tools.eqiad.wmflabs
  • 15:03 arturo: drain and reboot tools-worker-1011
  • 15:03 chasemp: rebooted tools-worker 1001-1008
  • 14:58 arturo: drain and reboot tools-worker-1010
  • 14:27 chasemp: multiple tools running on k8s workers report issues reading replica.my.cnf file atm
  • 14:27 chasemp: reboot tools-worker-100[12]
  • 14:23 chasemp: downtime icinga alert for k8s workers ready
  • 13:21 arturo: T188994 in some servers there was some race in the dpkg lock between apt-upgrade and puppet. Also, I forgot to use DEBIAN_FRONTEND=noninteractive, so debconf prompts happened and stalled dpkg operations. Already solved, but some puppet alerts were produced
  • 12:58 arturo: T188994 upgrading packages in jessie nodes from the oldstable source
  • 11:42 arturo: clush -w @all "sudo DEBIAN_FRONTEND=noninteractive apt-get autoclean" <-- free space in filesystem
  • 11:41 arturo: aborrero@tools-clushmaster-01:~$ clush -w @all "sudo DEBIAN_FRONTEND=noninteractive apt-get autoremove -y" <-- we did in canary servers last week and it went fine. So run in fleet-wide
  • 11:36 arturo: (ubuntu) removed linux-image-3.13.0-142-generic and linux-image-3.13.0-137-generic (T188911)
  • 11:33 arturo: removing unused kernel packages in ubuntu nodes
  • 11:08 arturo: aborrero@tools-clushmaster-01:~$ clush -w @all "sudo rm /etc/apt/preferences.d/* ; sudo puppet agent -t -v" <--- rebuild directory, it contains stale files across all the cluster

2018-03-05

  • 18:56 zhuyifei1999_: also published jobutils_1.30_all.deb
  • 18:39 zhuyifei1999_: built and published misctools_1.30_all.deb T167026 T181492
  • 14:33 arturo: delete `linux-image-4.9.0-6-amd64` package from stretch instances for T188911
  • 14:01 arturo: deleting old kernel packages in jessie instances for T188911
  • 13:58 arturo: running `apt-get autoremove` with clush in all jessie instances
  • 12:16 arturo: apply role::toollabs::base to tools-paws prefix in horizon for T187193
  • 12:10 arturo: apply role::toollabs::base to tools-prometheus prefix in horizon for T187193

2018-03-02

  • 13:41 arturo: doing some testing with puppet classes in tools-package-builder-01 via horizon

2018-03-01

2018-02-27

  • 17:37 chasemp: add chico as admin to toolsbeta
  • 12:23 arturo: running `apt-get autoclean` in canary servers
  • 12:16 arturo: running `apt-get autoremove` in canary servers

2018-02-26

  • 19:17 chasemp: tools-clushmaster-01:~$ clush -w @all "sudo puppet agent --test"
  • 10:35 arturo: enable puppet in tools-proxy-01
  • 10:23 arturo: disable puppet in tools-proxy-01 for apt pinning tests

2018-02-25

  • 19:04 chicocvenancio: killed jobs in tools-bastion-03, wrote notice to tools owners' terminals

2018-02-23

  • 19:11 arturo: enable puppet in tools-proxy-01
  • 18:53 arturo: disable puppet in tools-proxy-01 for apt preferences testing
  • 13:52 arturo: deploying https://gerrit.wikimedia.org/r/#/c/413725/ across the fleet
  • 13:04 arturo: install apt-rdepends in tools-paws-master-01 which triggered some python libs to be upgraded

2018-02-22

  • 16:31 bstorm_: Enabled puppet on tools-static-12 as the test server

2018-02-21

  • 19:02 bstorm_: disabled puppet on tools-static-* pending change 413197
  • 18:15 arturo: puppet should be fine across the fleet
  • 17:24 arturo: another try: merged https://gerrit.wikimedia.org/r/#/c/413202/
  • 17:02 arturo: revert last change https://gerrit.wikimedia.org/r/#/c/413198/
  • 16:59 arturo: puppet is broken across the cluster due to last change
  • 16:57 arturo: deploying https://gerrit.wikimedia.org/r/#/c/410177/
  • 16:26 bd808: Rebooting tools-docker-registry-01, NFS mounts are in a bad state
  • 11:43 arturo: package upgrades in tools-webgrid-lightttpd-1401
  • 11:35 arturo: package upgrades in tools-package-builder-01 tools-prometheus-01 tools-static-10 and tools-redis-1001
  • 11:22 arturo: package upgrades in tools-mail, tools-grid-master, tool-logs-02
  • 10:51 arturo: package upgrades in tools-checker-01 tools-clushmaster-01 and tools-docker-builder-05
  • 09:18 chicocvenancio: killed io intensive tool job in bastion
  • 03:32 zhuyifei1999_: removed /data/project/.elasticsearch.ini, owned by root and mode 644, leaks the creds of /data/project/strephit/.elasticsearch.ini Might need to cycle it as well...

2018-02-20

  • 12:42 arturo: upgrading tools-flannel-etcd-01
  • 12:42 arturo: upgrading tools-k8s-etcd-01

2018-02-19

  • 19:13 arturo: upgrade all packages of tools-services-01
  • 19:02 arturo: move tools-services-01 from puppet3 to puppet4 (manual package upgrade). No issues detected.
  • 18:23 arturo: upgrade packages of tools-cron-01 from all channels (trusty-wikimedia, trusty-updates and trusty-tools)
  • 12:54 arturo: puppet run with clush to ensure puppet is back to normal after being broken due to duplicated python3 declaration

2018-02-16

  • 18:21 arturo: upgrading tools-proxy-01 and tools-paws-master-01, same as others
  • 17:36 arturo: upgrading oldstable, jessie-backports, jessie-wikimedia packages in tools-k8s-master-01 (excluding linux*, libpam*, nslcd)
  • 13:00 arturo: upgrades in tools-exec-14[01-10].eqiad.wmflabs were fine
  • 12:42 arturo: aborrero@tools-clushmaster-01:~$ clush -q -w @exec-upgrade-canarys 'DEBIAN_FRONTEND=noninteractive sudo apt-upgrade -u upgrade trusty-updates -y'
  • 11:58 arturo: aborrero@tools-elastic-01:~$ sudo apt-upgrade -u -f exclude upgrade jessie-wikimedia -y
  • 11:57 arturo: aborrero@tools-elastic-01:~$ sudo apt-upgrade -u -f exclude upgrade jessie-backports -y
  • 11:53 arturo: (10 exec canary nodes) aborrero@tools-clushmaster-01:~$ clush -q -w @exec-upgrade-canarys 'sudo apt-upgrade -u upgrade trusty-wikimedia -y'
  • 11:41 arturo: aborrero@tools-elastic-01:~$ sudo apt-upgrade -u -f exclude upgrade oldstable -y

2018-02-15

  • 13:54 arturo: cleanup ferm (deinstall) in tools-services-01 for T187435
  • 13:28 arturo: aborrero@tools-bastion-02:~$ sudo apt-upgrade -u upgrade trusty-tools
  • 13:16 arturo: aborrero@tools-bastion-02:~$ sudo apt-upgrade -u upgrade trusty-updates -y
  • 13:13 arturo: aborrero@tools-bastion-02:~$ sudo apt-upgrade -u upgrade trusty-wikimedia
  • 13:06 arturo: aborrero@tools-webgrid-generic-1401:~$ sudo apt-upgrade -u upgrade trusty-tools
  • 12:57 arturo: aborrero@tools-webgrid-generic-1401:~$ sudo apt-upgrade -u upgrade trusty-updates
  • 12:51 arturo: aborrero@tools-webgrid-generic-1401:~$ sudo apt-upgrade -u upgrade trusty-wikimedia

2018-02-14

  • 13:09 arturo: the reboot was OK, the server seems working and kubectl sees all the pods running in the deployment (T187315)
  • 13:04 arturo: reboot tools-paws-master-01 for T187315

2018-02-11

  • 01:28 zhuyifei1999_: `# find /home/ -maxdepth 1 -perm -o+w \! -uid 0 -exec chmod -v o-w {} \;` Affected: only /home/tr8dr, mode 0777 -> 0775
  • 01:21 zhuyifei1999_: `# find /data/project/ -maxdepth 1 -perm -o+w \! -uid 0 -exec chmod -v o-w {} \;` Affected tools: wikisource-tweets, gsociftttdev, dow, ifttt-testing, elobot. All mode 2777 -> 2775

2018-02-09

  • 10:35 arturo: deploy https://gerrit.wikimedia.org/r/#/c/409226/ T179343 T182562 T186846
  • 06:15 bd808: Killed orphan processes owned by iabot, dupdet, and wsexport scattered across the webgrid nodes
  • 05:07 bd808: Killed 4 orphan php-fcgi processes from jembot that were running on tools-webgrid-lighttpd-1426
  • 05:06 bd808: Killed 4 orphan php-fcgi processes from jembot that were running on tools-webgrid-lighttpd-1411
  • 05:05 bd808: Killed 1 orphan php-fcgi process from jembot that were running on tools-webgrid-lighttpd-1409
  • 05:02 bd808: Killed 4 orphan php-fcgi processes from jembot that were running on tools-webgrid-lighttpd-1421 and pegging the cpu there
  • 04:56 bd808: Rescheduled 30 of the 60 tools running on tools-webgrid-lighttpd-1421 (T186830)
  • 04:39 bd808: Killed 4 orphan php-fcgi processes from jembot that were running on tools-webgrid-lighttpd-1417 and pegging the cpu there

2018-02-08

  • 18:38 arturo: aborrero@tools-k8s-master-01:~$ sudo kubectl uncordon tools-worker-1002.tools.eqiad.wmflabs
  • 18:35 arturo: aborrero@tools-worker-1002:~$ sudo apt-upgrade -u upgrade jessie-wikimedia -v
  • 18:33 arturo: aborrero@tools-worker-1002:~$ sudo apt-upgrade -u upgrade oldstable -v
  • 18:28 arturo: cordon & drain tools-worker-1002.tools.eqiad.wmflabs
  • 18:10 arturo: uncordon tools-paws-worker-1019. Package upgrades were OK.
  • 18:08 arturo: aborrero@tools-paws-worker-1019:~$ sudo apt-upgrade upgrade stable -v
  • 18:06 arturo: aborrero@tools-paws-worker-1019:~$ sudo apt-upgrade upgrade stretch-wikimedia -v
  • 18:02 arturo: cordon tools-paws-worker-1019 to do some package upgrades
  • 17:29 arturo: repool tools-exec-1401.tools.eqiad.wmflabs. Package upgrades were OK.
  • 17:20 arturo: aborrero@tools-exec-1401:~$ sudo apt-upgrade upgrade trusty-updates -vy
  • 17:15 arturo: aborrero@tools-exec-1401:~$ sudo apt-upgrade upgrade trusty-wikimedia -vy
  • 17:11 arturo: depool tools-exec-1401.tools.eqiad.wmflabs to do some package upgrades
  • 14:22 arturo: it was some kind of transient error. After a second puppet run across the fleet, all seems fine
  • 13:53 arturo: deploy https://gerrit.wikimedia.org/r/#/c/407465/ which is causing some puppet issues. Investigating.

2018-02-06

  • 13:15 arturo: deploy https://gerrit.wikimedia.org/r/#/c/408529/ to tools-services-01
  • 13:05 arturo: unpublish/publish trusty-tools repo
  • 13:03 arturo: install aptly v0.9.6-1 in tools-services-01 for T186539 after adding it to trusty-tools repo (self contained)

2018-02-05

  • 17:58 arturo: publishing/unpublishing trusty-tools repo in tools-services-01 to address T186539
  • 13:27 arturo: for the record, not a single warning or error (orange/red messages) in puppet in the toolforge cluster
  • 13:06 arturo: deploying fix for T186230 using clush

2018-02-03

  • 01:04 chicocvenancio: killed io intensive process in bastion-03 "vltools python3 ./broken_ref_anchors.py"

2018-01-31

  • 22:54 chasemp: add bstorm to sudoers as root

2018-01-29

  • 20:02 chasemp: add zhuyifei1999_ tools root for T185577
  • 20:01 chasemp: blast a puppet run to see if any errors are persistent

2018-01-28

  • 22:49 chicocvenancio: killed compromised session generating miner processes
  • 22:48 chicocvenancio: killed miner processes in tools-bastion-03

2018-01-27

  • 00:55 arturo: at tools-static-11 the kernel OOM killer stopped git gc at about 20% :-(
  • 00:25 arturo: (/srv is almost full) aborrero@tools-static-11:/srv/cdnjs$ sudo git gc --aggressive

2018-01-25

  • 23:47 arturo: fix last deprecation warnings in tools-elastic-03, tools-elastic-02, tools-proxy-01 and tools-proxy-02 by replacing by hand configtimeout with http_configtimeout in /etc/puppet/puppet.conf
  • 23:20 arturo: T179386 aborrero@tools-clushmaster-01:~$ clush -w @all 'sudo puppet agent -t -v'
  • 05:25 arturo: deploying misctools and jobutils 1.29 for T179386

2018-01-23

  • 19:41 madhuvishy: Add bstorm to project admins
  • 15:48 bd808: Admin clean up; removed Coren, Ryan Lane, and Springle.
  • 14:17 chasemp: add me, arturo, chico to sudoers and removed marc

2018-01-22

  • 18:32 arturo: T181948 T185314 deploying jobutils and misctools v1.28 in the cluster
  • 11:21 arturo: puppet in the cluster is mostly fine, except for a couple of deprecation warnings, a conn timeout to services-01 and https://phabricator.wikimedia.org/T181948#3916790
  • 10:31 arturo: aborrero@tools-clushmaster-01:~$ clush -w @all 'sudo puppet agent -t -v' <--- check again how is the cluster with puppet
  • 10:18 arturo: T181948 deploy misctools 1.27 in the cluster

2018-01-19

  • 17:32 arturo: T185314 deploying new version of jobutils 1.27
  • 12:56 arturo: the puppet status across the fleet seems good, only minor things like T185314 , T179388 and T179386
  • 12:39 arturo: aborrero@tools-clushmaster-01:~$ clush -w @all 'sudo puppet agent -t -v'

2018-01-18

  • 16:11 arturo: aborrero@tools-clushmaster-01:~$ sudo aptitude purge vblade vblade-persist runit (for something similar to T182781)
  • 15:42 arturo: T178717 aborrero@tools-clushmaster-01:~$ clush -w @all 'sudo puppet agent -t -v'
  • 13:52 arturo: T178717 aborrero@tools-clushmaster-01:~$ clush -f 1 -w @all 'sudo facter | grep lsbdistcodename | grep trusty && sudo apt-upgrade trusty-wikimedia -v'
  • 13:44 chasemp: upgrade wikimedia packages on tools-bastion-05
  • 12:24 arturo: T178717 aborrero@tools-exec-1401:~$ sudo apt-upgrade trusty-wikimedia -v
  • 12:11 arturo: T178717 aborrero@tools-webgrid-generic-1402:~$ sudo apt-upgrade trusty-wikimedia -v
  • 11:42 arturo: T178717 aborrero@tools-clushmaster-01:~$ clush -w @all 'sudo puppet agent --test'

2018-01-17

  • 18:47 arturo: aborrero@tools-clushmaster-01:~$ clush -w @all 'apt-show-versions | grep upgradeable | grep trusty-wikimedia' | tee pending-upgrades-report-trusty-wikimedia.txt
  • 17:55 arturo: aborrero@tools-clushmaster-01:~$ clush -w @all 'sudo report-pending-upgrades -v' | tee pending-upgrades-report.txt
  • 15:15 andrewbogott: running purge-old-kernels on all Trusty exec nodes
  • 15:15 andrewbogott: repooling exec-manage tools-exec-1430.
  • 15:04 andrewbogott: depooling exec-manage tools-exec-1430. Experimenting with purge-old-kernels
  • 14:09 arturo: T181647 aborrero@tools-clushmaster-01:~$ clush -w @all 'sudo puppet agent --test'

2018-01-16

  • 22:01 chasemp: qstat -explain E -xml | grep 'name' | sed 's/<name>//' | sed 's/<\/name>//' | xargs qmod -cq
  • 21:54 chasemp: tools-exec-1436:~$ /sbin/reboot
  • 21:24 andrewbogott: repooled tools-exec-1420 and tools-webgrid-lighttpd-1417
  • 21:14 andrewbogott: depooling tools-exec-1420 and tools-webgrid-lighttpd-1417
  • 20:58 andrewbogott: depooling tools-exec-1412, 1415, 1417, tools-webgrid-lighttpd-1415, 1416, 1422, 1426
  • 20:56 andrewbogott: repooling tools-exec-1416, 1418, 1424, tools-webgrid-lighttpd-1404, tools-webgrid-lighttpd-1410
  • 20:46 andrewbogott: depooling tools-exec-1416, 1418, 1424, tools-webgrid-lighttpd-1404, tools-webgrid-lighttpd-1410
  • 20:46 andrewbogott: repooled tools-exec-1406, 1421, 1436, 1437, tools-webgrid-generic-1404, tools-webgrid-lighttpd-1409, tools-webgrid-lighttpd-1411, tools-webgrid-lighttpd-1418, tools-webgrid-lighttpd-1425
  • 20:33 andrewbogott: depooling tools-exec-1406, 1421, 1436, 1437, tools-webgrid-generic-1404, tools-webgrid-lighttpd-1409, tools-webgrid-lighttpd-1411, tools-webgrid-lighttpd-1418, tools-webgrid-lighttpd-1425
  • 20:20 andrewbogott: depooling tools-webgrid-lighttpd-1412 and tools-exec-1423 for host reboot
  • 20:19 andrewbogott: repooled tools-exec-1409, 1410, 1414, 1419, 1427, 1428 tools-webgrid-generic-1401, tools-webgrid-lighttpd-1406
  • 20:02 andrewbogott: depooling tools-exec-1409, 1410, 1414, 1419, 1427, 1428 tools-webgrid-generic-1401, tools-webgrid-lighttpd-1406
  • 20:00 andrewbogott: depooled and repooled tools-webgrid-lighttpd-1427 tools-webgrid-lighttpd-1428 tools-exec-1413 tools-exec-1442 for host reboot
  • 18:50 andrewbogott: switched active proxy back to tools-proxy-02
  • 18:50 andrewbogott: repooling tools-exec-1422 and tools-webgrid-lighttpd-1413
  • 18:34 andrewbogott: moving proxy from tools-proxy-02 to tools-proxy-01
  • 18:31 andrewbogott: depooling tools-exec-1422 and tools-webgrid-lighttpd-1413 for host reboot
  • 18:26 andrewbogott: repooling tools-exec-1404 and 1434 for host reboot
  • 18:06 andrewbogott: depooling tools-exec-1404 and 1434 for host reboot
  • 18:04 andrewbogott: repooling tools-exec-1402, 1426, 1429, 1433, tools-webgrid-lighttpd-1408, 1414, 1424
  • 17:48 andrewbogott: depooling tools-exec-1402, 1426, 1429, 1433, tools-webgrid-lighttpd-1408, 1414, 1424
  • 17:28 andrewbogott: disabling tools-webgrid-generic-1402, tools-webgrid-lighttpd-1403, tools-exec-1403 for host reboot
  • 17:26 andrewbogott: repooling tools-exec-1405, 1425, tools-webgrid-generic-1403, tools-webgrid-lighttpd-1401, 1405 after host reboot
  • 17:08 andrewbogott: depooling tools-exec-1405, 1425, tools-webgrid-generic-1403, tools-webgrid-lighttpd-1401, 1405 for host reboot
  • 16:19 andrewbogott: repooling tools-exec-1401, 1407, 1408, 1430, 1431, 1432, 1435, 1438, 1439, 1441, tools-webgrid-lighttpd-1402, 1407 after host reboot
  • 15:52 andrewbogott: depooling tools-exec-1401, 1407, 1408, 1430, 1431, 1432, 1435, 1438, 1439, 1441, tools-webgrid-lighttpd-1402, 1407 for host reboot
  • 13:35 chasemp: tools-mail almouked@ltnet.net 719 pending messages cleared

2018-01-11

  • 20:33 andrewbogott: repooling tools-exec-1411, tools-exec-1440, tools-webgrid-lighttpd-1419, tools-webgrid-lighttpd-1420, tools-webgrid-lighttpd-1421
  • 20:33 andrewbogott: uncordoning tools-worker-1012 and tools-worker-1017
  • 20:06 andrewbogott: cordoning tools-worker-1012 and tools-worker-1017
  • 20:02 andrewbogott: depooling tools-exec-1411, tools-exec-1440, tools-webgrid-lighttpd-1419, tools-webgrid-lighttpd-1420, tools-webgrid-lighttpd-1421
  • 19:00 chasemp: reboot tools-worker-1015
  • 15:08 chasemp: reboot tools-exec-1405
  • 15:06 chasemp: reboot tools-exec-1404
  • 15:06 chasemp: reboot tools-exec-1403
  • 15:02 chasemp: reboot tools-exec-1402
  • 14:57 chasemp: reboot tools-exec-1401 again...
  • 14:53 chasemp: reboot tools-exec-1401
  • 14:46 chasemp: install metltdown kernel and reboot workers 1011-1016 as jessie pilot

2018-01-10

  • 15:14 chasemp: tools-clushmaster-01:~$ clush -f 1 -w @k8s-worker "sudo puppet agent --enable && sudo puppet agent --test"
  • 15:03 chasemp: tools-k8s-master-01:~# for n in `kubectl get nodes | awk '{print $1}' | grep -v -e tools-worker-1001 -e tools-worker-1016 -e tools-worker-1016`; do kubectl cordon $n; done
  • 14:41 chasemp: tools-clushmaster-01:~$ clush -w @k8s-worker "sudo puppet agent --disable 'chase rollout'"
  • 14:01 chasemp: tools-k8s-master-01:~# kubectl uncordon tools-worker-1001.tools.eqiad.wmflabs
  • 13:57 arturo: T184604 cleaned stalled log files that prevented logrotate from working. Triggered a couple of logrorate runs by hand in tools-worker-1020.tools.eqiad.wmflabs
  • 13:46 arturo: T184604 aborrero@tools-k8s-master-01:~$ sudo kubectl uncordon tools-worker-1020.tools.eqiad.wmflabs
  • 13:45 arturo: T184604 aborrero@tools-worker-1020:/var/log$ sudo mkdir /var/lib/kubelet/pods/bcb36fe1-7d3d-11e7-9b1a-fa163edef48a/volumes
  • 13:26 arturo: sudo kubectl drain tools-worker-1020.tools.eqiad.wmflabs
  • 13:22 arturo: empty by hand syslog and daemon.log files. They are so big that logrotate won't handle them
  • 13:20 arturo: aborrero@tools-worker-1020:~$ sudo service kubelet restart
  • 13:18 arturo: aborrero@tools-k8s-master-01:~$ sudo kubectl cordon tools-worker-1020.tools.eqiad.wmflabs for T184604
  • 13:13 arturo: detected low space in tools-worker-1020, big files in /var/log due to kubelet issue. Opened T184604

2018-01-09

  • 23:21 yuvipanda: paws new cluster master is up, re-adding nodes by executing same sequence of commands for upgrading
  • 23:08 yuvipanda: turns out the version of k8s we had wasn't recent enough to support easy upgrades, so destroy entire cluster again and install 1.9.1
  • 23:01 yuvipanda: kill paws master and reboot it
  • 22:54 yuvipanda: kill all kube-system pods in paws cluster
  • 22:54 yuvipanda: kill all PAWS pods
  • 22:53 yuvipanda: redo tools-paws-worker-1006 manually, since clush seems to have missed it for some reason
  • 22:49 yuvipanda: run clush -w tools-paws-worker-10[01-20] 'sudo bash /home/yuvipanda/kubeadm-bootstrap/init-worker.bash' to bring paws workers back up again, but as 1.8
  • 22:48 yuvipanda: run 'clush -w tools-paws-worker-10[01-20] 'sudo bash /home/yuvipanda/kubeadm-bootstrap/install-kubeadm.bash to setup kubeadm on all paws worker nodes
  • 22:46 yuvipanda: reboot all paws-worker nodes
  • 22:46 yuvipanda: run clush -w tools-paws-worker-10[01-20] 'sudo bash /home/yuvipanda/kubeadm-bootstrap/remove-worker.bash' to completely destroy the paws k8s cluster
  • 22:46 madhuvishy: run clush -w tools-paws-worker-10[01-20] 'sudo bash /home/yuvipanda/kubeadm-bootstrap/remove-worker.bash' to completely destroy the paws k8s cluster
  • 21:17 chasemp: ...rush@tools-clushmaster-01:~$ clush -f 1 -w @k8s-worker "sudo puppet agent --enable && sudo puppet agent --test"
  • 21:17 chasemp: tools-clushmaster-01:~$ clush -f 1 -w @k8s-worker "sudo puppet agent --enable --test"
  • 21:10 chasemp: tools-k8s-master-01:~# for n in `kubectl get nodes | awk '{print $1}' | grep -v -e tools-worker-1001 -e tools-worker-1016 -e tools-worker-1028 -e tools-worker-1029 `; do kubectl uncordon $n; done
  • 20:55 chasemp: for n in `kubectl get nodes | awk '{print $1}' | grep -v -e tools-worker-1001 -e tools-worker-1016`; do kubectl cordon $n; done
  • 20:51 chasemp: kubectl cordon tools-worker-1001.tools.eqiad.wmflabs
  • 20:15 chasemp: disable puppet on proxies and k8s workers
  • 19:50 chasemp: clush -w @all 'sudo puppet agent --test'
  • 19:42 chasemp: reboot tools-worker-1010

2018-01-08

  • 20:34 madhuvishy: Restart kube services and uncordon tools-worker-1001
  • 19:26 chasemp: sudo service docker restart; sudo service flannel restart; sudo service kube-proxy restart on tools-proxy-02

2018-01-06

  • 00:35 madhuvishy: Run `clush -w @paws-worker -b 'sudo iptables -L FORWARD'`
  • 00:05 madhuvishy: Drain and cordon tools-worker-1001 (for debugging the dns outage)

2018-01-05

  • 23:49 madhuvishy: Run clush -w @k8s-worker -x tools-worker-1001.tools.eqiad.wmflabs 'sudo service docker restart; sudo service flannel restart; sudo service kubelet restart; sudo service kube-proxy restart' on tools-clushmaster-01
  • 16:22 andrewbogott: moving tools-worker-1027 to labvirt1015 (CPU balancing)
  • 16:01 andrewbogott: moving tools-worker-1017 to labvirt1017 (CPU balancing)
  • 15:32 andrewbogott: moving tools-exec-1420.tools.eqiad.wmflabs to labvirt1015 (CPU balancing)
  • 15:18 andrewbogott: moving tools-exec-1411.tools.eqiad.wmflabs to labvirt1017 (CPU balancing)
  • 15:02 andrewbogott: moving tools-exec-1440.tools.eqiad.wmflabs to labvirt1017 (CPU balancing)
  • 14:47 andrewbogott: moving tools-webgrid-lighttpd-1421.tools.eqiad.wmflabs to labvirt1017 (CPU balancing)
  • 14:25 andrewbogott: moving tools-webgrid-lighttpd-1420.tools.eqiad.wmflabs to labvirt1015 (CPU balancing)
  • 14:05 andrewbogott: moving tools-webgrid-lighttpd-1417.tools.eqiad.wmflabs to labvirt1015 (CPU balancing)
  • 13:46 andrewbogott: moving tools-webgrid-lighttpd-1419.tools.eqiad.wmflabs to labvirt1017 (CPU balancing)
  • 05:33 andrewbogott: migrating tools-worker-1012 to labvirt1017 (CPU load balancing)

2018-01-04

  • 17:24 andrewbogott: rebooting tools-paws-worker-1019 to verify repair of T184018

2018-01-03