Nova Resource:Tools/SAL/Archive 3

2019-12-30

05:02 andrewbogott: moving tools-worker-1012 to cloudvirt1024 for T241523
04:49 andrewbogott: draining and rebooting tools-worker-1031, its drive is full

2019-12-29

01:38 Krenair: Cordoned tools-worker-1012 and deleted pods associated with dplbot and dewikigreetbot as well as my own testing one, host seems to be under heavy load - T241523

2019-12-27

15:06 Krenair: Killed a "python parse_page.py outreachy" process by aikochou that was hogging IO on tools-sgebastion-07

2019-12-25

16:07 zhuyifei1999_: pkilled 5 `python pwb.py` processes belonging to `tools.kaleem-bot` on tools-sgebastion-07

2019-12-22

20:13 bd808: Enabled Puppet on tools-proxy-06.tools.eqiad.wmflabs after nginx config test (T241310)
18:52 bd808: Disabled Puppet on tools-proxy-06.tools.eqiad.wmflabs to test nginx config change (T241310)

2019-12-20

22:28 bd808: Re-enabled Puppet on tools-sgebastion-09. Reason for disable was "arturo raising systemd limits"
11:33 arturo: reboot tools-k8s-control-3 to fix some stale NFS mount issues

2019-12-18

17:33 bstorm_: updated package in aptly for toollabs-webservice to 0.53
11:49 arturo: introduce placeholder DNS records for toolforge.org domain. No services are provided under this domain yet for end users, this is just us testing (SSL, proxy stuff etc). This may be reverted anytime.

2019-12-17

20:25 bd808: Fixed https://tools.wmflabs.org/ to redirect to https://tools.wmflabs.org/admin/
19:21 bstorm_: deployed the changes to the live proxy to enable the new kubernetes cluster T234037
16:53 bstorm_: maintain-kubeusers app deployed fully in tools for new kubernetes cluster T214513 T228499
16:50 bstorm_: updated the maintain-kubeusers docker image for beta and tools
04:48 bstorm_: completed first run of maintain-kubeusers 2 in the new cluster T214513
01:26 bstorm_: running the first run of maintain-kubeusers 2.0 for the new cluster T214513 (more successfully this time)
01:25 bstorm_: unset the immutable bit from 1704 tool kubeconfigs T214513
01:05 bstorm_: beginning the first run of the new maintain-kubeusers in gentle-mode -- but it was just killed by some files setting the immutable bit T214513
00:45 bstorm_: enabled encryption at rest on the new k8s cluster

2019-12-16

22:04 bd808: Added 'ALLOW IPv4 25/tcp from 0.0.0.0/0' to "MTA" security group applied to tools-mail-02
19:05 bstorm_: deployed the maintain-kubeusers operations pod to the new cluster

2019-12-14

10:48 valhallasw`cloud: re-enabling puppet on tools-sgeexec-0912, likely left-over from NFS maintenance (no reason was specified).

2019-12-13

18:46 bstorm_: updated tools-k8s-control-2 and 3 to the new config as well
17:56 bstorm_: updated tools-k8s-control-1 to the new control plane configuration
17:47 bstorm_: edited kubeadm-config configMap object to match the new init config
17:32 bstorm_: rebooting tools-k8s-control-2 to correct mount issue
00:45 bstorm_: rebooting tools-static-13
00:28 bstorm_: rebooting the k8s master to clear NFS errors
00:15 bstorm_: switch tools-acme-chief config to match the new authdns_servers format upstream

2019-12-12

23:36 bstorm_: rebooting toolschecker after downtiming the services
22:58 bstorm_: rebooting tools-acme-chief-01
22:53 bstorm_: rebooting the cron server, tools-sgecron-01 as it wasn't recovered from last night's maintenance
11:20 arturo: rolling reboot for all grid & k8s worker nodes due to NFS staleness
09:22 arturo: reboot tools-sgeexec-0911 to try fixing weird NFS state
08:46 arturo: doing `run-puppet-agent` in all VMs to see state of NFS
08:34 arturo: reboot tools-worker-1033/1034 and tools-sgebastion-08 to try to correct NFS mount issues

2019-12-11

18:13 bd808: Restarted maintain-dbusers on labstore1004. Process had not logged any account creations since 2019-12-01T22:45:45.
17:24 andrewbogott: deleted and/or truncated a bunch of logfiles on tools-worker-1031

2019-12-10

13:59 arturo: set pod replicas to 3 in the new k8s cluster (T239405)

2019-12-09

11:06 andrewbogott: deleting unused security groups: catgraph, devpi, MTA, mysql, syslog, test T91619

2019-12-04

13:45 arturo: drop puppet prefix `tools-cron`, deprecated and no longer in use

2019-11-29

11:45 arturo: created 3 new VMs `tools-k8s-worker-[3,4,5]` (T239403)
10:28 arturo: re-arm keyholder in tools-acme-chief-01 (password in labs/private.git @ tools-puppetmaster-01)
10:27 arturo: re-arm keyholder in tools-acme-chief-02 (password in labs/private.git @ tools-puppetmaster-01)

2019-11-26

23:25 bstorm_: rebuilding docker images to include the new webservice 0.52 in all versions instead of just the stretch ones T236202
22:57 bstorm_: push upgraded webservice 0.52 to the buster and jessie repos for container rebuilds T236202
19:55 phamhi: drained tools-worker-1002,8,15,32 to rebalance the cluster
19:45 phamhi: cleaned up container that was taken up 16G of disk space on tools-worker-1020 in order to re-run puppet client
14:01 arturo: drop hiera references to `tools-test-proxy-01.tools.eqiad.wmflabs`. Such VM no longer exists
14:00 arturo: introduce the `profile::toolforge::proxies` hiera key in the global puppet config

2019-11-25

10:35 arturo: refresh puppet certs for tools-k8s-etcd-[4-6] nodes (T238655)
10:35 arturo: add puppet cert SANs via instance hiera to tools-k8s-etcd-[4-6] nodes (T238655)

2019-11-22

13:32 arturo: created security group `tools-new-k8s-full-connectivity` and add new k8s VMs to it (T238654)
05:55 jeh: add Riley Huntley `riley` to base tools project

2019-11-21

12:48 arturo: reboot the new k8s cluster after the upgrade
11:49 arturo: upgrading new k8s kubectl version to 1.15.6 (T238654)
11:44 arturo: upgrading new k8s kubelet version to 1.15.6 (T238654)
10:29 arturo: upgrading new k8s cluster version to 1.15.6 using kubeadm (T238654)
10:28 arturo: install kubeadm 1.15.6 on worker/control nodes in the new k8s cluster (T238654)

2019-11-19

13:49 arturo: re-create nginx-ingress pod due to deployment template refresh (T237643)
12:46 arturo: deploy changes to tools-prometheus to account for the new k8s cluster (T237643)

2019-11-15

14:44 arturo: stop live-hacks on tools-prometheus-01 T237643

2019-11-13

17:20 arturo: live-hacking tools-prometheus-01 to test some experimental configs for the new k8s cluster (T237643)

2019-11-12

12:52 arturo: reboot tools-proxy-06 to reset iptables setup T238058

2019-11-10

02:17 bd808: Building new Docker images for T237836 (retrying after cleaning out old images on tools-docker-builder-06)
02:15 bd808: Cleaned up old images on tools-docker-builder-06 using instructions from https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Kubernetes#Building_toolforge_specific_images
02:10 bd808: Building new Docker images for T237836
01:45 bstorm_: deploying bugfix for webservice in tools and toolsbeta T237836

2019-11-08

22:47 bstorm_: adding rsync::server::wrap_with_stunnel: false to the tools-docker-registry-03/4 servers to unbreak puppet
18:40 bstorm_: pushed new webservice package to the bastions T230961
18:37 bstorm_: pushed new webservice package supporting buster containers to repo T230961
18:36 bstorm_: pushed buster-sssd images to the docker repo
17:15 phamhi: pushed new buster images with the prefix name "toolforge"

2019-11-07

13:27 arturo: deployed registry-admission-webhook and ingress-admission-controller into the new k8s cluster (T236826)
13:01 arturo: creating puppet prefix `tools-k8s-worker` and a couple of VMs `tools-k8s-worker-[1,2]` T236826
12:57 arturo: increasing project quota T237633
11:54 arturo: point `k8s.tools.eqiad1.wikimedia.cloud` to tools-k8s-haproxy-1 T236826
11:43 arturo: create VMs `tools-k8s-haproxy-[1,2]` T236826
11:43 arturo: create puppet prefix `tools-k8s-haproxy` T236826

2019-11-06

22:32 bstorm_: added rsync::server::wrap_with_stunnel: false to tools-sge-services prefix to fix puppet
21:33 bstorm_: docker images needed for kubernetes cluster upgrade deployed T215531
20:26 bstorm_: building and pushing docker images needed for kubernetes cluster upgrade
16:10 arturo: new k8s cluster control nodes are bootstrapped (T236826)
13:51 arturo: created FQDN `k8s.tools.eqiad1.wikimedia.cloud` pointing to `tools-k8s-control-1` for the initial bootstrap (T236826)
13:50 arturo: created 3 VMs`tools-k8s-control-[1,2,3]` (T236826)
13:43 arturo: created `tools-k8s-control` puppet prefix T236826
11:57 phamhi: restarted all webservices in grid (T233347)

2019-11-05

23:08 Krenair: Dropped 59a77a3, 3830802, and 83df61f from tools-puppetmaster-01:/var/lib/git/labs/private cherry-picks as these are no longer required T206235
22:49 Krenair: Disassociated floating IP 185.15.56.60 from tools-static-13, traffic to this host goes via the project-proxy now. DNS was already changed a few days ago. T236952
22:35 bstorm_: upgraded libpython3.4 libpython3.4-dbg libpython3.4-minimal libpython3.4-stdlib python3.4 python3.4-dbg python3.4-minimal to fix an old broken patch T237468
22:12 bstorm_: pushed docker-registry.tools.wmflabs.org/maintain-kubeusers:beta to the registry to deploy in toolsbeta
17:38 phamhi: restarted lighttpd based webservice pods on tools-worker-103x and 1040 (T233347)
17:34 phamhi: restarted lighttpd based webservice pods on tools-worker-102[0-9] (T233347)
17:06 phamhi: restarted lighttpd based webservice pods on tools-worker-101[0-9] (T233347)
16:44 phamhi: restarted lighttpd based webservice pods on tools-worker-100[1-9] (T233347)
13:55 arturo: created 3 new VMs: `tools-k8s-etcd-[4,5,6]` T236826

2019-11-04

14:45 phamhi: Built and pushed ruby25 docker image based on buster (T230961)
14:45 phamhi: Built and pushed golang111 docker image based on buster (T230961)
14:45 phamhi: Built and pushed jdk11 docker image based on buster (T230961)
14:45 phamhi: Built and pushed php73 docker image based on buster (T230961)
11:10 phamhi: Built and pushed python37 docker image based on buster (T230961)

2019-11-01

21:00 Krenair: Removed tools-checker.wmflabs.org A record to 208.80.155.229 as that target IP is in the old pre-neutron range that is no longer routed
20:57 Krenair: Removed trusty.tools.wmflabs.org CNAME to login-trusty.tools.wmflabs.org as that target record does not exist, presumably deleted ages ago
20:56 Krenair: Removed tools-trusty.wmflabs.org CNAME to login-trusty.tools.wmflabs.org as that target record does not exist, presumably deleted ages ago
20:38 Krenair: Updated A record for tools-static.wmflabs.org to point towards project-proxy T236952

2019-10-31

18:47 andrewbogott: deleted and/or truncated a bunch of logfiles on tools-worker-1001. Runaway logfiles filled up the drive which prevented puppet from running. If puppet had run, it would have prevented the runaway logfiles.
13:59 arturo: update puppet prefix `tools-k8s-etcd-` to use the `role::wmcs::toolforge::k8s::etcd` T236826
13:41 arturo: disabling puppet in tools-k8s-etcd- nodes to test https://gerrit.wikimedia.org/r/c/operations/puppet/+/546995
10:15 arturo: SSL cert replacement for tools-docker-registry and tools-k8s-master went fine apparently (T236962)
10:02 arturo: icinga downtime toolschecker for 1h for replacing SSL certs in tools-docker-registry and tools-k8s-master (T236962)

2019-10-30

13:53 arturo: replacing SSL cert in tools-proxy-x server apparently OK (merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/545679) T235252
13:48 arturo: replacing SSL cert in tools-proxy-x server (live-hacking https://gerrit.wikimedia.org/r/c/operations/puppet/+/545679 first for testing) T235252
13:40 arturo: icinga downtime toolschecker for 1h for replacing SSL cert T235252

2019-10-29

10:49 arturo: deleting VMs tools-test-proxy-01, no longer in use
10:07 arturo: deleting old jessie VMs tools-proxy-03/04 T235627

2019-10-28

16:06 arturo: delete VM instance `tools-test-proxy-01` and the puppet prefix `tools-test-proxy`
15:54 arturo: tools-proxy-05 has now the 185.15.56.11 floating IP as active proxy. Old one 185.15.56.6 has been freed T235627
15:54 arturo: shutting down tools-proxy-03 T235627
15:26 bd808: Killed all processes owned by jem on tools-sgebastion-08
15:16 arturo: tools-proxy-05 has now the 185.15.56.5 floating IP as active proxy T235627
15:14 arturo: refresh hiera to use tools-proxy-05 as active proxy T235627
15:11 bd808: Killed ircbot.php processes started by jem on tools-sgebastion-08 per request on irc
14:58 arturo: added `webproxy` security group to tools-proxy-05 and tools-proxy-06 (T235627)
14:57 phamhi: drained tools-worker-1031.tools.eqiad.wmflabs to clean up disk space
14:45 arturo: created VMs tools-proxy-05 and tools-proxy-06 (T235627)
14:43 arturo: adding `role::wmcs::toolforge::proxy` to the `tools-proxy` puppet prefix (T235627)
14:42 arturo: deleted `role::toollabs::proxy` from the `tools-proxy` puppet profile (T235627)
14:34 arturo: icinga downtime toolschecker for 1h (T235627)
12:25 arturo: upload image `coredns` v1.3.1 (eb516548c180) to docker registry (T236249)
12:23 arturo: upload image `kube-apiserver` v1.15.1 (68c3eb07bfc3) to docker registry (T236249)
12:22 arturo: upload image `kube-controller-manager` v1.15.1 (d75082f1d121) to docker registry (T236249)
12:20 arturo: upload image `kube-proxy` v1.15.1 (89a062da739d) to docker registry (T236249)
12:19 arturo: upload image `kube-scheduler` v1.15.1 (b0b3c4c404da) to docker registry (T236249)
12:04 arturo: upload image `calico/node` v3.8.0 (cd3efa20ff37) to docker registry (T236249)
12:03 arturo: upload image `calico/calico/pod2daemon-flexvol` v3.8.0 (f68c8f870a03) to docker registry (T236249)
12:01 arturo: upload image `calico/cni` v3.8.0 (539ca36a4c13) to docker registry (T236249)
11:58 arturo: upload image `calico/kube-controllers` v3.8.0 (df5ff96cd966) to docker registry (T236249)
11:47 arturo: upload image `nginx-ingress-controller` v0.25.1 (0439eb3e11f1) to docker registry (T236249)

2019-10-24

16:32 bstorm_: set the prod rsyslog config for kubernetes to false for Toolforge

2019-10-23

20:00 phamhi: Rebuilding all jessie and stretch docker images to pick up toollabs-webservice 0.47 (T233347)
12:09 phamhi: Deployed toollabs-webservice 0.47 to buster-tools and stretch-tools (T233347)
09:13 arturo: 9 tools-sgeexec nodes and 6 other related VMs are down because hypervisor is rebooting
09:03 arturo: tools-sgebastion-08 is down because hypervisor is rebooting

2019-10-22

16:56 bstorm_: drained tools-worker-1025.tools.eqiad.wmflabs which was malfunctioning
09:25 arturo: created the `tools.eqiad1.wikimedia.cloud.` DNS zone

2019-10-21

17:32 phamhi: Rebuilding all jessie and stretch docker images to pick up toollabs-webservice 0.46

2019-10-18

22:15 bd808: Rescheduled continuous jobs away from tools-sgeexec-0904 because of high system load
22:09 bd808: Cleared error state of webgrid-generic@tools-sgewebgrid-generic-0901, webgrid-lighttpd@tools-sgewebgrid-lighttpd-09{12,15,19,20,26}
21:29 bd808: Rescheduled all grid engine webservice jobs (T217815)

2019-10-16

16:21 phamhi: Deployed toollabs-webservice 0.46 to buster-tools and stretch-tools (T218461)
09:29 arturo: toolforge is recovered from the reboot of cloudvirt1029
09:17 arturo: due to the reboot of cloudvirt1029, several sgeexec nodes (8) are offline, also sgewebgrid-lighttpd (8) and tools-worker (3) and the main toolforge proxy (tools-proxy-03)

2019-10-15

17:10 phamhi: restart tools-worker-1035 because it is no longer responding

2019-10-14

09:26 arturo: cleaned-up updatetools from tools-sge-services nodes (T229261)

2019-10-11

19:52 bstorm_: restarted docker on tools-docker-builder after phamhi noticed the daemon had a routing issue (blank iptables)
11:55 arturo: create tools-test-proxy-01 VM for testing T235059 and a puppet prefix for it
10:53 arturo: added kubernetes-node_1.4.6-7_amd64.deb to buster-tools and buster-toolsbeta (aptly) for T235059
10:51 arturo: added docker-engine_1.12.6-0~debian-jessie_amd64.deb to buster-tools and buster-toolsbeta (aptly) for T235059
10:46 arturo: added logster_0.0.10-2~jessie1_all.deb to buster-tools and buster-toolsbeta (aptly) for T235059

2019-10-10

02:33 bd808: Rebooting tools-sgewebgrid-lighttpd-0903. Instance hung.

2019-10-09

22:52 jeh: removing test instances tools-sssd-sgeexec-test-[12] from SGE
15:32 phamhi: drained tools-worker-1020/23/33/35/36/40 to rebalance the cluster
14:46 phamhi: drained and cordoned tools-worker-1029 after status reset on reboot
12:37 arturo: drain tools-worker-1038 to rebalance load in the k8s cluster
12:35 arturo: uncordon tools-worker-1029 (was disabled for unknown reasons)
12:33 arturo: drain tools-worker-1010 to rebalance load
10:33 arturo: several sgewebgrid-lighttpd nodes (9) not available because cloudvirt1013 is rebooting
10:21 arturo: several worker nodes (7) not available because cloudvirt1012 is rebooting
10:08 arturo: several worker nodes (6) not available because cloudvirt1009 is rebooting
09:59 arturo: several worker nodes (5) not available because cloudvirt1008 is rebooting

2019-10-08

19:40 bstorm_: drained tools-worker-1007/8 to rebalance the cluster
19:34 bstorm_: drained tools-worker-1009 and then 1014 for rebalancing
19:27 bstorm_: drained tools-worker-1005 for rebalancing (and put these back in service as I went)
19:24 bstorm_: drained tools-worker-1003 and 1009 for rebalancing
15:41 arturo: deleted VM instance tools-sgebastion-0test. No longer in use.

2019-10-07

20:17 bd808: Dropped backlog of messages for delivery to tools.usrd-tools
20:16 bd808: Dropped backlog of messages for delivery to tools.mix-n-match
20:13 bd808: Dropped backlog of frozen messages for delivery (240 dropped)
19:25 bstorm_: deleted tools-puppetmaster-02
19:20 Krenair: reboot tools-k8s-master-01 due to nfs stale issue
19:18 Krenair: reboot tools-paws-worker-1006 due to nfs stale issue
19:16 phamhi: reboot tools-worker-1040 due to nfs stale issue
19:16 phamhi: reboot tools-worker-1039 due to nfs stale issue
19:16 phamhi: reboot tools-worker-1038 due to nfs stale issue
19:16 phamhi: reboot tools-worker-1037 due to nfs stale issue
19:16 phamhi: reboot tools-worker-1036 due to nfs stale issue
19:16 phamhi: reboot tools-worker-1035 due to nfs stale issue
19:15 phamhi: reboot tools-worker-1034 due to nfs stale issue
19:15 phamhi: reboot tools-worker-1033 due to nfs stale issue
19:15 phamhi: reboot tools-worker-1032 due to nfs stale issue
19:15 phamhi: reboot tools-worker-1031 due to nfs stale issue
19:15 phamhi: reboot tools-worker-1030 due to nfs stale issue
19:10 Krenair: reboot tools-puppetmaster-02 due to nfs stale issue
19:09 Krenair: reboot tools-sgebastion-0test due to nfs stale issue
19:08 Krenair: reboot tools-sgebastion-09 due to nfs stale issue
19:08 Krenair: reboot tools-sge-services-04 due to nfs stale issue
19:07 Krenair: reboot tools-paws-worker-1002 due to nfs stale issue
19:06 Krenair: reboot tools-mail-02 due to nfs stale issue
19:06 Krenair: reboot tools-docker-registry-03 due to nfs stale issue
19:04 Krenair: reboot tools-worker-1029 due to nfs stale issue
19:00 Krenair: reboot tools-static-12 tools-docker-registry-04 and tools-clushmaster-02 due to NFS stale issue
18:55 phamhi: reboot tools-worker-1028 due to nfs stale issue
18:55 phamhi: reboot tools-worker-1027 due to nfs stale issue
18:55 phamhi: reboot tools-worker-1026 due to nfs stale issue
18:55 phamhi: reboot tools-worker-1025 due to nfs stale issue
18:47 phamhi: reboot tools-worker-1023 due to nfs stale issue
18:47 phamhi: reboot tools-worker-1022 due to nfs stale issue
18:46 phamhi: reboot tools-worker-1021 due to nfs stale issue
18:46 phamhi: reboot tools-worker-1020 due to nfs stale issue
18:46 phamhi: reboot tools-worker-1019 due to nfs stale issue
18:46 phamhi: reboot tools-worker-1018 due to nfs stale issue
18:34 phamhi: reboot tools-worker-1017 due to nfs stale issue
18:34 phamhi: reboot tools-worker-1016 due to nfs stale issue
18:32 phamhi: reboot tools-worker-1015 due to nfs stale issue
18:32 phamhi: reboot tools-worker-1014 due to nfs stale issue
18:23 phamhi: reboot tools-worker-1013 due to nfs stale issue
18:21 phamhi: reboot tools-worker-1012 due to nfs stale issue
18:12 phamhi: reboot tools-worker-1011 due to nfs stale issue
18:12 phamhi: reboot tools-worker-1010 due to nfs stale issue
18:08 phamhi: reboot tools-worker-1009 due to nfs stale issue
18:07 phamhi: reboot tools-worker-1008 due to nfs stale issue
17:58 phamhi: reboot tools-worker-1007 due to nfs stale issue
17:57 phamhi: reboot tools-worker-1006 due to nfs stale issue
17:47 phamhi: reboot tools-worker-1005 due to nfs stale issue
17:47 phamhi: reboot tools-worker-1004 due to nfs stale issue
17:43 phamhi: reboot tools-worker-1002.tools.eqiad.wmflabs due to nfs stale issue
17:35 phamhi: drained and uncordoned tools-worker-100[1-5]
17:32 bstorm_: reboot tools-sgewebgrid-lighttpd-0912
17:30 bstorm_: reboot tools-sgewebgrid-lighttpd-0923/24/08
17:01 bstorm_: rebooting tools-sgegrid-master and tools-sgegrid-shadow 😭
16:58 bstorm_: rebooting tools-sgewebgrid-lighttpd-0902/4/6/7/8/19
16:53 bstorm_: rebooting tools-sgewebgrid-generic-0902/4
16:50 bstorm_: rebooting tools-sgeexec-0915/18/19/23/26
16:49 bstorm_: rebooting tools-sgeexec-0901 and tools-sgeexec-0909/10/11
16:46 bd808: `sudo shutdown -r now` for tools-sgebastion-08
16:41 bstorm_: reboot tools-sgebastion-07
16:39 bd808: `sudo service nslcd restart` on tools-sgebastion-08

2019-10-04

21:43 bd808: `sudo exec-manage repool tools-sgeexec-0923.tools.eqiad.wmflabs`
21:26 bd808: Rebooting tools-sgeexec-0923 after lots of messing about with a broken update-initramfs build
20:35 bd808: Manually running `/usr/bin/python3 /usr/bin/unattended-upgrade` on tools-sgeexec-0923
20:33 bd808: Killed 2 /usr/bin/unattended-upgrade procs on tools-sgeexec-0923 that seemed stuck
13:33 arturo: remove /etc/init.d/rsyslog on tools-worker-XXXX nodes so the rsyslog deb prerm script doesn't prevent the package from being updated

2019-10-03

13:05 arturo: delete servers tools-sssd-sgeexec-test-[1,2], no longer required

2019-09-27

16:59 bd808: Set "profile::rsyslog::kafka_shipper::kafka_brokers: []" in tools-elastic prefix puppet
00:40 bstorm_: depooled and rebooted tools-sgewebgrid-lighttpd-0927

2019-09-25

19:08 andrewbogott: moving tools-sgewebgrid-lighttpd-0903 to cloudvirt1021

2019-09-23

16:58 bstorm_: deployed tools-manifest 0.20 and restarted webservicemonitor
06:01 bd808: Restarted maintain-dbusers process on labstore1004. (T233530)

2019-09-12

20:48 phamhi: Deleted tools-puppetdb-01.tools as it is no longer in used

2019-09-11

13:30 jeh: restart tools-sgeexec-0912

2019-09-09

22:44 bstorm_: uncordoned tools-worker-1030 and tools-worker-1038

2019-09-06

15:11 bd808: `sudo kill -9 10635` on tools-k8s-master-01 (T194859)

2019-09-05

21:02 bd808: Enabled Puppet on tools-docker-registry-03 and forced puppet run (T232135)
18:13 bd808: Disabled Puppet on tools-docker-registry-03 to investigate docker-registry issue (no phab task yet)

2019-09-01

20:51 Reedy: `sudo service maintain-kubeusers restart` on tools-k8s-master-01

2019-08-30

16:54 phamhi: restart maintain-kuberusers service in tools-k8s-master-01
16:21 bstorm_: depooling tools-sgewebgrid-lighttpd-0923 to reboot -- high iowait likely from NFS mounts

2019-08-29

22:18 bd808: Finished building new stretch Docker images for Toolforge Kubernetes use
22:06 bd808: Starting process of building new stretch Docker images for Toolforge Kubernetes use
22:05 bd808: Jessie Docker image rebuild complete
21:31 bd808: Starting process of building new jessie Docker images for Toolforge Kubernetes use

2019-08-27

19:10 bd808: Restarted maintain-kubeusers after complaint on irc. It was stuck in limbo again

2019-08-26

21:48 bstorm_: repooled tools-sgewebgrid-generic-0902, tools-sgewebgrid-lighttpd-0902, tools-sgewebgrid-lighttpd-0903 and tools-sgeexec-0905

2019-08-18

08:11 arturo: restart maintain-kuberusers service in tools-k8s-master-01

2019-08-17

10:56 arturo: force-reboot tools-worker-1006. Is completely stuck

2019-08-15

15:32 jeh: upgraded jobutils debian package to 1.38 T229551
09:22 arturo: restart maintain-kubeusers service in tools-k8s-master-01 because some tools were missing their namespaces

2019-08-13

22:00 bstorm_: truncated exim paniclog on tools-sgecron-01 because it was being spammy
13:41 jeh: Set icingia downtime for toolschecker labs showmount T229448

2019-08-12

16:08 phamhi: updated prometheus-node-exporter from 0.14.0~git20170523-1 to 0.17.0+ds-3 in tools-worker-[1030-1040] nodes (T230147)

2019-08-08

19:26 jeh: restarting tools-sgewebgrid-lighttpd-0915 T230157

2019-08-07

19:07 bd808: Disassociated SUL and Phabricator accounts from user Lophi (T229713)

2019-08-06

16:18 arturo: add phamhi as user/projectadmin (T228942) and delete hpham
15:59 arturo: add hpham as user/projectadmin (T228942)
13:44 jeh: disabling puppet on tools-checker-03 while testing nginx timeouts T221301

2019-08-05

22:49 bstorm_: launching tools-worker-1040
20:36 andrewbogott: rebooting oom tools-worker-1026
16:10 jeh: `tools-k8s-master-01: systemctl restart maintain-kubeusers` T229846
09:39 arturo: `root@tools-checker-03:~# toolscheckerctl restart` again (T229787)
09:30 arturo: `root@tools-checker-03:~# toolscheckerctl restart` (T229787)

2019-08-02

14:00 andrewbogott_: rebooting tools-worker-1022 as it is unresponsive

2019-07-31

18:07 bstorm_: drained tools-worker-1015/05/03/17 to rebalance load
17:41 bstorm_: drained tools-worker-1025 and 1026 to rebalance load
17:32 bstorm_: drained tools-worker-1028 to rebalance load
17:29 bstorm_: drained tools-worker-1008 to rebalance load
17:23 bstorm_: drained tools-worker-1021 to rebalance load
17:17 bstorm_: drained tools-worker-1007 to rebalance load
17:07 bstorm_: drained tools-worker-1004 to rebalance load
16:27 andrewbogott: moving tools-static-12 to cloudvirt1018
15:33 bstorm_: T228573 spinning up 5 worker nodes for kubernetes cluster (tools-worker-1035-9)

2019-07-27

23:00 zhuyifei1999_: a past probably related ticket: T194859
22:57 zhuyifei1999_: maintain-kubeusers seems stuck. Traceback: https://phabricator.wikimedia.org/P8812, core dump: /root/core.17898. Restarting

2019-07-26

17:39 bstorm_: restarted maintain-kubeusers because it was suspiciously tardy and quiet
17:14 bstorm_: drained tools-worker-1013.tools.eqiad.wmflabs to rebalance load
17:09 bstorm_: draining tools-worker-1020.tools.eqiad.wmflabs to rebalance load
16:32 bstorm_: created tools-worker-1034 - T228573
15:57 bstorm_: created tools-worker-1032 and 1033 - T228573
15:55 bstorm_: created tools-worker-1031 - T228573

2019-07-25

22:01 bstorm_: T228573 created tools-worker-1030
21:22 jeh: rebooting tools-worker-1016 unresponsive

2019-07-24

10:14 arturo: reallocating tools-puppetmaster-01 from cloudvirt1027 to cloudvirt1028 (T227539)
10:12 arturo: reallocating tools-docker-registry-04 from cloudvirt1027 to cloudvirt1028 (T227539)

2019-07-22

18:39 bstorm_: repooled tools-sgeexec-0905 after reboot
18:33 bstorm_: depooled tools-sgeexec-0905 because it's acting kind of weird and not responding to prometheus
18:32 bstorm_: repooled tools-sgewebgrid-lighttpd-0902 after restarting the grid-exec service
18:28 bstorm_: depooled tools-sgewebgrid-lighttpd-0902 to find out why it is behaving weird
17:55 bstorm_: draining tools-worker-1023 since it is having issues
17:38 bstorm_: Adding the prometheus servers to the ferm rules via wikitech hiera for kubelet stats T228573

2019-07-20

19:52 andrewbogott: rebooting tools-worker-1023

2019-07-17

20:23 andrewbogott: migrating tools-sgegrid-shadow to cloudvirt1014

2019-07-15

14:50 bstorm_: cleared error state from tools-sgeexec-0911 which went offline after error from job 5190035

2019-06-25

09:30 arturo: detected puppet issue in all VMs: T226480

2019-06-24

17:42 andrewbogott: moving tools-sgeexec-0905 to cloudvirt1015

2019-06-17

14:07 andrewbogott: moving tools-sgewebgrid-lighttpd-0903 to cloudvirt1015
13:59 andrewbogott: moving tools-sgewebgrid-generic-0902 and tools-sgewebgrid-lighttpd-0902 to cloudvirt1015 (optimistic re: T220853 )

2019-06-11

18:03 bstorm_: deleted anomalous kubernetes node tools-worker-1019.eqiad.wmflabs

2019-06-05

18:33 andrewbogott: repooled tools-sgeexec-0921 and tools-sgeexec-0929
18:16 andrewbogott: depooling and moving tools-sgeexec-0921 and tools-sgeexec-0929

2019-05-30

13:01 arturo: uncordon/repool tools-worker-1001/2/3. They should be fine now. I'm only leaving 1029 cordoned for testing purposes
13:01 arturo: reboot tools-woker-1003 to cleanup sssd config and let nslcd/nscd start freshly
12:47 arturo: reboot tools-woker-1002 to cleanup sssd config and let nslcd/nscd start freshly
12:42 arturo: reboot tools-woker-1001 to cleanup sssd config and let nslcd/nscd start freshly
12:35 arturo: enable puppet in tools-worker nodes
12:29 arturo: switch hiera setting back to classic/sudoldap for tools-worker because T224651 (T224558)
12:25 arturo: cordon/drain tools-worker-1002 because T224651 and T224651
12:23 arturo: cordon/drain tools-worker-1001 because T224651 and T224651
12:22 arturo: cordon/drain tools-worker-1029 because T224651 and T224651
12:20 arturo: cordon/drain tools-worker-1003 because T224651 and T224651
11:59 arturo: T224558 repool tools-worker-1003 (using sssd/sudo now!)
11:23 arturo: T224558 depool tools-worker-1003
10:48 arturo: T224558 drop/build a VM for tools-worker-1002. It didn't like the sssd/sudo change :-(
10:33 arturo: T224558 switch tools-worker-1002 to sssd/sudo. Includes drain/depool/reboot/repool
10:28 arturo: T224558 use hiera config in prefix tools-worker for sssd/sudo
10:27 arturo: T224558 switch tools-worker-1001 to sssd/sudo. Includes drain/depool/reboot/repool
10:09 arturo: T224558 disable puppet in all tools-worker- nodes
10:01 arturo: T224558 add tools-worker-1029 to the nodes pool of k8s
09:58 arturo: T224558 reboot tools-worker-1029 after puppet changes for sssd/sudo in jessie

2019-05-29

11:13 arturo: briefly tested some sssd config changes in tools-sgebastion-09
10:13 arturo: enroll the tools-worker-1029 VM into toolforge k8s, but leave it cordoned for sssd testing purposes (T221225)
10:12 arturo: re-create the tools-worker-1001 VM, already enrolled into toolforge k8s
09:34 arturo: delete tools-worker-1001, it was totally malfunctioning

2019-05-28

18:15 arturo: T221225 for the record, tools-worker-1001 is not working after trying with sssd
18:13 arturo: T221225 created tools-worker-1029 to test sssd/sudo stuff
17:49 arturo: T221225 repool tools-worker-1002 (using nscd/nslcd and sudoldap)
17:44 arturo: T221225 back to classic/ldap hiera config in the tools-worker puppet prefix
17:35 arturo: T221225 hard reboot tools-worker-1001 again
17:27 arturo: T221225 hard reboot tools-worker-1001
17:12 arturo: T221225 depool & switch to sssd/sudo & reboot & repool tools-worker-1002
17:09 arturo: T221225 depool & switch to sssd/sudo & reboot & repool tools-worker-1001
17:08 arturo: T221225 switch to sssd/sudo in puppet prefix for tools-worker
13:04 arturo: T221225 depool and rebooted tools-worker-1001 in preparation for sssd migration
12:39 arturo: T221225 disable puppet in all tools-worker nodes in preparation for sssd
12:32 arturo: drop the tools-bastion puppet prefix, unused
12:31 arturo: T221225 set sssd/sudo in the hiera config for the tools-checker prefix, and reboot tools-checker-03
12:27 arturo: T221225 set sssd/sudo in the hiera config for the tools-docker-registry prefix, and reboot tools-docker-registry-[03-04]
12:16 arturo: T221225 set sssd/sudo in the hiera config for the tools-sgebastion prefix, and reboot tools-sgebastion-07/08
11:26 arturo: merged change to the sudo module to allow sssd transition

2019-05-27

09:47 arturo: run `apt-get clean` to wipe 4GB of unused .deb packages, usage on / (root) was > 90% (on tools-sgebastion-08)
09:47 arturo: run `apt-get clean` to wipe 4GB of unused .deb packages, usage on / (root) was > 90%

2019-05-21

12:35 arturo: T223992 rebooting tools-redis-1002

2019-05-20

11:25 arturo: T223332 enable puppet agent in tools-k8s-master and tools-docker-registry nodes and deploy new SSL cert
10:53 arturo: T223332 disable puppet agent in tools-k8s-master and tools-docker-registry nodes

2019-05-18

11:13 chicocvenancio: PAWS update helm chart to point to new singleuser image (T217908)
09:06 bd808: Rebuilding all stretch docker images to pick up toollabs-webservice 0.45

2019-05-17

17:36 bd808: Rebuilding all docker images to pick up toollabs-webservice 0.45
17:35 bd808: Deployed toollabs-webservice 0.45 (python 3.5 and nodejs 10 containers)

2019-05-16

11:22 chicocvenancio: PAWS: restart hub to get new configured announcement
11:05 chicocvenancio: PAWS: change confimap to reference WMHACK 2019 as busiest time

2019-05-15

16:20 arturo: T223148 repool both tools-sgeexec-0921 and -0929
15:32 arturo: T223148 depool tools-sgeexec-0921 and move to cloudvirt1014
15:32 arturo: T223148 depool tools-sgeexec-0920 and move to cloudvirt1014
12:29 arturo: T223148 repool both tools-sgeexec-09[37,39]
12:13 arturo: T223148 depool tools-sgeexec-0937 and move to cloudvirt1008
12:13 arturo: T223148 depool tools-sgeexec-0939 and move to cloudvirt1007
11:34 arturo: T223148 repool tools-sgeexec-0940
11:20 arturo: T223148 depool tools-sgeexec-0940 and move to cloudvirt1006
11:11 arturo: T223148 repool tools-sgeexec-0941
10:46 arturo: T223148 depool tools-sgeexec-0941 and move to cloudvirt1005
09:44 arturo: T223148 repool tools-sgeexec-0901
09:00 arturo: T223148 depool tools-sgeexec-0901 and reallocate to cloudvirt1004

2019-05-14

17:12 arturo: T223148 repool tools-sgeexec-0920
16:37 arturo: T223148 depool tools-sgeexec-0920 and reallocate to cloudvirt1003
16:36 arturo: T223148 repool tools-sgeexec-0911
15:56 arturo: T223148 depool tools-sgeexec-0911 and reallocate to cloudvirt1003
15:52 arturo: T223148 repool tools-sgeexec-0909
15:24 arturo: T223148 depool tools-sgeexec-0909 and reallocate to cloudvirt1002
15:24 arturo: T223148 last SAL entry is bogus, please ignore (depool tools-worker-1009)
15:23 arturo: T223148 depool tools-worker-1009
15:13 arturo: T223148 repool tools-worker-1023
13:16 arturo: T223148 repool tools-sgeexec-0942
13:03 arturo: T223148 repool tools-sgewebgrid-generic-0904
12:58 arturo: T223148 reallocating tools-worker-1023 to cloudvirt1001
12:56 arturo: T223148 depool tools-worker-1023
12:52 arturo: T223148 reallocating tools-sgeexec-0942 to cloudvirt1001
12:50 arturo: T223148 depool tools-sgeexec-0942
12:49 arturo: T223148 reallocating tools-sgewebgrid-generic-0904 to cloudvirt1001
12:43 arturo: T223148 depool tools-sgewebgrid-generic-0904

2019-05-13

08:15 zhuyifei1999_: `truncate -s 0 /var/log/exim4/paniclog` on tools-sgecron-01.tools.eqiad.wmflabs & tools-sgewebgrid-lighttpd-0921.tools.eqiad.wmflabs

2019-05-07

14:38 arturo: T222718 uncordon tools-worker-1019, I couldn't find a reason for it to be cordoned
14:31 arturo: T222718 reboot tools-worker-1009 and 1022 after being drained
14:28 arturo: k8s drain tools-worker-1009 and 1022
11:46 arturo: T219362 enable puppet in tools-redis servers and use the new puppet role
11:33 arturo: T219362 disable puppet in tools-reds servers for puppet code cleanup
11:12 arturo: T219362 drop the `tools-services` puppet prefix (we are actually using `tools-sgeservices`)
11:10 arturo: T219362 enable puppet in tools-static servers and use new puppet role
11:01 arturo: T219362 disable puppet in tools-static servers for puppet code cleanup
10:16 arturo: T219362 drop the `tools-webgrid-lighttpd` puppet prefix
10:14 arturo: T219362 drop the `tools-webgrid-generic` puppet prefix
10:06 arturo: T219362 drop the `tools-exec-1` puppet prefix

2019-05-06

11:34 arturo: T221225 reenable puppet
10:53 arturo: T221225 disable puppet in all toolforge servers for testing sssd patch (puppetmaster livehack)

2019-05-03

09:43 arturo: fixed puppet in tools-puppetdb-01 too
09:39 arturo: puppet should be now fine across toolforge (except tools-puppetdb-01 which is WIP I think)
09:37 arturo: fix puppet in tools-elastic-03, archived jessie repos, weird rsyslog-kafka package situation
09:33 arturo: fix puppet in tools-elastic-02, archived jessie repos, weird rsyslog-kafka package situation
09:24 arturo: fix puppet in tools-elastic-01, archived jessie repos, weird rsyslog-kafka package situation
09:18 arturo: solve a weird apt situation in tools-puppetmaster-01 regarding the rsyslog-kafka package (puppet agent was failing)
09:16 arturo: solve a weird apt situation in tools-worker-1028 regarding the rsyslog-kafka package

2019-04-30

12:50 arturo: enable puppet in all servers T221225
12:45 arturo: adding `sudo_flavor: sudo` hiera config to all puppet prefixes with sssd (T221225)
12:45 arturo: adding `sudo_flavor: sudo` hiera config to all puppet prefixes with sssd
11:07 arturo: T221225 disable puppet in toolforge
10:56 arturo: T221225 create tools-sgebastion-0test for more sssd tests

2019-04-29

11:22 arturo: T221225 re-enable puppet agent in all toolforge servers
10:27 arturo: T221225 reboot tool-sgebastion-09 for testing sssd
10:21 arturo: disable puppet in all servers to livehack tools-puppetmaster-01 to test T221225
08:29 arturo: cleanup disk in tools-sgebastion-09, was full of debug logs and unused apt packages

2019-04-26

12:20 andrewbogott: rescheduling every pod everywhere
12:18 andrewbogott: rescheduling all pods on tools-worker-1023.tools.eqiad.wmflabs

2019-04-25

12:49 arturo: T221225 using `profile::ldap::client::labs::client_stack: sssd` in horizon for tools-sgebastion-09 (testing)
11:43 arturo: T221793 removing prometheus crontab and letting puppet agent re-create it again to resolve staleness

2019-04-24

12:54 arturo: puppet broken, fixing right now
09:18 arturo: T221225 reallocating tools-sgebastion-09 to cloudvirt1008

2019-04-23

15:26 arturo: T221225 rebooting tools-sgebastion-08 to cleanup sssd
15:19 arturo: T221225 creating tools-sgebastion-09 for testing sssd stuff
13:06 arturo: T221225 use `profile::ldap::client::labs::client_stack: classic` in the puppet bastion prefix, again. Rollback again.
12:57 arturo: T221225 use `profile::ldap::client::labs::client_stack: sssd` in the puppet bastion prefix, try again with sssd in the bastions, reboot them
10:28 arturo: T221225 use `profile::ldap::client::labs::client_stack: classic` in the puppet bastion prefix
10:27 arturo: T221225 rebooting tools-sgebastion-07 to clean sssd confiuration
10:16 arturo: T221225 disable puppet in tools-sgebastion-08 for sssd testing
09:49 arturo: T221225 run puppet agent in the bastions and reboot them with sssd
09:43 arturo: T221225 use `profile::ldap::client::labs::client_stack: sssd` in the puppet bastion prefix
09:41 arturo: T221225 disable puppet agent in the bastions

2019-04-17

12:09 arturo: T221225 rebooting bastions to clean sssd. We are back to nscd/nslcd until we figure out what's wrong here
11:59 arturo: T221205 sssd was deployed successfully into all webgrid nodes
11:39 arturo: deploy sssd to tools-sge-services-03/04 (includes reboot)
11:31 arturo: reboot bastions for sssd deployment
11:30 arturo: deploy sssd to bastions
11:24 arturo: disable puppet in bastions to deploy sssd
09:52 arturo: T221205 tools-sgewebgrid-lighttpd-0915 requires some manual intervention because issues in the dpkg database prevents deleting nscd/nslcd packages
09:45 arturo: T221205 tools-sgewebgrid-lighttpd-0913 requires some manual intervention because unconfigured packages prevents a clean puppet agent run
09:12 arturo: T221205 start deploying sssd to sgewebgrid nodes
09:00 arturo: T221205 add `profile::ldap::client::labs::client_stack: sssd` in horizon for the puppet prefixes `tools-sgewebgrid-lighttpd` and `tools-sgewebgrid-generic`
08:57 arturo: T221205 disable puppet in all tools-sgewebgrid-* nodes

2019-04-16

20:49 chicocvenancio: change paws announcement in configmap hub-config back to a welcome message
17:15 chicocvenancio: add paws outage announcement in configmap hub-config
17:00 andrewbogott: moving tools-k8s-master-01 to eqiad1-r

2019-04-15

18:50 andrewbogott: moving tools-elastic-01 to cloudvirt1008 to make spreadcheck happy
15:01 andrewbogott: moving tools-redis-1001 to eqiad1-r

2019-04-14

16:23 andrewbogott: moved all tools-worker nodes off of cloudvirt1015 and uncordoned them

2019-04-13

21:09 bstorm_: Moving tools-prometheus-01 to cloudvirt1009 and tools-clushmaster-02 to cloudvirt1008 for T220853
20:36 bstorm_: moving tools-elastic-02 to cloudvirt1009 for T220853
19:58 bstorm_: started migrating tools-k8s-etcd-03 to cloudvirt1012 T220853
19:51 bstorm_: started migrating tools-flannel-etcd-02 to cloudvirt1013 T220853

2019-04-11

22:38 andrewbogott: moving tools-paws-worker-1005 to cloudvirt1009 to make spreadcheck happier
21:49 bd808: Re-enabled puppet on tools-elastic-02 and forced puppet run
21:44 andrewbogott: moving tools-mail-02 to eqiad1-r
20:56 andrewbogott: shutting down tools-logs-02 — seems unused
19:44 bd808: Disabled puppet on tools-elastic-02 and set to 1-node master
19:34 andrewbogott: moving tools-puppetmaster-01 to eqiad1-r
15:40 andrewbogott: moving tools-redis-1002 to eqiad1-r
13:52 andrewbogott: moving tools-prometheus-01 and tools-elastic-01 to eqiad1-r
12:01 arturo: T151704 deploying oidentd
11:54 arturo: disable puppet in all hosts to deploy oidentd
02:33 andrewbogott: tools-paws-worker-1005, tools-paws-worker-1006 to eqiad1-r
00:03 andrewbogott: tools-paws-worker-1002, tools-paws-worker-1003 to eqiad1-r

2019-04-10

23:58 andrewbogott: moving tools-clushmaster-02, tools-elastic-03 and tools-paws-worker-1001 to eqiad1-r
18:52 bstorm_: depooled and rebooted tools-sgeexec-0929 because systemd was in a weird state
18:46 bstorm_: depooled and rebooted tools-sgewebgrid-lighttpd-0913 because high load was caused by ancient lsof processes
14:49 bstorm_: cleared E state from 5 queues
13:06 arturo: T218126 hard reboot tools-sgeexec-0906
12:31 arturo: T218126 hard reboot tools-sgeexec-0926
12:27 arturo: T218126 hard reboot tools-sgeexec-0925
12:06 arturo: T218126 hard reboot tools-sgeexec-0901
11:55 arturo: T218126 hard reboot tools-sgeexec-0924
11:47 arturo: T218126 hard reboot tools-sgeexec-0921
11:23 arturo: T218126 hard reboot tools-sgeexec-0940
11:03 arturo: T218126 hard reboot tools-sgeexec-0928
10:49 arturo: T218126 hard reboot tools-sgeexec-0923
10:43 arturo: T218126 hard reboot tools-sgeexec-0915
10:27 arturo: T218126 hard reboot tools-sgeexec-0935
10:19 arturo: T218126 hard reboot tools-sgeexec-0914
10:02 arturo: T218126 hard reboot tools-sgeexec-0907
09:41 arturo: T218126 hard reboot tools-sgeexec-0918
09:27 arturo: T218126 hard reboot tools-sgeexec-0932
09:26 arturo: T218216 hard reboot tools-sgeexec-0932
09:04 arturo: T218216 add `profile::ldap::client::labs::client_stack: sssd` to prefix puppet for sge-exec nodes
09:03 arturo: T218216 do a controlled rollover of sssd, depooling sgeexec nodes, reboot and repool
08:39 arturo: T218216 disable puppet in all tools-sgeexec-XXXX nodes for controlled sssd rollout
00:32 andrewbogott: migrating tools-worker-1022, 1023, 1025, 1026 to eqiad1-r

2019-04-09

22:04 bstorm_: added the new region on port 80 to the elasticsearch security group for stashbot
21:16 andrewbogott: moving tools-flannel-etcd-03 to eqiad1-r
20:43 andrewbogott: moving tools-worker-1018, 1019, 1020, 1021 to eqiad1-r
20:04 andrewbogott: moving tools-k8s-etcd-03 to eqiad1-r
19:54 andrewbogott: moving tools-flannel-etcd-02 to eqiad1-r
18:36 andrewbogott: moving tools-worker-1016, tools-worker-1017 to eqiad1-r
18:05 andrewbogott: migrating tools-k8s-etcd-02 to eqiad1-r
18:00 andrewbogott: migrating tools-flannel-etcd-01 to eqiad1-r
17:36 andrewbogott: moving tools-worker-1014, tools-worker-1015 to eqiad1-r
17:05 andrewbogott: migrating tools-k8s-etcd-01 to eqiad1-r
15:56 andrewbogott: moving tools-worker-1012, tools-worker-1013 to eqiad1-r
14:56 bstorm_: cleared 4 queues on gridengine of E status (ldap again)
14:07 andrewbogott: moving tools-worker-1010, tools-worker-1011, tools-worker-1001 to eqiad1-r
03:48 andrewbogott: moving tools-worker-1008 and tools-worker-1009 to eqiad1-r
02:07 bstorm_: reloaded ferm on tools-flannel-etcd-0[1-3] to get the k8s node moves to register

2019-04-08

22:36 andrewbogott: moving tools-worker-1006 and tools-worker-1007 to eqiad1-r
20:03 andrewbogott: moving tools-worker-1003 and tools-worker-1004 to eqiad1-r

2019-04-07

16:54 zhuyifei1999_: tools-sgeexec-0928 unresponsive since around 22 UTC. No data on Graphite. Can't ssh in even as root. Hard rebooting via Horizon
01:06 bstorm_: cleared E state from 6 queues

2019-04-05

15:44 bstorm_: cleared E state from two exec queues

2019-04-04

21:21 bd808: Uncordoned tools-worker-1013.tools.eqiad.wmflabs after reboot and forced puppet run
20:53 bd808: Rebooting tools-worker-1013
20:50 bd808: Draining tools-worker-1013.tools.eqiad.wmflabs
20:29 bd808: Released floating IP and deleted instance tools-checker-01 via Horizon
20:28 bd808: Shutdown tools-checker-01 via Horizon
20:17 bd808: Repooled tools-webgrid-lighttpd-0906 after reboot, apt-get dist-upgrade, and forced puppet run
20:13 bd808: Hard reboot of tools-sgewebgrid-lighttpd-0906 via Horizon
20:09 bd808: Repooled tools-webgrid-lighttpd-0912 after reboot, apt-get dist-upgrade, and forced puppet run
20:05 bd808: Depooled and rebooted tools-sgewebgrid-lighttpd-0912
20:05 bstorm_: rebooted tools-webgrid-lighttpd-0912
20:03 bstorm_: depooled tools-webgrid-lighttpd-0912
19:59 bstorm_: depooling and rebooting tools-webgrid-lighttpd-0906
19:43 bd808: Repooled tools-sgewebgrid-lighttpd-0926 after reboot, apt-get dist-update, and forced puppet run
19:36 bd808: Hard reboot of tools-sgewebgrid-lighttpd-0926 via Horizon
19:30 bd808: Rebooting tools-sgewebgrid-lighttpd-0926
19:28 bd808: Depooled tools-sgewebgrid-lighttpd-0926
19:13 bstorm_: cleared E state from 7 queues
17:32 andrewbogott: moving tools-static-12 to cloudvirt1023 to keep the two static nodes off the same host

2019-04-03

11:22 arturo: puppet breakage in due to me introducing openstack-mitaka-jessie repo by mistake. Cleaning up already

2019-04-02

12:11 arturo: icinga downtime toolschecker for 1 month T219243
03:55 bd808: Added etcd service group to tools-k8s-etcd-* (T219243)

2019-04-01

19:44 bd808: Deleted tools-checker-02 via Horizon (T219243)
19:43 bd808: Shutdown tools-checker-02 via Horizon (T219243)
16:53 bstorm_: cleared E state on 6 grid queues
14:54 andrewbogott: moving tools-static-12 to eqiad1-r (for real this time maybe)

2019-03-29

21:13 bstorm_: depooled tools-sgewebgrid-generic-0903 because of some stuck jobs and odd load characteristics
21:09 bd808: Updated cherry-pick of https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/500095/ on tools-puppetmaster-01 (T219243)
20:48 bd808: Using root console to fix broken initial puppet run on tools-checker-03.
20:32 bd808: Creating tools-checker-03 with role::wmcs::toolforge::checker (T219243)
20:24 bd808: Cherry-picked https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/500095/ to tools-puppetmaster-01 for testing (T219243)
20:22 bd808: Disabled puppet on tools-checker-0{1,2} to make testing new role::wmcs::toolforge::checker easier (T219243)
17:25 bd808: Cleared the "Eqw" state of 44 jobs with `qstat -u '*' | grep Eqw | awk '{print $1;}' | xargs -L1 sudo qmod -cj` on tools-sgegrid-master
17:16 andrewbogott: aborted move of tools-static-12; will wait until tomorrow and give DNS caches more time to update
17:11 bd808: Restarted nginx on tools-static-13
16:53 andrewbogott: moving tools-static-12 to eqiad1-r
16:49 bstorm_: cleared E state from 21 queues
14:34 andrewbogott: moving tools-static.wmflabs.org to point to tools-static-13 in eqiad1-r
13:54 andrewbogott: moving tools-static-13 to eqiad1-r

2019-03-28

01:00 bstorm_: cleared error states from two queues
00:23 bstorm_: T216060 created tools-sgewebgrid-generic-0901...again!

2019-03-27

23:35 bstorm_: rebooted tools-paws-master-01 for NFS issue T219460
14:45 bstorm_: cleared several "E" state queues
12:26 gtirloni: truncated exim4/paniclog on tools-sgewebgrid-lighttpd-0921
12:25 gtirloni: truncated exim4/paniclog on tools-sgecron-01
12:15 arturo: T218126 `aborrero@tools-sgegrid-master:~$ sudo qmod -d 'test@tools-sssd-sgeexec-test-2'` (and 1)

2019-03-26

22:00 gtirloni: downtimed toolschecker
17:31 arturo: T218126 create VM instances tools-sssd-sgeexec-test-[12]
00:26 bd808: Deleted DNS record for login-trusty.tools.wmflabs.org
00:26 bd808: Deleted DNS record for trusty-dev.tools.wmflabs.org

2019-03-25

21:21 bd808: All Trusty grid engine hosts shutdown and deleted (T217152)
{{safesubst:SAL entry|1=21:19 bd808: Deleted tools-grid-{master,shadow} (T217152)}}
21:18 bd808: Deleted tools-webgrid-lighttpd-14* (T217152)
20:55 bstorm_: reboot tools-sgewebgrid-generic-0903 to clear up some issues
20:52 bstorm_: rebooting tools-package-builder-02 due to lots of hung /usr/bin/lsof +c 15 -nXd DEL processes
20:51 bd808: Deleted tools-webgrid-generic-14* (T217152)
20:49 bd808: Deleted tools-exec-143* (T217152)
20:49 bd808: Deleted tools-exec-142* (T217152)
20:48 bd808: Deleted tools-exec-141* (T217152)
20:47 bd808: Deleted tools-exec-140* (T217152)
20:43 bd808: Deleted tools-cron-01 (T217152)
20:42 bd808: Deleted tools-bastion-0{2,3} (T217152)
20:35 bstorm_: rebooted tools-worker-1025 and tools-worker-1021
19:59 bd808: Shutdown tools-exec-143* (T217152)
19:51 bd808: Shutdown tools-exec-142* (T217152)
19:47 bstorm_: depooling tools-worker-1025.tools.eqiad.wmflabs because it's not responding and showing insane load
19:33 bd808: Shutdown tools-exec-141* (T217152)
19:31 bd808: Shutdown tools-bastion-0{2,3} (T217152)
19:19 bd808: Shutdown tools-exec-140* (T217152)
19:12 bd808: Shutdown tools-webgrid-generic-14* (T217152)
19:11 bd808: Shutdown tools-webgrid-lighttpd-14* (T217152)
18:53 bd808: Shutdown tools-grid-master (T217152)
18:53 bd808: Shutdown tools-grid-shadow (T217152)
18:49 bd808: All jobs still running on the Trusty job grid force deleted.
18:46 bd808: All Trusty job grid queues marked as disabled. This should stop all new Trusty job submissions.
18:43 arturo: icinga downtime tools-checker for 24h due to trusty grid shutdown
18:39 bd808: Shutdown tools-cron-01.tools.eqiad.wmflabs (T217152)
15:27 bd808: Copied all crontab files still on tools-cron-01 to tool's $HOME/crontab.trusty.save
02:34 bd808: Disassociated floating IPs and deleted shutdown Trusty grid nodes tools-exec-14{33,34,35,36,37,38,39,40,41,42} (T217152)
02:26 bd808: Deleted shutdown Trusty grid nodes tools-webgrid-lighttpd-14{20,21,22,24,25,26,27,28} (T217152)

2019-03-22

17:16 andrewbogott: switching all instances to use ldap-ro.eqiad.wikimedia.org as both primary and secondary ldap server
16:12 bstorm_: cleared errored out stretch grid queues
15:56 bd808: Rebooting tools-static-12
03:09 bstorm_: T217280 depooled and rebooted 15 other nodes. Entire stretch grid is in a good state for now.
02:31 bstorm_: T217280 depooled and rebooted tools-sgeexec-0908 since it had no jobs but very high load from an NFS event that was no longer happening
02:09 bstorm_: T217280 depooled and rebooted tools-sgewebgrid-lighttpd-0924
00:39 bstorm_: T217280 depooled and rebooted tools-sgewebgrid-lighttpd-0902

2019-03-21

23:28 bstorm_: T217280 depooled, reloaded and repooled tools-sgeexec-0938
21:53 bstorm_: T217280 rebooted and cleared "unknown status" from tools-sgeexec-0914 after depooling
21:51 bstorm_: T217280 rebooted and cleared "unknown status" from tools-sgeexec-0909 after depooling
21:26 bstorm_: T217280 cleared error state from a couple queues and rebooted tools-sgeexec-0901 and 04 to clear other issues related

2019-03-18

18:43 bd808: Rebooting tools-static-12
18:42 chicocvenancio: PAWS: 3 nodes still in not ready state, `worker-10(01|07|10)` all else working
18:41 chicocvenancio: PAWS: deleting pods stuck in Unknown state with ` --grace-period=0 --force`
18:40 andrewbogott: rebooting tools-static-13 in hopes of fixing some nfs mounts
18:25 chicocvenancio: removing postStart hook for PWB update and restarting hub while gerrit.wikimedia.com is down

2019-03-17

23:41 bd808: Cherry-picked https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/497210/ as a quick fix for T218494
22:30 bd808: Investigating strange system state on tools-bastion-03.
17:48 bstorm_: T218514 rebooting tools-worker-1009 and 1012
17:46 bstorm_: depooling tools-worker-1009 and tools-worker-1012 for T218514
17:13 bstorm_: depooled and rebooting tools-worker-1018
15:09 andrewbogott: running 'killall dpkg and dpkg --configure -a' on all nodes to try to work around a race with initramfs

2019-03-16

22:34 bstorm_: clearing errored out queues again

2019-03-15

21:08 bstorm_: cleared error state on several queues T217280
15:58 gtirloni: rebooted tools-clushmaster-02
14:40 mutante: tools-sgebastion-07 - dpkg-reconfigure locales and adding Korean ko_KR.EUC-KR - T130532
14:32 mutante: tools-sgebastion-07 - generating locales for user request in T130532

2019-03-14

23:52 bd808: Disabled job queues and rescheduled continuous jobs away from tools-exec-14{21,22,23,24,25,26,27,28,29,30,31,32} (T217152)
23:28 bd808: Deleted tools-bastion-05 (T217152)
22:30 bd808: Removed obsolete submit hosts from Trusty grid config
22:20 bd808: Removed tools-webgrid-lighttpd-142{0,1,2,5} from the grid and shutdown instances via horizon (T217152)
22:10 bd808: Depooled tools-webgrid-lighttpd-142{0,1,2,5} (T217152)
21:55 bd808: Removed submit host flag from tools-bastion-05.tools.eqiad.wmflabs, removed floating ip, and shutdown instance via horizon (T217152)
21:48 bd808: Removed tools-exec-14{33,34,35,36,37,38,39,40,41,42} from the grid and shutdown instances via horizon (T217152)
21:38 gtirloni: rebooted tools-sgewebgrid-generic-0904 (T218341)
21:32 gtirloni: rebooted tools-exec-1020 (T218341)
21:23 gtirloni: rebooted tools-sgeexec-0919, tools-sgeexec-0934, tools-worker-1018 (T218341)
21:19 bd808: Killed jobs still running on tools-exec-14{33,34,35,36,37,38,39,40,41,42}.tools.eqiad.wmflabs 2 weeks after being depooled (T217152)
20:58 bd808: Repooled tools-sgeexec-0941 following reboot
20:57 bd808: Hard reboot of tools-sgeexec-0941 via horizon
20:54 bd808: Depooled and rebooted tools-sgeexec-0941.tools.eqiad.wmflabs
20:53 bd808: Repooled tools-sgeexec-0917 following reboot
20:52 bd808: Hard reboot of tools-sgeexec-0917 via horizon
20:47 bd808: Depooled and rebooted tools-sgeexec-0917
20:44 bd808: Repooled tools-sgeexec-0908 after reboot
20:36 bd808: depooled and rebooted tools-sgeexec-0908
19:08 gtirloni: rebooted tools-worker-1028 (T218341)
19:08 gtirloni: rebooted tools-sgewebgrid-lighttpd-0914 (T218341)
19:07 gtirloni: rebooted tools-sgewebgrid-lighttpd-0914
18:13 gtirloni: drained tools-worker-1028 for reboot (processes in D state)

2019-03-13

23:30 bd808: Rebuilding stretch Kubernetes images
22:55 bd808: Rebuilding jessie Kubernetes images
17:11 bstorm_: specifically rebooted SGE cron server tools-sgecron-01
17:10 bstorm_: rebooted cron server
16:10 bd808: Updated DNS for dev.tools.wmflabs.org to point to Stretch secondary bastion. This was missed on 2019-03-07
12:33 arturo: reboot tools-sgebastion-08 (T215154)
12:17 arturo: reboot tools-sgebastion-07 (T215154)
11:53 arturo: enable puppet in tools-sgebastion-07 (T215154)
11:20 arturo: disable puppet in tools-sgebastion-07 for testing T215154
05:07 bstorm_: re-enabled puppet for tools-sgebastion-07
04:59 bstorm_: disabled puppet for a little bit on tools-bastion-07
00:22 bd808: Raise web-memlimit for isbn tool to 6G for tomcat8 (T217406)

2019-03-11

15:53 bd808: Manually started `service gridengine-master` on tools-sgegrid-master after reboot (T218038)
15:47 bd808: Hard reboot of tools-sgegrid-master via Horizon UI (T218038)
15:42 bd808: Rebooting tools-sgegrid-master (T218038)
14:49 gtirloni: deleted tools-webgrid-lighttpd-1419
00:53 bd808: Re-enabled 13 queue instances that had been disabled by LDAP failures during job initialization (T217280)

2019-03-10

22:36 gtirloni: increased nscd group TTL from 60 to 300sec

2019-03-08

19:48 andrewbogott: repooling tools-exec-1430 and tools-sgeexec-0905 to compare ldap usage
19:21 andrewbogott: depooling tools-exec-1430 and tools-sgeexec-0905 to compare ldap usage
17:49 bd808: Re-enabled 4 queue instances that had been disabled by LDAP failures during job initialization (T217280)
00:30 bd808: DNS record created for trusty-dev.tools.wmflabs.org (Trusty secondary bastion)

2019-03-07

23:31 bd808: Updated DNS to point login.tools.wmflabs.org at 185.15.56.48 (Stretch bastion)
04:15 bd808: Killed 3 orphan processes on Trusty grid
04:01 bd808: Cleared error state on a large number of Stretch grid queues which had been disabled by LDAP and/or NFS hiccups (T217280)
00:49 zhuyifei1999_: clushed misctools 1.37 upgrade on @bastion,@cron,@bastion-stretch T217406
00:38 zhuyifei1999_: published misctools 1.37 T217406
00:34 zhuyifei1999_: begin building misctools 1.37 using debuild T217406

2019-03-06

13:57 gtirloni: fixed SSH warnings in tools-clushmaster-02

2019-03-04

19:07 bstorm_: umounted /mnt/nfs/dumps-labstore1006.wikimedia.org for T217473
{{safesubst:SAL entry|1=14:05 gtirloni: rebooted tools-docker-registry-{03,04}, tools-puppetmaster-02 and tools-puppetdb-01 (load avg >45, not accessible)}}

2019-03-03

20:54 andrewbogott: cleaning out /tmp on tools-exec-1412

2019-02-28

19:36 zhuyifei1999_: built with debuild instead T217297
19:08 zhuyifei1999_: test failures during build, see ticket
18:55 zhuyifei1999_: start building jobutils 1.36 T217297

2019-02-27

20:41 andrewbogott: restarting nginx on tools-checker-01
19:34 andrewbogott: uncordoning tools-worker-1028, 1002 and 1005, now in eqiad1-r
16:20 zhuyifei1999_: regenerating k8s creds for tools.whichsub & tools.permission-denied-test T176027
15:40 andrewbogott: moving tools-worker-1002, 1005, 1028 to eqiad1-r
01:36 bd808: Shutdown tools-webgrid-lighttpd-1419.tools.eqiad.wmflabs via horizon (T217152)
01:29 bd808: Depooled tools-webgrid-lighttpd-1419.tools.eqiad.wmflabs (T217152)
01:26 bd808: Disabled job queues and rescheduled continuous jobs away from tools-exec-14{33,34,35,36,37,38,39,40,41,42}.tools.eqiad.wmflabs (T217152)

2019-02-26

20:51 gtirloni: reboot tools-package-builder-02 (unresponsive)
19:01 gtirloni: pushed updated docker images
17:30 andrewbogott: draining and cordoning tools-worker-1027 for a region migration test

2019-02-25

23:20 bstorm_: Depooled tools-sgeexec-0914 and tools-sgeexec-0915 for T217066
21:41 andrewbogott: depooling tools-sgeexec-0911, tools-sgeexec-0912, tools-sgeexec-0913 to test T217066
13:11 chicocvenancio: PAWS: Stopped AABot notebook pod T217010
12:54 chicocvenancio: PAWS: Restarted Criscod notebook pod T217010
12:21 chicocvenancio: PAWS: killed proxy and hub pods to attempt to get it to see routes to open notebooks servers to no avail. Restarted BernhardHumm's notebook pod T217010
09:50 gtirloni: rebooted tools-sgeexec-09{16,22,40} (T216988)
09:41 gtirloni: rebooted tools-sgeexec-09{16,22,40}
08:37 zhuyifei1999_: uncordon tools-worker-1015.tools.eqiad.wmflabs
08:34 legoktm: hard rebooted tools-worker-1015 via horizon
07:48 zhuyifei1999_: systemd stuck in D state. :(
07:44 zhuyifei1999_: I saved dmesg and process list to a few files in /root if that helps debugging
07:43 zhuyifei1999_: D states are not responding to SIGKILL. Will reboot.
07:37 zhuyifei1999_: tools-worker-1015.tools.eqiad.wmflabs having severe NFS issues (all NFS accessing processes are stuck in D state). Draining.

2019-02-22

16:29 gtirloni: upgraded and rebooted tools-puppetmaster-01 (new kernel)
15:59 gtirloni: started tools-puppetmaster-01 (new size: m1.large)
15:13 gtirloni: shutdown tools-puppetmaster-01

2019-02-21

09:59 gtirloni: upgraded all packages in all stretch nodes
00:12 zhuyifei1999_: forcing puppet run on tools-k8s-master-01
00:08 zhuyifei1999_: running /usr/local/bin/git-sync-upstream on tools-puppetmaster-01 to speed puppet changes up

2019-02-20

23:30 zhuyifei1999_: begin rebuilding all docker images T178601 T193646 T215683
23:25 zhuyifei1999_: upgraded toollabs-webservice on tools-bastion-02 to 0.44 (newly-built version)
23:19 zhuyifei1999_: this was built for stretch. hopefully it works for all distros
23:17 zhuyifei1999_: begin build new tools-webservice package T178601 T193646 T215683
21:57 andrewbogott: moving tools-static-13 to a new virt host
21:34 andrewbogott: moving the tools-static IP from tools-static-13 to tools-static-12
19:17 andrewbogott: moving tools-bastion-02 to labvirt1004
16:56 andrewbogott: moving tools-paws-worker-1003
15:53 andrewbogott: moving tools-worker-1017, tools-worker-1027, tools-worker-1028
15:04 andrewbogott: moving tools-exec-1413 and tools-exec-1442

2019-02-19

01:49 bd808: Revoked Toolforge project membership for user DannyS712 (T215092)

2019-02-18

20:45 gtirloni: upgraded and rebooted tools-sgebastion-07 (login-stretch)
20:22 gtirloni: enabled toolsdb monitoring in Icinga
20:03 gtirloni: pointed tools-db.eqiad.wmflabs to 172.16.7.153
18:50 chicocvenancio: moving paws back to toolsdb T216208
13:47 arturo: rebooting tools-sgebastion-07 to try fixing general slowness

2019-02-17

22:23 zhuyifei1999_: uncordon tools-worker-1010.tools.eqiad.wmflabs
22:11 zhuyifei1999_: rebooting tools-worker-1010.tools.eqiad.wmflabs
22:10 zhuyifei1999_: draining tools-worker-1010.tools.eqiad.wmflabs, `docker ps` is hanging. no idea why. also other weirdness like ContainerCreating forever

2019-02-16

05:00 zhuyifei1999_: fixed by restarting flannel. another puppet run simply started kubelet
04:58 zhuyifei1999_: puppet logs: https://phabricator.wikimedia.org/P8097. Docker is failing with 'Failed to load environment files: No such file or directory'
04:52 zhuyifei1999_: copied the resolv.conf from tools-k8s-master-01, removing secondary DNS to make sure puppet fixes that, and starting puppet
04:48 zhuyifei1999_: that host's resolv.conf is badly broken https://phabricator.wikimedia.org/P8096. The last Puppet run was at Thu Feb 14 15:21:09 UTC 2019 (2247 minutes ago)
04:44 zhuyifei1999_: puppet is also failing bad here 'Error: Could not request certificate: getaddrinfo: Name or service not known'
04:43 zhuyifei1999_: this one has logs full of 'Can't contact LDAP server'
04:41 zhuyifei1999_: nslcd also broken on tools-worker-1005
04:34 zhuyifei1999_: uncordon tools-worker-1014.tools.eqiad.wmflabs
04:33 zhuyifei1999_: the issue was, /var/run/nslcd/socket was somehow a directory, AFAICT
04:31 zhuyifei1999_: then started nslcd vis systemctl and `id zhuyifei1999` returns correct stuffs
04:30 zhuyifei1999_: `nslcd -nd` complains about 'nslcd: bind() to /var/run/nslcd/socket failed: Address already in use'. SIGTERMed a background nslcd, `rmdir /var/run/nslcd/socket`, and `nslcd -nd` seemingly starts to work
04:23 zhuyifei1999_: drained tools-worker-1014.tools.eqiad.wmflabs
04:16 zhuyifei1999_: logs: https://phabricator.wikimedia.org/P8095
04:14 zhuyifei1999_: restarting nslcd on tools-worker-1014 in an attempt to fix that, service failed to start, looking into logs
04:12 zhuyifei1999_: restarting nscd on tools-worker-1014 in an attempt to fix seemingly-not-attached-to-LDAP

2019-02-14

21:57 bd808: Deleted old tools-proxy-02 instance
21:57 bd808: Deleted old tools-proxy-01 instance
21:56 bd808: Deleted old tools-package-builder-01 instance
20:57 andrewbogott: rebooting tools-worker-1005
20:34 andrewbogott: moving tools-exec-1409, tools-exec-1410, tools-exec-1414, tools-exec-1419
19:55 andrewbogott: moving tools-webgrid-generic-1401 and tools-webgrid-lighttpd-1419
19:33 andrewbogott: moving tools-checker-01 to labvirt1003
19:25 andrewbogott: moving tools-elastic-02 to labvirt1003
19:11 andrewbogott: moving tools-k8s-etcd-01 to labvirt1002
18:37 andrewbogott: moving tools-exec-1418, tools-exec-1424 to labvirt1003
18:34 andrewbogott: moving tools-webgrid-lighttpd-1404, tools-webgrid-lighttpd-1406, tools-webgrid-lighttpd-1410 to labvirt1002
17:35 arturo: T215154 tools-sgebastion-07 now running systemd 239 and starts enforcing user limits
15:33 andrewbogott: moving tools-worker-1002, 1003, 1005, 1006, 1007, 1010, 1013, 1014 to different labvirts in order to move labvirt1012 to eqiad1-r

2019-02-13

19:16 andrewbogott: deleting tools-sgewebgrid-generic-0901, tools-sgewebgrid-lighttpd-0901, tools-sgebastion-06
15:16 zhuyifei1999_: `sudo /usr/local/bin/grid-configurator --all-domains --observer-pass $(grep OS_PASSWORD /etc/novaobserver.yaml|awk '{gsub(/"/,"",$2);print $2}')` on tools-sgegrid-master to attempt to make it recognize -sgebastion-07 T216042
15:06 zhuyifei1999_: `sudo systemctl restart gridengine-master` on tools-sgegrid-master to attempt to make it recognize -sgebastion-07 T216042
13:03 arturo: T216030 switch login-stretch.tools.wmflabs.org floating IP to tools-sgebastion-07

2019-02-12

01:24 bd808: Stopped maintain-kubeusers, edited /etc/kubernetes/tokenauth, restarted maintain-kubeusers (T215704)

2019-02-11

22:57 bd808: Shutoff tools-webgrid-lighttpd-14{01,13,24,26,27,28} via Horizon UI
22:34 bd808: Decommissioned tools-webgrid-lighttpd-14{01,13,24,26,27,28}
22:23 bd808: sudo exec-manage depool tools-webgrid-lighttpd-1401.tools.eqiad.wmflabs
22:21 bd808: sudo exec-manage depool tools-webgrid-lighttpd-1413.tools.eqiad.wmflabs
22:18 bd808: sudo exec-manage depool tools-webgrid-lighttpd-1428.tools.eqiad.wmflabs
22:07 bd808: sudo exec-manage depool tools-webgrid-lighttpd-1427.tools.eqiad.wmflabs
22:06 bd808: sudo exec-manage depool tools-webgrid-lighttpd-1424.tools.eqiad.wmflabs
22:05 bd808: sudo exec-manage depool tools-webgrid-lighttpd-1426.tools.eqiad.wmflabs
20:06 bstorm_: Ran apt-get clean on tools-sgebastion-07 since it was running out of disk (and lots of it was the apt cache)
19:09 bd808: Upgraded tools-manifest on tools-cron-01 to v0.19 (T107878)
18:57 bd808: Upgraded tools-manifest on tools-sgecron-01 to v0.19 (T107878)
18:57 bd808: Built tools-manifest_0.19_all.deb and published to aptly repos (T107878)
18:26 bd808: Upgraded tools-manifest on tools-sgecron-01 to v0.18 (T107878)
18:25 bd808: Built tools-manifest_0.18_all.deb and published to aptly repos (T107878)
18:12 bd808: Upgraded tools-manifest on tools-sgecron-01 to v0.17 (T107878)
18:08 bd808: Built tools-manifest_0.17_all.deb and published to aptly repos (T107878)
10:41 godog: flip tools-prometheus proxy back to tools-prometheus-01 and upgrade to prometheus 2.7.1

2019-02-08

19:17 hauskatze: Stopped webservice of `tools.sulinfo` which redirects to `tools.quentinv57-tools` which is also unavalaible
18:32 hauskatze: Stopped webservice for `tools.quentinv57-tools` for T210829.
13:49 gtirloni: upgraded all packages in SGE cluster
12:25 arturo: install aptitude in tools-sgebastion-06
11:08 godog: flip tools-prometheus.wmflabs.org to tools-prometheus-02 - T215272
01:07 bd808: Creating tools-sgebastion-07

2019-02-07

23:48 bd808: Updated DNS to make tools-trusty.wmflabs.org and trusty.tools.wmflabs.org CNAMEs for login-trusty.tools.wmflabs.org
20:18 gtirloni: cleared mail queue on tools-mail-02
08:41 godog: upgrade prometheus-02 to prometheus 2.6 - T215272

2019-02-04

13:20 arturo: T215154 another reboot for tools-sgebastion-06
12:26 arturo: T215154 another reboot for tools-sgebastion-06. Puppet is disabled
11:38 arturo: T215154 reboot tools-sgebastion-06 to totally refresh systemd status
11:36 arturo: T215154 manually install systemd 239 in tools-sgebastion-06

2019-01-30

23:54 gtirloni: cleared apt cache on sge* hosts

2019-01-25

20:50 bd808: Deployed new tcl/web Kubernetes image based on Debian Stretch (T214668)
14:22 andrewbogott: draining and moving tools-worker-1016 to a new labvirt for T214447
14:22 andrewbogott: draining and moving tools-worker-1021 to a new labvirt for T214447

2019-01-24

11:09 arturo: T213421 delete tools-services-01/02
09:46 arturo: T213418 delete tools-docker-registry-02
09:45 arturo: T213418 delete tools-docker-builder-05 and tools-docker-registry-01
03:28 bd808: Fixed rebase conflict in labs/private on tools-puppetmaster-01

2019-01-23

22:18 bd808: Building new tools-sgewebgrid-lighttpd-0904 instance using Stretch base image (T214519)
22:09 bd808: Deleted tools-sgewebgrid-lighttpd-0904 instance via Horizon, used wrong base image (T214519)
21:04 bd808: Building new tools-sgewebgrid-lighttpd-0904 instance (T214519)
20:53 bd808: Deleted broken tools-sgewebgrid-lighttpd-0904 instance via Horizon (T214519)
19:49 andrewbogott: shutting down eqiad-region proxies tools-proxy-01 and tools-proxy-02
17:44 bd808: Added rules to default security group for prometheus monitoring on port 9100 (T211684)

2019-01-22

20:21 gtirloni: published new docker images (all)
18:57 bd808: Changed deb-tools.wmflabs.org proxy to point to tools-sge-services-03.tools.eqiad.wmflabs

2019-01-21

05:25 andrewbogott: restarted tools-sgeexec-0906 and tools-sgeexec-0904; they seem better now but I have not repooled them yet

2019-01-18

21:22 bd808: Forcing php-igbinary update via clush for T213666

2019-01-17

23:37 bd808: Shutdown tools-package-builder-01. Use tools-package-builder-02 instead!
22:09 bd808: Upgrading tools-manifest to 0.16 on tools-sgecron-01
22:05 bd808: Upgrading tools-manifest to 0.16 on tools-cron-01
21:51 bd808: Upgrading tools-manifest to 0.15 on tools-cron-01
20:41 bd808: Building tools-package-builder-02 to replace tools-package-builder-01
17:16 arturo: T213421 shutdown tools-services-01/02. Will delete VMs after a grace period
12:54 arturo: add webservice security group to tools-sge-services-03/04

2019-01-16

17:29 andrewbogott: depooling and moving tools-sgeexec-0904 tools-sgeexec-0906 tools-sgewebgrid-lighttpd-0904
16:38 arturo: T213418 shutdown tools-docker-registry-01 and 02. Will delete the instances in a week or so
14:34 arturo: T213418 point docker-registry.tools.wmflabs.org to tools-docker-registry-03 (was in -02)
14:24 arturo: T213418 allocate floating IPs for tools-docker-registry-03 & 04

2019-01-15

21:02 bstorm_: restarting webservicemonitor on tools-services-02 -- acting funny
18:46 bd808: Dropped A record for www.tools.wmflabs.org and replaced it with a CNAME pointing to tools.wmflabs.org.
18:29 bstorm_: T213711 installed python3-requests=2.11.1-1~bpo8+1 python3-urllib3=1.16-1~bpo8+1 on tools-proxy-03, which stopped the bleeding
14:55 arturo: disable puppet in tools-docker-registry-01 and tools-docker-registry-02, trying with `role::wmcs::toolforge::docker::registry` in the puppetmaster for -03 and -04. The registry shouldn't be affected by this
14:21 arturo: T213418 put a backup of the docker registry in NFS just in case: `aborrero@tools-docker-registry-02:$ sudo cp /srv/registry/registry.tar.gz /data/project/.system_sge/docker-registry-backup/`

2019-01-14

22:03 bstorm_: T213711 Added UDP port needed for flannel packets to work to k8s worker sec groups in both eqiad and eqiad1-r
22:03 bstorm_: T213711 Added ports needed for etcd-flannel to work on the etcd security group in eqiad
21:42 zhuyifei1999_: also `write`-ed to them (as root). auth on my personal account would take a long time
21:37 zhuyifei1999_: that command belonged to tools.scholia (with fnielsen as the ssh user)
21:36 zhuyifei1999_: killed an egrep using too mush NFS bandwidth on tools-bastion-03
21:33 zhuyifei1999_: SIGTERM PID 12542 24780 875 14569 14722. `tail`s with parent as init, belonging to user maxlath. they should submit to grid.
16:44 arturo: T213418 docker-registry.tools.wmflabs.org point floating IP to tools-docker-registry-02
14:00 arturo: T213421 disable updatetools in the new services nodes while building them
13:53 arturo: T213421 delete tools-services-03/04 and create them with another prefix: tools-sge-services-03/04 to actually use the new role
13:47 arturo: T213421 create tools-services-03 and tools-services-04 (stretch) they will use the new puppet role `role::wmcs::toolforge::services`

2019-01-11

11:55 arturo: T213418 shutdown tools-docker-builder-05, will give a grace period before deleting the VM
10:51 arturo: T213418 created tools-docker-builder-06 in eqiad1
10:46 arturo: T213418 migrating tools-docker-registry-02 from eqiad to eqiad1

2019-01-10

22:45 bstorm_: T213357 - Added 24 lighttpd nodes tot he new grid
18:54 bstorm_: T213355 built and configured two more generic web nodes for the new grid
10:35 gtirloni: deleted non-puppetized checks from tools-checker-0[1,2]
00:12 bstorm_: T213353 Added 36 exec nodes to the new grid

2019-01-09

20:16 andrewbogott: moving tools-paws-worker-1013 and tools-paws-worker-1007 to eqiad1
17:17 andrewbogott: moving paws-worker-1017 and paws-worker-1016 to eqiad1
14:42 andrewbogott: experimentally moving tools-paws-worker-1019 to eqiad1
09:59 gtirloni: rebooted tools-checker-01 (T213252)

2019-01-07

17:21 bstorm_: T67777 - set the max_u_jobs global grid config setting to 50 in the new grid
15:54 bstorm_: T67777 Set stretch grid user job limit to 16
05:45 bd808: Manually installed python3-venv on tools-sgebastion-06. Gerrit patch submitted for proper automation.

2019-01-06

22:06 bd808: Added floating ip to tools-sgebastion-06 (T212360)

2019-01-05

23:54 bd808: Manually installed php-mbstring on tools-sgebastion-06. Gerrit patch submitted to install it on the rest of the Son of Grid Engine nodes.

2019-01-04

21:37 bd808: Truncated /data/project/.system/accounting after archiving ~30 days of history

2019-01-03

21:03 bd808: Enabled Puppet on tools-proxy-02
20:53 bd808: Disabled Puppet on tools-proxy-02
20:51 bd808: Enabled Puppet on tools-proxy-01
20:49 bd808: Disabled Puppet on tools-proxy-01

2018-12-21

16:29 andrewbogott: migrating tools-exec-1416 to labvirt1004
16:01 andrewbogott: moving tools-grid-master to labvirt1004
00:35 bd808: Installed tools-manifest 0.14 for T212390
00:22 bd808: Rebuiliding all docker containers with toollabs-webservice 0.43 for T212390
00:19 bd808: Installed toollabs-webservice 0.43 on all hosts for T212390
00:01 bd808: Installed toollabs-webservice 0.43 on tools-bastion-02 for T212390

2018-12-20

20:43 andrewbogott: moving moving tools-prometheus-02 to labvirt1004
20:42 andrewbogott: moving tools-k8s-etcd-02 to labvirt1003
20:41 andrewbogott: moving tools-package-builder-01 to labvirt1002

2018-12-17

22:16 bstorm_: Adding a bunch of hiera values and prefixes for the new grid - T212153
19:18 gtirloni: decreased nfs-mount-manager verbosity (T211817)
19:02 arturo: T211977 add package tools-manifest 0.13 to stretch-tools & stretch-toolsbeta in aptly
13:46 arturo: T211977 `aborrero@tools-services-01:~$ sudo aptly repo move trusty-tools stretch-toolsbeta 'tools-manifest (=0.12)'`

2018-12-11

13:19 gtirloni: Removed BigBrother (T208357)

2018-12-05

12:17 gtirloni: remoted node tools-worker-1029.tools.eqiad.wmflabs from cluster (T196973)

2018-12-04

22:47 bstorm_: gtirloni added back main floating IP for tools-k8s-master-01 and removed unnecessary ones to stop k8s outage T164123
20:03 gtirloni: removed floating IPs from tools-k8s-master-01 (T164123)

2018-12-01

02:44 gtirloni: deleted instance tools-exec-gift-trusty-01 (T194615)
00:10 andrewbogott: moving tools-worker-1020 and tools-worker-1022 to different labvirts

2018-11-30

23:13 andrewbogott: moving tools-worker-1009 and tools-worker-1019 to different labvirts
22:18 gtirloni: Pushed new jdk8 docker image based on stretch (T205774)
18:15 gtirloni: shutdown tools-exec-gift-trusty-01 instance (T194615)

2018-11-27

17:49 bstorm_: restarted maintain-kubeusers just in case it had any issues reconnecting to toolsdb

2018-11-26

17:39 gtirloni: updated tools-manifest package on tools-services-01/02 to version 0.12 (10->60 seconds sleep time) (T210190)
17:34 gtirloni: T186571 removed legofan4000 user from project-tools group (again)
13:31 gtirloni: deleted instance tools-clushmaster-01 (T209701)

2018-11-20

23:05 gtirloni: Published stretch-tools and stretch-toolsbeta aptly repositories individually on tools-services-01
14:18 gtirloni: Created Puppet prefixes 'tools-clushmaster' & 'tools-mail'
13:24 gtirloni: shutdown tools-clushmaster-01 (use tools-clushmaster-02)
10:52 arturo: T208579 distributing now misctools and jobutils 1.33 in all aptly repos
09:43 godog: restart prometheus@tools on prometheus-01

2018-11-16

21:16 bd808: Ran grid engine orphan process kill script from T153281. Only 3 orphan php-cgi processes belonging to iluvatarbot found.
17:47 gtirloni: deleted tools-mail instance
17:23 andrewbogott: moving tools-docker-registry-02 to labvirt1001
17:07 andrewbogott: moving tools-elastic-03 to labvirt1007
13:36 gtirloni: rebooted tools-static-12 and tools-static-13 after package upgrades

2018-11-14

17:29 andrewbogott: moving tools-worker-1027 to labvirt1008
17:18 andrewbogott: moving tools-webgrid-lighttpd-1417 to labvirt1005
17:15 andrewbogott: moving tools-exec-1420 to labvirt1009

2018-11-13

17:40 arturo: remove misctools 1.31 and jobutils 1.30 from the stretch-tools repo (T207970)
13:32 gtirloni: pointed mail.tools.wmflabs.org to new IP 208.80.155.158
13:29 gtirloni: Changed active mail relay to tools-mail-02 (T209356)
13:22 arturo: T207970 misctools and jobutils v1.32 are now in both `stretch-tools` and `stretch-toolsbeta` repos in tools-services-01
13:05 arturo: T207970 there is now a `stretch-toolsbeta` repo in tools-services-01, still empty
12:59 arturo: the puppet issue has been solved by reverting the code
12:28 arturo: puppet broken in toolforge due to a refactor. Will be fixed in a bit

2018-11-08

18:12 gtirloni: cleaned up old tmp files on tools-bastion-02
17:58 arturo: installing jobutils and misctools v1.32 (T207970)
17:18 gtirloni: cleaned up old tmp files on tools-exec-1406
16:56 gtirloni: cleaned up /tmp on tools-bastion-05
16:37 gtirloni: re-enabled tools-webgrid-lighttpd-1424.tools.eqiad.wmflabs
16:36 gtirloni: re-enabled tools-webgrid-lighttpd-1414.tools.eqiad.wmflabs
16:36 gtirloni: re-enabled tools-webgrid-lighttpd-1408.tools.eqiad.wmflabs
16:36 gtirloni: re-enabled tools-webgrid-lighttpd-1403.tools.eqiad.wmflabs
16:36 gtirloni: re-enabled tools-exec-1429.tools.eqiad.wmflabs
16:36 gtirloni: re-enabled tools-exec-1411.tools.eqiad.wmflabs
16:29 bstorm_: re-enabled tools-exec-1433.tools.eqiad.wmflabs
11:32 gtirloni: removed temporary /var/mail fix (T208843)

2018-11-07

10:37 gtirloni: removed invalid apt.conf.d file from all hosts (T110055)

2018-11-02

18:11 arturo: T206223 some disturbances due to the certificate renewal
17:04 arturo: renewing *.wmflabs.org T206223

2018-10-31

18:02 gtirloni: truncated big .err and error.log files
13:15 addshore: removing Jonas Kress (WMDE) from tools project, no longer with wmde

2018-10-29

17:00 bd808: Ran grid engine orphan process kill script from T153281

2018-10-26

10:34 arturo: T207970 added misctools 1.31 and jobutils 1.30 to stretch-tools aptly repo
10:32 arturo: T209970 added misctools 1.31 and jobutils 1.30 to stretch-tools aptly repo

2018-10-19

14:17 andrewbogott: moving tools-clushmaster-01 to labvirt1004
00:29 andrewbogott: migrating tools-exec-1411 and tools-exec-1410 off of cloudvirt1017

2018-10-18

19:57 andrewbogott: moving tools-webgrid-lighttpd-1419, tools-webgrid-lighttpd-1420 and tools-webgrid-lighttpd-1421 to labvirt1009, 1010 and 1011 as part of (gradually) draining labvirt1017

2018-10-16

15:13 bd808: (repost for gtirloni) T186571 removed legofan4000 user from project-tools group (leftover from T165624 legofan4000->macfan4000 rename)

2018-10-07

21:57 zhuyifei1999_: restarted maintain-kubeusers on tools-k8s-master-01 T194859
21:48 zhuyifei1999_: maintain-kubeusers on tools-k8s-master-01 seems to be in an infinite loop of 10 seconds. installed python3-dbg
21:44 zhuyifei1999_: journal on tools-k8s-master-01 is full of etcd failures, did a puppet run, nothing interesting happens

2018-09-21

12:35 arturo: cleanup stalled apt preference files (pinning) in tools-clushmaster-01
12:14 arturo: T205078 same for {jessie,stretch}-wikimedia
12:12 arturo: T205078 upgrade trusty-wikimedia packages (git-fat, debmonitor)
11:57 arturo: T205078 purge packages smbclient libsmbclient libwbclient0 python-samba samba-common samba-libs from trusty machines

2018-09-17

09:13 arturo: T204481 aborrero@tools-mail:~$ sudo exiqgrep -i | xargs sudo exim -Mrm

2018-09-14

11:22 arturo: T204267 stop the corhist tool (k8s) because is hammering the wikidata API
10:51 arturo: T204267 stop the openrefine-wikidata tool (k8s) because is hammering the wikidata API

2018-09-08

10:35 gtirloni: restarted cron and truncated /var/log/exim4/paniclog (T196137)

2018-09-07

05:07 legoktm: uploaded/imported toollabs-webservice_0.42_all.deb

2018-08-27

23:40 bd808: `# exec-manage repool tools-webgrid-generic-1402.eqiad.wmflabs` T202932
23:28 bd808: Restarted down instance tools-webgrid-generic-1402 & ran apt-upgrade
22:36 zhuyifei1999_: `# exec-manage depool tools-webgrid-generic-1402.eqiad.wmflabs` T202932

2018-08-22

13:02 arturo: I used this command: `sudo exim -bp | sudo exiqgrep -i | xargs sudo exim -Mrm`
13:00 arturo: remove all emails in tools-mail.eqiad.wmflabs queue, 3378 bounce msgs, mostly related to @qq.com

2018-08-19

09:12 legoktm: rebuilding python/base k8s images for https://gerrit.wikimedia.org/r/453665 (T202218)

2018-08-14

21:02 legoktm: rebuilt php7.2 docker images for https://gerrit.wikimedia.org/r/452755
01:08 legoktm: switched tools.coverme and tools.wikiinfo to use PHP 7.2

2018-08-13

23:31 legoktm: rebuilding docker images for webservice upgrade
23:16 legoktm: published toollabs-webservice_0.41_all.deb
23:06 legoktm: fixed permissions of tools-package-builder-01:/srv/src/tools-webservice

2018-08-09

10:40 arturo: T201602 upgrade packages from jessie-backports (excluding python-designateclient)
10:30 arturo: T201602 upgrade packages from jessie-wikimedia
10:27 arturo: T201602 upgrade packages from trusty-updates

2018-08-08

10:01 zhuyifei1999_: building & publishing toollabs-webservice 0.40 deb, and all Docker images T156626 T148872 T158244

2018-08-06

12:33 arturo: T197176 installing texlive-full in toolforge

2018-08-01

14:31 andrewbogott: temporarily depooling tools-exec-1409, 1410, 1414, 1419, 1427, 1428 to try to give labvirt1009 a break

2018-07-30

20:33 bd808: Started rebuilding all Kubernetes Docker images to pick up latest apt updates
04:47 legoktm: added toollabs-webservice_0.39_all.deb to stretch-tools

2018-07-27

04:52 zhuyifei1999_: rebuilding python/base docker container T190274

2018-07-25

19:02 chasemp: tools-worker-1004 reboot
19:01 chasemp: ifconfig eth0:fakenfs 208.80.155.106 netmask 255.255.255.255 up on tools-worker-1004 (late log)

2018-07-18

13:24 arturo: upgrading packages from `stretch-wikimedia` T199905
13:18 arturo: upgrading packages from `stable` T199905
12:51 arturo: upgrading packages from `oldstable` T199905
12:31 arturo: upgrading packages from `trusty-updates` T199905
12:16 arturo: upgrading packages from `jessie-wikimedia` T199905
12:09 arturo: upgrading packages from `trusty-wikimedia` T199905

2018-06-30

18:15 chicocvenancio: pushed new config to PAWS to fix dumps nfs mountpoint
16:40 zhuyifei1999_: because tools-paws-master-01 was having ~1000 loadavg due to NFS having issues and processes stuck in D state
16:39 zhuyifei1999_: reboot tools-paws-master-01
16:35 zhuyifei1999_: `root@tools-paws-master-01:~# sed -i 's/^labstore1006.wikimedia.org/#labstore1006.wikimedia.org/' /etc/fstab`
16:34 andrewbogott: "sed -i '/labstore1006/d' /etc/fstab" everywhere

2018-06-29

17:41 bd808: Rescheduling continuous jobs away from tools-exec-1408 where load is high
17:11 bd808: Rescheduled jobs away from toole-exec-1404 where linkwatcher is currently stealing most of the CPU (T123121)
16:46 bd808: Killed orphan tool owned processes running on the job grid. Mostly jembot and wsexport php-cgi processes stuck in deadlock following an OOM. T182070

2018-06-28

19:50 chasemp: tools-clushmaster-01:~$ clush -w @all 'sudo umount -fl /mnt/nfs/dumps-labstore1006.wikimedia.org'
18:02 chasemp: tools-clushmaster-01:~$ clush -w @all "sudo umount -fl /mnt/nfs/dumps-labstore1007.wikimedia.org"
17:53 chasemp: tools-clushmaster-01:~$ clush -w @all "sudo puppet agent --disable 'labstore1007 outage'"
17:20 chasemp: tools-worker-1007:~# /sbin/reboot
16:48 arturo: rebooting tools-docker-registry-01
16:42 andrewbogott: rebooting tools-worker-<everything> to get NFS unstuck
16:40 andrewbogott: rebooting tools-worker-1012 and tools-worker-1015 to get their nfs mounts unstuck

2018-06-21

13:18 chasemp: tools-bastion-03:~# bash -x /data/project/paws/paws-userhomes-hack.bash

2018-06-20

15:09 bd808: Killed orphan processes on webgrid nodes (T182070); most owned by jembot and croptool

2018-06-14

14:20 chasemp: timeout 180s bash -x /data/project/paws/paws-userhomes-hack.bash

2018-06-11

10:11 arturo: T196137 `aborrero@tools-clushmaster-01:~$ clush -w@all 'sudo wc -l /var/log/exim4/paniclog 2>/dev/null | grep -v ^0 && sudo rm -rf /var/log/exim4/paniclog && sudo service prometheus-node-exporter restart || true'`

2018-06-08

07:46 arturo: T196137 more rootspam today, restarting again `prometheus-node-exporter` and force rotating exim4 paniclog in 12 nodes

2018-06-07

11:01 arturo: T196137 force rotate all exim panilog files to avoid rootspam `aborrero@tools-clushmaster-01:~$ clush -w@all 'sudo logrotate /etc/logrotate.d/exim4-paniclog -f -v'`

2018-06-06

22:00 bd808: Scripting a restart of webservice for tools that are still in CrashLoopBackOff state after 2nd attempt (T196589)
21:10 bd808: Scripting a restart of webservice for 59 tools that are still in CrashLoopBackOff state after last attempt (P7220)
20:25 bd808: Scripting a restart of webservice for 175 tools that are in CrashLoopBackOff state (P7220)
19:04 chasemp: tools-bastion-03 is virtually unusable
09:49 arturo: T196137 aborrero@tools-clushmaster-01:~$ clush -w@all 'sudo service prometheus-node-exporter restart' <-- procs using the old uid

2018-06-05

18:02 bd808: Forced puppet run on tools-bastion-03 to re-enable logins by dubenben (T196486)
17:39 arturo: T196137 clush: delete `prometheus` user and re-create it locally. Then, chown prometheus dirs
17:38 bd808: Added grid engine quota to limit user debenben to 2 concurrent jobs (T196486)

2018-06-04

10:28 arturo: T196006 installing sqlite3 package in exec nodes

2018-06-03

10:19 zhuyifei1999_: Grid is full. qdel'ed all jobs belonging to tools.dibot except lighttpd, and tools.mbh that has a job name starting 'comm_delin', 'delfilexcl' T195834

2018-05-31

11:31 zhuyifei1999_: building & pushing python/web docker image T174769
11:13 zhuyifei1999_: force puppet run on tools-worker-1001 to check the impact of https://gerrit.wikimedia.org/r/#/c/433101

2018-05-30

10:52 zhuyifei1999_: undid both changes to tools-bastion-05
10:50 zhuyifei1999_: also making /proc/sys/kernel/yama/ptrace_scope 0 temporarily on tools-bastion-05
10:45 zhuyifei1999_: installing mono-runtime-dbg on tools-bastion-05 to produce debugging information; was previously installed on tools-exec-1413 & 1441. Might be a good idea to uninstall them once we can close T195834

2018-05-28

12:09 arturo: T194665 adding mono packages to apt.wikimedia.org for jessie-wikimedia and stretch-wikimedia
12:06 arturo: T194665 adding mono packages to apt.wikimedia.org for trusty-wikimedia

2018-05-25

05:31 zhuyifei1999_: Edit /data/project/.system/gridengine/default/common/sge_request, h_vmem 256M -> 512M, release precise -> trusty T195558

2018-05-22

11:53 arturo: running puppet to deploy https://gerrit.wikimedia.org/r/#/c/433996/ for T194665 (mono framework update)

2018-05-18

16:36 bd808: Restarted bigbrother on tools-services-02

2018-05-16

21:17 zhuyifei1999_: maintain-kubeusers on stuck in infinite sleeps of 10 seconds

2018-05-15

04:28 andrewbogott: depooling, rebooting, re-pooling tools-exec-1414. It's hanging for unknown reasons.
04:07 zhuyifei1999_: Draining unresponsive tools-exec-1414 following Portal:Toolforge/Admin#Draining_a_node_of_Jobs
04:05 zhuyifei1999_: Force deletion of grid job 5221417 (tools.giftbot sga), host tools-exec-1414 not responding

2018-05-12

10:09 Hauskatze: tools.quentinv57-tools@tools-bastion-02:~$ webservice stop | T194343

2018-05-11

14:34 andrewbogott: repooling labvirt1001 tools instances
13:59 andrewbogott: depooling a bunch of things before rebooting labvirt1001 for T194258: tools-exec-1401 tools-exec-1407 tools-exec-1408 tools-exec-1430 tools-exec-1431 tools-exec-1432 tools-exec-1435 tools-exec-1438 tools-exec-1439 tools-exec-1441 tools-webgrid-lighttpd-1402 tools-webgrid-lighttpd-1407

2018-05-10

18:55 andrewbogott: depooling, rebooting, repooling tools-exec-1401 to test a kernel update

2018-05-09

21:11 Reedy: Added Tim Starling as member/admin

2018-05-07

21:02 zhuyifei1999_: re-building all docker images T190893
20:49 zhuyifei1999_: building, signing, and publishing toollabs-webservice 0.39 T190893
00:25 zhuyifei1999_: `renice -n 15 -p 28865` (`tar cvzf` of `tools.giftbot`) on tools-bastion-02, been hogging the NFS IO for a few hours

2018-05-05

23:37 zhuyifei1999_: regenerate k8s creds for tools.zhuyifei1999-test because I messed up while testing

2018-05-03

14:48 arturo: uploaded a new ruby docker image to the registry with the libmysqlclient-dev package T192566

2018-05-01

14:05 andrewbogott: moving tools-webgrid-lighttpd-1406 to labvirt1016 (routine rebalancing)

2018-04-27

18:26 zhuyifei1999_: `$ write` doesn't seem to be able to write to their tmux tty, so echoed into their pts directly: `# echo -e '\n\n[...]\n' > /dev/pts/81`
18:17 zhuyifei1999_: SIGTERM tools-bastion-03 PID 6562 tools.zoomproof celery worker

2018-04-23

14:41 zhuyifei1999_: `chown tools.pywikibot:tools.pywikibot /shared/pywikipedia/` Prior owner: tools.russbot:project-tools T192732

2018-04-22

13:07 bd808: Kill orphan php-cgi processes across the job grid via clush -w @exec -w @webgrid -b 'ps axwo user:20,ppid,pid,cmd | grep -E " 1 " | grep php-cgi | xargs sudo kill -9'`

2018-04-15

17:51 zhuyifei1999_: forced puppet puns across tools-elastic-0[1-3] T192224
17:45 zhuyifei1999_: granted elasticsearch credentials to tools.flaky-ci T192224

2018-04-11

13:25 chasemp: cleanup exim frozen messages in an effort to aleve queue pressure

2018-04-06

16:30 chicocvenancio: killed job in bastion, tools.gpy affected
14:30 arturo: add puppet class `toollabs::apt_pinning` to tools-puppetmaster-01 using horizon, to add some apt pinning related to T159254
11:23 arturo: manually upgrade apache2 on tools-puppemaster for T159254

2018-04-05

18:46 chicocvenancio: killed wget that was hogging io

2018-03-29

20:09 chicocvenancio: killed interactive processes in tools-bastion-03
19:56 chicocvenancio: several interactive jobs running in bastion-03. I am writing to connected users and will kill the jobs once done

2018-03-28

13:06 zhuyifei1999_: SIGTERM PID 30633 on tools-bastion-03 (tool 3d2commons's celery). Please run this on grid

2018-03-26

21:34 bd808: clush -w @exec -w @webgrid -b 'sudo find /tmp -type f -atime +1 -delete'

2018-03-23

23:26 bd808: clush -w @exec -w @webgrid -b 'sudo find /tmp -type f -atime +1 -delete'
19:43 bd808: tools-proxy-* Forced puppet run to apply https://gerrit.wikimedia.org/r/#/c/421472/

2018-03-22

22:04 bd808: Forced puppet run on tools-proxy-02 for T130748
21:52 bd808: Forced puppet run on tools-proxy-01 for T130748
21:48 bd808: Disabled puppet on tools-proxy-* for https://gerrit.wikimedia.org/r/#/c/420619/ rollout
03:50 bd808: clush -w @exec -w @webgrid -b 'sudo find /tmp -type f -atime +1 -delete'

2018-03-21

17:50 bd808: Cleaned up stale /project/.system/bigbrother.scoreboard.* files from labstore1004
01:09 bd808: Deleting /tmp files owned by tools.wsexport with -mtime +2 across grid (T190185)

2018-03-20

08:28 zhuyifei1999_: unmount dumps & remount on tools-bastion-02 (can someone clush this?) T189018 T190126

2018-03-19

11:02 arturo: reboot tools-exec-1408, to balance load. Server is unresponsive due to high load by some tools

2018-03-16

22:44 zhuyifei1999_: suspended process 22825 (BotOrderOfChapters.exe) on tools-bastion-03. Threads continuously going to D-state & R-state. Also sent message via $ write on pts/10
12:13 arturo: reboot tools-webgrid-lighttpd-1420 due to almost full /tmp

2018-03-15

16:56 zhuyifei1999_: granted elasticsearch credentials to tools.denkmalbot T185624

2018-03-14

20:57 bd808: Upgrading elasticsearch on tools-elastic-01 (T181531)
20:53 bd808: Upgrading elasticsearch on tools-elastic-02 (T181531)
20:51 bd808: Upgrading elasticsearch on tools-elastic-03 (T181531)
12:07 arturo: reboot tools-webgrid-lighttpd-1415, almost full /tmp
12:01 arturo: repool tools-webgrid-lighttpd-1421, /tmp is now empty
11:56 arturo: depool tools-webgrid-lighttpd-1421 for reboot due to /tmp almost full

2018-03-12

20:09 madhuvishy: Run clush -w @all -b 'sudo umount /mnt/nfs/labstore1003-scratch && sudo mount -a' to remount scratch across all of tools
17:13 arturo: T188994 upgrading packages from `stable`
16:53 arturo: T188994 upgrading packages from stretch-wikimedia
16:33 arturo: T188994 upgrading packages form jessie-wikimedia
14:58 zhuyifei1999_: building, publishing, and deploying misctools 1.31 5f3561e T189430
13:31 arturo: tools-exec-1441 and tools-exec-1442 rebooted fine and are repooled
13:26 arturo: depool tools-exec-1441 and tools-exec-1442 for reboots
13:19 arturo: T188994 upgrade packages from jessie-backports in all jessie servers
12:49 arturo: T188994 upgrade packages from trusty-updates in all ubuntu servers
12:34 arturo: T188994 upgrade packages from trusty-wikimedia in all ubuntu servers

2018-03-08

16:05 chasemp: tools-clushmaster-01:~$ clush -g all 'sudo puppet agent --test'
14:02 arturo: T188994 upgrading trusty-tools packages in all the cluster, this includes jobutils, openssh-server and openssh-sftp-server

2018-03-07

20:42 chicocvenancio: killed io intensive recursive zip of huge folder
18:30 madhuvishy: Killed php-cgi job run by user 51242 on tools-webgrid-lighttpd-1413
14:08 arturo: just merged NFS package pinning https://gerrit.wikimedia.org/r/#/c/416943/
13:47 arturo: deploying more apt pinnings: https://gerrit.wikimedia.org/r/#/c/416934/

2018-03-06

16:15 madhuvishy: Reboot tools-docker-registry-02 T189018
15:50 madhuvishy: Rebooting tools-worker-1011
15:08 chasemp: tools-k8s-master-01:~# kubectl uncordon tools-worker-1011.tools.eqiad.wmflabs
15:03 arturo: drain and reboot tools-worker-1011
15:03 chasemp: rebooted tools-worker 1001-1008
14:58 arturo: drain and reboot tools-worker-1010
14:27 chasemp: multiple tools running on k8s workers report issues reading replica.my.cnf file atm
14:27 chasemp: reboot tools-worker-100[12]
14:23 chasemp: downtime icinga alert for k8s workers ready
13:21 arturo: T188994 in some servers there was some race in the dpkg lock between apt-upgrade and puppet. Also, I forgot to use DEBIAN_FRONTEND=noninteractive, so debconf prompts happened and stalled dpkg operations. Already solved, but some puppet alerts were produced
12:58 arturo: T188994 upgrading packages in jessie nodes from the oldstable source
11:42 arturo: clush -w @all "sudo DEBIAN_FRONTEND=noninteractive apt-get autoclean" <-- free space in filesystem
11:41 arturo: aborrero@tools-clushmaster-01:~$ clush -w @all "sudo DEBIAN_FRONTEND=noninteractive apt-get autoremove -y" <-- we did in canary servers last week and it went fine. So run in fleet-wide
11:36 arturo: (ubuntu) removed linux-image-3.13.0-142-generic and linux-image-3.13.0-137-generic (T188911)
11:33 arturo: removing unused kernel packages in ubuntu nodes
11:08 arturo: aborrero@tools-clushmaster-01:~$ clush -w @all "sudo rm /etc/apt/preferences.d/* ; sudo puppet agent -t -v" <--- rebuild directory, it contains stale files across all the cluster

2018-03-05

18:56 zhuyifei1999_: also published jobutils_1.30_all.deb
18:39 zhuyifei1999_: built and published misctools_1.30_all.deb T167026 T181492
14:33 arturo: delete `linux-image-4.9.0-6-amd64` package from stretch instances for T188911
14:01 arturo: deleting old kernel packages in jessie instances for T188911
13:58 arturo: running `apt-get autoremove` with clush in all jessie instances
12:16 arturo: apply role::toollabs::base to tools-paws prefix in horizon for T187193
12:10 arturo: apply role::toollabs::base to tools-prometheus prefix in horizon for T187193

2018-03-02

13:41 arturo: doing some testing with puppet classes in tools-package-builder-01 via horizon

2018-03-01

13:27 arturo: deploy https://gerrit.wikimedia.org/r/#/c/415057/

2018-02-27

17:37 chasemp: add chico as admin to toolsbeta
12:23 arturo: running `apt-get autoclean` in canary servers
12:16 arturo: running `apt-get autoremove` in canary servers

2018-02-26

19:17 chasemp: tools-clushmaster-01:~$ clush -w @all "sudo puppet agent --test"
10:35 arturo: enable puppet in tools-proxy-01
10:23 arturo: disable puppet in tools-proxy-01 for apt pinning tests

2018-02-25

19:04 chicocvenancio: killed jobs in tools-bastion-03, wrote notice to tools owners' terminals

2018-02-23

19:11 arturo: enable puppet in tools-proxy-01
18:53 arturo: disable puppet in tools-proxy-01 for apt preferences testing
13:52 arturo: deploying https://gerrit.wikimedia.org/r/#/c/413725/ across the fleet
13:04 arturo: install apt-rdepends in tools-paws-master-01 which triggered some python libs to be upgraded

2018-02-22

16:31 bstorm_: Enabled puppet on tools-static-12 as the test server

2018-02-21

19:02 bstorm_: disabled puppet on tools-static-* pending change 413197
18:15 arturo: puppet should be fine across the fleet
17:24 arturo: another try: merged https://gerrit.wikimedia.org/r/#/c/413202/
17:02 arturo: revert last change https://gerrit.wikimedia.org/r/#/c/413198/
16:59 arturo: puppet is broken across the cluster due to last change
16:57 arturo: deploying https://gerrit.wikimedia.org/r/#/c/410177/
16:26 bd808: Rebooting tools-docker-registry-01, NFS mounts are in a bad state
11:43 arturo: package upgrades in tools-webgrid-lightttpd-1401
11:35 arturo: package upgrades in tools-package-builder-01 tools-prometheus-01 tools-static-10 and tools-redis-1001
11:22 arturo: package upgrades in tools-mail, tools-grid-master, tool-logs-02
10:51 arturo: package upgrades in tools-checker-01 tools-clushmaster-01 and tools-docker-builder-05
09:18 chicocvenancio: killed io intensive tool job in bastion
03:32 zhuyifei1999_: removed /data/project/.elasticsearch.ini, owned by root and mode 644, leaks the creds of /data/project/strephit/.elasticsearch.ini Might need to cycle it as well...

2018-02-20

12:42 arturo: upgrading tools-flannel-etcd-01
12:42 arturo: upgrading tools-k8s-etcd-01

2018-02-19

19:13 arturo: upgrade all packages of tools-services-01
19:02 arturo: move tools-services-01 from puppet3 to puppet4 (manual package upgrade). No issues detected.
18:23 arturo: upgrade packages of tools-cron-01 from all channels (trusty-wikimedia, trusty-updates and trusty-tools)
12:54 arturo: puppet run with clush to ensure puppet is back to normal after being broken due to duplicated python3 declaration

2018-02-16

18:21 arturo: upgrading tools-proxy-01 and tools-paws-master-01, same as others
17:36 arturo: upgrading oldstable, jessie-backports, jessie-wikimedia packages in tools-k8s-master-01 (excluding linux*, libpam*, nslcd)
13:00 arturo: upgrades in tools-exec-14[01-10].eqiad.wmflabs were fine
12:42 arturo: aborrero@tools-clushmaster-01:~$ clush -q -w @exec-upgrade-canarys 'DEBIAN_FRONTEND=noninteractive sudo apt-upgrade -u upgrade trusty-updates -y'
11:58 arturo: aborrero@tools-elastic-01:~$ sudo apt-upgrade -u -f exclude upgrade jessie-wikimedia -y
11:57 arturo: aborrero@tools-elastic-01:~$ sudo apt-upgrade -u -f exclude upgrade jessie-backports -y
11:53 arturo: (10 exec canary nodes) aborrero@tools-clushmaster-01:~$ clush -q -w @exec-upgrade-canarys 'sudo apt-upgrade -u upgrade trusty-wikimedia -y'
11:41 arturo: aborrero@tools-elastic-01:~$ sudo apt-upgrade -u -f exclude upgrade oldstable -y

2018-02-15

13:54 arturo: cleanup ferm (deinstall) in tools-services-01 for T187435
13:28 arturo: aborrero@tools-bastion-02:~$ sudo apt-upgrade -u upgrade trusty-tools
13:16 arturo: aborrero@tools-bastion-02:~$ sudo apt-upgrade -u upgrade trusty-updates -y
13:13 arturo: aborrero@tools-bastion-02:~$ sudo apt-upgrade -u upgrade trusty-wikimedia
13:06 arturo: aborrero@tools-webgrid-generic-1401:~$ sudo apt-upgrade -u upgrade trusty-tools
12:57 arturo: aborrero@tools-webgrid-generic-1401:~$ sudo apt-upgrade -u upgrade trusty-updates
12:51 arturo: aborrero@tools-webgrid-generic-1401:~$ sudo apt-upgrade -u upgrade trusty-wikimedia

2018-02-14

13:09 arturo: the reboot was OK, the server seems working and kubectl sees all the pods running in the deployment (T187315)
13:04 arturo: reboot tools-paws-master-01 for T187315

2018-02-11

01:28 zhuyifei1999_: `# find /home/ -maxdepth 1 -perm -o+w \! -uid 0 -exec chmod -v o-w {} \;` Affected: only /home/tr8dr, mode 0777 -> 0775
01:21 zhuyifei1999_: `# find /data/project/ -maxdepth 1 -perm -o+w \! -uid 0 -exec chmod -v o-w {} \;` Affected tools: wikisource-tweets, gsociftttdev, dow, ifttt-testing, elobot. All mode 2777 -> 2775

2018-02-09

10:35 arturo: deploy https://gerrit.wikimedia.org/r/#/c/409226/ T179343 T182562 T186846
06:15 bd808: Killed orphan processes owned by iabot, dupdet, and wsexport scattered across the webgrid nodes
05:07 bd808: Killed 4 orphan php-fcgi processes from jembot that were running on tools-webgrid-lighttpd-1426
05:06 bd808: Killed 4 orphan php-fcgi processes from jembot that were running on tools-webgrid-lighttpd-1411
05:05 bd808: Killed 1 orphan php-fcgi process from jembot that were running on tools-webgrid-lighttpd-1409
05:02 bd808: Killed 4 orphan php-fcgi processes from jembot that were running on tools-webgrid-lighttpd-1421 and pegging the cpu there
04:56 bd808: Rescheduled 30 of the 60 tools running on tools-webgrid-lighttpd-1421 (T186830)
04:39 bd808: Killed 4 orphan php-fcgi processes from jembot that were running on tools-webgrid-lighttpd-1417 and pegging the cpu there

2018-02-08

18:38 arturo: aborrero@tools-k8s-master-01:~$ sudo kubectl uncordon tools-worker-1002.tools.eqiad.wmflabs
18:35 arturo: aborrero@tools-worker-1002:~$ sudo apt-upgrade -u upgrade jessie-wikimedia -v
18:33 arturo: aborrero@tools-worker-1002:~$ sudo apt-upgrade -u upgrade oldstable -v
18:28 arturo: cordon & drain tools-worker-1002.tools.eqiad.wmflabs
18:10 arturo: uncordon tools-paws-worker-1019. Package upgrades were OK.
18:08 arturo: aborrero@tools-paws-worker-1019:~$ sudo apt-upgrade upgrade stable -v
18:06 arturo: aborrero@tools-paws-worker-1019:~$ sudo apt-upgrade upgrade stretch-wikimedia -v
18:02 arturo: cordon tools-paws-worker-1019 to do some package upgrades
17:29 arturo: repool tools-exec-1401.tools.eqiad.wmflabs. Package upgrades were OK.
17:20 arturo: aborrero@tools-exec-1401:~$ sudo apt-upgrade upgrade trusty-updates -vy
17:15 arturo: aborrero@tools-exec-1401:~$ sudo apt-upgrade upgrade trusty-wikimedia -vy
17:11 arturo: depool tools-exec-1401.tools.eqiad.wmflabs to do some package upgrades
14:22 arturo: it was some kind of transient error. After a second puppet run across the fleet, all seems fine
13:53 arturo: deploy https://gerrit.wikimedia.org/r/#/c/407465/ which is causing some puppet issues. Investigating.

2018-02-06

13:15 arturo: deploy https://gerrit.wikimedia.org/r/#/c/408529/ to tools-services-01
13:05 arturo: unpublish/publish trusty-tools repo
13:03 arturo: install aptly v0.9.6-1 in tools-services-01 for T186539 after adding it to trusty-tools repo (self contained)

2018-02-05

17:58 arturo: publishing/unpublishing trusty-tools repo in tools-services-01 to address T186539
13:27 arturo: for the record, not a single warning or error (orange/red messages) in puppet in the toolforge cluster
13:06 arturo: deploying fix for T186230 using clush

2018-02-03

01:04 chicocvenancio: killed io intensive process in bastion-03 "vltools python3 ./broken_ref_anchors.py"

2018-01-31

22:54 chasemp: add bstorm to sudoers as root

2018-01-29

20:02 chasemp: add zhuyifei1999_ tools root for T185577
20:01 chasemp: blast a puppet run to see if any errors are persistent

2018-01-28

22:49 chicocvenancio: killed compromised session generating miner processes
22:48 chicocvenancio: killed miner processes in tools-bastion-03

2018-01-27

00:55 arturo: at tools-static-11 the kernel OOM killer stopped git gc at about 20% :-(
00:25 arturo: (/srv is almost full) aborrero@tools-static-11:/srv/cdnjs$ sudo git gc --aggressive

2018-01-25

23:47 arturo: fix last deprecation warnings in tools-elastic-03, tools-elastic-02, tools-proxy-01 and tools-proxy-02 by replacing by hand configtimeout with http_configtimeout in /etc/puppet/puppet.conf
23:20 arturo: T179386 aborrero@tools-clushmaster-01:~$ clush -w @all 'sudo puppet agent -t -v'
05:25 arturo: deploying misctools and jobutils 1.29 for T179386

2018-01-23

19:41 madhuvishy: Add bstorm to project admins
15:48 bd808: Admin clean up; removed Coren, Ryan Lane, and Springle.
14:17 chasemp: add me, arturo, chico to sudoers and removed marc

2018-01-22

18:32 arturo: T181948 T185314 deploying jobutils and misctools v1.28 in the cluster
11:21 arturo: puppet in the cluster is mostly fine, except for a couple of deprecation warnings, a conn timeout to services-01 and https://phabricator.wikimedia.org/T181948#3916790
10:31 arturo: aborrero@tools-clushmaster-01:~$ clush -w @all 'sudo puppet agent -t -v' <--- check again how is the cluster with puppet
10:18 arturo: T181948 deploy misctools 1.27 in the cluster

2018-01-19

17:32 arturo: T185314 deploying new version of jobutils 1.27
12:56 arturo: the puppet status across the fleet seems good, only minor things like T185314 , T179388 and T179386
12:39 arturo: aborrero@tools-clushmaster-01:~$ clush -w @all 'sudo puppet agent -t -v'

2018-01-18

16:11 arturo: aborrero@tools-clushmaster-01:~$ sudo aptitude purge vblade vblade-persist runit (for something similar to T182781)
15:42 arturo: T178717 aborrero@tools-clushmaster-01:~$ clush -w @all 'sudo puppet agent -t -v'
13:52 arturo: T178717 aborrero@tools-clushmaster-01:~$ clush -f 1 -w @all 'sudo facter | grep lsbdistcodename | grep trusty && sudo apt-upgrade trusty-wikimedia -v'
13:44 chasemp: upgrade wikimedia packages on tools-bastion-05
12:24 arturo: T178717 aborrero@tools-exec-1401:~$ sudo apt-upgrade trusty-wikimedia -v
12:11 arturo: T178717 aborrero@tools-webgrid-generic-1402:~$ sudo apt-upgrade trusty-wikimedia -v
11:42 arturo: T178717 aborrero@tools-clushmaster-01:~$ clush -w @all 'sudo puppet agent --test'

2018-01-17

18:47 arturo: aborrero@tools-clushmaster-01:~$ clush -w @all 'apt-show-versions | grep upgradeable | grep trusty-wikimedia' | tee pending-upgrades-report-trusty-wikimedia.txt
17:55 arturo: aborrero@tools-clushmaster-01:~$ clush -w @all 'sudo report-pending-upgrades -v' | tee pending-upgrades-report.txt
15:15 andrewbogott: running purge-old-kernels on all Trusty exec nodes
15:15 andrewbogott: repooling exec-manage tools-exec-1430.
15:04 andrewbogott: depooling exec-manage tools-exec-1430. Experimenting with purge-old-kernels
14:09 arturo: T181647 aborrero@tools-clushmaster-01:~$ clush -w @all 'sudo puppet agent --test'

2018-01-16

22:01 chasemp: qstat -explain E -xml | grep 'name' | sed 's/<name>//' | sed 's/<\/name>//' | xargs qmod -cq
21:54 chasemp: tools-exec-1436:~$ /sbin/reboot
21:24 andrewbogott: repooled tools-exec-1420 and tools-webgrid-lighttpd-1417
21:14 andrewbogott: depooling tools-exec-1420 and tools-webgrid-lighttpd-1417
20:58 andrewbogott: depooling tools-exec-1412, 1415, 1417, tools-webgrid-lighttpd-1415, 1416, 1422, 1426
20:56 andrewbogott: repooling tools-exec-1416, 1418, 1424, tools-webgrid-lighttpd-1404, tools-webgrid-lighttpd-1410
20:46 andrewbogott: depooling tools-exec-1416, 1418, 1424, tools-webgrid-lighttpd-1404, tools-webgrid-lighttpd-1410
20:46 andrewbogott: repooled tools-exec-1406, 1421, 1436, 1437, tools-webgrid-generic-1404, tools-webgrid-lighttpd-1409, tools-webgrid-lighttpd-1411, tools-webgrid-lighttpd-1418, tools-webgrid-lighttpd-1425
20:33 andrewbogott: depooling tools-exec-1406, 1421, 1436, 1437, tools-webgrid-generic-1404, tools-webgrid-lighttpd-1409, tools-webgrid-lighttpd-1411, tools-webgrid-lighttpd-1418, tools-webgrid-lighttpd-1425
20:20 andrewbogott: depooling tools-webgrid-lighttpd-1412 and tools-exec-1423 for host reboot
20:19 andrewbogott: repooled tools-exec-1409, 1410, 1414, 1419, 1427, 1428 tools-webgrid-generic-1401, tools-webgrid-lighttpd-1406
20:02 andrewbogott: depooling tools-exec-1409, 1410, 1414, 1419, 1427, 1428 tools-webgrid-generic-1401, tools-webgrid-lighttpd-1406
20:00 andrewbogott: depooled and repooled tools-webgrid-lighttpd-1427 tools-webgrid-lighttpd-1428 tools-exec-1413 tools-exec-1442 for host reboot
18:50 andrewbogott: switched active proxy back to tools-proxy-02
18:50 andrewbogott: repooling tools-exec-1422 and tools-webgrid-lighttpd-1413
18:34 andrewbogott: moving proxy from tools-proxy-02 to tools-proxy-01
18:31 andrewbogott: depooling tools-exec-1422 and tools-webgrid-lighttpd-1413 for host reboot
18:26 andrewbogott: repooling tools-exec-1404 and 1434 for host reboot
18:06 andrewbogott: depooling tools-exec-1404 and 1434 for host reboot
18:04 andrewbogott: repooling tools-exec-1402, 1426, 1429, 1433, tools-webgrid-lighttpd-1408, 1414, 1424
17:48 andrewbogott: depooling tools-exec-1402, 1426, 1429, 1433, tools-webgrid-lighttpd-1408, 1414, 1424
17:28 andrewbogott: disabling tools-webgrid-generic-1402, tools-webgrid-lighttpd-1403, tools-exec-1403 for host reboot
17:26 andrewbogott: repooling tools-exec-1405, 1425, tools-webgrid-generic-1403, tools-webgrid-lighttpd-1401, 1405 after host reboot
17:08 andrewbogott: depooling tools-exec-1405, 1425, tools-webgrid-generic-1403, tools-webgrid-lighttpd-1401, 1405 for host reboot
16:19 andrewbogott: repooling tools-exec-1401, 1407, 1408, 1430, 1431, 1432, 1435, 1438, 1439, 1441, tools-webgrid-lighttpd-1402, 1407 after host reboot
15:52 andrewbogott: depooling tools-exec-1401, 1407, 1408, 1430, 1431, 1432, 1435, 1438, 1439, 1441, tools-webgrid-lighttpd-1402, 1407 for host reboot
13:35 chasemp: tools-mail almouked@ltnet.net 719 pending messages cleared

2018-01-11

20:33 andrewbogott: repooling tools-exec-1411, tools-exec-1440, tools-webgrid-lighttpd-1419, tools-webgrid-lighttpd-1420, tools-webgrid-lighttpd-1421
20:33 andrewbogott: uncordoning tools-worker-1012 and tools-worker-1017
20:06 andrewbogott: cordoning tools-worker-1012 and tools-worker-1017
20:02 andrewbogott: depooling tools-exec-1411, tools-exec-1440, tools-webgrid-lighttpd-1419, tools-webgrid-lighttpd-1420, tools-webgrid-lighttpd-1421
19:00 chasemp: reboot tools-worker-1015
15:08 chasemp: reboot tools-exec-1405
15:06 chasemp: reboot tools-exec-1404
15:06 chasemp: reboot tools-exec-1403
15:02 chasemp: reboot tools-exec-1402
14:57 chasemp: reboot tools-exec-1401 again...
14:53 chasemp: reboot tools-exec-1401
14:46 chasemp: install metltdown kernel and reboot workers 1011-1016 as jessie pilot

2018-01-10

15:14 chasemp: tools-clushmaster-01:~$ clush -f 1 -w @k8s-worker "sudo puppet agent --enable && sudo puppet agent --test"
15:03 chasemp: tools-k8s-master-01:~# for n in `kubectl get nodes | awk '{print $1}' | grep -v -e tools-worker-1001 -e tools-worker-1016 -e tools-worker-1016`; do kubectl cordon $n; done
14:41 chasemp: tools-clushmaster-01:~$ clush -w @k8s-worker "sudo puppet agent --disable 'chase rollout'"
14:01 chasemp: tools-k8s-master-01:~# kubectl uncordon tools-worker-1001.tools.eqiad.wmflabs
13:57 arturo: T184604 cleaned stalled log files that prevented logrotate from working. Triggered a couple of logrorate runs by hand in tools-worker-1020.tools.eqiad.wmflabs
13:46 arturo: T184604 aborrero@tools-k8s-master-01:~$ sudo kubectl uncordon tools-worker-1020.tools.eqiad.wmflabs
13:45 arturo: T184604 aborrero@tools-worker-1020:/var/log$ sudo mkdir /var/lib/kubelet/pods/bcb36fe1-7d3d-11e7-9b1a-fa163edef48a/volumes
13:26 arturo: sudo kubectl drain tools-worker-1020.tools.eqiad.wmflabs
13:22 arturo: empty by hand syslog and daemon.log files. They are so big that logrotate won't handle them
13:20 arturo: aborrero@tools-worker-1020:~$ sudo service kubelet restart
13:18 arturo: aborrero@tools-k8s-master-01:~$ sudo kubectl cordon tools-worker-1020.tools.eqiad.wmflabs for T184604
13:13 arturo: detected low space in tools-worker-1020, big files in /var/log due to kubelet issue. Opened T184604

2018-01-09

23:21 yuvipanda: paws new cluster master is up, re-adding nodes by executing same sequence of commands for upgrading
23:08 yuvipanda: turns out the version of k8s we had wasn't recent enough to support easy upgrades, so destroy entire cluster again and install 1.9.1
23:01 yuvipanda: kill paws master and reboot it
22:54 yuvipanda: kill all kube-system pods in paws cluster
22:54 yuvipanda: kill all PAWS pods
22:53 yuvipanda: redo tools-paws-worker-1006 manually, since clush seems to have missed it for some reason
22:49 yuvipanda: run clush -w tools-paws-worker-10[01-20] 'sudo bash /home/yuvipanda/kubeadm-bootstrap/init-worker.bash' to bring paws workers back up again, but as 1.8
22:48 yuvipanda: run 'clush -w tools-paws-worker-10[01-20] 'sudo bash /home/yuvipanda/kubeadm-bootstrap/install-kubeadm.bash to setup kubeadm on all paws worker nodes
22:46 yuvipanda: reboot all paws-worker nodes
22:46 yuvipanda: run clush -w tools-paws-worker-10[01-20] 'sudo bash /home/yuvipanda/kubeadm-bootstrap/remove-worker.bash' to completely destroy the paws k8s cluster
22:46 madhuvishy: run clush -w tools-paws-worker-10[01-20] 'sudo bash /home/yuvipanda/kubeadm-bootstrap/remove-worker.bash' to completely destroy the paws k8s cluster
21:17 chasemp: ...rush@tools-clushmaster-01:~$ clush -f 1 -w @k8s-worker "sudo puppet agent --enable && sudo puppet agent --test"
21:17 chasemp: tools-clushmaster-01:~$ clush -f 1 -w @k8s-worker "sudo puppet agent --enable --test"
21:10 chasemp: tools-k8s-master-01:~# for n in `kubectl get nodes | awk '{print $1}' | grep -v -e tools-worker-1001 -e tools-worker-1016 -e tools-worker-1028 -e tools-worker-1029 `; do kubectl uncordon $n; done
20:55 chasemp: for n in `kubectl get nodes | awk '{print $1}' | grep -v -e tools-worker-1001 -e tools-worker-1016`; do kubectl cordon $n; done
20:51 chasemp: kubectl cordon tools-worker-1001.tools.eqiad.wmflabs
20:15 chasemp: disable puppet on proxies and k8s workers
19:50 chasemp: clush -w @all 'sudo puppet agent --test'
19:42 chasemp: reboot tools-worker-1010

2018-01-08

20:34 madhuvishy: Restart kube services and uncordon tools-worker-1001
19:26 chasemp: sudo service docker restart; sudo service flannel restart; sudo service kube-proxy restart on tools-proxy-02

2018-01-06

00:35 madhuvishy: Run `clush -w @paws-worker -b 'sudo iptables -L FORWARD'`
00:05 madhuvishy: Drain and cordon tools-worker-1001 (for debugging the dns outage)

2018-01-05

23:49 madhuvishy: Run clush -w @k8s-worker -x tools-worker-1001.tools.eqiad.wmflabs 'sudo service docker restart; sudo service flannel restart; sudo service kubelet restart; sudo service kube-proxy restart' on tools-clushmaster-01
16:22 andrewbogott: moving tools-worker-1027 to labvirt1015 (CPU balancing)
16:01 andrewbogott: moving tools-worker-1017 to labvirt1017 (CPU balancing)
15:32 andrewbogott: moving tools-exec-1420.tools.eqiad.wmflabs to labvirt1015 (CPU balancing)
15:18 andrewbogott: moving tools-exec-1411.tools.eqiad.wmflabs to labvirt1017 (CPU balancing)
15:02 andrewbogott: moving tools-exec-1440.tools.eqiad.wmflabs to labvirt1017 (CPU balancing)
14:47 andrewbogott: moving tools-webgrid-lighttpd-1421.tools.eqiad.wmflabs to labvirt1017 (CPU balancing)
14:25 andrewbogott: moving tools-webgrid-lighttpd-1420.tools.eqiad.wmflabs to labvirt1015 (CPU balancing)
14:05 andrewbogott: moving tools-webgrid-lighttpd-1417.tools.eqiad.wmflabs to labvirt1015 (CPU balancing)
13:46 andrewbogott: moving tools-webgrid-lighttpd-1419.tools.eqiad.wmflabs to labvirt1017 (CPU balancing)
05:33 andrewbogott: migrating tools-worker-1012 to labvirt1017 (CPU load balancing)

2018-01-04

17:24 andrewbogott: rebooting tools-paws-worker-1019 to verify repair of T184018

2018-01-03

15:38 bd808: Forced Puppet run on tools-services-01
11:29 arturo: deploy https://gerrit.wikimedia.org/r/#/c/401716/ and https://gerrit.wikimedia.org/r/394101 using clush