Nova Resource:Tools/SAL/Archive 4

2021-12-31

19:48 taavi: reset grid error status on webgrid-lighttpd@tools-sgewebgrid-lighttpd-0915

2021-12-28

20:31 taavi: restarting acme-chief to debug T298353

2021-12-24

07:58 majavah: cleared error state from 4 webgrid-lighttpd nodes

2021-12-23

22:57 bd808: Marked tool stang for deletion (T296496)
22:57 bd808: Marked tool wplist for deletion (T295523)
22:56 bd808: Marked tool antigng for deletion (T294708)
22:55 bd808: Marked tool ytrb for deletion (T291909)
22:54 bd808: Marked tool geolink for deletion (T291801)
22:54 bd808: Marked tool wmf-task-samtar for deletion (T286622)
22:53 bd808: Marked tool coi for deletion (T286619)
22:52 bd808: Marked tool abusereport for deletion (T286618)
22:51 bd808: Marked tool chi for deletion (T282702)
22:43 bd808: Marked tool algo-news for deletion (T280444)
22:43 bd808: Marked tool ircclient for deletion (T279209)
22:42 bd808: Marked tool vagrant-test for deletion (T279209)
22:42 bd808: Marked tool vagrant2 for deletion (T279209)
22:42 bd808: Marked tool testwiki for deletion (T279209)
22:41 bd808: Marked tool zoranzoki21wiki for deletion (T279209)
22:41 bd808: Marked tool zoranzoki21bot for deletion (T279209)
22:40 bd808: Marked tool filesearch for deletion (T279209)
22:40 bd808: Marked tool sourceror for deletion (T275690)
22:39 bd808: Marked tool move for deletion (T270535)
22:38 bd808: Marked tool hastagwatcher for deletion (T270534)
22:37 bd808: Marked tool outreacy-wikicv for deletion (T270532)
22:36 bd808: Marked tool dawiki for deletion (T270105)
22:33 bd808: Marked tool rubinbot3 for deletion (T266963)
22:32 bd808: Marked tool rubinbot2 for deletion (T266963)
22:32 bd808: Marked tool rubinbot for deletion (T266963)
22:31 bd808: Marked tool google-drive-photos-to-commons for deletion (T259870)
22:30 bd808: Marked tool wdqs-wmil-tutorial for deletion (T258394)
22:29 bd808: Marked tool base-encode for deletion (T258340)
22:28 bd808: Marked tool wikidata-exports for deletion (T255192)
22:27 bd808: Marked tool oar for deletion (T254044)
22:27 bd808: Marked tool wmde-uca-test for deletion (T249089)
22:26 bd808: Marked tool fastilybot for deletion (T248248)
22:25 bd808: Marked tool mtc-rest for deletion (T248247)
22:24 bd808: Marked tool squirrelnest-upf for deletion (T248235)
22:23 bd808: Marked tool wikibase-databridge-storybook for deletion (T245026)
22:22 bd808: Marked tool draft-uncategorize-script for deletion (T236646)
22:21 bd808: Marked tool maplink-generator for deletion (T231766)
22:20 bd808: Marked tool rhinosf1-afdclose for deletion (T225838)
22:18 bd808: Marked tool asdf for deletion (T223699)
22:17 bd808: Marked tool basyounybot for deletion (T218524)
22:14 bd808: Marked tool design-research-methods for deletion (T218523)
22:12 bd808: Marked tool he-wiktionary-rule-checker for deletion (T218500)
22:11 bd808: Marked tool outofband for deletion (T218382)
22:10 bd808: Marked tool sync-badges for deletion (T218187)
22:09 bd808: Marked tool grafana-json-datasource for deletion (T218075)
22:08 bd808: Marked tool gsociftttdev for deletion (T217478)
22:04 bd808: Marked tool wikipagestats for deletion (T216970)
22:02 bd808: Marked tool bd808-test4 for deletion (T216440)
22:02 bd808: Marked tool bd808-test3 for deletion (T216439)
22:01 bd808: Marked tool tei2wikitext for deletion (T216427)
22:00 bd808: Marked tool projetpp for deletion (T216427)
22:00 bd808: Marked tool ppp-sparql for deletion (T216427)
21:59 bd808: Marked tool platypus-qa for deletion (T216427)
21:59 bd808: Marked tool creatorlinks for deletion (T216427)
21:58 bd808: Marked tool corenlp for deletion (T216427)
21:57 bd808: Marked tool strikertest2017-08-23 for deletion (T216211)
21:46 bd808: Marked tool languagetool for deletion (T215734)
21:45 bd808: Marked tool gdk-artists-research for deletion (T214495)
21:44 bd808: Marked tool phragile for deletion (T214495)
21:44 bd808: Marked tool commons-mass-upload for deletion (T214495)
21:43 bd808: Marked tool wmde-uca-test for deletion (T214495)
21:43 bd808: Marked tool wmde-editconflict-test for deletion (T214495)
21:42 bd808: Marked tool wmde-inline-movedparagraphs for deletion (T214495)
21:41 bd808: Marked tool prometheus for deletion (T211972)
21:40 bd808: Marked tool quentinv57-tools for deletion (T210829)
21:38 bd808: Marked tool addbot for deletion (T208427)
21:37 bd808: Marked tool addshore-dev for deletion (T208427)
21:37 bd808: Marked tool addshore for deletion (T208427)
21:36 bd808: Marked tool miraheze-notifico for deletion (T203124)
21:34 bd808: Marked tool mh-signbot for deletion (T202946)
21:33 bd808: Marked tool messenger-chatbot for deletion (T198808)
21:22 bd808: Marked tool harvesting-data-rafinery for deletion (T197214)
21:21 bd808: Marked tool miraheze-discord-irc for deletion (T192410)
21:20 bd808: Marked tool sau226-wiki-bug-testing for deletion (T188608)
21:18 bd808: Marked tool kmlexport-cswiki for deletion (T186916)
21:17 bd808: Marked tool www-portal-builder for deletion (T182140)
21:15 bd808: Marked tool recoin-sample for deletion (T181541)
21:13 bd808: Marked tool wlm-jury-yarl for deletion (T172590)
21:12 bd808: Marked tool wlm-jury-at for deletion (T172590)
19:43 bd808: Marked tool yunomi for deletion (T170070)
19:42 bd808: Marked tool datbotcommons for deletion (T164662)
19:40 bd808: Marked tool ut-iw-bot for deletion (T158303)
19:39 bd808: Marked tool hujibot for deletion (T157916)
19:37 bd808: Marked tool contributions-summary for deletion (T157749)
19:35 bd808: Marked tool morebots for deletion (T157399)
19:32 bd808: Marked tool rcm for deletion (T136216)

2021-12-20

18:01 majavah: deploying calico v3.21.0 (T292698)
12:17 arturo: running `aborrero@tools-sgegrid-master:~$ sudo grid-configurator --all-domains` after merging a few patches to the script to handle dead config

2021-12-14

09:46 majavah: testing delete-crashing-pods emailer component with a test tool T292925

2021-12-08

05:21 andrewbogott: moving tools-k8s-etcd-13 to cloudvirt1028

2021-12-07

11:11 arturo: updated member roles in github.com/toolforge: remove brooke as owner, add dcaro

2021-12-06

13:23 majavah: root@toolserver-proxy-01:~# systemctl restart apache2.service # working around T293826

2021-12-04

12:18 majavah: deploying delete-crashing-pods in dry run mode T292925

2021-11-28

17:46 andrewbogott: moving tools-k8s-etcd-13 to cloudvirt1020; cloudvirt1018 (its old host) has a degraded raid which is affecting performance

2021-11-19

13:16 majavah: manually add 3 project members after ldap issues were fixed

2021-11-16

12:31 majavah: uploading calico 3.21.0 to the internal docker registry T292698
10:28 majavah: deploying maintain-kubeusers changes T286857

2021-11-11

10:50 arturo: add user `srv-networktests` as project user (T294955)

2021-11-05

19:18 majavah: deploying registry-admission changes

2021-10-29

23:58 andrewbogott: deleting all files older than 14 days in /srv/tools/shared/tools/project/.shared/cache

2021-10-28

12:42 arturo: set `allow-snippet-annotations: "false"` for ingress-nginx (T294330)

2021-10-26

18:00 majavah: deleting legacy ingresses for tools.wmflabs.org urls
12:26 majavah: deploy ingress-admission updates
12:11 majavah: deploy ingress-nginx v1.0.4 / chart v4.0.6 on toolforge T292771

2021-10-25

14:33 majavah: copy nginx-ingress controller v1.0.4 to internal registry T292771
11:32 majavah: depool tools-sgeexec-0910 T294228
11:13 majavah: removed tons of duplicate qw jobs accross multiple tools

2021-10-22

15:35 majavah: remove "^tools-k8s-master-[0-9]+\.tools\.eqiad\.wmflabs$" from authorized_regexes for the main certificate
15:35 majavah: add mail.tools.wmcloud.org to the tools mail tls certificate alternative names

2021-10-21

09:48 majavah: deploying toolforge-webservice 0.79

2021-10-20

15:41 majavah: removing toollabs-webservice from grid exec and master nodes where it's not needed and not managed by puppet
12:51 majavah: rolling out toolforge-webservice 0.78 T292706 T282975 T276626

2021-10-15

15:01 arturo: add updated ingress-nginx docker image in the registry (v1.0.1) for T293472

2021-10-07

09:13 majavah: disabling settings api, now that all pod presets are gone T279106
08:00 majavah: removing all pod presets T279106
05:44 majavah: deploying fix for T292672

2021-10-06

06:46 majavah: taavi@toolserver-proxy-01:~$ sudo systemctl restart apache2.service # see if it helps with toolserver.org ssl alerts

2021-10-03

21:31 bstorm: rebuilding buster containers since they are also affected T291387 T292355
21:29 bstorm: rebuilt stretch containers for potential issues with LE cert updates T291387

2021-10-01

21:59 bd808: clush -w @all -b 'sudo sed -i "s#mozilla/DST_Root_CA_X3.crt#!mozilla/DST_Root_CA_X3.crt#" /etc/ca-certificates.conf && sudo update-ca-certificates' for T292289

2021-09-30

13:43 majavah: cleaning up unused kubernetes ingress objects for tools.wmflabs.org urls T292105

2021-09-29

22:39 bstorm: finished deploy of the toollabs-webservice 0.77 and updating labels across the k8s cluster to match
22:26 bstorm: pushing toollabs-webservice 0.77 to tools releases
21:46 bstorm: pushing toollabs-webservice 0.77 to toolsbeta

2021-09-27

16:19 majavah: deploy volume-admission fix for containers for some volumes mounted
13:01 majavah: publish jobutils and misctools 0.43 T286072
11:34 majavah: disabling pod preset controller T279106

2021-09-23

17:20 majavah: deploying new maintain-kubeusers for lack of podpresets T279106

2021-09-22

18:06 bstorm: launching tools-nfs-test-client-01 to run a "fair" test battery against T291406
11:37 dcaro: controlled undrain tools-k8s-worker-53 (T291546)
08:57 majavah: drain tools-k8s-worker-53

2021-09-20

12:44 majavah: deploying volume-admission to tools, should not affect anything yet T279106

2021-09-15

08:08 majavah: update tools-manifest to 0.24

2021-09-14

10:36 arturo: add toolforge-jobs-framework-cli v5 to aptly buster-tools/toolsbeta

2021-09-13

08:57 arturo: cleared grid queues error states (T290844)
08:55 arturo: repooling sgeexec-0907 (T290798)
08:14 arturo: rebooting sgeexec-0907 (T290798)
08:12 arturo: depool sgeexec-0907 (T290798)

2021-09-11

08:51 majavah: depool tools-sgeexec-0907

2021-09-10

23:26 bstorm: cleared error state for tools-sgeexec-0907.tools.eqiad.wmflabs
12:00 arturo: shutdown tools-package-builder-03 (buster), leave -04 online (bullseye)
09:35 arturo: live-hacking tools puppetmaster with a couple of ops/puppet changes
07:54 arturo: created bullseye VM tools-package-builder-04 (T273942)

2021-09-09

16:20 arturo: 70017ec0ac root@tools-k8s-control-3:~# kubectl apply -f /etc/kubernetes/psp/base-pod-security-policies.yaml

2021-09-07

15:27 majavah: rolling out python3-prometheus-client updates
14:41 majavah: manually removing some absented but still present crontabs to stop root@ spam

2021-09-06

16:31 arturo: deploying jobs-framework-cli v4
16:22 arturo: deploying jobs-framework-api 3228d97

2021-09-03

22:36 bstorm: backfilling quotas in screen for T286784
12:49 majavah: deploying new tools-manifest version

2021-09-02

01:02 bstorm: deployed new version of maintain-kubeusers with new count quotas for new tools T286784

2021-08-20

19:10 majavah: rebuilding node12-sssd/{base,web} to use debian packaged npm 7
18:42 majavah: rebuilding php74-sssd/{base,web} to use composer 2

2021-08-18

21:32 bstorm: rebooted tools-sgecron-01 due to a ram filling up and killing everything
16:34 bstorm: deleting the sssd cache on tools-sgecron-01 to fix a peculiar passwd db issue

2021-08-16

17:00 majavah: remove and re-add toollabs-webservice 0.75 on stretch-toolsbeta repository
15:45 majavah: reset sul account mapping on striker for developer account "DutchTom" T288969
14:19 majavah: building node12 images - T284590 T243159

2021-08-15

17:30 majavah: deploying update jobs-framework-api container list to include bullseye images
17:22 majavah: finished initial build of images: php74, jdk17, python39, ruby27 - T284590
16:51 majavah: starting build of initial bullseye based images - T284590
16:44 majavah: tagged and building toollabs-webservice 0.76 with bullseye images defined T284590
15:14 majavah: building tools-webservice 0.74 (currently live version) to bullseye-tools and bullseye-toolsbeta

2021-08-12

16:59 bstorm: deployed updated manifest for ingress-admission
16:45 bstorm: restarted ingress admission pods in tools after testing in toolsbeta
16:27 bstorm: updated the docker image for docker-registry.tools.wmflabs.org/ingress-admission:latest
16:22 bstorm: rebooting tools-docker-registry-05 after exchanging uids for puppet and docker-registry

2021-08-07

05:59 majavah: restart nginx on toolserver-proxy-01 if that helps with flapping icinga certificate expiry check

2021-08-06

16:17 bstorm: failed over to tools-docker-registry-06 (which has more space) T288229
00:43 bstorm: set up sync between the new registry host and the existing one T288229
00:21 bstorm: provisioning second docker registry server to rsync to (120GB disk and fairly large server) T288229

2021-08-05

23:50 bstorm: rebooting the docker registry T288229
23:04 bstorm: extended docker registry volume to 120GB T288229

2021-07-29

18:04 majavah: reset sul account mapping on striker for developer account "Derek Zax" T287369

2021-07-28

21:33 majavah: add mdipietro as projectadmin and to sudo policy T287287

2021-07-27

16:20 bstorm: built new php images with python2 on board T287421
00:04 bstorm: deploy a version of the php3.7 web image that includes the python2 package with tag :testing T287421

2021-07-26

17:37 bstorm: repooled the whole set of ingress workers after upgrades T280340
16:37 bstorm: removing tools-k8s-ingress-4 from active ingress nodes at the proxy T280340

2021-07-23

07:15 majavah: restart nginx on tools-static-14 to see if it helps with fontcdn issues

2021-07-22

23:35 bstorm: deleted tools-sgebastion-09 since it has been shut off since March anyway
15:32 arturo: re-deploying toolforge-jobs-framework-api
15:30 arturo: pushed new docker image on the registry for toolforge-jobs-framework-api 4d8235b (T287077)

2021-07-21

20:01 bstorm: deployed new maintain-kubeusers to toolforge T285011
19:55 bstorm: deployed new rbac for maintain-kubeusers changes T285011
17:10 majavah: deploying calico v3.18.4 T280342
14:35 majavah: updating systemd on toolforge stretch bastions T287036
11:59 arturo: deploying jobs-framework-api 07346d7 (T286108)
11:04 arturo: enabling TTLAfterFinished feature gate on kubeadm live configmap (T286108)
11:01 arturo: enabling TTLAfterFinished feature gate on static pod manifests on /etc/kubernetes/manifests/kube-{apiserver,controller-manager}.yaml in all 3 control nodes (T286108)

2021-07-20

18:42 majavah: deploying systemd security tools on toolforge public stretch machines T287004
17:45 arturo: pushed new toolforge-jobs-framework-api docker image into the registry (3a6ae38) (T286126
17:37 arturo: added toolforge-jobs-framework-cli v3 to aptly buster-tools and buster-toolsbeta
13:25 majavah: apply buster systemd security updates

2021-07-19

23:24 bstorm: applied matchPolicy: equivalent to tools ingress validation controller T280360
16:43 bstorm: cleared queue error state caused by excessive resource use by topicmatcher T282474

2021-07-16

14:04 arturo: deployed jobs-framework-api 42b7a88 (T286132)
11:57 arturo: added toollabs-webservice_0.75_all to jessie-tools aptly repo (T286003)
11:52 arturo: created `jessie-tools` aptly repository on tools-services-05 (T286003)

2021-07-15

16:12 arturo: deploy toolforge-jobs-framework-api git version d85d93e (T285944, T286107, T285979, T286485, T286107)
15:55 arturo: added toolforge-jobs-framework-cli_2_all.deb to buster-{tools,toolsbeta} (T285944)

2021-07-14

23:29 bstorm: mounted nfs on tools-services-05 and backing up aptly to NFS dir T286003
09:17 majavah: copying calico 3.18.4 images from docker hub to docker-registry.tools.wmflabs.org T280342

2021-07-12

16:56 bstorm: deleted job 4720371 due to LDAP failure
16:51 bstorm: cleared the E state from two job queues

2021-07-02

18:46 bstorm: cleared error state for tools-sgeexec-0940.tools.eqiad.wmflabs

2021-07-01

22:08 bstorm: releasing webservice 0.75
17:03 andrewbogott: rebooting tools-k8s-worker-[31,33,35,44,49,51,57-58,70].tools.eqiad1.wikimedia.cloud
16:47 bstorm: remounted scratch everywhere...but mostly tools T224747
15:47 arturo: rebased labs/private.git
11:04 arturo: added toolforge-jobs-framework-cli_1_all.deb to aptly buster-tools,buster-toolsbeta
10:34 arturo: refreshed jobs-api deployment

2021-06-29

21:58 bstorm: clearing one errored queue and a stack of discarded jobs
20:11 majavah: toolforge kubernetes upgrade complete T280299
17:03 majavah: starting toolforge kubernetes 1.18 upgrade - T280299
16:17 arturo: deployed jobs-framework-api in the k8s cluster
15:34 majavah: remove duplicate definitions from tools-clushmaster-02 /root/.ssh/known_hosts
15:12 arturo: livehacking puppetmaster for T283238
10:24 dcaro: running puppet on the buster bastions after 20000 minutes failing... might break something

2021-06-15

19:02 bstorm: cleared error status from a few queues
16:15 majavah: deleting unused shutdown nodes: tools-checker-03 tools-k8s-haproxy-1 tools-k8s-haproxy-2

2021-06-14

22:21 bstorm: push docker-registry.tools.wmflabs.org/toolforge-python37-sssd-web:testing to test staged os.execv (and other patches) using toolsbeta toollabs-webservice version 0.75 T282975

2021-06-13

08:15 majavah: clear grid error state from tools-sgeexec-0907, tools-sgeexec-0916, tools-sgeexec-0940

2021-06-12

14:39 majavah: remove nonexistent tools-prometheus-04 and add tools-prometheus-05 to hiera key "prometheus_nodes"
13:53 majavah: create empty bullseye-{tools,toolsbeta} repositories on tools-services-05 aptly

2021-06-10

17:38 majavah: clear error state from tools-sgeexec-0907, task@tools-sgeexec-0939

2021-06-09

13:57 majavah: clear error state from exec nodes tools-sgeexec-0913, tools-sgeexec-0936, task@tools-sgeexec-0940

2021-06-07

18:39 bstorm: cleaning up more error conditions on grid queues
17:42 majavah: delete `ingress-nginx` namespace and related objects T264221
17:37 majavah: remove tools-k8s-ingress-[1-3] from kubernetes, follow-up to https://sal.toolforge.org/log/nd7v2HkB1jz_IcWuCX5M T264221

2021-06-04

21:30 bstorm: deleting "tools-k8s-ingress-3", "tools-k8s-ingress-2", "tools-k8s-ingress-1" T264221
21:21 bstorm: cleared error state from 4 grid queues

2021-06-03

18:27 majavah: renew prometheus kubernetes certificate T280301
17:06 majavah: renew admission webhook certificates T280301

2021-06-01

10:10 majavah: properly clean up deleted vms tools-k8s-haproxy-[1,2], tools-checker-03 from puppet after using the wrong fqdn first time
09:54 majavah: clear error state from tools-sgeexec-0913, tools-sgeexec-0950

2021-05-30

18:58 majavah: clear grid error state from 14 queues

2021-05-27

18:03 bstorm: adjusted profile::wmcs::kubeadm::etcd_latency_ms from 30 back to the default (10)
16:04 bstorm: cleared error state from several exec node queues
14:49 andrewbogott: swapping in three new etcd nodes with local storage: tools-k8s-etcd-13,14,15

2021-05-24

10:36 arturo: rebased labs/private.git after merge conflict
06:49 majavah: remove scfc kubernetes admin access after bd808 removed tools.admin membership to avoid maintain-kubeusers crashes when it expires

2021-05-22

14:47 majavah: manually remove jeh admin certificates and from maintain-kubeusers configmap T282725
14:32 majavah: manually remove valhallasw yuvipanda admin certificates and from configmap and restart maintain-kubeusers pod T282725
02:51 bd808: Restarted nginx on tools-static-14 to see if that clears up the fontcdn 502 errors

2021-05-21

17:06 majavah: unpool tooks-k8s-ingress-[4-6]
17:06 majavah: repool tools-k8s-ingress-6
17:02 majavah: repool tools-k8s-ingress-4 and -5
16:59 bstorm: upgrading the ingress-gen2 controllers to release 3 to capture new RAM/CPU limits
16:43 bstorm: resize tools-k8s-ingress-4 to g3.cores4.ram8.disk20
16:43 bstorm: resize tools-k8s-ingress-6 to g3.cores4.ram8.disk20
16:40 bstorm: resize tools-k8s-ingress-5 to g3.cores4.ram8.disk20
16:04 majavah: rollback kubernetes ingress update from front proxy
06:52 Majavah: pool tools-k8s-ingress-6 and depool ingress-[2,3] T264221

2021-05-20

17:05 Majavah: pool tools-k8s-ingress-5 as an ingress node, depool ingress-1 T264221
16:31 Majavah: pool tools-k8s-worker-4 as an ingress node T264221
15:17 Majavah: trying to install ingress-nginx via helm again after adjusting security groups T264221
15:15 Majavah: move tools-k8s-ingress-[5-6] from "tools-k8s-full-connectivity" to "tools-new-k8s-full-connectivity" security group T264221

2021-05-19

12:15 Majavah: rollback ingress-nginx-gen2
11:09 Majavah: deploy helm-based nginx ingress controller v0.46.0 to ingress-nginx-gen2 namespace T264221
10:44 Majavah: create tools-k8s-ingress-[4-6] T264221

2021-05-16

16:52 Majavah: clear error state from tools-sgeexec-0905 tools-sgeexec-0907 tools-sgeexec-0936 tools-sgeexec-0941

2021-05-14

19:18 bstorm: adjusting the rate limits for bastions nfs_write upward a lot to make NFS writes faster now that the cluster is finally using 10Gb on the backend and frontend T218338
16:55 andrewbogott: rebooting toolserver-proxy-01 to clear up stray files
16:47 andrewbogott: deleting log files older than 14 days on toolserver-proxy-01

2021-05-12

19:45 bstorm: cleared error state from some queues
19:05 Majavah: remove phamhi-binding phamhi-view-binding cluster role bindings T282725
19:04 bstorm: deleted the maintain-kubeusers pod to get it up and running fast T282725
19:03 bstorm: deleted phamhi from admin configmap in maintain-kubeusers T282725

2021-05-11

17:17 Majavah: shutdown and delete tools-checker-03 T278540
17:14 Majavah: move floating ip 185.15.56.61 to tools-checker-04
17:12 Majavah: add tools-checker-04 as a grid submit host T278540
16:58 Majavah: add tools-checker-04 to toollabs::checker_hosts hiera key T278540
16:49 Majavah: creating tools-checker-04 with buster T278540
16:32 Majavah: carefully shutdown tools-k8s-haproxy-1 T252239
16:29 Majavah: carefully shutdown tools-k8s-haproxy-2 T252239

2021-05-10

22:58 bstorm: cleared error state on a grid queue
22:58 bstorm: setting `profile::wmcs::kubeadm::docker_vol: false` on ingress nodes
15:22 Majavah: change k8s.svc.tools.eqiad1.wikimedia.cloud. to point to the tools-k8s-haproxy-keepalived-vip address 172.16.6.113 (T252239)
15:06 Majavah: carefully rolling out keepalived to tools-k8s-haproxy-[3-4] while making sure [1-2] do not have changes
15:03 Majavah: clear all error states caused by overloaded exec nodes
14:57 arturo: allow tools-k8s-haproxy-[3-4] to use the tools-k8s-haproxy-keepalived-vip address (172.16.6.113) (T252239)
12:53 Majavah: creating tools-k8s-haproxy-[3-4] to rebuild current ones without nfs and with keepalived

2021-05-09

06:55 Majavah: clear error state from tools-sgeexec-0916

2021-05-08

10:57 Majavah: import docker image k8s.gcr.io/ingress-nginx/controller:v0.46.0 to local registry as docker-registry.tools.wmflabs.org/nginx-ingress-controller:v0.46.0 T264221

2021-05-07

18:07 Majavah: generate and add k8s haproxy keepalived password (profile::toolforge::k8s::haproxy::keepalived_password) to private puppet repo
17:15 bstorm: recreated recordset of k8s.tools.eqiad1.wikimedia.cloud as CNAME to k8s.svc.tools.eqiad1.wikimedia.cloud T282227
17:12 bstorm: created A record of k8s.svc.tools.eqiad1.wikimedia.cloud pointing at current cluster with TTL of 300 for quick initial failover when the new set of haproxy nodes are ready T282227
09:44 arturo: `sudo wmcs-openstack --os-project-id=tools port create --network lan-flat-cloudinstances2b tools-k8s-haproxy-keepalived-vip`

2021-05-06

14:43 Majavah: clear error states from all currently erroring exec nodes
14:37 Majavah: clear error state from tools-sgeexec-0913
04:35 Majavah: add own root key to project hiera on horizon T278390
02:36 andrewbogott: removing jhedden from sudo roots

2021-05-05

19:27 andrewbogott: adding taavi as a sudo root to project toolforge for T278390

2021-05-04

15:23 arturo: upgrading exim4-daemon-heavy in tools-mail-03
10:47 arturo: rebase & resolve merge conflicts in labs/private.git

2021-05-03

16:24 dcaro: started tools-sgeexec-0907, was stuck on initramfs due to an unclean fs (/dev/vda3, root), ran fsck manually fixing all the errors and booted up correctly after (T280641)
14:07 dcaro: depooling tols-sgeexec-0908/7 to be able to restart the VMs as they got stuck during migration (T280641)

2021-04-29

18:23 bstorm: removing one more etcd node via cookbook T279723
18:12 bstorm: removing an etcd node via cookbook T279723

2021-04-27

16:40 bstorm: deleted all the errored out grid jobs stuck in queue wait
16:16 bstorm: cleared E status on grid queues to get things flowing again

2021-04-26

12:17 arturo: allowing more tools into the legacy redirector (T281003)

2021-04-22

08:44 Krenair: Removed yuvipanda from roots sudo policy
08:42 Krenair: Removed yuvipanda from projectadmin per request
08:40 Krenair: Removed yuvipanda from tools.admin per request

2021-04-20

22:20 bd808: `clush -w @all -b "sudo exiqgrep -z -i | xargs sudo exim -Mt"`
22:19 bd808: `clush -w @exec -b "sudo exiqgrep -z -i | xargs sudo exim -Mt"`
21:52 bd808: Update hiera `profile::toolforge::active_mail_relay: tools-mail-03.tools.eqiad1.wikimedia.cloud`. Was using wrong domain name in prior update.
21:49 bstorm: tagged the latest maintain-kubeusers and deployed to toolforge (with kustomize changes to rbac) after testing in toolsbeta T280300
21:27 bd808: Update hiera `profile::toolforge::active_mail_relay: tools-mail-03.tools.eqiad.wmflabs`. was -2 which is decommed.
10:18 dcaro: seting the retention on the tools-prometheus VMs to 250GB (they have 276GB total, leaving some space for online data operations if needed) (T279990)

2021-04-19

10:53 dcaro: reverting setting prometheus data source in grafana to 'server', can't connect,
10:51 dcaro: setting prometheus data source in grafana to 'server' to avoid CORS issues

2021-04-16

23:15 bstorm: cleaned up all source files for the grid with the old domain name to enable future node creation T277653
14:38 dcaro: added 'will get out of space in X days' panel to the dasboard https://grafana-labs.wikimedia.org/goto/kBlGd0uGk (T279990), we got <5days xd
11:35 arturo: running `grid-configurator --all-domains` which basically added tools-sgebastion-10,11 as submit hosts and removed tools-sgegrid-master,shadow as submit hosts

2021-04-15

17:45 bstorm: cleared error state from tools-sgeexec-0920.tools.eqiad.wmflabs for a failed job

2021-04-13

13:26 dcaro: upgrade puppet and python-wmflib on tools-prometheus-03
11:23 arturo: deleted shutoff VM tools-package-builder-02 (T275864)
11:21 arturo: deleted shutoff VM tools-sge-services-03,04 (T278354)
11:20 arturo: deleted shutoff VM tools-docker-registry-03,04 (T278303)
11:18 arturo: deleted shutoff VM tools-mail-02 (T278538)
11:17 arturo: deleted shutoff VMs tools-static-12,13 (T278539)

2021-04-11

16:07 bstorm: cleared E state from tools-sgeexec-0917 tools-sgeexec-0933 tools-sgeexec-0934 tools-sgeexec-0937 from failures of jobs 761759, 815031, 815056, 855676, 898936

2021-04-08

18:25 bstorm: cleaned up the deprecated entries in /data/project/.system_sge/gridengine/etc/submithosts for tools-sgegrid-master and tools-sgegrid-shadow using the old fqdns T277653
09:24 arturo: allocate & associate floating IP 185.15.56.122 for tools-sgebastion-11, also with DNS A record `dev-buster.toolforge.org` (T275865)
09:22 arturo: create DNS A record `login-buster.toolforge.org` pointing to 185.15.56.66 (tools-sgebastion-10) (T275865)
09:20 arturo: associate floating IP 185.15.56.66 to tools-sgebastion-10 (T275865)
09:13 arturo: created tools-sgebastion-11 (buster) (T275865)

2021-04-07

04:35 andrewbogott: replacing the mx record '10 mail.tools.wmcloud.org' with '10 mail.tools.wmcloud.org.' — trying to fix axfr for the tools.wmcloud.org zone

2021-04-06

15:16 bstorm: cleared queue state since a few had "errored" for failed jobs.
12:59 dcaro: Removing etcd member tools-k8s-etcd-7.tools.eqiad1.wikimedia.cloud to get an odd number (T267082)
11:45 arturo: upgrading jobutils & misctools to 1.42 everywhere
11:39 arturo: cleaning up aptly: old package versions, old repos (jessie, trusty, precise) etc
10:31 dcaro: Removing etcd member tools-k8s-etcd-6.tools.eqiad.wmflabs (T267082)
10:21 arturo: published jobutils & misctools 1.42 (T278748)
10:21 arturo: published jobutils & misctools 1.42
10:21 arturo: aptly repo had some weirdness due to the cinder volume: hardlinks created by aptly were broken, solved with `sudo aptly publish --skip-signing repo stretch-tools -force-overwrite`
10:07 dcaro: adding new etcd member using the cookbook wmcs.toolforge.add_etcd_node (T267082)
10:05 arturo: installed aptly from buster-backports on tools-services-05 to see if that makes any difference with an issue when publishing repos
09:53 dcaro: Removing etcd member tools-k8s-etcd-4.tools.eqiad.wmflabs (T267082)
08:55 dcaro: adding new etcd member using the cookbook wmcs.toolforge.add_etcd_node (T267082)

2021-04-05

17:02 bstorm: chowned the data volume for the docker registry to docker-registry:docker-registry
09:56 arturo: make jhernandez (IRC joakino) projectadmin (T278975)

2021-04-01

20:43 bstorm: cleared error state from the grid queues caused by unspecified job errors
15:53 dcaro: Removed etcd member tools-k8s-etcd-5.tools.eqiad.wmflabs, adding a new member (T267082)
15:43 dcaro: Removing etcd member tools-k8s-etcd-5.tools.eqiad.wmflabs (T267082)
15:36 dcaro: Added new etcd member tools-k8s-etcd-9.tools.eqiad1.wikimedia.cloud (T267082)
15:18 dcaro: adding new etcd member using the cookbook wmcs.toolforge.add_etcd_node (T267082)

2021-03-31

15:57 arturo: rebooting `tools-mail-03` after enabling NFS (T267082, T278538)
15:57 arturo: rebooting `tools-mail-03` after enabling NFS (T
15:04 arturo: created MX record for `tools.wmcloud.org` pointing to `mail.tools.wmcloud.org`
15:03 arturo: created DNS A record `mail.tools.wmcloud.org` pointing to 185.15.56.63
14:56 arturo: shutoff tools-mail-02 (T278538)
14:55 arturo: point floating IP 185.15.56.63 to tools-mail-03 (T278538)
14:45 arturo: created VM `tools-mail-03` as Debian Buster (T278538)
14:39 arturo: relocate some of the hiera keys for email server from project-level to prefix
09:44 dcaro: running disk performance test on etcd-4 (round2)
09:05 dcaro: running disk performance test on etcd-8
08:43 dcaro: running disk performance test on etcd-4

2021-03-30

16:15 bstorm: added `labstore::traffic_shaping::egress: 800mbps` to tools-static prefix T278539
15:44 arturo: shutoff tools-static-12/13 (T278539)
15:41 arturo: point horizon web proxy `tools-static.wmflabs.org` to tools-static-14 (T278539)
15:37 arturo: add `mount_nfs: true` to tools-static prefix (T2778539)
15:26 arturo: create VM tools-static-14 with Debian Buster image (T278539)
12:19 arturo: introduce horizon proxy `deb-tools.wmcloud.org` (T278436)
12:15 arturo: shutdown tools-sgebastion-09 (stretch)
11:05 arturo: created VM `tools-sgebastion-10` as Debian Buster (T275865)
11:04 arturo: created server group `tools-bastion` with anti-affinity policy

2021-03-28

19:31 legoktm: legoktm@tools-sgebastion-08:~$ sudo qdel -f 9999704 # T278645

2021-03-27

02:48 Reedy: qdel -f 9999895 9999799

2021-03-26

12:21 arturo: shutdown tools-package-builder-02 (stretch), we keep -03 which is buster (T275864)

2021-03-25

19:30 bstorm: forced deletion of all jobs stuck in a deleting state T277653
17:46 arturo: rebooting tools-sgeexec-* nodes to account for new grid master (T277653)
16:20 arturo: rebuilding tools-sgegrid-master VM as debian buster (T277653)
16:18 arturo: icinga-downtime toolschecker for 2h
16:05 bstorm: failed over the tools grid to the shadow master T277653
13:36 arturo: shutdown tools-sge-services-03 (T278354)
13:33 arturo: shutdown tools-sge-services-04 (T278354)
13:31 arturo: point aptly clients to `tools-services-05.tools.eqiad1.wikimedia.cloud` (hiera change) (T278354)
12:58 arturo: created VM `tools-services-05` as Debian Buster (T278354)
12:51 arturo: create cinder volume `tools-aptly-data` (T278354)

2021-03-24

12:46 arturo: shutoff the old stretch VMs `tools-docker-registry-03` and `tools-docker-registry-04` (T278303)
12:38 arturo: associate floating IP 185.15.56.67 with `tools-docker-registry-05` and refresh FQDN docker-registry.tools.wmflabs.org accordingly (T278303)
12:33 arturo: attach cinder volume `tools-docker-registry-data` to VM `tools-docker-registry-05` (T278303)
12:32 arturo: snapshot cinder volume `tools-docker-registry-data` into `tools-docker-registry-data-stretch-migration` (T278303)
12:32 arturo: bump cinder storage quota from 80G to 400G (without quota request task)
12:11 arturo: created VM `tools-docker-registry-06` as Debian Buster (T278303)
12:09 arturo: dettach cinder volume `tools-docker-registry-data` (T278303)
11:46 arturo: attach cinder volume `tools-docker-registry-data` to VM `tools-docker-registry-03` to format it and pre-populate it with registry data (T278303)
11:20 arturo: created 80G cinder volume tools-docker-registry-data (T278303)
11:10 arturo: starting VM tools-docker-registry-04 which was stopped probably since 2021-03-09 due to hypervisor draining

2021-03-23

12:46 arturo: aborrero@tools-sgegrid-master:~$ sudo systemctl restart gridengine-master.service
12:16 arturo: delete & re-create VM tools-sgegrid-shadow as Debian Buster (T277653)
12:14 arturo: created puppet prefix 'tools-sgegrid-shadow' and migrated puppet configuration from VM-puppet
12:13 arturo: created server group 'tools-grid-master-shadow' with anty-affinity policy

2021-03-18

19:24 bstorm: set profile::toolforge::infrastructure across the entire project with login_server set on the bastion and exec node-related prefixes
16:21 andrewbogott: enabling puppet tools-wide
16:20 andrewbogott: disabling puppet tools-wide to test https://gerrit.wikimedia.org/r/c/operations/puppet/+/672456
16:19 bstorm: added profile::toolforge::infrastructure class to puppetmaster T277756
04:12 bstorm: rebooted tools-sgeexec-0935.tools.eqiad.wmflabs because it forgot how to LDAP...likely root cause of the issues tonight
03:59 bstorm: rebooting grid master. sorry for the cron spam
03:49 bstorm: restarting sssd on tools-sgegrid-master
03:37 bstorm: deleted a massive number of stuck jobs that misfired from the cron server
03:35 bstorm: rebooting tools-sgecron-01 to try to clear up the ldap-related errors coming out of it
01:46 bstorm: killed the toolschecker cron job, which had an LDAP error, and ran it again by hand

2021-03-17

20:57 bstorm: deployed changes to rbac for kubernetes to add kubectl top access for tools
20:26 andrewbogott: moving tools-elastic-3 to cloudvirt1034; two elastic nodes shouldn't be on the same hv

2021-03-16

16:31 arturo: installing jobutils and misctools 1.41
15:55 bstorm: deleted a bunch of messed up grid jobs (9989481,8813,81682,86317,122602,122623,583621,606945,606999)
12:32 arturo: add packages jobutils / misctools v1.41 to {stretch,buster}-tools aptly repository in tools-sge-services-03

2021-03-12

23:13 bstorm: cleared error state for all grid queues

2021-03-11

17:40 bstorm: deployed metrics-server:0.4.1 to kubernetes
16:21 bstorm: add jobutils 1.40 and misctools 1.40 to stretch-tools
13:11 arturo: add misctools 1.37 to buster-tools|toolsbeta aptly repo for T275865
13:10 arturo: add jobutils 1.40 to buster-tools aptly repo for T275865

2021-03-10

10:56 arturo: briefly stopped VM tools-k8s-etcd-7 to disable VMX cpu flag

2021-03-09

13:31 arturo: hard-reboot tools-docker-registry-04 because issues related to T276922
12:34 arturo: briefly rebooting VM tools-docker-registry-04, we need to reboot the hypervisor cloudvirt1038 and failed to migrate away

2021-03-05

12:30 arturo: started tools-redis-1004 again
12:22 arturo: stop tools-redis-1004 to ease draining of cloudvirt1035

2021-03-04

11:25 arturo: rebooted tools-sgewebgrid-generic-0901, repool it again
09:58 arturo: depool tools-sgewebgrid-generic-0901 to reboot VM. It was stuck in MIGRATING state when draining cloudvirt1022

2021-03-03

15:17 arturo: shutting down tools-sgebastion-07 in an attempt to fix nova state and finish hypervisor migration
15:11 arturo: tools-sgebastion-07 triggered a neutron exception (unauthorized) while being live-migrated from cloudvirt1021 to 1029. Resetting nova state with `nova reset-state bd685d48-1011-404e-a755-372f6022f345 --active` and try again
14:48 arturo: killed pywikibot instance running in tools-sgebastion-07 by user msyn

2021-03-02

15:23 bstorm: depooling tools-sgewebgrid-lighttpd-0914.tools.eqiad.wmflabs for reboot. It isn't communicating right
15:22 bstorm: cleared queue error states...will need to keep a better eye on what's causing those

2021-02-27

02:23 bstorm: deployed typo fix to maintain-kubeusers in an innocent effort to make the weekend better T275910
02:00 bstorm: running a script to repair the dumps mount in all podpresets T275371

2021-02-26

22:04 bstorm: cleaned up grid jobs 1230666,1908277,1908299,2441500,2441513
21:27 bstorm: hard rebooting tools-sgeexec-0947
21:21 bstorm: hard rebooting tools-sgeexec-0952.tools.eqiad.wmflabs
20:01 bd808: Deleted csr in strange state for tool-ores-inspect

2021-02-24

18:30 bd808: `sudo wmcs-openstack role remove --user zfilipin --project tools user` T267313
01:04 bstorm: hard rebooting tools-k8s-worker-76 because it's in a sorry state

2021-02-23

23:11 bstorm: draining a bunch of k8s workers to clean up after dumps changes T272397
23:06 bstorm: draining tools-k8s-worker-55 to clean up after dumps changes T272397

2021-02-22

20:40 bstorm: repooled tools-sgeexec-0918.tools.eqiad.wmflabs
19:09 bstorm: hard rebooted tools-sgeexec-0918 from openstack T275411
19:07 bstorm: shutting down tools-sgeexec-0918 with the VM's command line (not libvirt directly yet) T275411
19:05 bstorm: shutting down tools-sgeexec-0918 (with openstack to see what happens) T275411
19:03 bstorm: depooled tools-sgeexec-0918 T275411
18:56 bstorm: deleted job 1962508 from the grid to clear it up T275301
16:58 bstorm: cleared error state on several grid queues

2021-02-19

12:31 arturo: deploying new version of toolforge ingress admission controller

2021-02-17

21:26 bstorm: deleted tools-puppetdb-01 since it is unused at this time (and undersized anyway)

2021-02-04

16:27 bstorm: rebooting tools-package-builder-02

2021-01-26

16:27 bd808: Hard reboot of tools-sgeexec-0906 via Horizon for T272978

2021-01-22

09:59 dcaro: added the record redis.svc.tools.eqiad1.wikimedia.cloud pointing to tools-redis1003 (T272679)

2021-01-21

23:58 bstorm: deployed new maintain-kubeusers to tools T271847

2021-01-19

22:57 bstorm: truncated 75GB error log /data/project/robokobot/virgule.err T272247
22:48 bstorm: truncated 100GB error log /data/project/magnus-toolserver/error.log T272247
22:43 bstorm: truncated 107GB log '/data/project/meetbot/logs/messages.log' T272247
22:34 bstorm: truncating 194 GB error log '/data/project/mix-n-match/mnm-microsync.err' T272247
16:37 bd808: Added Jhernandez to root sudoers group

2021-01-14

20:56 bstorm: setting bastions to have mostly-uncapped egress network and 40MBps nfs_read for better shared use
20:43 bstorm: running tc-setup across the k8s workers
20:40 bstorm: running tc-setup across the grid fleet
17:58 bstorm: hard rebooting tools-sgecron-01 following network issues during upgrade to stein T261134

2021-01-13

10:02 arturo: delete floating IP allocation 185.15.56.245 (T271867)

2021-01-12

18:16 bstorm: deleted wedged CSR tool-adhs-wde to get maintain-kubeusers working again T271842

2021-01-05

18:49 bstorm: changing the limits on k8s etcd nodes again, so disabling puppet on them T267966

2021-01-04

18:21 bstorm: ran 'sudo systemctl stop getty@ttyS1.service && sudo systemctl disable getty@ttyS1.service' on tools-k8s-etcd-5 I have no idea why that keeps coming back.

2020-12-22

18:22 bstorm: rebooting the grid master because it is misbehaving following the NFS outage
10:53 arturo: rebase & resolve ugly git merge conflict in labs/private.git

2020-12-18

18:37 bstorm: set profile::wmcs::kubeadm::etcd_latency_ms: 15 T267966

2020-12-17

21:42 bstorm: doing the same procedure to increase the timeouts more T267966
19:56 bstorm: puppet enabled one at a time, letting things catch up. Timeouts are now adjusted to something closer to fsync values T267966
19:44 bstorm: set etcd timeouts seed value to 20 instead of the default 10 (profile::wmcs::kubeadm::etcd_latency_ms) T267966
18:58 bstorm: disabling puppet on k8s-etcd servers to alter the timeouts T267966
14:23 arturo: regenerating puppet cert with proper alt names in tools-k8s-etcd-4 (T267966)
14:21 arturo: regenerating puppet cert with proper alt names in tools-k8s-etcd-5 (T267966)
14:19 arturo: regenerating puppet cert with proper alt names in tools-k8s-etcd-6 (T267966)
14:17 arturo: regenerating puppet cert with proper alt names in tools-k8s-etcd-7 (T267966)
14:15 arturo: regenerating puppet cert with proper alt names in tools-k8s-etcd-8 (T267966)
14:12 arturo: updated kube-apiserver manifest with new etcd nodes (T267966)
13:56 arturo: adding etcd dns_alt_names hiera keys to the puppet prefix https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/instance-puppet/+/beb27b45a74765a64552f2d4f70a40b217b4f4e9%5E%21/
13:12 arturo: making k8s api server aware of the new etcd nodes via hiera update https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/instance-puppet/+/3761c4c4dab1c3ed0ab0a1133d2ccf3df6c28baf%5E%21/ (T267966)
12:54 arturo: joining new etcd nodes in the k8s etcd cluster (T267966)
12:52 arturo: adding more etcd nodes in the hiera key in tools-k8s-etcd puppet prefix https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/instance-puppet/+/b4f60768078eccdabdfab4cd99c7c57076de51b2
12:50 arturo: dropping more unused hiera keys in the tools-k8s-etcd puppet prefix https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/instance-puppet/+/e9e66a6787d9b91c08cf4742a27b90b3e6d05aac
12:49 arturo: dropping unused hiera keys in the tools-k8s-etcd puppet prefix https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/instance-puppet/+/2b4cb4a41756e602fb0996e7d0210e9102172424
12:16 arturo: created VM `tools-k8s-etcd-8` (T267966)
12:15 arturo: created VM `tools-k8s-etcd-7` (T267966)
12:13 arturo: created `tools-k8s-etcd` anti-affinity server group

2020-12-11

18:29 bstorm: certificatesigningrequest.certificates.k8s.io "tool-production-error-tasks-metrics" deleted to stop maintain-kubeusers issues
12:14 dcaro: upgrading stable/main (clinic duty)
12:12 dcaro: upgrading buster-wikimedia/main (clinic duty)
12:03 dcaro: upgrading stable-updates/main, mainly cacertificates (clinic duty)
12:01 dcaro: upgrading stretch-backports/main, mainly libuv (clinic duty)
11:58 dcaro: disabled all the repos blocking upgrades on tools-package-builder-02 (duplicated, other releases...)
11:35 arturo: uncordon tools-k8s-worker-71 and tools-k8s-worker-55, they weren't uncordoned yesterday for whatever reasons (T263284)
11:27 dcaro: upgrading stretch-wikimedia/main (clinic duty)
11:20 dcaro: upgrading stretch-wikimedia/thirdparty/mono-project-stretch (clinic duty)
11:08 dcaro: upgrade stretch-wikimedia/component/php72 (minor upgrades) (clinic duty)
11:04 dcaro: upgrade oldstable/main packages (clinic duty)
10:58 dcaro: upgrade kubectl done (clinic duty)
10:53 dcaro: upgrade kubectl (clinic duty)
10:16 dcaro: upgrading oldstable/main packages (clinic duty)

2020-12-10

17:35 bstorm: k8s-control nodes upgraded to 1.17.13 T263284
17:16 arturo: k8s control nodes were all upgraded to 1.17, now upgrading worker nodes (T263284)
15:50 dcaro: puppet upgraded to 5.5.10 on the hosts, ping me if you see anything weird (clinic duty)
15:41 arturo: icinga-downtime toolschecker for 2h (T263284)
15:35 dcaro: Puppet 5 on tools-sgebastion-09 ran well and without issues, upgrading the other sge nodes (clinic duty)
15:32 dcaro: Upgrading puppet from 4 to 5 on tools-sgebastion-09 (clinic duty)
12:41 arturo: set hiera `profile::wmcs::kubeadm::component: thirdparty/kubeadm-k8s-1-17` in project & tools-k8s-control prefix (T263284)
11:50 arturo: disabled puppet in all k8s nodes in preparation for version upgrade (T263284)
11:50 arturo: disabled puppet in all k8s nodes in preparation for version upgrade (T263284)
09:58 dcaro: successful tesseract upgrade on tools-sgewebgrid-lighttpd-0914, upgrading the rest of nodes (clinic duty)
09:49 dcaro: upgrading tesseract on tools-sgewebgrid-lighttpd-0914 (clinic duty)

2020-12-08

19:01 bstorm: pushed updated calico node image (v3.14.0) to internal docker registry as well T269016

2020-12-07

22:56 bstorm: pushed updated local copies of the typha, calico-cni and calico-pod2daemon-flexvol images to the tools internal registry T269016

2020-12-03

09:18 arturo: restarted kubelet systemd service on tools-k8s-worker-38. Node was NotReady, complaining about 'use of closed network connection'
09:16 arturo: restarted kubelet systemd service on tools-k8s-worker-59. Node was NotReady, complaining about 'use of closed network connection'

2020-11-28

23:35 Krenair: Re-scheduled 4 continuous jobs from tools-sgeexec-0908 as it appears to be broken, at about 23:20 UTC
04:35 Krenair: Ran `sudo -i kubectl -n tool-mdbot delete cm maintain-kubeusers` on tools-k8s-control-1 for T268904, seems to have regenerated ~tools.mdbot/.kube/config

2020-11-24

17:44 arturo: rebased labs/private.git. 2 patches had merge conflicts
16:36 bd808: clush -w @all -b 'sudo -i apt-get purge nscd'
16:31 bd808: Ran `sudo -i apt-get purge nscd` on tools-sgeexec-0932 to try and fix apt state for puppet

2020-11-10

19:45 andrewbogott: rebooting tools-sgeexec-0950; OOM

2020-11-02

13:35 arturo: (typo: dcaro)
13:35 arturo: added dcar as projectadmin & user (T266068)

2020-10-29

21:33 legoktm: published docker-registry.tools.wmflabs.org/toolbeta-test image (T265681)
21:10 bstorm: Added another ingress node to k8s cluster in case the load spikes are the problem T266506
17:33 bstorm: hard rebooting tools-sgeexec-0905 and tools-sgeexec-0916 to get the grid back to full capacity
04:03 legoktm: published docker-registry.tools.wmflabs.org/toolforge-buster0-builder:latest image (T265686)

2020-10-28

23:42 bstorm: dramatically elevated the egress cap on tools-k8s-ingress nodes that were affected by the NFS settings T266506
22:10 bstorm: launching tools-k8s-ingress-3 to try and get an NFS-free node T266506
21:58 bstorm: set 'mount_nfs: false' on the tools-k8s-ingress prefix T266506

2020-10-23

22:22 legoktm: imported pack_0.14.2-1_amd64.deb into buster-tools (T266270)

2020-10-21

17:58 legoktm: pushed toolforge-buster0-{build,run}:latest images to docker registry

2020-10-15

22:00 bstorm: manually removing nscd from tools-sgebastion-08 and running puppet
18:23 andrewbogott: uncordoning tools-k8s-worker-53, 54, 55, 59
17:28 andrewbogott: depooling tools-k8s-worker-53, 54, 55, 59
17:27 andrewbogott: uncordoning tools-k8s-worker-35, 37, 45
16:44 andrewbogott: depooling tools-k8s-worker-35, 37, 45

2020-10-14

21:00 andrewbogott: repooling tools-sgewebgrid-generic-0901 and tools-sgewebgrid-lighttpd-0915
20:37 andrewbogott: depooling tools-sgewebgrid-generic-0901 and tools-sgewebgrid-lighttpd-0915
20:35 andrewbogott: repooling tools-sgewebgrid-lighttpd-0911, 12, 13, 16
20:31 bd808: Deployed toollabs-webservice v0.74
19:53 andrewbogott: depooling tools-sgewebgrid-lighttpd-0911, 12, 13, 16 and moving to Ceph
19:47 andrewbogott: repooling tools-sgeexec-0932, 33, 34 and moving to Ceph
19:07 andrewbogott: depooling tools-sgeexec-0932, 33, 34 and moving to Ceph
19:06 andrewbogott: repooling tools-sgeexec-0935, 36, 38, 40 and moving to Ceph
16:56 andrewbogott: depooling tools-sgeexec-0935, 36, 38, 40 and moving to Ceph

2020-10-10

17:07 bstorm: cleared errors on tools-sgeexec-0912.tools.eqiad.wmflabs to get the queue moving again

2020-10-08

17:07 bstorm: rebuilding docker images with locales-all T263339

2020-10-06

19:04 andrewbogott: uncordoned tools-k8s-worker-38
18:51 andrewbogott: uncordoned tools-k8s-worker-52
18:40 andrewbogott: draining and cordoning tools-k8s-worker-52 and tools-k8s-worker-38 for ceph migration

2020-10-02

21:09 bstorm: rebooting tools-k8s-worker-70 because it seems to be unable to recover from an old NFS disconnect
17:37 andrewbogott: stopping tools-prometheus-03 to attempt a snapshot
16:03 bstorm: shutting down tools-prometheus-04 to try to fsck the disk

2020-10-01

21:39 andrewbogott: migrating tools-proxy-06 to ceph
21:35 andrewbogott: moving k8s.tools.eqiad1.wikimedia.cloud from 172.16.0.99 (toolsbeta-test-k8s-haproxy-1) to 172.16.0.108 (toolsbeta-test-k8s-haproxy-2) in anticipation of downtime for haproxy-1 tomorrow

2020-09-30

18:34 andrewbogott: repooling tools-sgeexec-0918
18:29 andrewbogott: depooling tools-sgeexec-0918 so I can reboot cloudvirt1036

2020-09-23

21:38 bstorm: ran an 'apt clean' across the fleet to get ahead of the new locale install

2020-09-18

19:41 andrewbogott: repooling tools-k8s-worker-30, 33, 34, 57, 60
19:04 andrewbogott: depooling tools-k8s-worker-30, 33, 34, 57, 60
19:02 andrewbogott: repooling tools-k8s-worker-41, 43, 44, 47, 48, 49, 50, 51
17:48 andrewbogott: depooling tools-k8s-worker-41, 43, 44, 47, 48, 49, 50, 51
17:47 andrewbogott: repooling tools-k8s-worker-31, 32, 36, 39, 40
16:40 andrewbogott: depooling tools-k8s-worker-31, 32, 36, 39, 40
16:38 andrewbogott: repooling tools-sgewebgrid-lighttpd-0914, tools-sgewebgrid-generic-0902, tools-sgewebgrid-lighttpd-0919, tools-sgewebgrid-lighttpd-0918
16:10 andrewbogott: depooling tools-sgewebgrid-lighttpd-0914, tools-sgewebgrid-generic-0902, tools-sgewebgrid-lighttpd-0919, tools-sgewebgrid-lighttpd-0918
13:54 andrewbogott: repooling tools-sgeexec-0913, tools-sgeexec-0915, tools-sgeexec-0916
13:50 andrewbogott: depooling tools-sgeexec-0913, tools-sgeexec-0915, tools-sgeexec-0916 for flavor update
01:20 andrewbogott: repooling tools-sgeexec-0901, tools-sgeexec-0905, tools-sgeexec-0910, tools-sgeexec-0911, tools-sgeexec-0912 after flavor update
01:11 andrewbogott: depooling tools-sgeexec-0901, tools-sgeexec-0905, tools-sgeexec-0910, tools-sgeexec-0911, tools-sgeexec-0912 for flavor update
01:08 andrewbogott: repooling tools-sgeexec-0917, tools-sgeexec-0918, tools-sgeexec-0919, tools-sgeexec-0920 after flavor update
01:00 andrewbogott: depooling tools-sgeexec-0917, tools-sgeexec-0918, tools-sgeexec-0919, tools-sgeexec-0920 for flavor update
00:58 andrewbogott: repooling tools-sgeexec-0942, tools-sgeexec-0947, tools-sgeexec-0950, tools-sgeexec-0951, tools-sgeexec-0952 after flavor update
00:49 andrewbogott: depooling tools-sgeexec-0942, tools-sgeexec-0947, tools-sgeexec-0950, tools-sgeexec-0951, tools-sgeexec-0952 for flavor update

2020-09-17

21:56 bd808: Built and deployed tools-manifest v0.22 (T263190)
21:55 bd808: Built and deployed tools-manifest v0.22 (T169695)
20:34 bd808: Live hacked "--backend=gridengine" into webservicemonitor on tools-sgecron-01 (T263190)
20:21 bd808: Restarted webservicemonitor on tools-sgecron-01.tools.eqiad.wmflabs
20:09 andrewbogott: I didn't actually depool tools-sgeexec-0942, tools-sgeexec-0947, tools-sgeexec-0950, tools-sgeexec-0951, tools-sgeexec-0952 because there was some kind of brief outage just now
19:58 andrewbogott: depooling tools-sgeexec-0942, tools-sgeexec-0947, tools-sgeexec-0950, tools-sgeexec-0951, tools-sgeexec-0952 for flavor update
19:55 andrewbogott: repooling tools-k8s-worker-61,62,64,65,67,68,69 for flavor update
19:29 andrewbogott: depooling tools-k8s-worker-61,62,64,65,67,68,69 for flavor update
15:38 andrewbogott: repooling tools-k8s-worker-70 and tools-k8s-worker-66 after flavor remapping
15:34 andrewbogott: depooling tools-k8s-worker-70 and tools-k8s-worker-66 for flavor remapping
15:30 andrewbogott: repooling tools-sgeexec-0909, 0908, 0907, 0906, 0904
15:21 andrewbogott: depooling tools-sgeexec-0909, 0908, 0907, 0906, 0904 for flavor remapping
13:55 andrewbogott: depooled tools-sgewebgrid-lighttpd-0917 and tools-sgewebgrid-lighttpd-0920
13:55 andrewbogott: repooled tools-sgeexec-0937 after move to ceph
13:45 andrewbogott: depooled tools-sgeexec-0937 for move to ceph

2020-09-16

23:20 andrewbogott: repooled tools-sgeexec-0941 and tools-sgeexec-0939 for move to ceph
23:03 andrewbogott: depooled tools-sgeexec-0941 and tools-sgeexec-0939 for move to ceph
23:02 andrewbogott: uncordoned tools-k8s-worker-58, tools-k8s-worker-56, tools-k8s-worker-42 for migration to ceph
22:29 andrewbogott: draining tools-k8s-worker-58, tools-k8s-worker-56, tools-k8s-worker-42 for migration to ceph
17:37 andrewbogott: service gridengine-master restart on tools-sgegrid-master

2020-09-10

15:37 arturo: hard-rebooting tools-proxy-05
15:33 arturo: rebooting tools-proxy-05 to try flushing local DNS caches
15:25 arturo: detected missing DNS record for k8s.tools.eqiad1.wikimedia.cloud which means the k8s cluster is down
10:22 arturo: enabling ingress dedicated worker nodes in the k8s cluster (T250172)

2020-09-09

11:12 arturo: new ingress nodes added to the cluster, and tainted/labeled per the docs https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Kubernetes/Deploying#ingress_nodes (T250172)
10:50 arturo: created puppet prefix `tools-k8s-ingress` (T250172)
10:42 arturo: created VMs tools-k8s-ingress-1 and tools-k8s-ingress-2 in the `tools-ingress` server group T250172)
10:38 arturo: created server group `tools-ingress` with soft anti affinity policy (T250172)

2020-09-08

23:24 bstorm: clearing grid queue error states blocking job runs
22:53 bd808: forcing puppet run on tools-sgebastion-07

2020-09-02

18:13 andrewbogott: moving tools-sgeexec-0920 to ceph
17:57 andrewbogott: moving tools-sgeexec-0942 to ceph

2020-08-31

19:58 andrewbogott: migrating tools-sgeexec-091[0-9] to ceph
17:19 andrewbogott: migrating tools-sgeexec-090[4-9] to ceph
17:19 andrewbogott: repooled tools-sgeexec-0901
16:52 bstorm: `apt install uwsgi` was run on tools-checker-03 in the last log T261677
16:51 bstorm: running `apt install uwsgi` with --allow-downgrades to fix the puppet setup there T261677
14:26 andrewbogott: depooling tools-sgeexec-0901, migrating to ceph

2020-08-30

00:57 Krenair: also ran qconf -ds on each
00:35 Krenair: Tidied up SGE problems (it was spamming root@ every minute for hours) following host deletions some hours ago - removed tools-sgeexec-0921 through 0931 from @general, ran qmod -rj on all jobs registered for those nodes, then qdel -f on the remainders, then qconf -de on each deleted node

2020-08-29

16:02 bstorm: deleting "tools-sgeexec-0931", "tools-sgeexec-0930", "tools-sgeexec-0929", "tools-sgeexec-0928", "tools-sgeexec-0927"
16:00 bstorm: deleting "tools-sgeexec-0926", "tools-sgeexec-0925", "tools-sgeexec-0924", "tools-sgeexec-0923", "tools-sgeexec-0922", "tools-sgeexec-0921"

2020-08-26

21:08 bd808: Disabled puppet on tools-proxy-06 to test fixes for a bug in the new T251628 code
08:54 arturo: merged several patches by bryan for toolforge front proxy (cleanups, etc) example: https://gerrit.wikimedia.org/r/c/operations/puppet/+/622435

2020-08-25

19:38 andrewbogott: deleting tools-sgeexec-0943.tools.eqiad.wmflabs, tools-sgeexec-0944.tools.eqiad.wmflabs, tools-sgeexec-0945.tools.eqiad.wmflabs, tools-sgeexec-0946.tools.eqiad.wmflabs, tools-sgeexec-0948.tools.eqiad.wmflabs, tools-sgeexec-0949.tools.eqiad.wmflabs, tools-sgeexec-0953.tools.eqiad.wmflabs — they are broken and we're not very curious why; will retry this exercise when everything is standardized on
15:03 andrewbogott: removing non-ceph nodes tools-sgeexec-0921 through tools-sgeexec-0931
15:02 andrewbogott: added new sge-exec nodes tools-sgeexec-0943 through tools-sgeexec-0953 (for real this time)

2020-08-19

21:29 andrewbogott: shutting down and removing tools-k8s-worker-20 through tools-k8s-worker-29; this load can now be handled by new nodes on ceph hosts
21:15 andrewbogott: shutting down and removing tools-k8s-worker-1 through tools-k8s-worker-19; this load can now be handled by new nodes on ceph hosts
18:40 andrewbogott: creating 13 new xlarge k8s worker nodes, tools-k8s-worker-67 through tools-k8s-worker-79

2020-08-18

15:24 bd808: Rebuilding all Docker containers to pick up newest versions of installed packages

2020-07-30

16:28 andrewbogott: added new xlarge ceph-hosted worker nodes: tools-k8s-worker-61, 62, 63, 64, 65, 66. T258663

2020-07-29

23:24 bd808: Pushed a copy of docker-registry.wikimedia.org/wikimedia-jessie:latest to docker-registry.tools.wmflabs.org/wikimedia-jessie:latest in preparation for the upstream image going away

2020-07-24

22:33 bd808: Removed a few more ancient docker images: grrrit, jessie-toollabs, and nagf
21:02 bd808: Running cleanup script to delete the non-sssd toolforge images from docker-registry.tools.wmflabs.org
20:17 bd808: Forced garbage collection on docker-registry.tools.wmflabs.org
20:06 bd808: Running cleanup script to delete all of the old toollabs-* images from docker-registry.tools.wmflabs.org

2020-07-22

23:24 bstorm: created server group 'tools-k8s-worker' to create any new worker nodes in so that they have a low chance of being scheduled together by openstack unless it is necessary T258663
23:22 bstorm: running puppet and NFS 4.2 remount on tools-k8s-worker-[56-60] T257945
23:17 bstorm: running puppet and NFS 4.2 remount on tools-k8s-worker-[41-55] T257945
23:14 bstorm: running puppet and NFS 4.2 remount on tools-k8s-worker-[21-40] T257945
23:11 bstorm: running puppet and NFS remount on tools-k8s-worker-[1-15] T257945
23:07 bstorm: disabling puppet on k8s workers to reduce the effect of changing the NFS mount version all at once T257945
22:28 bstorm: setting tools-k8s-control prefix to mount NFS v4.2 T257945
22:15 bstorm: set the tools-k8s-control nodes to also use 800MBps to prevent issues with toolforge ingress and api system
22:07 bstorm: set the tools-k8s-haproxy-1 (main load balancer for toolforge) to have an egress limit of 800MB per sec instead of the same as all the other servers

2020-07-21

16:09 bstorm: rebooting tools-sgegrid-shadow to remount NFS correctly
15:55 bstorm: set the bastion prefix to have explicitly set hiera value of profile::wmcs::nfsclient::nfs_version: '4'

2020-07-17

16:47 bd808: Enabled Puppet on tools-proxy-06 following successful test (T102367)
16:29 bd808: Disabled Puppet on tools-proxy-06 to test nginx config changes manually (T102367)

2020-07-15

23:11 bd808: Removed ssh root key for valhallasw from project hiera (T255697)

2020-07-09

18:53 bd808: Updating git-review to 1.27 via clush across cluster (T257496)

2020-07-08

11:16 arturo: merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/610029 -- important change to front-proxy (T234617)
11:11 arturo: live-hacking puppetmaster with https://gerrit.wikimedia.org/r/c/operations/puppet/+/610029 (T234617)

2020-07-07

23:22 bd808: Rebuilding all Docker images to pick up webservice v0.73 (T234617, T257229)
23:19 bd808: Deploying webservice v0.73 via clush (T234617, T257229)
23:16 bd808: Building webservice v0.73 (T234617, T257229)
15:01 Reedy: killed python process from tools.experimental-embeddings using a lot of cpu on tools-sgebastion-07
15:01 Reedy: killed meno25 process running pwb.py on tools-sgebastion-07
09:59 arturo: point DNS tools.wmflabs.org A record to 185.15.56.60 (tools-legacy-redirector) (T247236)

2020-07-06

11:54 arturo: briefly point DNS tools.wmflabs.org A record to 185.15.56.60 (tools-legacy-redirector) and then switch back to 185.15.56.11 (tools-proxy-05). The legacy redirector does HTTP/307 (T247236)
11:50 arturo: associate floating IP address 185.15.56.60 to tools-legacy-redirector (T247236)

2020-07-01

11:19 arturo: cleanup exim email queue (4 frozen messages)
11:02 arturo: live-hacking puppetmaster with https://gerrit.wikimedia.org/r/c/operations/puppet/+/608849 (T256737)

2020-06-30

11:18 arturo: set some hiera keys for mtail in puppet prefix `tools-mail` (T256737)

2020-06-29

22:48 legoktm: built html-sssd/web image (T241817)
22:23 legoktm: rebuild python{34,35,37}-sssd/web images for https://gerrit.wikimedia.org/r/608093
12:01 arturo: introduced spam filter in the mail server (T120210)

2020-06-25

21:49 zhuyifei1999_: re-enabling puppet on tools-sgebastion-09 T256426
21:39 zhuyifei1999_: disabling puppet on tools-sgebastion-09 so I can play with mount settings T256426
21:24 bstorm: hard rebooting tools-sgebastion-09

2020-06-24

12:36 arturo: live-hacking puppetmaster with exim prometheus stuff (T175964)
11:57 arturo: merging email ratelimiting patch https://gerrit.wikimedia.org/r/c/operations/puppet/+/607320 (T175964)

2020-06-23

17:55 arturo: killed procs for users `hamishz` and `msyn` which apparently were tools that should be running in the grid / kubernetes instead
16:08 arturo: created acme-chief cert `tools_mail` in the prefix hiera

2020-06-17

10:40 arturo: created VM tools-legacy-redirector, with the corresponding puppet prefix (T247236, T234617)

2020-06-16

23:01 bd808: Building new Docker images to pick up webservice 0.72
22:58 bd808: Deploying webservice 0.72 to bastions and grid
22:56 bd808: Building webservice 0.72
15:10 arturo: merging a patch with changes to the template for keepalived (used in the elastic cluster) https://gerrit.wikimedia.org/r/c/operations/puppet/+/605898

2020-06-15

21:28 bstorm_: cleaned up killgridjobs.sh on the tools bastions T157792
18:14 bd808: Rebuilding all Docker images to pick up webservice 0.71 (T254640, T253412)
18:12 bd808: Deploying webservice 0.71 to bastions and grid via clush
18:05 bd808: Building webservice 0.71

2020-06-12

13:13 arturo: live-hacking session in the puppetmaster ended
13:10 arturo: live-hacing puppet tree in tools-puppetmaster-02 for testing PAWS related patch (they share haproxy puppet code)
00:16 bstorm_: remounted NFS for tools-k8s-control-3 and tools-acme-chief-01

2020-06-11

23:35 bstorm_: rebooting tools-k8s-control-2 because it seems to be confused on NFS, interestingly enough

2020-06-04

13:32 bd808: Manually restored /etc/haproxy/conf.d/elastic.cfg on tools-elastic-*

2020-06-02

12:23 arturo: renewed TLS cert for k8s metrics-server (T250874) following docs: https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Kubernetes/Certificates#internal_API_access
11:00 arturo: renewed TLS cert for prometheus to contact toolforge k8s (T250874) following docs: https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Kubernetes/Certificates#external_API_access

2020-06-01

23:51 bstorm_: refreshed certs for the custom webhook controllers on the k8s cluster T250874
00:39 bd808: Ugh. Prior SAL message was about tools-sgeexec-0940
00:39 bd808: Compressed /var/log/account/pacct.0 ahead of rotation schedule to free some space on the root partition

2020-05-29

19:37 bstorm_: adding docker image for paws-public docker-registry.tools.wmflabs.org/paws-public-nginx:openresty T252217

2020-05-28

21:19 bd808: Killed 7 python processes run by user 'mattho69' on login.toolforge.org
21:06 bstorm_: upgrading tools-k8s-worker-[30-60] to kubernetes 1.16.10 T246122
17:54 bstorm_: upgraded tools-k8s-worker-[11..15] and starting on -21-29 now T246122
16:01 bstorm_: kubectl upgraded to 1.16.10 on all bastions T246122
15:58 arturo: upgrading tools-k8s-worker-[1..10] to 1.16.10 (T246122)
15:41 arturo: upgrading tools-k8s-control-3 to 1.16.10 (T246122)
15:17 arturo: upgrading tools-k8s-control-2 to 1.16.10 (T246122)
15:09 arturo: upgrading tools-k8s-control-1 to 1.16.10 (T246122)
14:49 arturo: cleanup /etc/apt/sources.list.d/ directory in all tools-k8s-* VMs
11:27 arturo: merging change to front-proxy: https://gerrit.wikimedia.org/r/c/operations/puppet/+/599139 (T253816)

2020-05-27

17:23 bstorm_: deleting "tools-k8s-worker-20", "tools-k8s-worker-19", "tools-k8s-worker-18", "tools-k8s-worker-17", "tools-k8s-worker-16"

2020-05-26

18:45 bstorm_: upgrading maintain-kubeusers to match what is in toolsbeta T246059 T211096
16:20 bstorm_: fix incorrect volume name in kubeadm-config configmap T246122

2020-05-22

20:00 bstorm_: rebooted tools-sgebastion-07 to clear up tmp file problems with 10 min warning
19:12 bstorm_: running command to delete over 2000 tmp ca certs on tools-bastion-07 T253412

2020-05-21

22:40 bd808: Rebuilding all Docker containers for tools-webservice 0.70 (T252700)
22:36 bd808: Updated tools-webservice to 0.70 across instances (T252700)
22:29 bd808: Building tools-webservice 0.70 via wmcs-package-build.py

2020-05-20

09:59 arturo: now running tesseract-ocr v4.1.1-2~bpo9+1 in the Toolforge grid (T247422)
09:50 arturo: `aborrero@cloud-cumin-01:~$ sudo cumin --force -x 'O{project:tools name:tools-sge[bcew].*}' 'apt-get install tesseract-ocr -t stretch-backports -y'` (T247422)
09:35 arturo: `aborrero@cloud-cumin-01:~$ sudo cumin --force -x 'O{project:tools name:tools-sge[bcew].*}' 'rm /etc/apt/sources.lists.d/kubeadm-k8s-component-repo.list ; rm /etc/apt/sources.list.d/repository_thirdparty-kubeadm-k8s-1-15.list ; run-puppet-agent'` (T247422)
09:23 arturo: `aborrero@cloud-cumin-01:~$ sudo cumin --force -x 'O{project:tools name:tools-sge[bcew].*}' 'rm /etc/apt/preferences.d/* ; run-puppet-agent'` (T247422)

2020-05-19

17:00 bstorm_: deleting/restarting the paws db-proxy pod because it cannot connect to the replicas...and I'm hoping that's due to depooling and such

2020-05-13

18:14 bstorm_: upgrading calico to 3.14.0 with typha enabled in Toolforge K8s T250863
18:10 bstorm_: set "profile::toolforge::k8s::typha_enabled: true" in tools project for calico upgrade T250863

2020-05-09

00:28 bstorm_: added nfs.* to ignored_fs_types for the prometheus::node_exporter params in project hiera T252260

2020-05-08

18:17 bd808: Building all jessie-sssd derived images (T197930)
17:29 bd808: Building new jessie-sssd base image (T197930)

2020-05-07

21:51 bstorm_: rebuilding the docker images for Toolforge k8s
19:03 bstorm_: toollabs-webservice 0.69 now pushed to the Toolforge bastions
18:57 bstorm_: pushing new toollabs-webservice package v0.69 to the tools repos

2020-05-06

21:20 bd808: Kubectl delete node tools-k8s-worker-[16-20] (T248702)
18:24 bd808: Updated "profile::toolforge::k8s::worker_nodes" list in "tools-k8s-haproxy" prefix puppet (T248702)
18:14 bd808: Shutdown tools-k8s-worker-[16-20] instances (T248702)
18:04 bd808: Draining tools-k8s-worker-[16-20] in preparation for decomm (T248702)
17:56 bd808: Cordoned tools-k8s-worker-[16-20] in preparation for decomm (T248702)
00:01 bd808: Joining tools-k8s-worker-60 to the k8s worker pool
00:00 bd808: Joining tools-k8s-worker-59 to the k8s worker pool

2020-05-05

23:58 bd808: Joining tools-k8s-worker-58 to the k8s worker pool
23:55 bd808: Joining tools-k8s-worker-57 to the k8s worker pool
23:53 bd808: Joining tools-k8s-worker-56 to the k8s worker pool
21:51 bd808: Building 5 new k8s worker nodes (T248702)

2020-05-04

22:08 bstorm_: deleting tools-elastic-01/2/3 T236606
16:46 arturo: removing the now unused `/etc/apt/preferences.d/toolforge_k8s_kubeadmrepo*` files (T250866)
16:43 arturo: removing the now unused `/etc/apt/sources.list.d/toolforge-k8s-kubeadmrepo.list` file (T250866)

2020-04-29

22:13 bstorm_: running a fixup script after fixing a bug T247455
21:28 bstorm_: running the rewrite-psp-preset.sh script across all tools T247455
16:54 bstorm_: deleted the maintain-kubeusers pod to start running the new image T247455
16:52 bstorm_: tagged docker-registry.tools.wmflabs.org/maintain-kubeusers:beta to latest to deploy to toolforge T247455

2020-04-28

22:58 bstorm_: rebuilding docker-registry.tools.wmflabs.org/maintain-kubeusers:beta T247455

2020-04-23

19:22 bd808: Increased Kubernetes services quota for bd808-test tool.

2020-04-21

23:06 bstorm_: repooled tools-k8s-worker-38/52, tools-sgewebgrid-lighttpd-0918/9 and tools-sgeexec-0901 T250869
22:09 bstorm_: depooling tools-sgewebgrid-lighttpd-0918/9 and tools-sgeexec-0901 T250869
22:02 bstorm_: draining tools-k8s-worker-38 and tools-k8s-worker-52 as they are on the crashed host T250869

2020-04-20

15:31 bd808: Rebuilding Docker containers to pick up tools-webservice v0.68 (T250625)
14:47 arturo: added joakino to tools.admin LDAP group
13:28 jeh: shutdown elasticsearch v5 cluster running Jessie T236606
12:46 arturo: uploading tools-webservice v0.68 to aptly stretch-tools and update it on relevant servers (T250625)
12:06 arturo: uploaded tools-webservice v0.68 to stretch-toolsbeta for testing
11:59 arturo: `root@tools-sge-services-03:~# aptly db cleanup` removed 340 unreferenced packages, and 2 unreferenced files

2020-04-15

23:20 bd808: Building ruby25-sssd/base and children (T141388, T250118)
20:09 jeh: update default security group to allow prometheus01.metricsinfra.eqiad.wmflabs TCP 9100 T250206

2020-04-14

18:26 bstorm_: Deployed new code and RBAC for maintain-kubeusers T246123
18:19 bstorm_: updating the maintain-kubeusers:latest image T246123
17:32 bstorm_: updating the maintain-kubeusers:beta image on tools-docker-imagebuilder-01 T246123

2020-04-10

21:33 bd808: Rebuilding all Docker images for the Kubernetes cluster (T249843)
19:36 bstorm_: after testing deploying toollabs-webservice 0.67 to tools repos T249843
14:53 arturo: live-hacking tools-puppetmaster-02 with https://gerrit.wikimedia.org/r/c/operations/puppet/+/587991 for T249837

2020-04-09

15:13 bd808: Rebuilding all stretch and buster Docker images. Jessie is broken at the moment due to package version mismatches
11:18 arturo: bump nproc limit in bastions https://gerrit.wikimedia.org/r/c/operations/puppet/+/587715 (T219070)
04:29 bd808: Running rebuild_all for Docker images to pick up toollabs-webservice v0.66 [try #2] (T154504, T234617)
04:19 bd808: python3 build.py --image-prefix toolforge --tag latest --no-cache --push --single jessie-sssd
00:20 bd808: Docker rebuild failed in toolforge-python2-sssd-base: "zlib1g-dev : Depends: zlib1g (= 1:1.2.8.dfsg-2+b1) but 1:1.2.8.dfsg-2+deb8u1 is to be installed"

2020-04-08

23:49 bd808: Running rebuild_all for Docker images to pick up toollabs-webservice v0.66 (T154504, T234617)
23:35 bstorm_: deploy toollabs-webservice v0.66 T154504 T234617

2020-04-07

20:06 andrewbogott: sss_cache -E on tools-sgebastion-08 and tools-sgebastion-09
20:00 andrewbogott: sss_cache -E on tools-sgebastion-07

2020-04-06

19:16 bstorm_: deleted tools-redis-1001/2 T248929

2020-04-03

22:40 bstorm_: shut down tools-redis-1001/2 T248929
22:32 bstorm_: switch tools-redis-1003 to the active redis server T248929
20:41 bstorm_: deleting tools-redis-1003/4 to attach them to an anti-affinity group T248929
18:53 bstorm_: spin up tools-redis-1004 on stretch and connect to cluster T248929
18:23 bstorm_: spin up tools-redis-1003 on stretch and connect to the cluster T248929
16:50 bstorm_: launching tools-redis-03 (Buster) to see what happens

2020-03-30

18:28 bstorm_: Beginning rolling depool, remount, repool of k8s workers for T248702
18:22 bstorm_: disabled puppet across tools-k8s-worker-[1-55].tools.eqiad.wmflabs T248702
16:56 arturo: dropping `_psl.toolforge.org` TXT record (T168677)

2020-03-27

21:22 bstorm_: removed puppet prefix tools-docker-builder T248703
21:15 bstorm_: deleted tools-docker-builder-06 T248703
18:55 bstorm_: launching tools-docker-imagebuilder-01 T248703
12:52 arturo: install python3-pykube on tools-k8s-control-3 for some tests interaction with the API from python

2020-03-24

11:44 arturo: trying to solve a rebase/merge conflict in labs/private.git in tools-puppetmaster-02
11:33 arturo: merging tools-proxy patch https://gerrit.wikimedia.org/r/c/operations/puppet/+/579952/ (T234617) (second try with some additional bits in LUA)
10:16 arturo: merging tools-proxy patch https://gerrit.wikimedia.org/r/c/operations/puppet/+/579952/ (T234617)

2020-03-18

19:07 bstorm_: removed role::toollabs::logging::sender from project puppet (it wouldn't work anyway)
18:04 bstorm_: removed puppet prefix tools-flannel-etcd T246689
17:58 bstorm_: removed puppet prefix tools-worker T246689
17:57 bstorm_: removed puppet prefix tools-k8s-master T246689
17:36 bstorm_: removed lots of deprecated hiera keys from horizon for the old cluster T246689
16:59 bstorm_: deleting "tools-worker-1002", "tools-worker-1001", "tools-k8s-master-01", "tools-flannel-etcd-03", "tools-k8s-etcd-03", "tools-flannel-etcd-02", "tools-k8s-etcd-02", "tools-flannel-etcd-01", "tools-k8s-etcd-01" T246689

2020-03-17

13:29 arturo: set `profile::toolforge::bastion::nproc: 200` for tools-sgebastion-08 (T219070)
00:08 bstorm_: shut off tools-flannel-etcd-01/02/03 T246689

2020-03-16

22:01 bstorm_: shut off tools-k8s-etcd-01/02/03 T246689
22:00 bstorm_: shut off tools-k8s-master-01 T246689
21:59 bstorm_: shut down tools-worker-1001 and tools-worker-1002 T246689

2020-03-11

17:00 jeh: clean up apt cache on tools-sgebastion-07

2020-03-06

16:25 bstorm_: updating maintain-kubeusers image to filter invalid tool names

2020-03-03

18:16 jeh: create OpenStack DNS record for elasticsearch.svc.tools.eqiad1.wikimedia.cloud (eqiad1 subdomain change) T236606
18:02 jeh: create OpenStack DNS record for elasticsearch.svc.tools.eqiad.wikimedia.cloud T236606
17:31 jeh: create a OpenStack virtual ip address for the new elasticsearch cluster T236606
10:54 arturo: deleted VMs `tools-worker-[1003-1020]` (legacy k8s cluster) (T246689)
10:51 arturo: cordoned/drained all legacy k8s worker nodes except 1001/1002 (T246689)

2020-03-02

22:26 jeh: starting first pass of elasticsearch data migration to new cluster T236606

2020-03-01

01:48 bstorm_: old version of kubectl removed. Anyone who needs it can download it with `curl -LO https://storage.googleapis.com/kubernetes-release/release/v1.4.12/bin/linux/amd64/kubectl`
01:27 bstorm_: running the force-migrate command to make sure any new kubernetes deployments are on the new cluster.

2020-02-28

22:14 bstorm_: shutting down the old maintain-kubeusers and taking the gloves off the new one (removing --gentle-mode)
16:51 bstorm_: node/tools-k8s-worker-15 uncordoned
16:44 bstorm_: drained tools-k8s-worker-15 and hard rebooting it because it wasn't happy
16:36 bstorm_: rebooting k8s workers 1-35 on the 2020 cluster to clear a strange nologin condition that has been there since the NFS maintenance
16:14 bstorm_: rebooted tools-k8s-worker-7 to clear some puppet issues
16:00 bd808: Devoicing stashbot in #wikimedia-cloud to reduce irc spam while migrating tools to 2020 Kubernetes cluster
15:28 jeh: create OpenStack server group tools-elastic with anti-affinty policy enabled T236606
15:09 jeh: create 3 new elasticsearch VMs tools-elastic-[1,2,3] T236606
14:20 jeh: create new puppet prefixes for existing (no change in data) and new elasticsearch VMs
04:35 bd808: Joined tools-k8s-worker-54 to 2020 Kubernetes cluster
04:34 bd808: Joined tools-k8s-worker-53 to 2020 Kubernetes cluster
04:32 bd808: Joined tools-k8s-worker-52 to 2020 Kubernetes cluster
04:31 bd808: Joined tools-k8s-worker-51 to 2020 Kubernetes cluster
04:28 bd808: Joined tools-k8s-worker-50 to 2020 Kubernetes cluster
04:24 bd808: Joined tools-k8s-worker-49 to 2020 Kubernetes cluster
04:23 bd808: Joined tools-k8s-worker-48 to 2020 Kubernetes cluster
04:21 bd808: Joined tools-k8s-worker-47 to 2020 Kubernetes cluster
04:21 bd808: Joined tools-k8s-worker-46 to 2020 Kubernetes cluster
04:19 bd808: Joined tools-k8s-worker-45 to 2020 Kubernetes cluster
04:14 bd808: Joined tools-k8s-worker-44 to 2020 Kubernetes cluster
04:13 bd808: Joined tools-k8s-worker-43 to 2020 Kubernetes cluster
04:12 bd808: Joined tools-k8s-worker-42 to 2020 Kubernetes cluster
04:10 bd808: Joined tools-k8s-worker-41 to 2020 Kubernetes cluster
04:09 bd808: Joined tools-k8s-worker-40 to 2020 Kubernetes cluster
04:08 bd808: Joined tools-k8s-worker-39 to 2020 Kubernetes cluster
04:07 bd808: Joined tools-k8s-worker-38 to 2020 Kubernetes cluster
04:06 bd808: Joined tools-k8s-worker-37 to 2020 Kubernetes cluster
03:49 bd808: Joined tools-k8s-worker-36 to 2020 Kubernetes cluster
00:50 bstorm_: rebuilt all docker images to include webservice 0.64

2020-02-27

23:27 bstorm_: installed toollabs-webservice 0.64 on the bastions
23:24 bstorm_: pushed toollabs-webservice version 0.64 to all toolforge repos
21:03 jeh: add reindex service account to elasticsearch for data migration T236606
20:57 bstorm_: upgrading toollabs-webservice to stretch-toolsbeta version for jdk8:testing image only
20:19 jeh: update elasticsearch VPS security group to allow toolsbeta-elastic7-1 access on tcp 80 T236606
18:53 bstorm_: hard rebooted a rather stuck tools-sgecron-01
18:20 bd808: Building tools-k8s-worker-[36-55]
17:56 bd808: Deleted instances tools-worker-10[21-40]
16:14 bd808: Decommissioning tools-worker-10[21-40]
16:02 bd808: Drained tools-worker-1021
15:51 bd808: Drained tools-worker-1022
15:44 bd808: Drained tools-worker-1023 (there is no tools-worker-1024)
15:39 bd808: Drained tools-worker-1025
15:39 bd808: Drained tools-worker-1026
15:11 bd808: Drained tools-worker-1027
15:09 bd808: Drained tools-worker-1028 (there is no tools-worker-1029)
15:07 bd808: Drained tools-worker-1030
15:06 bd808: Uncordoned tools-worker-10[16-20]. Was over optimistic about repacking legacy Kubernetes cluster into 15 instances. Will keep 20 for now.
15:00 bd808: Drained tools-worker-1031
14:54 bd808: Hard reboot tools-worker-1016. Direct virsh console unresponsive. Stuck in shutdown since 2020-01-22?
14:44 bd808: Uncordoned tools-worker-1009.tools.eqiad.wmflabs
14:41 bd808: Drained tools-worker-1032
14:37 bd808: Drained tools-worker-1033
14:35 bd808: Drained tools-worker-1034
14:34 bd808: Drained tools-worker-1035
14:33 bd808: Drained tools-worker-1036
14:33 bd808: Drained tools-worker-10{39,38,37} yesterday but did not !log
00:29 bd808: Drained tools-worker-1009 for reboot (NFS flakey)
00:11 bd808: Uncordoned tools-worker-1009.tools.eqiad.wmflabs
00:08 bd808: Uncordoned tools-worker-1002.tools.eqiad.wmflabs
00:02 bd808: Rebooting tools-worker-1002
00:00 bd808: Draining tools-worker-1002 to reboot for NFS problems

2020-02-26

23:42 bd808: Drained tools-worker-1040
23:41 bd808: Cordoned tools-worker-10[16-40] in preparation for shrinking legacy Kubernetes cluster
23:12 bstorm_: replacing all tool limit-ranges in the 2020 cluster with a lower cpu request version
22:29 bstorm_: deleted pod maintain-kubeusers-6d9c45f4bc-5bqq5 to deploy new image
21:06 bstorm_: deleting loads of stuck grid jobs
20:27 jeh: rebooting tools-worker-[1008,1015,1021]
20:15 bstorm_: rebooting tools-sgegrid-master because it actually had the permissions thing going on still
18:03 bstorm_: downtimed toolschecker for nfs maintenance

2020-02-25

15:31 bd808: `wmcs-k8s-enable-cluster-monitor toolschecker`

2020-02-23

00:40 Krenair: T245932

2020-02-21

16:02 andrewbogott: moving tools-sgecron-01 to cloudvirt1022

2020-02-20

14:49 andrewbogott: moving tools-k8s-worker-19 and tools-k8s-worker-18 to cloudvirt1022 (as part of draining 1014)
00:04 Krenair: Shut off tools-puppetmaster-01 - to be deleted in one week T245365

2020-02-19

22:05 Krenair: Project-wide hiera change to swap puppetmaster to tools-puppetmaster-02 T245365
15:36 bstorm_: setting 'puppetmaster: tools-puppetmaster-02.tools.eqiad.wmflabs' on tools-sgeexec-0942 to test new puppetmaster on grid T245365
11:50 arturo: fix invalid yaml format in horizon puppet prefix 'tools-k8s-haproxy' that prevented clean puppet run in the VMs
00:59 bd808: Live hacked the "nginx-configuration" ConfigMap for T245426 (done several hours ago, but I forgot to !log it)

2020-02-18

23:26 bstorm_: added tools-sgegrid-master.tools.eqiad1.wikimedia.cloud and tools-sgegrid-shadow.tools.eqiad1.wikimedia.cloud to gridengine admin host lists
09:50 arturo: temporarily delete DNS zone tools.wmcloud.org to try re-creating it

2020-02-17

18:53 arturo: T168677 created DNS TXT record _psl.toolforge.org. with value `https://github.com/publicsuffix/list/pull/970`
13:22 arturo: relocating tools-sgewebgrid-lighttpd-0914 to cloudvirt1012 to spread same VMs across different hypervisors

2020-02-14

00:38 bd808: Added tools-k8s-worker-35 to 2020 Kubernetes cluster (T244791)
00:34 bd808: Added tools-k8s-worker-34 to 2020 Kubernetes cluster (T244791)
00:32 bd808: Added tools-k8s-worker-33 to 2020 Kubernetes cluster (T244791)
00:29 bd808: Added tools-k8s-worker-32 to 2020 Kubernetes cluster (T244791)
00:25 bd808: Added tools-k8s-worker-31 to 2020 Kubernetes cluster (T244791)
00:25 bd808: Added tools-k8s-worker-30 to 2020 Kubernetes cluster (T244791)
00:17 bd808: Added tools-k8s-worker-29 to 2020 Kubernetes cluster (T244791)
00:15 bd808: Added tools-k8s-worker-28 to 2020 Kubernetes cluster (T244791)
00:13 bd808: Added tools-k8s-worker-27 to 2020 Kubernetes cluster (T244791)
00:07 bd808: Added tools-k8s-worker-26 to 2020 Kubernetes cluster (T244791)
00:03 bd808: Added tools-k8s-worker-25 to 2020 Kubernetes cluster (T244791)

2020-02-13

23:53 bd808: Added tools-k8s-worker-24 to 2020 Kubernetes cluster (T244791)
23:50 bd808: Added tools-k8s-worker-23 to 2020 Kubernetes cluster (T244791)
23:38 bd808: Added tools-k8s-worker-22 to 2020 Kubernetes cluster (T244791)
21:35 bd808: Deleted tools-sgewebgrid-lighttpd-092{1,2,3,4,5,6,7,8} & tools-sgewebgrid-generic-090{3,4} (T244791)
21:33 bd808: Removed tools-sgewebgrid-lighttpd-092{1,2,3,4,5,6,7,8} & tools-sgewebgrid-generic-090{3,4} from grid engine config (T244791)
17:43 andrewbogott: migrating b24e29d7-a468-4882-9652-9863c8acfb88 to cloudvirt1022

2020-02-12

19:29 bd808: Rebuilding all Docker images to pick up toollabs-webservice (0.63) (T244954)
19:15 bd808: Deployed toollabs-webservice (0.63) on bastions (T244954)
00:20 bd808: Depooling tools-sgewebgrid-generic-0903 (T244791)
00:19 bd808: Depooling tools-sgewebgrid-generic-0904 (T244791)
00:14 bd808: Depooling tools-sgewebgrid-lighttpd-0921 (T244791)
00:09 bd808: Depooling tools-sgewebgrid-lighttpd-0922 (T244791)
00:05 bd808: Depooling tools-sgewebgrid-lighttpd-0923 (T244791)
00:05 bd808: Depooling tools-sgewebgrid-lighttpd-0924 (T244791)

2020-02-11

23:58 bd808: Depooling tools-sgewebgrid-lighttpd-0925 (T244791)
23:56 bd808: Depooling tools-sgewebgrid-lighttpd-0926 (T244791)
23:38 bd808: Depooling tools-sgewebgrid-lighttpd-0927 (T244791)

2020-02-10

23:39 bstorm_: updated tools-manifest to 0.21 on aptly for stretch
22:51 bstorm_: all docker images now use webservice 0.62
22:01 bd808: Manually starting webservices for tools that were running on tools-sgewebgrid-lighttpd-0928 (T244791)
21:47 bd808: Depooling tools-sgewebgrid-lighttpd-0928 (T244791)
21:25 bstorm_: upgraded toollabs-webservice package for tools to 0.62 T244293 T244289 T234617 T156626

2020-02-07

10:55 arturo: drop jessie VM instances tools-prometheus-{01,02} which were shutdown (T238096)

2020-02-06

10:44 arturo: merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/565556 which is a behavior change to the Toolforge front proxy (T234617)
10:27 arturo: shutdown again tools-prometheus-01, no longer in use (T238096)
05:07 andrewbogott: cleared out old /tmp and /var/log files on tools-sgebastion-07

2020-02-05

11:22 arturo: restarting ferm fleet-wide to account for prometheus servers changed IP (but same hostname) (T238096)

2020-02-04

11:38 arturo: start again tools-prometheus-01 again to sync data to the new tools-prometheus-03/04 VMs (T238096)
11:37 arturo: re-create tools-prometheus-03/04 as 'bigdisk2' instances (300GB) T238096

2020-02-03

14:12 arturo: move tools-prometheus-04 from cloudvirt1022 to cloudvirt1013
12:48 arturo: shutdown tools-prometheus-01 and tools-prometheus-02, after fixing the proxy `tools-prometheus.wmflabs.org` to tools-prometheus-03, data synced (T238096)
09:38 arturo: tools-prometheus-01: systemctl stop prometheus@tools. Another try to migrate data to tools-prometheus-{03,04} (T238096)

2020-01-31

14:06 arturo: leave tools-prometheus-01 as the backend for tools-prometheus.wmflabs.org for the weekend so grafana dashboards keep working (T238096)
14:00 arturo: syncing again prometheus data from tools-prometheus-01 to tools-prometheus-0{3,4} due to some inconsistencies preventing prometheus from starting (T238096)

2020-01-30

21:04 andrewbogott: also apt-get install python3-novaclient on tools-prometheus-03 and tools-prometheus-04 to suppress cronspam. Possible real fix for this is https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/569084/
20:39 andrewbogott: apt-get install python3-keystoneclient on tools-prometheus-03 and tools-prometheus-04 to suppress cronspam. Possible real fix for this is https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/569084/
16:27 arturo: create VM tools-prometheus-04 as cold standby of tools-prometheus-03 (T238096)
16:25 arturo: point tools-prometheus.wmflabs.org proxy to tools-prometheus-03 (T238096)
13:42 arturo: disable puppet in prometheus servers while syncing metric data (T238096)
13:15 arturo: drop floating IP 185.15.56.60 and FQDN `prometheus.tools.wmcloud.org` because this is not how the prometheus setup is right now. Use a web proxy instead `tools-prometheus-new.wmflabs.org` (T238096)
13:09 arturo: created FQDN `prometheus.tools.wmcloud.org` pointing to IPv4 185.15.56.60 (tools-prometheus-03) to test T238096
12:59 arturo: associated floating IPv4 185.15.56.60 to tools-prometheus-03 (T238096)
12:57 arturo: created domain `tools.wmcloud.org` in the tools project after some back and forth with designated, permissions and the database. I plan to use this domain to test the new Debian Buster-based prometheus setup (T238096)
10:20 arturo: create new VM instance tools-prometheus-03 (T238096)

2020-01-29

20:07 bd808: Created {bastion,login,dev}.toolforge.org service names for Toolforge bastions using Horizon & Designate

2020-01-28

13:35 arturo: `aborrero@tools-clushmaster-02:~$ clush -w @exec-stretch 'for i in $(ps aux | grep [t]ools.j | awk -F" " "{print \$2}") ; do echo "killing $i" ; sudo kill $i ; done || true'` (T243831)

2020-01-27

07:05 zhuyifei1999_: wrong package. uninstalled. the correct one is bpfcc-tools and seems only available in buster+. T115231
07:01 zhuyifei1999_: apt installing bcc on tools-worker-1037 to see who is sending SIGTERM, will uninstall after done. dependency: bin86. T115231

2020-01-24

20:58 bd808: Built tools-k8s-worker-21 to test out build script following openstack client upgrade
15:45 bd808: Rebuilding all Docker containers again because I failed to actually update the build server git clone properly last time I did this
05:23 bd808: Building 6 new tools-k8s-worker instances for the 2020 Kubernetes cluster (take 2)
04:41 bd808: Rebuilding all Docker images to pick up webservice-python-bootstrap changes

2020-01-23

23:38 bd808: Halted tools-k8s-worker build script after first instance (tools-k8s-worker-10) stuck in "scheduling" state for 20 minutes
23:16 bd808: Building 6 new tools-k8s-worker instances for the 2020 Kubernetes cluster
05:15 bd808: Building tools-elastic-04
04:39 bd808: wmcs-openstack quota set --instances 192
04:36 bd808: wmcs-openstack quota set --cores 768 --ram 1536000

2020-01-22

12:43 arturo: for the record, issue with tools-worker-1016 was memory exhaustion apparently
12:35 arturo: hard-reboot tools-worker-1016 (not responding to even console access)

2020-01-21

19:25 bstorm_: hard rebooting tools-sgeexec-0913/14/35 because they aren't even on the network
19:17 bstorm_: depooled and rebooted tools-sgeexec-0914 because it was acting funny
18:30 bstorm_: depooling and rebooting tools-sgeexec-[0911,0913,0919,0921,0924,0931,0933,0935,0939,0941].tools.eqiad.wmflabs
17:21 bstorm_: rebooting toolschecker to recover stale nfs handle

2020-01-16

23:54 bstorm_: rebooting tools-docker-builder-06 because there are a couple running containers that don't want to die cleanly
23:45 bstorm_: rebuilding docker containers to include new webservice version (0.58)
23:41 bstorm_: deployed toollabs-webservice 0.58 to everything that isn't a container
16:45 bstorm_: ran configurator to set the gridengine web queues to `rerun FALSE` T242397

2020-01-14

15:29 bstorm_: failed the gridengine master back to the master server from the shadow
02:23 andrewbogott: rebooting tools-paws-worker-1006 to resolve hangs associated with an old NFS failure

2020-01-13

17:48 bd808: Running `puppet ca destroy` for each unsigned cert on tools-puppetmaster-01 (T242642)
16:42 bd808: Cordoned and fixed puppet on tools-k8s-worker-12. Rebooting now. T242559
16:33 bd808: Cordoned and fixed puppet on tools-k8s-worker-11. Rebooting now. T242559
16:31 bd808: Cordoned and fixed puppet on tools-k8s-worker-10. Rebooting now. T242559
16:26 bd808: Cordoned and fixed puppet on tools-k8s-worker-9. Rebooting now. T242559

2020-01-12

22:31 Krenair: same on -13 and -14
22:28 Krenair: same on -8
22:18 Krenair: same on -7
22:11 Krenair: Did usual new instance creation puppet dance on tools-k8s-worker-6, /data/project got created

2020-01-11

01:33 bstorm_: updated toollabs-webservice package to 0.57, which should allow persisting mem and cpu in manifests with burstable qos.

2020-01-10

23:31 bstorm_: updated toollabs-webservice package to 0.56
15:45 bstorm_: depooled tools-paws-worker-1013 to reboot because I think it is the last tools server with that mount issue (I hope)
15:35 bstorm_: depooling and rebooting tools-worker-1016 because it still had the leftover mount problems
15:30 bstorm_: git stash-ing local puppet changes in hopes that arturo has that material locally, and it doesn't break anything to do so

2020-01-09

23:35 bstorm_: depooled tools-sgeexec-0939 because it isn't acting right and rebooting it
18:26 bstorm_: re-joining the k8s nodes OF THE PAWS CLUSTER to the cluster one at a time to rotate the certs T242353
18:25 bstorm_: re-joining the k8s nodes to the cluster one at a time to rotate the certs T242353
18:06 bstorm_: rebooting tools-paws-master-01 T242353
17:46 bstorm_: refreshing the paws cluster's entire x509 environment T242353

2020-01-07

22:40 bstorm_: rebooted tools-worker-1007 to recover it from disk full and general badness
16:33 arturo: deleted by hand pod metrics/cadvisor-5pd46 due to prometheus having issues scrapping it
15:46 bd808: Rebooting tools-k8s-worker-[6-14]
15:35 bstorm_: changed kubeadm-config to use a list instead of a hash for extravols on the apiserver in the new k8s cluster T242067
14:02 arturo: `root@tools-k8s-control-3:~# wmcs-k8s-secret-for-cert -n metrics -s metrics-server-certs -a metrics-server` (T241853)
13:33 arturo: upload docker-registry.tools.wmflabs.org/coreos/kube-state-metrics:v1.8.0 copied from quay.io/coreos/kube-state-metrics:v1.8.0 (T241853)
13:31 arturo: upload docker-registry.tools.wmflabs.org/metrics-server-amd64:v0.3.6 copied from k8s.gcr.io/metrics-server-amd64:v0.3.6 (T241853)
13:23 arturo: [new k8s] doing changes to kube-state-metrics and metrics-server trying to relocate them to the 'metrics' namespace (T241853)
05:28 bd808: Creating tools-k8s-worker-[6-14] (again)
05:20 bd808: Deleting busted tools-k8s-worker-[6-14]
05:02 bd808: Creating tools-k8s-worker-[6-14]
00:26 bstorm_: repooled tools-sgewebgrid-lighttpd-0919
00:17 bstorm_: repooled tools-sgewebgrid-lighttpd-0918
00:15 bstorm_: moving tools-sgewebgrid-lighttpd-0918 and -0919 to cloudvirt1004 from cloudvirt1029 to rebalance load
00:02 bstorm_: depooled tools-sgewebgrid-lighttpd-0918 and 0919 to move to cloudvirt1004 to improve spread

2020-01-06

23:40 bd808: Deleted tools-sgewebgrid-lighttpd-09{0[1-9],10}
23:36 bd808: Shutdown tools-sgewebgrid-lighttpd-09{0[1-9],10}
23:34 bd808: Decommissioned tools-sgewebgrid-lighttpd-09{0[1-9],10}
23:13 bstorm_: Repooled tools-sgeexec-0922 because I don't know why it was depooled
23:01 bd808: Depooled tools-sgewebgrid-lighttpd-0910.tools.eqiad.wmflabs
22:58 bd808: Depooling tools-sgewebgrid-lighttpd-090[2-9]
22:57 bd808: Disabling queues on tools-sgewebgrid-lighttpd-090[2-9]
21:07 bd808: Restarted kube2proxy on tools-proxy-05 to try and refresh admin tool's routes
18:54 bstorm_: edited /etc/fstab to remove NFS and unmounted the nfs volumes tools-k8s-haproxy-1 T241908
18:49 bstorm_: edited /etc/fstab to remove NFS and rebooted to clear stale mounts on tools-k8s-haproxy-2 T241908
18:47 bstorm_: added mount_nfs=false to tools-k8s-haproxy puppet prefix T241908
18:24 bd808: Deleted shutdown instance tools-worker-1029 (was an SSSD testing instance)
16:42 bstorm_: failed sge-shadow-master back to the main grid master
16:42 bstorm_: Removed files for old S1tty that wasn't working on sge-grid-master

2020-01-04

18:11 bd808: Shutdown tools-worker-1029
18:10 bd808: kubectl delete node tools-worker-1029.tools.eqiad.wmflabs
18:06 bd808: Removed tools-worker-1029.tools.eqiad.wmflabs from k8s::worker_hosts hiera in preparation for decom
16:54 bstorm_: moving VMs tools-worker-1012/1028/1005 from cloudvirt1024 to cloudvirt1003 due to hardware errors T241884
16:47 bstorm_: moving VM tools-flannel-etcd-02 from cloudvirt1024 to cloudvirt1003 due to hardware errors T241884
16:16 bd808: Draining tools-worker-10{05,12,28} due to hardware errors (T241884)
16:13 arturo: moving VM tools-sgewebgrid-lighttpd-0927 from cloudvirt1024 to cloudvirt1009 due to hardware errors (T241884)
16:11 arturo: moving VM tools-sgewebgrid-lighttpd-0926 from cloudvirt1024 to cloudvirt1009 due to hardware errors (T241884)
16:09 arturo: moving VM tools-sgewebgrid-lighttpd-0925 from cloudvirt1024 to cloudvirt1009 due to hardware errors (T241884)
16:08 arturo: moving VM tools-sgewebgrid-lighttpd-0924 from cloudvirt1024 to cloudvirt1009 due to hardware errors (T241884)
16:07 arturo: moving VM tools-sgewebgrid-lighttpd-0923 from cloudvirt1024 to cloudvirt1009 due to hardware errors (T241884)
16:06 arturo: moving VM tools-sgewebgrid-lighttpd-0909 from cloudvirt1024 to cloudvirt1009 due to hardware errors (T241884)
16:04 arturo: moving VM tools-sgeexec-0923 from cloudvirt1024 to cloudvirt1009 due to hardware errors (T241884)
16:02 arturo: moving VM tools-sgeexec-0910 from cloudvirt1024 to cloudvirt1009 due to hardware errors (T241873)

2020-01-03

16:48 bstorm_: updated the ValidatingWebhookConfiguration for the ingress admission controller to the working settings
11:51 arturo: [new k8s] deploy cadvisor as in https://gerrit.wikimedia.org/r/c/operations/puppet/+/561654 (T237643)
11:21 arturo: upload k8s.gcr.io/cadvisor:v0.30.2 docker image to the docker registry as docker-registry.tools.wmflabs.org/cadvisor:0.30.2 for T237643
03:04 bd808: Really rebuilding all {jessie,stretch,buster}-sssd images. Last time I forgot to actually update the git clone.
00:11 bd808: Rebuiliding all stretch-ssd Docker images to pick up busybox

2020-01-02

23:54 bd808: Rebuiliding all buster-ssd Docker images to pick up busybox