Nova Resource:Tools/SAL/Archive 4

From Wikitech

2021-12-31

  • 19:48 taavi: reset grid error status on webgrid-lighttpd@tools-sgewebgrid-lighttpd-0915

2021-12-28

  • 20:31 taavi: restarting acme-chief to debug T298353

2021-12-24

  • 07:58 majavah: cleared error state from 4 webgrid-lighttpd nodes

2021-12-23

  • 22:57 bd808: Marked tool stang for deletion (T296496)
  • 22:57 bd808: Marked tool wplist for deletion (T295523)
  • 22:56 bd808: Marked tool antigng for deletion (T294708)
  • 22:55 bd808: Marked tool ytrb for deletion (T291909)
  • 22:54 bd808: Marked tool geolink for deletion (T291801)
  • 22:54 bd808: Marked tool wmf-task-samtar for deletion (T286622)
  • 22:53 bd808: Marked tool coi for deletion (T286619)
  • 22:52 bd808: Marked tool abusereport for deletion (T286618)
  • 22:51 bd808: Marked tool chi for deletion (T282702)
  • 22:43 bd808: Marked tool algo-news for deletion (T280444)
  • 22:43 bd808: Marked tool ircclient for deletion (T279209)
  • 22:42 bd808: Marked tool vagrant-test for deletion (T279209)
  • 22:42 bd808: Marked tool vagrant2 for deletion (T279209)
  • 22:42 bd808: Marked tool testwiki for deletion (T279209)
  • 22:41 bd808: Marked tool zoranzoki21wiki for deletion (T279209)
  • 22:41 bd808: Marked tool zoranzoki21bot for deletion (T279209)
  • 22:40 bd808: Marked tool filesearch for deletion (T279209)
  • 22:40 bd808: Marked tool sourceror for deletion (T275690)
  • 22:39 bd808: Marked tool move for deletion (T270535)
  • 22:38 bd808: Marked tool hastagwatcher for deletion (T270534)
  • 22:37 bd808: Marked tool outreacy-wikicv for deletion (T270532)
  • 22:36 bd808: Marked tool dawiki for deletion (T270105)
  • 22:33 bd808: Marked tool rubinbot3 for deletion (T266963)
  • 22:32 bd808: Marked tool rubinbot2 for deletion (T266963)
  • 22:32 bd808: Marked tool rubinbot for deletion (T266963)
  • 22:31 bd808: Marked tool google-drive-photos-to-commons for deletion (T259870)
  • 22:30 bd808: Marked tool wdqs-wmil-tutorial for deletion (T258394)
  • 22:29 bd808: Marked tool base-encode for deletion (T258340)
  • 22:28 bd808: Marked tool wikidata-exports for deletion (T255192)
  • 22:27 bd808: Marked tool oar for deletion (T254044)
  • 22:27 bd808: Marked tool wmde-uca-test for deletion (T249089)
  • 22:26 bd808: Marked tool fastilybot for deletion (T248248)
  • 22:25 bd808: Marked tool mtc-rest for deletion (T248247)
  • 22:24 bd808: Marked tool squirrelnest-upf for deletion (T248235)
  • 22:23 bd808: Marked tool wikibase-databridge-storybook for deletion (T245026)
  • 22:22 bd808: Marked tool draft-uncategorize-script for deletion (T236646)
  • 22:21 bd808: Marked tool maplink-generator for deletion (T231766)
  • 22:20 bd808: Marked tool rhinosf1-afdclose for deletion (T225838)
  • 22:18 bd808: Marked tool asdf for deletion (T223699)
  • 22:17 bd808: Marked tool basyounybot for deletion (T218524)
  • 22:14 bd808: Marked tool design-research-methods for deletion (T218523)
  • 22:12 bd808: Marked tool he-wiktionary-rule-checker for deletion (T218500)
  • 22:11 bd808: Marked tool outofband for deletion (T218382)
  • 22:10 bd808: Marked tool sync-badges for deletion (T218187)
  • 22:09 bd808: Marked tool grafana-json-datasource for deletion (T218075)
  • 22:08 bd808: Marked tool gsociftttdev for deletion (T217478)
  • 22:04 bd808: Marked tool wikipagestats for deletion (T216970)
  • 22:02 bd808: Marked tool bd808-test4 for deletion (T216440)
  • 22:02 bd808: Marked tool bd808-test3 for deletion (T216439)
  • 22:01 bd808: Marked tool tei2wikitext for deletion (T216427)
  • 22:00 bd808: Marked tool projetpp for deletion (T216427)
  • 22:00 bd808: Marked tool ppp-sparql for deletion (T216427)
  • 21:59 bd808: Marked tool platypus-qa for deletion (T216427)
  • 21:59 bd808: Marked tool creatorlinks for deletion (T216427)
  • 21:58 bd808: Marked tool corenlp for deletion (T216427)
  • 21:57 bd808: Marked tool strikertest2017-08-23 for deletion (T216211)
  • 21:46 bd808: Marked tool languagetool for deletion (T215734)
  • 21:45 bd808: Marked tool gdk-artists-research for deletion (T214495)
  • 21:44 bd808: Marked tool phragile for deletion (T214495)
  • 21:44 bd808: Marked tool commons-mass-upload for deletion (T214495)
  • 21:43 bd808: Marked tool wmde-uca-test for deletion (T214495)
  • 21:43 bd808: Marked tool wmde-editconflict-test for deletion (T214495)
  • 21:42 bd808: Marked tool wmde-inline-movedparagraphs for deletion (T214495)
  • 21:41 bd808: Marked tool prometheus for deletion (T211972)
  • 21:40 bd808: Marked tool quentinv57-tools for deletion (T210829)
  • 21:38 bd808: Marked tool addbot for deletion (T208427)
  • 21:37 bd808: Marked tool addshore-dev for deletion (T208427)
  • 21:37 bd808: Marked tool addshore for deletion (T208427)
  • 21:36 bd808: Marked tool miraheze-notifico for deletion (T203124)
  • 21:34 bd808: Marked tool mh-signbot for deletion (T202946)
  • 21:33 bd808: Marked tool messenger-chatbot for deletion (T198808)
  • 21:22 bd808: Marked tool harvesting-data-rafinery for deletion (T197214)
  • 21:21 bd808: Marked tool miraheze-discord-irc for deletion (T192410)
  • 21:20 bd808: Marked tool sau226-wiki-bug-testing for deletion (T188608)
  • 21:18 bd808: Marked tool kmlexport-cswiki for deletion (T186916)
  • 21:17 bd808: Marked tool www-portal-builder for deletion (T182140)
  • 21:15 bd808: Marked tool recoin-sample for deletion (T181541)
  • 21:13 bd808: Marked tool wlm-jury-yarl for deletion (T172590)
  • 21:12 bd808: Marked tool wlm-jury-at for deletion (T172590)
  • 19:43 bd808: Marked tool yunomi for deletion (T170070)
  • 19:42 bd808: Marked tool datbotcommons for deletion (T164662)
  • 19:40 bd808: Marked tool ut-iw-bot for deletion (T158303)
  • 19:39 bd808: Marked tool hujibot for deletion (T157916)
  • 19:37 bd808: Marked tool contributions-summary for deletion (T157749)
  • 19:35 bd808: Marked tool morebots for deletion (T157399)
  • 19:32 bd808: Marked tool rcm for deletion (T136216)

2021-12-20

  • 18:01 majavah: deploying calico v3.21.0 (T292698)
  • 12:17 arturo: running `aborrero@tools-sgegrid-master:~$ sudo grid-configurator --all-domains` after merging a few patches to the script to handle dead config

2021-12-14

  • 09:46 majavah: testing delete-crashing-pods emailer component with a test tool T292925

2021-12-08

  • 05:21 andrewbogott: moving tools-k8s-etcd-13 to cloudvirt1028

2021-12-07

  • 11:11 arturo: updated member roles in github.com/toolforge: remove brooke as owner, add dcaro

2021-12-06

  • 13:23 majavah: root@toolserver-proxy-01:~# systemctl restart apache2.service # working around T293826

2021-12-04

  • 12:18 majavah: deploying delete-crashing-pods in dry run mode T292925

2021-11-28

  • 17:46 andrewbogott: moving tools-k8s-etcd-13 to cloudvirt1020; cloudvirt1018 (its old host) has a degraded raid which is affecting performance

2021-11-19

  • 13:16 majavah: manually add 3 project members after ldap issues were fixed

2021-11-16

  • 12:31 majavah: uploading calico 3.21.0 to the internal docker registry T292698
  • 10:28 majavah: deploying maintain-kubeusers changes T286857

2021-11-11

  • 10:50 arturo: add user `srv-networktests` as project user (T294955)

2021-11-05

  • 19:18 majavah: deploying registry-admission changes

2021-10-29

  • 23:58 andrewbogott: deleting all files older than 14 days in /srv/tools/shared/tools/project/.shared/cache

2021-10-28

  • 12:42 arturo: set `allow-snippet-annotations: "false"` for ingress-nginx (T294330)

2021-10-26

  • 18:00 majavah: deleting legacy ingresses for tools.wmflabs.org urls
  • 12:26 majavah: deploy ingress-admission updates
  • 12:11 majavah: deploy ingress-nginx v1.0.4 / chart v4.0.6 on toolforge T292771

2021-10-25

  • 14:33 majavah: copy nginx-ingress controller v1.0.4 to internal registry T292771
  • 11:32 majavah: depool tools-sgeexec-0910 T294228
  • 11:13 majavah: removed tons of duplicate qw jobs accross multiple tools

2021-10-22

  • 15:35 majavah: remove "^tools-k8s-master-[0-9]+\.tools\.eqiad\.wmflabs$" from authorized_regexes for the main certificate
  • 15:35 majavah: add mail.tools.wmcloud.org to the tools mail tls certificate alternative names

2021-10-21

  • 09:48 majavah: deploying toolforge-webservice 0.79

2021-10-20

  • 15:41 majavah: removing toollabs-webservice from grid exec and master nodes where it's not needed and not managed by puppet
  • 12:51 majavah: rolling out toolforge-webservice 0.78 T292706 T282975 T276626

2021-10-15

  • 15:01 arturo: add updated ingress-nginx docker image in the registry (v1.0.1) for T293472

2021-10-07

  • 09:13 majavah: disabling settings api, now that all pod presets are gone T279106
  • 08:00 majavah: removing all pod presets T279106
  • 05:44 majavah: deploying fix for T292672

2021-10-06

  • 06:46 majavah: taavi@toolserver-proxy-01:~$ sudo systemctl restart apache2.service # see if it helps with toolserver.org ssl alerts

2021-10-03

  • 21:31 bstorm: rebuilding buster containers since they are also affected T291387 T292355
  • 21:29 bstorm: rebuilt stretch containers for potential issues with LE cert updates T291387

2021-10-01

  • 21:59 bd808: clush -w @all -b 'sudo sed -i "s#mozilla/DST_Root_CA_X3.crt#!mozilla/DST_Root_CA_X3.crt#" /etc/ca-certificates.conf && sudo update-ca-certificates' for T292289

2021-09-30

  • 13:43 majavah: cleaning up unused kubernetes ingress objects for tools.wmflabs.org urls T292105

2021-09-29

  • 22:39 bstorm: finished deploy of the toollabs-webservice 0.77 and updating labels across the k8s cluster to match
  • 22:26 bstorm: pushing toollabs-webservice 0.77 to tools releases
  • 21:46 bstorm: pushing toollabs-webservice 0.77 to toolsbeta

2021-09-27

  • 16:19 majavah: deploy volume-admission fix for containers for some volumes mounted
  • 13:01 majavah: publish jobutils and misctools 0.43 T286072
  • 11:34 majavah: disabling pod preset controller T279106

2021-09-23

  • 17:20 majavah: deploying new maintain-kubeusers for lack of podpresets T279106

2021-09-22

  • 18:06 bstorm: launching tools-nfs-test-client-01 to run a "fair" test battery against T291406
  • 11:37 dcaro: controlled undrain tools-k8s-worker-53 (T291546)
  • 08:57 majavah: drain tools-k8s-worker-53

2021-09-20

  • 12:44 majavah: deploying volume-admission to tools, should not affect anything yet T279106

2021-09-15

  • 08:08 majavah: update tools-manifest to 0.24

2021-09-14

  • 10:36 arturo: add toolforge-jobs-framework-cli v5 to aptly buster-tools/toolsbeta

2021-09-13

  • 08:57 arturo: cleared grid queues error states (T290844)
  • 08:55 arturo: repooling sgeexec-0907 (T290798)
  • 08:14 arturo: rebooting sgeexec-0907 (T290798)
  • 08:12 arturo: depool sgeexec-0907 (T290798)

2021-09-11

  • 08:51 majavah: depool tools-sgeexec-0907

2021-09-10

  • 23:26 bstorm: cleared error state for tools-sgeexec-0907.tools.eqiad.wmflabs
  • 12:00 arturo: shutdown tools-package-builder-03 (buster), leave -04 online (bullseye)
  • 09:35 arturo: live-hacking tools puppetmaster with a couple of ops/puppet changes
  • 07:54 arturo: created bullseye VM tools-package-builder-04 (T273942)

2021-09-09

  • 16:20 arturo: 70017ec0ac root@tools-k8s-control-3:~# kubectl apply -f /etc/kubernetes/psp/base-pod-security-policies.yaml

2021-09-07

  • 15:27 majavah: rolling out python3-prometheus-client updates
  • 14:41 majavah: manually removing some absented but still present crontabs to stop root@ spam

2021-09-06

  • 16:31 arturo: deploying jobs-framework-cli v4
  • 16:22 arturo: deploying jobs-framework-api 3228d97

2021-09-03

  • 22:36 bstorm: backfilling quotas in screen for T286784
  • 12:49 majavah: deploying new tools-manifest version

2021-09-02

  • 01:02 bstorm: deployed new version of maintain-kubeusers with new count quotas for new tools T286784

2021-08-20

  • 19:10 majavah: rebuilding node12-sssd/{base,web} to use debian packaged npm 7
  • 18:42 majavah: rebuilding php74-sssd/{base,web} to use composer 2

2021-08-18

  • 21:32 bstorm: rebooted tools-sgecron-01 due to a ram filling up and killing everything
  • 16:34 bstorm: deleting the sssd cache on tools-sgecron-01 to fix a peculiar passwd db issue

2021-08-16

  • 17:00 majavah: remove and re-add toollabs-webservice 0.75 on stretch-toolsbeta repository
  • 15:45 majavah: reset sul account mapping on striker for developer account "DutchTom" T288969
  • 14:19 majavah: building node12 images - T284590 T243159

2021-08-15

  • 17:30 majavah: deploying update jobs-framework-api container list to include bullseye images
  • 17:22 majavah: finished initial build of images: php74, jdk17, python39, ruby27 - T284590
  • 16:51 majavah: starting build of initial bullseye based images - T284590
  • 16:44 majavah: tagged and building toollabs-webservice 0.76 with bullseye images defined T284590
  • 15:14 majavah: building tools-webservice 0.74 (currently live version) to bullseye-tools and bullseye-toolsbeta

2021-08-12

  • 16:59 bstorm: deployed updated manifest for ingress-admission
  • 16:45 bstorm: restarted ingress admission pods in tools after testing in toolsbeta
  • 16:27 bstorm: updated the docker image for docker-registry.tools.wmflabs.org/ingress-admission:latest
  • 16:22 bstorm: rebooting tools-docker-registry-05 after exchanging uids for puppet and docker-registry

2021-08-07

  • 05:59 majavah: restart nginx on toolserver-proxy-01 if that helps with flapping icinga certificate expiry check

2021-08-06

  • 16:17 bstorm: failed over to tools-docker-registry-06 (which has more space) T288229
  • 00:43 bstorm: set up sync between the new registry host and the existing one T288229
  • 00:21 bstorm: provisioning second docker registry server to rsync to (120GB disk and fairly large server) T288229

2021-08-05

  • 23:50 bstorm: rebooting the docker registry T288229
  • 23:04 bstorm: extended docker registry volume to 120GB T288229

2021-07-29

  • 18:04 majavah: reset sul account mapping on striker for developer account "Derek Zax" T287369

2021-07-28

  • 21:33 majavah: add mdipietro as projectadmin and to sudo policy T287287

2021-07-27

  • 16:20 bstorm: built new php images with python2 on board T287421
  • 00:04 bstorm: deploy a version of the php3.7 web image that includes the python2 package with tag :testing T287421

2021-07-26

  • 17:37 bstorm: repooled the whole set of ingress workers after upgrades T280340
  • 16:37 bstorm: removing tools-k8s-ingress-4 from active ingress nodes at the proxy T280340

2021-07-23

  • 07:15 majavah: restart nginx on tools-static-14 to see if it helps with fontcdn issues

2021-07-22

  • 23:35 bstorm: deleted tools-sgebastion-09 since it has been shut off since March anyway
  • 15:32 arturo: re-deploying toolforge-jobs-framework-api
  • 15:30 arturo: pushed new docker image on the registry for toolforge-jobs-framework-api 4d8235b (T287077)

2021-07-21

  • 20:01 bstorm: deployed new maintain-kubeusers to toolforge T285011
  • 19:55 bstorm: deployed new rbac for maintain-kubeusers changes T285011
  • 17:10 majavah: deploying calico v3.18.4 T280342
  • 14:35 majavah: updating systemd on toolforge stretch bastions T287036
  • 11:59 arturo: deploying jobs-framework-api 07346d7 (T286108)
  • 11:04 arturo: enabling TTLAfterFinished feature gate on kubeadm live configmap (T286108)
  • 11:01 arturo: enabling TTLAfterFinished feature gate on static pod manifests on /etc/kubernetes/manifests/kube-{apiserver,controller-manager}.yaml in all 3 control nodes (T286108)

2021-07-20

  • 18:42 majavah: deploying systemd security tools on toolforge public stretch machines T287004
  • 17:45 arturo: pushed new toolforge-jobs-framework-api docker image into the registry (3a6ae38) (T286126
  • 17:37 arturo: added toolforge-jobs-framework-cli v3 to aptly buster-tools and buster-toolsbeta
  • 13:25 majavah: apply buster systemd security updates

2021-07-19

  • 23:24 bstorm: applied matchPolicy: equivalent to tools ingress validation controller T280360
  • 16:43 bstorm: cleared queue error state caused by excessive resource use by topicmatcher T282474

2021-07-16

  • 14:04 arturo: deployed jobs-framework-api 42b7a88 (T286132)
  • 11:57 arturo: added toollabs-webservice_0.75_all to jessie-tools aptly repo (T286003)
  • 11:52 arturo: created `jessie-tools` aptly repository on tools-services-05 (T286003)

2021-07-15

2021-07-14

  • 23:29 bstorm: mounted nfs on tools-services-05 and backing up aptly to NFS dir T286003
  • 09:17 majavah: copying calico 3.18.4 images from docker hub to docker-registry.tools.wmflabs.org T280342

2021-07-12

  • 16:56 bstorm: deleted job 4720371 due to LDAP failure
  • 16:51 bstorm: cleared the E state from two job queues

2021-07-02

  • 18:46 bstorm: cleared error state for tools-sgeexec-0940.tools.eqiad.wmflabs

2021-07-01

  • 22:08 bstorm: releasing webservice 0.75
  • 17:03 andrewbogott: rebooting tools-k8s-worker-[31,33,35,44,49,51,57-58,70].tools.eqiad1.wikimedia.cloud
  • 16:47 bstorm: remounted scratch everywhere...but mostly tools T224747
  • 15:47 arturo: rebased labs/private.git
  • 11:04 arturo: added toolforge-jobs-framework-cli_1_all.deb to aptly buster-tools,buster-toolsbeta
  • 10:34 arturo: refreshed jobs-api deployment

2021-06-29

  • 21:58 bstorm: clearing one errored queue and a stack of discarded jobs
  • 20:11 majavah: toolforge kubernetes upgrade complete T280299
  • 17:03 majavah: starting toolforge kubernetes 1.18 upgrade - T280299
  • 16:17 arturo: deployed jobs-framework-api in the k8s cluster
  • 15:34 majavah: remove duplicate definitions from tools-clushmaster-02 /root/.ssh/known_hosts
  • 15:12 arturo: livehacking puppetmaster for T283238
  • 10:24 dcaro: running puppet on the buster bastions after 20000 minutes failing... might break something

2021-06-15

  • 19:02 bstorm: cleared error status from a few queues
  • 16:15 majavah: deleting unused shutdown nodes: tools-checker-03 tools-k8s-haproxy-1 tools-k8s-haproxy-2

2021-06-14

  • 22:21 bstorm: push docker-registry.tools.wmflabs.org/toolforge-python37-sssd-web:testing to test staged os.execv (and other patches) using toolsbeta toollabs-webservice version 0.75 T282975

2021-06-13

  • 08:15 majavah: clear grid error state from tools-sgeexec-0907, tools-sgeexec-0916, tools-sgeexec-0940

2021-06-12

  • 14:39 majavah: remove nonexistent tools-prometheus-04 and add tools-prometheus-05 to hiera key "prometheus_nodes"
  • 13:53 majavah: create empty bullseye-{tools,toolsbeta} repositories on tools-services-05 aptly

2021-06-10

  • 17:38 majavah: clear error state from tools-sgeexec-0907, task@tools-sgeexec-0939

2021-06-09

  • 13:57 majavah: clear error state from exec nodes tools-sgeexec-0913, tools-sgeexec-0936, task@tools-sgeexec-0940

2021-06-07

2021-06-04

  • 21:30 bstorm: deleting "tools-k8s-ingress-3", "tools-k8s-ingress-2", "tools-k8s-ingress-1" T264221
  • 21:21 bstorm: cleared error state from 4 grid queues

2021-06-03

  • 18:27 majavah: renew prometheus kubernetes certificate T280301
  • 17:06 majavah: renew admission webhook certificates T280301

2021-06-01

  • 10:10 majavah: properly clean up deleted vms tools-k8s-haproxy-[1,2], tools-checker-03 from puppet after using the wrong fqdn first time
  • 09:54 majavah: clear error state from tools-sgeexec-0913, tools-sgeexec-0950

2021-05-30

  • 18:58 majavah: clear grid error state from 14 queues

2021-05-27

  • 18:03 bstorm: adjusted profile::wmcs::kubeadm::etcd_latency_ms from 30 back to the default (10)
  • 16:04 bstorm: cleared error state from several exec node queues
  • 14:49 andrewbogott: swapping in three new etcd nodes with local storage: tools-k8s-etcd-13,14,15

2021-05-24

  • 10:36 arturo: rebased labs/private.git after merge conflict
  • 06:49 majavah: remove scfc kubernetes admin access after bd808 removed tools.admin membership to avoid maintain-kubeusers crashes when it expires

2021-05-22

  • 14:47 majavah: manually remove jeh admin certificates and from maintain-kubeusers configmap T282725
  • 14:32 majavah: manually remove valhallasw yuvipanda admin certificates and from configmap and restart maintain-kubeusers pod T282725
  • 02:51 bd808: Restarted nginx on tools-static-14 to see if that clears up the fontcdn 502 errors

2021-05-21

  • 17:06 majavah: unpool tooks-k8s-ingress-[4-6]
  • 17:06 majavah: repool tools-k8s-ingress-6
  • 17:02 majavah: repool tools-k8s-ingress-4 and -5
  • 16:59 bstorm: upgrading the ingress-gen2 controllers to release 3 to capture new RAM/CPU limits
  • 16:43 bstorm: resize tools-k8s-ingress-4 to g3.cores4.ram8.disk20
  • 16:43 bstorm: resize tools-k8s-ingress-6 to g3.cores4.ram8.disk20
  • 16:40 bstorm: resize tools-k8s-ingress-5 to g3.cores4.ram8.disk20
  • 16:04 majavah: rollback kubernetes ingress update from front proxy
  • 06:52 Majavah: pool tools-k8s-ingress-6 and depool ingress-[2,3] T264221

2021-05-20

  • 17:05 Majavah: pool tools-k8s-ingress-5 as an ingress node, depool ingress-1 T264221
  • 16:31 Majavah: pool tools-k8s-worker-4 as an ingress node T264221
  • 15:17 Majavah: trying to install ingress-nginx via helm again after adjusting security groups T264221
  • 15:15 Majavah: move tools-k8s-ingress-[5-6] from "tools-k8s-full-connectivity" to "tools-new-k8s-full-connectivity" security group T264221

2021-05-19

  • 12:15 Majavah: rollback ingress-nginx-gen2
  • 11:09 Majavah: deploy helm-based nginx ingress controller v0.46.0 to ingress-nginx-gen2 namespace T264221
  • 10:44 Majavah: create tools-k8s-ingress-[4-6] T264221

2021-05-16

  • 16:52 Majavah: clear error state from tools-sgeexec-0905 tools-sgeexec-0907 tools-sgeexec-0936 tools-sgeexec-0941

2021-05-14

  • 19:18 bstorm: adjusting the rate limits for bastions nfs_write upward a lot to make NFS writes faster now that the cluster is finally using 10Gb on the backend and frontend T218338
  • 16:55 andrewbogott: rebooting toolserver-proxy-01 to clear up stray files
  • 16:47 andrewbogott: deleting log files older than 14 days on toolserver-proxy-01

2021-05-12

  • 19:45 bstorm: cleared error state from some queues
  • 19:05 Majavah: remove phamhi-binding phamhi-view-binding cluster role bindings T282725
  • 19:04 bstorm: deleted the maintain-kubeusers pod to get it up and running fast T282725
  • 19:03 bstorm: deleted phamhi from admin configmap in maintain-kubeusers T282725

2021-05-11

  • 17:17 Majavah: shutdown and delete tools-checker-03 T278540
  • 17:14 Majavah: move floating ip 185.15.56.61 to tools-checker-04
  • 17:12 Majavah: add tools-checker-04 as a grid submit host T278540
  • 16:58 Majavah: add tools-checker-04 to toollabs::checker_hosts hiera key T278540
  • 16:49 Majavah: creating tools-checker-04 with buster T278540
  • 16:32 Majavah: carefully shutdown tools-k8s-haproxy-1 T252239
  • 16:29 Majavah: carefully shutdown tools-k8s-haproxy-2 T252239

2021-05-10

  • 22:58 bstorm: cleared error state on a grid queue
  • 22:58 bstorm: setting `profile::wmcs::kubeadm::docker_vol: false` on ingress nodes
  • 15:22 Majavah: change k8s.svc.tools.eqiad1.wikimedia.cloud. to point to the tools-k8s-haproxy-keepalived-vip address 172.16.6.113 (T252239)
  • 15:06 Majavah: carefully rolling out keepalived to tools-k8s-haproxy-[3-4] while making sure [1-2] do not have changes
  • 15:03 Majavah: clear all error states caused by overloaded exec nodes
  • 14:57 arturo: allow tools-k8s-haproxy-[3-4] to use the tools-k8s-haproxy-keepalived-vip address (172.16.6.113) (T252239)
  • 12:53 Majavah: creating tools-k8s-haproxy-[3-4] to rebuild current ones without nfs and with keepalived

2021-05-09

  • 06:55 Majavah: clear error state from tools-sgeexec-0916

2021-05-08

  • 10:57 Majavah: import docker image k8s.gcr.io/ingress-nginx/controller:v0.46.0 to local registry as docker-registry.tools.wmflabs.org/nginx-ingress-controller:v0.46.0 T264221

2021-05-07

  • 18:07 Majavah: generate and add k8s haproxy keepalived password (profile::toolforge::k8s::haproxy::keepalived_password) to private puppet repo
  • 17:15 bstorm: recreated recordset of k8s.tools.eqiad1.wikimedia.cloud as CNAME to k8s.svc.tools.eqiad1.wikimedia.cloud T282227
  • 17:12 bstorm: created A record of k8s.svc.tools.eqiad1.wikimedia.cloud pointing at current cluster with TTL of 300 for quick initial failover when the new set of haproxy nodes are ready T282227
  • 09:44 arturo: `sudo wmcs-openstack --os-project-id=tools port create --network lan-flat-cloudinstances2b tools-k8s-haproxy-keepalived-vip`

2021-05-06

  • 14:43 Majavah: clear error states from all currently erroring exec nodes
  • 14:37 Majavah: clear error state from tools-sgeexec-0913
  • 04:35 Majavah: add own root key to project hiera on horizon T278390
  • 02:36 andrewbogott: removing jhedden from sudo roots

2021-05-05

  • 19:27 andrewbogott: adding taavi as a sudo root to project toolforge for T278390

2021-05-04

  • 15:23 arturo: upgrading exim4-daemon-heavy in tools-mail-03
  • 10:47 arturo: rebase & resolve merge conflicts in labs/private.git

2021-05-03

  • 16:24 dcaro: started tools-sgeexec-0907, was stuck on initramfs due to an unclean fs (/dev/vda3, root), ran fsck manually fixing all the errors and booted up correctly after (T280641)
  • 14:07 dcaro: depooling tols-sgeexec-0908/7 to be able to restart the VMs as they got stuck during migration (T280641)

2021-04-29

  • 18:23 bstorm: removing one more etcd node via cookbook T279723
  • 18:12 bstorm: removing an etcd node via cookbook T279723

2021-04-27

  • 16:40 bstorm: deleted all the errored out grid jobs stuck in queue wait
  • 16:16 bstorm: cleared E status on grid queues to get things flowing again

2021-04-26

  • 12:17 arturo: allowing more tools into the legacy redirector (T281003)

2021-04-22

  • 08:44 Krenair: Removed yuvipanda from roots sudo policy
  • 08:42 Krenair: Removed yuvipanda from projectadmin per request
  • 08:40 Krenair: Removed yuvipanda from tools.admin per request

2021-04-20

  • 22:20 bd808: `clush -w @all -b "sudo exiqgrep -z -i | xargs sudo exim -Mt"`
  • 22:19 bd808: `clush -w @exec -b "sudo exiqgrep -z -i | xargs sudo exim -Mt"`
  • 21:52 bd808: Update hiera `profile::toolforge::active_mail_relay: tools-mail-03.tools.eqiad1.wikimedia.cloud`. Was using wrong domain name in prior update.
  • 21:49 bstorm: tagged the latest maintain-kubeusers and deployed to toolforge (with kustomize changes to rbac) after testing in toolsbeta T280300
  • 21:27 bd808: Update hiera `profile::toolforge::active_mail_relay: tools-mail-03.tools.eqiad.wmflabs`. was -2 which is decommed.
  • 10:18 dcaro: seting the retention on the tools-prometheus VMs to 250GB (they have 276GB total, leaving some space for online data operations if needed) (T279990)

2021-04-19

  • 10:53 dcaro: reverting setting prometheus data source in grafana to 'server', can't connect,
  • 10:51 dcaro: setting prometheus data source in grafana to 'server' to avoid CORS issues

2021-04-16

  • 23:15 bstorm: cleaned up all source files for the grid with the old domain name to enable future node creation T277653
  • 14:38 dcaro: added 'will get out of space in X days' panel to the dasboard https://grafana-labs.wikimedia.org/goto/kBlGd0uGk (T279990), we got <5days xd
  • 11:35 arturo: running `grid-configurator --all-domains` which basically added tools-sgebastion-10,11 as submit hosts and removed tools-sgegrid-master,shadow as submit hosts

2021-04-15

  • 17:45 bstorm: cleared error state from tools-sgeexec-0920.tools.eqiad.wmflabs for a failed job

2021-04-13

  • 13:26 dcaro: upgrade puppet and python-wmflib on tools-prometheus-03
  • 11:23 arturo: deleted shutoff VM tools-package-builder-02 (T275864)
  • 11:21 arturo: deleted shutoff VM tools-sge-services-03,04 (T278354)
  • 11:20 arturo: deleted shutoff VM tools-docker-registry-03,04 (T278303)
  • 11:18 arturo: deleted shutoff VM tools-mail-02 (T278538)
  • 11:17 arturo: deleted shutoff VMs tools-static-12,13 (T278539)

2021-04-11

  • 16:07 bstorm: cleared E state from tools-sgeexec-0917 tools-sgeexec-0933 tools-sgeexec-0934 tools-sgeexec-0937 from failures of jobs 761759, 815031, 815056, 855676, 898936

2021-04-08

  • 18:25 bstorm: cleaned up the deprecated entries in /data/project/.system_sge/gridengine/etc/submithosts for tools-sgegrid-master and tools-sgegrid-shadow using the old fqdns T277653
  • 09:24 arturo: allocate & associate floating IP 185.15.56.122 for tools-sgebastion-11, also with DNS A record `dev-buster.toolforge.org` (T275865)
  • 09:22 arturo: create DNS A record `login-buster.toolforge.org` pointing to 185.15.56.66 (tools-sgebastion-10) (T275865)
  • 09:20 arturo: associate floating IP 185.15.56.66 to tools-sgebastion-10 (T275865)
  • 09:13 arturo: created tools-sgebastion-11 (buster) (T275865)

2021-04-07

  • 04:35 andrewbogott: replacing the mx record '10 mail.tools.wmcloud.org' with '10 mail.tools.wmcloud.org.' — trying to fix axfr for the tools.wmcloud.org zone

2021-04-06

  • 15:16 bstorm: cleared queue state since a few had "errored" for failed jobs.
  • 12:59 dcaro: Removing etcd member tools-k8s-etcd-7.tools.eqiad1.wikimedia.cloud to get an odd number (T267082)
  • 11:45 arturo: upgrading jobutils & misctools to 1.42 everywhere
  • 11:39 arturo: cleaning up aptly: old package versions, old repos (jessie, trusty, precise) etc
  • 10:31 dcaro: Removing etcd member tools-k8s-etcd-6.tools.eqiad.wmflabs (T267082)
  • 10:21 arturo: published jobutils & misctools 1.42 (T278748)
  • 10:21 arturo: published jobutils & misctools 1.42
  • 10:21 arturo: aptly repo had some weirdness due to the cinder volume: hardlinks created by aptly were broken, solved with `sudo aptly publish --skip-signing repo stretch-tools -force-overwrite`
  • 10:07 dcaro: adding new etcd member using the cookbook wmcs.toolforge.add_etcd_node (T267082)
  • 10:05 arturo: installed aptly from buster-backports on tools-services-05 to see if that makes any difference with an issue when publishing repos
  • 09:53 dcaro: Removing etcd member tools-k8s-etcd-4.tools.eqiad.wmflabs (T267082)
  • 08:55 dcaro: adding new etcd member using the cookbook wmcs.toolforge.add_etcd_node (T267082)

2021-04-05

  • 17:02 bstorm: chowned the data volume for the docker registry to docker-registry:docker-registry
  • 09:56 arturo: make jhernandez (IRC joakino) projectadmin (T278975)

2021-04-01

  • 20:43 bstorm: cleared error state from the grid queues caused by unspecified job errors
  • 15:53 dcaro: Removed etcd member tools-k8s-etcd-5.tools.eqiad.wmflabs, adding a new member (T267082)
  • 15:43 dcaro: Removing etcd member tools-k8s-etcd-5.tools.eqiad.wmflabs (T267082)
  • 15:36 dcaro: Added new etcd member tools-k8s-etcd-9.tools.eqiad1.wikimedia.cloud (T267082)
  • 15:18 dcaro: adding new etcd member using the cookbook wmcs.toolforge.add_etcd_node (T267082)

2021-03-31

  • 15:57 arturo: rebooting `tools-mail-03` after enabling NFS (T267082, T278538)
  • 15:57 arturo: rebooting `tools-mail-03` after enabling NFS (T
  • 15:04 arturo: created MX record for `tools.wmcloud.org` pointing to `mail.tools.wmcloud.org`
  • 15:03 arturo: created DNS A record `mail.tools.wmcloud.org` pointing to 185.15.56.63
  • 14:56 arturo: shutoff tools-mail-02 (T278538)
  • 14:55 arturo: point floating IP 185.15.56.63 to tools-mail-03 (T278538)
  • 14:45 arturo: created VM `tools-mail-03` as Debian Buster (T278538)
  • 14:39 arturo: relocate some of the hiera keys for email server from project-level to prefix
  • 09:44 dcaro: running disk performance test on etcd-4 (round2)
  • 09:05 dcaro: running disk performance test on etcd-8
  • 08:43 dcaro: running disk performance test on etcd-4

2021-03-30

  • 16:15 bstorm: added `labstore::traffic_shaping::egress: 800mbps` to tools-static prefix T278539
  • 15:44 arturo: shutoff tools-static-12/13 (T278539)
  • 15:41 arturo: point horizon web proxy `tools-static.wmflabs.org` to tools-static-14 (T278539)
  • 15:37 arturo: add `mount_nfs: true` to tools-static prefix (T2778539)
  • 15:26 arturo: create VM tools-static-14 with Debian Buster image (T278539)
  • 12:19 arturo: introduce horizon proxy `deb-tools.wmcloud.org` (T278436)
  • 12:15 arturo: shutdown tools-sgebastion-09 (stretch)
  • 11:05 arturo: created VM `tools-sgebastion-10` as Debian Buster (T275865)
  • 11:04 arturo: created server group `tools-bastion` with anti-affinity policy

2021-03-28

  • 19:31 legoktm: legoktm@tools-sgebastion-08:~$ sudo qdel -f 9999704 # T278645

2021-03-27

2021-03-26

  • 12:21 arturo: shutdown tools-package-builder-02 (stretch), we keep -03 which is buster (T275864)

2021-03-25

  • 19:30 bstorm: forced deletion of all jobs stuck in a deleting state T277653
  • 17:46 arturo: rebooting tools-sgeexec-* nodes to account for new grid master (T277653)
  • 16:20 arturo: rebuilding tools-sgegrid-master VM as debian buster (T277653)
  • 16:18 arturo: icinga-downtime toolschecker for 2h
  • 16:05 bstorm: failed over the tools grid to the shadow master T277653
  • 13:36 arturo: shutdown tools-sge-services-03 (T278354)
  • 13:33 arturo: shutdown tools-sge-services-04 (T278354)
  • 13:31 arturo: point aptly clients to `tools-services-05.tools.eqiad1.wikimedia.cloud` (hiera change) (T278354)
  • 12:58 arturo: created VM `tools-services-05` as Debian Buster (T278354)
  • 12:51 arturo: create cinder volume `tools-aptly-data` (T278354)

2021-03-24

  • 12:46 arturo: shutoff the old stretch VMs `tools-docker-registry-03` and `tools-docker-registry-04` (T278303)
  • 12:38 arturo: associate floating IP 185.15.56.67 with `tools-docker-registry-05` and refresh FQDN docker-registry.tools.wmflabs.org accordingly (T278303)
  • 12:33 arturo: attach cinder volume `tools-docker-registry-data` to VM `tools-docker-registry-05` (T278303)
  • 12:32 arturo: snapshot cinder volume `tools-docker-registry-data` into `tools-docker-registry-data-stretch-migration` (T278303)
  • 12:32 arturo: bump cinder storage quota from 80G to 400G (without quota request task)
  • 12:11 arturo: created VM `tools-docker-registry-06` as Debian Buster (T278303)
  • 12:09 arturo: dettach cinder volume `tools-docker-registry-data` (T278303)
  • 11:46 arturo: attach cinder volume `tools-docker-registry-data` to VM `tools-docker-registry-03` to format it and pre-populate it with registry data (T278303)
  • 11:20 arturo: created 80G cinder volume tools-docker-registry-data (T278303)
  • 11:10 arturo: starting VM tools-docker-registry-04 which was stopped probably since 2021-03-09 due to hypervisor draining

2021-03-23

  • 12:46 arturo: aborrero@tools-sgegrid-master:~$ sudo systemctl restart gridengine-master.service
  • 12:16 arturo: delete & re-create VM tools-sgegrid-shadow as Debian Buster (T277653)
  • 12:14 arturo: created puppet prefix 'tools-sgegrid-shadow' and migrated puppet configuration from VM-puppet
  • 12:13 arturo: created server group 'tools-grid-master-shadow' with anty-affinity policy

2021-03-18

  • 19:24 bstorm: set profile::toolforge::infrastructure across the entire project with login_server set on the bastion and exec node-related prefixes
  • 16:21 andrewbogott: enabling puppet tools-wide
  • 16:20 andrewbogott: disabling puppet tools-wide to test https://gerrit.wikimedia.org/r/c/operations/puppet/+/672456
  • 16:19 bstorm: added profile::toolforge::infrastructure class to puppetmaster T277756
  • 04:12 bstorm: rebooted tools-sgeexec-0935.tools.eqiad.wmflabs because it forgot how to LDAP...likely root cause of the issues tonight
  • 03:59 bstorm: rebooting grid master. sorry for the cron spam
  • 03:49 bstorm: restarting sssd on tools-sgegrid-master
  • 03:37 bstorm: deleted a massive number of stuck jobs that misfired from the cron server
  • 03:35 bstorm: rebooting tools-sgecron-01 to try to clear up the ldap-related errors coming out of it
  • 01:46 bstorm: killed the toolschecker cron job, which had an LDAP error, and ran it again by hand

2021-03-17

  • 20:57 bstorm: deployed changes to rbac for kubernetes to add kubectl top access for tools
  • 20:26 andrewbogott: moving tools-elastic-3 to cloudvirt1034; two elastic nodes shouldn't be on the same hv

2021-03-16

  • 16:31 arturo: installing jobutils and misctools 1.41
  • 15:55 bstorm: deleted a bunch of messed up grid jobs (9989481,8813,81682,86317,122602,122623,583621,606945,606999)
  • 12:32 arturo: add packages jobutils / misctools v1.41 to {stretch,buster}-tools aptly repository in tools-sge-services-03

2021-03-12

  • 23:13 bstorm: cleared error state for all grid queues

2021-03-11

  • 17:40 bstorm: deployed metrics-server:0.4.1 to kubernetes
  • 16:21 bstorm: add jobutils 1.40 and misctools 1.40 to stretch-tools
  • 13:11 arturo: add misctools 1.37 to buster-tools|toolsbeta aptly repo for T275865
  • 13:10 arturo: add jobutils 1.40 to buster-tools aptly repo for T275865

2021-03-10

  • 10:56 arturo: briefly stopped VM tools-k8s-etcd-7 to disable VMX cpu flag

2021-03-09

  • 13:31 arturo: hard-reboot tools-docker-registry-04 because issues related to T276922
  • 12:34 arturo: briefly rebooting VM tools-docker-registry-04, we need to reboot the hypervisor cloudvirt1038 and failed to migrate away

2021-03-05

  • 12:30 arturo: started tools-redis-1004 again
  • 12:22 arturo: stop tools-redis-1004 to ease draining of cloudvirt1035

2021-03-04

  • 11:25 arturo: rebooted tools-sgewebgrid-generic-0901, repool it again
  • 09:58 arturo: depool tools-sgewebgrid-generic-0901 to reboot VM. It was stuck in MIGRATING state when draining cloudvirt1022

2021-03-03

  • 15:17 arturo: shutting down tools-sgebastion-07 in an attempt to fix nova state and finish hypervisor migration
  • 15:11 arturo: tools-sgebastion-07 triggered a neutron exception (unauthorized) while being live-migrated from cloudvirt1021 to 1029. Resetting nova state with `nova reset-state bd685d48-1011-404e-a755-372f6022f345 --active` and try again
  • 14:48 arturo: killed pywikibot instance running in tools-sgebastion-07 by user msyn

2021-03-02

  • 15:23 bstorm: depooling tools-sgewebgrid-lighttpd-0914.tools.eqiad.wmflabs for reboot. It isn't communicating right
  • 15:22 bstorm: cleared queue error states...will need to keep a better eye on what's causing those

2021-02-27

  • 02:23 bstorm: deployed typo fix to maintain-kubeusers in an innocent effort to make the weekend better T275910
  • 02:00 bstorm: running a script to repair the dumps mount in all podpresets T275371

2021-02-26

  • 22:04 bstorm: cleaned up grid jobs 1230666,1908277,1908299,2441500,2441513
  • 21:27 bstorm: hard rebooting tools-sgeexec-0947
  • 21:21 bstorm: hard rebooting tools-sgeexec-0952.tools.eqiad.wmflabs
  • 20:01 bd808: Deleted csr in strange state for tool-ores-inspect

2021-02-24

  • 18:30 bd808: `sudo wmcs-openstack role remove --user zfilipin --project tools user` T267313
  • 01:04 bstorm: hard rebooting tools-k8s-worker-76 because it's in a sorry state

2021-02-23

  • 23:11 bstorm: draining a bunch of k8s workers to clean up after dumps changes T272397
  • 23:06 bstorm: draining tools-k8s-worker-55 to clean up after dumps changes T272397

2021-02-22

  • 20:40 bstorm: repooled tools-sgeexec-0918.tools.eqiad.wmflabs
  • 19:09 bstorm: hard rebooted tools-sgeexec-0918 from openstack T275411
  • 19:07 bstorm: shutting down tools-sgeexec-0918 with the VM's command line (not libvirt directly yet) T275411
  • 19:05 bstorm: shutting down tools-sgeexec-0918 (with openstack to see what happens) T275411
  • 19:03 bstorm: depooled tools-sgeexec-0918 T275411
  • 18:56 bstorm: deleted job 1962508 from the grid to clear it up T275301
  • 16:58 bstorm: cleared error state on several grid queues

2021-02-19

  • 12:31 arturo: deploying new version of toolforge ingress admission controller

2021-02-17

  • 21:26 bstorm: deleted tools-puppetdb-01 since it is unused at this time (and undersized anyway)

2021-02-04

  • 16:27 bstorm: rebooting tools-package-builder-02

2021-01-26

  • 16:27 bd808: Hard reboot of tools-sgeexec-0906 via Horizon for T272978

2021-01-22

  • 09:59 dcaro: added the record redis.svc.tools.eqiad1.wikimedia.cloud pointing to tools-redis1003 (T272679)

2021-01-21

  • 23:58 bstorm: deployed new maintain-kubeusers to tools T271847

2021-01-19

  • 22:57 bstorm: truncated 75GB error log /data/project/robokobot/virgule.err T272247
  • 22:48 bstorm: truncated 100GB error log /data/project/magnus-toolserver/error.log T272247
  • 22:43 bstorm: truncated 107GB log '/data/project/meetbot/logs/messages.log' T272247
  • 22:34 bstorm: truncating 194 GB error log '/data/project/mix-n-match/mnm-microsync.err' T272247
  • 16:37 bd808: Added Jhernandez to root sudoers group

2021-01-14

  • 20:56 bstorm: setting bastions to have mostly-uncapped egress network and 40MBps nfs_read for better shared use
  • 20:43 bstorm: running tc-setup across the k8s workers
  • 20:40 bstorm: running tc-setup across the grid fleet
  • 17:58 bstorm: hard rebooting tools-sgecron-01 following network issues during upgrade to stein T261134

2021-01-13

  • 10:02 arturo: delete floating IP allocation 185.15.56.245 (T271867)

2021-01-12

  • 18:16 bstorm: deleted wedged CSR tool-adhs-wde to get maintain-kubeusers working again T271842

2021-01-05

  • 18:49 bstorm: changing the limits on k8s etcd nodes again, so disabling puppet on them T267966

2021-01-04

  • 18:21 bstorm: ran 'sudo systemctl stop getty@ttyS1.service && sudo systemctl disable getty@ttyS1.service' on tools-k8s-etcd-5 I have no idea why that keeps coming back.

2020-12-22

  • 18:22 bstorm: rebooting the grid master because it is misbehaving following the NFS outage
  • 10:53 arturo: rebase & resolve ugly git merge conflict in labs/private.git

2020-12-18

  • 18:37 bstorm: set profile::wmcs::kubeadm::etcd_latency_ms: 15 T267966

2020-12-17

2020-12-11

  • 18:29 bstorm: certificatesigningrequest.certificates.k8s.io "tool-production-error-tasks-metrics" deleted to stop maintain-kubeusers issues
  • 12:14 dcaro: upgrading stable/main (clinic duty)
  • 12:12 dcaro: upgrading buster-wikimedia/main (clinic duty)
  • 12:03 dcaro: upgrading stable-updates/main, mainly cacertificates (clinic duty)
  • 12:01 dcaro: upgrading stretch-backports/main, mainly libuv (clinic duty)
  • 11:58 dcaro: disabled all the repos blocking upgrades on tools-package-builder-02 (duplicated, other releases...)
  • 11:35 arturo: uncordon tools-k8s-worker-71 and tools-k8s-worker-55, they weren't uncordoned yesterday for whatever reasons (T263284)
  • 11:27 dcaro: upgrading stretch-wikimedia/main (clinic duty)
  • 11:20 dcaro: upgrading stretch-wikimedia/thirdparty/mono-project-stretch (clinic duty)
  • 11:08 dcaro: upgrade stretch-wikimedia/component/php72 (minor upgrades) (clinic duty)
  • 11:04 dcaro: upgrade oldstable/main packages (clinic duty)
  • 10:58 dcaro: upgrade kubectl done (clinic duty)
  • 10:53 dcaro: upgrade kubectl (clinic duty)
  • 10:16 dcaro: upgrading oldstable/main packages (clinic duty)

2020-12-10

  • 17:35 bstorm: k8s-control nodes upgraded to 1.17.13 T263284
  • 17:16 arturo: k8s control nodes were all upgraded to 1.17, now upgrading worker nodes (T263284)
  • 15:50 dcaro: puppet upgraded to 5.5.10 on the hosts, ping me if you see anything weird (clinic duty)
  • 15:41 arturo: icinga-downtime toolschecker for 2h (T263284)
  • 15:35 dcaro: Puppet 5 on tools-sgebastion-09 ran well and without issues, upgrading the other sge nodes (clinic duty)
  • 15:32 dcaro: Upgrading puppet from 4 to 5 on tools-sgebastion-09 (clinic duty)
  • 12:41 arturo: set hiera `profile::wmcs::kubeadm::component: thirdparty/kubeadm-k8s-1-17` in project & tools-k8s-control prefix (T263284)
  • 11:50 arturo: disabled puppet in all k8s nodes in preparation for version upgrade (T263284)
  • 11:50 arturo: disabled puppet in all k8s nodes in preparation for version upgrade (T263284)
  • 09:58 dcaro: successful tesseract upgrade on tools-sgewebgrid-lighttpd-0914, upgrading the rest of nodes (clinic duty)
  • 09:49 dcaro: upgrading tesseract on tools-sgewebgrid-lighttpd-0914 (clinic duty)

2020-12-08

  • 19:01 bstorm: pushed updated calico node image (v3.14.0) to internal docker registry as well T269016

2020-12-07

  • 22:56 bstorm: pushed updated local copies of the typha, calico-cni and calico-pod2daemon-flexvol images to the tools internal registry T269016

2020-12-03

  • 09:18 arturo: restarted kubelet systemd service on tools-k8s-worker-38. Node was NotReady, complaining about 'use of closed network connection'
  • 09:16 arturo: restarted kubelet systemd service on tools-k8s-worker-59. Node was NotReady, complaining about 'use of closed network connection'

2020-11-28

  • 23:35 Krenair: Re-scheduled 4 continuous jobs from tools-sgeexec-0908 as it appears to be broken, at about 23:20 UTC
  • 04:35 Krenair: Ran `sudo -i kubectl -n tool-mdbot delete cm maintain-kubeusers` on tools-k8s-control-1 for T268904, seems to have regenerated ~tools.mdbot/.kube/config

2020-11-24

  • 17:44 arturo: rebased labs/private.git. 2 patches had merge conflicts
  • 16:36 bd808: clush -w @all -b 'sudo -i apt-get purge nscd'
  • 16:31 bd808: Ran `sudo -i apt-get purge nscd` on tools-sgeexec-0932 to try and fix apt state for puppet

2020-11-10

  • 19:45 andrewbogott: rebooting tools-sgeexec-0950; OOM

2020-11-02

  • 13:35 arturo: (typo: dcaro)
  • 13:35 arturo: added dcar as projectadmin & user (T266068)

2020-10-29

  • 21:33 legoktm: published docker-registry.tools.wmflabs.org/toolbeta-test image (T265681)
  • 21:10 bstorm: Added another ingress node to k8s cluster in case the load spikes are the problem T266506
  • 17:33 bstorm: hard rebooting tools-sgeexec-0905 and tools-sgeexec-0916 to get the grid back to full capacity
  • 04:03 legoktm: published docker-registry.tools.wmflabs.org/toolforge-buster0-builder:latest image (T265686)

2020-10-28

  • 23:42 bstorm: dramatically elevated the egress cap on tools-k8s-ingress nodes that were affected by the NFS settings T266506
  • 22:10 bstorm: launching tools-k8s-ingress-3 to try and get an NFS-free node T266506
  • 21:58 bstorm: set 'mount_nfs: false' on the tools-k8s-ingress prefix T266506

2020-10-23

  • 22:22 legoktm: imported pack_0.14.2-1_amd64.deb into buster-tools (T266270)

2020-10-21

  • 17:58 legoktm: pushed toolforge-buster0-{build,run}:latest images to docker registry

2020-10-15

  • 22:00 bstorm: manually removing nscd from tools-sgebastion-08 and running puppet
  • 18:23 andrewbogott: uncordoning tools-k8s-worker-53, 54, 55, 59
  • 17:28 andrewbogott: depooling tools-k8s-worker-53, 54, 55, 59
  • 17:27 andrewbogott: uncordoning tools-k8s-worker-35, 37, 45
  • 16:44 andrewbogott: depooling tools-k8s-worker-35, 37, 45

2020-10-14

  • 21:00 andrewbogott: repooling tools-sgewebgrid-generic-0901 and tools-sgewebgrid-lighttpd-0915
  • 20:37 andrewbogott: depooling tools-sgewebgrid-generic-0901 and tools-sgewebgrid-lighttpd-0915
  • 20:35 andrewbogott: repooling tools-sgewebgrid-lighttpd-0911, 12, 13, 16
  • 20:31 bd808: Deployed toollabs-webservice v0.74
  • 19:53 andrewbogott: depooling tools-sgewebgrid-lighttpd-0911, 12, 13, 16 and moving to Ceph
  • 19:47 andrewbogott: repooling tools-sgeexec-0932, 33, 34 and moving to Ceph
  • 19:07 andrewbogott: depooling tools-sgeexec-0932, 33, 34 and moving to Ceph
  • 19:06 andrewbogott: repooling tools-sgeexec-0935, 36, 38, 40 and moving to Ceph
  • 16:56 andrewbogott: depooling tools-sgeexec-0935, 36, 38, 40 and moving to Ceph

2020-10-10

  • 17:07 bstorm: cleared errors on tools-sgeexec-0912.tools.eqiad.wmflabs to get the queue moving again

2020-10-08

  • 17:07 bstorm: rebuilding docker images with locales-all T263339

2020-10-06

  • 19:04 andrewbogott: uncordoned tools-k8s-worker-38
  • 18:51 andrewbogott: uncordoned tools-k8s-worker-52
  • 18:40 andrewbogott: draining and cordoning tools-k8s-worker-52 and tools-k8s-worker-38 for ceph migration

2020-10-02

  • 21:09 bstorm: rebooting tools-k8s-worker-70 because it seems to be unable to recover from an old NFS disconnect
  • 17:37 andrewbogott: stopping tools-prometheus-03 to attempt a snapshot
  • 16:03 bstorm: shutting down tools-prometheus-04 to try to fsck the disk

2020-10-01

  • 21:39 andrewbogott: migrating tools-proxy-06 to ceph
  • 21:35 andrewbogott: moving k8s.tools.eqiad1.wikimedia.cloud from 172.16.0.99 (toolsbeta-test-k8s-haproxy-1) to 172.16.0.108 (toolsbeta-test-k8s-haproxy-2) in anticipation of downtime for haproxy-1 tomorrow

2020-09-30

  • 18:34 andrewbogott: repooling tools-sgeexec-0918
  • 18:29 andrewbogott: depooling tools-sgeexec-0918 so I can reboot cloudvirt1036

2020-09-23

  • 21:38 bstorm: ran an 'apt clean' across the fleet to get ahead of the new locale install

2020-09-18

  • 19:41 andrewbogott: repooling tools-k8s-worker-30, 33, 34, 57, 60
  • 19:04 andrewbogott: depooling tools-k8s-worker-30, 33, 34, 57, 60
  • 19:02 andrewbogott: repooling tools-k8s-worker-41, 43, 44, 47, 48, 49, 50, 51
  • 17:48 andrewbogott: depooling tools-k8s-worker-41, 43, 44, 47, 48, 49, 50, 51
  • 17:47 andrewbogott: repooling tools-k8s-worker-31, 32, 36, 39, 40
  • 16:40 andrewbogott: depooling tools-k8s-worker-31, 32, 36, 39, 40
  • 16:38 andrewbogott: repooling tools-sgewebgrid-lighttpd-0914, tools-sgewebgrid-generic-0902, tools-sgewebgrid-lighttpd-0919, tools-sgewebgrid-lighttpd-0918
  • 16:10 andrewbogott: depooling tools-sgewebgrid-lighttpd-0914, tools-sgewebgrid-generic-0902, tools-sgewebgrid-lighttpd-0919, tools-sgewebgrid-lighttpd-0918
  • 13:54 andrewbogott: repooling tools-sgeexec-0913, tools-sgeexec-0915, tools-sgeexec-0916
  • 13:50 andrewbogott: depooling tools-sgeexec-0913, tools-sgeexec-0915, tools-sgeexec-0916 for flavor update
  • 01:20 andrewbogott: repooling tools-sgeexec-0901, tools-sgeexec-0905, tools-sgeexec-0910, tools-sgeexec-0911, tools-sgeexec-0912 after flavor update
  • 01:11 andrewbogott: depooling tools-sgeexec-0901, tools-sgeexec-0905, tools-sgeexec-0910, tools-sgeexec-0911, tools-sgeexec-0912 for flavor update
  • 01:08 andrewbogott: repooling tools-sgeexec-0917, tools-sgeexec-0918, tools-sgeexec-0919, tools-sgeexec-0920 after flavor update
  • 01:00 andrewbogott: depooling tools-sgeexec-0917, tools-sgeexec-0918, tools-sgeexec-0919, tools-sgeexec-0920 for flavor update
  • 00:58 andrewbogott: repooling tools-sgeexec-0942, tools-sgeexec-0947, tools-sgeexec-0950, tools-sgeexec-0951, tools-sgeexec-0952 after flavor update
  • 00:49 andrewbogott: depooling tools-sgeexec-0942, tools-sgeexec-0947, tools-sgeexec-0950, tools-sgeexec-0951, tools-sgeexec-0952 for flavor update

2020-09-17

  • 21:56 bd808: Built and deployed tools-manifest v0.22 (T263190)
  • 21:55 bd808: Built and deployed tools-manifest v0.22 (T169695)
  • 20:34 bd808: Live hacked "--backend=gridengine" into webservicemonitor on tools-sgecron-01 (T263190)
  • 20:21 bd808: Restarted webservicemonitor on tools-sgecron-01.tools.eqiad.wmflabs
  • 20:09 andrewbogott: I didn't actually depool tools-sgeexec-0942, tools-sgeexec-0947, tools-sgeexec-0950, tools-sgeexec-0951, tools-sgeexec-0952 because there was some kind of brief outage just now
  • 19:58 andrewbogott: depooling tools-sgeexec-0942, tools-sgeexec-0947, tools-sgeexec-0950, tools-sgeexec-0951, tools-sgeexec-0952 for flavor update
  • 19:55 andrewbogott: repooling tools-k8s-worker-61,62,64,65,67,68,69 for flavor update
  • 19:29 andrewbogott: depooling tools-k8s-worker-61,62,64,65,67,68,69 for flavor update
  • 15:38 andrewbogott: repooling tools-k8s-worker-70 and tools-k8s-worker-66 after flavor remapping
  • 15:34 andrewbogott: depooling tools-k8s-worker-70 and tools-k8s-worker-66 for flavor remapping
  • 15:30 andrewbogott: repooling tools-sgeexec-0909, 0908, 0907, 0906, 0904
  • 15:21 andrewbogott: depooling tools-sgeexec-0909, 0908, 0907, 0906, 0904 for flavor remapping
  • 13:55 andrewbogott: depooled tools-sgewebgrid-lighttpd-0917 and tools-sgewebgrid-lighttpd-0920
  • 13:55 andrewbogott: repooled tools-sgeexec-0937 after move to ceph
  • 13:45 andrewbogott: depooled tools-sgeexec-0937 for move to ceph

2020-09-16

  • 23:20 andrewbogott: repooled tools-sgeexec-0941 and tools-sgeexec-0939 for move to ceph
  • 23:03 andrewbogott: depooled tools-sgeexec-0941 and tools-sgeexec-0939 for move to ceph
  • 23:02 andrewbogott: uncordoned tools-k8s-worker-58, tools-k8s-worker-56, tools-k8s-worker-42 for migration to ceph
  • 22:29 andrewbogott: draining tools-k8s-worker-58, tools-k8s-worker-56, tools-k8s-worker-42 for migration to ceph
  • 17:37 andrewbogott: service gridengine-master restart on tools-sgegrid-master

2020-09-10

  • 15:37 arturo: hard-rebooting tools-proxy-05
  • 15:33 arturo: rebooting tools-proxy-05 to try flushing local DNS caches
  • 15:25 arturo: detected missing DNS record for k8s.tools.eqiad1.wikimedia.cloud which means the k8s cluster is down
  • 10:22 arturo: enabling ingress dedicated worker nodes in the k8s cluster (T250172)

2020-09-09

2020-09-08

  • 23:24 bstorm: clearing grid queue error states blocking job runs
  • 22:53 bd808: forcing puppet run on tools-sgebastion-07

2020-09-02

  • 18:13 andrewbogott: moving tools-sgeexec-0920 to ceph
  • 17:57 andrewbogott: moving tools-sgeexec-0942 to ceph

2020-08-31

  • 19:58 andrewbogott: migrating tools-sgeexec-091[0-9] to ceph
  • 17:19 andrewbogott: migrating tools-sgeexec-090[4-9] to ceph
  • 17:19 andrewbogott: repooled tools-sgeexec-0901
  • 16:52 bstorm: `apt install uwsgi` was run on tools-checker-03 in the last log T261677
  • 16:51 bstorm: running `apt install uwsgi` with --allow-downgrades to fix the puppet setup there T261677
  • 14:26 andrewbogott: depooling tools-sgeexec-0901, migrating to ceph

2020-08-30

  • 00:57 Krenair: also ran qconf -ds on each
  • 00:35 Krenair: Tidied up SGE problems (it was spamming root@ every minute for hours) following host deletions some hours ago - removed tools-sgeexec-0921 through 0931 from @general, ran qmod -rj on all jobs registered for those nodes, then qdel -f on the remainders, then qconf -de on each deleted node

2020-08-29

  • 16:02 bstorm: deleting "tools-sgeexec-0931", "tools-sgeexec-0930", "tools-sgeexec-0929", "tools-sgeexec-0928", "tools-sgeexec-0927"
  • 16:00 bstorm: deleting "tools-sgeexec-0926", "tools-sgeexec-0925", "tools-sgeexec-0924", "tools-sgeexec-0923", "tools-sgeexec-0922", "tools-sgeexec-0921"

2020-08-26

2020-08-25

  • 19:38 andrewbogott: deleting tools-sgeexec-0943.tools.eqiad.wmflabs, tools-sgeexec-0944.tools.eqiad.wmflabs, tools-sgeexec-0945.tools.eqiad.wmflabs, tools-sgeexec-0946.tools.eqiad.wmflabs, tools-sgeexec-0948.tools.eqiad.wmflabs, tools-sgeexec-0949.tools.eqiad.wmflabs, tools-sgeexec-0953.tools.eqiad.wmflabs — they are broken and we're not very curious why; will retry this exercise when everything is standardized on
  • 15:03 andrewbogott: removing non-ceph nodes tools-sgeexec-0921 through tools-sgeexec-0931
  • 15:02 andrewbogott: added new sge-exec nodes tools-sgeexec-0943 through tools-sgeexec-0953 (for real this time)

2020-08-19

  • 21:29 andrewbogott: shutting down and removing tools-k8s-worker-20 through tools-k8s-worker-29; this load can now be handled by new nodes on ceph hosts
  • 21:15 andrewbogott: shutting down and removing tools-k8s-worker-1 through tools-k8s-worker-19; this load can now be handled by new nodes on ceph hosts
  • 18:40 andrewbogott: creating 13 new xlarge k8s worker nodes, tools-k8s-worker-67 through tools-k8s-worker-79

2020-08-18

  • 15:24 bd808: Rebuilding all Docker containers to pick up newest versions of installed packages

2020-07-30

  • 16:28 andrewbogott: added new xlarge ceph-hosted worker nodes: tools-k8s-worker-61, 62, 63, 64, 65, 66. T258663

2020-07-29

  • 23:24 bd808: Pushed a copy of docker-registry.wikimedia.org/wikimedia-jessie:latest to docker-registry.tools.wmflabs.org/wikimedia-jessie:latest in preparation for the upstream image going away

2020-07-24

  • 22:33 bd808: Removed a few more ancient docker images: grrrit, jessie-toollabs, and nagf
  • 21:02 bd808: Running cleanup script to delete the non-sssd toolforge images from docker-registry.tools.wmflabs.org
  • 20:17 bd808: Forced garbage collection on docker-registry.tools.wmflabs.org
  • 20:06 bd808: Running cleanup script to delete all of the old toollabs-* images from docker-registry.tools.wmflabs.org

2020-07-22

  • 23:24 bstorm: created server group 'tools-k8s-worker' to create any new worker nodes in so that they have a low chance of being scheduled together by openstack unless it is necessary T258663
  • 23:22 bstorm: running puppet and NFS 4.2 remount on tools-k8s-worker-[56-60] T257945
  • 23:17 bstorm: running puppet and NFS 4.2 remount on tools-k8s-worker-[41-55] T257945
  • 23:14 bstorm: running puppet and NFS 4.2 remount on tools-k8s-worker-[21-40] T257945
  • 23:11 bstorm: running puppet and NFS remount on tools-k8s-worker-[1-15] T257945
  • 23:07 bstorm: disabling puppet on k8s workers to reduce the effect of changing the NFS mount version all at once T257945
  • 22:28 bstorm: setting tools-k8s-control prefix to mount NFS v4.2 T257945
  • 22:15 bstorm: set the tools-k8s-control nodes to also use 800MBps to prevent issues with toolforge ingress and api system
  • 22:07 bstorm: set the tools-k8s-haproxy-1 (main load balancer for toolforge) to have an egress limit of 800MB per sec instead of the same as all the other servers

2020-07-21

  • 16:09 bstorm: rebooting tools-sgegrid-shadow to remount NFS correctly
  • 15:55 bstorm: set the bastion prefix to have explicitly set hiera value of profile::wmcs::nfsclient::nfs_version: '4'

2020-07-17

  • 16:47 bd808: Enabled Puppet on tools-proxy-06 following successful test (T102367)
  • 16:29 bd808: Disabled Puppet on tools-proxy-06 to test nginx config changes manually (T102367)

2020-07-15

  • 23:11 bd808: Removed ssh root key for valhallasw from project hiera (T255697)

2020-07-09

  • 18:53 bd808: Updating git-review to 1.27 via clush across cluster (T257496)

2020-07-08

2020-07-07

  • 23:22 bd808: Rebuilding all Docker images to pick up webservice v0.73 (T234617, T257229)
  • 23:19 bd808: Deploying webservice v0.73 via clush (T234617, T257229)
  • 23:16 bd808: Building webservice v0.73 (T234617, T257229)
  • 15:01 Reedy: killed python process from tools.experimental-embeddings using a lot of cpu on tools-sgebastion-07
  • 15:01 Reedy: killed meno25 process running pwb.py on tools-sgebastion-07
  • 09:59 arturo: point DNS tools.wmflabs.org A record to 185.15.56.60 (tools-legacy-redirector) (T247236)

2020-07-06

  • 11:54 arturo: briefly point DNS tools.wmflabs.org A record to 185.15.56.60 (tools-legacy-redirector) and then switch back to 185.15.56.11 (tools-proxy-05). The legacy redirector does HTTP/307 (T247236)
  • 11:50 arturo: associate floating IP address 185.15.56.60 to tools-legacy-redirector (T247236)

2020-07-01

2020-06-30

  • 11:18 arturo: set some hiera keys for mtail in puppet prefix `tools-mail` (T256737)

2020-06-29

2020-06-25

  • 21:49 zhuyifei1999_: re-enabling puppet on tools-sgebastion-09 T256426
  • 21:39 zhuyifei1999_: disabling puppet on tools-sgebastion-09 so I can play with mount settings T256426
  • 21:24 bstorm: hard rebooting tools-sgebastion-09

2020-06-24

2020-06-23

  • 17:55 arturo: killed procs for users `hamishz` and `msyn` which apparently were tools that should be running in the grid / kubernetes instead
  • 16:08 arturo: created acme-chief cert `tools_mail` in the prefix hiera

2020-06-17

  • 10:40 arturo: created VM tools-legacy-redirector, with the corresponding puppet prefix (T247236, T234617)

2020-06-16

  • 23:01 bd808: Building new Docker images to pick up webservice 0.72
  • 22:58 bd808: Deploying webservice 0.72 to bastions and grid
  • 22:56 bd808: Building webservice 0.72
  • 15:10 arturo: merging a patch with changes to the template for keepalived (used in the elastic cluster) https://gerrit.wikimedia.org/r/c/operations/puppet/+/605898

2020-06-15

  • 21:28 bstorm_: cleaned up killgridjobs.sh on the tools bastions T157792
  • 18:14 bd808: Rebuilding all Docker images to pick up webservice 0.71 (T254640, T253412)
  • 18:12 bd808: Deploying webservice 0.71 to bastions and grid via clush
  • 18:05 bd808: Building webservice 0.71

2020-06-12

  • 13:13 arturo: live-hacking session in the puppetmaster ended
  • 13:10 arturo: live-hacing puppet tree in tools-puppetmaster-02 for testing PAWS related patch (they share haproxy puppet code)
  • 00:16 bstorm_: remounted NFS for tools-k8s-control-3 and tools-acme-chief-01

2020-06-11

  • 23:35 bstorm_: rebooting tools-k8s-control-2 because it seems to be confused on NFS, interestingly enough

2020-06-04

  • 13:32 bd808: Manually restored /etc/haproxy/conf.d/elastic.cfg on tools-elastic-*

2020-06-02

2020-06-01

  • 23:51 bstorm_: refreshed certs for the custom webhook controllers on the k8s cluster T250874
  • 00:39 bd808: Ugh. Prior SAL message was about tools-sgeexec-0940
  • 00:39 bd808: Compressed /var/log/account/pacct.0 ahead of rotation schedule to free some space on the root partition

2020-05-29

  • 19:37 bstorm_: adding docker image for paws-public docker-registry.tools.wmflabs.org/paws-public-nginx:openresty T252217

2020-05-28

  • 21:19 bd808: Killed 7 python processes run by user 'mattho69' on login.toolforge.org
  • 21:06 bstorm_: upgrading tools-k8s-worker-[30-60] to kubernetes 1.16.10 T246122
  • 17:54 bstorm_: upgraded tools-k8s-worker-[11..15] and starting on -21-29 now T246122
  • 16:01 bstorm_: kubectl upgraded to 1.16.10 on all bastions T246122
  • 15:58 arturo: upgrading tools-k8s-worker-[1..10] to 1.16.10 (T246122)
  • 15:41 arturo: upgrading tools-k8s-control-3 to 1.16.10 (T246122)
  • 15:17 arturo: upgrading tools-k8s-control-2 to 1.16.10 (T246122)
  • 15:09 arturo: upgrading tools-k8s-control-1 to 1.16.10 (T246122)
  • 14:49 arturo: cleanup /etc/apt/sources.list.d/ directory in all tools-k8s-* VMs
  • 11:27 arturo: merging change to front-proxy: https://gerrit.wikimedia.org/r/c/operations/puppet/+/599139 (T253816)

2020-05-27

  • 17:23 bstorm_: deleting "tools-k8s-worker-20", "tools-k8s-worker-19", "tools-k8s-worker-18", "tools-k8s-worker-17", "tools-k8s-worker-16"

2020-05-26

  • 18:45 bstorm_: upgrading maintain-kubeusers to match what is in toolsbeta T246059 T211096
  • 16:20 bstorm_: fix incorrect volume name in kubeadm-config configmap T246122

2020-05-22

  • 20:00 bstorm_: rebooted tools-sgebastion-07 to clear up tmp file problems with 10 min warning
  • 19:12 bstorm_: running command to delete over 2000 tmp ca certs on tools-bastion-07 T253412

2020-05-21

  • 22:40 bd808: Rebuilding all Docker containers for tools-webservice 0.70 (T252700)
  • 22:36 bd808: Updated tools-webservice to 0.70 across instances (T252700)
  • 22:29 bd808: Building tools-webservice 0.70 via wmcs-package-build.py

2020-05-20

  • 09:59 arturo: now running tesseract-ocr v4.1.1-2~bpo9+1 in the Toolforge grid (T247422)
  • 09:50 arturo: `aborrero@cloud-cumin-01:~$ sudo cumin --force -x 'O{project:tools name:tools-sge[bcew].*}' 'apt-get install tesseract-ocr -t stretch-backports -y'` (T247422)
  • 09:35 arturo: `aborrero@cloud-cumin-01:~$ sudo cumin --force -x 'O{project:tools name:tools-sge[bcew].*}' 'rm /etc/apt/sources.lists.d/kubeadm-k8s-component-repo.list ; rm /etc/apt/sources.list.d/repository_thirdparty-kubeadm-k8s-1-15.list ; run-puppet-agent'` (T247422)
  • 09:23 arturo: `aborrero@cloud-cumin-01:~$ sudo cumin --force -x 'O{project:tools name:tools-sge[bcew].*}' 'rm /etc/apt/preferences.d/* ; run-puppet-agent'` (T247422)

2020-05-19

  • 17:00 bstorm_: deleting/restarting the paws db-proxy pod because it cannot connect to the replicas...and I'm hoping that's due to depooling and such

2020-05-13

  • 18:14 bstorm_: upgrading calico to 3.14.0 with typha enabled in Toolforge K8s T250863
  • 18:10 bstorm_: set "profile::toolforge::k8s::typha_enabled: true" in tools project for calico upgrade T250863

2020-05-09

  • 00:28 bstorm_: added nfs.* to ignored_fs_types for the prometheus::node_exporter params in project hiera T252260

2020-05-08

  • 18:17 bd808: Building all jessie-sssd derived images (T197930)
  • 17:29 bd808: Building new jessie-sssd base image (T197930)

2020-05-07

  • 21:51 bstorm_: rebuilding the docker images for Toolforge k8s
  • 19:03 bstorm_: toollabs-webservice 0.69 now pushed to the Toolforge bastions
  • 18:57 bstorm_: pushing new toollabs-webservice package v0.69 to the tools repos

2020-05-06

  • 21:20 bd808: Kubectl delete node tools-k8s-worker-[16-20] (T248702)
  • 18:24 bd808: Updated "profile::toolforge::k8s::worker_nodes" list in "tools-k8s-haproxy" prefix puppet (T248702)
  • 18:14 bd808: Shutdown tools-k8s-worker-[16-20] instances (T248702)
  • 18:04 bd808: Draining tools-k8s-worker-[16-20] in preparation for decomm (T248702)
  • 17:56 bd808: Cordoned tools-k8s-worker-[16-20] in preparation for decomm (T248702)
  • 00:01 bd808: Joining tools-k8s-worker-60 to the k8s worker pool
  • 00:00 bd808: Joining tools-k8s-worker-59 to the k8s worker pool

2020-05-05

  • 23:58 bd808: Joining tools-k8s-worker-58 to the k8s worker pool
  • 23:55 bd808: Joining tools-k8s-worker-57 to the k8s worker pool
  • 23:53 bd808: Joining tools-k8s-worker-56 to the k8s worker pool
  • 21:51 bd808: Building 5 new k8s worker nodes (T248702)

2020-05-04

  • 22:08 bstorm_: deleting tools-elastic-01/2/3 T236606
  • 16:46 arturo: removing the now unused `/etc/apt/preferences.d/toolforge_k8s_kubeadmrepo*` files (T250866)
  • 16:43 arturo: removing the now unused `/etc/apt/sources.list.d/toolforge-k8s-kubeadmrepo.list` file (T250866)

2020-04-29

  • 22:13 bstorm_: running a fixup script after fixing a bug T247455
  • 21:28 bstorm_: running the rewrite-psp-preset.sh script across all tools T247455
  • 16:54 bstorm_: deleted the maintain-kubeusers pod to start running the new image T247455
  • 16:52 bstorm_: tagged docker-registry.tools.wmflabs.org/maintain-kubeusers:beta to latest to deploy to toolforge T247455

2020-04-28

  • 22:58 bstorm_: rebuilding docker-registry.tools.wmflabs.org/maintain-kubeusers:beta T247455

2020-04-23

  • 19:22 bd808: Increased Kubernetes services quota for bd808-test tool.

2020-04-21

  • 23:06 bstorm_: repooled tools-k8s-worker-38/52, tools-sgewebgrid-lighttpd-0918/9 and tools-sgeexec-0901 T250869
  • 22:09 bstorm_: depooling tools-sgewebgrid-lighttpd-0918/9 and tools-sgeexec-0901 T250869
  • 22:02 bstorm_: draining tools-k8s-worker-38 and tools-k8s-worker-52 as they are on the crashed host T250869

2020-04-20

  • 15:31 bd808: Rebuilding Docker containers to pick up tools-webservice v0.68 (T250625)
  • 14:47 arturo: added joakino to tools.admin LDAP group
  • 13:28 jeh: shutdown elasticsearch v5 cluster running Jessie T236606
  • 12:46 arturo: uploading tools-webservice v0.68 to aptly stretch-tools and update it on relevant servers (T250625)
  • 12:06 arturo: uploaded tools-webservice v0.68 to stretch-toolsbeta for testing
  • 11:59 arturo: `root@tools-sge-services-03:~# aptly db cleanup` removed 340 unreferenced packages, and 2 unreferenced files

2020-04-15

  • 23:20 bd808: Building ruby25-sssd/base and children (T141388, T250118)
  • 20:09 jeh: update default security group to allow prometheus01.metricsinfra.eqiad.wmflabs TCP 9100 T250206

2020-04-14

  • 18:26 bstorm_: Deployed new code and RBAC for maintain-kubeusers T246123
  • 18:19 bstorm_: updating the maintain-kubeusers:latest image T246123
  • 17:32 bstorm_: updating the maintain-kubeusers:beta image on tools-docker-imagebuilder-01 T246123

2020-04-10

2020-04-09

  • 15:13 bd808: Rebuilding all stretch and buster Docker images. Jessie is broken at the moment due to package version mismatches
  • 11:18 arturo: bump nproc limit in bastions https://gerrit.wikimedia.org/r/c/operations/puppet/+/587715 (T219070)
  • 04:29 bd808: Running rebuild_all for Docker images to pick up toollabs-webservice v0.66 [try #2] (T154504, T234617)
  • 04:19 bd808: python3 build.py --image-prefix toolforge --tag latest --no-cache --push --single jessie-sssd
  • 00:20 bd808: Docker rebuild failed in toolforge-python2-sssd-base: "zlib1g-dev : Depends: zlib1g (= 1:1.2.8.dfsg-2+b1) but 1:1.2.8.dfsg-2+deb8u1 is to be installed"

2020-04-08

  • 23:49 bd808: Running rebuild_all for Docker images to pick up toollabs-webservice v0.66 (T154504, T234617)
  • 23:35 bstorm_: deploy toollabs-webservice v0.66 T154504 T234617

2020-04-07

  • 20:06 andrewbogott: sss_cache -E on tools-sgebastion-08 and tools-sgebastion-09
  • 20:00 andrewbogott: sss_cache -E on tools-sgebastion-07

2020-04-06

  • 19:16 bstorm_: deleted tools-redis-1001/2 T248929

2020-04-03

  • 22:40 bstorm_: shut down tools-redis-1001/2 T248929
  • 22:32 bstorm_: switch tools-redis-1003 to the active redis server T248929
  • 20:41 bstorm_: deleting tools-redis-1003/4 to attach them to an anti-affinity group T248929
  • 18:53 bstorm_: spin up tools-redis-1004 on stretch and connect to cluster T248929
  • 18:23 bstorm_: spin up tools-redis-1003 on stretch and connect to the cluster T248929
  • 16:50 bstorm_: launching tools-redis-03 (Buster) to see what happens

2020-03-30

  • 18:28 bstorm_: Beginning rolling depool, remount, repool of k8s workers for T248702
  • 18:22 bstorm_: disabled puppet across tools-k8s-worker-[1-55].tools.eqiad.wmflabs T248702
  • 16:56 arturo: dropping `_psl.toolforge.org` TXT record (T168677)

2020-03-27

  • 21:22 bstorm_: removed puppet prefix tools-docker-builder T248703
  • 21:15 bstorm_: deleted tools-docker-builder-06 T248703
  • 18:55 bstorm_: launching tools-docker-imagebuilder-01 T248703
  • 12:52 arturo: install python3-pykube on tools-k8s-control-3 for some tests interaction with the API from python

2020-03-24

2020-03-18

  • 19:07 bstorm_: removed role::toollabs::logging::sender from project puppet (it wouldn't work anyway)
  • 18:04 bstorm_: removed puppet prefix tools-flannel-etcd T246689
  • 17:58 bstorm_: removed puppet prefix tools-worker T246689
  • 17:57 bstorm_: removed puppet prefix tools-k8s-master T246689
  • 17:36 bstorm_: removed lots of deprecated hiera keys from horizon for the old cluster T246689
  • 16:59 bstorm_: deleting "tools-worker-1002", "tools-worker-1001", "tools-k8s-master-01", "tools-flannel-etcd-03", "tools-k8s-etcd-03", "tools-flannel-etcd-02", "tools-k8s-etcd-02", "tools-flannel-etcd-01", "tools-k8s-etcd-01" T246689

2020-03-17

  • 13:29 arturo: set `profile::toolforge::bastion::nproc: 200` for tools-sgebastion-08 (T219070)
  • 00:08 bstorm_: shut off tools-flannel-etcd-01/02/03 T246689

2020-03-16

  • 22:01 bstorm_: shut off tools-k8s-etcd-01/02/03 T246689
  • 22:00 bstorm_: shut off tools-k8s-master-01 T246689
  • 21:59 bstorm_: shut down tools-worker-1001 and tools-worker-1002 T246689

2020-03-11

  • 17:00 jeh: clean up apt cache on tools-sgebastion-07

2020-03-06

  • 16:25 bstorm_: updating maintain-kubeusers image to filter invalid tool names

2020-03-03

  • 18:16 jeh: create OpenStack DNS record for elasticsearch.svc.tools.eqiad1.wikimedia.cloud (eqiad1 subdomain change) T236606
  • 18:02 jeh: create OpenStack DNS record for elasticsearch.svc.tools.eqiad.wikimedia.cloud T236606
  • 17:31 jeh: create a OpenStack virtual ip address for the new elasticsearch cluster T236606
  • 10:54 arturo: deleted VMs `tools-worker-[1003-1020]` (legacy k8s cluster) (T246689)
  • 10:51 arturo: cordoned/drained all legacy k8s worker nodes except 1001/1002 (T246689)

2020-03-02

  • 22:26 jeh: starting first pass of elasticsearch data migration to new cluster T236606

2020-03-01

2020-02-28

  • 22:14 bstorm_: shutting down the old maintain-kubeusers and taking the gloves off the new one (removing --gentle-mode)
  • 16:51 bstorm_: node/tools-k8s-worker-15 uncordoned
  • 16:44 bstorm_: drained tools-k8s-worker-15 and hard rebooting it because it wasn't happy
  • 16:36 bstorm_: rebooting k8s workers 1-35 on the 2020 cluster to clear a strange nologin condition that has been there since the NFS maintenance
  • 16:14 bstorm_: rebooted tools-k8s-worker-7 to clear some puppet issues
  • 16:00 bd808: Devoicing stashbot in #wikimedia-cloud to reduce irc spam while migrating tools to 2020 Kubernetes cluster
  • 15:28 jeh: create OpenStack server group tools-elastic with anti-affinty policy enabled T236606
  • 15:09 jeh: create 3 new elasticsearch VMs tools-elastic-[1,2,3] T236606
  • 14:20 jeh: create new puppet prefixes for existing (no change in data) and new elasticsearch VMs
  • 04:35 bd808: Joined tools-k8s-worker-54 to 2020 Kubernetes cluster
  • 04:34 bd808: Joined tools-k8s-worker-53 to 2020 Kubernetes cluster
  • 04:32 bd808: Joined tools-k8s-worker-52 to 2020 Kubernetes cluster
  • 04:31 bd808: Joined tools-k8s-worker-51 to 2020 Kubernetes cluster
  • 04:28 bd808: Joined tools-k8s-worker-50 to 2020 Kubernetes cluster
  • 04:24 bd808: Joined tools-k8s-worker-49 to 2020 Kubernetes cluster
  • 04:23 bd808: Joined tools-k8s-worker-48 to 2020 Kubernetes cluster
  • 04:21 bd808: Joined tools-k8s-worker-47 to 2020 Kubernetes cluster
  • 04:21 bd808: Joined tools-k8s-worker-46 to 2020 Kubernetes cluster
  • 04:19 bd808: Joined tools-k8s-worker-45 to 2020 Kubernetes cluster
  • 04:14 bd808: Joined tools-k8s-worker-44 to 2020 Kubernetes cluster
  • 04:13 bd808: Joined tools-k8s-worker-43 to 2020 Kubernetes cluster
  • 04:12 bd808: Joined tools-k8s-worker-42 to 2020 Kubernetes cluster
  • 04:10 bd808: Joined tools-k8s-worker-41 to 2020 Kubernetes cluster
  • 04:09 bd808: Joined tools-k8s-worker-40 to 2020 Kubernetes cluster
  • 04:08 bd808: Joined tools-k8s-worker-39 to 2020 Kubernetes cluster
  • 04:07 bd808: Joined tools-k8s-worker-38 to 2020 Kubernetes cluster
  • 04:06 bd808: Joined tools-k8s-worker-37 to 2020 Kubernetes cluster
  • 03:49 bd808: Joined tools-k8s-worker-36 to 2020 Kubernetes cluster
  • 00:50 bstorm_: rebuilt all docker images to include webservice 0.64

2020-02-27

  • 23:27 bstorm_: installed toollabs-webservice 0.64 on the bastions
  • 23:24 bstorm_: pushed toollabs-webservice version 0.64 to all toolforge repos
  • 21:03 jeh: add reindex service account to elasticsearch for data migration T236606
  • 20:57 bstorm_: upgrading toollabs-webservice to stretch-toolsbeta version for jdk8:testing image only
  • 20:19 jeh: update elasticsearch VPS security group to allow toolsbeta-elastic7-1 access on tcp 80 T236606
  • 18:53 bstorm_: hard rebooted a rather stuck tools-sgecron-01
  • 18:20 bd808: Building tools-k8s-worker-[36-55]
  • 17:56 bd808: Deleted instances tools-worker-10[21-40]
  • 16:14 bd808: Decommissioning tools-worker-10[21-40]
  • 16:02 bd808: Drained tools-worker-1021
  • 15:51 bd808: Drained tools-worker-1022
  • 15:44 bd808: Drained tools-worker-1023 (there is no tools-worker-1024)
  • 15:39 bd808: Drained tools-worker-1025
  • 15:39 bd808: Drained tools-worker-1026
  • 15:11 bd808: Drained tools-worker-1027
  • 15:09 bd808: Drained tools-worker-1028 (there is no tools-worker-1029)
  • 15:07 bd808: Drained tools-worker-1030
  • 15:06 bd808: Uncordoned tools-worker-10[16-20]. Was over optimistic about repacking legacy Kubernetes cluster into 15 instances. Will keep 20 for now.
  • 15:00 bd808: Drained tools-worker-1031
  • 14:54 bd808: Hard reboot tools-worker-1016. Direct virsh console unresponsive. Stuck in shutdown since 2020-01-22?
  • 14:44 bd808: Uncordoned tools-worker-1009.tools.eqiad.wmflabs
  • 14:41 bd808: Drained tools-worker-1032
  • 14:37 bd808: Drained tools-worker-1033
  • 14:35 bd808: Drained tools-worker-1034
  • 14:34 bd808: Drained tools-worker-1035
  • 14:33 bd808: Drained tools-worker-1036
  • 14:33 bd808: Drained tools-worker-10{39,38,37} yesterday but did not !log
  • 00:29 bd808: Drained tools-worker-1009 for reboot (NFS flakey)
  • 00:11 bd808: Uncordoned tools-worker-1009.tools.eqiad.wmflabs
  • 00:08 bd808: Uncordoned tools-worker-1002.tools.eqiad.wmflabs
  • 00:02 bd808: Rebooting tools-worker-1002
  • 00:00 bd808: Draining tools-worker-1002 to reboot for NFS problems

2020-02-26

  • 23:42 bd808: Drained tools-worker-1040
  • 23:41 bd808: Cordoned tools-worker-10[16-40] in preparation for shrinking legacy Kubernetes cluster
  • 23:12 bstorm_: replacing all tool limit-ranges in the 2020 cluster with a lower cpu request version
  • 22:29 bstorm_: deleted pod maintain-kubeusers-6d9c45f4bc-5bqq5 to deploy new image
  • 21:06 bstorm_: deleting loads of stuck grid jobs
  • 20:27 jeh: rebooting tools-worker-[1008,1015,1021]
  • 20:15 bstorm_: rebooting tools-sgegrid-master because it actually had the permissions thing going on still
  • 18:03 bstorm_: downtimed toolschecker for nfs maintenance

2020-02-25

  • 15:31 bd808: `wmcs-k8s-enable-cluster-monitor toolschecker`

2020-02-23

2020-02-21

  • 16:02 andrewbogott: moving tools-sgecron-01 to cloudvirt1022

2020-02-20

  • 14:49 andrewbogott: moving tools-k8s-worker-19 and tools-k8s-worker-18 to cloudvirt1022 (as part of draining 1014)
  • 00:04 Krenair: Shut off tools-puppetmaster-01 - to be deleted in one week T245365

2020-02-19

  • 22:05 Krenair: Project-wide hiera change to swap puppetmaster to tools-puppetmaster-02 T245365
  • 15:36 bstorm_: setting 'puppetmaster: tools-puppetmaster-02.tools.eqiad.wmflabs' on tools-sgeexec-0942 to test new puppetmaster on grid T245365
  • 11:50 arturo: fix invalid yaml format in horizon puppet prefix 'tools-k8s-haproxy' that prevented clean puppet run in the VMs
  • 00:59 bd808: Live hacked the "nginx-configuration" ConfigMap for T245426 (done several hours ago, but I forgot to !log it)

2020-02-18

  • 23:26 bstorm_: added tools-sgegrid-master.tools.eqiad1.wikimedia.cloud and tools-sgegrid-shadow.tools.eqiad1.wikimedia.cloud to gridengine admin host lists
  • 09:50 arturo: temporarily delete DNS zone tools.wmcloud.org to try re-creating it

2020-02-17

2020-02-14

  • 00:38 bd808: Added tools-k8s-worker-35 to 2020 Kubernetes cluster (T244791)
  • 00:34 bd808: Added tools-k8s-worker-34 to 2020 Kubernetes cluster (T244791)
  • 00:32 bd808: Added tools-k8s-worker-33 to 2020 Kubernetes cluster (T244791)
  • 00:29 bd808: Added tools-k8s-worker-32 to 2020 Kubernetes cluster (T244791)
  • 00:25 bd808: Added tools-k8s-worker-31 to 2020 Kubernetes cluster (T244791)
  • 00:25 bd808: Added tools-k8s-worker-30 to 2020 Kubernetes cluster (T244791)
  • 00:17 bd808: Added tools-k8s-worker-29 to 2020 Kubernetes cluster (T244791)
  • 00:15 bd808: Added tools-k8s-worker-28 to 2020 Kubernetes cluster (T244791)
  • 00:13 bd808: Added tools-k8s-worker-27 to 2020 Kubernetes cluster (T244791)
  • 00:07 bd808: Added tools-k8s-worker-26 to 2020 Kubernetes cluster (T244791)
  • 00:03 bd808: Added tools-k8s-worker-25 to 2020 Kubernetes cluster (T244791)

2020-02-13

  • 23:53 bd808: Added tools-k8s-worker-24 to 2020 Kubernetes cluster (T244791)
  • 23:50 bd808: Added tools-k8s-worker-23 to 2020 Kubernetes cluster (T244791)
  • 23:38 bd808: Added tools-k8s-worker-22 to 2020 Kubernetes cluster (T244791)
  • 21:35 bd808: Deleted tools-sgewebgrid-lighttpd-092{1,2,3,4,5,6,7,8} & tools-sgewebgrid-generic-090{3,4} (T244791)
  • 21:33 bd808: Removed tools-sgewebgrid-lighttpd-092{1,2,3,4,5,6,7,8} & tools-sgewebgrid-generic-090{3,4} from grid engine config (T244791)
  • 17:43 andrewbogott: migrating b24e29d7-a468-4882-9652-9863c8acfb88 to cloudvirt1022

2020-02-12

  • 19:29 bd808: Rebuilding all Docker images to pick up toollabs-webservice (0.63) (T244954)
  • 19:15 bd808: Deployed toollabs-webservice (0.63) on bastions (T244954)
  • 00:20 bd808: Depooling tools-sgewebgrid-generic-0903 (T244791)
  • 00:19 bd808: Depooling tools-sgewebgrid-generic-0904 (T244791)
  • 00:14 bd808: Depooling tools-sgewebgrid-lighttpd-0921 (T244791)
  • 00:09 bd808: Depooling tools-sgewebgrid-lighttpd-0922 (T244791)
  • 00:05 bd808: Depooling tools-sgewebgrid-lighttpd-0923 (T244791)
  • 00:05 bd808: Depooling tools-sgewebgrid-lighttpd-0924 (T244791)

2020-02-11

  • 23:58 bd808: Depooling tools-sgewebgrid-lighttpd-0925 (T244791)
  • 23:56 bd808: Depooling tools-sgewebgrid-lighttpd-0926 (T244791)
  • 23:38 bd808: Depooling tools-sgewebgrid-lighttpd-0927 (T244791)

2020-02-10

  • 23:39 bstorm_: updated tools-manifest to 0.21 on aptly for stretch
  • 22:51 bstorm_: all docker images now use webservice 0.62
  • 22:01 bd808: Manually starting webservices for tools that were running on tools-sgewebgrid-lighttpd-0928 (T244791)
  • 21:47 bd808: Depooling tools-sgewebgrid-lighttpd-0928 (T244791)
  • 21:25 bstorm_: upgraded toollabs-webservice package for tools to 0.62 T244293 T244289 T234617 T156626

2020-02-07

  • 10:55 arturo: drop jessie VM instances tools-prometheus-{01,02} which were shutdown (T238096)

2020-02-06

2020-02-05

  • 11:22 arturo: restarting ferm fleet-wide to account for prometheus servers changed IP (but same hostname) (T238096)

2020-02-04

  • 11:38 arturo: start again tools-prometheus-01 again to sync data to the new tools-prometheus-03/04 VMs (T238096)
  • 11:37 arturo: re-create tools-prometheus-03/04 as 'bigdisk2' instances (300GB) T238096

2020-02-03

  • 14:12 arturo: move tools-prometheus-04 from cloudvirt1022 to cloudvirt1013
  • 12:48 arturo: shutdown tools-prometheus-01 and tools-prometheus-02, after fixing the proxy `tools-prometheus.wmflabs.org` to tools-prometheus-03, data synced (T238096)
  • 09:38 arturo: tools-prometheus-01: systemctl stop prometheus@tools. Another try to migrate data to tools-prometheus-{03,04} (T238096)

2020-01-31

  • 14:06 arturo: leave tools-prometheus-01 as the backend for tools-prometheus.wmflabs.org for the weekend so grafana dashboards keep working (T238096)
  • 14:00 arturo: syncing again prometheus data from tools-prometheus-01 to tools-prometheus-0{3,4} due to some inconsistencies preventing prometheus from starting (T238096)

2020-01-30

  • 21:04 andrewbogott: also apt-get install python3-novaclient on tools-prometheus-03 and tools-prometheus-04 to suppress cronspam. Possible real fix for this is https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/569084/
  • 20:39 andrewbogott: apt-get install python3-keystoneclient on tools-prometheus-03 and tools-prometheus-04 to suppress cronspam. Possible real fix for this is https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/569084/
  • 16:27 arturo: create VM tools-prometheus-04 as cold standby of tools-prometheus-03 (T238096)
  • 16:25 arturo: point tools-prometheus.wmflabs.org proxy to tools-prometheus-03 (T238096)
  • 13:42 arturo: disable puppet in prometheus servers while syncing metric data (T238096)
  • 13:15 arturo: drop floating IP 185.15.56.60 and FQDN `prometheus.tools.wmcloud.org` because this is not how the prometheus setup is right now. Use a web proxy instead `tools-prometheus-new.wmflabs.org` (T238096)
  • 13:09 arturo: created FQDN `prometheus.tools.wmcloud.org` pointing to IPv4 185.15.56.60 (tools-prometheus-03) to test T238096
  • 12:59 arturo: associated floating IPv4 185.15.56.60 to tools-prometheus-03 (T238096)
  • 12:57 arturo: created domain `tools.wmcloud.org` in the tools project after some back and forth with designated, permissions and the database. I plan to use this domain to test the new Debian Buster-based prometheus setup (T238096)
  • 10:20 arturo: create new VM instance tools-prometheus-03 (T238096)

2020-01-29

  • 20:07 bd808: Created {bastion,login,dev}.toolforge.org service names for Toolforge bastions using Horizon & Designate

2020-01-28

  • 13:35 arturo: `aborrero@tools-clushmaster-02:~$ clush -w @exec-stretch 'for i in $(ps aux | grep [t]ools.j | awk -F" " "{print \$2}") ; do echo "killing $i" ; sudo kill $i ; done || true'` (T243831)

2020-01-27

  • 07:05 zhuyifei1999_: wrong package. uninstalled. the correct one is bpfcc-tools and seems only available in buster+. T115231
  • 07:01 zhuyifei1999_: apt installing bcc on tools-worker-1037 to see who is sending SIGTERM, will uninstall after done. dependency: bin86. T115231

2020-01-24

  • 20:58 bd808: Built tools-k8s-worker-21 to test out build script following openstack client upgrade
  • 15:45 bd808: Rebuilding all Docker containers again because I failed to actually update the build server git clone properly last time I did this
  • 05:23 bd808: Building 6 new tools-k8s-worker instances for the 2020 Kubernetes cluster (take 2)
  • 04:41 bd808: Rebuilding all Docker images to pick up webservice-python-bootstrap changes

2020-01-23

  • 23:38 bd808: Halted tools-k8s-worker build script after first instance (tools-k8s-worker-10) stuck in "scheduling" state for 20 minutes
  • 23:16 bd808: Building 6 new tools-k8s-worker instances for the 2020 Kubernetes cluster
  • 05:15 bd808: Building tools-elastic-04
  • 04:39 bd808: wmcs-openstack quota set --instances 192
  • 04:36 bd808: wmcs-openstack quota set --cores 768 --ram 1536000

2020-01-22

  • 12:43 arturo: for the record, issue with tools-worker-1016 was memory exhaustion apparently
  • 12:35 arturo: hard-reboot tools-worker-1016 (not responding to even console access)

2020-01-21

  • 19:25 bstorm_: hard rebooting tools-sgeexec-0913/14/35 because they aren't even on the network
  • 19:17 bstorm_: depooled and rebooted tools-sgeexec-0914 because it was acting funny
  • 18:30 bstorm_: depooling and rebooting tools-sgeexec-[0911,0913,0919,0921,0924,0931,0933,0935,0939,0941].tools.eqiad.wmflabs
  • 17:21 bstorm_: rebooting toolschecker to recover stale nfs handle

2020-01-16

  • 23:54 bstorm_: rebooting tools-docker-builder-06 because there are a couple running containers that don't want to die cleanly
  • 23:45 bstorm_: rebuilding docker containers to include new webservice version (0.58)
  • 23:41 bstorm_: deployed toollabs-webservice 0.58 to everything that isn't a container
  • 16:45 bstorm_: ran configurator to set the gridengine web queues to `rerun FALSE` T242397

2020-01-14

  • 15:29 bstorm_: failed the gridengine master back to the master server from the shadow
  • 02:23 andrewbogott: rebooting tools-paws-worker-1006 to resolve hangs associated with an old NFS failure

2020-01-13

  • 17:48 bd808: Running `puppet ca destroy` for each unsigned cert on tools-puppetmaster-01 (T242642)
  • 16:42 bd808: Cordoned and fixed puppet on tools-k8s-worker-12. Rebooting now. T242559
  • 16:33 bd808: Cordoned and fixed puppet on tools-k8s-worker-11. Rebooting now. T242559
  • 16:31 bd808: Cordoned and fixed puppet on tools-k8s-worker-10. Rebooting now. T242559
  • 16:26 bd808: Cordoned and fixed puppet on tools-k8s-worker-9. Rebooting now. T242559

2020-01-12

  • 22:31 Krenair: same on -13 and -14
  • 22:28 Krenair: same on -8
  • 22:18 Krenair: same on -7
  • 22:11 Krenair: Did usual new instance creation puppet dance on tools-k8s-worker-6, /data/project got created

2020-01-11

  • 01:33 bstorm_: updated toollabs-webservice package to 0.57, which should allow persisting mem and cpu in manifests with burstable qos.

2020-01-10

  • 23:31 bstorm_: updated toollabs-webservice package to 0.56
  • 15:45 bstorm_: depooled tools-paws-worker-1013 to reboot because I think it is the last tools server with that mount issue (I hope)
  • 15:35 bstorm_: depooling and rebooting tools-worker-1016 because it still had the leftover mount problems
  • 15:30 bstorm_: git stash-ing local puppet changes in hopes that arturo has that material locally, and it doesn't break anything to do so

2020-01-09

  • 23:35 bstorm_: depooled tools-sgeexec-0939 because it isn't acting right and rebooting it
  • 18:26 bstorm_: re-joining the k8s nodes OF THE PAWS CLUSTER to the cluster one at a time to rotate the certs T242353
  • 18:25 bstorm_: re-joining the k8s nodes to the cluster one at a time to rotate the certs T242353
  • 18:06 bstorm_: rebooting tools-paws-master-01 T242353
  • 17:46 bstorm_: refreshing the paws cluster's entire x509 environment T242353

2020-01-07

  • 22:40 bstorm_: rebooted tools-worker-1007 to recover it from disk full and general badness
  • 16:33 arturo: deleted by hand pod metrics/cadvisor-5pd46 due to prometheus having issues scrapping it
  • 15:46 bd808: Rebooting tools-k8s-worker-[6-14]
  • 15:35 bstorm_: changed kubeadm-config to use a list instead of a hash for extravols on the apiserver in the new k8s cluster T242067
  • 14:02 arturo: `root@tools-k8s-control-3:~# wmcs-k8s-secret-for-cert -n metrics -s metrics-server-certs -a metrics-server` (T241853)
  • 13:33 arturo: upload docker-registry.tools.wmflabs.org/coreos/kube-state-metrics:v1.8.0 copied from quay.io/coreos/kube-state-metrics:v1.8.0 (T241853)
  • 13:31 arturo: upload docker-registry.tools.wmflabs.org/metrics-server-amd64:v0.3.6 copied from k8s.gcr.io/metrics-server-amd64:v0.3.6 (T241853)
  • 13:23 arturo: [new k8s] doing changes to kube-state-metrics and metrics-server trying to relocate them to the 'metrics' namespace (T241853)
  • 05:28 bd808: Creating tools-k8s-worker-[6-14] (again)
  • 05:20 bd808: Deleting busted tools-k8s-worker-[6-14]
  • 05:02 bd808: Creating tools-k8s-worker-[6-14]
  • 00:26 bstorm_: repooled tools-sgewebgrid-lighttpd-0919
  • 00:17 bstorm_: repooled tools-sgewebgrid-lighttpd-0918
  • 00:15 bstorm_: moving tools-sgewebgrid-lighttpd-0918 and -0919 to cloudvirt1004 from cloudvirt1029 to rebalance load
  • 00:02 bstorm_: depooled tools-sgewebgrid-lighttpd-0918 and 0919 to move to cloudvirt1004 to improve spread

2020-01-06

  • 23:40 bd808: Deleted tools-sgewebgrid-lighttpd-09{0[1-9],10}
  • 23:36 bd808: Shutdown tools-sgewebgrid-lighttpd-09{0[1-9],10}
  • 23:34 bd808: Decommissioned tools-sgewebgrid-lighttpd-09{0[1-9],10}
  • 23:13 bstorm_: Repooled tools-sgeexec-0922 because I don't know why it was depooled
  • 23:01 bd808: Depooled tools-sgewebgrid-lighttpd-0910.tools.eqiad.wmflabs
  • 22:58 bd808: Depooling tools-sgewebgrid-lighttpd-090[2-9]
  • 22:57 bd808: Disabling queues on tools-sgewebgrid-lighttpd-090[2-9]
  • 21:07 bd808: Restarted kube2proxy on tools-proxy-05 to try and refresh admin tool's routes
  • 18:54 bstorm_: edited /etc/fstab to remove NFS and unmounted the nfs volumes tools-k8s-haproxy-1 T241908
  • 18:49 bstorm_: edited /etc/fstab to remove NFS and rebooted to clear stale mounts on tools-k8s-haproxy-2 T241908
  • 18:47 bstorm_: added mount_nfs=false to tools-k8s-haproxy puppet prefix T241908
  • 18:24 bd808: Deleted shutdown instance tools-worker-1029 (was an SSSD testing instance)
  • 16:42 bstorm_: failed sge-shadow-master back to the main grid master
  • 16:42 bstorm_: Removed files for old S1tty that wasn't working on sge-grid-master

2020-01-04

  • 18:11 bd808: Shutdown tools-worker-1029
  • 18:10 bd808: kubectl delete node tools-worker-1029.tools.eqiad.wmflabs
  • 18:06 bd808: Removed tools-worker-1029.tools.eqiad.wmflabs from k8s::worker_hosts hiera in preparation for decom
  • 16:54 bstorm_: moving VMs tools-worker-1012/1028/1005 from cloudvirt1024 to cloudvirt1003 due to hardware errors T241884
  • 16:47 bstorm_: moving VM tools-flannel-etcd-02 from cloudvirt1024 to cloudvirt1003 due to hardware errors T241884
  • 16:16 bd808: Draining tools-worker-10{05,12,28} due to hardware errors (T241884)
  • 16:13 arturo: moving VM tools-sgewebgrid-lighttpd-0927 from cloudvirt1024 to cloudvirt1009 due to hardware errors (T241884)
  • 16:11 arturo: moving VM tools-sgewebgrid-lighttpd-0926 from cloudvirt1024 to cloudvirt1009 due to hardware errors (T241884)
  • 16:09 arturo: moving VM tools-sgewebgrid-lighttpd-0925 from cloudvirt1024 to cloudvirt1009 due to hardware errors (T241884)
  • 16:08 arturo: moving VM tools-sgewebgrid-lighttpd-0924 from cloudvirt1024 to cloudvirt1009 due to hardware errors (T241884)
  • 16:07 arturo: moving VM tools-sgewebgrid-lighttpd-0923 from cloudvirt1024 to cloudvirt1009 due to hardware errors (T241884)
  • 16:06 arturo: moving VM tools-sgewebgrid-lighttpd-0909 from cloudvirt1024 to cloudvirt1009 due to hardware errors (T241884)
  • 16:04 arturo: moving VM tools-sgeexec-0923 from cloudvirt1024 to cloudvirt1009 due to hardware errors (T241884)
  • 16:02 arturo: moving VM tools-sgeexec-0910 from cloudvirt1024 to cloudvirt1009 due to hardware errors (T241873)

2020-01-03

  • 16:48 bstorm_: updated the ValidatingWebhookConfiguration for the ingress admission controller to the working settings
  • 11:51 arturo: [new k8s] deploy cadvisor as in https://gerrit.wikimedia.org/r/c/operations/puppet/+/561654 (T237643)
  • 11:21 arturo: upload k8s.gcr.io/cadvisor:v0.30.2 docker image to the docker registry as docker-registry.tools.wmflabs.org/cadvisor:0.30.2 for T237643
  • 03:04 bd808: Really rebuilding all {jessie,stretch,buster}-sssd images. Last time I forgot to actually update the git clone.
  • 00:11 bd808: Rebuiliding all stretch-ssd Docker images to pick up busybox

2020-01-02

  • 23:54 bd808: Rebuiliding all buster-ssd Docker images to pick up busybox