Nova Resource:Tools/SAL/Archive 4
Appearance
2021-12-31
- 19:48 taavi: reset grid error status on webgrid-lighttpd@tools-sgewebgrid-lighttpd-0915
2021-12-28
- 20:31 taavi: restarting acme-chief to debug T298353
2021-12-24
- 07:58 majavah: cleared error state from 4 webgrid-lighttpd nodes
2021-12-23
- 22:57 bd808: Marked tool stang for deletion (T296496)
- 22:57 bd808: Marked tool wplist for deletion (T295523)
- 22:56 bd808: Marked tool antigng for deletion (T294708)
- 22:55 bd808: Marked tool ytrb for deletion (T291909)
- 22:54 bd808: Marked tool geolink for deletion (T291801)
- 22:54 bd808: Marked tool wmf-task-samtar for deletion (T286622)
- 22:53 bd808: Marked tool coi for deletion (T286619)
- 22:52 bd808: Marked tool abusereport for deletion (T286618)
- 22:51 bd808: Marked tool chi for deletion (T282702)
- 22:43 bd808: Marked tool algo-news for deletion (T280444)
- 22:43 bd808: Marked tool ircclient for deletion (T279209)
- 22:42 bd808: Marked tool vagrant-test for deletion (T279209)
- 22:42 bd808: Marked tool vagrant2 for deletion (T279209)
- 22:42 bd808: Marked tool testwiki for deletion (T279209)
- 22:41 bd808: Marked tool zoranzoki21wiki for deletion (T279209)
- 22:41 bd808: Marked tool zoranzoki21bot for deletion (T279209)
- 22:40 bd808: Marked tool filesearch for deletion (T279209)
- 22:40 bd808: Marked tool sourceror for deletion (T275690)
- 22:39 bd808: Marked tool move for deletion (T270535)
- 22:38 bd808: Marked tool hastagwatcher for deletion (T270534)
- 22:37 bd808: Marked tool outreacy-wikicv for deletion (T270532)
- 22:36 bd808: Marked tool dawiki for deletion (T270105)
- 22:33 bd808: Marked tool rubinbot3 for deletion (T266963)
- 22:32 bd808: Marked tool rubinbot2 for deletion (T266963)
- 22:32 bd808: Marked tool rubinbot for deletion (T266963)
- 22:31 bd808: Marked tool google-drive-photos-to-commons for deletion (T259870)
- 22:30 bd808: Marked tool wdqs-wmil-tutorial for deletion (T258394)
- 22:29 bd808: Marked tool base-encode for deletion (T258340)
- 22:28 bd808: Marked tool wikidata-exports for deletion (T255192)
- 22:27 bd808: Marked tool oar for deletion (T254044)
- 22:27 bd808: Marked tool wmde-uca-test for deletion (T249089)
- 22:26 bd808: Marked tool fastilybot for deletion (T248248)
- 22:25 bd808: Marked tool mtc-rest for deletion (T248247)
- 22:24 bd808: Marked tool squirrelnest-upf for deletion (T248235)
- 22:23 bd808: Marked tool wikibase-databridge-storybook for deletion (T245026)
- 22:22 bd808: Marked tool draft-uncategorize-script for deletion (T236646)
- 22:21 bd808: Marked tool maplink-generator for deletion (T231766)
- 22:20 bd808: Marked tool rhinosf1-afdclose for deletion (T225838)
- 22:18 bd808: Marked tool asdf for deletion (T223699)
- 22:17 bd808: Marked tool basyounybot for deletion (T218524)
- 22:14 bd808: Marked tool design-research-methods for deletion (T218523)
- 22:12 bd808: Marked tool he-wiktionary-rule-checker for deletion (T218500)
- 22:11 bd808: Marked tool outofband for deletion (T218382)
- 22:10 bd808: Marked tool sync-badges for deletion (T218187)
- 22:09 bd808: Marked tool grafana-json-datasource for deletion (T218075)
- 22:08 bd808: Marked tool gsociftttdev for deletion (T217478)
- 22:04 bd808: Marked tool wikipagestats for deletion (T216970)
- 22:02 bd808: Marked tool bd808-test4 for deletion (T216440)
- 22:02 bd808: Marked tool bd808-test3 for deletion (T216439)
- 22:01 bd808: Marked tool tei2wikitext for deletion (T216427)
- 22:00 bd808: Marked tool projetpp for deletion (T216427)
- 22:00 bd808: Marked tool ppp-sparql for deletion (T216427)
- 21:59 bd808: Marked tool platypus-qa for deletion (T216427)
- 21:59 bd808: Marked tool creatorlinks for deletion (T216427)
- 21:58 bd808: Marked tool corenlp for deletion (T216427)
- 21:57 bd808: Marked tool strikertest2017-08-23 for deletion (T216211)
- 21:46 bd808: Marked tool languagetool for deletion (T215734)
- 21:45 bd808: Marked tool gdk-artists-research for deletion (T214495)
- 21:44 bd808: Marked tool phragile for deletion (T214495)
- 21:44 bd808: Marked tool commons-mass-upload for deletion (T214495)
- 21:43 bd808: Marked tool wmde-uca-test for deletion (T214495)
- 21:43 bd808: Marked tool wmde-editconflict-test for deletion (T214495)
- 21:42 bd808: Marked tool wmde-inline-movedparagraphs for deletion (T214495)
- 21:41 bd808: Marked tool prometheus for deletion (T211972)
- 21:40 bd808: Marked tool quentinv57-tools for deletion (T210829)
- 21:38 bd808: Marked tool addbot for deletion (T208427)
- 21:37 bd808: Marked tool addshore-dev for deletion (T208427)
- 21:37 bd808: Marked tool addshore for deletion (T208427)
- 21:36 bd808: Marked tool miraheze-notifico for deletion (T203124)
- 21:34 bd808: Marked tool mh-signbot for deletion (T202946)
- 21:33 bd808: Marked tool messenger-chatbot for deletion (T198808)
- 21:22 bd808: Marked tool harvesting-data-rafinery for deletion (T197214)
- 21:21 bd808: Marked tool miraheze-discord-irc for deletion (T192410)
- 21:20 bd808: Marked tool sau226-wiki-bug-testing for deletion (T188608)
- 21:18 bd808: Marked tool kmlexport-cswiki for deletion (T186916)
- 21:17 bd808: Marked tool www-portal-builder for deletion (T182140)
- 21:15 bd808: Marked tool recoin-sample for deletion (T181541)
- 21:13 bd808: Marked tool wlm-jury-yarl for deletion (T172590)
- 21:12 bd808: Marked tool wlm-jury-at for deletion (T172590)
- 19:43 bd808: Marked tool yunomi for deletion (T170070)
- 19:42 bd808: Marked tool datbotcommons for deletion (T164662)
- 19:40 bd808: Marked tool ut-iw-bot for deletion (T158303)
- 19:39 bd808: Marked tool hujibot for deletion (T157916)
- 19:37 bd808: Marked tool contributions-summary for deletion (T157749)
- 19:35 bd808: Marked tool morebots for deletion (T157399)
- 19:32 bd808: Marked tool rcm for deletion (T136216)
2021-12-20
- 18:01 majavah: deploying calico v3.21.0 (T292698)
- 12:17 arturo: running `aborrero@tools-sgegrid-master:~$ sudo grid-configurator --all-domains` after merging a few patches to the script to handle dead config
2021-12-14
- 09:46 majavah: testing delete-crashing-pods emailer component with a test tool T292925
2021-12-08
- 05:21 andrewbogott: moving tools-k8s-etcd-13 to cloudvirt1028
2021-12-07
- 11:11 arturo: updated member roles in github.com/toolforge: remove brooke as owner, add dcaro
2021-12-06
- 13:23 majavah: root@toolserver-proxy-01:~# systemctl restart apache2.service # working around T293826
2021-12-04
- 12:18 majavah: deploying delete-crashing-pods in dry run mode T292925
2021-11-28
- 17:46 andrewbogott: moving tools-k8s-etcd-13 to cloudvirt1020; cloudvirt1018 (its old host) has a degraded raid which is affecting performance
2021-11-19
- 13:16 majavah: manually add 3 project members after ldap issues were fixed
2021-11-16
- 12:31 majavah: uploading calico 3.21.0 to the internal docker registry T292698
- 10:28 majavah: deploying maintain-kubeusers changes T286857
2021-11-11
- 10:50 arturo: add user `srv-networktests` as project user (T294955)
2021-11-05
- 19:18 majavah: deploying registry-admission changes
2021-10-29
- 23:58 andrewbogott: deleting all files older than 14 days in /srv/tools/shared/tools/project/.shared/cache
2021-10-28
- 12:42 arturo: set `allow-snippet-annotations: "false"` for ingress-nginx (T294330)
2021-10-26
- 18:00 majavah: deleting legacy ingresses for tools.wmflabs.org urls
- 12:26 majavah: deploy ingress-admission updates
- 12:11 majavah: deploy ingress-nginx v1.0.4 / chart v4.0.6 on toolforge T292771
2021-10-25
- 14:33 majavah: copy nginx-ingress controller v1.0.4 to internal registry T292771
- 11:32 majavah: depool tools-sgeexec-0910 T294228
- 11:13 majavah: removed tons of duplicate qw jobs accross multiple tools
2021-10-22
- 15:35 majavah: remove "^tools-k8s-master-[0-9]+\.tools\.eqiad\.wmflabs$" from authorized_regexes for the main certificate
- 15:35 majavah: add mail.tools.wmcloud.org to the tools mail tls certificate alternative names
2021-10-21
- 09:48 majavah: deploying toolforge-webservice 0.79
2021-10-20
- 15:41 majavah: removing toollabs-webservice from grid exec and master nodes where it's not needed and not managed by puppet
- 12:51 majavah: rolling out toolforge-webservice 0.78 T292706 T282975 T276626
2021-10-15
- 15:01 arturo: add updated ingress-nginx docker image in the registry (v1.0.1) for T293472
2021-10-07
- 09:13 majavah: disabling settings api, now that all pod presets are gone T279106
- 08:00 majavah: removing all pod presets T279106
- 05:44 majavah: deploying fix for T292672
2021-10-06
- 06:46 majavah: taavi@toolserver-proxy-01:~$ sudo systemctl restart apache2.service # see if it helps with toolserver.org ssl alerts
2021-10-03
- 21:31 bstorm: rebuilding buster containers since they are also affected T291387 T292355
- 21:29 bstorm: rebuilt stretch containers for potential issues with LE cert updates T291387
2021-10-01
- 21:59 bd808: clush -w @all -b 'sudo sed -i "s#mozilla/DST_Root_CA_X3.crt#!mozilla/DST_Root_CA_X3.crt#" /etc/ca-certificates.conf && sudo update-ca-certificates' for T292289
2021-09-30
- 13:43 majavah: cleaning up unused kubernetes ingress objects for tools.wmflabs.org urls T292105
2021-09-29
- 22:39 bstorm: finished deploy of the toollabs-webservice 0.77 and updating labels across the k8s cluster to match
- 22:26 bstorm: pushing toollabs-webservice 0.77 to tools releases
- 21:46 bstorm: pushing toollabs-webservice 0.77 to toolsbeta
2021-09-27
- 16:19 majavah: deploy volume-admission fix for containers for some volumes mounted
- 13:01 majavah: publish jobutils and misctools 0.43 T286072
- 11:34 majavah: disabling pod preset controller T279106
2021-09-23
- 17:20 majavah: deploying new maintain-kubeusers for lack of podpresets T279106
2021-09-22
- 18:06 bstorm: launching tools-nfs-test-client-01 to run a "fair" test battery against T291406
- 11:37 dcaro: controlled undrain tools-k8s-worker-53 (T291546)
- 08:57 majavah: drain tools-k8s-worker-53
2021-09-20
- 12:44 majavah: deploying volume-admission to tools, should not affect anything yet T279106
2021-09-15
- 08:08 majavah: update tools-manifest to 0.24
2021-09-14
- 10:36 arturo: add toolforge-jobs-framework-cli v5 to aptly buster-tools/toolsbeta
2021-09-13
- 08:57 arturo: cleared grid queues error states (T290844)
- 08:55 arturo: repooling sgeexec-0907 (T290798)
- 08:14 arturo: rebooting sgeexec-0907 (T290798)
- 08:12 arturo: depool sgeexec-0907 (T290798)
2021-09-11
- 08:51 majavah: depool tools-sgeexec-0907
2021-09-10
- 23:26 bstorm: cleared error state for tools-sgeexec-0907.tools.eqiad.wmflabs
- 12:00 arturo: shutdown tools-package-builder-03 (buster), leave -04 online (bullseye)
- 09:35 arturo: live-hacking tools puppetmaster with a couple of ops/puppet changes
- 07:54 arturo: created bullseye VM tools-package-builder-04 (T273942)
2021-09-09
- 16:20 arturo: 70017ec0ac root@tools-k8s-control-3:~# kubectl apply -f /etc/kubernetes/psp/base-pod-security-policies.yaml
2021-09-07
- 15:27 majavah: rolling out python3-prometheus-client updates
- 14:41 majavah: manually removing some absented but still present crontabs to stop root@ spam
2021-09-06
- 16:31 arturo: deploying jobs-framework-cli v4
- 16:22 arturo: deploying jobs-framework-api 3228d97
2021-09-03
- 22:36 bstorm: backfilling quotas in screen for T286784
- 12:49 majavah: deploying new tools-manifest version
2021-09-02
- 01:02 bstorm: deployed new version of maintain-kubeusers with new count quotas for new tools T286784
2021-08-20
- 19:10 majavah: rebuilding node12-sssd/{base,web} to use debian packaged npm 7
- 18:42 majavah: rebuilding php74-sssd/{base,web} to use composer 2
2021-08-18
- 21:32 bstorm: rebooted tools-sgecron-01 due to a ram filling up and killing everything
- 16:34 bstorm: deleting the sssd cache on tools-sgecron-01 to fix a peculiar passwd db issue
2021-08-16
- 17:00 majavah: remove and re-add toollabs-webservice 0.75 on stretch-toolsbeta repository
- 15:45 majavah: reset sul account mapping on striker for developer account "DutchTom" T288969
- 14:19 majavah: building node12 images - T284590 T243159
2021-08-15
- 17:30 majavah: deploying update jobs-framework-api container list to include bullseye images
- 17:22 majavah: finished initial build of images: php74, jdk17, python39, ruby27 - T284590
- 16:51 majavah: starting build of initial bullseye based images - T284590
- 16:44 majavah: tagged and building toollabs-webservice 0.76 with bullseye images defined T284590
- 15:14 majavah: building tools-webservice 0.74 (currently live version) to bullseye-tools and bullseye-toolsbeta
2021-08-12
- 16:59 bstorm: deployed updated manifest for ingress-admission
- 16:45 bstorm: restarted ingress admission pods in tools after testing in toolsbeta
- 16:27 bstorm: updated the docker image for docker-registry.tools.wmflabs.org/ingress-admission:latest
- 16:22 bstorm: rebooting tools-docker-registry-05 after exchanging uids for puppet and docker-registry
2021-08-07
- 05:59 majavah: restart nginx on toolserver-proxy-01 if that helps with flapping icinga certificate expiry check
2021-08-06
- 16:17 bstorm: failed over to tools-docker-registry-06 (which has more space) T288229
- 00:43 bstorm: set up sync between the new registry host and the existing one T288229
- 00:21 bstorm: provisioning second docker registry server to rsync to (120GB disk and fairly large server) T288229
2021-08-05
- 23:50 bstorm: rebooting the docker registry T288229
- 23:04 bstorm: extended docker registry volume to 120GB T288229
2021-07-29
- 18:04 majavah: reset sul account mapping on striker for developer account "Derek Zax" T287369
2021-07-28
- 21:33 majavah: add mdipietro as projectadmin and to sudo policy T287287
2021-07-27
- 16:20 bstorm: built new php images with python2 on board T287421
- 00:04 bstorm: deploy a version of the php3.7 web image that includes the python2 package with tag :testing T287421
2021-07-26
- 17:37 bstorm: repooled the whole set of ingress workers after upgrades T280340
- 16:37 bstorm: removing tools-k8s-ingress-4 from active ingress nodes at the proxy T280340
2021-07-23
- 07:15 majavah: restart nginx on tools-static-14 to see if it helps with fontcdn issues
2021-07-22
- 23:35 bstorm: deleted tools-sgebastion-09 since it has been shut off since March anyway
- 15:32 arturo: re-deploying toolforge-jobs-framework-api
- 15:30 arturo: pushed new docker image on the registry for toolforge-jobs-framework-api 4d8235b (T287077)
2021-07-21
- 20:01 bstorm: deployed new maintain-kubeusers to toolforge T285011
- 19:55 bstorm: deployed new rbac for maintain-kubeusers changes T285011
- 17:10 majavah: deploying calico v3.18.4 T280342
- 14:35 majavah: updating systemd on toolforge stretch bastions T287036
- 11:59 arturo: deploying jobs-framework-api 07346d7 (T286108)
- 11:04 arturo: enabling TTLAfterFinished feature gate on kubeadm live configmap (T286108)
- 11:01 arturo: enabling TTLAfterFinished feature gate on static pod manifests on /etc/kubernetes/manifests/kube-{apiserver,controller-manager}.yaml in all 3 control nodes (T286108)
2021-07-20
- 18:42 majavah: deploying systemd security tools on toolforge public stretch machines T287004
- 17:45 arturo: pushed new toolforge-jobs-framework-api docker image into the registry (3a6ae38) (T286126
- 17:37 arturo: added toolforge-jobs-framework-cli v3 to aptly buster-tools and buster-toolsbeta
- 13:25 majavah: apply buster systemd security updates
2021-07-19
- 23:24 bstorm: applied matchPolicy: equivalent to tools ingress validation controller T280360
- 16:43 bstorm: cleared queue error state caused by excessive resource use by topicmatcher T282474
2021-07-16
- 14:04 arturo: deployed jobs-framework-api 42b7a88 (T286132)
- 11:57 arturo: added toollabs-webservice_0.75_all to jessie-tools aptly repo (T286003)
- 11:52 arturo: created `jessie-tools` aptly repository on tools-services-05 (T286003)
2021-07-15
- 16:12 arturo: deploy toolforge-jobs-framework-api git version d85d93e (T285944, T286107, T285979, T286485, T286107)
- 15:55 arturo: added toolforge-jobs-framework-cli_2_all.deb to buster-{tools,toolsbeta} (T285944)
2021-07-14
- 23:29 bstorm: mounted nfs on tools-services-05 and backing up aptly to NFS dir T286003
- 09:17 majavah: copying calico 3.18.4 images from docker hub to docker-registry.tools.wmflabs.org T280342
2021-07-12
- 16:56 bstorm: deleted job 4720371 due to LDAP failure
- 16:51 bstorm: cleared the E state from two job queues
2021-07-02
- 18:46 bstorm: cleared error state for tools-sgeexec-0940.tools.eqiad.wmflabs
2021-07-01
- 22:08 bstorm: releasing webservice 0.75
- 17:03 andrewbogott: rebooting tools-k8s-worker-[31,33,35,44,49,51,57-58,70].tools.eqiad1.wikimedia.cloud
- 16:47 bstorm: remounted scratch everywhere...but mostly tools T224747
- 15:47 arturo: rebased labs/private.git
- 11:04 arturo: added toolforge-jobs-framework-cli_1_all.deb to aptly buster-tools,buster-toolsbeta
- 10:34 arturo: refreshed jobs-api deployment
2021-06-29
- 21:58 bstorm: clearing one errored queue and a stack of discarded jobs
- 20:11 majavah: toolforge kubernetes upgrade complete T280299
- 17:03 majavah: starting toolforge kubernetes 1.18 upgrade - T280299
- 16:17 arturo: deployed jobs-framework-api in the k8s cluster
- 15:34 majavah: remove duplicate definitions from tools-clushmaster-02 /root/.ssh/known_hosts
- 15:12 arturo: livehacking puppetmaster for T283238
- 10:24 dcaro: running puppet on the buster bastions after 20000 minutes failing... might break something
2021-06-15
- 19:02 bstorm: cleared error status from a few queues
- 16:15 majavah: deleting unused shutdown nodes: tools-checker-03 tools-k8s-haproxy-1 tools-k8s-haproxy-2
2021-06-14
- 22:21 bstorm: push docker-registry.tools.wmflabs.org/toolforge-python37-sssd-web:testing to test staged os.execv (and other patches) using toolsbeta toollabs-webservice version 0.75 T282975
2021-06-13
- 08:15 majavah: clear grid error state from tools-sgeexec-0907, tools-sgeexec-0916, tools-sgeexec-0940
2021-06-12
- 14:39 majavah: remove nonexistent tools-prometheus-04 and add tools-prometheus-05 to hiera key "prometheus_nodes"
- 13:53 majavah: create empty bullseye-{tools,toolsbeta} repositories on tools-services-05 aptly
2021-06-10
- 17:38 majavah: clear error state from tools-sgeexec-0907, task@tools-sgeexec-0939
2021-06-09
- 13:57 majavah: clear error state from exec nodes tools-sgeexec-0913, tools-sgeexec-0936, task@tools-sgeexec-0940
2021-06-07
- 18:39 bstorm: cleaning up more error conditions on grid queues
- 17:42 majavah: delete `ingress-nginx` namespace and related objects T264221
- 17:37 majavah: remove tools-k8s-ingress-[1-3] from kubernetes, follow-up to https://sal.toolforge.org/log/nd7v2HkB1jz_IcWuCX5M T264221
2021-06-04
- 21:30 bstorm: deleting "tools-k8s-ingress-3", "tools-k8s-ingress-2", "tools-k8s-ingress-1" T264221
- 21:21 bstorm: cleared error state from 4 grid queues
2021-06-03
- 18:27 majavah: renew prometheus kubernetes certificate T280301
- 17:06 majavah: renew admission webhook certificates T280301
2021-06-01
- 10:10 majavah: properly clean up deleted vms tools-k8s-haproxy-[1,2], tools-checker-03 from puppet after using the wrong fqdn first time
- 09:54 majavah: clear error state from tools-sgeexec-0913, tools-sgeexec-0950
2021-05-30
- 18:58 majavah: clear grid error state from 14 queues
2021-05-27
- 18:03 bstorm: adjusted profile::wmcs::kubeadm::etcd_latency_ms from 30 back to the default (10)
- 16:04 bstorm: cleared error state from several exec node queues
- 14:49 andrewbogott: swapping in three new etcd nodes with local storage: tools-k8s-etcd-13,14,15
2021-05-24
- 10:36 arturo: rebased labs/private.git after merge conflict
- 06:49 majavah: remove scfc kubernetes admin access after bd808 removed tools.admin membership to avoid maintain-kubeusers crashes when it expires
2021-05-22
- 14:47 majavah: manually remove jeh admin certificates and from maintain-kubeusers configmap T282725
- 14:32 majavah: manually remove valhallasw yuvipanda admin certificates and from configmap and restart maintain-kubeusers pod T282725
- 02:51 bd808: Restarted nginx on tools-static-14 to see if that clears up the fontcdn 502 errors
2021-05-21
- 17:06 majavah: unpool tooks-k8s-ingress-[4-6]
- 17:06 majavah: repool tools-k8s-ingress-6
- 17:02 majavah: repool tools-k8s-ingress-4 and -5
- 16:59 bstorm: upgrading the ingress-gen2 controllers to release 3 to capture new RAM/CPU limits
- 16:43 bstorm: resize tools-k8s-ingress-4 to g3.cores4.ram8.disk20
- 16:43 bstorm: resize tools-k8s-ingress-6 to g3.cores4.ram8.disk20
- 16:40 bstorm: resize tools-k8s-ingress-5 to g3.cores4.ram8.disk20
- 16:04 majavah: rollback kubernetes ingress update from front proxy
- 06:52 Majavah: pool tools-k8s-ingress-6 and depool ingress-[2,3] T264221
2021-05-20
- 17:05 Majavah: pool tools-k8s-ingress-5 as an ingress node, depool ingress-1 T264221
- 16:31 Majavah: pool tools-k8s-worker-4 as an ingress node T264221
- 15:17 Majavah: trying to install ingress-nginx via helm again after adjusting security groups T264221
- 15:15 Majavah: move tools-k8s-ingress-[5-6] from "tools-k8s-full-connectivity" to "tools-new-k8s-full-connectivity" security group T264221
2021-05-19
- 12:15 Majavah: rollback ingress-nginx-gen2
- 11:09 Majavah: deploy helm-based nginx ingress controller v0.46.0 to ingress-nginx-gen2 namespace T264221
- 10:44 Majavah: create tools-k8s-ingress-[4-6] T264221
2021-05-16
- 16:52 Majavah: clear error state from tools-sgeexec-0905 tools-sgeexec-0907 tools-sgeexec-0936 tools-sgeexec-0941
2021-05-14
- 19:18 bstorm: adjusting the rate limits for bastions nfs_write upward a lot to make NFS writes faster now that the cluster is finally using 10Gb on the backend and frontend T218338
- 16:55 andrewbogott: rebooting toolserver-proxy-01 to clear up stray files
- 16:47 andrewbogott: deleting log files older than 14 days on toolserver-proxy-01
2021-05-12
- 19:45 bstorm: cleared error state from some queues
- 19:05 Majavah: remove phamhi-binding phamhi-view-binding cluster role bindings T282725
- 19:04 bstorm: deleted the maintain-kubeusers pod to get it up and running fast T282725
- 19:03 bstorm: deleted phamhi from admin configmap in maintain-kubeusers T282725
2021-05-11
- 17:17 Majavah: shutdown and delete tools-checker-03 T278540
- 17:14 Majavah: move floating ip 185.15.56.61 to tools-checker-04
- 17:12 Majavah: add tools-checker-04 as a grid submit host T278540
- 16:58 Majavah: add tools-checker-04 to toollabs::checker_hosts hiera key T278540
- 16:49 Majavah: creating tools-checker-04 with buster T278540
- 16:32 Majavah: carefully shutdown tools-k8s-haproxy-1 T252239
- 16:29 Majavah: carefully shutdown tools-k8s-haproxy-2 T252239
2021-05-10
- 22:58 bstorm: cleared error state on a grid queue
- 22:58 bstorm: setting `profile::wmcs::kubeadm::docker_vol: false` on ingress nodes
- 15:22 Majavah: change k8s.svc.tools.eqiad1.wikimedia.cloud. to point to the tools-k8s-haproxy-keepalived-vip address 172.16.6.113 (T252239)
- 15:06 Majavah: carefully rolling out keepalived to tools-k8s-haproxy-[3-4] while making sure [1-2] do not have changes
- 15:03 Majavah: clear all error states caused by overloaded exec nodes
- 14:57 arturo: allow tools-k8s-haproxy-[3-4] to use the tools-k8s-haproxy-keepalived-vip address (172.16.6.113) (T252239)
- 12:53 Majavah: creating tools-k8s-haproxy-[3-4] to rebuild current ones without nfs and with keepalived
2021-05-09
- 06:55 Majavah: clear error state from tools-sgeexec-0916
2021-05-08
- 10:57 Majavah: import docker image k8s.gcr.io/ingress-nginx/controller:v0.46.0 to local registry as docker-registry.tools.wmflabs.org/nginx-ingress-controller:v0.46.0 T264221
2021-05-07
- 18:07 Majavah: generate and add k8s haproxy keepalived password (profile::toolforge::k8s::haproxy::keepalived_password) to private puppet repo
- 17:15 bstorm: recreated recordset of k8s.tools.eqiad1.wikimedia.cloud as CNAME to k8s.svc.tools.eqiad1.wikimedia.cloud T282227
- 17:12 bstorm: created A record of k8s.svc.tools.eqiad1.wikimedia.cloud pointing at current cluster with TTL of 300 for quick initial failover when the new set of haproxy nodes are ready T282227
- 09:44 arturo: `sudo wmcs-openstack --os-project-id=tools port create --network lan-flat-cloudinstances2b tools-k8s-haproxy-keepalived-vip`
2021-05-06
- 14:43 Majavah: clear error states from all currently erroring exec nodes
- 14:37 Majavah: clear error state from tools-sgeexec-0913
- 04:35 Majavah: add own root key to project hiera on horizon T278390
- 02:36 andrewbogott: removing jhedden from sudo roots
2021-05-05
- 19:27 andrewbogott: adding taavi as a sudo root to project toolforge for T278390
2021-05-04
- 15:23 arturo: upgrading exim4-daemon-heavy in tools-mail-03
- 10:47 arturo: rebase & resolve merge conflicts in labs/private.git
2021-05-03
- 16:24 dcaro: started tools-sgeexec-0907, was stuck on initramfs due to an unclean fs (/dev/vda3, root), ran fsck manually fixing all the errors and booted up correctly after (T280641)
- 14:07 dcaro: depooling tols-sgeexec-0908/7 to be able to restart the VMs as they got stuck during migration (T280641)
2021-04-29
- 18:23 bstorm: removing one more etcd node via cookbook T279723
- 18:12 bstorm: removing an etcd node via cookbook T279723
2021-04-27
- 16:40 bstorm: deleted all the errored out grid jobs stuck in queue wait
- 16:16 bstorm: cleared E status on grid queues to get things flowing again
2021-04-26
- 12:17 arturo: allowing more tools into the legacy redirector (T281003)
2021-04-22
- 08:44 Krenair: Removed yuvipanda from roots sudo policy
- 08:42 Krenair: Removed yuvipanda from projectadmin per request
- 08:40 Krenair: Removed yuvipanda from tools.admin per request
2021-04-20
- 22:20 bd808: `clush -w @all -b "sudo exiqgrep -z -i | xargs sudo exim -Mt"`
- 22:19 bd808: `clush -w @exec -b "sudo exiqgrep -z -i | xargs sudo exim -Mt"`
- 21:52 bd808: Update hiera `profile::toolforge::active_mail_relay: tools-mail-03.tools.eqiad1.wikimedia.cloud`. Was using wrong domain name in prior update.
- 21:49 bstorm: tagged the latest maintain-kubeusers and deployed to toolforge (with kustomize changes to rbac) after testing in toolsbeta T280300
- 21:27 bd808: Update hiera `profile::toolforge::active_mail_relay: tools-mail-03.tools.eqiad.wmflabs`. was -2 which is decommed.
- 10:18 dcaro: seting the retention on the tools-prometheus VMs to 250GB (they have 276GB total, leaving some space for online data operations if needed) (T279990)
2021-04-19
- 10:53 dcaro: reverting setting prometheus data source in grafana to 'server', can't connect,
- 10:51 dcaro: setting prometheus data source in grafana to 'server' to avoid CORS issues
2021-04-16
- 23:15 bstorm: cleaned up all source files for the grid with the old domain name to enable future node creation T277653
- 14:38 dcaro: added 'will get out of space in X days' panel to the dasboard https://grafana-labs.wikimedia.org/goto/kBlGd0uGk (T279990), we got <5days xd
- 11:35 arturo: running `grid-configurator --all-domains` which basically added tools-sgebastion-10,11 as submit hosts and removed tools-sgegrid-master,shadow as submit hosts
2021-04-15
- 17:45 bstorm: cleared error state from tools-sgeexec-0920.tools.eqiad.wmflabs for a failed job
2021-04-13
- 13:26 dcaro: upgrade puppet and python-wmflib on tools-prometheus-03
- 11:23 arturo: deleted shutoff VM tools-package-builder-02 (T275864)
- 11:21 arturo: deleted shutoff VM tools-sge-services-03,04 (T278354)
- 11:20 arturo: deleted shutoff VM tools-docker-registry-03,04 (T278303)
- 11:18 arturo: deleted shutoff VM tools-mail-02 (T278538)
- 11:17 arturo: deleted shutoff VMs tools-static-12,13 (T278539)
2021-04-11
- 16:07 bstorm: cleared E state from tools-sgeexec-0917 tools-sgeexec-0933 tools-sgeexec-0934 tools-sgeexec-0937 from failures of jobs 761759, 815031, 815056, 855676, 898936
2021-04-08
- 18:25 bstorm: cleaned up the deprecated entries in /data/project/.system_sge/gridengine/etc/submithosts for tools-sgegrid-master and tools-sgegrid-shadow using the old fqdns T277653
- 09:24 arturo: allocate & associate floating IP 185.15.56.122 for tools-sgebastion-11, also with DNS A record `dev-buster.toolforge.org` (T275865)
- 09:22 arturo: create DNS A record `login-buster.toolforge.org` pointing to 185.15.56.66 (tools-sgebastion-10) (T275865)
- 09:20 arturo: associate floating IP 185.15.56.66 to tools-sgebastion-10 (T275865)
- 09:13 arturo: created tools-sgebastion-11 (buster) (T275865)
2021-04-07
- 04:35 andrewbogott: replacing the mx record '10 mail.tools.wmcloud.org' with '10 mail.tools.wmcloud.org.' — trying to fix axfr for the tools.wmcloud.org zone
2021-04-06
- 15:16 bstorm: cleared queue state since a few had "errored" for failed jobs.
- 12:59 dcaro: Removing etcd member tools-k8s-etcd-7.tools.eqiad1.wikimedia.cloud to get an odd number (T267082)
- 11:45 arturo: upgrading jobutils & misctools to 1.42 everywhere
- 11:39 arturo: cleaning up aptly: old package versions, old repos (jessie, trusty, precise) etc
- 10:31 dcaro: Removing etcd member tools-k8s-etcd-6.tools.eqiad.wmflabs (T267082)
- 10:21 arturo: published jobutils & misctools 1.42 (T278748)
- 10:21 arturo: published jobutils & misctools 1.42
- 10:21 arturo: aptly repo had some weirdness due to the cinder volume: hardlinks created by aptly were broken, solved with `sudo aptly publish --skip-signing repo stretch-tools -force-overwrite`
- 10:07 dcaro: adding new etcd member using the cookbook wmcs.toolforge.add_etcd_node (T267082)
- 10:05 arturo: installed aptly from buster-backports on tools-services-05 to see if that makes any difference with an issue when publishing repos
- 09:53 dcaro: Removing etcd member tools-k8s-etcd-4.tools.eqiad.wmflabs (T267082)
- 08:55 dcaro: adding new etcd member using the cookbook wmcs.toolforge.add_etcd_node (T267082)
2021-04-05
- 17:02 bstorm: chowned the data volume for the docker registry to docker-registry:docker-registry
- 09:56 arturo: make jhernandez (IRC joakino) projectadmin (T278975)
2021-04-01
- 20:43 bstorm: cleared error state from the grid queues caused by unspecified job errors
- 15:53 dcaro: Removed etcd member tools-k8s-etcd-5.tools.eqiad.wmflabs, adding a new member (T267082)
- 15:43 dcaro: Removing etcd member tools-k8s-etcd-5.tools.eqiad.wmflabs (T267082)
- 15:36 dcaro: Added new etcd member tools-k8s-etcd-9.tools.eqiad1.wikimedia.cloud (T267082)
- 15:18 dcaro: adding new etcd member using the cookbook wmcs.toolforge.add_etcd_node (T267082)
2021-03-31
- 15:57 arturo: rebooting `tools-mail-03` after enabling NFS (T267082, T278538)
- 15:57 arturo: rebooting `tools-mail-03` after enabling NFS (T
- 15:04 arturo: created MX record for `tools.wmcloud.org` pointing to `mail.tools.wmcloud.org`
- 15:03 arturo: created DNS A record `mail.tools.wmcloud.org` pointing to 185.15.56.63
- 14:56 arturo: shutoff tools-mail-02 (T278538)
- 14:55 arturo: point floating IP 185.15.56.63 to tools-mail-03 (T278538)
- 14:45 arturo: created VM `tools-mail-03` as Debian Buster (T278538)
- 14:39 arturo: relocate some of the hiera keys for email server from project-level to prefix
- 09:44 dcaro: running disk performance test on etcd-4 (round2)
- 09:05 dcaro: running disk performance test on etcd-8
- 08:43 dcaro: running disk performance test on etcd-4
2021-03-30
- 16:15 bstorm: added `labstore::traffic_shaping::egress: 800mbps` to tools-static prefix T278539
- 15:44 arturo: shutoff tools-static-12/13 (T278539)
- 15:41 arturo: point horizon web proxy `tools-static.wmflabs.org` to tools-static-14 (T278539)
- 15:37 arturo: add `mount_nfs: true` to tools-static prefix (T2778539)
- 15:26 arturo: create VM tools-static-14 with Debian Buster image (T278539)
- 12:19 arturo: introduce horizon proxy `deb-tools.wmcloud.org` (T278436)
- 12:15 arturo: shutdown tools-sgebastion-09 (stretch)
- 11:05 arturo: created VM `tools-sgebastion-10` as Debian Buster (T275865)
- 11:04 arturo: created server group `tools-bastion` with anti-affinity policy
2021-03-28
2021-03-27
2021-03-26
- 12:21 arturo: shutdown tools-package-builder-02 (stretch), we keep -03 which is buster (T275864)
2021-03-25
- 19:30 bstorm: forced deletion of all jobs stuck in a deleting state T277653
- 17:46 arturo: rebooting tools-sgeexec-* nodes to account for new grid master (T277653)
- 16:20 arturo: rebuilding tools-sgegrid-master VM as debian buster (T277653)
- 16:18 arturo: icinga-downtime toolschecker for 2h
- 16:05 bstorm: failed over the tools grid to the shadow master T277653
- 13:36 arturo: shutdown tools-sge-services-03 (T278354)
- 13:33 arturo: shutdown tools-sge-services-04 (T278354)
- 13:31 arturo: point aptly clients to `tools-services-05.tools.eqiad1.wikimedia.cloud` (hiera change) (T278354)
- 12:58 arturo: created VM `tools-services-05` as Debian Buster (T278354)
- 12:51 arturo: create cinder volume `tools-aptly-data` (T278354)
2021-03-24
- 12:46 arturo: shutoff the old stretch VMs `tools-docker-registry-03` and `tools-docker-registry-04` (T278303)
- 12:38 arturo: associate floating IP 185.15.56.67 with `tools-docker-registry-05` and refresh FQDN docker-registry.tools.wmflabs.org accordingly (T278303)
- 12:33 arturo: attach cinder volume `tools-docker-registry-data` to VM `tools-docker-registry-05` (T278303)
- 12:32 arturo: snapshot cinder volume `tools-docker-registry-data` into `tools-docker-registry-data-stretch-migration` (T278303)
- 12:32 arturo: bump cinder storage quota from 80G to 400G (without quota request task)
- 12:11 arturo: created VM `tools-docker-registry-06` as Debian Buster (T278303)
- 12:09 arturo: dettach cinder volume `tools-docker-registry-data` (T278303)
- 11:46 arturo: attach cinder volume `tools-docker-registry-data` to VM `tools-docker-registry-03` to format it and pre-populate it with registry data (T278303)
- 11:20 arturo: created 80G cinder volume tools-docker-registry-data (T278303)
- 11:10 arturo: starting VM tools-docker-registry-04 which was stopped probably since 2021-03-09 due to hypervisor draining
2021-03-23
- 12:46 arturo: aborrero@tools-sgegrid-master:~$ sudo systemctl restart gridengine-master.service
- 12:16 arturo: delete & re-create VM tools-sgegrid-shadow as Debian Buster (T277653)
- 12:14 arturo: created puppet prefix 'tools-sgegrid-shadow' and migrated puppet configuration from VM-puppet
- 12:13 arturo: created server group 'tools-grid-master-shadow' with anty-affinity policy
2021-03-18
- 19:24 bstorm: set profile::toolforge::infrastructure across the entire project with login_server set on the bastion and exec node-related prefixes
- 16:21 andrewbogott: enabling puppet tools-wide
- 16:20 andrewbogott: disabling puppet tools-wide to test https://gerrit.wikimedia.org/r/c/operations/puppet/+/672456
- 16:19 bstorm: added profile::toolforge::infrastructure class to puppetmaster T277756
- 04:12 bstorm: rebooted tools-sgeexec-0935.tools.eqiad.wmflabs because it forgot how to LDAP...likely root cause of the issues tonight
- 03:59 bstorm: rebooting grid master. sorry for the cron spam
- 03:49 bstorm: restarting sssd on tools-sgegrid-master
- 03:37 bstorm: deleted a massive number of stuck jobs that misfired from the cron server
- 03:35 bstorm: rebooting tools-sgecron-01 to try to clear up the ldap-related errors coming out of it
- 01:46 bstorm: killed the toolschecker cron job, which had an LDAP error, and ran it again by hand
2021-03-17
- 20:57 bstorm: deployed changes to rbac for kubernetes to add kubectl top access for tools
- 20:26 andrewbogott: moving tools-elastic-3 to cloudvirt1034; two elastic nodes shouldn't be on the same hv
2021-03-16
- 16:31 arturo: installing jobutils and misctools 1.41
- 15:55 bstorm: deleted a bunch of messed up grid jobs (9989481,8813,81682,86317,122602,122623,583621,606945,606999)
- 12:32 arturo: add packages jobutils / misctools v1.41 to {stretch,buster}-tools aptly repository in tools-sge-services-03
2021-03-12
- 23:13 bstorm: cleared error state for all grid queues
2021-03-11
- 17:40 bstorm: deployed metrics-server:0.4.1 to kubernetes
- 16:21 bstorm: add jobutils 1.40 and misctools 1.40 to stretch-tools
- 13:11 arturo: add misctools 1.37 to buster-tools|toolsbeta aptly repo for T275865
- 13:10 arturo: add jobutils 1.40 to buster-tools aptly repo for T275865
2021-03-10
- 10:56 arturo: briefly stopped VM tools-k8s-etcd-7 to disable VMX cpu flag
2021-03-09
- 13:31 arturo: hard-reboot tools-docker-registry-04 because issues related to T276922
- 12:34 arturo: briefly rebooting VM tools-docker-registry-04, we need to reboot the hypervisor cloudvirt1038 and failed to migrate away
2021-03-05
- 12:30 arturo: started tools-redis-1004 again
- 12:22 arturo: stop tools-redis-1004 to ease draining of cloudvirt1035
2021-03-04
- 11:25 arturo: rebooted tools-sgewebgrid-generic-0901, repool it again
- 09:58 arturo: depool tools-sgewebgrid-generic-0901 to reboot VM. It was stuck in MIGRATING state when draining cloudvirt1022
2021-03-03
- 15:17 arturo: shutting down tools-sgebastion-07 in an attempt to fix nova state and finish hypervisor migration
- 15:11 arturo: tools-sgebastion-07 triggered a neutron exception (unauthorized) while being live-migrated from cloudvirt1021 to 1029. Resetting nova state with `nova reset-state bd685d48-1011-404e-a755-372f6022f345 --active` and try again
- 14:48 arturo: killed pywikibot instance running in tools-sgebastion-07 by user msyn
2021-03-02
- 15:23 bstorm: depooling tools-sgewebgrid-lighttpd-0914.tools.eqiad.wmflabs for reboot. It isn't communicating right
- 15:22 bstorm: cleared queue error states...will need to keep a better eye on what's causing those
2021-02-27
- 02:23 bstorm: deployed typo fix to maintain-kubeusers in an innocent effort to make the weekend better T275910
- 02:00 bstorm: running a script to repair the dumps mount in all podpresets T275371
2021-02-26
- 22:04 bstorm: cleaned up grid jobs 1230666,1908277,1908299,2441500,2441513
- 21:27 bstorm: hard rebooting tools-sgeexec-0947
- 21:21 bstorm: hard rebooting tools-sgeexec-0952.tools.eqiad.wmflabs
- 20:01 bd808: Deleted csr in strange state for tool-ores-inspect
2021-02-24
- 18:30 bd808: `sudo wmcs-openstack role remove --user zfilipin --project tools user` T267313
- 01:04 bstorm: hard rebooting tools-k8s-worker-76 because it's in a sorry state
2021-02-23
- 23:11 bstorm: draining a bunch of k8s workers to clean up after dumps changes T272397
- 23:06 bstorm: draining tools-k8s-worker-55 to clean up after dumps changes T272397
2021-02-22
- 20:40 bstorm: repooled tools-sgeexec-0918.tools.eqiad.wmflabs
- 19:09 bstorm: hard rebooted tools-sgeexec-0918 from openstack T275411
- 19:07 bstorm: shutting down tools-sgeexec-0918 with the VM's command line (not libvirt directly yet) T275411
- 19:05 bstorm: shutting down tools-sgeexec-0918 (with openstack to see what happens) T275411
- 19:03 bstorm: depooled tools-sgeexec-0918 T275411
- 18:56 bstorm: deleted job 1962508 from the grid to clear it up T275301
- 16:58 bstorm: cleared error state on several grid queues
2021-02-19
- 12:31 arturo: deploying new version of toolforge ingress admission controller
2021-02-17
- 21:26 bstorm: deleted tools-puppetdb-01 since it is unused at this time (and undersized anyway)
2021-02-04
- 16:27 bstorm: rebooting tools-package-builder-02
2021-01-26
- 16:27 bd808: Hard reboot of tools-sgeexec-0906 via Horizon for T272978
2021-01-22
- 09:59 dcaro: added the record redis.svc.tools.eqiad1.wikimedia.cloud pointing to tools-redis1003 (T272679)
2021-01-21
- 23:58 bstorm: deployed new maintain-kubeusers to tools T271847
2021-01-19
- 22:57 bstorm: truncated 75GB error log /data/project/robokobot/virgule.err T272247
- 22:48 bstorm: truncated 100GB error log /data/project/magnus-toolserver/error.log T272247
- 22:43 bstorm: truncated 107GB log '/data/project/meetbot/logs/messages.log' T272247
- 22:34 bstorm: truncating 194 GB error log '/data/project/mix-n-match/mnm-microsync.err' T272247
- 16:37 bd808: Added Jhernandez to root sudoers group
2021-01-14
- 20:56 bstorm: setting bastions to have mostly-uncapped egress network and 40MBps nfs_read for better shared use
- 20:43 bstorm: running tc-setup across the k8s workers
- 20:40 bstorm: running tc-setup across the grid fleet
- 17:58 bstorm: hard rebooting tools-sgecron-01 following network issues during upgrade to stein T261134
2021-01-13
- 10:02 arturo: delete floating IP allocation 185.15.56.245 (T271867)
2021-01-12
- 18:16 bstorm: deleted wedged CSR tool-adhs-wde to get maintain-kubeusers working again T271842
2021-01-05
- 18:49 bstorm: changing the limits on k8s etcd nodes again, so disabling puppet on them T267966
2021-01-04
- 18:21 bstorm: ran 'sudo systemctl stop getty@ttyS1.service && sudo systemctl disable getty@ttyS1.service' on tools-k8s-etcd-5 I have no idea why that keeps coming back.
2020-12-22
- 18:22 bstorm: rebooting the grid master because it is misbehaving following the NFS outage
- 10:53 arturo: rebase & resolve ugly git merge conflict in labs/private.git
2020-12-18
- 18:37 bstorm: set profile::wmcs::kubeadm::etcd_latency_ms: 15 T267966
2020-12-17
- 21:42 bstorm: doing the same procedure to increase the timeouts more T267966
- 19:56 bstorm: puppet enabled one at a time, letting things catch up. Timeouts are now adjusted to something closer to fsync values T267966
- 19:44 bstorm: set etcd timeouts seed value to 20 instead of the default 10 (profile::wmcs::kubeadm::etcd_latency_ms) T267966
- 18:58 bstorm: disabling puppet on k8s-etcd servers to alter the timeouts T267966
- 14:23 arturo: regenerating puppet cert with proper alt names in tools-k8s-etcd-4 (T267966)
- 14:21 arturo: regenerating puppet cert with proper alt names in tools-k8s-etcd-5 (T267966)
- 14:19 arturo: regenerating puppet cert with proper alt names in tools-k8s-etcd-6 (T267966)
- 14:17 arturo: regenerating puppet cert with proper alt names in tools-k8s-etcd-7 (T267966)
- 14:15 arturo: regenerating puppet cert with proper alt names in tools-k8s-etcd-8 (T267966)
- 14:12 arturo: updated kube-apiserver manifest with new etcd nodes (T267966)
- 13:56 arturo: adding etcd dns_alt_names hiera keys to the puppet prefix https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/instance-puppet/+/beb27b45a74765a64552f2d4f70a40b217b4f4e9%5E%21/
- 13:12 arturo: making k8s api server aware of the new etcd nodes via hiera update https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/instance-puppet/+/3761c4c4dab1c3ed0ab0a1133d2ccf3df6c28baf%5E%21/ (T267966)
- 12:54 arturo: joining new etcd nodes in the k8s etcd cluster (T267966)
- 12:52 arturo: adding more etcd nodes in the hiera key in tools-k8s-etcd puppet prefix https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/instance-puppet/+/b4f60768078eccdabdfab4cd99c7c57076de51b2
- 12:50 arturo: dropping more unused hiera keys in the tools-k8s-etcd puppet prefix https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/instance-puppet/+/e9e66a6787d9b91c08cf4742a27b90b3e6d05aac
- 12:49 arturo: dropping unused hiera keys in the tools-k8s-etcd puppet prefix https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/instance-puppet/+/2b4cb4a41756e602fb0996e7d0210e9102172424
- 12:16 arturo: created VM `tools-k8s-etcd-8` (T267966)
- 12:15 arturo: created VM `tools-k8s-etcd-7` (T267966)
- 12:13 arturo: created `tools-k8s-etcd` anti-affinity server group
2020-12-11
- 18:29 bstorm: certificatesigningrequest.certificates.k8s.io "tool-production-error-tasks-metrics" deleted to stop maintain-kubeusers issues
- 12:14 dcaro: upgrading stable/main (clinic duty)
- 12:12 dcaro: upgrading buster-wikimedia/main (clinic duty)
- 12:03 dcaro: upgrading stable-updates/main, mainly cacertificates (clinic duty)
- 12:01 dcaro: upgrading stretch-backports/main, mainly libuv (clinic duty)
- 11:58 dcaro: disabled all the repos blocking upgrades on tools-package-builder-02 (duplicated, other releases...)
- 11:35 arturo: uncordon tools-k8s-worker-71 and tools-k8s-worker-55, they weren't uncordoned yesterday for whatever reasons (T263284)
- 11:27 dcaro: upgrading stretch-wikimedia/main (clinic duty)
- 11:20 dcaro: upgrading stretch-wikimedia/thirdparty/mono-project-stretch (clinic duty)
- 11:08 dcaro: upgrade stretch-wikimedia/component/php72 (minor upgrades) (clinic duty)
- 11:04 dcaro: upgrade oldstable/main packages (clinic duty)
- 10:58 dcaro: upgrade kubectl done (clinic duty)
- 10:53 dcaro: upgrade kubectl (clinic duty)
- 10:16 dcaro: upgrading oldstable/main packages (clinic duty)
2020-12-10
- 17:35 bstorm: k8s-control nodes upgraded to 1.17.13 T263284
- 17:16 arturo: k8s control nodes were all upgraded to 1.17, now upgrading worker nodes (T263284)
- 15:50 dcaro: puppet upgraded to 5.5.10 on the hosts, ping me if you see anything weird (clinic duty)
- 15:41 arturo: icinga-downtime toolschecker for 2h (T263284)
- 15:35 dcaro: Puppet 5 on tools-sgebastion-09 ran well and without issues, upgrading the other sge nodes (clinic duty)
- 15:32 dcaro: Upgrading puppet from 4 to 5 on tools-sgebastion-09 (clinic duty)
- 12:41 arturo: set hiera `profile::wmcs::kubeadm::component: thirdparty/kubeadm-k8s-1-17` in project & tools-k8s-control prefix (T263284)
- 11:50 arturo: disabled puppet in all k8s nodes in preparation for version upgrade (T263284)
- 11:50 arturo: disabled puppet in all k8s nodes in preparation for version upgrade (T263284)
- 09:58 dcaro: successful tesseract upgrade on tools-sgewebgrid-lighttpd-0914, upgrading the rest of nodes (clinic duty)
- 09:49 dcaro: upgrading tesseract on tools-sgewebgrid-lighttpd-0914 (clinic duty)
2020-12-08
- 19:01 bstorm: pushed updated calico node image (v3.14.0) to internal docker registry as well T269016
2020-12-07
- 22:56 bstorm: pushed updated local copies of the typha, calico-cni and calico-pod2daemon-flexvol images to the tools internal registry T269016
2020-12-03
- 09:18 arturo: restarted kubelet systemd service on tools-k8s-worker-38. Node was NotReady, complaining about 'use of closed network connection'
- 09:16 arturo: restarted kubelet systemd service on tools-k8s-worker-59. Node was NotReady, complaining about 'use of closed network connection'
2020-11-28
- 23:35 Krenair: Re-scheduled 4 continuous jobs from tools-sgeexec-0908 as it appears to be broken, at about 23:20 UTC
- 04:35 Krenair: Ran `sudo -i kubectl -n tool-mdbot delete cm maintain-kubeusers` on tools-k8s-control-1 for T268904, seems to have regenerated ~tools.mdbot/.kube/config
2020-11-24
- 17:44 arturo: rebased labs/private.git. 2 patches had merge conflicts
- 16:36 bd808: clush -w @all -b 'sudo -i apt-get purge nscd'
- 16:31 bd808: Ran `sudo -i apt-get purge nscd` on tools-sgeexec-0932 to try and fix apt state for puppet
2020-11-10
- 19:45 andrewbogott: rebooting tools-sgeexec-0950; OOM
2020-11-02
- 13:35 arturo: (typo: dcaro)
- 13:35 arturo: added dcar as projectadmin & user (T266068)
2020-10-29
- 21:33 legoktm: published docker-registry.tools.wmflabs.org/toolbeta-test image (T265681)
- 21:10 bstorm: Added another ingress node to k8s cluster in case the load spikes are the problem T266506
- 17:33 bstorm: hard rebooting tools-sgeexec-0905 and tools-sgeexec-0916 to get the grid back to full capacity
- 04:03 legoktm: published docker-registry.tools.wmflabs.org/toolforge-buster0-builder:latest image (T265686)
2020-10-28
- 23:42 bstorm: dramatically elevated the egress cap on tools-k8s-ingress nodes that were affected by the NFS settings T266506
- 22:10 bstorm: launching tools-k8s-ingress-3 to try and get an NFS-free node T266506
- 21:58 bstorm: set 'mount_nfs: false' on the tools-k8s-ingress prefix T266506
2020-10-23
- 22:22 legoktm: imported pack_0.14.2-1_amd64.deb into buster-tools (T266270)
2020-10-21
- 17:58 legoktm: pushed toolforge-buster0-{build,run}:latest images to docker registry
2020-10-15
- 22:00 bstorm: manually removing nscd from tools-sgebastion-08 and running puppet
- 18:23 andrewbogott: uncordoning tools-k8s-worker-53, 54, 55, 59
- 17:28 andrewbogott: depooling tools-k8s-worker-53, 54, 55, 59
- 17:27 andrewbogott: uncordoning tools-k8s-worker-35, 37, 45
- 16:44 andrewbogott: depooling tools-k8s-worker-35, 37, 45
2020-10-14
- 21:00 andrewbogott: repooling tools-sgewebgrid-generic-0901 and tools-sgewebgrid-lighttpd-0915
- 20:37 andrewbogott: depooling tools-sgewebgrid-generic-0901 and tools-sgewebgrid-lighttpd-0915
- 20:35 andrewbogott: repooling tools-sgewebgrid-lighttpd-0911, 12, 13, 16
- 20:31 bd808: Deployed toollabs-webservice v0.74
- 19:53 andrewbogott: depooling tools-sgewebgrid-lighttpd-0911, 12, 13, 16 and moving to Ceph
- 19:47 andrewbogott: repooling tools-sgeexec-0932, 33, 34 and moving to Ceph
- 19:07 andrewbogott: depooling tools-sgeexec-0932, 33, 34 and moving to Ceph
- 19:06 andrewbogott: repooling tools-sgeexec-0935, 36, 38, 40 and moving to Ceph
- 16:56 andrewbogott: depooling tools-sgeexec-0935, 36, 38, 40 and moving to Ceph
2020-10-10
- 17:07 bstorm: cleared errors on tools-sgeexec-0912.tools.eqiad.wmflabs to get the queue moving again
2020-10-08
- 17:07 bstorm: rebuilding docker images with locales-all T263339
2020-10-06
- 19:04 andrewbogott: uncordoned tools-k8s-worker-38
- 18:51 andrewbogott: uncordoned tools-k8s-worker-52
- 18:40 andrewbogott: draining and cordoning tools-k8s-worker-52 and tools-k8s-worker-38 for ceph migration
2020-10-02
- 21:09 bstorm: rebooting tools-k8s-worker-70 because it seems to be unable to recover from an old NFS disconnect
- 17:37 andrewbogott: stopping tools-prometheus-03 to attempt a snapshot
- 16:03 bstorm: shutting down tools-prometheus-04 to try to fsck the disk
2020-10-01
- 21:39 andrewbogott: migrating tools-proxy-06 to ceph
- 21:35 andrewbogott: moving k8s.tools.eqiad1.wikimedia.cloud from 172.16.0.99 (toolsbeta-test-k8s-haproxy-1) to 172.16.0.108 (toolsbeta-test-k8s-haproxy-2) in anticipation of downtime for haproxy-1 tomorrow
2020-09-30
- 18:34 andrewbogott: repooling tools-sgeexec-0918
- 18:29 andrewbogott: depooling tools-sgeexec-0918 so I can reboot cloudvirt1036
2020-09-23
- 21:38 bstorm: ran an 'apt clean' across the fleet to get ahead of the new locale install
2020-09-18
- 19:41 andrewbogott: repooling tools-k8s-worker-30, 33, 34, 57, 60
- 19:04 andrewbogott: depooling tools-k8s-worker-30, 33, 34, 57, 60
- 19:02 andrewbogott: repooling tools-k8s-worker-41, 43, 44, 47, 48, 49, 50, 51
- 17:48 andrewbogott: depooling tools-k8s-worker-41, 43, 44, 47, 48, 49, 50, 51
- 17:47 andrewbogott: repooling tools-k8s-worker-31, 32, 36, 39, 40
- 16:40 andrewbogott: depooling tools-k8s-worker-31, 32, 36, 39, 40
- 16:38 andrewbogott: repooling tools-sgewebgrid-lighttpd-0914, tools-sgewebgrid-generic-0902, tools-sgewebgrid-lighttpd-0919, tools-sgewebgrid-lighttpd-0918
- 16:10 andrewbogott: depooling tools-sgewebgrid-lighttpd-0914, tools-sgewebgrid-generic-0902, tools-sgewebgrid-lighttpd-0919, tools-sgewebgrid-lighttpd-0918
- 13:54 andrewbogott: repooling tools-sgeexec-0913, tools-sgeexec-0915, tools-sgeexec-0916
- 13:50 andrewbogott: depooling tools-sgeexec-0913, tools-sgeexec-0915, tools-sgeexec-0916 for flavor update
- 01:20 andrewbogott: repooling tools-sgeexec-0901, tools-sgeexec-0905, tools-sgeexec-0910, tools-sgeexec-0911, tools-sgeexec-0912 after flavor update
- 01:11 andrewbogott: depooling tools-sgeexec-0901, tools-sgeexec-0905, tools-sgeexec-0910, tools-sgeexec-0911, tools-sgeexec-0912 for flavor update
- 01:08 andrewbogott: repooling tools-sgeexec-0917, tools-sgeexec-0918, tools-sgeexec-0919, tools-sgeexec-0920 after flavor update
- 01:00 andrewbogott: depooling tools-sgeexec-0917, tools-sgeexec-0918, tools-sgeexec-0919, tools-sgeexec-0920 for flavor update
- 00:58 andrewbogott: repooling tools-sgeexec-0942, tools-sgeexec-0947, tools-sgeexec-0950, tools-sgeexec-0951, tools-sgeexec-0952 after flavor update
- 00:49 andrewbogott: depooling tools-sgeexec-0942, tools-sgeexec-0947, tools-sgeexec-0950, tools-sgeexec-0951, tools-sgeexec-0952 for flavor update
2020-09-17
- 21:56 bd808: Built and deployed tools-manifest v0.22 (T263190)
- 21:55 bd808: Built and deployed tools-manifest v0.22 (T169695)
- 20:34 bd808: Live hacked "--backend=gridengine" into webservicemonitor on tools-sgecron-01 (T263190)
- 20:21 bd808: Restarted webservicemonitor on tools-sgecron-01.tools.eqiad.wmflabs
- 20:09 andrewbogott: I didn't actually depool tools-sgeexec-0942, tools-sgeexec-0947, tools-sgeexec-0950, tools-sgeexec-0951, tools-sgeexec-0952 because there was some kind of brief outage just now
- 19:58 andrewbogott: depooling tools-sgeexec-0942, tools-sgeexec-0947, tools-sgeexec-0950, tools-sgeexec-0951, tools-sgeexec-0952 for flavor update
- 19:55 andrewbogott: repooling tools-k8s-worker-61,62,64,65,67,68,69 for flavor update
- 19:29 andrewbogott: depooling tools-k8s-worker-61,62,64,65,67,68,69 for flavor update
- 15:38 andrewbogott: repooling tools-k8s-worker-70 and tools-k8s-worker-66 after flavor remapping
- 15:34 andrewbogott: depooling tools-k8s-worker-70 and tools-k8s-worker-66 for flavor remapping
- 15:30 andrewbogott: repooling tools-sgeexec-0909, 0908, 0907, 0906, 0904
- 15:21 andrewbogott: depooling tools-sgeexec-0909, 0908, 0907, 0906, 0904 for flavor remapping
- 13:55 andrewbogott: depooled tools-sgewebgrid-lighttpd-0917 and tools-sgewebgrid-lighttpd-0920
- 13:55 andrewbogott: repooled tools-sgeexec-0937 after move to ceph
- 13:45 andrewbogott: depooled tools-sgeexec-0937 for move to ceph
2020-09-16
- 23:20 andrewbogott: repooled tools-sgeexec-0941 and tools-sgeexec-0939 for move to ceph
- 23:03 andrewbogott: depooled tools-sgeexec-0941 and tools-sgeexec-0939 for move to ceph
- 23:02 andrewbogott: uncordoned tools-k8s-worker-58, tools-k8s-worker-56, tools-k8s-worker-42 for migration to ceph
- 22:29 andrewbogott: draining tools-k8s-worker-58, tools-k8s-worker-56, tools-k8s-worker-42 for migration to ceph
- 17:37 andrewbogott: service gridengine-master restart on tools-sgegrid-master
2020-09-10
- 15:37 arturo: hard-rebooting tools-proxy-05
- 15:33 arturo: rebooting tools-proxy-05 to try flushing local DNS caches
- 15:25 arturo: detected missing DNS record for k8s.tools.eqiad1.wikimedia.cloud which means the k8s cluster is down
- 10:22 arturo: enabling ingress dedicated worker nodes in the k8s cluster (T250172)
2020-09-09
- 11:12 arturo: new ingress nodes added to the cluster, and tainted/labeled per the docs https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Kubernetes/Deploying#ingress_nodes (T250172)
- 10:50 arturo: created puppet prefix `tools-k8s-ingress` (T250172)
- 10:42 arturo: created VMs tools-k8s-ingress-1 and tools-k8s-ingress-2 in the `tools-ingress` server group T250172)
- 10:38 arturo: created server group `tools-ingress` with soft anti affinity policy (T250172)
2020-09-08
- 23:24 bstorm: clearing grid queue error states blocking job runs
- 22:53 bd808: forcing puppet run on tools-sgebastion-07
2020-09-02
- 18:13 andrewbogott: moving tools-sgeexec-0920 to ceph
- 17:57 andrewbogott: moving tools-sgeexec-0942 to ceph
2020-08-31
- 19:58 andrewbogott: migrating tools-sgeexec-091[0-9] to ceph
- 17:19 andrewbogott: migrating tools-sgeexec-090[4-9] to ceph
- 17:19 andrewbogott: repooled tools-sgeexec-0901
- 16:52 bstorm: `apt install uwsgi` was run on tools-checker-03 in the last log T261677
- 16:51 bstorm: running `apt install uwsgi` with --allow-downgrades to fix the puppet setup there T261677
- 14:26 andrewbogott: depooling tools-sgeexec-0901, migrating to ceph
2020-08-30
- 00:57 Krenair: also ran qconf -ds on each
- 00:35 Krenair: Tidied up SGE problems (it was spamming root@ every minute for hours) following host deletions some hours ago - removed tools-sgeexec-0921 through 0931 from @general, ran qmod -rj on all jobs registered for those nodes, then qdel -f on the remainders, then qconf -de on each deleted node
2020-08-29
- 16:02 bstorm: deleting "tools-sgeexec-0931", "tools-sgeexec-0930", "tools-sgeexec-0929", "tools-sgeexec-0928", "tools-sgeexec-0927"
- 16:00 bstorm: deleting "tools-sgeexec-0926", "tools-sgeexec-0925", "tools-sgeexec-0924", "tools-sgeexec-0923", "tools-sgeexec-0922", "tools-sgeexec-0921"
2020-08-26
- 21:08 bd808: Disabled puppet on tools-proxy-06 to test fixes for a bug in the new T251628 code
- 08:54 arturo: merged several patches by bryan for toolforge front proxy (cleanups, etc) example: https://gerrit.wikimedia.org/r/c/operations/puppet/+/622435
2020-08-25
- 19:38 andrewbogott: deleting tools-sgeexec-0943.tools.eqiad.wmflabs, tools-sgeexec-0944.tools.eqiad.wmflabs, tools-sgeexec-0945.tools.eqiad.wmflabs, tools-sgeexec-0946.tools.eqiad.wmflabs, tools-sgeexec-0948.tools.eqiad.wmflabs, tools-sgeexec-0949.tools.eqiad.wmflabs, tools-sgeexec-0953.tools.eqiad.wmflabs — they are broken and we're not very curious why; will retry this exercise when everything is standardized on
- 15:03 andrewbogott: removing non-ceph nodes tools-sgeexec-0921 through tools-sgeexec-0931
- 15:02 andrewbogott: added new sge-exec nodes tools-sgeexec-0943 through tools-sgeexec-0953 (for real this time)
2020-08-19
- 21:29 andrewbogott: shutting down and removing tools-k8s-worker-20 through tools-k8s-worker-29; this load can now be handled by new nodes on ceph hosts
- 21:15 andrewbogott: shutting down and removing tools-k8s-worker-1 through tools-k8s-worker-19; this load can now be handled by new nodes on ceph hosts
- 18:40 andrewbogott: creating 13 new xlarge k8s worker nodes, tools-k8s-worker-67 through tools-k8s-worker-79
2020-08-18
- 15:24 bd808: Rebuilding all Docker containers to pick up newest versions of installed packages
2020-07-30
- 16:28 andrewbogott: added new xlarge ceph-hosted worker nodes: tools-k8s-worker-61, 62, 63, 64, 65, 66. T258663
2020-07-29
- 23:24 bd808: Pushed a copy of docker-registry.wikimedia.org/wikimedia-jessie:latest to docker-registry.tools.wmflabs.org/wikimedia-jessie:latest in preparation for the upstream image going away
2020-07-24
- 22:33 bd808: Removed a few more ancient docker images: grrrit, jessie-toollabs, and nagf
- 21:02 bd808: Running cleanup script to delete the non-sssd toolforge images from docker-registry.tools.wmflabs.org
- 20:17 bd808: Forced garbage collection on docker-registry.tools.wmflabs.org
- 20:06 bd808: Running cleanup script to delete all of the old toollabs-* images from docker-registry.tools.wmflabs.org
2020-07-22
- 23:24 bstorm: created server group 'tools-k8s-worker' to create any new worker nodes in so that they have a low chance of being scheduled together by openstack unless it is necessary T258663
- 23:22 bstorm: running puppet and NFS 4.2 remount on tools-k8s-worker-[56-60] T257945
- 23:17 bstorm: running puppet and NFS 4.2 remount on tools-k8s-worker-[41-55] T257945
- 23:14 bstorm: running puppet and NFS 4.2 remount on tools-k8s-worker-[21-40] T257945
- 23:11 bstorm: running puppet and NFS remount on tools-k8s-worker-[1-15] T257945
- 23:07 bstorm: disabling puppet on k8s workers to reduce the effect of changing the NFS mount version all at once T257945
- 22:28 bstorm: setting tools-k8s-control prefix to mount NFS v4.2 T257945
- 22:15 bstorm: set the tools-k8s-control nodes to also use 800MBps to prevent issues with toolforge ingress and api system
- 22:07 bstorm: set the tools-k8s-haproxy-1 (main load balancer for toolforge) to have an egress limit of 800MB per sec instead of the same as all the other servers
2020-07-21
- 16:09 bstorm: rebooting tools-sgegrid-shadow to remount NFS correctly
- 15:55 bstorm: set the bastion prefix to have explicitly set hiera value of profile::wmcs::nfsclient::nfs_version: '4'
2020-07-17
- 16:47 bd808: Enabled Puppet on tools-proxy-06 following successful test (T102367)
- 16:29 bd808: Disabled Puppet on tools-proxy-06 to test nginx config changes manually (T102367)
2020-07-15
- 23:11 bd808: Removed ssh root key for valhallasw from project hiera (T255697)
2020-07-09
- 18:53 bd808: Updating git-review to 1.27 via clush across cluster (T257496)
2020-07-08
- 11:16 arturo: merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/610029 -- important change to front-proxy (T234617)
- 11:11 arturo: live-hacking puppetmaster with https://gerrit.wikimedia.org/r/c/operations/puppet/+/610029 (T234617)
2020-07-07
- 23:22 bd808: Rebuilding all Docker images to pick up webservice v0.73 (T234617, T257229)
- 23:19 bd808: Deploying webservice v0.73 via clush (T234617, T257229)
- 23:16 bd808: Building webservice v0.73 (T234617, T257229)
- 15:01 Reedy: killed python process from tools.experimental-embeddings using a lot of cpu on tools-sgebastion-07
- 15:01 Reedy: killed meno25 process running pwb.py on tools-sgebastion-07
- 09:59 arturo: point DNS tools.wmflabs.org A record to 185.15.56.60 (tools-legacy-redirector) (T247236)
2020-07-06
- 11:54 arturo: briefly point DNS tools.wmflabs.org A record to 185.15.56.60 (tools-legacy-redirector) and then switch back to 185.15.56.11 (tools-proxy-05). The legacy redirector does HTTP/307 (T247236)
- 11:50 arturo: associate floating IP address 185.15.56.60 to tools-legacy-redirector (T247236)
2020-07-01
- 11:19 arturo: cleanup exim email queue (4 frozen messages)
- 11:02 arturo: live-hacking puppetmaster with https://gerrit.wikimedia.org/r/c/operations/puppet/+/608849 (T256737)
2020-06-30
- 11:18 arturo: set some hiera keys for mtail in puppet prefix `tools-mail` (T256737)
2020-06-29
- 22:48 legoktm: built html-sssd/web image (T241817)
- 22:23 legoktm: rebuild python{34,35,37}-sssd/web images for https://gerrit.wikimedia.org/r/608093
- 12:01 arturo: introduced spam filter in the mail server (T120210)
2020-06-25
- 21:49 zhuyifei1999_: re-enabling puppet on tools-sgebastion-09 T256426
- 21:39 zhuyifei1999_: disabling puppet on tools-sgebastion-09 so I can play with mount settings T256426
- 21:24 bstorm: hard rebooting tools-sgebastion-09
2020-06-24
- 12:36 arturo: live-hacking puppetmaster with exim prometheus stuff (T175964)
- 11:57 arturo: merging email ratelimiting patch https://gerrit.wikimedia.org/r/c/operations/puppet/+/607320 (T175964)
2020-06-23
- 17:55 arturo: killed procs for users `hamishz` and `msyn` which apparently were tools that should be running in the grid / kubernetes instead
- 16:08 arturo: created acme-chief cert `tools_mail` in the prefix hiera
2020-06-17
- 10:40 arturo: created VM tools-legacy-redirector, with the corresponding puppet prefix (T247236, T234617)
2020-06-16
- 23:01 bd808: Building new Docker images to pick up webservice 0.72
- 22:58 bd808: Deploying webservice 0.72 to bastions and grid
- 22:56 bd808: Building webservice 0.72
- 15:10 arturo: merging a patch with changes to the template for keepalived (used in the elastic cluster) https://gerrit.wikimedia.org/r/c/operations/puppet/+/605898
2020-06-15
- 21:28 bstorm_: cleaned up killgridjobs.sh on the tools bastions T157792
- 18:14 bd808: Rebuilding all Docker images to pick up webservice 0.71 (T254640, T253412)
- 18:12 bd808: Deploying webservice 0.71 to bastions and grid via clush
- 18:05 bd808: Building webservice 0.71
2020-06-12
- 13:13 arturo: live-hacking session in the puppetmaster ended
- 13:10 arturo: live-hacing puppet tree in tools-puppetmaster-02 for testing PAWS related patch (they share haproxy puppet code)
- 00:16 bstorm_: remounted NFS for tools-k8s-control-3 and tools-acme-chief-01
2020-06-11
- 23:35 bstorm_: rebooting tools-k8s-control-2 because it seems to be confused on NFS, interestingly enough
2020-06-04
- 13:32 bd808: Manually restored /etc/haproxy/conf.d/elastic.cfg on tools-elastic-*
2020-06-02
- 12:23 arturo: renewed TLS cert for k8s metrics-server (T250874) following docs: https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Kubernetes/Certificates#internal_API_access
- 11:00 arturo: renewed TLS cert for prometheus to contact toolforge k8s (T250874) following docs: https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Kubernetes/Certificates#external_API_access
2020-06-01
- 23:51 bstorm_: refreshed certs for the custom webhook controllers on the k8s cluster T250874
- 00:39 bd808: Ugh. Prior SAL message was about tools-sgeexec-0940
- 00:39 bd808: Compressed /var/log/account/pacct.0 ahead of rotation schedule to free some space on the root partition
2020-05-29
- 19:37 bstorm_: adding docker image for paws-public docker-registry.tools.wmflabs.org/paws-public-nginx:openresty T252217
2020-05-28
- 21:19 bd808: Killed 7 python processes run by user 'mattho69' on login.toolforge.org
- 21:06 bstorm_: upgrading tools-k8s-worker-[30-60] to kubernetes 1.16.10 T246122
- 17:54 bstorm_: upgraded tools-k8s-worker-[11..15] and starting on -21-29 now T246122
- 16:01 bstorm_: kubectl upgraded to 1.16.10 on all bastions T246122
- 15:58 arturo: upgrading tools-k8s-worker-[1..10] to 1.16.10 (T246122)
- 15:41 arturo: upgrading tools-k8s-control-3 to 1.16.10 (T246122)
- 15:17 arturo: upgrading tools-k8s-control-2 to 1.16.10 (T246122)
- 15:09 arturo: upgrading tools-k8s-control-1 to 1.16.10 (T246122)
- 14:49 arturo: cleanup /etc/apt/sources.list.d/ directory in all tools-k8s-* VMs
- 11:27 arturo: merging change to front-proxy: https://gerrit.wikimedia.org/r/c/operations/puppet/+/599139 (T253816)
2020-05-27
- 17:23 bstorm_: deleting "tools-k8s-worker-20", "tools-k8s-worker-19", "tools-k8s-worker-18", "tools-k8s-worker-17", "tools-k8s-worker-16"
2020-05-26
- 18:45 bstorm_: upgrading maintain-kubeusers to match what is in toolsbeta T246059 T211096
- 16:20 bstorm_: fix incorrect volume name in kubeadm-config configmap T246122
2020-05-22
- 20:00 bstorm_: rebooted tools-sgebastion-07 to clear up tmp file problems with 10 min warning
- 19:12 bstorm_: running command to delete over 2000 tmp ca certs on tools-bastion-07 T253412
2020-05-21
- 22:40 bd808: Rebuilding all Docker containers for tools-webservice 0.70 (T252700)
- 22:36 bd808: Updated tools-webservice to 0.70 across instances (T252700)
- 22:29 bd808: Building tools-webservice 0.70 via wmcs-package-build.py
2020-05-20
- 09:59 arturo: now running tesseract-ocr v4.1.1-2~bpo9+1 in the Toolforge grid (T247422)
- 09:50 arturo: `aborrero@cloud-cumin-01:~$ sudo cumin --force -x 'O{project:tools name:tools-sge[bcew].*}' 'apt-get install tesseract-ocr -t stretch-backports -y'` (T247422)
- 09:35 arturo: `aborrero@cloud-cumin-01:~$ sudo cumin --force -x 'O{project:tools name:tools-sge[bcew].*}' 'rm /etc/apt/sources.lists.d/kubeadm-k8s-component-repo.list ; rm /etc/apt/sources.list.d/repository_thirdparty-kubeadm-k8s-1-15.list ; run-puppet-agent'` (T247422)
- 09:23 arturo: `aborrero@cloud-cumin-01:~$ sudo cumin --force -x 'O{project:tools name:tools-sge[bcew].*}' 'rm /etc/apt/preferences.d/* ; run-puppet-agent'` (T247422)
2020-05-19
- 17:00 bstorm_: deleting/restarting the paws db-proxy pod because it cannot connect to the replicas...and I'm hoping that's due to depooling and such
2020-05-13
- 18:14 bstorm_: upgrading calico to 3.14.0 with typha enabled in Toolforge K8s T250863
- 18:10 bstorm_: set "profile::toolforge::k8s::typha_enabled: true" in tools project for calico upgrade T250863
2020-05-09
- 00:28 bstorm_: added nfs.* to ignored_fs_types for the prometheus::node_exporter params in project hiera T252260
2020-05-08
- 18:17 bd808: Building all jessie-sssd derived images (T197930)
- 17:29 bd808: Building new jessie-sssd base image (T197930)
2020-05-07
- 21:51 bstorm_: rebuilding the docker images for Toolforge k8s
- 19:03 bstorm_: toollabs-webservice 0.69 now pushed to the Toolforge bastions
- 18:57 bstorm_: pushing new toollabs-webservice package v0.69 to the tools repos
2020-05-06
- 21:20 bd808: Kubectl delete node tools-k8s-worker-[16-20] (T248702)
- 18:24 bd808: Updated "profile::toolforge::k8s::worker_nodes" list in "tools-k8s-haproxy" prefix puppet (T248702)
- 18:14 bd808: Shutdown tools-k8s-worker-[16-20] instances (T248702)
- 18:04 bd808: Draining tools-k8s-worker-[16-20] in preparation for decomm (T248702)
- 17:56 bd808: Cordoned tools-k8s-worker-[16-20] in preparation for decomm (T248702)
- 00:01 bd808: Joining tools-k8s-worker-60 to the k8s worker pool
- 00:00 bd808: Joining tools-k8s-worker-59 to the k8s worker pool
2020-05-05
- 23:58 bd808: Joining tools-k8s-worker-58 to the k8s worker pool
- 23:55 bd808: Joining tools-k8s-worker-57 to the k8s worker pool
- 23:53 bd808: Joining tools-k8s-worker-56 to the k8s worker pool
- 21:51 bd808: Building 5 new k8s worker nodes (T248702)
2020-05-04
- 22:08 bstorm_: deleting tools-elastic-01/2/3 T236606
- 16:46 arturo: removing the now unused `/etc/apt/preferences.d/toolforge_k8s_kubeadmrepo*` files (T250866)
- 16:43 arturo: removing the now unused `/etc/apt/sources.list.d/toolforge-k8s-kubeadmrepo.list` file (T250866)
2020-04-29
- 22:13 bstorm_: running a fixup script after fixing a bug T247455
- 21:28 bstorm_: running the rewrite-psp-preset.sh script across all tools T247455
- 16:54 bstorm_: deleted the maintain-kubeusers pod to start running the new image T247455
- 16:52 bstorm_: tagged docker-registry.tools.wmflabs.org/maintain-kubeusers:beta to latest to deploy to toolforge T247455
2020-04-28
- 22:58 bstorm_: rebuilding docker-registry.tools.wmflabs.org/maintain-kubeusers:beta T247455
2020-04-23
- 19:22 bd808: Increased Kubernetes services quota for bd808-test tool.
2020-04-21
- 23:06 bstorm_: repooled tools-k8s-worker-38/52, tools-sgewebgrid-lighttpd-0918/9 and tools-sgeexec-0901 T250869
- 22:09 bstorm_: depooling tools-sgewebgrid-lighttpd-0918/9 and tools-sgeexec-0901 T250869
- 22:02 bstorm_: draining tools-k8s-worker-38 and tools-k8s-worker-52 as they are on the crashed host T250869
2020-04-20
- 15:31 bd808: Rebuilding Docker containers to pick up tools-webservice v0.68 (T250625)
- 14:47 arturo: added joakino to tools.admin LDAP group
- 13:28 jeh: shutdown elasticsearch v5 cluster running Jessie T236606
- 12:46 arturo: uploading tools-webservice v0.68 to aptly stretch-tools and update it on relevant servers (T250625)
- 12:06 arturo: uploaded tools-webservice v0.68 to stretch-toolsbeta for testing
- 11:59 arturo: `root@tools-sge-services-03:~# aptly db cleanup` removed 340 unreferenced packages, and 2 unreferenced files
2020-04-15
- 23:20 bd808: Building ruby25-sssd/base and children (T141388, T250118)
- 20:09 jeh: update default security group to allow prometheus01.metricsinfra.eqiad.wmflabs TCP 9100 T250206
2020-04-14
- 18:26 bstorm_: Deployed new code and RBAC for maintain-kubeusers T246123
- 18:19 bstorm_: updating the maintain-kubeusers:latest image T246123
- 17:32 bstorm_: updating the maintain-kubeusers:beta image on tools-docker-imagebuilder-01 T246123
2020-04-10
- 21:33 bd808: Rebuilding all Docker images for the Kubernetes cluster (T249843)
- 19:36 bstorm_: after testing deploying toollabs-webservice 0.67 to tools repos T249843
- 14:53 arturo: live-hacking tools-puppetmaster-02 with https://gerrit.wikimedia.org/r/c/operations/puppet/+/587991 for T249837
2020-04-09
- 15:13 bd808: Rebuilding all stretch and buster Docker images. Jessie is broken at the moment due to package version mismatches
- 11:18 arturo: bump nproc limit in bastions https://gerrit.wikimedia.org/r/c/operations/puppet/+/587715 (T219070)
- 04:29 bd808: Running rebuild_all for Docker images to pick up toollabs-webservice v0.66 [try #2] (T154504, T234617)
- 04:19 bd808: python3 build.py --image-prefix toolforge --tag latest --no-cache --push --single jessie-sssd
- 00:20 bd808: Docker rebuild failed in toolforge-python2-sssd-base: "zlib1g-dev : Depends: zlib1g (= 1:1.2.8.dfsg-2+b1) but 1:1.2.8.dfsg-2+deb8u1 is to be installed"
2020-04-08
- 23:49 bd808: Running rebuild_all for Docker images to pick up toollabs-webservice v0.66 (T154504, T234617)
- 23:35 bstorm_: deploy toollabs-webservice v0.66 T154504 T234617
2020-04-07
- 20:06 andrewbogott: sss_cache -E on tools-sgebastion-08 and tools-sgebastion-09
- 20:00 andrewbogott: sss_cache -E on tools-sgebastion-07
2020-04-06
- 19:16 bstorm_: deleted tools-redis-1001/2 T248929
2020-04-03
- 22:40 bstorm_: shut down tools-redis-1001/2 T248929
- 22:32 bstorm_: switch tools-redis-1003 to the active redis server T248929
- 20:41 bstorm_: deleting tools-redis-1003/4 to attach them to an anti-affinity group T248929
- 18:53 bstorm_: spin up tools-redis-1004 on stretch and connect to cluster T248929
- 18:23 bstorm_: spin up tools-redis-1003 on stretch and connect to the cluster T248929
- 16:50 bstorm_: launching tools-redis-03 (Buster) to see what happens
2020-03-30
- 18:28 bstorm_: Beginning rolling depool, remount, repool of k8s workers for T248702
- 18:22 bstorm_: disabled puppet across tools-k8s-worker-[1-55].tools.eqiad.wmflabs T248702
- 16:56 arturo: dropping `_psl.toolforge.org` TXT record (T168677)
2020-03-27
- 21:22 bstorm_: removed puppet prefix tools-docker-builder T248703
- 21:15 bstorm_: deleted tools-docker-builder-06 T248703
- 18:55 bstorm_: launching tools-docker-imagebuilder-01 T248703
- 12:52 arturo: install python3-pykube on tools-k8s-control-3 for some tests interaction with the API from python
2020-03-24
- 11:44 arturo: trying to solve a rebase/merge conflict in labs/private.git in tools-puppetmaster-02
- 11:33 arturo: merging tools-proxy patch https://gerrit.wikimedia.org/r/c/operations/puppet/+/579952/ (T234617) (second try with some additional bits in LUA)
- 10:16 arturo: merging tools-proxy patch https://gerrit.wikimedia.org/r/c/operations/puppet/+/579952/ (T234617)
2020-03-18
- 19:07 bstorm_: removed role::toollabs::logging::sender from project puppet (it wouldn't work anyway)
- 18:04 bstorm_: removed puppet prefix tools-flannel-etcd T246689
- 17:58 bstorm_: removed puppet prefix tools-worker T246689
- 17:57 bstorm_: removed puppet prefix tools-k8s-master T246689
- 17:36 bstorm_: removed lots of deprecated hiera keys from horizon for the old cluster T246689
- 16:59 bstorm_: deleting "tools-worker-1002", "tools-worker-1001", "tools-k8s-master-01", "tools-flannel-etcd-03", "tools-k8s-etcd-03", "tools-flannel-etcd-02", "tools-k8s-etcd-02", "tools-flannel-etcd-01", "tools-k8s-etcd-01" T246689
2020-03-17
- 13:29 arturo: set `profile::toolforge::bastion::nproc: 200` for tools-sgebastion-08 (T219070)
- 00:08 bstorm_: shut off tools-flannel-etcd-01/02/03 T246689
2020-03-16
- 22:01 bstorm_: shut off tools-k8s-etcd-01/02/03 T246689
- 22:00 bstorm_: shut off tools-k8s-master-01 T246689
- 21:59 bstorm_: shut down tools-worker-1001 and tools-worker-1002 T246689
2020-03-11
- 17:00 jeh: clean up apt cache on tools-sgebastion-07
2020-03-06
- 16:25 bstorm_: updating maintain-kubeusers image to filter invalid tool names
2020-03-03
- 18:16 jeh: create OpenStack DNS record for elasticsearch.svc.tools.eqiad1.wikimedia.cloud (eqiad1 subdomain change) T236606
- 18:02 jeh: create OpenStack DNS record for elasticsearch.svc.tools.eqiad.wikimedia.cloud T236606
- 17:31 jeh: create a OpenStack virtual ip address for the new elasticsearch cluster T236606
- 10:54 arturo: deleted VMs `tools-worker-[1003-1020]` (legacy k8s cluster) (T246689)
- 10:51 arturo: cordoned/drained all legacy k8s worker nodes except 1001/1002 (T246689)
2020-03-02
- 22:26 jeh: starting first pass of elasticsearch data migration to new cluster T236606
2020-03-01
- 01:48 bstorm_: old version of kubectl removed. Anyone who needs it can download it with `curl -LO https://storage.googleapis.com/kubernetes-release/release/v1.4.12/bin/linux/amd64/kubectl`
- 01:27 bstorm_: running the force-migrate command to make sure any new kubernetes deployments are on the new cluster.
2020-02-28
- 22:14 bstorm_: shutting down the old maintain-kubeusers and taking the gloves off the new one (removing --gentle-mode)
- 16:51 bstorm_: node/tools-k8s-worker-15 uncordoned
- 16:44 bstorm_: drained tools-k8s-worker-15 and hard rebooting it because it wasn't happy
- 16:36 bstorm_: rebooting k8s workers 1-35 on the 2020 cluster to clear a strange nologin condition that has been there since the NFS maintenance
- 16:14 bstorm_: rebooted tools-k8s-worker-7 to clear some puppet issues
- 16:00 bd808: Devoicing stashbot in #wikimedia-cloud to reduce irc spam while migrating tools to 2020 Kubernetes cluster
- 15:28 jeh: create OpenStack server group tools-elastic with anti-affinty policy enabled T236606
- 15:09 jeh: create 3 new elasticsearch VMs tools-elastic-[1,2,3] T236606
- 14:20 jeh: create new puppet prefixes for existing (no change in data) and new elasticsearch VMs
- 04:35 bd808: Joined tools-k8s-worker-54 to 2020 Kubernetes cluster
- 04:34 bd808: Joined tools-k8s-worker-53 to 2020 Kubernetes cluster
- 04:32 bd808: Joined tools-k8s-worker-52 to 2020 Kubernetes cluster
- 04:31 bd808: Joined tools-k8s-worker-51 to 2020 Kubernetes cluster
- 04:28 bd808: Joined tools-k8s-worker-50 to 2020 Kubernetes cluster
- 04:24 bd808: Joined tools-k8s-worker-49 to 2020 Kubernetes cluster
- 04:23 bd808: Joined tools-k8s-worker-48 to 2020 Kubernetes cluster
- 04:21 bd808: Joined tools-k8s-worker-47 to 2020 Kubernetes cluster
- 04:21 bd808: Joined tools-k8s-worker-46 to 2020 Kubernetes cluster
- 04:19 bd808: Joined tools-k8s-worker-45 to 2020 Kubernetes cluster
- 04:14 bd808: Joined tools-k8s-worker-44 to 2020 Kubernetes cluster
- 04:13 bd808: Joined tools-k8s-worker-43 to 2020 Kubernetes cluster
- 04:12 bd808: Joined tools-k8s-worker-42 to 2020 Kubernetes cluster
- 04:10 bd808: Joined tools-k8s-worker-41 to 2020 Kubernetes cluster
- 04:09 bd808: Joined tools-k8s-worker-40 to 2020 Kubernetes cluster
- 04:08 bd808: Joined tools-k8s-worker-39 to 2020 Kubernetes cluster
- 04:07 bd808: Joined tools-k8s-worker-38 to 2020 Kubernetes cluster
- 04:06 bd808: Joined tools-k8s-worker-37 to 2020 Kubernetes cluster
- 03:49 bd808: Joined tools-k8s-worker-36 to 2020 Kubernetes cluster
- 00:50 bstorm_: rebuilt all docker images to include webservice 0.64
2020-02-27
- 23:27 bstorm_: installed toollabs-webservice 0.64 on the bastions
- 23:24 bstorm_: pushed toollabs-webservice version 0.64 to all toolforge repos
- 21:03 jeh: add reindex service account to elasticsearch for data migration T236606
- 20:57 bstorm_: upgrading toollabs-webservice to stretch-toolsbeta version for jdk8:testing image only
- 20:19 jeh: update elasticsearch VPS security group to allow toolsbeta-elastic7-1 access on tcp 80 T236606
- 18:53 bstorm_: hard rebooted a rather stuck tools-sgecron-01
- 18:20 bd808: Building tools-k8s-worker-[36-55]
- 17:56 bd808: Deleted instances tools-worker-10[21-40]
- 16:14 bd808: Decommissioning tools-worker-10[21-40]
- 16:02 bd808: Drained tools-worker-1021
- 15:51 bd808: Drained tools-worker-1022
- 15:44 bd808: Drained tools-worker-1023 (there is no tools-worker-1024)
- 15:39 bd808: Drained tools-worker-1025
- 15:39 bd808: Drained tools-worker-1026
- 15:11 bd808: Drained tools-worker-1027
- 15:09 bd808: Drained tools-worker-1028 (there is no tools-worker-1029)
- 15:07 bd808: Drained tools-worker-1030
- 15:06 bd808: Uncordoned tools-worker-10[16-20]. Was over optimistic about repacking legacy Kubernetes cluster into 15 instances. Will keep 20 for now.
- 15:00 bd808: Drained tools-worker-1031
- 14:54 bd808: Hard reboot tools-worker-1016. Direct virsh console unresponsive. Stuck in shutdown since 2020-01-22?
- 14:44 bd808: Uncordoned tools-worker-1009.tools.eqiad.wmflabs
- 14:41 bd808: Drained tools-worker-1032
- 14:37 bd808: Drained tools-worker-1033
- 14:35 bd808: Drained tools-worker-1034
- 14:34 bd808: Drained tools-worker-1035
- 14:33 bd808: Drained tools-worker-1036
- 14:33 bd808: Drained tools-worker-10{39,38,37} yesterday but did not !log
- 00:29 bd808: Drained tools-worker-1009 for reboot (NFS flakey)
- 00:11 bd808: Uncordoned tools-worker-1009.tools.eqiad.wmflabs
- 00:08 bd808: Uncordoned tools-worker-1002.tools.eqiad.wmflabs
- 00:02 bd808: Rebooting tools-worker-1002
- 00:00 bd808: Draining tools-worker-1002 to reboot for NFS problems
2020-02-26
- 23:42 bd808: Drained tools-worker-1040
- 23:41 bd808: Cordoned tools-worker-10[16-40] in preparation for shrinking legacy Kubernetes cluster
- 23:12 bstorm_: replacing all tool limit-ranges in the 2020 cluster with a lower cpu request version
- 22:29 bstorm_: deleted pod maintain-kubeusers-6d9c45f4bc-5bqq5 to deploy new image
- 21:06 bstorm_: deleting loads of stuck grid jobs
- 20:27 jeh: rebooting tools-worker-[1008,1015,1021]
- 20:15 bstorm_: rebooting tools-sgegrid-master because it actually had the permissions thing going on still
- 18:03 bstorm_: downtimed toolschecker for nfs maintenance
2020-02-25
- 15:31 bd808: `wmcs-k8s-enable-cluster-monitor toolschecker`
2020-02-23
- 00:40 Krenair: T245932
2020-02-21
- 16:02 andrewbogott: moving tools-sgecron-01 to cloudvirt1022
2020-02-20
- 14:49 andrewbogott: moving tools-k8s-worker-19 and tools-k8s-worker-18 to cloudvirt1022 (as part of draining 1014)
- 00:04 Krenair: Shut off tools-puppetmaster-01 - to be deleted in one week T245365
2020-02-19
- 22:05 Krenair: Project-wide hiera change to swap puppetmaster to tools-puppetmaster-02 T245365
- 15:36 bstorm_: setting 'puppetmaster: tools-puppetmaster-02.tools.eqiad.wmflabs' on tools-sgeexec-0942 to test new puppetmaster on grid T245365
- 11:50 arturo: fix invalid yaml format in horizon puppet prefix 'tools-k8s-haproxy' that prevented clean puppet run in the VMs
- 00:59 bd808: Live hacked the "nginx-configuration" ConfigMap for T245426 (done several hours ago, but I forgot to !log it)
2020-02-18
- 23:26 bstorm_: added tools-sgegrid-master.tools.eqiad1.wikimedia.cloud and tools-sgegrid-shadow.tools.eqiad1.wikimedia.cloud to gridengine admin host lists
- 09:50 arturo: temporarily delete DNS zone tools.wmcloud.org to try re-creating it
2020-02-17
- 18:53 arturo: T168677 created DNS TXT record _psl.toolforge.org. with value `https://github.com/publicsuffix/list/pull/970`
- 13:22 arturo: relocating tools-sgewebgrid-lighttpd-0914 to cloudvirt1012 to spread same VMs across different hypervisors
2020-02-14
- 00:38 bd808: Added tools-k8s-worker-35 to 2020 Kubernetes cluster (T244791)
- 00:34 bd808: Added tools-k8s-worker-34 to 2020 Kubernetes cluster (T244791)
- 00:32 bd808: Added tools-k8s-worker-33 to 2020 Kubernetes cluster (T244791)
- 00:29 bd808: Added tools-k8s-worker-32 to 2020 Kubernetes cluster (T244791)
- 00:25 bd808: Added tools-k8s-worker-31 to 2020 Kubernetes cluster (T244791)
- 00:25 bd808: Added tools-k8s-worker-30 to 2020 Kubernetes cluster (T244791)
- 00:17 bd808: Added tools-k8s-worker-29 to 2020 Kubernetes cluster (T244791)
- 00:15 bd808: Added tools-k8s-worker-28 to 2020 Kubernetes cluster (T244791)
- 00:13 bd808: Added tools-k8s-worker-27 to 2020 Kubernetes cluster (T244791)
- 00:07 bd808: Added tools-k8s-worker-26 to 2020 Kubernetes cluster (T244791)
- 00:03 bd808: Added tools-k8s-worker-25 to 2020 Kubernetes cluster (T244791)
2020-02-13
- 23:53 bd808: Added tools-k8s-worker-24 to 2020 Kubernetes cluster (T244791)
- 23:50 bd808: Added tools-k8s-worker-23 to 2020 Kubernetes cluster (T244791)
- 23:38 bd808: Added tools-k8s-worker-22 to 2020 Kubernetes cluster (T244791)
- 21:35 bd808: Deleted tools-sgewebgrid-lighttpd-092{1,2,3,4,5,6,7,8} & tools-sgewebgrid-generic-090{3,4} (T244791)
- 21:33 bd808: Removed tools-sgewebgrid-lighttpd-092{1,2,3,4,5,6,7,8} & tools-sgewebgrid-generic-090{3,4} from grid engine config (T244791)
- 17:43 andrewbogott: migrating b24e29d7-a468-4882-9652-9863c8acfb88 to cloudvirt1022
2020-02-12
- 19:29 bd808: Rebuilding all Docker images to pick up toollabs-webservice (0.63) (T244954)
- 19:15 bd808: Deployed toollabs-webservice (0.63) on bastions (T244954)
- 00:20 bd808: Depooling tools-sgewebgrid-generic-0903 (T244791)
- 00:19 bd808: Depooling tools-sgewebgrid-generic-0904 (T244791)
- 00:14 bd808: Depooling tools-sgewebgrid-lighttpd-0921 (T244791)
- 00:09 bd808: Depooling tools-sgewebgrid-lighttpd-0922 (T244791)
- 00:05 bd808: Depooling tools-sgewebgrid-lighttpd-0923 (T244791)
- 00:05 bd808: Depooling tools-sgewebgrid-lighttpd-0924 (T244791)
2020-02-11
- 23:58 bd808: Depooling tools-sgewebgrid-lighttpd-0925 (T244791)
- 23:56 bd808: Depooling tools-sgewebgrid-lighttpd-0926 (T244791)
- 23:38 bd808: Depooling tools-sgewebgrid-lighttpd-0927 (T244791)
2020-02-10
- 23:39 bstorm_: updated tools-manifest to 0.21 on aptly for stretch
- 22:51 bstorm_: all docker images now use webservice 0.62
- 22:01 bd808: Manually starting webservices for tools that were running on tools-sgewebgrid-lighttpd-0928 (T244791)
- 21:47 bd808: Depooling tools-sgewebgrid-lighttpd-0928 (T244791)
- 21:25 bstorm_: upgraded toollabs-webservice package for tools to 0.62 T244293 T244289 T234617 T156626
2020-02-07
- 10:55 arturo: drop jessie VM instances tools-prometheus-{01,02} which were shutdown (T238096)
2020-02-06
- 10:44 arturo: merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/565556 which is a behavior change to the Toolforge front proxy (T234617)
- 10:27 arturo: shutdown again tools-prometheus-01, no longer in use (T238096)
- 05:07 andrewbogott: cleared out old /tmp and /var/log files on tools-sgebastion-07
2020-02-05
- 11:22 arturo: restarting ferm fleet-wide to account for prometheus servers changed IP (but same hostname) (T238096)
2020-02-04
- 11:38 arturo: start again tools-prometheus-01 again to sync data to the new tools-prometheus-03/04 VMs (T238096)
- 11:37 arturo: re-create tools-prometheus-03/04 as 'bigdisk2' instances (300GB) T238096
2020-02-03
- 14:12 arturo: move tools-prometheus-04 from cloudvirt1022 to cloudvirt1013
- 12:48 arturo: shutdown tools-prometheus-01 and tools-prometheus-02, after fixing the proxy `tools-prometheus.wmflabs.org` to tools-prometheus-03, data synced (T238096)
- 09:38 arturo: tools-prometheus-01: systemctl stop prometheus@tools. Another try to migrate data to tools-prometheus-{03,04} (T238096)
2020-01-31
- 14:06 arturo: leave tools-prometheus-01 as the backend for tools-prometheus.wmflabs.org for the weekend so grafana dashboards keep working (T238096)
- 14:00 arturo: syncing again prometheus data from tools-prometheus-01 to tools-prometheus-0{3,4} due to some inconsistencies preventing prometheus from starting (T238096)
2020-01-30
- 21:04 andrewbogott: also apt-get install python3-novaclient on tools-prometheus-03 and tools-prometheus-04 to suppress cronspam. Possible real fix for this is https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/569084/
- 20:39 andrewbogott: apt-get install python3-keystoneclient on tools-prometheus-03 and tools-prometheus-04 to suppress cronspam. Possible real fix for this is https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/569084/
- 16:27 arturo: create VM tools-prometheus-04 as cold standby of tools-prometheus-03 (T238096)
- 16:25 arturo: point tools-prometheus.wmflabs.org proxy to tools-prometheus-03 (T238096)
- 13:42 arturo: disable puppet in prometheus servers while syncing metric data (T238096)
- 13:15 arturo: drop floating IP 185.15.56.60 and FQDN `prometheus.tools.wmcloud.org` because this is not how the prometheus setup is right now. Use a web proxy instead `tools-prometheus-new.wmflabs.org` (T238096)
- 13:09 arturo: created FQDN `prometheus.tools.wmcloud.org` pointing to IPv4 185.15.56.60 (tools-prometheus-03) to test T238096
- 12:59 arturo: associated floating IPv4 185.15.56.60 to tools-prometheus-03 (T238096)
- 12:57 arturo: created domain `tools.wmcloud.org` in the tools project after some back and forth with designated, permissions and the database. I plan to use this domain to test the new Debian Buster-based prometheus setup (T238096)
- 10:20 arturo: create new VM instance tools-prometheus-03 (T238096)
2020-01-29
- 20:07 bd808: Created {bastion,login,dev}.toolforge.org service names for Toolforge bastions using Horizon & Designate
2020-01-28
- 13:35 arturo: `aborrero@tools-clushmaster-02:~$ clush -w @exec-stretch 'for i in $(ps aux | grep [t]ools.j | awk -F" " "{print \$2}") ; do echo "killing $i" ; sudo kill $i ; done || true'` (T243831)
2020-01-27
- 07:05 zhuyifei1999_: wrong package. uninstalled. the correct one is bpfcc-tools and seems only available in buster+. T115231
- 07:01 zhuyifei1999_: apt installing bcc on tools-worker-1037 to see who is sending SIGTERM, will uninstall after done. dependency: bin86. T115231
2020-01-24
- 20:58 bd808: Built tools-k8s-worker-21 to test out build script following openstack client upgrade
- 15:45 bd808: Rebuilding all Docker containers again because I failed to actually update the build server git clone properly last time I did this
- 05:23 bd808: Building 6 new tools-k8s-worker instances for the 2020 Kubernetes cluster (take 2)
- 04:41 bd808: Rebuilding all Docker images to pick up webservice-python-bootstrap changes
2020-01-23
- 23:38 bd808: Halted tools-k8s-worker build script after first instance (tools-k8s-worker-10) stuck in "scheduling" state for 20 minutes
- 23:16 bd808: Building 6 new tools-k8s-worker instances for the 2020 Kubernetes cluster
- 05:15 bd808: Building tools-elastic-04
- 04:39 bd808: wmcs-openstack quota set --instances 192
- 04:36 bd808: wmcs-openstack quota set --cores 768 --ram 1536000
2020-01-22
- 12:43 arturo: for the record, issue with tools-worker-1016 was memory exhaustion apparently
- 12:35 arturo: hard-reboot tools-worker-1016 (not responding to even console access)
2020-01-21
- 19:25 bstorm_: hard rebooting tools-sgeexec-0913/14/35 because they aren't even on the network
- 19:17 bstorm_: depooled and rebooted tools-sgeexec-0914 because it was acting funny
- 18:30 bstorm_: depooling and rebooting tools-sgeexec-[0911,0913,0919,0921,0924,0931,0933,0935,0939,0941].tools.eqiad.wmflabs
- 17:21 bstorm_: rebooting toolschecker to recover stale nfs handle
2020-01-16
- 23:54 bstorm_: rebooting tools-docker-builder-06 because there are a couple running containers that don't want to die cleanly
- 23:45 bstorm_: rebuilding docker containers to include new webservice version (0.58)
- 23:41 bstorm_: deployed toollabs-webservice 0.58 to everything that isn't a container
- 16:45 bstorm_: ran configurator to set the gridengine web queues to `rerun FALSE` T242397
2020-01-14
- 15:29 bstorm_: failed the gridengine master back to the master server from the shadow
- 02:23 andrewbogott: rebooting tools-paws-worker-1006 to resolve hangs associated with an old NFS failure
2020-01-13
- 17:48 bd808: Running `puppet ca destroy` for each unsigned cert on tools-puppetmaster-01 (T242642)
- 16:42 bd808: Cordoned and fixed puppet on tools-k8s-worker-12. Rebooting now. T242559
- 16:33 bd808: Cordoned and fixed puppet on tools-k8s-worker-11. Rebooting now. T242559
- 16:31 bd808: Cordoned and fixed puppet on tools-k8s-worker-10. Rebooting now. T242559
- 16:26 bd808: Cordoned and fixed puppet on tools-k8s-worker-9. Rebooting now. T242559
2020-01-12
- 22:31 Krenair: same on -13 and -14
- 22:28 Krenair: same on -8
- 22:18 Krenair: same on -7
- 22:11 Krenair: Did usual new instance creation puppet dance on tools-k8s-worker-6, /data/project got created
2020-01-11
- 01:33 bstorm_: updated toollabs-webservice package to 0.57, which should allow persisting mem and cpu in manifests with burstable qos.
2020-01-10
- 23:31 bstorm_: updated toollabs-webservice package to 0.56
- 15:45 bstorm_: depooled tools-paws-worker-1013 to reboot because I think it is the last tools server with that mount issue (I hope)
- 15:35 bstorm_: depooling and rebooting tools-worker-1016 because it still had the leftover mount problems
- 15:30 bstorm_: git stash-ing local puppet changes in hopes that arturo has that material locally, and it doesn't break anything to do so
2020-01-09
- 23:35 bstorm_: depooled tools-sgeexec-0939 because it isn't acting right and rebooting it
- 18:26 bstorm_: re-joining the k8s nodes OF THE PAWS CLUSTER to the cluster one at a time to rotate the certs T242353
- 18:25 bstorm_: re-joining the k8s nodes to the cluster one at a time to rotate the certs T242353
- 18:06 bstorm_: rebooting tools-paws-master-01 T242353
- 17:46 bstorm_: refreshing the paws cluster's entire x509 environment T242353
2020-01-07
- 22:40 bstorm_: rebooted tools-worker-1007 to recover it from disk full and general badness
- 16:33 arturo: deleted by hand pod metrics/cadvisor-5pd46 due to prometheus having issues scrapping it
- 15:46 bd808: Rebooting tools-k8s-worker-[6-14]
- 15:35 bstorm_: changed kubeadm-config to use a list instead of a hash for extravols on the apiserver in the new k8s cluster T242067
- 14:02 arturo: `root@tools-k8s-control-3:~# wmcs-k8s-secret-for-cert -n metrics -s metrics-server-certs -a metrics-server` (T241853)
- 13:33 arturo: upload docker-registry.tools.wmflabs.org/coreos/kube-state-metrics:v1.8.0 copied from quay.io/coreos/kube-state-metrics:v1.8.0 (T241853)
- 13:31 arturo: upload docker-registry.tools.wmflabs.org/metrics-server-amd64:v0.3.6 copied from k8s.gcr.io/metrics-server-amd64:v0.3.6 (T241853)
- 13:23 arturo: [new k8s] doing changes to kube-state-metrics and metrics-server trying to relocate them to the 'metrics' namespace (T241853)
- 05:28 bd808: Creating tools-k8s-worker-[6-14] (again)
- 05:20 bd808: Deleting busted tools-k8s-worker-[6-14]
- 05:02 bd808: Creating tools-k8s-worker-[6-14]
- 00:26 bstorm_: repooled tools-sgewebgrid-lighttpd-0919
- 00:17 bstorm_: repooled tools-sgewebgrid-lighttpd-0918
- 00:15 bstorm_: moving tools-sgewebgrid-lighttpd-0918 and -0919 to cloudvirt1004 from cloudvirt1029 to rebalance load
- 00:02 bstorm_: depooled tools-sgewebgrid-lighttpd-0918 and 0919 to move to cloudvirt1004 to improve spread
2020-01-06
- 23:40 bd808: Deleted tools-sgewebgrid-lighttpd-09{0[1-9],10}
- 23:36 bd808: Shutdown tools-sgewebgrid-lighttpd-09{0[1-9],10}
- 23:34 bd808: Decommissioned tools-sgewebgrid-lighttpd-09{0[1-9],10}
- 23:13 bstorm_: Repooled tools-sgeexec-0922 because I don't know why it was depooled
- 23:01 bd808: Depooled tools-sgewebgrid-lighttpd-0910.tools.eqiad.wmflabs
- 22:58 bd808: Depooling tools-sgewebgrid-lighttpd-090[2-9]
- 22:57 bd808: Disabling queues on tools-sgewebgrid-lighttpd-090[2-9]
- 21:07 bd808: Restarted kube2proxy on tools-proxy-05 to try and refresh admin tool's routes
- 18:54 bstorm_: edited /etc/fstab to remove NFS and unmounted the nfs volumes tools-k8s-haproxy-1 T241908
- 18:49 bstorm_: edited /etc/fstab to remove NFS and rebooted to clear stale mounts on tools-k8s-haproxy-2 T241908
- 18:47 bstorm_: added mount_nfs=false to tools-k8s-haproxy puppet prefix T241908
- 18:24 bd808: Deleted shutdown instance tools-worker-1029 (was an SSSD testing instance)
- 16:42 bstorm_: failed sge-shadow-master back to the main grid master
- 16:42 bstorm_: Removed files for old S1tty that wasn't working on sge-grid-master
2020-01-04
- 18:11 bd808: Shutdown tools-worker-1029
- 18:10 bd808: kubectl delete node tools-worker-1029.tools.eqiad.wmflabs
- 18:06 bd808: Removed tools-worker-1029.tools.eqiad.wmflabs from k8s::worker_hosts hiera in preparation for decom
- 16:54 bstorm_: moving VMs tools-worker-1012/1028/1005 from cloudvirt1024 to cloudvirt1003 due to hardware errors T241884
- 16:47 bstorm_: moving VM tools-flannel-etcd-02 from cloudvirt1024 to cloudvirt1003 due to hardware errors T241884
- 16:16 bd808: Draining tools-worker-10{05,12,28} due to hardware errors (T241884)
- 16:13 arturo: moving VM tools-sgewebgrid-lighttpd-0927 from cloudvirt1024 to cloudvirt1009 due to hardware errors (T241884)
- 16:11 arturo: moving VM tools-sgewebgrid-lighttpd-0926 from cloudvirt1024 to cloudvirt1009 due to hardware errors (T241884)
- 16:09 arturo: moving VM tools-sgewebgrid-lighttpd-0925 from cloudvirt1024 to cloudvirt1009 due to hardware errors (T241884)
- 16:08 arturo: moving VM tools-sgewebgrid-lighttpd-0924 from cloudvirt1024 to cloudvirt1009 due to hardware errors (T241884)
- 16:07 arturo: moving VM tools-sgewebgrid-lighttpd-0923 from cloudvirt1024 to cloudvirt1009 due to hardware errors (T241884)
- 16:06 arturo: moving VM tools-sgewebgrid-lighttpd-0909 from cloudvirt1024 to cloudvirt1009 due to hardware errors (T241884)
- 16:04 arturo: moving VM tools-sgeexec-0923 from cloudvirt1024 to cloudvirt1009 due to hardware errors (T241884)
- 16:02 arturo: moving VM tools-sgeexec-0910 from cloudvirt1024 to cloudvirt1009 due to hardware errors (T241873)
2020-01-03
- 16:48 bstorm_: updated the ValidatingWebhookConfiguration for the ingress admission controller to the working settings
- 11:51 arturo: [new k8s] deploy cadvisor as in https://gerrit.wikimedia.org/r/c/operations/puppet/+/561654 (T237643)
- 11:21 arturo: upload k8s.gcr.io/cadvisor:v0.30.2 docker image to the docker registry as docker-registry.tools.wmflabs.org/cadvisor:0.30.2 for T237643
- 03:04 bd808: Really rebuilding all {jessie,stretch,buster}-sssd images. Last time I forgot to actually update the git clone.
- 00:11 bd808: Rebuiliding all stretch-ssd Docker images to pick up busybox
2020-01-02
- 23:54 bd808: Rebuiliding all buster-ssd Docker images to pick up busybox