Nova Resource:Devtools/SAL

2024-07-24

22:42 mutante: gitlab-prod-1002 out of disk again when attempting package upgrade - apt-get clean; rm /var/log/syslog.2*.gz rm /var/log/messages.2*.gz to get space

2024-07-10

21:46 mutante: gitlab-runner* instances: apt-get upgrade (upgrading gitlab-runner, exim*. downgrading lshw, pythong3-jinja2)
21:34 mutante: gitlab-prod-1002 apt-get upgrade (upgrading gitlab-ce, exim-*. downgrading lshw, python3-jinja2)

2024-06-18

10:31 taavi@cloudcumin1001: END (PASS) - Cookbook wmcs.openstack.migrate_project_to_ovs (exit_code=0)
10:18 taavi@cloudcumin1001: START - Cookbook wmcs.openstack.migrate_project_to_ovs

2024-05-06

20:37 mutante: deleting buster deployment server deploy-1004, replaced by deploy-1006 - T360964

2024-05-02

23:52 mutante: switching puppetmaster for deploy-1006 back to local project puppetmaster; rm -rf /var/lib/puppet/ssl that still referred to puppetmaster-1001, signing new request on puppetmaster-1003 T360470 T363415
23:10 mutante: replacing deploy-1004 (buster) with deploy-1006 (bullseye) as new deployment server in both repo and Horizon hiera T360964 T363415

2024-05-01

23:59 mutante: creating instance gerrit-bullseye (T363196)
22:45 mutante: rebooting deploy-1006 to see if issue with starting mw-cgroup goes away - .. and it did! - it must have been the grub config from https://gerrit.wikimedia.org/r/c/operations/puppet/+/991347 but needed one reboot - T363957

2024-04-24

20:28 mutante: changing puppetmaster of deploy-1006 to puppetmaster.cloudinfra.wmflabs.org instead of the local project one
20:12 mutante: - puppet run on deploy-1006 deployment_server - scap init duplicate declaration fixed after adding "profile::mediawiki::scap_client::is_master: true" to new deploy-prefix Hiera - thanks to Jaime Nuche per comment on gerrit
19:46 mutante: creating new prefix 'deploy' to apply needed Hiera keys to deployment hosts based on host name prefix (both deploy1004, deploy1006 and future deploy*)
19:41 mutante: deleting instance puppetmaster-1001 that was > 4 years old, on buster and I had shutdown a couple days ago. replaced by puppetmaster-1003 (bookworm, puppetserver) T360964 T360470
19:21 mutante: gitlab-prod-1002; gitlab-runner-1003; gitlab-runner-1002 - apt-get update && apt-get upgrade

2024-04-17

21:18 mutante: - resizing puppetmaster-1003 from g3.cores1.ram2.disk20 to g3.cores2.ram4.disk20 - T360470

2024-04-16

17:09 mutante: - deleting devtools-puppetdb1001 instance (T360964)
16:54 mutante: - soft rebooting puppetmaster-1003, shutting down puppetmaster-1001
16:38 mutante: - can't ssh to new puppetmaster again
16:33 mutante: - deleting deploy-1005 - don't try deployment server in bookworm, first bullseye
16:29 mutante: - shutting down puppetdb instance again

2024-04-15

17:52 mutante: - added profile::labs::cindermount::srv to puppetmaster-1003 in horizon to get missing cinder volume - T360470
17:51 mutante: - added Notice: /Stage[main]/Profile::Labs::Cindermount::Srv/Cinderutils::Ensure[cinder_on_srv]/Exec[prepare_cinder_volume_/srv]/returns: executed successfully
16:57 mutante: - puppetmaster-1003 reachable again but service fails to start and puppetserver-deploy-code fails
16:50 mutante: - rebooted unreachable puppetmaster-1003 - was "no route to host" - but is back now, log had a " /dev/sdb: Can't open blockdev" as well

2024-04-12

17:55 mutante: - changed both puppetmaster and puppetdb hiera setting back to puppetmaster-1001 for instance deploy-1004
17:49 mutante: - deploy-1004 itself is on buster and buster and puppet 7 don't mix well - testing deployment role on bookworm, creating deploy-1005
17:48 mutante: - deploy-1004 has a puppet problem when talking to new puppetmaster-1003 that goes away when switching back to puppetmaster-1001

2024-04-11

20:08 mutante: zuul-1001 - switching to new puppetmaster-1003 in puppet.conf manually, switched project defaults in repo too
19:58 mutante: manually editing puppet.conf to use puppetmaster-1003 instead of -1001 because you can't switch the puppetmaster via puppet if puppet is already broken :)
19:41 mutante: switching gitlab-runner-1005 from puppetmaster-1001 to puppetmaster-1003 via web Hiera
19:03 mutante: - deleting instance contint-bullseye which was only used by me for a test before we created contint1003 in prod T334517 T361224
18:38 mutante: - attempting to fix puppet run on vrts-1001 related to switching prod to cfssl for SSL cers
18:23 mutante: - shutting down puppetmaster-1001 on buster - should now be replaced by puppetmaster-1003 on bookworm (thanks brennen) T360964 T360470
18:02 mutante: - shutting down instance devtools-puppetdb1001 - which is on buster - basically to see what breaks of complains, if anything

2024-04-09

19:46 mutante: - soft rebooting gerrit-prod-1001 buster instance (to be removed )

2024-04-01

17:27 mutante: - added profile::pki::client::ensure: present to instance hiera for etherpad-bookworm - fixing broken puppet run
17:25 mutante: - attempting to fix puppet on instance etherpad-bookworm but SSL provider cfssl doesn't appear to work in cloud

2024-02-09

19:08 mutante: deleting instance phabricator-prod-1001 (shut down a couple days ago, buster instance replaced by phabricator-bullseye instance) T356530

2024-02-07

21:51 mutante: rebooting gitlab-prod-1002 T356906

2024-02-02

23:37 mutante: phabricator-bullseye configured auth provider for simple passwords, letting users register users, locked auth config option again
23:34 mutante: phabricator-bullseye:/srv/deployment/phabricator/deployment/phabricator/bin$ sudo ./auth unlock
22:53 mutante: - phabricator-bullseye sudo ./config set phabricator.base-uri 'http://phabricator.wmcloud.org/' | sudo ./config set security.alternate-file-domain 'https://phab-usercontent.wmcloud.org' | delete proxy phab-prod-usercontent, create proxy phab-usercontent.wmcloud.org, restart apache T356530
22:37 mutante: - phabricator-bullseye - /srv/deployment/phabricator/deployment/phabricator/bin$ sudo ./config set phabricator.base-uri 'http://phabricator.wmcloud.org/' T356530
22:31 mutante: - phabricator-bullseye, instance hiera: setting phabricator_domain to phabricator.wmcloud.org and phabricator_altdomain to phab-usercontent.wmcloud.org T356530
22:27 mutante: - deleted proxies phab.wmflabs.org, phab-prod-usercontent.wmflabs.org, phabricator.wmflabs.org - created proxies phabricator.wmcloud.org, phab-usercontent.wmcloud.org - wmflabs names are legacy and should migrate T356530
20:59 mutante: - changing phabricator domain in instance Hiera of phabricator-bullseye to phab.wmflabs.org and running puppet to update apache config/rewrite rules T356530
20:41 mutante: shutting down instance phabricator-prod-1001 (buster), replaced by phab-bullseye (bullseye) T356530
20:38 mutante: deleting proxy phab-bull.wmcloud.org after previous proxy names are switched to bullseye backend T356530
20:17 mutante: editing proxies phab.wmflabs.org and phab-prod-usercontent.wmflabs.org to point to bullseye instance instead of buster T356530
20:13 mutante: running "scap deploy" in /srv/deployment/phabricator/deployment on deploy-1004 which deploys to phabricator-bullseye and phabricator-prod-1001 T356530
19:29 mutante: deleting web proxy phabricator-prod.wmflabs.org which pointed to port 443 on the buster instance and timed out T356530
19:12 mutante: deleting web proxy phorge.wmcloud.org which pointed to 172.16.7.98 which doesn't exist anymore T356530

2024-01-05

21:30 mutante: contint-bullseye - sudo /usr/sbin/a2dismod mpm_event ; sudo /usr/sbin/a2endmod php74 - the usual issue we have had for years

2023-12-18

21:53 andrew@cloudcumin1001: END (FAIL) - Cookbook wmcs.openstack.quota_increase (exit_code=99)
21:53 andrew@cloudcumin1001: START - Cookbook wmcs.openstack.quota_increase
21:52 andrew@cloudcumin1001: END (FAIL) - Cookbook wmcs.openstack.quota_increase (exit_code=99) (T353671)
21:52 andrew@cloudcumin1001: START - Cookbook wmcs.openstack.quota_increase (T353671)

2023-12-13

23:49 mutante: starting manual gitlab upgrade process on gitlab-prod-1002

2023-11-30

21:03 mutante: - phabricator-bullseye - created user app_user and granted privileges in mysql for user from 127.0.0.1, ran ./phabricator/bin/storage upgrade --force; set 'phabricator_domain: phab-bull.wmcloud.org' in web Hiera
20:33 mutante: - phabricator-bullseye - running 'mariadb-secure-installation' interactive script - this fixed mysql shell which previously exited with "bash: /nonexistent: No such file or directory"
20:29 mutante: - phabricator-bullseye - running 'mariadb-install-db'
20:27 mutante: - phabricator-bullseye - attempting to fix mariadb/mysql server, apt-get remove mariadb-server, running puppet, debugging why it wont start

2023-11-21

21:52 mutante: - commit fake key for phabricator-bullseye host in git /var/lib/git/labs/private/modules/secret/secrets/ssl on puppetmaster-1001.devtools T327068
21:41 mutante: - cert issue on new machine related to having local puppetmaster, like T349937#9288547 except "rm -rf /var/lib/puppet/ssl" was enough since puppetmaster did auto-sign new CSR - T327068
21:24 mutante: - initial puppet run on newly created VM fails with "SSL_connect returned=1 errno=0 state=error: certificate verify failed (self signed certificate in certificate chain): [self signed certificate in certificate chain for /CN=Puppet CA: puppetmaster-1001.devtools.eqiad.wmflabs" T327068
21:05 mutante: - creating instance phabricator-bullseye g3.cores2.ram4.disk20 T327068
21:00 mutante: - deleted instance phorge-1001 to get quota back and allow for creting new phabricator-on-bullseye instance T328595 T327068

2023-04-12

19:26 mutante: - vrts-1001 - editing /etc/my.cnf to set mariadb datadir to /var/lib/mysql instead of /srv/sqldata and restart service, issue like T329571

2023-03-11

00:20 mutante: - on phorge1001, enable general query log in mysql (mariadb), to learn about database scheme, don't forget to turn that off so VM doesn't run out of disk (SET GLOBAL general_log=1;) T328595

2023-03-07

23:34 mutante: - phorge-1001 - MariaDB [(none)]> SET GLOBAL max_allowed_packet=33554432;
23:33 mutante: - phorge-1001 - MariaDB [(none)]> SET GLOBAL local_infile=0;
23:31 mutante: - phorge-1001 - MariaDB [(none)]> SET GLOBAL sql_mode = "STRICT_ALL_TABLES,STRICT_TRANS_TABLES,ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION";

2023-02-13

23:09 mutante: - shutting down gerrit-prod-1001
22:22 mutante: certbot renew --apache fixed cert issue - https://ldapauth-gitldap.wmflabs.org/ does not exist unrelatedly - T329444
22:18 mutante: install package python3-certbot-apache on gerrit-prod-1001 - T329444
22:03 mutante: - re-activating disabled puppet on gerrit-prod-1001 (reason given was 'gerrit deploy' but it was about 17 days ago)
21:58 mutante: rebooting instance gerrit-prod-1001 which can't be reached T329444

2023-01-31

22:39 mutante: remove role::gitlab from gitlab-prod-1001. to be replaced with gitlab-prod-1002. T318521

2023-01-28

16:26 taavi: adjust gitlab-prod-1002 network port settings to allow adding the secondary IP, requested in T318521

2023-01-24

09:47 wm-bot2: Increased quotas by 1 instances (T327750) - cookbook ran by arturo@nostromo

2022-11-30

15:57 wm-bot2: Increased quotas by 1 floating-ips (T323986) - cookbook ran by dcaro@vulcanus

2022-11-17

18:18 andrewbogott: committed a local puppet change on puppetprimary to fix upstream syncs

2022-10-28

20:24 mutante: - removing from Horizon / project-wide hiera: profile::phabricator::main::manage_scap_user: true (set in the repo)
20:23 mutante: - removing from Horizon / project-wide hiera: profile::keyholder::server::require_encrypted_keys: 'no' (set in the repo)
20:19 mutante: - removing from Horizon / project-wide hiera: profile::gerrit::daemon_user: gerrit2, profile::gerrit::manage_scap_user: true, profile::gerrit::scap_user: gerrit-deploy (all of these are set in the repo)

2022-10-19

19:21 mutante: - on puppetmaster-1001.devtools created /var/lib/puppet/volatile/GeoIP directory - to fix puppet error on deploy-1004.devtools - reacting to puppet-broken-nagging-emails

2022-06-24

12:43 taavi: `os quota set devtools --ram 45056 --cores 22 --instances 9` # T311302

2022-06-15

20:30 mutante: - created gitlab-runner-1002 - applied puppet role - attached cinder volume "docker" - running puppet again
17:06 mutante: deleting instance gitlab-runner-1001 which just disconnects people. gut feeling is it has to do with the fact that a previous instance name was used again

2022-06-14

23:29 mutante: - creating instance gitlab-runner-1001 since we did not have a test machine for gitlab-runners but need one to test things like gerrit:791655 before hitting prod T308271

2022-04-29

20:59 mutante: - restarting instance gitlab-prod-1001 - No route to host
20:55 mutante: - attempting to soft reboot instance deploy1004 (got the puppet fail mail and wasnt reachable by ssh), this happened lately as well to gitlab-prod-1001, same project, different instance, but this time it doesn't just come back yet

2022-04-20

17:03 mutante: soft rebooting gitlab-prod-1001 which was sending "failed puppet" reports and was unreachable, just like the other day.

2022-04-18

19:08 mutante: - gitlab-prod-1001 is indeed back after soft rebooting the instance. uptime 1 min T297411
19:07 mutante: - gitlab-prod-1001 randomly stopped working. we got the "puppet failed" mails without having made changes and can't ssh to the instance anymore when trying to check out why. trying soft reboot via Horizon T297411

2022-04-15

18:00 mutante: - deleting deploy-1002 - use deploy-1004 instead - T306069
17:03 mutante: - not sure if possible (for me) to create a bullseye deployment server in cloud, using scap: failed: Execution of '/usr/bin/scap deploy --init', missing PHP packages, missing prometheus-mcrouter-exporter and more T306069
17:02 mutante: - not sure if possible (for me) to create ad deployment server in cloud, using scap: failed: Execution of '/usr/bin/scap deploy --init'
16:40 mutante: : creating deploy1003 to replace deploy1002 T306069
16:36 mutante: : deleting instance gitlab-runner-1001 - was just for testing, real runners are upgrade in their own project

2022-03-02

22:22 mutante: - creating gitlab-runner-1001 on bullseye - purely test for T297659

2022-03-01

18:16 taavi: allocated secondary IP for gitlab-prod-1001 per request on T302803

2022-02-15

16:08 taavi: created devtools.wmcloud.org dns zone for the devtools project T301793

2022-01-26

17:26 arturo: bump quota, floating IP from 1 to 2 (T299561)
15:56 arturo: bump quota, RAM from 32 to 40, cores from 16 to 20 (T299561)

2022-01-21

22:12 mutante: - created new instance gitlab-prod-1001 T297411
22:11 mutante: - created new instance gitlab-prod-1001T297411
21:57 mutante: - deleted instances "doc" and "doc1002" to make room for gitlab instance T299561 - T297411

2022-01-19

17:36 mutante: - added brennen, aokoth and jelto as users and projectadmins (T297411)

2021-11-10

19:49 mutante: - removing manually added things in Horizon Hiera that were already in the repo, please don't keep adding in web UI, we don't want to repeat the same thing we did in deployment-prep

2021-07-28

16:39 andrewbogott: rebooting gerrit-prod-1001; seemingly unreachable

2021-03-10

10:58 arturo: briefly stopped VM 'doc' to disable VMX cpu flag and live-migrate it

2021-02-22

20:58 mutante: fixed puppet run on deploy-1002 by adding empty array of wikimedia-sites to hiera
20:01 mutante: deploy-1002 is broken because mediawiki::sites is not in Hiera (yet)

2020-10-28

17:01 andrewbogott: fixed puppet runs on phabricator-stage-1001 (previously puppetmaster name mismatch)

2020-09-01

00:16 mutante: - unbreaking puppet run on the local deployment after it was broken since July due to changes in prod deployment_server role

2020-06-30

20:22 mutante: managed to let certbot get LE certs for gerrit.devtools.wmflabs.org and the floating IP

2020-06-17

19:55 paladox: ran `iptables -A INPUT -p tcp -m tcp --dport 80 -j ACCEPT` on phabricator-prod-1001

2020-05-08

07:01 mutante: phabricator-prod-1001 - removing cron for public task dump (though puppet should have removed it)

2020-05-07

09:24 mutante: - cloud puppetmasters still affected by https://phabricator.wikimedia.org/T83447#5807825
09:07 mutante: - puppetmaster-1001 - Permission denied @ rb_sysopen - /var/lib/puppet/volatile/GeoIP/.geoipupdate.lock
09:06 mutante: - avoiding the need for a second role for deployment_servers in cloud with https://gerrit.wikimedia.org/r/c/operations/puppet/+/594903
09:05 mutante: - puppet fixed on deploy-1002 with https://gerrit.wikimedia.org/r/c/operations/puppet/+/594900
08:04 mutante: - broken puppet again from prod changes. this time: deploy-1002 - []' is not applicable to an Undef Value. mediawiki/mcrouter_wancache.pp, line: 19

2020-04-13

10:00 mutante: - phabricator-stage-1001: replace deployment-tin.deployment-rep with deploy-1002.devtools in deployment-cache/.config
09:40 mutante: set missing (and new) profile::tlsproxy::envoy::capitalize_headers: true to fix puppet errors
09:35 mutante: set phabricator::vcs::address::v6 to fe80 local address to fix puppet error on phabricator-stage-1001

2020-01-16

00:53 mutante: deploy-1002 - become 'trebuchet' user and ssh to phabricator scap targets. to fix ssh host key verification issue on first deploy
00:30 mutante: deploy-1002 live hack /srv/deployment/phabricator/deployment/scap/phabricator-targets and replace prod server with cloud instances; scap deploy in phabricator repo

2020-01-15

23:51 paladox: deploy-1002 rm -rf /srv/deployment
23:44 mutante: deploy-1002 sudo git init in /srv/deployment ; scap deploy --init (now fails with 'fatal: Not a valid object name HEAD')
23:42 mutante: deploy-1002 mkdir /srv/deployments/.git ; chown trebuchet:wikidev .git ; manually run "scap deploy --init" as trebuchet user in an attempt to fix initial puppet run on deployment_server

2020-01-14

00:59 mutante: deleting instance codesearch-buster
00:54 mutante: - deleting instance codesearch-stretch, creating codesearch-buster

2020-01-11

00:35 mutante: deleting instance codesearch-buster, creating codesearch-stretch
00:05 mutante: s/cloudsearch/codesearch/g
00:04 mutante: creating throwaway instance "cloudsearch"
00:04 mutante: deleting instance deploy1001 (buster), creating deploy-1002 (stretch) instead

2020-01-04

16:01 bstorm_: moving vm puppetmaster-1001 from cloudvirt1024 to cloudvirt1009 due to hardware error T241884

2020-01-03

22:37 mutante: - sudo vi /srv/deployment/phabricator/deployment-cache/.config on both phabricator instances to fix deployment server (remove deployment-tin (!))
21:50 mutante: assigned 172.16.0.198/32 on eth0 on phabricator-prod-1001
21:50 jeh: add secondary interface to phabricator-prod-1001
21:32 mutante: configure 172.16.0.189 as "vcs" address v4 for phabricator-stage-1001
21:24 jeh: add secondary interface to phabricator-stage-1001
00:47 paladox: puppet cert generate puppetmaster-1001.devtools.eqiad.wmflabs
00:30 paladox: set puppetmaster: puppetmaster-1001.devtools.eqiad.wmflabs in hiera

2020-01-02

23:42 paladox: puppetmaster-1001: ln -s /var/lib/puppet/ssl/private_keys/puppetmaster-1001.devtools.eqiad.wmflabs.pem /var/lib/puppet/server/ssl/private_keys/puppetmaster-1001.devtools.eqiad.wmflabs.pem
23:40 paladox: puppetmaster-1001: ln -s /var/lib/puppet/ssl/certs/puppetmaster-1001.devtools.eqiad.wmflabs.pem /var/lib/puppet/server/ssl/certs/puppetmaster-1001.devtools.eqiad.wmflabs.pem
23:35 mutante: attempting to create local puppetmaster (formerly puppetmaster::self)

2019-12-23

04:14 mutante: - turns out instance creation doesn't work using Brave browser but does work using Firefox (T241345) - created phabricator-prod-1001
03:45 mutante: - creating new instance seems to fail - nothing shows up at all
03:42 mutante: - launching medium sized buster instance gerrit-prod-01 (T236309)