Nova Resource:Admin/SAL

From Wikitech
Jump to navigation Jump to search

2020-06-04

  • 14:24 andrewbogott: disabling puppet on all instances for /labs/private recovery
  • 14:23 arturo: disabling puppet on all instances for /labs/private recovery

2020-05-28

  • 23:02 bd808: `/usr/local/sbin/maintain-dbusers --debug harvest-replicas` (T253930)
  • 13:36 andrewbogott: rebuilding cloudservices2002-dev with Buster
  • 00:33 andrewbogott: shutting down cloudservices2002-dev to see if we can live without it. This is in anticipation or rebuilding it entirely for T253780

2020-05-27

  • 23:29 andrewbogott: disabling the backup job on cloudbackup2001 (just like last week) so the backup doesn't start while Brooke is rebuilding labstore1004 tomorrow.
  • 06:03 bd808: `systemctl start mariadb` on clouddb1001 following reboot (take 2)
  • 05:58 bd808: `systemctl start mariadb` on clouddb1001 following reboot
  • 05:53 bd808: Hard reboot of clouddb1001 via Horizon. Console unresponsive.

2020-05-25

  • 16:35 arturo: [codfw1dev] created zone `0-29.57.15.185.in-addr.arpa.` (T247972)

2020-05-21

  • 19:23 andrewbogott: disabling puppet on cloudbackup2001 to prevent the backup job from starting during maintenance
  • 19:16 andrewbogott: systemctl disable block_sync-tools-project.service on cloudbackup2001.codfw.wmnet to avoid stepping on current upgrade
  • 15:48 andrewbogott: re-imaging cloudnet1003 with Buster

2020-05-19

  • 22:59 bd808: `apt-get install mariadb-client` on cloudcontrol1003
  • 21:12 bd808: Migrating wcdo.wcdo.eqiad.wmflabs to cloudvirt1023 (T251065)

2020-05-18

  • 21:37 andrewbogott: rebuilding cloudnet2003-dev with Buster

2020-05-15

  • 22:10 bd808: Added reedy as projectadmin in cloudinfra project (T249774)
  • 22:05 bd808: Added reedy as projectadmin in admin project (T249774)
  • 18:44 bstorm_: rebooting cloudvirt-wdqs1003 T252831
  • 15:47 bd808: Manually running wmcs-novastats-dnsleaks from cloudcontrol1003 (T252889)

2020-05-14

  • 23:28 bstorm_: downtimed cloudvirt1004/6 and cloudvirt-wdqs1003 until tomorrow around this time T252831
  • 22:21 bstorm_: upgrading qemu-system-x86 on cloudvirt1006 to backports version T252831
  • 22:15 bstorm_: changing /etc/libvirt/qemu.conf and restarting libvirtd on cloudvirt1006 T252831
  • 21:12 andrewbogott: rebuilding cloudvirt1003-wdqs as part of T252831
  • 15:47 andrewbogott: moving cloudvirt1004 and cloudvirt1006 to the 'ceph' aggregate for T252784
  • 15:02 andrewbogott: moving all of cloudvirt100[1-9] into the 'toobusy' host aggregate. These are slower, have spinning disks, and are due for replacement.

2020-05-12

  • 20:33 andrewbogott: moving cloudvirt1023 to the 'standard' pool and out of the 'spare' pool
  • 19:10 jeh: disable neutron-openvswitch-agent service on cloudvirt2001-dev.codfw T248881
  • 19:09 jeh: Shutdown the unused eno2 network interface on cloudvirt2001-dev.codfw to clear up monitoring errors T248425
  • 18:20 andrewbogott: moving cloudvirt1024 out of the 'maintenance' aggregate and into 'spare'
  • 16:45 andrewbogott: restarting neutron-l3-agent on cloudnet1004 so it knows about all three cloudcontrols. Leaving cloudnet1003 since restarting it there will cause network interruptions
  • 14:06 arturo: icinga downtime everything for 2h for Debian Buster migration in some cloud components

2020-05-09

  • 16:53 andrewbogott: rebuilding cloudcontrol2001-dev and 2003-dev with buster for T252121

2020-05-08

  • 19:02 bstorm_: moving tools-k8s-haproxy-2 from cloudvirt1021 to cloudvirt1017 to improve spread

2020-05-05

  • 13:58 andrewbogott: rebuilding cloudcontrol2004-dev to test new puppet changes

2020-05-04

  • 09:04 arturo: [codfw1dev] manually modify iptables ruleset to only allow SSH from WMF bastions on cloudservices2003-dev and cloudcontrol2004-dev (T251604)

2020-04-21

  • 22:12 andrewbogott: moving cloudvirt1004 out of the 'standard' aggregate and into the 'maintenance' aggregate
  • 16:01 jeh: restart cloudceph mon and osd services for openssl upgrades

2020-04-15

  • 18:44 jeh: create indexes and views for grwikimedia T245912

2020-04-13

  • 15:07 jeh: restart memcached on labwebs to increase cache size T145703

2020-04-09

  • 19:57 andrewbogott: upgrading eqiad1 designate to rocky
  • 16:52 andrewbogott: cleaned up a bunch of leaked .eqiad.wmflabs dns records

2020-04-08

  • 19:20 andrewbogott: rotated password and api token for pdns servers on cloudservices1003 and cloudservices1004
  • 14:54 arturo: `root@cloudcontrol1003:~# cp /etc/inputrc .inputrc` to solve some bash shortcut weirdness

2020-04-07

  • 20:57 andrewbogott: service sssd stop; rm -rf /var/lib/sss/db*; service sssd start on tools-sgebastion-08

2020-04-06

  • 22:39 andrewbogott: deleting bogus groups cn=b'project-bastion',ou=groups,dc=wikimedia,dc=org and cn=b'project-tools',ou=groups,dc=wikimedia,dc=org from ldap
  • 17:42 arturo: [codfw1dev] transferred DNS zone 57.15.185.in-addr.arpa. to the cloudinfra-codfw1dev project (T247972)
  • 17:39 arturo: [codfw1dev] `openstack zone create --email root@wmflabs.org --type PRIMARY --ttl 3600 --description "floating IPs subnet" 57.15.185.in-addr.arpa.` (T247972)
  • 16:23 arturo: restarting apache2 in cloudcontrol1003/1004 to pick up latest wmfkeystonehooks changes T249494

2020-04-02

  • 20:59 jeh: codfw1dev clear VM error states and start bastions, puppet master and database

2020-04-01

  • 16:27 arturo: [codfw1dev] enable puppet across the fleet clean vxlan changes (T248881)

2020-03-31

  • 12:35 arturo: [codfw1dev] restarting VMs: designaterockytest14, bastion-codfw1dev-0[1,2] (T248881)
  • 12:34 arturo: [codfw1dev] installing neutron-openvswitch-agent on cloudvirt2001-dev (T248881)
  • 12:25 arturo: [codfw1dev] installing neutron-openvswitch-agent on cloudnet200[2,3]-dev (T248881)
  • 11:45 arturo: [codfw1dev] rebooting cloudvirt2003-dev to pick up latest kernel update. Otherwise modprobe is confused trying to load modules and openvswitch won't start (T248881)
  • 10:40 arturo: [codfw1dev] installing neutron-openvswitch-agent on cloudvirt2003-dev (T248881)
  • 10:09 arturo: [codfw1dev] reboot cloudnet2003-dev into linux 4.9 (was using 4.14 from a testing operation in 2020-03-10)

2020-03-30

2020-03-27

  • 21:28 bd808: Created huggle.wmcloud.org Designate zone and allocated it to the huggle project
  • 19:51 jeh: start haproxy on cloudcontrol2003-dev.wikimedia.org

2020-03-26

  • 15:01 arturo: icinga downtime cloudvirt* cloudcontrol* cloudnet* lab* cloudstore*
  • 15:01 andrewbogott: beginning openstack upgrade window for T242766
  • 12:32 arturo: [codfw1dev] downgraded systemd, libsystemd0, udev and friends to the non-backports versions (T247013)

2020-03-25

  • 19:29 andrewbogott: dumping a bunch of VMs on cloudvirt1015 to see if it still crashes
  • 17:56 jeh: add labweb1002 back into the pool - completed horizon testing T240852
  • 17:09 jeh: depool labweb1002 for horizon testing T240852

2020-03-24

  • 19:41 jeh: switch cloudvirt1016 from maintenance to standard host aggregate T243327
  • 15:31 andrewbogott: restarting nova-conductor and nova-api on cloudcontrol1003 and cloudcontrol1004

2020-03-23

  • 21:41 jeh: restart neutron-l3-agent on cloudnet100[3,4] to pickup policy.yaml changes
  • 13:28 jeh: disable puppet on labweb100[1,2] to enable horizon event traces T240852
  • 10:26 arturo: restarting apache in both labweb1001/labweb1002 upon reports of returning 500s

2020-03-21

  • 14:23 andrewbogott: restarting apache2 on labweb1001 and 1002

2020-03-18

  • 19:17 andrewbogott: deleted a bunch of records from the pdns database on cloudservices1003/1004 which had a record name but the content (where an IP address should be) was NULL, e.g. m.wikidata.beta.wmflabs.org.
  • 10:55 arturo: [codfw1dev] deleting BGP agent, undoing changes we did for T245606

2020-03-14

  • 17:40 jeh: restart maintain-dbusers on labstore1004 T247654

2020-03-13

2020-03-12

  • 22:29 bstorm_: running puppet across all dumps mounts to make sure active links are shifted to labstore1006

2020-03-11

2020-03-10

  • 17:02 arturo: [codfw1dev] deleting address scopes, bad interaction with our custom NAT setup T247135
  • 13:55 arturo: [codfw1dev] rebooting cloudnet2003-dev into linux kernel 4.14 for testing stuff related to T247135

2020-03-09

  • 18:09 arturo: enabling puppet in cloudvirt1006, all services have been restored
  • 17:59 arturo: deleted the neutron bridge on cloudvirt1006, for testing stuff related to the queens upgrade
  • 17:58 arturo: stopped neutron-linuxbridge-agent and nova-compute in cloudvirt1006 for testing stuff related to the queens upgrade

2020-03-06

  • 14:54 andrewbogott: draining all instances off of cloudvirt1006 for T246908

2020-03-05

  • 14:24 arturo: [codfw1dev] we just enabled BGP session between cloudnet2xxx-dev and cr1-codfw (T245606)
  • 13:07 arturo: [codfw1dev] move the extra IP address for BGP in cloudnet200x-dev servers from eno2.2120 to the br-external bridge device (T245606)
  • 13:06 arturo: [codfw1dev] upgrade neutron-dynamic-routing packages in cloudnet200X-dev and cloudcontrol200X-dev servers to 11.0.0-2~bpo9+1 (T245606)

2020-03-04

  • 22:22 andrewbogott: upgrading designate on cloudservices1003/1004 to Queens
  • 22:09 andrewbogott: moving cloudvirt1006 into the maintenance aggregate for T246908
  • 21:37 bd808: Running wmcs-wikireplica-dns to add service names for ngwikimedia.*.db.svc.eqiad.wmflabs (T240772)
  • 21:14 bd808: Running `sudo maintain-meta_p --all-databases --purge` on labsdb1009 (T246056)
  • 21:11 bd808: Running `sudo maintain-meta_p --all-databases --purge` on labsdb1010 (T246056)
  • 21:08 bd808: Running `sudo maintain-meta_p --all-databases --purge` on labsdb1011 (T246056)
  • 21:05 bd808: Running `sudo maintain-meta_p --all-databases --purge` on labsdb1002 (T246056)

2020-03-02

  • 16:54 arturo: [codfw1dev] deleted python3-os-ken debian package in cloudnet2003-dev which was installed by hand and had depedency issues

2020-02-29

  • 16:32 bstorm_: downtimed the smart alert on cloudvirt1009 until Monday since apparently predictive failures flap T244986

2020-02-26

  • 22:03 jeh: powering down cloudvirt1014 for hardware maintenance

2020-02-25

  • 16:08 andrewbogott: changing neutron's rabbitmq password because oslo is having trouble parsing some of the characters in the password
  • 15:26 andrewbogott: updated the cell_mapping record in the nova_api database to add the second rabbitmq server to the transport_url field
  • 15:26 andrewbogott: updated the cell_mapping record in the nova_api database to set the db uri to 'mysql+pymysql' -- this in response to a deprecation notice

2020-02-24

  • 12:16 arturo: [codfw1dev] `root@cloudcontrol2001-dev:~# neutron bgp-speaker-peer-add bgpspeaker cr2-codfw` (T245606)
  • 12:16 arturo: [codfw1dev] `root@cloudcontrol2001-dev:~# neutron bgp-speaker-peer-add bgpspeaker cr1-codfw` (T245606)
  • 12:09 arturo: [codfw1dev] `root@cloudcontrol2001-dev:~# neutron bgp-peer-create --peer-ip 208.80.153.187 --remote-as 65002 cr2-codfw` (T245606)
  • 12:09 arturo: [codfw1dev] `root@cloudcontrol2001-dev:~# neutron bgp-peer-create --peer-ip 208.80.153.186 --remote-as 65002 cr1-codfw` (T245606)
  • 12:06 arturo: [codfw1dev] `root@cloudcontrol2001-dev:~# neutron bgp-peer-delete 17b8c2a3-f0ce-4d50-a265-18ccac703c61` (T245606)
  • 10:59 arturo: [codfw1dev] `root@cloudcontrol2001-dev:~# neutron bgp-speaker-peer-add bgpspeaker bgppeer` (T245606)
  • 10:56 arturo: [codfw1dev] `root@cloudcontrol2001-dev:~# neutron bgp-peer-create --peer-ip 208.80.153.185 --remote-as 65002 bgppeer` (T245606)

2020-02-21

  • 12:48 arturo: [codfw1dev] running `root@cloudcontrol2001-dev:~# neutron bgp-speaker-network-add bgpspeaker wan-transport-codfw` (T245606)
  • 12:46 arturo: [codfw1dev] created bgpspeaker for AS64711 (T245606)
  • 12:42 arturo: [codfw1dev] run `sudo neutron-db-manage upgrade head` to upgrade the db schema for neutron bgp tables
  • 11:51 arturo: [codfw1dev] create a neutron subnet pool per each subnet objects we have and manually update DB to inter-associate them (T245606)
  • 11:49 arturo: [codfw1dev] rename neutron address scope `no-nat` to `bgp` (T245606)
  • 11:37 arturo: [codfw1dev] cleanup unused neutron subnet pools from previous address scope tests (T244851)

2020-02-20

  • 19:22 andrewbogott: updating designate pool config for https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/572213/
  • 15:33 andrewbogott: migrating all VMs on cloudvirt1014 to cloudvirt1022
  • 13:35 arturo: [codfw1dev] disable puppet in cloudcontrol servers to hack neutron.conf for tests related to T245606
  • 13:33 arturo: [codfw1dev] disable puppet in cloudnet servers to hack neutron.conf for tests related to T245606

2020-02-18

  • 22:19 andrewbogott: transferred the tools.wmcloud.org. to the tools project
  • 22:16 andrewbogott: moved wmcloud.org dns domain to the cloud-infra project
  • 21:02 andrewbogott: adding .eqiad1.wikimedia.cloud records to all existing eqiad1 VMs, updating all eqiad1 internal pointer records to reference the new eqiad1.wikimedia.cloud fqdns.
  • 09:44 arturo: deleted DNS zone wmcloud.org and try re-creating it

2020-02-14

  • 10:35 arturo: running `root@cloudcontrol2001-dev:~# designate server-create --name ns1.openstack.codfw1dev.wikimediacloud.org.` (T243766)
  • 10:32 arturo: running `root@cloudcontrol1004:~# designate server-create --name ns1.openstack.eqiad1.wikimediacloud.org.` (T243766)
  • 10:32 arturo: running `root@cloudcontrol1004:~# designate server-create --name ns0.openstack.eqiad1.wikimediacloud.org.` (T243766)

2020-02-12

  • 13:38 arturo: [codfw1dev] add reference to subnetpool to the instance subnet `MariaDB [neutron]> update subnets set subnetpool_id='d129650d-d4be-4fe1-b13e-6edb5565cb4a' where id = '7adfcebe-b3d0-4315-92fe-e8365cc80668';` (T244851)

2020-02-11

  • 13:46 arturo: [codfw1dev] creating some neutron objects to investigate T244851 (subnets, subnet pools, address scopes, ...)
  • 12:40 arturo: [codfw1dev] delete unknown address scope 'wmcs-v4-scope': `root@cloudcontrol2001-dev:~# openstack address scope delete 078cfd71-117b-4aac-9197-6ebbbb7dd3de` (T244851)
  • 12:40 arturo: [codfw1dev] delete unknown subnet pool 'cloudinstancesb-v4-pool0': `root@cloudcontrol2001-dev:~# openstack subnet pool delete d23a9b88-5c3d-4a53-ab88-053233a75365` (T244851)

2020-02-07

  • 18:11 jeh: shutdown cloudvirt1016 for hardware maintenance T241882

2020-02-06

  • 14:44 jeh: update apt packages on cloudvirt1015 T220853
  • 14:28 jeh: run hardware tests on cloudvirt1015 T220853

2020-01-28

  • 17:24 arturo: [codfw1dev] root@cloudcontrol2001-dev:~# designate server-create --name ns0.openstack.codfw1dev.wikimediacloud.org. (T243766)
  • 10:18 arturo: [codfw1dev] created DNS record `bastion-codfw1dev-01.codfw1dev.wmcloud.org A 185.15.57.2` (T242976, T229441)
  • 10:13 arturo: [codfw1dev] the zone `codfw1dev.wmcloud.org` belongs now to the `cloudinfra-codfw1dev` project (T242976)
  • 10:11 arturo: [codfw1dev] `root@cloudcontrol2001-dev:~# openstack zone create --description "main DNS domain for public addresses" --email "root@wmflabs.org" --type PRIMARY --ttl 3600 codfw1dev.wmcloud.org.` (T242976 and T243766)
  • 09:53 arturo: restart apache2 in labweb1001/1002 because horizon errors
  • 09:47 arturo: created DNS zone wmcloud.org in eqiad1, transfer it to the cloudinfra project (T242976) right now only use is to delegate codfw1dev.wmcloud.org subdomain to designate in the other deployment

2020-01-27

  • 12:45 arturo: [codfw1dev] manually move the new domain to the `cloudinfra-codfw1dev` project clouddb2001-dev: `[designate]> update zones set tenant_id='cloudinfra-codfw1dev' where id = '4c75410017904858a5839de93c9e8b3d';` T243556
  • 12:44 arturo: [codfw1dev] `root@cloudcontrol2001-dev:~# openstack zone create --description "main DNS domain for VMs" --email "root@wmflabs.org" --type PRIMARY --ttl 3600 codfw1dev.wikimedia.cloud.` T243556

2020-01-24

  • 15:10 jeh: remove icinga downtime for cloudvirt1013 T241313
  • 12:52 arturo: repooling cloudvirt1013 after HW got fixed (T241313)

2020-01-21

  • 17:43 bstorm_: remounting /mnt/nfs/dumps-labstore1007.wikimedia.org/ on all dumps-mounting projects
  • 10:24 arturo: running `sudo systemctl restart apache2.service` in both labweb servers to try mitigating T240852

2020-01-15

  • 16:59 bd808: Changed the config for cloud-announce mailing list so that lsit admins do not get bounce unsubscribe notices

2020-01-14

  • 14:03 arturo: icinga downtime all cloudvirts for another 2h for fixing some icinga checks
  • 12:04 arturo: icinga downtime toolchecker for 2 hours for openstack upgrades T241347
  • 12:02 arturo: icinga downtime cloud* labs* hosts for 2 hours for openstack upgrades T241347
  • 04:26 andrewbogott: upgrading designate on cloudservices1003/1004

2020-01-13

  • 13:34 arturo: [¢odfw1dev] prevent neutron from allocating floating IPs from the wrong subnet by doing `neutron subnet-update --allocation-pool start=208.80.153.190,end=208.80.153.190 cloud-instances-transport1-b-codfw` (T242594)

2020-01-10

  • 13:27 arturo: cloudvirt1009: virsh undefine i-000069b6. This is tools-elastic-01 which is running on cloudvirt1008 (so, leaked on cloudvirt1009)

2020-01-09

  • 11:12 arturo: running `MariaDB [nova_eqiad1]> update quota_usages set in_use='0' where project_id='etytree';` (T242332)
  • 11:11 arturo: running `MariaDB [nova_eqiad1]> select * from quota_usages where project_id = 'etytree';` (T242332)
  • 10:32 arturo: ran `root@cloudcontrol1004:~# nova-manage project quota_usage_refresh --project etytree`

2020-01-08

  • 10:53 arturo: icinga downtime all cloudvirts for 30 minutes to re-create all canary VMs"

2020-01-07

  • 11:12 arturo: icinga-downtime everything cloud* for 30 minutes to merge nova scheduler changes
  • 10:02 arturo: icinga downtime cloudvirt1009 for 30 minutes to re-create canary VM (T242078)

2020-01-06

  • 13:45 andrewbogott: restarting nova-api and nova-conductor on cloudcontrol1003 and 1004

2020-01-04

  • 16:34 arturo: icinga downtime cloudvirt1024 for 2 months because hardware errors (T241884)

2019-12-31

  • 11:46 andrewbogott: I couldn't!
  • 11:40 andrewbogott: restarting cloudservices2002-dev to see if I can reproduce an issue I saw earlier

2019-12-25

2019-12-24

  • 15:13 arturo: icinga downtime all the lab* fleet for nova password change for 1h
  • 14:39 arturo: icinga downtime all the cloud* fleet for nova password change for 1h

2019-12-23

  • 11:13 arturo: enable puppet in cloudcontrol1003/1004
  • 10:40 arturo: disable puppet in cloudcontrol1003/1004 while doing changes related to python-ldap

2019-12-22

  • 23:48 andrewbogott: restarting nova-conductor and nova-api on cloudcontrol1003 and 1004
  • 09:45 arturo: cloudvirt1013 is back (did it alone) T241313
  • 09:37 arturo: cloudvirt1013 is down for good. Apparently powered off. I can't even reach it via iLO

2019-12-20

  • 12:43 arturo: icinga downtime cloudmetrics1001 for 128 hours

2019-12-18

  • 12:55 arturo: [codfw1dev] created a new subnet neutron object to hold the new CIDR for floating IPs (cloud-codfw1dev-floating - 185.15.57.0/29) T239347

2019-12-17

  • 07:21 andrewbogott: deploying horizon/train to labweb1001/1002

2019-12-12

  • 06:11 arturo: schedule 4h downtime for labstores
  • 05:57 arturo: schedule 4h downtime for cloudvirts and other openstack components due to upgrade ops

2019-12-02

  • 06:28 andrewbogott: running nova-manage db sync on eqiad1
  • 06:27 andrewbogott: running nova-manage cell_v2 map_cell0 on eqiad1

2019-11-21

  • 16:07 jeh: created replica indexes and views for szywiki T237373
  • 15:48 jeh: creating replica indexes and views for shywiktionary T238115
  • 15:48 jeh: creating replica indexes and views for gcrwiki T238114
  • 15:46 jeh: creating replica indexes and views for minwiktionary T238522
  • 15:36 jeh: creating replica indexes and views for gewikimedia T236404

2019-11-18

  • 19:27 andrewbogott: repooling labsdb1011
  • 18:54 andrewbogott: running maintain-views --all-databases --replace-all —clean on labsdb1011 T238480
  • 18:44 andrewbogott: depooling labsdb1011 and killing remaining user queries T238480
  • 18:42 andrewbogott: repooled labsdb1009 and 1010 T238480
  • 18:19 andrewbogott: running maintain-views --all-databases --replace-all —clean on labsdb1010 T238480
  • 18:18 andrewbogott: depooling labsdb1010, killing remaining user queries
  • 17:46 andrewbogott: running maintain-views --all-databases --replace-all —clean on labsdb1009 T238480
  • 17:38 andrewbogott: depooling labsdb1009, killing remaining user queries
  • 16:54 andrewbogott: running maintain-views --all-databases --replace-all —clean on labsdb1012 T237509

2019-11-15

  • 20:04 andrewbogott: repool labdb1011 (T237509)
  • 19:29 andrewbogott: running maintain-views --all-databases --replace-all —clean on labsdb1011
  • 19:25 andrewbogott: depooling labsdb1011, killing remaining queries
  • 19:25 andrewbogott: repooling labsdb1010
  • 18:59 andrewbogott: running maintain-views --all-databases --replace-all —clean on labsdb1012
  • 18:57 andrewbogott: running maintain-views --all-databases --replace-all —clean on labsdb1010
  • 18:54 andrewbogott: depooling labsdb1010, killing remaining user queries
  • 18:54 andrewbogott: depooled labsdb1009, ran maintain-views —clean —all-databases —replace-all, repooled

2019-11-11

  • 13:10 arturo: cloudweb2001-dev: disable puppet and redirect stderr in the loadExitNodes.php cron script to prevent cronspam while we investigate the cause of the issue (T237971)

2019-11-05

  • 11:59 arturo: icinga downtime for 1h cloudcontrol1004, cloudnet1003, cloudvirt1017/1020/1022 for PDU operations in the rack T227542

2019-11-04

  • 21:55 andrewbogott: deleting a ton of wikitech hiera pages that were either no-ops or refer to nonexistent VMs or prefixes

2019-10-31

  • 11:01 arturo: icinga-downtimed cloudvirt1030 and cloudservices1003 for 1h due to PDU upgrade operations T227543

2019-10-30

  • 22:43 jeh: reboot cloud-bootstrapvz-stretch to resolve bad bootstrapvz build

2019-10-29

  • 10:52 arturo: icinga downtime cloudvirt1001/1002/1024/1018/1012/1009/1015/1008 for 1h T227538

2019-10-25

  • 10:45 arturo: icinga downtime toolschecker for 1 to upgrade clouddb1002 mariadb (toolsdb secondary) (T236384 , T236420)

2019-10-24

  • 12:30 arturo: starting cloudvirt1019, PDU operations ended (T227540)
  • 11:58 arturo: icinga downtime for 2h (T227540) cloudvirt1019
  • 11:15 arturo: poweroff cloudvirt1019 during the PDU operations (T227540)
  • 11:10 arturo: icinga downtime for 2h (T227540) toolschecker
  • 10:58 arturo: icinga downtime for 1h (T227540) cloudvirt100[3-7], cloudvirt1019, cloudvirt1016, cloudvirt1021, cloudvirt1013, cloudnet1004

2019-10-23

  • 09:23 arturo: cloudvirt1026 reboot ended OK
  • 09:12 arturo: rebooting cloudvirt1026 for kernel upgrade
  • 09:09 arturo: cloudvirt1025 reboot ended OK
  • 09:00 arturo: rebooting cloudvirt1025 for kernel upgrade
  • 08:51 arturo: icinga downtime cloudvirt1025/1026 for reboots

2019-10-18

  • 16:01 arturo: created the `eqiad1.wikimedia.cloud` DNS zone (T235846)
  • 14:27 andrewbogott: deleted a bunch of leaked VMS from earlier today from the admin-monitoring project. Fullstack leaks due to an api outage, maybe?
  • 10:44 arturo: double max_message_size from 40KB to 80KB in the cloud-admin mailing list. A simple email with a couple of quotes can go over the 40KB limit.

2019-10-16

  • 21:59 jeh: resync wiki replica tool and user accounts T235697
  • 09:40 arturo: reboot of cloudvirt1030 went fine
  • 09:28 arturo: reboot of cloudvirt1029 went fine
  • 09:28 arturo: rebooting cloudvirt1030 for kernel updates
  • 09:12 arturo: rebooting cloudvirt1029 for kernel updates
  • 09:11 arturo: reboot of cloudvirt1028 went fine
  • 09:00 arturo: rebooting cloudvirt1028 for kernel updates
  • 08:56 arturo: icinga downtime cloudvirt[1028-1030].eqiad.wmnet for 1h for reboots

2019-10-15

  • 13:30 jeh: creating indexes and views for banwiki T234770

2019-10-10

  • 18:55 bd808: Created indexes and views for nqowiki (T230543)
  • 11:59 arturo: network switch hardware is down affecting cloudvirt1025/1026 (T227536) VMs are supposed to be online but unreachable

2019-10-09

  • 10:44 arturo: cloudvirt1013 rebooted well
  • 10:32 arturo: cloudvirt1013 is rebooting
  • 10:32 arturo: cloudvirt1012 rebooted just fine (very slow, 35 VMs)
  • 10:21 arturo: cloudvirt1012 is rebooting
  • 10:19 arturo: cloudvirt1009 rebooted just fine (very slow though)
  • 10:07 arturo: cloudvirt1009 is rebooting
  • 10:06 arturo: cloudvirt1008 rebooted just fine (very slow though)
  • 09:58 arturo: cloudvirt1008 is rebooting
  • 09:52 arturo: icinga downtime toolschecker, paws, etc for 2h, because cloudvirt reboots

2019-10-07

  • 14:07 arturo: horizon is disabled for maintenance (T212302)
  • 14:00 arturo: starting scheduled maintenance: upgrading eqiad1 from openstack mitaka to newton

2019-10-02

  • 15:23 arturo: codfw1dev renaming net/subnet objects to a more modern naming scheme T233665
  • 12:49 arturo: codfw1dev delete all floating ip allocations in the deployment for mangling the network config for testing T233665
  • 12:47 arturo: codfw1dev deleting all VMs in the deployment for mangling the network config for testing T233665
  • 11:08 arturo: codfw1dev rebooting cloudnet2002-dev and cloudnet2003-dev for testing T233665
  • 10:31 arturo: codfw1dev: add cloudinstances2b-gw router to the l3 agent in cloudnet2003-dev
  • 09:59 arturo: codfw1dev: cleanup leftover "HA port tenant admin" in neutron (ports from missing servers)
  • 09:46 arturo: codfw1dev: cleanup leftover neutron agents

2019-09-30

  • 10:21 arturo: we installed ferm in every VM by mistake. Deleting it and forcing a puppet agent run to try to go back to a clean state.
  • 09:38 arturo: downtime toolschecker for 24h
  • 09:33 arturo: force update ferm cloud-wide (in all VMs) for T153468

2019-08-18

  • 10:39 arturo: rebooting cloudvirt1023 for new interface names configuration
  • 10:34 arturo: downtimed cloudvirt1023 for 2 days

2019-08-05

  • 17:17 bd808: Set downtime on gridengine and kubernetes webservice checks in icinga until 2019-09-02 (flaky tests)

2019-07-29

  • 20:14 bd808: Restarted maintain-kubeusers on tools-k8s-master-01 (T194859)

2019-07-25

  • 12:32 arturo: eqiad1/glance: debian-9.9-stretch image deprecates debian-9.8-stretch (T228983)
  • 09:59 arturo: (codfw1dev) drop missing glance images (T228972)
  • 09:32 arturo: (codfw1dev) deleting a bunch of VMs that were running in now missing hypervisors
  • 09:31 arturo: (codfw1dev) deleting a bunch of VMs in ERROR and SHUTDOWN state
  • 09:27 arturo: last log entry refers to the codfw1dev deployment
  • 09:27 arturo: cleanup `nova service-list` from old hypervisors (labtest*)
  • 09:23 arturo: refreshed nova DB grants in clouddb2001-dev for the codfw1dev deployment
  • 08:47 arturo: cleanup the cloud-announce pending emails (spam)

2019-07-23

  • 19:43 andrewbogott: restarting rabbitmq-server on cloudcontrol1003 and 1004

2019-07-22

  • 23:44 bd808: Restarted maintain-kubeusers on tools-k8s-master-01 (T228529)

2019-07-11

  • 22:07 bd808: Ran `sudo systemctl stop designate_floating_ip_ptr_records_updater.service` on cloudcontrol1003
  • 22:01 bd808: `sudo apt-get install python2.7-dbg` on cloudcontrol1003 to debug hung python process
  • 21:48 bd808: Ran `sudo systemctl stop designate_floating_ip_ptr_records_updater.service` on cloudcontrol1004

2019-06-25

  • 16:05 bstorm_: updated python3.4 to update4 wherever it was installed on Jessie VMs to prevent issues with broken update3.
  • 14:56 bstorm_: Updated python 3.4 on the labs-puppetmaster server

2019-06-03

  • 15:55 arturo: T221769 rebooting cloudservices1003 after bootstrapping is apparently completed

2019-05-28

  • 21:42 bstorm_: unmounting labstore1003-scratch on all cloud clients
  • 18:14 bstorm_: T209527 switched mounts from labstore1003 to cloudstore1008 for scratch

2019-05-20

  • 17:25 arturo: T223923 dropped compat-network config from /etc/network/interfaces in eqiad1/codfw1dev neutron nodes
  • 17:22 arturo: T223923 dropped br-compat bridges and vlan interfaces (1102 and 2102) in eqiad1/codfw1dev neutron nodes
  • 17:07 arturo: T223923 dropped compat-network configuration from the neutron database in eqiad1
  • 16:55 arturo: T223923 dropped compat-network configuration from the neutron database in codfw1dev

2019-05-15

  • 17:00 andrewbogott: touching /root/firstboot_done on all VMs that cumin can reach. This will prevent firstboot.sh from running a second time if/when any of these are rebooted. T223370

2019-04-26

  • 15:51 arturo: andrew updated dns servers for the cloud-instances2-b-eqiad subnet in neutron: 208.80.154.143 and 208.80.154.24

2019-04-25

  • 11:14 arturo: T221760 increased size of conntrack table

2019-04-24

  • 12:54 arturo: T220051 puppet broken in every VM in Cloud VPS, fixing right now

2019-04-22

  • 11:14 arturo: create by hand /var/cache/labsaliaser/labs-ip-aliases.json in cloudservices2002-dev (T218575)

2019-04-16

  • 22:55 bd808: cloudcontrol2003-dev: added `exit 0` to /etc/cron.hourly/keystone to stop cron spam on partially configured cluster
  • 12:08 arturo: rebooting cloudvirt200[123]-dev because deep changes in config
  • 11:27 arturo: T219626 add DB grants for neutron and glnace to clouddb2001-dev (codfw1dev)
  • 10:37 arturo: T219626 replace 208.80.153.75 with 208.80.153.59 in the clouddb2001-dev database (codfw1dev deployment)
  • 10:30 arturo: T219626 replace labtestcontrol2003 with cloudcontrol2001-dev in the clouddb2001-dev database (codfw1dev deployment)

2019-04-15

  • 13:08 arturo: T219626 add DB grants for keystone/nova/nova_api to clouddb2001-dev (codfw1dev)

2019-04-13

  • 18:25 bd808: Restarted nova-compute service on cloudvirt1015 (T220853)

2019-04-11

  • 12:00 arturo: T151704 deploying oidentd to cloudnet1xxx servers

2019-04-02

  • 19:52 andrewbogott: installed new base Stretch image. Updated packages, and runs apt-get dist-upgrade on first boot.

2019-03-29

  • 14:34 andrewbogott: moving tools-static.wmflabs.org to point to tools-static-13 in eqiad1-r
  • 00:00 bstorm_: T193264 Added osm.db.svc.eqiad.wmflabs to cloud DNS

2019-03-25

  • 00:40 bd808: Restarted maintain-dbusers on labstore1004. Process hung up on failed LDAP connection.

2019-03-21

  • 19:32 andrewbogott: restarting keystone on cloudcontrol1003

2019-03-15

  • 16:00 gtirloni: increased nscd cache size (T217280)

2019-03-14

  • 19:04 gtirloni: bstorm started nfsd on labstore1006 (T218341)
  • 16:42 gtirloni: published new debian-9.8 image (T218314)

2019-03-04

  • 19:37 bstorm_: umounted /mnt/nfs/dumps-labstore1006.wikimedia.org across all VPS projects for T217473

2019-02-26

  • 12:46 gtirloni: shutdown toolsbeta-sgegrid-master (cronspam)

2019-02-25

  • 10:32 gtirloni: restarted nfsd on labstore1004

2019-02-21

  • 09:09 gtirloni: restarted uwsgi-labspuppetbackend.service on labpuppetmaster1001
  • 07:42 gtirloni: created project cloudstore
  • 07:36 gtirloni: deleted wmcs-nfs project

2019-02-20

  • 21:58 andrewbogott: silencing shinken and disabling puppet on shinken-02 for now

2019-02-19

  • 12:00 gtirloni: added nagios@icinga2001.wikimedia.org to cloud-admin-feed@ allowed senders

2019-02-18

  • 20:21 gtirloni: downtimed cloudvirt1020
  • 20:12 gtirloni: ran `labs-ip-alias-dump.py` on cloudservices/labservices servers

2019-02-15

  • 13:10 arturo: T216239 labvirt1019 has been drained
  • 12:22 arturo: T216239 draining labvirt1009 with a command like this: `root@cloudcontrol1004:~# wmcs-cold-migrate --region eqiad --nova-db nova 2c0cf363-c7c3-42ad-94bd-e586f2492321 labvirt1001`
  • 12:02 arturo: more nova service cleanups in the database (labvirts that were reallocated to eqiad1)
  • 11:34 arturo: T216190 cleanup from nova database `nova service-delete 35`
  • 03:50 andrewbogott: updated VPS base images for Jessie and Stretch, now featuring Stretch 9.7

2019-02-11

  • 18:13 gtirloni: cleaned old metrics data in labmon1001 T215417
  • 15:28 gtirloni: running `maintain-views --all-databases --replace-all` on labsdb1011
  • 14:18 gtirloni: running `maintain-views --all-databases --replace-all` on labsdb1010

2019-02-08

  • 14:56 gtirloni: running `maintain-views --all-databases --replace-all` on labsdb1009

2019-02-06

  • 11:47 gtirloni: downtimed labmon100{1,2} T215399
  • 00:17 bstorm_: T214106 deleted bstorm-test2 project to clean up

2019-02-05

  • 10:48 arturo: labmon1001 is now part of the 'eqiad1-r' region

2019-02-01

  • 09:54 arturo: moving canary1015-01 VM instance from cloudvirt1024 back to cloudvirt1015

2019-01-31

  • 12:44 arturo: T215012 depooling cloudvirt1015 and migrating all VMs to cloudvirt1024

2019-01-25

  • 20:11 gtirloni: deleted project yandex-proxy T212306
  • 20:11 gtirloni: deleted project T212306

2019-01-24

  • 11:50 arturo: T213925 modify subnet cloud-instances-transport1-b-eqiad1 to avoid floating IP allocations from here
  • 11:07 arturo: T214299 failover cloudnet1003 to cloudnet1004
  • 10:03 arturo: T214299 reimage cloudnet1004 to debian stretch
  • 09:51 arturo: T214299 failover cloudnet1004 to cloudnet1003

2019-01-22

  • 19:19 arturo: T214299 stretch cloudnet1003 is apparently all set
  • 18:40 arturo: T214299 manually delete from neutron agents from cloudnet1003 (must be added again after reimage, with new uuids)
  • 18:37 arturo: T214299 reimaging cloudnet1003 as debian stretch
  • 17:35 jbond42: starting roll out of apt package updates to
  • 14:41 gtirloni: T214369 deployed new jessie and stretch VM images

2019-01-21

  • 18:29 gtirloni: installed libguestfs-tools on cloudvirt1021

2019-01-16

  • 14:21 andrewbogott: stopping old VPS proxies in eqiad — T213540

2019-01-15

  • 14:20 andrewbogott: changing tools.wmflabs.org to point to tools-proxy-03 in eqiad1

2019-01-13

  • 20:00 andrewbogott: VPS proxies are now running in eqiad1 on proxy-01. Old VMs will wait a bit for deletion. T213540
  • 19:12 andrewbogott: moving the VPS proxy API backend to proxy-01.project-proxy.eqiad.wmflabs, as per T213540
  • 17:11 andrewbogott: moving all VPS dynamic proxies to proxy-eqiad1.wmflabs.org aka proxy-01.project-proxy.eqiad.wmflabs, as per T213540

2019-01-09

  • 22:21 bd808: neutron quota-update --tenant-id tools --port 256

2019-01-08

  • 18:59 bd808: Definately did NOT delete uid=novaadmin,ou=people,dc=wikimedia,dc=org
  • 18:59 bd808: Deleted LDAP user uid=neutron,ou=people,dc=wikimedia,dc=org
  • 18:58 bd808: Deleted LDAP user uid=novaadmin,ou=people,dc=wikimedia,dc=org

2019-01-06

  • 22:03 bd808: Set floatingip quota of 60 for tools project in eqiad1-r region (T212360)

2018-12-20

  • 17:10 arturo: T207663 renumbered transport network in eqiad1

2018-12-05

  • 17:59 arturo: T207663 changed labtestn transport network addressing from private to public

2018-12-03

  • 13:25 arturo: T202886 create again PTR records after dnsleak.py fix

2018-11-30

  • 14:08 arturo: running dns leaks cleanup `root@cloudcontrol1003:~# /root/novastats/dnsleaks.py --delete`

2018-11-28

  • 17:33 gtirloni: deleted contintcloud project (T209644)

2018-11-27

  • 13:32 gtirloni: enabled DRBD stats collection on labstore100[4-5] T208446

2018-11-22

  • 07:12 gtirloni: deployed new debian-9.6-stretch image

2018-11-21

  • 10:48 arturo: re-created compat-net as not shared in labtestn to test stuff related to T209954

2018-11-16

  • 12:43 gtirloni: armed keyholder on labpuppetmaster1001/1002 after reboots
  • 12:08 gtirloni: rebooted labpuppetmaster1001 (T207377)
  • 11:57 gtirloni: rebooted labpuppetmaster1002 (T207377)

2018-11-14

  • 17:19 gtirloni: added cloudvirt1016 to scheduler pool (T209426)
  • 15:41 gtirloni: reimaging labvirt1016 as cloudvirt1016
  • 15:14 gtirloni: reset-failed systemd unit nova-scheduler on cloudcontrol1004
  • 13:52 gtirloni: rebooted labservices1002 after package upgrades (T207377)
  • 13:23 gtirloni: rebooted labstore2004 after package upgrades (T207377)
  • 13:20 gtirloni: rebooted labstore2003 after package upgrades (T207377)
  • 13:20 gtirloni: rebooted labstore2001/labstore2003 after package upgrades (T207377)
  • 12:08 gtirloni: rebooted labnet1002 after package upgrades
  • 12:01 gtirloni: rebooted labmon1002 after package upgrades
  • 11:41 gtirloni: rebooted labcontrol1002 after package upgrades
  • 11:15 gtirloni: rebooted cloudcontrol1004 after package upgrades

2018-11-09

  • 18:17 gtirloni: restarted neutron-linuxbridge-agent on cloudvirt1018/1023

2018-11-08

  • 11:00 gtirloni: Added novaproxy-02 to $CACHES
  • 10:50 gtirloni: Added cloudvirt1017 to eqiad1 region

2018-11-07

  • 13:49 arturo: T208733 moving labvirt1017 from main deployment to eqiad1 and renaming it to cloudvirt1017

2018-10-22

  • 16:24 arturo: T206261 another update to dmz_cidr in eqiad1
  • 10:26 arturo: change again in dmz_cidr in eqiad1: VMs will connect between them without NAT even when using floating IPs (T206261)

2018-10-19

  • 12:02 arturo: revert change in dmz_cidr in eqiad1 for now (T206261)
  • 11:16 arturo: change in dmz_cidr in eqiad1: VMs will connect between them without NAT even when using floating IPs (T206261)
  • 10:14 arturo: we have new virt servers in the eqiad1 deployment since past week and this week: cloudvirt1018, cloudvirt1023, cloudvirt1024

2018-09-26

  • 10:40 arturo: T205524 all sorts of restarts in all neutron daemons
  • 10:20 arturo: T205524 stop/start all neutron agents in cloudnet1003.eqiad.wmnet
  • 10:13 arturo: T205524 restart all agents in cloudnet1004.eqiad.wmnet
  • 10:10 arturo: restart neutron-server in cloudcontrol1003, investigating T205524

2018-09-24

  • 10:57 arturo: try to increase floating ip allocation pool in eqiad1. Of 185.15.56.0/25 we are using only 185.15.56.10-185.15.56.31, I don't know why. Let's use 185.15.56.2-185.15.56.126

2018-09-21

  • 17:18 bd808: Running `sudo maintain-meta_p --all-databases --purge` across labsdb10(09|10|11) for T201890

2018-09-17

  • 22:08 bd808: Granted gtirloni project roles of admin, projectadmin, and user

2018-09-12

  • 11:20 arturo: T202636 distributing default routes using classless-static-route for all VMs in main/labtest (dnsmasq/nova-network)

2018-09-11

  • 16:52 arturo: again, restarted nova-network after killing all dnsmasq procs in labnet1001 for T202636
  • 16:08 arturo: restarted nova-network after killing all dnsmasq procs in labnet1001 for T202636
  • 10:53 arturo: T202636 creating all the compat-network configuration in neutron
  • 10:36 arturo: T202636 creating br-compat bridge in eqiad1 for the compat network
  • 10:33 arturo: T202636 manually reserve 10.68.23.253 (in nova-network)

2018-09-10

  • 22:46 andrewbogott: deleting all VMs on labvirt1019 and 1020 as prep for T204003

2018-08-30

  • 15:46 andrewbogott: restarting rabbitmq-server on cloudcontrol1003
  • 13:07 arturo: T202636 internal network routing now exists in labtest/labtestn for VM to communicate with each other

2018-08-28

  • 11:04 arturo: T202549 eqiad1 databases are all now running in m5-master. Mysql has been cleaned from cloudcontrol100[3,4]

2018-08-23

  • 16:17 arturo: T188589 bstorm_ merged patch to reduce nova DB connection usage
  • 13:15 arturo: T202115 `root@cloudcontrol1003:~# neutron subnet-update --allocation-pool start=10.64.22.4,end=10.64.22.4 e4fb2771-a361-4add-ac4e-280cc300c59f`
  • 13:10 arturo: T202115 (was `{"start": "10.64.22.2", "end": "10.64.22.254"}` )
  • 13:08 arturo: T202115 `root@cloudcontrol1003:~# neutron subnet-update --allocation-pool start=10.64.22.254,end=10.64.22.254 e4fb2771-a361-4add-ac4e-280cc300c59f`

2018-08-22

  • 15:28 arturo: cleanup local glance,keystone databases in cloudcontrol1003.wikimedia.org (already in m5-master)
  • 15:27 arturo: cleanup local keystone database in cloudcontrol1003.wikimedia.org (already in m5-master)

2018-08-21

  • 15:39 andrewbogott: initial test message
  • 10:31 arturo: eqiad1 remove leftover port for HA on labnet1004
  • 10:15 arturo: test

2018-05-07

  • 18:07 bstorm_: stopped the toolhistory job because it is totally broken and fills /tmp.

2018-02-09

  • 00:55 bd808: Added Arturo Borrero Gonzalez and Bstorm as project members
  • 00:54 bd808: Removed Yuvipanda at user request (T186289)