Nova Resource:Deployment-prep/SAL

2024-11-11

15:19 dhinus: revoke expired cert for deployment-poolcounter06.deployment-prep.eqiad.wmflabs
15:13 fnegri@cloudcumin1001: END (FAIL) - Cookbook wmcs.vps.refresh_puppet_certs (exit_code=99) on deployment-poolcounter06.deployment-prep.eqiad1.wikimedia.cloud
15:13 fnegri@cloudcumin1001: START - Cookbook wmcs.vps.refresh_puppet_certs on deployment-poolcounter06.deployment-prep.eqiad1.wikimedia.cloud

2024-08-13

23:59 bd808: Added BetaDevOpsBot as a service account with admin rights for OpenTofu automation tasks

2024-08-12

21:15 bd808: Added BryanDavis (self) as a project admin

2024-07-24

16:02 Southparkfan: moved sessionstorage/kask from sessionstorage04 to sessionstorage06 T370461

2024-07-23

16:55 Southparkfan: cancel kask maintenance, not going to perform switchover yet, see https://phabricator.wikimedia.org/T370461
16:11 andrew@cloudcumin1001: END (PASS) - Cookbook wmcs.openstack.rebuild_dbinstance (exit_code=0)
16:10 andrew@cloudcumin1001: START - Cookbook wmcs.openstack.rebuild_dbinstance
16:09 andrew@cloudcumin1001: END (PASS) - Cookbook wmcs.openstack.rebuild_dbinstance (exit_code=0)
16:07 andrew@cloudcumin1001: START - Cookbook wmcs.openstack.rebuild_dbinstance
16:05 andrew@cloudcumin1001: END (FAIL) - Cookbook wmcs.openstack.rebuild_dbinstance (exit_code=99)
16:05 andrew@cloudcumin1001: START - Cookbook wmcs.openstack.rebuild_dbinstance
15:57 andrew@cloudcumin1001: END (PASS) - Cookbook wmcs.openstack.rebuild_dbinstance (exit_code=0)
15:55 andrew@cloudcumin1001: START - Cookbook wmcs.openstack.rebuild_dbinstance
15:54 andrew@cloudcumin1001: END (PASS) - Cookbook wmcs.openstack.rebuild_dbinstance (exit_code=0)
15:52 andrew@cloudcumin1001: START - Cookbook wmcs.openstack.rebuild_dbinstance
15:52 andrew@cloudcumin1001: END (PASS) - Cookbook wmcs.openstack.rebuild_dbinstance (exit_code=0)
15:50 andrew@cloudcumin1001: START - Cookbook wmcs.openstack.rebuild_dbinstance
15:50 andrew@cloudcumin1001: END (PASS) - Cookbook wmcs.openstack.rebuild_dbinstance (exit_code=0)
15:48 andrew@cloudcumin1001: START - Cookbook wmcs.openstack.rebuild_dbinstance
15:48 andrew@cloudcumin1001: END (PASS) - Cookbook wmcs.openstack.rebuild_dbinstance (exit_code=0)
15:46 andrew@cloudcumin1001: START - Cookbook wmcs.openstack.rebuild_dbinstance
15:45 andrew@cloudcumin1001: END (PASS) - Cookbook wmcs.openstack.rebuild_dbinstance (exit_code=0)
15:43 andrew@cloudcumin1001: START - Cookbook wmcs.openstack.rebuild_dbinstance
15:43 andrew@cloudcumin1001: END (PASS) - Cookbook wmcs.openstack.rebuild_dbinstance (exit_code=0)
15:41 andrew@cloudcumin1001: START - Cookbook wmcs.openstack.rebuild_dbinstance
15:41 andrew@cloudcumin1001: END (PASS) - Cookbook wmcs.openstack.rebuild_dbinstance (exit_code=0)
15:39 andrew@cloudcumin1001: START - Cookbook wmcs.openstack.rebuild_dbinstance
15:38 andrew@cloudcumin1001: END (PASS) - Cookbook wmcs.openstack.rebuild_dbinstance (exit_code=0)
15:36 andrew@cloudcumin1001: START - Cookbook wmcs.openstack.rebuild_dbinstance
15:35 andrew@cloudcumin1001: END (PASS) - Cookbook wmcs.openstack.rebuild_dbinstance (exit_code=0)
15:34 Southparkfan: starting kask maintenance - T370461
15:33 andrew@cloudcumin1001: START - Cookbook wmcs.openstack.rebuild_dbinstance
15:33 andrew@cloudcumin1001: END (PASS) - Cookbook wmcs.openstack.rebuild_dbinstance (exit_code=0)
15:31 andrew@cloudcumin1001: START - Cookbook wmcs.openstack.rebuild_dbinstance
15:31 andrew@cloudcumin1001: END (PASS) - Cookbook wmcs.openstack.rebuild_dbinstance (exit_code=0)
15:29 andrew@cloudcumin1001: START - Cookbook wmcs.openstack.rebuild_dbinstance
15:28 andrew@cloudcumin1001: END (PASS) - Cookbook wmcs.openstack.rebuild_dbinstance (exit_code=0)
15:26 andrew@cloudcumin1001: START - Cookbook wmcs.openstack.rebuild_dbinstance
15:26 andrew@cloudcumin1001: END (PASS) - Cookbook wmcs.openstack.rebuild_dbinstance (exit_code=0)
15:24 andrew@cloudcumin1001: START - Cookbook wmcs.openstack.rebuild_dbinstance
15:24 andrew@cloudcumin1001: END (PASS) - Cookbook wmcs.openstack.rebuild_dbinstance (exit_code=0)
15:22 andrew@cloudcumin1001: START - Cookbook wmcs.openstack.rebuild_dbinstance
15:14 andrew@cloudcumin1001: END (PASS) - Cookbook wmcs.openstack.rebuild_dbinstance (exit_code=0)
15:12 andrew@cloudcumin1001: START - Cookbook wmcs.openstack.rebuild_dbinstance
15:11 andrew@cloudcumin1001: END (PASS) - Cookbook wmcs.openstack.rebuild_dbinstance (exit_code=0)
15:09 andrew@cloudcumin1001: START - Cookbook wmcs.openstack.rebuild_dbinstance

2024-07-22

17:07 andrew@cloudcumin1001: END (PASS) - Cookbook wmcs.openstack.rebuild_dbinstance (exit_code=0)
17:05 andrew@cloudcumin1001: START - Cookbook wmcs.openstack.rebuild_dbinstance
17:02 andrew@cloudcumin1001: END (PASS) - Cookbook wmcs.openstack.rebuild_dbinstance (exit_code=0)
17:00 andrew@cloudcumin1001: START - Cookbook wmcs.openstack.rebuild_dbinstance
15:30 Southparkfan: remove deployment-push-notifications01 - T370459
15:11 Southparkfan: remove deployment-parsoid12 - T361386
15:10 Southparkfan: remove deployment-mwmaint02 T370582
15:02 andrew@cloudcumin1001: END (PASS) - Cookbook wmcs.openstack.rebuild_dbinstance (exit_code=0)
15:00 andrew@cloudcumin1001: START - Cookbook wmcs.openstack.rebuild_dbinstance
15:00 Southparkfan: delete deployment-urldownloader03 T370466
14:57 Southparkfan: delete deployment-mediawiki11 and deployment-mediawiki12 (incl PuppetDB data + volumes) T361387
14:43 Southparkfan: fix /srv/git/operations/puppet yet again (T364492) via chown -R gitpuppet:gitpuppet on .git/, then use 'pgit' (gitpuppet wrapper) to reset to oot branch
14:10 andrew@cloudcumin1001: END (PASS) - Cookbook wmcs.openstack.rebuild_dbinstance (exit_code=0)
14:08 andrew@cloudcumin1001: START - Cookbook wmcs.openstack.rebuild_dbinstance
14:06 andrew@cloudcumin1001: END (PASS) - Cookbook wmcs.openstack.rebuild_dbinstance (exit_code=0)
14:04 andrew@cloudcumin1001: START - Cookbook wmcs.openstack.rebuild_dbinstance

2024-07-20

15:52 Southparkfan: add deployment-sessionstore05 (bookworm) - T370461
15:15 Southparkfan: add deployment-urldownloader04 (bookworm) - T370466
14:46 Southparkfan: deleted deployment-shellbox (buster) T370462
14:34 Southparkfan: add deployment-shellbox01 T370462
14:19 Southparkfan: err, adding deployment-mwmaint03, I meant - T370582
14:19 Southparkfan: add deployment-maint03, replacing the Buster instance T370582

2024-07-19

16:56 Southparkfan: switching trafficserver backends from mediawiki11 and 12 to 13 and 14 - T361387
14:47 Southparkfan: delete deployment-jobrunner04 (buster), replaced by 05 (bullseye) T370487
14:38 Southparkfan: remove floating IP for deployment-ircd03 T369919
11:03 Southparkfan: switched over cpjobqueue (running on deployment-changeprop-1) to deployment-jobrunner05 T370487
09:28 Southparkfan: create deployment-jobrunner05 with Bullseye image, T370487

2024-07-18

22:49 Southparkfan: deleted deployment-irc02 (buster), released its floating IP, deactivated & cleaned on puppetserver-1, removed irc-next.beta.wmcloud.org A RR - T369919
22:05 Southparkfan: add deployment-ircd03 (bullseye) with floating IP and irc-next.beta.wmcloud.org - T369919

2024-07-06

15:15 andrew@cloudcumin1001: END (PASS) - Cookbook wmcs.vps.remove_instance (exit_code=0) for instance deployment-memc10
15:14 andrew@cloudcumin1001: START - Cookbook wmcs.vps.remove_instance for instance deployment-memc10
15:14 andrew@cloudcumin1001: END (PASS) - Cookbook wmcs.vps.remove_instance (exit_code=0) for instance deployment-memc09
15:14 andrew@cloudcumin1001: START - Cookbook wmcs.vps.remove_instance for instance deployment-memc09
15:13 andrew@cloudcumin1001: END (PASS) - Cookbook wmcs.vps.remove_instance (exit_code=0) for instance deployment-memc08
15:13 andrew@cloudcumin1001: START - Cookbook wmcs.vps.remove_instance for instance deployment-memc08

2024-06-25

13:27 taavi@cloudcumin1001: END (FAIL) - Cookbook wmcs.openstack.migrate_project_to_ovs (exit_code=1)
13:14 taavi@cloudcumin1001: START - Cookbook wmcs.openstack.migrate_project_to_ovs

2024-06-24

23:02 bd808: Removed matanya's "reader" right per T368330

2024-06-18

10:17 taavi@cloudcumin1001: END (FAIL) - Cookbook wmcs.openstack.migrate_project_to_ovs (exit_code=1)
09:38 taavi: set deployment-db11 as writable after reboot
09:04 taavi@cloudcumin1001: START - Cookbook wmcs.openstack.migrate_project_to_ovs

2024-06-07

11:14 pmiazga: proceeding with soft restart deployment-puppetserver-1
10:31 pmiazga: deployment-puppetserver-1 - in /srv/git/operations/puppet cherry-picked I477c4b to test wikivoyage.beta.wmcloud.org domain handling - T355281

2024-06-06

18:31 pmiazga: T355281 testing mediawiki-config patch Idcd9cd, executed `scap sync-world`
17:56 pmiazga: Executed "mwscript extensions/WikimediaMaintenance/addWiki.php --wiki=aawiki pl wikivoyage plwikivoyage pl.wikivoyage.beta.wmcloud.org" to add a new wiki - polish wikivoyage on beta.wmcloud.org domain
17:49 pmiazga: T355281 updated DNS zones and hiera configs - added *.m.wikipedia.beta.wmcloud.org, *.wikivoyage.beta.wmcloud.org and *.m.wikivoyage.beta.wmcloud.org domains

2024-06-04

11:03 pmiazga: added beta.wmcloud.org and *.wikipedia.beta.wmcloud.org definitions to SNI section in deployment-acme-chief and lets-encrypt section in deployment-cache in hiera config on horizon.
10:34 pmiazga: Live debugging of puppet. Pulled Ifd37f0 to puppetserver-1. Additionally fixed ownership of /srv/git/operations/puppet to `gitpuppet:gitpuppet` to solve problems with git pull.

2024-06-02

away: T366415 removed upload.beta.wmflabs.org from hiera

2024-03-28

22:27 tgr: added toyofuku to deployment-prep

2024-03-14

21:31 andrewbogott: shutting down deployment-puppetdb03, deployment-puppetdb04, deployment-puppetmaster04. These have been replaced with new puppet infra and can be deleted in a couple of weeks if all is well.

2024-03-01

01:03 bd808: Added RLazarus as a project member

2023-11-30

17:09 tgr: added Dreamy_Jazz to members

2023-11-16

23:44 TheresNoTime: Add `DJackson` access, T351433

2023-10-27

21:26 tgr: set up deployment-rdb01 for redis (T340908)

2023-10-23

09:49 godog: turn off deployment-prometheus05 - T344974

2023-09-29

13:46 wm-bot2: dcaro@urcuchillay END (PASS) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=0)
13:38 wm-bot2: dcaro@urcuchillay START - Cookbook wmcs.openstack.cloudvirt.vm_console
13:33 wm-bot2: dcaro@urcuchillay END (PASS) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=0)
13:30 wm-bot2: dcaro@urcuchillay START - Cookbook wmcs.openstack.cloudvirt.vm_console
12:33 wm-bot2: dcaro@urcuchillay END (PASS) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=0)
12:31 wm-bot2: dcaro@urcuchillay START - Cookbook wmcs.openstack.cloudvirt.vm_console

2023-08-21

15:30 godog: shut prometheus05 - T344582

2023-07-28

18:05 andrewbogott: removing ::lvs::realserver::realserver_ips: from hiera for deployment-restbase-bullseye.deployment-prep.eqiad1.wikimedia.cloud because it's preventing puppet from compiling
17:57 andrewbogott: deleting deployment-thcipriani.deployment-prep.eqiad1.wikimedia.cloud, no longer used according to Tyler
17:38 andrewbogott: deleting deployment-mdb02.deployment-prep.eqiad1.wikimedia.cloud, seems abandoned by a departed staff member
17:18 andrewbogott: but not fixing deployment-mdb02.deployment-prep.eqiad1.wikimedia.cloud because it seems like mariadb was never set up there in the first place
17:16 andrewbogott: fixed miscellaneous puppet issues on four or five hosts

2023-07-05

10:59 fabfur: shutting down and removing unused deployment-cache-text07 and deployment-cache-upload07

2023-06-28

09:54 fabfur: removed (text|upload) instance references from wgCdnServersNoPurge (T327742)

2023-06-27

11:36 fabfur: removing old (text|upload) instance references from hieradata (horizon)

2023-06-26

12:41 fabfur: switch floating IP from deployment-cache-text07 to deployment-cache-text08 (bullseye upgrade: T327742) (fix sec group)
12:25 fabfur: reverted floating IP switch
12:14 fabfur: switch floating IP from deployment-cache-text07 to deployment-cache-text08 (bullseye upgrade: T327742)

2023-06-22

11:02 fabfur: switch floating IP from deployment-cache-upload07 to deployment-cache-upload08 (bullseye upgrade: T327742)

2023-06-16

13:50 vgutierrez: replaced buster acme-chief[03,04] with bullseye acme-chief[05,06]

2023-04-23

16:55 Krinkle: Fix profile::tlsproxy::envoy::global_cert_name in Horizon for webperf host to use '%{facts.fqdn}' instead of performance.discovery.wmnet as the latter doesn't resolve / would be an invalid cert for https://deployment-webperf21, ref T291015

2023-02-24

15:27 andrewbogott: deleting long-shutoff stretch instance deployment-imagescaler03.deployment-prep.eqiad1.wikimedia.cloud -- T289883

2023-01-19

19:02 andrewbogott: deleting long-shutdown stretch instances: deployment-echostore01, deployment-ms-fe03, deployment-prometheus02

2023-01-18

15:31 andrewbogott: shutting down deployment-imagescaler03 as it is long-overdue for replacement. See T294148 for details.
10:10 arturo: bump trove quotas (T326674)

2023-01-09

16:30 wm-bot2: Increased quotas by 6 cores (T326568) - cookbook ran by arturo@nostromo

2022-11-23

22:42 urandom: accidentally deleted deployment-sessionstore04

2022-11-09

14:56 andrewbogott: fixed puppet breakage on several instances

2022-10-31

15:56 andrewbogott: shutting down deployment-echostore01, deployment-ms-be0[56], deployment-mdb01, deployment-prometheus02, deployment-wikifeeds01 as per https://phabricator.wikimedia.org/T306068

2022-10-17

09:41 wm-bot2: Increased quotas by 4 cores (T320932) - cookbook ran by arturo@nostromo

2022-09-21

09:46 andrewbogott: removed some stray whitespace in /var/lib/git/operations/puppet that was preventing rebase on deployment-puppetmaster04.deployment-prep.eqiad.wmflabs

2022-08-29

21:25 inflatador: ES6->7 upgrade in beta-cluster T315604

2022-06-24

20:52 taavi: added `denisse` as a member

2022-06-20

16:30 urbanecm: add sgimeno as a project member (Growth engineer with need for access)

2022-05-25

18:20 TheresNoTime: samtar@deployment-mwmaint02:~$ mwscript resetUserEmail.php --wiki=wikidatawiki Mahir256 [snip] T309230

2022-05-23

19:21 inflatador: Deleted deployment-elastic0[5-7] in favor of newer bullseye hosts T299797

2022-05-16

19:31 inflatador: bking@deployment-elastic07 halted deployment-elastic07 in beta ES cluster; will decom on Friday T299797
19:03 inflatador: bking@deployment-elastic06 halted deployment-elastic06 in beta ES cluster; will decom on Friday T299797

2022-05-14

20:25 urbanecm: add TheresNoTime (samtar) as a project member per request

2022-05-13

18:58 inflatador: bking@deployment-elastic05 halted deployment-elastic05 in beta ES cluster; will decom in 1 wk T299797

2022-05-12

22:09 inflatador: bking@deployment-elastic05 banned deployment-elastic05 from beta ES cluster in preparation for decom T299797

2021-11-08

09:32 majavah: Remove rvogel from project members per IRC request

2021-10-05

12:03 majavah: root@deployment-cache-text06:~# systemctl restart sssd # T286502

2021-07-28

17:53 andrewbogott: rebooting deployment-logstash03 as it's in an inconsistent config state

2021-05-10

14:04 CFisch_WMDE: Improve comment around ReferencePreviews beta cluster default (T271206)
14:04 CFisch_WMDE: Forward renamed config name for improved template search features (T277028)

2021-05-05

14:17 CFisch_WMDE: Disable ReferencePreviews beta mode on beta labs (T271206)

2021-05-03

13:55 CFisch_WMDE: enable new search features for the template dialog (T271802)

2021-04-20

07:19 CFisch_WMDE: enable changes to the descriptions in the VE transclusion dialog (T273425)
07:17 CFisch_WMDE: enable suggested values paramter in TemplateData and VisualEditor (T271825)

2021-04-13

17:00 halfak: failed deploy to ORES (connection to host failed)
16:57 halfak: deploying ores f08a3cb
07:46 awight: enable syntax highlighting line numbering on all namespaces (T267911)

2021-03-22

11:36 dcaro: Created subzone svc.deployment-prep.eqiad1.wikimedia.cloud. (T276624)
11:34 dcaro: Created subzone beta.wmcloud.org (T276624)

2021-03-10

10:16 arturo: briefly stopping deployment-puppetdb03 to disable VMX CPU flag

2021-03-09

13:32 arturo: hard-reboot deployment-db05 because issues related to T276922
12:34 arturo: briefly rebooting VM deployment-db05, we need to reboot its hypervisor cloudvirt1038 and failed to migrate to other

2021-03-01

14:41 andrewbogott: changed profile::redis::multidc::discovery from 'false' to "" to comply with strict typing in the deployment-memc puppet prefix.

2020-12-23

19:03 balloons: resized deployment-puppetdb03 to g2.cores2.ram4.disk40 (T270420)

2020-12-16

22:00 mutante: adjusted 'puppet prefix' deployment-jobrunner to use "role::beta::mediawiki::jobrunner" instead of "role::mediawiki::jobrunner" - goes together with gerrit:649707 - no instance currently exists called 'deployment-jobrunner'

2020-11-11

09:50 awight: metawiki: Promoting User:Jan Dittrich (WMDE) into centralnoticeadmin...

2020-11-09

22:18 awight: metawiki: Promoting User:Jan Dittrich into centralnoticeadmin...

2020-10-29

17:24 andrewbogott: signing pending puppet certs for deployment-mediawiki-07.deployment-prep.eqiad1.wikimedia.cloud and deployment-mediawiki-09.deployment-prep.eqiad1.wikimedia.cloud
17:23 andrewbogott: signing pending puppet certs for deployment-kafka* nodes
16:17 andrewbogott: removing jkroll as a project member; the registered email is invalid so probably this user is no longer involved

2020-10-07

10:43 godog: move swift settings out of horizon and into puppet's hieradata

2020-08-04

22:00 mdholloway: deleted deployment-chromium01

2020-07-20

19:35 halfak: deploying ores f3c44be

2020-07-16

17:15 andrewbogott: added "profile::java::egd_source: /dev/random" to project-wide hiera since lack of that setting seems to be breaking puppet in a lot of places

2020-07-14

15:28 bd808: Silenced prometheus alerts for 7d

2020-06-24

18:56 halfak: deploying ORES 1b87365

2020-06-23

17:15 bstorm_: to fix puppet on several hosts, setting profile::java::hardened_tls: false in project puppet on horizon
17:09 bstorm_: restarted postgresql on deployment-puppetdb03
17:05 bstorm_: restarted puppetdb.service on deployment-puppetdb03

2020-06-12

08:15 awight: Granted dewiki-beta sysop to Adamw via commandline
07:57 awight: Update QuickSurveys config

2020-06-08

18:38 hauskatze: Rebooting deployment-logstash0[2,3]

2020-05-07

12:34 mutante: removing role::labs::lvm::srv from deployment servers since this is now included in role:deployment_server and should neve have been a role in the first place
12:07 mutante: - puppet still broken on deployment_servers due to unrelated pre-existing issues, also no alerts about it in shinken
12:04 mutante: - puppet broken on deployment_servers - fix deployed in https://gerrit.wikimedia.org/r/c/operations/puppet/+/594932

2020-04-20

19:48 halfak: Deploying ORES 514f94a
19:16 halfak: Deploying ORES ac2eb2f
18:52 halfak: Deploying ORES a5a5bce
15:41 halfak: Deploying ORES 5d977f4

2020-03-30

19:18 andrewbogott: restarting puppetdb on deployment-puppetdb03 and restarted apache2 on deployment-puppetmaster04 but puppet runs still fail everywhere

2020-02-14

21:53 andrewbogott: moving deployment-puppetdb02 and deployment-puppetmaster03 off of cloudvirt1014 (which will be drained next week anyway)

2020-02-07

16:57 halfak: deploying ores a6f4f14

2020-01-24

21:15 halfak: deploying ores 283f627

2020-01-23

21:50 halfak: deploying ores 039251f (reverting to last good state)
15:31 halfak: deploying ores 283f627a

2020-01-08

16:26 halfak: deploiying ores 039251f

2020-01-02

10:40 dcausse: created missing elastic indices: T241487

2019-12-27

12:24 hauskatze: Rebooting deployment-cpjobqueue
11:13 andrewbogott: migrating deployment-aqs03 to cloudvirt1009 in response to T241313

2019-12-18

19:13 halfak: deploying ores 80b1e62

2019-12-03

17:33 kevinbazira: deploying ores 6dd1fef

2019-11-11

18:28 MaxSem: Nuked HHVM and php7 tags on all beta wikis T75181

2019-11-08

17:46 MaxSem: Upgraded php7.2 on deployment-mwmaint01, was too old for MW

2019-10-31

20:29 tgr: importing a bunch of pages from production cswiki via importDump.php for T236823 (for reals now)
01:14 tgr: importing a bunch of pages from production cswiki via importDump.php for T236823

2019-09-25

23:01 andrewbogott: moving deployment-mwmaint01 and deployment-ircd to cloudvirt1021
15:15 andrewbogott: moving deployment-snapshot01 to cloudvirt1021
15:02 andrewbogott: moving deployment-dumps-puppetmaster02 to cloudvirt1021

2019-09-12

16:36 halfak: deploying ores 7d45b80

2019-08-25

16:33 Urbanecm: Revert 45831e6 locally on beta cluster to test possible root cause of T231162

2019-08-23

07:12 rxy: Applied SQL queries per phab:T231058#5433197

2019-08-06

17:09 accraze: deploying ores d08fa62

2019-08-05

22:24 accraze: deploying ores 4270244

2019-07-31

14:21 andrewbogott: moving deployment-sca02 to cloudvirt1030
12:59 andrewbogott: moving deployment-elastic05, deployment-kafka-main-2, deployment-mx02, deployment-webperf11 to new cloudvirts

2019-07-24

10:55 hauskatze: Dry-running extensions/AbuseFilter/maintenance/fixOldLogEntries.php refs. T228655

2019-07-11

20:30 mutante: add project member cdanis

2019-07-03

21:01 accraze: deploying ores 676f7ba

2019-06-25

15:51 awight: restart php7.2-fpm for wikidiff2 upgrade (T223391)

2019-06-05

19:34 andrewbogott: moving deployment-imagescaler03 to cloudvirt1029

2019-05-01

17:53 halfak: deploying ores:52e9759

2019-04-29

14:04 godog: add dsharpe user

2019-04-17

17:17 andrewbogott: cherry-picking https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/504580/ to move off of soon-to-be-shutdown dns recursors

2019-03-25

14:46 mateusbs17: sunset deployment-maps03

2019-03-14

19:14 ebernhardson: restart logstash on deployment-logstash2 to re-read and re-create apifeatureusage template

2019-03-13

18:09 ebernhardson: restart elasticsearch on deployment-elastic* to deploy apifeature usage fix (T183156)

2019-03-06

15:58 andrewbogott: deleting deployment-prometheus01 on Filippo's advice

2019-02-26

22:52 ebernhardson: delete logstash logs in /var/log/logstash generated prior to 2019
22:51 ebernhardson: restart logstash on deployment-logstash2 while hacking around to see why apifeatureusage doesn't work

2019-02-18

11:45 arturo: manually start deployment-db03 per Krenair request

2019-02-11

10:47 godog: shut deployment-prometheus01, unused now

2019-02-06

22:34 shdubsh: Deploy node-exporter 0.17 T213708
14:12 godog: shut off deployment-prometheus01 - T215272
14:00 godog: switch beta-prometheus to deployment-prometheus02 - T215272

2019-02-05

20:07 ebernhardson: jobrunner port 9006 is firewalled, revert to 9005 and created T215339 to fix job queue in beta cluste
19:36 ebernhardson: Update profile::cpjobqueue::{jobrunner,videoscaler}_host in horizon hiera from port 9005 to 9006 to match new restrictions in gerrit.wikimedia.org/r/481866

2019-02-04

21:48 ebernhardson: restart logstash on deployment-logstash2

2019-01-31

12:05 arturo: VM instances deployment-deploy01,deployment-deploy02,deployment-fluorine02,deployment-kafka-jumbo-2,deployment-kafka-main-1,deployment-maps04,deployment-mcs01,deployment-mediawiki-09,deployment-memc04,deployment-ms-be03,deployment-ms-fe02,deployment-parsoid09,deployment-sca04,deployment-webperf12, were stopped briefly due to issue in hypervisor (T215012)

2019-01-08

19:53 mutante: adjusting puppet config on deployment-mwmaint01. remove "mediawiki_maintenance" role from "other classes" section and apply "mediawiki::maintenance" instead after role rename in gerrit:479131 for consistency with other mediawiki:: roles

2019-01-07

19:32 awight: T212530: ORES revscoring 2.3.0

2018-11-23

13:55 dcausse: restarted elasticsearch on all deployement-elastic0X nodes (search broken on the beta cluster)

2018-11-20

23:45 mutante: deployment-deploy01 edited /srv/deployment/iegreview/iegreview/.git/DEPLOY_HEAD - - replaced deployment-tin with deployment-deploy1 to fix scap cloning / puppet
23:01 mutante: deployment-deploy01 edited /srv/deployment/scholarships/scholarships/.git - replaced deployment-tin with deployment-deploy1 to fix scap / cloning of scholarships app
13:55 andrewbogott: deleting deployment-redis05 and deployment-redis06 as per Giuseppe, "we're not using the old jobqueue, we should remove those vms"

2018-11-14

18:48 andrewbogott: moving deployment-mediawiki-07 to labvirt1008
18:31 andrewbogott: moving deployment-chromium01 to labvirt1009
18:06 andrewbogott: moving deployment-mx02 to labvirt1003
18:05 andrewbogott: migrating deployment-snapshot01 to labvirt1001

2018-11-13

22:19 andrewbogott: moving deployment-urldownloader02 to labvirt1012
21:59 andrewbogott: moving deployment-deploy02 to another labvirt
21:55 andrewbogott: moving deployment-webperf12 to a new labvirt
21:50 andrewbogott: moving deployment-dumps-puppetmaster02 to a new labvirt
21:43 andrewbogott: moving deployment-elastic05 to a new labvirt to clear out labvirt1016
13:01 arturo: a puppet refactor for the aptly module may have caused some puppet issues. Should be solved now

2018-11-02

14:04 Krenair: made onimisionipe a projectadmin per request in -cloud

2018-11-01

14:48 andrewbogott: moving deployment-redis05 to labvirt1012
14:47 Krenair: shut off deployment-redis05 for migration to new physical host

2018-10-31

21:16 Krenair: remove horizon hiera config for deployment-redis0[56] to unbreak puppet and remove old redis0[12] instance IPs T208040
19:55 andrewbogott: moving deployment-elastic06 to labvirt1012
19:40 andrewbogott: moving deployment-cpjobqueue to labvirt1012 to help clear out labvirt1017
19:11 andrewbogott: moving deployment-kafka-jumbo-1 to labvirt1012 to help clear out labvirt1017
18:54 andrewbogott: moving deployment-kafka-main-2 to labvirt1012 to help clear out labvirt1017
13:23 godog: enable statsd reporting for swift

2018-10-22

00:27 Krenair: Added gtirloni as a member per T207474 - I imagine he'll want to get in to look at shinken-related things

2018-10-02

08:56 godog: bounce logstash

2018-09-23

01:05 andrewbogott: rebooted deployment-maps03; OOM and also T205195

2018-08-09

01:42 awight: T201518: ORES, fawiki wp10, misc updates

2018-07-16

21:10 awight: ran namespaceDupes.php on beta enwiki

2018-06-30

20:40 Krenair: ran git gc on deployment-tin:/srv/mediawiki to free up space

2018-06-12

17:40 halfak: deploying ores 36037b6

2018-06-11

21:47 halfak: deploying ores 6ee8775

2018-06-04

15:44 awight: ORES: Fix T194322

2018-05-19

10:56 Krenair: amended uncommitted changes into HEAD commit (notified author) so I can unbreak puppet updates, also removed my old POC secure redirect puppet patch

2018-05-09

22:53 awight: ORES: wheels fixups
21:31 awight: Bump ORES wheels
21:04 awight: ORES: drafttopic in beta

2018-04-25

18:55 awight: ORES: Revscoring 2.2.2

2018-04-20

00:46 awight: roll back ORES beta to master
00:08 awight: Push ORES git-lfs to look at stuff

2018-04-16

18:58 awight: Update ORES editquality; T185903

2018-04-13

00:53 awight: ORES: Test large file in LFS

2018-04-12

23:01 awight: Try gerrit-based submodules for ORES, T180627

2018-04-11

21:04 awight: ORES experiment with git-lfs, T180627

2018-04-09

19:50 awight: Redundant virtualenv for ORES
18:06 awight: Restore to ORES master branch
17:17 awight: Test git-lfs in ORES

2018-04-04

21:24 awight: Roll back beta ORES
20:12 awight: Try dsh scap config for ORES

2018-03-21

23:25 RoanKattouw: Created maps security group for port 6533; removed port 6533 from sca security group
23:22 bd808: Raised security group quota from 20 to 40

2018-03-19

16:57 awight: ORES beta service is restored.
16:45 awight: Put ORES-beta back onto master branch
16:43 awight: ORES-beta has been down since January.

2018-03-15

15:47 awight: ORES with git-lfs, scap config
00:21 awight: ORES with git-lfs

2018-03-14

23:30 awight: new ORES submodule, pre-git-lfs

2018-03-13

18:08 awight: Enable Extension:JADE, T176333

2018-03-04

02:13 Krenair: Regenerated captcha images for T164047

2018-02-09

00:59 bd808: Removed Yuvipanda at user request (T186289)

2018-01-29

23:24 awight: Experiment with versioned ORES venv, T181071

2018-01-24

23:14 Krenair: armed keyholder on deployment-cumin using deployment-puppetmaster02:/var/lib/git/labs/private/files/ssh/tin/cumin_rsa.passphrase - this seems to have fixed cumin

2018-01-11

17:50 tgr: added Groovier1 to project members for T158909

2018-01-01

23:46 Krenair: ran `mwscript extensions/ORES/maintenance/CheckModelVersions.php --wiki=sqwiki` on deployment-tin for T183862

2017-12-21

16:54 awight: Update ORES to eb0f776bb

2017-12-20

18:00 RoanKattouw: Importing dump from deployment-db03 on deployment-db04
15:31 RoanKattouw: Restarting dump again, failed due to lack of disk space
15:07 RoanKattouw: Dropped invalid view labswiki.updates, restarting dump
14:59 RoanKattouw: Dumping all databases on deployment-db03 so I can restore replication on deployment-db04. This may cause MediaWiki writes to fail while the dump runs

2017-12-19

20:10 RoanKattouw: (Earlier today) Depooled deployment-db04, it needs fixing after replication broke badly. It's out of sync with deployment-db03, where I manually fixed inconsistencies
18:11 awight: Update beta ORES service to f109792
17:00 awight: Disable ORES UI for beta wikidatawiki, T183266

2017-12-13

17:24 awight: Install aspell-is for ORES
17:06 awight: Deploy ORES service b67bba7

2017-12-11

19:04 andrewbogott: upgraded deployment-puppetmaster02 to puppet v4

2017-12-06

21:43 awight: Update ORES to 42cf532

2017-12-04

17:44 awight: ORES: Try enwiki models on simplewiki, T181848 (6baed71)

2017-11-30

19:00 bd808: Testing stashbot fix for double phab logging (T181731)
17:49 anomie: Finished running cleanupUsersWithNoId.php on Beta Cluster for T181731
16:58 anomie: Running cleanupUsersWithNoId.php on Beta Cluster, see T181731

2017-11-29

21:27 awight: Update ores submodule, for RevIdScorer statistics
14:32 chasemp: git pull on /var/lib/git/labs/private and resolve one merge conflict. (the root key file is too old here)

2017-11-28

17:42 awight: Remove stale ORES customizations for the beta cluster.

2017-10-24

17:59 madhuvishy: Ran `sudo cumin -b 5 --backend openstack "project:deployment-prep" "apt-get install git --yes"`

2017-10-04

13:19 andrewbogott: migrating 'deployment-kafka-jumbo-1' to labvirt1017

2017-09-14

19:37 tgr: updated PrivateSettings.php for T175868

2017-09-05

19:34 gilles: deployed PrivateSettings.php change to add Thumbor username to Swift configuration

2017-06-13

18:47 andrewbogott: root@deployment-salt02:~# salt "*" cmd.run "apt-get -y install facter"

2017-05-19

19:05 mutante: fixing role class config on deployment-phab* (remove role::phabricator::main, add role::phabricator_server in context prefix "deployment-phab. remove again from instance level for phab-01
18:40 mutante: deployment-phab01 still has puppet error "Could not find class role::phabricator::main" and that should simply be removed from it, but i can NOT find it in Horizon, i checked instance config, project config, the "Other" section, the "All classes" tab. Because it's gone. But how do i fix the instance config then?
18:39 mutante: applying role::phabricator_server on instance deployment-phab01 (it had error, could not find role::phabricator::main and the name changed in role/profile conversion)

2017-03-29

18:41 ebernhardson: upgrading elasticsearch and kibana to 5.1.2 on deployment-logstash2 to test puppet+integration prior to prod deployment

2017-03-27

17:02 ebernhardson: cherry pick https://gerrit.wikimedia.org/r/344964 to puppetmaster to test upgrade to logstash 5.x

2017-03-20

20:51 andrewbogott: migrating deployment-urldownloader to labvirt1013
20:45 andrewbogott: migrating deployment-pdf01 to labvirt1011
20:14 andrewbogott: migrating deployment-puppetmaster02 to a different labvirt

2017-03-15

09:10 addshore: addshore@deployment-tin mwscript extensions/Cognate/maintenance/populateCognatePages.php --wiki=hewiktionary
09:10 addshore: addshore@deployment-tin mwscript extensions/Cognate/maintenance/populateCognatePages.php --wiki=dewiktionary
09:08 addshore: addshore@deployment-tin mwscript extensions/Cognate/maintenance/populateCognatePages.php --wiki=enwiktionary
08:56 addshore: addshore@deployment-tin mwscript extensions/Cognate/maintenance/populateCognatePages.php --wiki=enwiktionary // (ParameterTypeException, T160503)
08:50 addshore: addshore@deployment-tin mwscript extensions/Cognate/maintenance/populateCognateSites.php --wiki=enwiktionary --site-group=wiktionary // (3 sites added)
08:49 addshore: addshore@deployment-tin mwscript extensions/Wikidata/extensions/Wikibase/lib/maintenance/populateSitesTable.php --wiki=enwiktionary --force-protocol=https --load-from=https://deployment.wikimedia.beta.wmflabs.org/w/api.php
08:49 addshore: addshore@deployment-tin mwscript sql.php --wiki=enwiktionary "TRUNCATE sites; TRUNCATE site_identifiers;"
08:44 addshore: addshore@deployment-tin mwscript extensions/Wikidata/extensions/Wikibase/lib/maintenance/populateSitesTable.php --wiki=enwiktionary --force-protocol=https
08:43 addshore: addshore@deployment-tin mwscript extensions/Cognate/maintenance/populateCognateSites.php --wiki=dewiktionary --site-group=wiktionary // (0 sites added)
08:43 addshore: addshore@deployment-tin mwscript extensions/Cognate/maintenance/populateCognateSites.php --wiki=enwiktionary --site-group=wiktionary // (1 site added)

2017-03-06

19:04 addshore: mwscript sql.php --wiki=aawiki "CREATE DATABASE cognate_wiktionary"

2017-03-01

19:09 addshore: "mwscript extensions/WikimediaMaintenance/addWiki.php --wiki=aawiki he wiktionary hewiktionary he.wiktionary.beta.wmflabs.org" T158628

2017-02-02

00:52 tgr: added mhurd as member

2017-01-23

07:15 _joe_: cherry-picking the move of base to profile::base

2017-01-19

22:11 Krenair: added bunch of others to the same group per request. we should figure out how to make this process sane somehow
22:06 Krenair: added nuria to deploy-service group on deployment-tin

2017-01-17

17:51 urandom: re-enabling puppet on deployment-restbase02
17:47 urandom: re-enabling puppet on deployment-restbase01

2017-01-11

18:07 urandom: restarting restbase cassandra nodes
18:01 urandom: disabling puppet on restbase cassandra nodes to experiment with prometheus exporter

2017-01-08

05:20 Krenair: deployment-stream: live hacked /usr/lib/python2.7/dist-packages/socketio/handler.py a bit (added apostrophes) to try to make rcstream work

2017-01-04

21:30 mutante: deployment-cache-text-04 - running acme-setup command to debug .. Creating CSR /etc/acme/csr/beta_wmflabs_org.pem
21:26 Krenair: trying to troubleshoot puppet by stopping nginx then letting puppet start it
21:05 mutante: deployment-cache-text04 stopping nginx service, running puppet to debug dependency issue

2016-12-19

21:21 andrewbogott: and also python-functools32_3.2.3.2-3~bpo8+1_all.deb
21:20 andrewbogott: upgrading to python-jsonschema_2.5.1-5~bpo8+1_all.deb on deployment-eventlogging03
20:51 andrewbogott: upgrading to python-requests_2.12.3-1_all.deb ./python-urllib3_1.19.1-1_all.deb on deployment-mediawiki04 and deployment-tin

2016-12-04

15:26 Krenair: Found a git-sync-upstream cron on deployment-mx for some reason... commented for now, but wtf was this doing on a MX server?

2016-11-23

15:04 Krenair: fixed puppet on deployment-cache-text04 by manually enabling experimental apt repo, see T150660

2016-11-16

20:02 Krenair: mysql master back up, root identity is now unix socket based rather than password
19:57 Krenair: taking mysql master down to fix perms
07:52 Krenair: the new mysql root password for -db04 is at /tmp/newmysqlpass as well as in a new file in the puppetmaster's labs/private.git

2016-11-09

20:27 Krenair: removed default SSH access from production host 208.80.154.135, the old gallium IP

2016-11-03

05:04 Krenair: beginning to move the rest of beta to the new puppetmaster

2016-11-02

18:51 Krenair: armed keyholder on -tin and -mira
18:50 Krenair: started mysql on -db boxes to bring beta back online

2016-11-01

22:22 Krenair: started mysql on -db03 to hopefully pull us out of read-only mode
22:21 Krenair: started mysql on -db04
22:19 Krenair: stopped and started udp2log-mw on -fluorine02
22:00 Krenair: started moving nodes back to the new puppetmaster
02:55 Krenair: Managed to mess up the deployment-puppetmaster02 cert, had to move those nodes back

2016-10-31

20:57 Krenair: moving some nodes to deployment-puppetmaster02
16:57 bd808: Added Niharika29 as project member

2016-10-27

18:46 bd808: Testing dual page wiki logging by stashbot. (check #3)
18:36 bd808: Testing dual page wiki logging by stashbot. (second attempt)
18:14 bd808: Testing dual page wiki logging by stashbot.

2016-10-24

14:51 Krenair: T142288: Shut off -pdf02 and -conftool

2016-10-10

21:41 Krenair: restarted keyholder-proxy on -tin to make check_keyholder happy with the extra key that was active but unconfigured
21:11 Krenair: fixed puppet on -restbase01/-restbase02 by setting up deployment of cassandra/twcs on deployment-tin
20:56 Krenair: fixed puppet on -tin/-mira by restarting puppetmaster for base_path scap change
15:45 dcausse: deployment-elastic0[5-8]: reduce the number of replicas to 1 max for all indices

2016-10-03

15:40 Krenair: upgraded cache-upload04 to varnish4. hieradata is set on the prefix deployment-cache-upload

2016-09-28

22:33 Krenair: Rebooting deployment-ms-be01 - T146947, T141673

2016-09-26

23:13 Krenair: Rebooting deployment-aqs01 for T141673

2016-09-20

20:16 Krenair: enabled trusty-backports on deployment-puppetmaster

2016-09-14

15:21 godog: cherry-pick https://gerrit.wikimedia.org/r/#/c/310557/ on puppet master

2016-09-13

20:47 Krenair: Created SRV record _etcd._tcp.beta.wmflabs.org for etcd/confd

2016-09-11

20:35 Krenair: started cron service on deployment-salt02 again, seems it got killed Tue 2016-08-30 13:42:39 UTC - hopefully this will fix the puppet staleness alert

2016-09-08

02:25 Krenair: deployed the latest version of mediawiki/services/parsoid/deploy.git to get https://gerrit.wikimedia.org/r/#/c/309001/ see T144884

2016-08-30

23:20 Krenair: removed 'project_id' key from deployment-restbase02's metadata to fix compatibility with the new labsprojectfrommetadata code
18:09 yuvipanda: reboot deployment-kafka03 seems to be stuck

2016-08-19

00:39 Krenair: deployment-fluorine is now deployment-fluorine02 running jessie with the old precise packages shoehorned in

2016-08-12

19:20 Krenair: that fixed it, upload.beta is back up
19:14 Krenair: rebooting deployment-cache-upload04, it's stuck in https://phabricator.wikimedia.org/T141673 and varnish is no longer working there afaict, so trying to bring upload.beta.wmflabs.org back up

2016-08-02

14:02 gehel: rebooting deployment-elastic06 (unresponsive to SSH and Salt)
02:51 Krenair: https://deployment.wikimedia.beta.wmflabs, https://meta.wikimedia.beta.wmflabs, and their mobile variants now also have valid certs and TLS redirects.
01:12 Krenair: Proper SSL certificate up at https://upload.beta.wmflabs.org - HTTP has been changed to force TLS redirect.

2016-08-01

20:58 Krenair: deleted 2014/2015 files from deployment-stream:/var/log/diamond to get space on /var and stop it warning

2016-07-27

06:07 Tim: fixed broken puppet git checkout on deployment-puppetmaster, updated

2016-07-13

20:45 Krenair: RIP NFS

2016-07-11

23:24 Krenair: Unmounted /data/project (NFS) on all active hosts (mediawiki0[1-3], jobrunner01, tmh01), leaving just deployment-upload (shutoff, to schedule for deletion soon) - T64835

2016-07-09

00:46 Krenair: T64835: `mwscript extensions/WikimediaMaintenance/filebackend/setZoneAccess.php zerowiki --backend=local-multiwrite --private`
00:46 Krenair: T64835: `foreachwikiindblist "% all-labs.dblist - private.dblist" extensions/WikimediaMaintenance/filebackend/setZoneAccess.php --backend=local-multiwrite`
00:46 Krenair: T64835: Live-hacked some temporary swift config in

2016-06-27

22:32 eberhardson: deployed gerrit.wikimedia.org/r/296279 to puppetmaster to test kibana4 role

2016-06-25

03:24 Krenair: Changed eventbus key in secrets (from being a symlink to eventlogging to being a new random key) so check_keyholder works again

2016-06-22

22:23 Krenair: Installed netpbm on all deployment-mediawiki* hosts to fix ProofreadPage thumbnailing. I wonder if we should include the puppet mediawiki::packages::multimedia class on these hosts really

2016-06-13

16:06 Krenair: Rebooted deployment-ircd, it was stuck somehow
13:53 yuvipanda: kicked deployment-salt via nova for Krenair
13:35 Krenair: Fixed puppet on -tin by symlinking eventbus key to eventlogging in -puppetmaster:/var/lib/git/labs/private/modules/secret/secrets/keyholder

2016-06-01

02:14 Krenair: Started redis-server on deployment-rcstream to stop MW hhvm.log spam

2016-05-09

15:39 andrewbogott: migrating deployment-flourine to labvirt1009

2016-05-03

01:42 Krenair: ran package updates on deployment-parsoid06 so that exim4 would start so puppet will run

2016-05-02

09:54 gehel: restart elasticsearch cluster to ensure multicast configuration is disabled (T110236)

2016-04-13

20:37 Krenair: doing the same with -redis02
20:26 Krenair: corrected deployment-cxserver03:/etc/puppet/puppet.conf puppetmaster to use .deployment-prep as part of dns name

2016-04-10

06:04 Krenair: deleted some large files under deployment-mediawiki01:/var/log/nutcracker to free up space on /

2016-04-09

16:08 Krenair: (same for -conf03, -sentry01, -redis01, -upload - some of these are now fully fixed and some are better than they were before)
15:59 Krenair: mostly fixed puppet on deployment-sca02 by changing /etc/puppet/puppet.conf to use project name as part of puppetmaster's hostname
15:56 Krenair: fixed broken /etc/puppet/puppet.conf on deployment-cache-text04 (it started with a copy of the file for the labs central puppetmaster and then had the correct version pointing to the project's puppetmaster)
15:47 Krenair: reenabled puppet on eventlogging04 as no reason was provided for disabling, first run successful

2016-03-30

13:35 Reedy: upgrade hhvm on deployment-mediawiki03 and reboot
12:16 gehel: restarting varnish on deployment-cache-text04

2016-03-29

13:40 Amir1: Added ores-related classes and roles

2016-03-25

20:23 Krenair: started redis-server on deployment-redis01
20:23 Krenair: repaired centralauth.spoofuser table on deployment-db1
20:23 Krenair: fiddled around with puppet on deployment-cache-text04 earlier to fix certs etc.
07:38 tgr: restarting memcached

2016-03-18

18:13 gehel: activating automatic deployment of portals (https://gerrit.wikimedia.org/r/#/c/276397/)

2016-03-08

02:26 ori: Updating HHVM on deployment-mediawiki02

2016-03-01

16:54 gehel: fixed a stalled rebase on deployment-puppetmaster:/var/lib/operations/puppet

2016-02-18

13:24 gehel: upgrading elasticsearch to 1.7.5 on cirrus-browser-bot

2016-02-17

23:57 mobrovac: added Ppchelko to the list of members

2016-02-15

09:16 gehel: re-enabling puppet on elastic05 (https://phabricator.wikimedia.org/T126891)

2016-02-12

16:33 gehel: starting to ship logs from elasticsearch to logstash (https://gerrit.wikimedia.org/r/#/c/269100/)

2016-02-11

15:16 gehel: fixed deployment-puppetmaster rebase conflict by removing commit 814f12bc - author is informed

2016-02-08

06:10 tgr: set $wgAuthenticationTokenVersion on beta cluster (test run for T124440)

2016-02-05

15:57 gehel: restarting deployment-elastic07
01:42 Tim: cherry-picked https://gerrit.wikimedia.org/r/#/c/268022/ to local puppet master as suggested by hashar. Seems to work.

2016-01-30

02:57 Krenair: Restarted varnish on cache-text04 for T125282

2015-12-02

00:31 tgr: updated rsvg on appserver to 2.40.11 - https://phabricator.wikimedia.org/T112421

2015-11-04

00:06 Krenair: Synchronized portals: https://gerrit.wikimedia.org/r/#/c/250851/

2015-10-09

21:51 ori: Accidentally clobbered /etc/init.d/mysql on deployment-db1, causing deployment-prep failures. Restored now.

2015-09-16

20:39 cscott: updated OCG to version 4032a596ce6eb442b02cc6ee9b79263b1eb23275

2015-09-14

19:18 cscott: updated OCG to version 5811056e28f2bc6408b6da96095352ab381bb11f
12:04 dcausse: restarting elasticsearch (deployment-elastic0[5-8]) to deploy new plugins

2015-08-25

14:42 andrewbogott: moving deployment-cache-mobile04 to labvirt1004

2015-08-12

20:45 urandom: restarted restbase on deployment-restbase01 (dead)

2015-08-05

14:33 godog: update deployment-restbase02 to openjdk8 T104887
14:18 godog: update deployment-restbase01 to openjdk8 T104887

June 29

13:17 dcausse: restarting Elasticsearch to pick up new plugin versions

June 23

13:31 cscott: fixed salt on deployment-pdf02, restarted OCG there.
05:44 cscott: stopped OCG service on deployment-pdf02, see https://phabricator.wikimedia.org/T103473
05:20 cscott: updated OCG to version d7c698d5bf730d34057945e912ac75dc542dd788 ; restarted service.
03:58 cscott: stopped OCG on beta; redis 2.8.x is causing the service to crash on startup.

June 22

21:58 andrewbogott: re-enabling puppet on deployment-videoscaler01 because no reason was given for disabling
20:42 cscott: updated OCG to version b482144f5bd8b427bcc64a3dd287247195aa1951

June 4

20:29 ori: upgrading hhvm-fss from 1.1.4 to 1.1.5, has fix for T101395

May 29

14:07 moritzm: upgrade java on deployment-restbase0[12] to the 7u79 security update

May 28

08:46 godog: test es-tool restart-fast on deployment-elastic05

May 27

21:15 AaronSchulz: populated jobqueue:aggregator:s-wikis:v2 with 1000 fake wiki keys for load testing
21:07 AaronSchulz: Deployed https://gerrit.wikimedia.org/r/#/c/208852/
21:07 AaronSchulz: Deleted 4G of logs on jobrunner01

May 24

18:39 YuviKTM: purged old logs kept on NFS

May 20

20:58 cscott: updated OCG to version ca4f64852de5b1de782b292b50038fbd2dd84266

May 18

15:17 andrewbogott: rebooting deployment-logstash1

May 15

20:50 andrewbogott: rebooted deployment-bastion due to inconsistent run state after suspend/resume

May 13

21:08 cscott: updated OCG to version c7c75e5b03ad9096571dc6dbfcb7022c924ccb4f

May 2

00:51 yuvipanda: created deployment-boomboom to test

April 29

21:03 andrewbogott: suspending and shrinking disks of many instances

April 28

20:57 YuviPanda: KILL KILL KILL DEPLOYMENT-LUCID-SALT WITH FIRE AND BRIMSTONE AND BAD THINGS

April 27

08:01 _joe_: installed hhvm 3.6 on deployment-mediawiki02

April 24

14:25 _joe_: installing hhvm 3.6.1 on mediawiki-deployment01

April 23

17:19 andrewbogott: rebooting deployment-parsoidcache02 because it seems troubled

April 22

12:48 andrewbogott: migrating to new labvirt nodes

April 21

08:33 _joe_: rollback installation of hhvm 3.6
08:09 _joe_: installing HHVM 3.6 and the corresponding extensions on deployment-mediawiki01

April 9

20:11 mutante: fixed apt sources lists on deployment-bastion (T95541)

March 30

22:33 Josve05a: manually start mysql on db1 and db2
21:57 YuviPanda: reboot all instances from virt1000

March 23

20:41 cscott: updated OCG to version 11f096b6e45ef183826721f5c6b0f933a387b1bb

March 18

13:45 mobrovac: added restbase security group
13:35 YuviPanda: made mobrovac projectadmin
13:34 YuviPanda: added mobrovac to project

March 16

18:46 manybubbles: upgraded Elasticsearch on deployment-logstash1

March 11

18:47 YuviPanda: created deployment-mediawiki03

February 27

11:12 YuviPanda: start mysql on deployment-db1

February 26

11:53 YuviPanda: created deployment-parsoid01-test to test patch to use role::parsoid on labs

February 18

13:04 _joe_: installed new version of the hhvm extensions packages

February 17

23:18 Krenair: Started mysql on deployment-db1; beta now appears much less broken than before

February 6

20:07 ^d: scratch that, I rebuilt it as precise. why did I do that?
20:03 ^d: rebuilt deployment-elastic05 with new partition scheme

February 5

12:48 YuviPanda: cherry-picking https://gerrit.wikimedia.org/r/188798 on scap on deployment-prep
12:28 YuviPanda: killed chown on deployment-bastion, running direclty on NFS server
12:13 YuviPanda: running time sudo chown -R www-data:www-data upload7/ on /data/project
12:10 YuviPanda: stopped jobrunner on jobrunner01
11:53 YuviPanda: running git-sync-upstream on deployment-salt to pick up latest ops/puppet changes
11:52 _joe_: converting the web user to www-data
11:44 YuviPanda: deleted mediawiki03 instance, holdover from security testing from long, long ago
11:41 YuviPanda: disabled puppet on mediawiki01, 02, jobrunner01, bastion and salt

February 4

13:56 YuviPanda: created deployment-jobrunner01, trusty instance
13:51 YuviPanda: deleted deployment-jobrunner01, trusty version coming up
11:35 YuviPanda: created instance deployment-mediawiki02
11:26 YuviPanda: deleted instance deployment-mediawiki02
06:37 YuviPanda: created deployment-mediawiki01 host
06:34 YuviPanda: killed deployment-mediawiki01 host. FOREEVERRR

February 2

13:37 yuvipanda: added mx record to beta.wmflabs.org, for https://phabricator.wikimedia.org/T88215 via LDAP

January 27

18:15 andrewbogott: upgrading libc6 on all instances from deployment-salt

January 20

02:30 YuviPanda: created deployment-mediawiki04 to test roles

January 7

16:25 YuviPanda: added milimetric to NDA sudo’ers groups

December 29

22:24 MaxSem: Created a DNS entry for m.wikidata.beta.wmflabs.org

December 22

12:40 _joe_: upgrading HHVM to the latest version

December 16

16:52 manybubbles: elasticsearch restart finished
16:48 mutante: deployment-db2 is down
16:48 manybubbles: restarting beta's elasticsearch servers to pick up a new version of a plugin. won't interfere with current downtime.

December 13

17:10 bd808: Many strange puppet and scap failures in beta that look to be related to DNS failures
16:03 bd808: Starting work on phab:T78076 to renumber apache users in beta

December 11

22:47 cscott: updated OCG to version bfc3812ef346c9f767135b339cedd123a1bcac98

December 6

05:05 ori: upgrade hhvm-tidy to 0.1-2

December 3

21:33 cscott: updated OCG to version 08e94b19c3f17e699d7e53d9605f65c58e17ea0e

December 2

17:09 _joe_: upgrading HHVM to its latest version
17:08 andrewbogott: this is a test message

December 1

21:50 cscott-split: updated OCG to version a06e7c186796a6ee5d5af81e93688520abdf2596

November 26

20:47 cscott: updated OCG to version 7d8f2b8bd496464041e3ef9c092732457cc8f7ef

November 24

15:16 YuviPanda: modified local hack to account for 47dcefb74dd4faf8afb6880ec554c7e087aa947b
14:58 YuviPanda: cherry-picked 3e45c538978710113e6e28e9d533bf8d18c159a6 and 9d4614a8a352c78505212fd6e9d2a7be6d2e4927 to deployment-salt puppetmaster, restoring local hacks

November 19

21:19 anomie: Cherry-picked https://gerrit.wikimedia.org/r/#/c/173336/3 to Beta

November 17

20:37 YuviPanda: cleaned out logs on deployment-bastion
16:48 YuviPanda: delete deployment-analytics01, a tortoise from an ancient time.
05:17 YuviPanda: force apt-get install -f to unstuck puppet
04:49 YuviPanda: clean up coredump on deployment-prep

November 16

00:38 YuviPanda: uncherrypick https://gerrit.wikimedia.org/r/#/c/173634/ because OMG CODE
00:14 YuviPanda: cherry-pick https://gerrit.wikimedia.org/r/#/c/173634/ on deployment-salt
00:01 YuviPanda: cherry-pick https://gerrit.wikimedia.org/r/#/c/173510/ on deployment-prep to make memc03 run puppet

November 14

20:02 anomie: Cherry-picking https://gerrit.wikimedia.org/r/#/c/173336/ for testing in logstash

November 13

10:11 YuviPanda: cherry pick https://gerrit.wikimedia.org/r/#/c/172967/1 to test https://bugzilla.wikimedia.org/show_bug.cgi?id=73263

November 12

18:16 YuviPanda: cherry picking https://gerrit.wikimedia.org/r/#/c/172776/ on labs puppetmaster to see if it fixes issues in the cache machines

November 11

17:13 cscott: removed old ocg cronjobs on deployment-pdf0x; see https://bugzilla.wikimedia.org/show_bug.cgi?id=73166

November 10

22:37 cscott: rsync'ed .git from pdf01 to pdf02 to resolve git-deploy issues on pdf02 (git fsck on pdf02 reported lots of errors)
21:41 cscott: updated OCG to version d9855961b18f550f62c0b20da70f95847a215805 (skipping deployment-pdf02)
21:39 cscott: deployment-pdf02 is not responding to git-deploy for OCG

November 5

06:14 ori: restarted hhvm on beta app servers

November 3

22:07 cscott: updated OCG to version 5834af97ae80382f3368dc61b9d119cef0fe129b

October 29

18:55 ori: upgraded hhvm on beta labs to 3.3.0+dfsg1-1+wm1

October 28

23:47 RoanKattouw: ...which was a no-op
23:46 RoanKattouw: Updating puppet repo on deployment-salt puppet master
21:36 RoanKattouw: Creating deployment-parsoid05 as a replacement for the totally broken deployment-parsoid04 (also as a trusty instance rather than precise)
21:06 RoanKattouw: Rebooting deployment-parsoid04, wasn't responding to ssh

October 27

20:23 cscott: updated OCG to version 60b15d9985f881aadaa5fdf7c945298c3d7ebeac

October 22

21:10 arlolra: updated OCG to version e977e2c8ecacea2b4dee837933cc2ffdc6b214cb

October 8

22:04 subbu: updated OCG to version def24eca

October 7

22:50 cscott: updated OCG to version c778ea8b898f8ad8c2b7ad9de78a75469e7ed061

October 6

23:13 YuviPanda: killed extra log files in deployment-bastion
21:44 cscott: updated OCG to version bbdf4c6400cfbbc6030114ad16e1a6f7025eab2c
15:36 cscott: updated OCG to version aee3712b352f51f96569de0bcccf3facf654e688

October 3

19:51 manybubbles: performing rolling restart of elasticsearch nodes to pick up preview of accelerated regex plugin for testing at larger-than-mylaptop-scale
14:02 manybubbles: rebuilding beta's simplewiki cirrus *index*
14:02 manybubbles: rebuilding beta's simplewiki cirrus inde

October 1

20:13 cscott: updated OCG to version 48c495e3656f528abe636ce0cd7562270505534f
16:40 bd808: Added Gilles to under_NDA sudoers group

September 30

22:00 bd808: Cleaned deleted instances out of salt and trebuchet redis
20:26 bd808: Converted deployment-rsync02 to use local puppet & salt masters
15:36 bd808: enabling puppet and forcing run on deployment-mediawiki03
15:34 bd808: enabling puppet and forcing run on deployment-mediawiki02
15:28 bd808: enabling puppet and forcing run on deployment-mediawiki01

September 29

22:45 Reedy: re-enabled beta-scap-eqiad
21:34 Reedy: disabled "beta-scap-eqiad" until things are fixed
21:24 Reedy: deleted l10n cache on deployment-rsync01 to attempt to run sync-common manually
21:22 Reedy: deployment-rsync01 hard drive is far too small
17:57 cscott: updated OCG to version 89d8f29a24295b05d0643abe976fea83b56575c9
06:58 ori: Configured Beta cluster to use redis for session storage
06:57 ori: Created deployment-redis02 and converted it to use local puppet & salt masters
05:23 ori: Created deployment-redis01 and converted it to use local puppet & salt masters

September 28

14:38 andrewbogott: cherry-picked https://gerrit.wikimedia.org/r/#/c/163464/ onto deployment-salt to fix a puppet compile failure.
14:38 andrewbogott: edited and re-cherry-picked roan's citoid patch into beta because the previous version was breaking puppet

September 26

06:34 cscott: updated OCG to version f3a6c1cbba118d4a5e1aa019937dc50159fc823d

September 25

22:48 RoanKattouw: Fixed permissions of deployment-bastion:/srv/deployment/mathoid/mathoid/.git/deploy (needed g+w)
11:36 _joe_: updated hhvm to fix most bugs, also cherry-picked https://gerrit.wikimedia.org/r/#/c/162839/

September 24

23:00 bd808: Updated bash with salt
20:52 cscott: updated OCG to version 48acb8a2031863e35fad9960e48af60a3618def9

September 23

20:14 cscott: updated OCG to version 1cf9281ec3e01d6cbb27053de9f2423582fcc156
17:37 AaronSchulz: Initialized bloom cache on betalabs, enabled it, and populated it for enwiki

September 22

16:08 ori: updating HHVM to 3.3.0-20140918+wmf1

September 20

14:43 andrewbogott: movingdeployment-pdf02 to virt1009
00:36 mutante: raised instance quota to 43

September 19

00:26 cscott: updated OCG to version ce16f7adb60d7c77409e2e11ba0e5d6cce6955d5

September 16

15:44 godog: testing scap change from https://gerrit.wikimedia.org/r/#/c/160668/
02:46 cscott: updated OCG to version 188a3c221d927bd0601ef5e1b0c0f4a9d1cdbd31

September 15

21:44 andrewbogott: migrating deployment-videoscaler01 to virt1002
21:41 andrewbogott: migrating deployment-sentry2 to virt1002
21:40 cscott: *skipped* deploy of OCG, due to deployment-salt issues
21:19 bd808: Added Matanya to under_NDA sudoers group (bug 70864)

September 12

12:24 _joe_: set up hiera, noop as expected

September 11

16:31 YuviPanda: Delete deployment-graphite instance
02:29 mutante: raised instance quota by 1 to 42

September 10

08:14 Krinkle: bits.beta.wmflabs.org is down with 503 Service Unavailable (http://bits.beta.wmflabs.org/en.wikipedia.beta.wmflabs.org/load.php)

September 9

20:08 cscott: updated OCG to version c9a2b4cf2502479eeabed07ab2de728695d96e46

September 7

23:48 bd808: Added John F. Lewis to under_NDA sudo policy (bug 70539)
23:29 bd808: Promoted John F. Lewis to project admin (bug 70539)
23:26 bd808: Added Jalexander as project member (bug 70539)

September 5

17:54 bd808: Purged varnish cache on deployment-cache-bits01 -- sudo varnishadm ban req.url '~' /
16:00 YuviPanda: unfuck puppet on deployment-salt, puppet is stupid and does not properly report failed events on last_run_summary.yaml if there's a syntax error or a resource conflict. So I've to read last_run_report and do things with *that* instead now
15:49 YuviPanda: deliberately fucking up puppet to see if icinga complains
09:52 _joe_: cherry-picked I6ec53da483bebfa375eba2383cbf60123ff1ce26, it work

September 4

16:06 bd808: Manually cleaned bogus LocalRenameUserJob jobs from redis
13:54 _joe_: stopped puppet on the appservers but mw03, testing an apache change
05:28 legoktm: stopping jobrunner on deployment-jobrunner01
05:22 legoktm: restarted jobrunner on deployment-jobrunner01
05:14 bd808: Bad jobs in job queue filled up /var on jobrunner01 and killed jobrunner script. Leaving down for now until I find out how to delete the bad jobs.
01:41 bd808: Killed old jobs-loop.sh processes on deployment-jobrunner01
01:24 bd808: Many jobrunner errors like "wikiversions-labs.cdb has no version entry for `amwiki`" with various wiki names
01:23 bd808|AWAY: Started jobrunner service manually on jobrunner01.
00:44 bd808: Puppet run on deployment-jobrunner01 failing with what seem to be dns issues (getaddrinfo: Name or service not known when Trebuchet is running)
00:35 bd808: Puppet run on deployment-jobrunner01 failing with what seem to be dns issues (getaddrinfo: Name or service not known)

September 3

15:02 bd808: _joe_ rolled out a new hhvm package ~5 hours ago
15:01 bd808: morebots is back thanks to petan
14:50 bd808: logmsgbot down apparently

September 2

15:34 bd808: False alarm. SSL is borked in beta and we know that
15:29 bd808: `curl -vL -H 'Host: en.wikipedia.beta.wmflabs.org' localhost` works from deployment-cache-text02
15:27 bd808: https://en.wikipedia.beta.wmflabs.org/ returning ERR_CONNECTION_REFUSED (is varnish down?)

August 29

22:56 bd808: Got puppet to run cleanly on deployment-mediawiki03. Should be ready for serving traffic.
22:39 bd808: Fixed a merge conflict in operations/puppet on deployment-salt
21:46 bd808: Forced install of "right version of libvips-tools on mediawiki03 `sudo apt-get install libvips-tools=7.38.5-2`
08:40 hashar: rebooting deployment-cache-mobile03 (kernel up)

August 28

21:32 bd808: Added "Greg Grossmeier" to UnderNDA sudoers group
17:12 bd808: Changed centralauth db to rename labswiki -> deploymentwiki
16:49 bd808: CentralAuth looks broken on http://deployment.wikimedia.beta.wmflabs.org/
16:49 bd808: Apache vhosts look good again
16:34 bd808: Restarted varnishes on deployment-cache-text02
16:13 andrewbogott: merging a patch that renames 'labswiki' to 'deploymentwiki'
09:21 hashar: resetting git repository in /data/project/apache/conf to point to the betaclusterbranch of operations/mediawiki-config.git discarded all local hacks in the process

August 27

23:03 hashar: Blacklisting the security audit IP again on deployment-cache bits01 mobile03 and text02
22:53 hashar: removed the blackhole ip route from deployment-cache-text02 and deployment-cache-mobile03
22:48 hashar: the IP is a known security audit. See Chris Steipp.
22:46 hashar: blackholed an IP address on deployment-cache-text02 and deployment-cache-mobile03 , it was causing hundred of requests per seconds and overloaded the beta cluster. Use route -n to find the IP
22:37 hashar: restarting udp2log-mw on deployment-bastion. It keeps crashing since fiarly recently
22:26 bd808: when restarting varnish on deployment-cache-text02, don't forget that there are 2 varnish services (varnish and varnish-frontend)
22:19 bd808: restarted varnish (again) on deployment-cache-text02
22:10 bd808: restarted varnish on deployment-cache-text02
16:22 bd808: killing `apt-get update` process running on deployment-bastion since Jun13
14:59 bd808: Resolved puppet git merge conflict on deployment-salt
14:49 bd808: Moved hhvm core dumps to /data/project/hhvm-cores
14:42 bd808: Root dirve full on deployment-mediawiki02; hhvm core files are the culprit

August 25

23:47 ori: stopping hhvm/apache on deployment-mediawiki02 to replace debug build of hhvm with release build
21:44 bd808: Deployed scap 116027f (Make sync-common update l10n cdb files by default)
18:30 ori: deployment-mediawiki02: cleared /tmp; running puppet
15:05 hashar: mediawiki02 rm /tmp/hhvm*.core . Filled as bug 69979
15:01 hashar: mediawiki02 rm /tmp/mw-cache-master/conf*
15:01 hashar: mediawiki02 has mw conf caches under /tmp/mw-cache-master/ and since that partition is filled up, that ends up with conf caches being null file
15:00 hashar: mediawiki02 rm /var/log/upstart/hhvm*
14:53 hashar: mediawiki02 : removed /var/lib/puppet/state/agent_catalog_run.lock
14:46 hashar: restarting udp2log-mw service on -bastion. It is stalled for some reason
14:42 hashar: on mediawiki02 , clearing out some /var/log/upstart/hhvm.* log files see bug 69976
14:34 hashar: mediawiki02 / partition is 100% full

August 22

20:21 hashar: udp2log are back in /data/project/logs . The udp2log-mw service went stall for some reason.
20:08 ori: ran 'git pull' on deployment-salt:/srv/var-lib/git/operations/puppet
19:59 hashar: restarting udp2log-mw service on deployment-bastion
19:59 hashar: bits yielding 503
00:41 bd808: cherry-picked scap change https://gerrit.wikimedia.org/r/#/c/155677/ for testing

August 21

21:49 bd808: Trebuchet happier after all the salt-minion restarts; still have deleted hosts showing in the expected minion list for scap deploys
21:01 twentyafterfour: Started salt-minion on deployment-redis01
21:01 bd808: Started salt-minon on deployment-upload
21:00 bd808: Started salt-minon on deployment-fluoride
21:00 bd808: Started salt-minon on deployment-db1
20:59 bd808: Started salt-minon on deployment-elastic01
20:59 twentyafterfour: Started salt-minion on deployment-eventlogging02
20:58 bd808: Started salt-minon on deployment-elastic02
20:58 bd808: Started salt-minon on deployment-elastic03
20:57 bd808: Started salt-minon on deployment-elastic04
20:57 bd808: Started salt-minon on deployment-analytics01
20:55 bd808: Started salt-minon on deployment-cache-upload02
20:54 bd808: Started salt-minon on deployment-memc04
20:54 bd808: Started salt-minon on deployment-parsoid04
20:49 bd808: Started salt-minon on deployment-memc05
20:48 bd808: Started salt-minon on deployment-db2
20:48 twentyafterfour: Started salt-minion on deployment-cache-text02
20:47 twentyafterfour: Started salt-minion on deployment-memc03
20:46 bd808: Started salt-minon on deployment-cxserver01
20:12 bd808: List of broken salt minions can be obtained with `sudo salt-run manage.down` on deployment-salt
19:55 bd808: Fixed salt on deployment-memc02
19:52 bd808: Salt minions are broken all over beta. Hung grain-ensure calls, hung test.ping calls, downed minions
19:50 bd808: Killed dozens of grain-ensure calls and started salt-minion on deployment-cache-mobile03
19:47 bd808: Killed hung salt-call and started salt-minion on deployment-cache-bits01
19:28 bd808: Deployed cherry-pick of Iea7217a for scap
19:27 bd808: Restarted salt-minion on deployment-jobrunner01 & deployment-videoscaler01
19:27 bd808: Killed rogue salt-master process on deployment-bastion
19:26 bd808: Deleted salt keys for retired apache0[12] minions
00:13 bd808: Upgraded elasticsearch to 1.3.2 on deployment-logstash1

August 19

16:11 hashar: deleted /usr/local/apache/common-local symlink, made it a directory and retriggered https://integration.wikimedia.org/ci/job/beta-scap-eqiad/17887/console
16:03 bd808: Removed local changes to /usr/local/apache/conf/wmflabs-logging.conf on deployment-mediawiki02; logs back to nfs share
15:52 bd808: Changed apache logging level from debug to notice on deployment-mediawiki02 in /usr/local/apache/conf/wmflabs-logging.conf
15:47 bd808: Changed apache logging level from debug to warn on deployment-mediawiki02
15:44 bd808: /var full on deployment-mediawiki02; deleting 572M /var/log/apache2/debug.log.1
15:03 hashar: Killed some stalled scap / rsync process on deployment-bastion that were preventing https://integration.wikimedia.org/ci/job/beta-scap-eqiad/ from acquiring the lock.
14:17 hashar: huge rsync in progress on bastion
14:00 hashar: On bastion reverted the symlink on bastion and manually created directory /usr/local/apache/common-local
13:55 hashar_: On bastion, deleting /usr/local/apache/common-local and symlink it to /srv/common-local

August 18

22:22 ^d: dropped apache01/02 instances, unused and need the resources
18:23 manybubbles: finished upgrading elasticsearch in beta - everything seems ok so far
18:15 bd808: Restarted salt-minion on deployment-mediawiki01 & deployment-rsync01
18:15 bd808: Ran `sudo pkill python` on deployment-rsync01 to kill hundreds of grain-ensure processes
18:12 bd808: Ran `sudo pkill python` on deployment-mediawiki01 to kill hundreds of grain-ensure processes
18:10 manybubbles: finally restarting beta's elasticsearch servers now that they have new jars
17:56 bd808: Manually ran trebuchet fetches on deployment-elastic0*
17:49 bd808: Forcing puppet run on deployment-elastic01
17:47 godog: upgraded hhvm on mediawiki02 to 3.3-dev+20140728+wmf5
17:44 bd808: Trying to restart minions again with `salt '*' -b 1 service.restart salt-minion`
17:39 bd808: Restarting minions via `salt '*' service.restart salt-minion`
17:38 bd808: Restarted salt-master service on deployment-salt
17:19 bd808: 16:37 Restarted Apache and HHVM on deployment-mediawiki02 to pick up removal of /etc/php5/conf.d/mail.ini (logged in prod SAL by mistake)
16:59 manybubbles|lunc: upgrading Elasticsearch in beta to 1.3.2
16:11 bd808: Manually applied https://gerrit.wikimedia.org/r/#/c/141287/12/templates/mail/exim4.minimal.erb on deployment-mediawiki02 and restarted exim4 service
15:28 bd808: Puppet failing for deployment-mathoid due to duplicate definition error in trebuchet config
15:15 bd808: Reinstated puppet patch to depool deployment-mediawiki01 and forced puppet run on all deployment-cache-* hosts
15:04 bd808: Puppet run failing on deployment-mediawiki01 (apache won't start); Puppet disabled on deployment-mediawiki02 ('reason not specified') Probably needs to wait until Giuseppe is back from vacation for fixing.
15:00 bd808: Rebooting deployment-eventlogging02 via wikitech; console filling with OOM killer messages and puppet runs failing with "Cannot allocate memory - fork(2)"
14:29 bd808: Forced puppet run on deployment-cache-upload02
14:27 bd808: Forced puppet run on deployment-cache-text02
14:24 bd808: Forced puppet run on deployment-cache-mobile03
14:20 bd808: Forced puppet run on deployment-cache-bits01

August 17

22:58 bd808: Attempting to reboot deployment-cache-bits01.eqiad.wmflabs via wikitech
22:56 bd808: deployment-cache-bits01.eqiad.wmflabs not allowing ssh access and wikitech console full of OOM killer messages

August 15

21:57 legoktm: set $wgVERPsecret in PrivateSettings.php
21:42 hashSpeleology: Beta cluster database updates are broken due to CentralNotice. Fix up is 154231
20:57 hashSpeleology: deployment-rsync01 : deleting /usr/local/apache/common-local content. Then ln -s /srv/common-local /usr/local/apache/common-local as set by beta::common which is not applied on that host for some reason. bug 69590
20:55 hashSpeleology: puppet administratively disabled on mediawiki02 . Assuming some work in progress on that host. Leaving it untouched
20:54 hashSpeleology: puppet is proceeding on mediawiki01
20:52 hashSpeleology: attempting to unbreak mediawiki code update bug 69590 by cherry picking 154329
20:39 hashSpeleology: in case it is not in SAL. MediaWiki is no more synced to app server bug 69590
20:20 hashSpeleology: rebooting mediawiki01 , /var refuses to clear out and stick at 100% usage
20:16 hashSpeleology: cleaning up /var/log on deployment-mediawiki02
20:14 hashSpeleology: on deployment-mediawiki01 deleting /var/log/apache2/access.log.1
20:13 hashSpeleology: on deployment-mediawiki01 deleting /var/log/apache2/debug.log.1
20:13 hashSpeleology: bunch of instances have a full /var/log :-/
11:37 ori: deployment-cache-bits01 unresponsive; console shows OOMs: https://dpaste.de/LDRi/raw . rebooting
03:20 jeremyb: 02:46:37 UTC <ebernhardson> !log beta /dev/vda1 full. moved /srv-old to /mnt/srv-old and freed up 2.1G

August 14

12:23 hashar: manually rebased operations/puppet.git on puppetmaster

August 13

08:02 hashar: beta-code-update-eqiad is running again
07:57 hashar: fixing ownerships under /srv/scap-stage-dir/php-master/skins some files belong to root
07:55 hashar: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/ is broken :-/

August 8

16:05 bd808: Fixed merge conflict that was preventing updates on puppet master

August 6

13:13 hashar: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/ is running again
13:13 hashar: removed a bunch of local hack on deployment-bastion:/srv/scap-stage-dir/php-master . That causes the git repo to be dirty and prevents scap from achieving git pull there
12:08 hashar: Manually pruning whole text cache on deployment-cache-text02
12:07 hashar: Apache virtual hosts were not properly loaded on mediawiki02. I have hacked /etc/apache2/apache2.conf to make it Include Include /usr/local/apache/conf/all.conf (instead of main.conf which does not include everything)
08:43 hashar: prunning cache on deployment-cache-text02 / restarting varnish

August 2

08:53 swtaarrs: rebuilt and restarted hhvm on deployment-mediawiki02 with potential fix
05:17 swtaarrs: restarted hhvm on deployment-mediawiki0{1,2} to unwedge them

August 1

15:03 bd808: Updated cherry-pick of Iceb8f43
15:02 bd808: Cleaned up puppet repo on deployment-salt; merge conflicts with local Ia463120 hack; reapplied depool of deployment-mediawiki01
14:50 bd808: Restarted stuck hhvm on deployment-mediawiki02; apache had 89 children waiting for a response
13:27 godog: changed inplace bt-hhvm on deployment-mediawiki01/02 to also copy the binary
05:32 ori: depooled deployment-mediawiki02 to investigate HHVM lock-up by cherry-picking I7df8c5310 on beta.
00:40 ori: disabled puppet on deployment-mediawiki{01,02} and enabled verbose apache logging

July 31

22:41 bd808: Restarted hhvm on -mediawiki{01,02}. Brett looked at 01 before I did and said "it's the same as before"
20:09 cscott: updated OCG to version d2919c59eb09e09fc87777696411a070620aef45
19:59 hashar: Granted sudo right to cscott (under NDA). Will let him reboot OCG service
18:58 ori: re-enabled puppet on deployment-mediawiki{01,02}
10:41 hashar: Taking gdb traces of hhvm on mediawiki01 and mediawiki02. Restarting hhvm
05:08 bd808: HHVM hung on both boxes. Grabbed core and backtrace before restarting

July 30

19:59 bd808: Created local commit 7d56b79 in puppet to work around bugs in Ia463120718dceab087ad3f8e3f35917fa879f387
19:46 bd808: Restored prior /etc/hhvm/php.ini from puppet filebucket archive on deployment-mediawiki0[12]
19:32 bd808: Disabled puppet on deployment-mediawiki02 for the same reason
19:31 bd808: Disabled puppet on deployment-mediawiki01; Ori will look into hhvm config changes that were being applied
16:52 bd808: Fixed beta-scap-eqiad Jenkins job by correcting ssh problems in beta project
16:43 bd808: Fixed ssh to jobrunner01 and videoscaler01 by correcting unrelated puppet manifest problem and forcing run via salt.
16:00 bd808: Puppet runs on videoscaler01 and jobrunner01 failing for "Could not find dependency Ferm::Rule[bastion-ssh] for Ferm::Rule[deployment-bastion-scap-ssh]"
16:00 bd808: Puppet seems manually disabled on apache0[12].
15:59 bd808: Can't ssh to apache0[12], videoscaler01 and jobrunner01. Puppet not running on any of them. libnss-ldapd unattended update has broken /etc/nslcd.conf
15:23 bd808: Removed cherry-pick for Iac547efa83cf059a1276b6e279c3ebd4c7224b2c and updated cherry-pick for I5afba2c6b0fbf90ff8495cc4a82f5c7851893b52 to latest patch set.
15:05 bd808: Two cherry-picks in puppet conflicting with merged production changes: I5afba2c6b0fbf90ff8495cc4a82f5c7851893b52 and Iac547efa83cf059a1276b6e279c3ebd4c7224b2c (ori, twentyafterfour)
14:49 bd808: Started apache2 service on deployment-mediawiki01
14:16 hashar: rebooting hhvm
09:42 hashar: bastion had broken puppet because deployment_server and zuul both declare the same python packages 150501
09:40 hashar: restoring on puppetmaster modules/mediawiki/templates/apache/apache2.conf.erb which got deleted somehow
09:29 hashar: Rebooting apache01/02 to see whether it fix the ssh connection issue
09:27 hashar: manually started hhvm on mediawiki01
09:25 hashar: rebooting deployment-mediawiki01 hhvm process went zombie
09:23 hashar: restarting hhvm on mediawiki 01/02
09:05 hashar_: Beta scap script broken since 6:30am UTC https://integration.wikimedia.org/ci/job/beta-scap-eqiad/

July 29

22:56 cscott: updated OCG to version aeb8623d6ebe41ae7c7e36c57844bd9ea8e6d595
21:02 bd808: Converted deployment-sentry2.eqiad.wmflabs to use beta salt/puppet master
19:14 hashar: Removed all jobs from queue, restarted slave agent. Update Jobs coming back
19:09 hashar: deployment-bastion jenkins slave is stuck. Beta cluster is no more updating code :-//
15:58 godog: restarted hhvm on deploymnet-mediawiki01
15:52 godog: restarted hhvm on deployment-mediawiki02
15:50 godog: installed libevent-dbg on deployment-mediawiki02 to capture an hhvm backtrace
15:17 bd808: _joe_ restarting hhvm on deployment-mediawiki01
15:00 bd808: Apache stuck with 65 children on both deployment-mediawiki servers
10:37 hashar: Restarted hhvm on mediawiki{01,02}

July 28

17:41 bd808: Updated hhvm to latest 3.3-dev+20140728 build on deployment-mediawiki0[12]
15:37 manybubbles: rebuilding elasticsearch indexes to build a weighted all field we'll try to use to improve performance
15:32 bd808: Restarted hhvm on deployment-mediawiki0[12]. All apache children were stuck waiting for hhvm to respond.
15:20 bd808: Restarted apache on deployment-mediawiki02. 65 children and non-responsive to requests. (same as mediawiki01)
15:18 bd808: Restarted apache on deployment-mediawiki01. 65 children and non-responsive to requests.
14:23 manybubbles: or not - looks like I can't!
14:22 manybubbles: reubilding cirrus search indexes to pick up a speed up all field
08:30 hashar: restarted varnish on deployment-cache-bits01 . Hoping to clear bits cache

July 25

18:29 bd808: Added twentyafterfour and several other WMF staff to under_NDA sudo group
17:15 bd808: Morebots is back!
16:38 bd808: pstree showed "hhvm─┬─271*[sh]" on deployment-mediawiki02
16:38 bd808: Killed apache2+hhvm and restarted on deployment-mediawiki0[12]
16:06 bd808: `tcpdump -n udp dst port 8324` shows packets leaving deployment-bastion for deployment-logstash1
16:00 bd808: Stopped udp2log and started udp2log-wm with no apparent effect
16:00 bd808: udp2log events not being sent from deployment-bastion to deployment-logstash1
15:49 bd808: Restarted logstash on deployment-logstash1
09:45 mwalker: rebasing puppet repo to get a ocg patch

July 24

16:09 bd808: Reverted MW config to re-enable luasandbox mode; back to luastandalone for now
15:44 bd808: Updated MW config to re-enable luasandbox mode
15:43 bd808: Updated hhvm-luasandbox to 2.0-3 and restarted hhvm instances
14:21 hashar: killed hhvm process on deployment-mediawiki01 and 02. init script does not work.
02:59 ori: promoted legoktm to project-admin

July 23

23:30 bd808: Running `find . -type d -exec chmod 777 {} +` in /data/project/upload7 to finx shared image dir permisisons
20:49 bd808: Changed config to run lua via external executable to avoid hhvm crashing bug
16:20 bd808: hhvm upgraded to 3.1+20140723-1+wmf1 on deployment-mediawiki0[12]
15:34 bd808: Reverted hhvm to 3.1+20140630-1+wm1 on deployment-mediawiki02
15:21 bd808: Upgraded hhvm to 3.1+20140630; seeing problems with luasandbox extension

July 22

14:26 hashar: upgrading varnish on deployment-cache-mobile03
14:22 hashar: upgrading varnish on deployment-cache-text02
14:02 hashar: rebooting deployment-cache-upload02 varnish not happy with memory mapping
13:51 hashar: rebooting bits varnish cache
13:43 hashar: rebased puppetmaster repo. Rebase got broken after 0317463 - beta: New script to restart apaches got merged in.
13:35 hashar: apt-get upgrade on deployment-cache-bits01 + varnish upgrade
09:28 hashar: Removing role::beta::natfix that is now handled by labs DNS and the class is removed with 146091

July 21

23:37 ori: Switched over beta cluster app servers to HHVM
21:27 bd808: Killed update.php jobs; Antoine will give jobs a longer timeout
21:23 bd808: Running update.php for simplewiki in screen
21:22 bd808: Running update.php for hewiki in screen
21:21 bd808: Running update.php for eswiki in screen
21:21 bd808: Running update.php for cawiki in screen
21:21 bd808: Running update.php for commonswiki in screen
21:18 hashar: Restarting upd2log-mw on deployment-bastion. There is a bunch of [python] <defunct> processes
17:32 bd808: Updated scap to 4871208 (+ cherry pick of I6a56b5e)
17:12 bd808: Hotfix for scap ssh host key checking to fix jenkins scap job
17:03 bd808: Testing scap change I40a891b via cherry-pick
10:25 hashar: on bastion, fixed some puppet dependency to have nutcracker to start with the proper configuration 148043
10:20 hashar: upgrading packages on deployment-bastion
10:19 hashar: deleted /var/lib/apt/lists/lock on bastion. Was prevent apt-get update from running
10:18 hashar: setting up nutcracker on deployment-bastion. It was installed but the puppet class to configure it was not being applied. Related Gerrit patches: 148041 and 148042
09:25 hashar: rebooting deployment-apache02
09:22 hashar: rebooting deployment-apache01.
00:27 ori: deployment-mediawiki01 & deployment-mediawiki02: configured for project-local puppet & salt masters

July 18

00:30 bd808: removed local l10nupdate user from deployment-jobrunner01 and deployment-videoscaler01
00:22 bd808: Killed stuck beta-update-databases-eqiad job ( stuck for over 60m waiting for executor; deadlock?)
00:21 ori: beta broke due to I433826423. app servers load prod apache confs from /etc/apache2/wikimedia. temp fix: locally hack apache2.conf to load /usr/local/apache2/conf/all.conf; disable puppet.

July 17

23:18 bd808: Puppet broken for deployment-bastion by labs specific logic in misc::deployment::vars.
19:01 mwalker: possibly breaking labs by cherry picking an apparmor patch that affects mysql https://gerrit.wikimedia.org/r/#/c/147027/

July 16

19:15 mwalker: updated puppet about 20 minutes ago for new ocg variables (now officially in production puppet instead of just cherry picked)

July 15

18:26 bd808: Removed local mwdeploy user from /etc/passwd on deployment-videoscaler01 and deployment-jobrunner01
16:59 bd808: scap failing to deploymnet-videoscaler01 and deploymnet-jobrunner01 due to other random failures now. Lots of strange permissions errors during rsync
16:37 bd808: scap failing to deploymnet-videoscaler01 and deploymnet-jobrunner01 due to ssh auth failures; likely a puppet config problem

July 10

22:37 bd808: Added Gergő Tisza and Yuvipanda as project admins

July 8

23:37 bd808: Updated Kibana to 0afda49 (latest upstream head)
17:03 greg-g: Added John F. Lewis to the project after his NDA was signed by Mark (RT 7722)

July 7

20:55 bd808: Killed stuck `apt-get update` job on deployment-jobrunner01 started on Jun17
20:20 bd808: Fixed puppet on deployment-analytics01 with manual apt-get commands.
20:08 bd808: Ran `apt-get dist-upgrade` on deployment-analytics01 to upgrade hadoop, hive, pig, etc which were failing to update via puppet.

July 4

02:28 RoanKattouw: Unbroke replication on deployment-db2, it's catching up now

July 3

18:59 legoktm: manually created centralauth.renameuser_status table
16:04 bd808: Updated scap to ff04431
09:24 hashar: Reindexed ElasticSearch index for cawiki/eswiki with: mwscript extensions/CirrusSearch/maintenance/forceSearchIndex.php --wiki {cawiki,eswiki} --batch-size=50
09:22 hashar: Blow up ElasticSearch indices for cawiki and eswiki with: mwscript extensions/CirrusSearch/maintenance/updateOneSearchIndexConfig.php --wiki cawiki --startOver --indexType content && mwscript extensions/CirrusSearch/maintenance/updateOneSearchIndexConfig.php --wiki cawiki --startOver --indexType general
09:10 hashar: used addwiki.php to create the wiki. manually triggered the Jenkins job that update the databases https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/2319/
09:06 hashar: Adding cawiki and eswiki for cxserver testing Ibbcbd4

July 2

07:49 hashar: cxserver being configured! 140723 by Kartik and Niklas \O/

July 1

15:46 bd808: Fixed git rebase conflict in operations/puppet on deployment-salt
13:29 manybubbles: rebuilding Cirrus search index in beta to pick up new configuration and cache warmers
11:20 hashar: Added Filippo Giunchedi to the project as an admin (WMF ops)

June 30

20:47 bd808: The state of puppet for beta is badly broken. I have hacked things to get puppet to apply on deployment-apache0[12] but puppet won't apply on deployment-bastion in part due to the same hacks.
18:48 bd808: Created symlink /apache -> /usr/local/apache on deployment-apache0[12] to fix docroot symlinks
18:09 bd808: Beta apaches are broken with latest puppet config applied. Working to correct.
18:08 bd808: Manually added symlink for /etc/apache/wmf on deployment-apache0[12]

June 26

12:48 YuviPanda: cherry picked https://gerrit.wikimedia.org/r/#/c/142228/ to puppetmaster, sending events to charcoal.wmflabs.org now with projectname \o/
09:46 YuviPanda: cherry-picked https://gerrit.wikimedia.org/r/#/c/142210/ on to puppetmaster
09:38 hashar: Granting sudo to YuviPanda

June 25

20:58 bd808: Fixed rebase conflict in operations/puppet.git on deployment-salt caused by cherry-picked vcl patch left over from varnish submodule usage

June 24

19:29 bd808: Manually updated operations/puppet checkout on deployment-salt to deal with varnish submodule change

June 19

22:47 bd808: Updated scap to 792a572
22:46 bd808: Trebuchet runs on deployment-videoscaler01 are succeeding but not showing up in the `git deploy report` output
22:40 bd808: Deleted /var/log/diamond/diamond.log on deployment-jobrunner01 because /var was full

June 18

16:55 bd808: Setup hourly cron as user bd808 on deployment-salt to test automatic update of puppet repo using ~bd808/git-sync-upstream script

June 17

20:36 bd808: Upgraded elasticsearch to version 1.2.1 on deployment-logstash1

June 16

21:16 bd808: Jenkins beta-scap-eqiad job broken because of missing puppet config on deployment-jobrunner01; needs role::beta::scap_target
20:36 bd808: Enabled puppet on deployment-jobrunner01 and forced a run
20:34 bd808: Puppet disabled on deployment-jobrunner01 since 2014-06-03; No SAL logs explaining why
20:19 bd808: Updated scap to 5adce72; trebuchet reported i-00000237 (deployment-videoscaler01) as not updating, but manual check shows it did sync properly
20:00 bd808: Deleted /var/lib/puppet/state/agent_catalog_run.lock on deployment-bastion after verifying that no puppet processes were running
19:55 bd808: Truncated /var/log/diamond/diamond.log and restarted diamond on deployment-bastion
19:36 bd808: /var/log/diamond is 787M of 1.2G total logs
19:29 bd808: /var 0% free on deployment-bastion; looking for things to clean-up

June 9

15:19 andrewbogott: doing a 'rebase origin' on deployment-salt, because it needs it.
15:10 andrewbogott: updating all instances to puppet 3 via a cherry-pick�� of https://gerrit.wikimedia.org/r/#/c/137898/ on deployment-salt

June 7

02:44 bd808: Restarted logstash on deployment-logstash1; last even logged at 2014-06-06T22:11:04

June 6

19:26 bblack: - synced labs/private on deployment-salt again
16:30 bd808: Rebooted deployment-salt
16:27 bd808: Made /var/log a symlink to /srv/var-log on deployment-salt
16:26 bblack: Updated labs/private.git on puppetmaster. brings in updated zero+netmapper password for beta
16:18 bd808: Changed from role::labs::lvm::biglogs to role::labs::lvm::srv on deployment-salt and made /var/lib a symlink to /srv/var-lib
15:45 bd808: /var on deployment-salt still at 97% full after moving logs; /var/lib is our problem
15:43 bd808: Archived deployment-salt:/var/log to /data/project/deployment-salt
15:40 bd808: Disabled puppet on deployment-salt to work on disk space issues
12:44 hashar: Updated labs/private.git on puppetmaster. Brings Brandon Black change "add labs copy of zerofetcher auth file" 137918
02:48 mwalker: added role::labs::lvm::biglogs to deployment-salt because it is out of room on /var and I don't know what I can delete
01:25 bd808: Live hacked /etc/apache2/wmf/hhvm.conf on apaches to allow them to start
00:30 bd808: `git stash`ed dirty dblist files found in /a/common on deployment-bastion

June 5

14:16 manybubbles: rebuild beta's jawiki's search index without kuromoji - it didn't help much anyway
14:14 manybubbles: recovered from busted elasticsearch - two problems: 1. I had an index that used the kuromoji plugin but I'd uninstalled it and 2. I had plugins for 1.2.1 but was trying to start 1.1.0. Solution: 1. delete the index and recreate it without kuromoji. 2. upgrade to 1.2.1 like I had planned on doing any way.
14:01 manybubbles: elasticsearch cluster got really angry in beta when I restarted some node - its like they aren't talking to eachother properly - trying to recover. once that is done I'll upgrade to 1.2.1 and that might fix it
13:59 hashar: deployment-elastic01 puppet was broken due to bug 63322 i.e. having some HTML garbage as ec2id which would be used as puppet certname
13:47 manybubbles: rolling restart of elasticsearch nodes in beta to pick up new kernel

June 4

20:46 bd808: Fixed file ownership on /data/project/apache/uncommon for beta-recompile-math-texvc-eqiad job
19:27 manybubbles: sorry, can't do that yet,
19:27 manybubbles: plugins deployed to beta - time to restart Elasticsearch in beta - should cause not interruption of service
19:01 manybubbles: deploying Elasticsearch 1.2.1 and some updated plugins to beta
17:11 bd808: Unwedged the jenkins jobs to updating beta by stopping the stuck db update job
16:27 bd808: Changed uid/git for files owned by l10nupdate user
09:50 mwalker: Reset salt caches by running `salt '*' state.clear_cache` from deployment-salt -- deployment-pdf01 now no longer reports errors when returning status for deployment

June 3

22:30 bd808: Deleted unused /data/project/apache/common-local on NFS share.

June 2

19:42 bd808: Updated scap to a7da355
05:14 bd808: Restarted logstash on deployment-logstash1; Last event logged at 2014-06-01T0722:56

May 30

21:45 bd808: Restarted uwsgi on deployment-graphite
18:43 bd808: Updated scap to c4204dd

May 29

21:07 bd808: mwalker cleaned up log spam from upstart on deployment-pdf01
20:59 bd808: /var full on deployment-pdf01
20:55 bd808: Restarted salt minion on deployment-pdf01 with `sudo salt 'i-00000396.eqiad.wmflabs' service.restart salt-minion`

May 28

17:53 bd808: Restarted logstash on deployment-logstash1; last event logged at 2014-05-28T12:11:37
16:56 bd808: Updated scap to fd7e538

May 27

19:08 bd808: Updated scap to 48c7e28
14:56 bd808: Updated scap to 9609e8d

May 23

16:32 bd808: Upgraded elasticsearch to 1.1.0 on deployment-logstash1
13:36 manybubbles: restarting elasticsearch on deployment-elastic01 to pick up some gc setting recommended by elasticsearch team

May 22

23:00 bd808: Added 20after4 as a project admin
22:59 bd808: Added matanya as a project memeber
21:38 bd808|LUNCH: Deployed scap 096cb3f

May 21

17:33 mwalker: converted deployment-pdf01 (i-00000396.eqiad.wmflabs) to use local puppet & salt master
14:50 bd808: restarted logstash on deployment-logstash1; getting really tired of these soft crashes
00:33 bd808: Puppet failing on deployment-videoscaler01 with duplicate definition of Class[Mediawiki::Jobrunner]
00:07 bd808: Fixed puppet for deployment-jobrunner01 using https://gerrit.wikimedia.org/r/#/c/134519/2

May 20

23:49 bd808: Fixed puppet for deployment-apache[12] using https://gerrit.wikimedia.org/r/#/c/134519/2
23:11 bd808: deployment-apache01 needs more work: "Could not set shell on user[mwdeploy]"
23:06 bd808: Fixing puppet config for upstream rename of role::applicationserver -> role::mediawiki
21:14 ori: Converted deployment-stream to use local puppet & salt masters
21:08 RoanKattouw: chown'ed /data/project/parsoid/parsoid.log from mwalker (?!?) to parsoid so Parsoid runs again
15:53 bd808: Deployed scap 7b6fc47 via trebuchet

May 19

14:34 bd808: Restarted logstash service on deployment-logstash1; it stopped logging new events at 10:37:13Z

May 16

21:20 manybubbles: restarting elasticsearch in beta to update some plugins
00:34 bd808: Updated EventLogging to I89819bd

May 15

22:14 bd808: Restarted logstash on deployment-logstash1 yet again; memory leak from invalid encoding bug
00:14 bd808: Disabled puppet on deployment-logstash1 to test a local logstash config change

May 14

23:33 bd808: Added irc input to logstash via I409fec9

May 13

09:28 bd808: Restarted logstash service on deployment-logstash1
09:28 bd808: Logstash events stop at 2014-05-11T18:36:35Z; Log file shows many "Failed parsing date from field" errors which probably triggered the known upstream memory leak bug

May 10

18:02 bd808: Restarted logstash on deployment-logstash1

May 9

12:06 hashar: Creating en_rtlwiki wiki [[bugzilla:50335|bug 50335]]

May 6

17:54 bd808: Restarted logstash on deployment-logstash1
17:53 bd808: Logstash in beta hasn't recorded any events since 2014-05-04T04:32:36.
15:33 manybubbles: rolling restart of Elasticsearch servers in beta to pick up new highlighter plugin to fix bugs found when we fixed hebrew analysis. and to implement phrase highlighting.

May 5

21:29 mwalker: ran puppetstoredconfigclean and revoked puppet and salt keys for i-00000339.eqiad.wmflabs (was pdf01)
21:24 mwalker: removing pdf01 instance -- labs just uses production mwlib which works just fine. I'll recreate this when I make the OCG test instance
20:57 manybubbles: deploying new plugin to Elasticsearch (swift)

May 3

18:10 mwalker: Updated kernel on deployment-pdf01 (manually set console=ttyS0 to match older installed kernels)
17:58 mwalker: Converted i-00000339.eqiad.wmflabs (deployment-pdf01) to use local puppet & salt masters
17:54 mwalker: signed salt key for i-00000339.eqiad.wmflabs (deployment-pdf01)
17:43 bd808: Added mwalker to under_NDA sudoers group

May 2

17:01 bd808: Switched scap to use scripts delivered by trebuchet

May 1

15:46 manybubbles: upgrading Elasticsearch highlighter via a rolling restart
00:56 bd808: Fixed empty PrivateSettings.php configuration file (which I also broke earlier)

April 28

16:12 manybubbles: upgrading highlighter plugin in Elasticsearch
15:43 bd808: Created empty /srv/scap-stage-dir/wmf-config/mwblocker.log file to stop missing file warnings in beta.

April 25

11:31 hashar: commonswiki-75388f96: 0.6183 19.5M SQL ERROR (ignored): Table 'commonswiki.revtag_type' doesn't exist (10.68.16.193)
11:30 hashar: Authentication is broken on the beta cluster. Well at least from commons.wikimedia.beta.wmflabs.org

April 23

19:34 ^demon|lunch: created zhwiki, ukwiki, ruwiki, kowiki, hiwiki, jawiki for testing
10:19 hashar: stopping udp2log and starting udp2log-mw instead (known old bug that prevents logging)

April 22

18:42 bd808: Rebooting deployment-bastion in a wild attempt to get the jenkins slave there working again
18:42 bd808: Rebooting deployment-bastion in a wild attempt to get the jenkins slave there working again

April 18

19:24 manybubbles: rebuilding Cirrus indexes to pick up auxiliary fields and smarter accent matching

April 16

18:56 hashar: Migrating memc04 and memc05 to self master/salt [[bugzilla:64010|bug 64010]]
13:13 manybubbles: done
13:10 manybubbles: rolling restart of Elasticsearch nodes in beta to make super sure it picked up new plugins
09:33 hashar: rebased puppetmaster

April 15

20:02 manybubbles: restarting elasticsearch in beta to pick up a plugin update - no downtime should occur
14:24 hashar: rebased puppetmaster

April 11

17:41 bd808: Tried to enable role::protoproxy::ssl::beta on deployment-cache-text02 but it failed to apply because /etc/ssl/certs/star.wmflabs.org.pem and /etc/ssl/private/star.wmflabs.org.key don't match.
03:59 bd808: sudo apt-get install mysql-client on deployment-bastion
03:54 bd808: Added legoktm as a project member
00:02 bd808: Enabled https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/

April 10

21:35 bd808: Running scap on deployment-bastion for the first time in eqiad
21:13 bd808: Disabled https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/ to work on scap setup
14:52 hashar: Adding Tobias Gritschacher to the project so he can look at udp2log / apache logs whenever needed :-]

April 9

23:04 bd808: Re-enabled puppet on deployment-apache02 and forced a puppet run
21:39 bd808: Cherry-picked I8f77e0c into puppet and forced puppet run on deployment-bastion

April 8

17:53 manybubbles: rebuilding simplewiki's search index optimized for the new highlighter to check the size difference
05:34 Ryan_Lane: upgraded libssl on all nodes, restarted affected ssl servers
05:03 Ryan_Lane: upgraded libssl on all salt accessible nodes

April 5

11:19 hashar: Attempting to reenable SSL support with 124057

April 4

21:39 bd808: Restarted logstash; it stopped processing events again at 2014-04-04T19:56:46Z
17:31 bd808: Forced puppet run on deployment-cache-text02
17:29 bd808: Manually fixed puppet config on deployment-cache-text02 (the cert html error problem)
17:22 bd808: Rebooting deployment-cache-bits01
17:21 bd808: Forced puppet run on deployment-cache-bits01
16:15 manybubbles: Performing a rolling restart of Elasticsearch nodes to pick up a new plugin

April 3

17:32 bd808: Fixed certname in /etc/puppet/puppet.conf manually on deployment-bastion so puppet would run again.
15:33 bd808: Restarted logstash on deploymnet-logstash1; Stuck in a bad state due to jvm oom logged at 2014-04-03T12:03:43Z

April 2

17:54 manybubbles: done installing plugins on Elasticsearch in beta
14:10 hashar: Fixed database updating job https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/ . It was not running on the proper node.
12:50 hashar: restarted parsoid daemon on deployment-parsoid04.eqiad.wmflabs. It also now log to /data/project/parsoid/parsoid.log
12:36 hashar: Manually deleting parsoid user/group on deployment-parsoid04. Will use the LDAP uid/gid instead.

April 1

21:38 hashar: Removed the Zuul triggers that updated beta cluster in PMTPA 123100.
19:49 bd808: Converted deployment-graphite.eqiad.wmflabs to use local puppet & salt masters
19:20 bd808: Deleting and re-creating deployment-graphite because I forgot to add the web security group
15:57 andrewbogott: shutting down all pmtpa instances
14:32 manybubbles: completed upgrade to Elasticsearch 1.1.0 and fixed deployment-elastic04.
13:32 hashar: Thumbs access more or less fixed
13:31 hashar: deployment-upload is rejecting connection on port 80. Applying role::beta::uploadservice from 122786
13:30 manybubbles: upgrading labs Elasticsearch to 1.1.0
13:06 hashar: Applying role::beta::natfix on deployment-upload.eqiad.wmflabs . Might let it access images from commons.wikimedia.beta.wmflabs.org ( ex: http://upload.beta.wmflabs.org/wikipedia/commons/thumb/4/43/Feed-icon.svg/16px-Feed-icon.svg.png yields: Error retrieving thumbnail from scaling server: couldn't connect to host commons.wikimedia.beta.wmflabs.org )
08:31 hashar: MediaWiki config paths tweaks for Math [[bugzilla:63331|bug 63331]] and Captchas [[bugzilla:63342|bug 63342]]
00:32 bd808: Converted deployment-graphite to use local puppet & salt masters

March 31

21:02 hashar: Making Parsoid daemon to write its logs to /data/project/parsoid/parsoid.log 122561
20:47 hashar: Puppet master is fixed. The certificates got badly messed up, had to regenerate them following the documentation "Regenerate Certificates for Puppet Master"
20:17 hashar: restarted parsoid daemon
20:00 hashar: stopped parsoid . It is killing the application servers
19:53 hashar: restarting both apaches
19:21 hashar: restarting job service on jobrunner01 to apply 122436
19:20 hashar: Unbreak puppetmaster on deployment-salt.eqiad.wmflabs
19:01 hashar: puppet master is broken :(
17:39 hashar: lowering # of jobs spawned by the jobrunner 122436
16:00 bd808: Restarted logstash service on deployment-logstash1; no new log events seen since 2014-03-28T10:57
15:58 bd808: Updated kibana on deployment-logstash1 to e317bc6
15:56 hashar_: Cluster slow because some CirrusSearch job is spamming simplewiki . Gotta find a way to throttle the number of jobs being run on jobrunner01 or add more apache boxes . It is transient anyway, might look at limiting the runs tonight
15:10 hashar_: Rebased puppet repository. Only one hack left: https://gerrit.wikimedia.org/r/#/c/119534/
14:20 hashar: deleting deployment-parsoidcache01 cache the hardway: stopping varnish, deleting files in /srv/vdb/ , starting varnish
14:05 hashar: shutdowning database and apache boxes for now.
14:03 hashar: shutdowning varnishes instances in pmtpa
13:56 hashar: Deleted deployment-cache-upload01 , replaced by deployment-cache-upload02
13:52 hashar: upload varnish cache working :-]
13:47 hashar: applying role::cache::upload to role-cache-upload02
13:37 hashar: migrating deployment-cache-upload02.eqiad.Wmflabs to self puppet/salt master
13:22 hashar: Creating deployment-cache-upload02 to replace deployment-cache-upload01 which was missing the security group "web"
11:30 hashar: Update DNS entries to point to EQIAD instances (aka switching beta cluster to eqiad)

March 28

16:18 hashar: rebased puppet on deployment-salt
15:39 hashar: Last log made to wrong project
15:39 hashar: deleting instance ntegration-selenium-driver no more needed. browsertests jobs should now be runnable on integration-slave1001 and integration-slave1002 (in eqiad)
10:54 hashar: deleting instance integration-debian-builder . That is breaking all debian-glue jobs. Will revisit later next week to get pbuilder/cowbuilder set up on the other eqiad slaves
08:48 hashar: deleting integration-slave-pbuilder. Unneeded (i need a coffee)
08:43 hashar: Created integration-slave-pbuilder on eqiad to replace pmtpa instance integration-debian-builder
00:23 bd808: `sudo chmod -R a+rwx /data/project/upload7`; We need to get this file permissions thing figured out

March 27

15:23 hashar: role::beta::natfix cant run on deployment-bastion.eqiad because the ferm rules conflicts with the Augeas rules coming from udp2log :-(
15:21 hashar: applying role::beta::natfix on deployment-bastion.eqiad
14:58 hashar: fixed up role::beta::natfix . Ferm is now being applied again on various application server instances 121378
13:58 hashar: rebased puppetmaster git repository, reapplied ottomata live hacks.
12:55 hashar: mediawiki l10n cache being rebuild!!!
12:54 hashar: Fixed permissions on eqiad bastion for /srv/scap . Others (such as mwdeploy) could not read / execute scap scripts
11:29 hashar: MediaWiki code and configuration are now self updating on EQIAD cluster via Jenkins jobs. First run: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/4/console
11:11 hashar: deleting job beta-code-update , replaced by datacenter variants beta-code-update-pmtpa and beta-code-update-eqiad
10:54 hashar: Deleting job beta-update-databases , replaced by datacenter variants beta-update-databases-pmtpa and beta-update-databases-eqiad

March 26

19:05 bd808: Added ottomata as a project member and admin
15:46 springle: deployment-db1 data loaded
14:45 bd808: created proxy https://logstash-beta.wmflabs.org for logstash instance
14:17 hashar: fixed up redis configuration in eqiad. Jobrunner is happy now: aawiki-504cd7d2: 0.9649 21.5M Creating a new RedisConnectionPool instance with id 627014d. 121060
14:05 hashar: udp2log functional on eqiad beta cluster \O/
13:55 hashar: stopping udp2log on eqiad bastion, starting udp2log-mw (really should fix that issue one day)
13:52 hashar: dropped some live hack on eqiad in /data/project/apache/common-local and ran git pull
13:14 hashar: Dropping enwikivoyage and dewikivoyage databases from sql02. Related changes are updating the Jenkins config: https://gerrit.wikimedia.org/r/#/c/121045/ and cleaning up the mw-config : https://gerrit.wikimedia.org/r/#/c/121047/
07:53 springle: installed mariadb via puppet on deployment-db1. no data yet

March 25

19:43 hashar: created jenkins slave deployment-bastion.eqiad
17:17 hashar: Created and validated job that updates Parsoid on the EQIAD beta cluster \O/

March 24

23:16 marktraceur: Touching all the MMV scripts because they're not getting invalidated or something
23:10 hashar: l10n cache got broken due to a PHP fatal error I introduced. It is back up now. Found out via https://integration.wikimedia.org/dashboard/
23:09 hashar: upgraded all pmtpa varnishes, ran puppet on all of them. all set!
22:57 hashar: restarting deployment-cache-upload04 , apparently stalled
22:48 hashar: upgrading varnish on all pmtpa caches.
22:47 hashar: apt-get upgrade varnish on deployment-cache-bits03
22:45 marktraceur: attempted restart of varnish on betalabs; seems to have failed, trying again
22:42 hashar: made marktraceur a project admin and granted sudo rights
22:39 marktraceur: Restarting betalabs varnish to workaround https://bugzilla.wikimedia.org/show_bug.cgi?id=63034
17:25 bd808: Converted deployment-db1.eqiad.wmflabs to use local puppet & salt masters
17:06 bd808: Changed rules in sql security group to use CIDR 10.0.0.0/8.
17:05 bd808: Changed rules in search security group to use CIDR 10.0.0.0/8.
17:05 bd808: Built deployment-elastic04.eqiad.wmflabs with local salt/puppet master, secondary disk on /var/lib/elasticsearch and role::elasticsearch::server
16:19 bd808: Built deployment-elastic03.eqiad.wmflabs with local salt/puppet master, secondary disk on /var/lib/elasticsearch and role::elasticsearch::server
16:08 bd808: Built deployment-elastic02.eqiad.wmflabs with local salt/puppet master, secondary disk on /var/lib/elasticsearch and role::elasticsearch::server
15:54 bd808: Built deployment-elastic01.eqiad.wmflabs with local salt/puppet master, secondary disk on /var/lib/elasticsearch and role::elasticsearch::server
10:31 hashar: migrated deployment-solr to self puppet/salt masters

March 21

09:29 hashar: l10ncache is now rebuild properly : https://integration.wikimedia.org/ci/job/beta-code-update/53508/console
09:23 hashar: fixing l10ncache on deplkoyment-bastion : chown -R l10nupdate:l10nupdate /data/project/apache/common-local/php-master/cache/l10n The l10nupdate UID/GID has been changed and are now in LDAP

March 20

23:46 bd808: Mounted secondary disk as /var/lib/elasticsearch on deployment-logstash1
23:46 bd808: Converted deployment-tin to use local puppet & salt masters
22:09 hashar: Migrated videoscaler01 to use self salt/puppet masters.
21:30 hashar: manually installing timidity-daemon on jobrunner01.eqiad so puppet can stop it and stop whining
21:00 hashar: migrate jobrunner01.eqiad.wmflabs to self puppet/salt masters
20:55 hashar: deleting deployment-jobrunner02 , lets start with a single instance for nwo
20:51 hashar: Creating deployment-jobrunner01 and 02 in eqiad.
15:47 hashar: fixed salt-minion service on deployment-cache-upload01 and deployment-cache-mobile03 by deleting /etc/salt/pki/minion/minion_master.pub
15:30 hashar: migrated deployment-cache-upload01.eqiad.wmflabs and deployment-cache-mobile03.eqiad.wmflabs to use the salt/puppetmaster deployment-salt.eqiad.wmflabs.
15:30 hashar: deployment-cache-upload01.eqiad.wmflabs and deployment-cache-mobile03.eqiad.wmflabs recovered!! /dev/vdb does not exist on eqiad which caused the instance to be stalled.
10:48 hashar: Stopped the simplewiki script. Would need to recreate the db from scratch instead
10:37 hashar: Cleaning up simplewiki by deleting most pages in the main namespace. Would free up some disk space. deleteBatch.php is running in a screen on deployment-bastion.pmtpa.wmflabs
10:08 hashar: applying role::labs::lvm::mnt on deployment-db1 to provide additional disk space on /mnt
09:39 hashar: convert all remaining hosts but db1 to use the local puppet and salt masters
04:40 springle: created deployment-db1 for mariadb master in eqiad

March 19

21:23 bd808: Converted deployment-cache-text02 to use local puppet & salt masters
20:21 hashar: migrating eqiad varnish caches to use xfs
17:58 bd808: Converted deployment-parsoid04 to use local puppet & salt masters
17:51 bd808: Converted deployment-eventlogging02 to use local puppet & salt masters
17:22 bd808: Converted deployment-cache-bits01 to use local puppet & salt masters; puppet:///volatile/GeoIP not found on deployment-salt puppetmaster
17:00 bd808: Converted deployment-apache02 to use local puppet & salt masters
16:49 bd808: Converted deployment-apache01 to use local puppet & salt masters
16:30 hashar: Varnish caches in eqiad are failing puppet because there is no /dev/vdb. Will figure it out tomorrow :-]
16:15 hashar: Applying role::logging::mediawiki::errors on deployment-fluoride.eqiad.wmflabs . It is not receiving anything yet though.
15:50 hashar: fixed upd2log-mw daemon not starting on eqiad bastion ( /var/log/udp2log belonged to wrong UID/GID)
15:49 hashar: deleted local user l10nupdate on deployment-bastion. It is in ldap now.

March 18

03:31 bd808: deployment-bastion now using deployment-salt as puppet master

March 17

15:02 hashar: Starting copying /data/project from ptmpa to eqiad
14:46 hashar: manually purging all commonswiki archived files (on beta of course)

March 14

14:47 hashar: changing uid/gid of mwdeploy which is now provisioned via LDAP (aka deleting local user and group on all instance + file permissions tweaks)

March 11

10:46 hashar: dropping some unused databases from deployment-sql instance.

March 10

11:09 hashar: Deleting http://simple.wikipedia.beta.wmflabs.org/wiki/MediaWiki:Robots.txt
09:54 hashar: Reducing memcached instances to 3GB ( 115617 ). Seems to fix writing to the EQIAD memcaches which only have 3GB
09:08 hashar: Restarted bits cache (CPU / mem overload)

March 6

09:07 hashar: restarted varnish and varnish-frontend on deployment-cache-text1

March 5

17:26 hashar: hacked in mwversioninuse to return "master=aawiki". Relaunched l10n job using mwdeploy user and then running mw-update-l10n
17:07 hashar: mwversioninuse gives a wmf branch instead of master. That breaks l10n messages update and the job https://integration.wikimedia.org/ci/job/beta-code-update/ . Root cause is the python based scap.

March 3

17:28 manybubbles: doing an Elasticsearch reindex on beta before I try another one in production

February 28

10:17 hashar: Puppet running on varnish upload cache after several months. Might break random things in the process :(

February 27

14:11 manybubbles: upgrading beta to Elasticsearch 1.0

February 26

20:44 hashar: Cleaning up commonswiki archived files with mwscript deleteArchivedFiles.php --wiki=commonswiki --delete
20:44 hashar: deleted all files from http://commons.wikimedia.beta.wmflabs.org/wiki/Category:GWToolset_Batch_Upload (gwtoolset import test). Deleted File:Title_0* (Selenium tests).
15:06 hashar: deleted all thumbs from shared directory: /data/project/upload7/*/*/thumb/*
14:54 hashar: cleaning out 2013 archived logs.

February 25

08:42 hashar: Upgrading all varnishes.

February 24

23:36 MaxSem: Rolled back
23:25 hoo: recursively chowned extensions/MobileFrontend to mwdeploy:mwdeploy
23:21 hoo: chowned /data/project/apache/common-local/php-master/extensions/.git/modules/MobileFrontend/* to mwdeploy:mwdeploy
17:47 MaxSem: Investigating a mobile bug, might cause intermittent problems
17:36 MaxSem: Rebooted deployment-cache-mobile01 - was impossible to log into it though Varnish still worked

February 21

19:42 MaxSem: Adjusted read privs on /home/wikipedia/syslog/apache.log to allow fatalmonitor to work

February 19

16:24 hashar: -bastion : /etc/init.d/udp2log stop && /etc/init.d/udp2log-mw start (known bug)
16:23 hashar: rebooting -bastion
16:22 hashar: rebooting apache32 and apache33 breaking beta :-]

February 17

15:26 hashar: rebooting bits cache

February 11

21:55 manybubbles: update elasticsearch schema after recent changes. will run a links update as well

February 6

22:20 Krinkle: Manually ran changePassword.php to help someone (password reminder emails don't get sent)
14:43 hashar: restarting udp2log-mw on deployment-bastion. Logstash.wmflabs.org no more receiving fatals logs since Jan 31st

February 4

17:22 hashar: fixed up beta-parsoid-update job so Parsoid should be up to date again. The issue is that the multigit job pointed to a wrong host (ZUUL_URL should be zuul.eqiad.wmnet)
13:33 hashar: removing role::memcached from both apache servers
09:58 hashar: rebooting all varnish caches
09:57 hashar: Upgrading all varnish

February 3

16:59 hashar: upgrading varnish on deployment-parsoidcache3

January 30

19:35 hashar: deployment-cache-bits03 restarted gmond, leaked memory. Upgrading varnish
19:32 hashar: Canceled varnish package upgrade on deployment-cache-mobile01 , it runs a specific version ( 3.0.5plus~wmftest-wm1 ) instead of 3.0.3plus~rc1-wm29
19:30 hashar: upgrading varnish on deployment-cache-mobile01
19:29 hashar: upgrading varnish on deployment-cache-bits03
19:29 hashar: upgrading varnish on deployment-staging-cache-mobile02
19:28 hashar: upgrading varnish on deployment-cache-upload04
19:27 hashar: reenabling puppet on deployment-cache-mobile01
17:10 manybubbles: done reindexing beta. everything looks good
16:54 manybubbles: reindexing beta like we're going to do in production when the release train departs later today

January 28

17:10 hashar: added addshore and jhall to project so they can grep logs

January 27

15:17 hashar: applying role::beta::fatalmonitor puppet class on deployment-bastion bug 60046

January 23

19:38 hashar: VisualEditor was not being updated properly because some files belonged to root instead of mwdeploy. Ran chown -R mwdeploy:mwdeploy /data/project/apache/common-local/php-master/extensions/VisualEditor

January 16

20:54 manybubbles: turning elasticsearch's disk space aware allocator

January 15

21:14 manybubbles: finished updating to elasticsearch 0.90.10
08:48 andrewbogott: rebooted deployment-cache-text1

January 2

15:32 hashar: Migrated parsoid on deployment-parsoid2 to use mediawiki/services/parsoid out of a checkouts made in /srv/deployment/parsoid/{parsoid,deploy}. No job self updating it yet
15:00 manybubbles: finished upgrading Elasticsearch in beta. We're on 0.90.9 now.
14:07 hashar: running mw-update-l10n , it was broken because of https://gerrit.wikimedia.org/r/#/c/104741/ fixed up by https://gerrit.wikimedia.org/r/#/c/104953/
13:54 manybubbles: upgrading Elasticsearch servers in beta

December 26

18:54 manybubbles: performing in place index rebuild for wikis in beta after recent cirrus update

December 23

20:40 anomie: Restarting mw-job-runner service on deployment-jobrunner08, since jobs don't seem to be being run
20:03 anomie: Restarting apache on deployment-apache33 to see if that clears the odd errors going on

December 18

10:56 hashar: reenabling puppet on parsoid2 and deploying the new Parsoid upstart configuration 99656