Server Admin Log/Archive 40

From Wikitech
Jump to navigation Jump to search

2020-04-30

  • 23:21 urbanecm@deploy1001: Synchronized wmf-config/CommonSettings.php: SWAT: 9065650: Add project taglines (T249047) (duration: 01m 04s)
  • 23:19 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: 9065650: Add project taglines (T249047) (duration: 01m 04s)
  • 23:17 urbanecm@deploy1001: Synchronized static/images/mobile/copyright/: SWAT: 9065650: Add project taglines (T249047) (duration: 01m 04s)
  • 23:14 urbanecm@deploy1001: Synchronized static/images/project-logos/: SWAT: 9065650: Add project taglines (T249047) (duration: 01m 05s)
  • 23:08 urbanecm@deploy1001: Synchronized static/images/project-logos/: SWAT: ae1424a: Logo wordmarks should not define fill color - opacity will be used (T251135) (duration: 01m 05s)
  • 23:05 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: cf5f7ff: Assign oathauth-verify-user to stewards (T251447) (duration: 01m 05s)
  • 20:13 shdubsh: test mtail rc35 upgrade on logstash1007 - T251466
  • 20:10 rzl: mcrouter certs re-renewed on puppetmaster1001, puppet enabled on mcrouter hosts
  • 20:05 jeh: cloudvirt1024 upgrade iDRAC firmware from 2.4.8 to 2.5.4 T241884
  • 20:04 rzl: Disabling puppet on all mcrouter hosts for cert renewal. This isn't strictly needed, as the certs from last time are still fine -- just testing the renewal script.
  • 19:42 jeh: reboot cloudvirt1024 for NIC firmware updates T241884
  • 19:25 shdubsh: test mtail rc35 upgrade on fermium - T251466
  • 17:40 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1091', diff saved to https://phabricator.wikimedia.org/P11104 and previous config saved to /var/cache/conftool/dbconfig/20200430-174057-marostegui.json
  • 17:37 brennen@deploy1001: rebuilt and synchronized wikiversions files: Revert "group1 and group2 wikis to 1.35.0-wmf.28"
  • 17:24 reedy@deploy1001: Synchronized wmf-config/CommonSettings-labs.php: labs only (duration: 00m 58s)
  • 15:28 volans@deploy1001: Finished deploy [homer/deploy@56506db]: Release v0.2.1 (duration: 00m 21s)
  • 15:27 volans@deploy1001: Started deploy [homer/deploy@56506db]: Release v0.2.1
  • 15:11 marostegui: Create lag on es1021
  • 14:53 krinkle@deploy1001: Synchronized wmf-config/db-eqiad.php: I46d2b8 (duration: 00m 57s)
  • 14:38 krinkle@deploy1001: Synchronized wmf-config/db-codfw.php: I46d2b8 (duration: 00m 57s)
  • 14:24 vgutierrez: upgrade trafficserver to 8.0.7-1wm2 on cp[5006,5011]
  • 14:13 marostegui: Stop slave on es2020 for testing
  • 14:11 vgutierrez: rolling restart of ats-tls on text@esams - T249335
  • 14:01 vgutierrez: upgrade trafficserver to version 8.0.7-1wm2 on cp4025 and cp4031
  • 13:14 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 13:11 marostegui@cumin1001: START - Cookbook sre.hosts.downtime
  • 13:04 liw@deploy1001: rebuilt and synchronized wikiversions files: all wikis to 1.35.0-wmf.30
  • 12:44 arturo: re-enable puppet in apt1001
  • 12:35 jbond42: rolling restart of php7.2-fpm on mw1* servers
  • 12:09 jbond42: rolling restart of thumbor service
  • 12:02 arturo: disable puppet in apt1001 to briefly test a reprepro pull filter before merging a proper patch
  • 11:59 kormat@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 11:56 kormat@cumin1001: START - Cookbook sre.hosts.downtime
  • 11:56 jbond42: updating tiff on stretch
  • 11:54 arturo: running `aborrero@apt1001:~ $ sudo -i reprepro --delete clearvanished` to clean unused openstack components and packages (see https://gerrit.wikimedia.org/r/c/operations/puppet/+/593223)
  • 11:30 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: 83e1475: Uncoupling graphoid on testwiki (T242855) (duration: 01m 06s)
  • 11:07 marostegui: Deploy schema change on db1091
  • 11:07 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1091', diff saved to https://phabricator.wikimedia.org/P11099 and previous config saved to /var/cache/conftool/dbconfig/20200430-110721-marostegui.json
  • 11:05 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1084', diff saved to https://phabricator.wikimedia.org/P11098 and previous config saved to /var/cache/conftool/dbconfig/20200430-110539-marostegui.json
  • 11:05 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: 6572e25: Enable transwiki import from wikidata, frwikisource and hiwikibooks in hiwikisource (T251485) (duration: 01m 12s)
  • 10:51 mutante: bromine,vega,miscweb[12]002: rm -rf /srv/org/wikimedia/TransparencyReport-private
  • 10:09 jayme: imported helm 2.12.2-4 to main for buster-wikimedia
  • 09:53 jayme: imported helm3 3.2.0-1+deb10u1 to main for buster-wikimedia
  • 09:47 kormat: reimaging db1077 for testing purposes T251392
  • 08:36 XioNoX: change blackhole term scope on all routers - T226742
  • 07:52 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db2089', diff saved to https://phabricator.wikimedia.org/P11097 and previous config saved to /var/cache/conftool/dbconfig/20200430-075211-marostegui.json
  • 07:31 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 07:29 marostegui@cumin1001: START - Cookbook sre.hosts.downtime
  • 06:50 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1111', diff saved to https://phabricator.wikimedia.org/P11096 and previous config saved to /var/cache/conftool/dbconfig/20200430-065044-marostegui.json
  • 06:50 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1111', diff saved to https://phabricator.wikimedia.org/P11095 and previous config saved to /var/cache/conftool/dbconfig/20200430-065008-marostegui.json
  • 06:44 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db2089', diff saved to https://phabricator.wikimedia.org/P11094 and previous config saved to /var/cache/conftool/dbconfig/20200430-064450-marostegui.json
  • 05:18 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1092', diff saved to https://phabricator.wikimedia.org/P11093 and previous config saved to /var/cache/conftool/dbconfig/20200430-051818-marostegui.json
  • 05:16 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1092', diff saved to https://phabricator.wikimedia.org/P11092 and previous config saved to /var/cache/conftool/dbconfig/20200430-051637-marostegui.json
  • 05:15 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1099:3318', diff saved to https://phabricator.wikimedia.org/P11091 and previous config saved to /var/cache/conftool/dbconfig/20200430-051506-marostegui.json
  • 05:13 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1099:3318', diff saved to https://phabricator.wikimedia.org/P11090 and previous config saved to /var/cache/conftool/dbconfig/20200430-051329-marostegui.json
  • 05:02 marostegui: Restart x1 master finished - T250701
  • 05:00 marostegui: Restart x1 master (db1120) - T250701
  • 04:52 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1084', diff saved to https://phabricator.wikimedia.org/P11089 and previous config saved to /var/cache/conftool/dbconfig/20200430-045159-marostegui.json
  • 04:38 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1097:3314', diff saved to https://phabricator.wikimedia.org/P11088 and previous config saved to /var/cache/conftool/dbconfig/20200430-043803-marostegui.json
  • 00:26 twentyafterfour: phabricator update finished
  • 00:15 twentyafterfour: deploying phabricator update: https://phabricator.wikimedia.org/project/view/4620/
  • 00:11 eileen: process-control config revision is 1f31dd21c5

2020-04-29

  • 23:56 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: T91649 Drop Sentry, Part II: Stop configuring it for production or Beta Cluster (duration: 01m 05s)
  • 23:51 jforrester@deploy1001: Synchronized wmf-config/CommonSettings.php: T91649 Drop Sentry, Part I: Stop loading it anywhere (duration: 01m 05s)
  • 23:37 catrope@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Set Growth Screener survey sample rate to 0.1% and limit to anons only (T248421) (duration: 01m 05s)
  • 23:26 RoanKattouw: Ran updateArticleCount.php on trwikisource
  • 23:22 catrope@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Set $wgArticleCount to any on trwikisource (duration: 01m 06s)
  • 21:59 bstorm_: upgrading RAID firmware on labsdb1011 T249188
  • 21:34 volker-e@deploy1001: Finished deploy [design/style-guide@c4956c3]: Deploy design/style-guide: (duration: 00m 08s)
  • 21:34 volker-e@deploy1001: Started deploy [design/style-guide@c4956c3]: Deploy design/style-guide:
  • 21:22 jforrester@deploy1001: Synchronized php-1.35.0-wmf.28/extensions/GlobalBlocking/includes/api/ApiQueryGlobalBlocks.php: T251430 Unconditionally select gb_timestamp (duration: 01m 06s)
  • 21:19 jforrester@deploy1001: Synchronized php-1.35.0-wmf.30/extensions/GlobalBlocking/includes/api/ApiQueryGlobalBlocks.php: T251430 Unconditionally select gb_timestamp (duration: 01m 05s)
  • 21:17 jforrester@deploy1001: Synchronized php-1.35.0-wmf.30/extensions/Quiz/includes/Quiz.php: Don't crash if quiz attempts to include a bad title T251409 (duration: 01m 06s)
  • 20:11 andrew@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 20:09 andrew@cumin1001: START - Cookbook sre.hosts.downtime
  • 19:46 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: (doc-only) Fix Phabricator task reference for jvwiki logo (duration: 01m 05s)
  • 19:34 addshore: repool wdqs2008
  • 19:02 addshore: depooling and stopping the updater on wdqs2008 for some query tests (wdqs-internal)
  • 17:56 joal@deploy1001: Finished deploy [analytics/refinery@6460d05] (thin): Regular analytics weekly train THIN [6460d05] (duration: 00m 08s)
  • 17:56 joal@deploy1001: Started deploy [analytics/refinery@6460d05] (thin): Regular analytics weekly train THIN [6460d05]
  • 17:54 elukey@cumin1001: END (PASS) - Cookbook sre.presto.roll-restart-workers (exit_code=0)
  • 17:44 elukey@cumin1001: START - Cookbook sre.presto.roll-restart-workers
  • 17:24 joal@deploy1001: Finished deploy [analytics/refinery@6460d05]: Regular analytics weekly train [6460d05] (duration: 77m 08s)
  • 16:55 jforrester@deploy1001: Synchronized php-1.35.0-wmf.30/extensions/MachineVision/src/Hooks.php: Fix hook handling for hook T251408 (duration: 01m 05s)
  • 16:54 jforrester@deploy1001: sync-file aborted: Fix hook handling for hook T251408 (duration: 00m 02s)
  • 16:53 jforrester@deploy1001: Synchronized php-1.35.0-wmf.30/includes/EditPage.php: EditPage::showHeader - only warn editing an old revision if it exists T251404 (duration: 01m 06s)
  • 16:45 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 16:07 joal@deploy1001: Started deploy [analytics/refinery@6460d05]: Regular analytics weekly train [6460d05]
  • 16:05 joal@deploy1001: Finished deploy [analytics/aqs/deploy@c87c8e2]: Analytics regular weekly deploy (duration: 06m 59s)
  • 15:58 joal@deploy1001: Started deploy [analytics/aqs/deploy@c87c8e2]: Analytics regular weekly deploy
  • 15:12 kormat@cumin1001: dbctl commit (dc=all): 'Repooling db2087 in s6 and s7 after reimaging T250666', diff saved to https://phabricator.wikimedia.org/P11085 and previous config saved to /var/cache/conftool/dbconfig/20200429-151219-kormat.json
  • 14:56 sukhe: upload cescout 0.1.2-1 to apt.wm.o (buster)
  • 14:49 mdholloway: re-ran extension/MachineVision/maintenance/withholdImages.php on commonswiki
  • 14:39 mholloway-shell@deploy1001: Synchronized wmf-config/InitialiseSettings.php: MachineVision: Update image withholding term list (duration: 01m 06s)
  • 13:05 liw@deploy1001: Synchronized php: group1 wikis to 1.35.0-wmf.30 (duration: 01m 04s)
  • 13:04 liw@deploy1001: rebuilt and synchronized wikiversions files: group1 wikis to 1.35.0-wmf.30
  • 12:50 kormat@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 12:47 kormat@cumin1001: START - Cookbook sre.hosts.downtime
  • 12:26 kormat@cumin1001: dbctl commit (dc=all): 'Depool db2087 for reimaging T250666', diff saved to https://phabricator.wikimedia.org/P11081 and previous config saved to /var/cache/conftool/dbconfig/20200429-122602-kormat.json
  • 11:58 mutante: running puppet on cp-ats - switching backends of wikiworkshop.org
  • 11:05 hoo@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Add new properties to wmgWBRepoPreferredPageImagesProperties (T249811) (duration: 01m 18s)
  • 10:00 kormat: reimaging db2087 to buster T250666
  • 09:56 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1097:3314', diff saved to https://phabricator.wikimedia.org/P11080 and previous config saved to /var/cache/conftool/dbconfig/20200429-095629-marostegui.json
  • 09:55 marostegui@cumin1001: dbctl commit (dc=all): 'Fully repool db1105:3311 and 3312 after reimage', diff saved to https://phabricator.wikimedia.org/P11079 and previous config saved to /var/cache/conftool/dbconfig/20200429-095545-marostegui.json
  • 09:55 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1103:3314', diff saved to https://phabricator.wikimedia.org/P11078 and previous config saved to /var/cache/conftool/dbconfig/20200429-095527-marostegui.json
  • 09:38 jbond42: puppet enabled fleetwide
  • 09:30 jbond42: disable puppet fleet wide for puppetdb upgrade
  • 09:30 jbond42: disable puppet for puppetdb upgrade
  • 09:10 vgutierrez: starting rolling restart of ats-tls to enable the TLS session ID based cache - T170567
  • 08:52 elukey@cumin1001: END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0)
  • 08:49 mutante: gerrit1002 - gzipping gerrit.log.2020-04* files in /var/log/gerrit (T243808)
  • 08:45 elukey@cumin1001: START - Cookbook sre.zookeeper.roll-restart-zookeeper
  • 08:32 vgutierrez: upgrade to ATS 8.1 on cp4032 - T249335
  • 08:31 vgutierrez: restart ats-tls on cp[3054,3064]
  • 08:08 moritzm: installing openldap security updates
  • 08:02 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1105:3311 and 3312 after reimage', diff saved to https://phabricator.wikimedia.org/P11075 and previous config saved to /var/cache/conftool/dbconfig/20200429-080206-marostegui.json
  • 07:55 marostegui: Upgrade mysql on x1 master (without restarting) in preparation for tomorrow's upgrade - T250701
  • 07:54 _joe_: restarting php-fpm on mw1288 (workers die in SIGILL status)
  • 07:31 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1105:3311 and 3312 after reimage', diff saved to https://phabricator.wikimedia.org/P11074 and previous config saved to /var/cache/conftool/dbconfig/20200429-073144-marostegui.json
  • 07:14 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1105:3311 and 3312 after reimage', diff saved to https://phabricator.wikimedia.org/P11073 and previous config saved to /var/cache/conftool/dbconfig/20200429-071431-marostegui.json
  • 06:50 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 06:47 marostegui@cumin1001: START - Cookbook sre.hosts.downtime
  • 06:32 marostegui: stop mysql on db1105 for reimage
  • 06:22 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1105:3311 and 3312 for reimage', diff saved to https://phabricator.wikimedia.org/P11072 and previous config saved to /var/cache/conftool/dbconfig/20200429-062254-marostegui.json
  • 06:19 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1103:3314', diff saved to https://phabricator.wikimedia.org/P11071 and previous config saved to /var/cache/conftool/dbconfig/20200429-061941-marostegui.json
  • 06:17 vgutierrez: ats-tls restart on cp[3050,3058] - T249335
  • 06:07 vgutierrez: ats-tls restart on cp3064 - T249335
  • 05:47 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1114', diff saved to https://phabricator.wikimedia.org/P11070 and previous config saved to /var/cache/conftool/dbconfig/20200429-054733-marostegui.json
  • 00:56 catrope@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Enable Growth Study QuickSurvey on enwiki (with sample size 0, for testing) (T248421) (duration: 01m 10s)
  • 00:43 catrope@deploy1001: Finished scap: Update WikimediaMessages with new i18n messages for T248421 (duration: 55m 23s)

2020-04-28

  • 23:48 catrope@deploy1001: Started scap: Update WikimediaMessages with new i18n messages for T248421
  • 23:40 ejegg: updated payments-wiki from 8c896a8247 to afb84cc391
  • 21:55 ejegg: updated Payments IPN listener (Standalone SmashPig) from d80e4c5abd to 8c30ed7fe5
  • 20:57 cdanis@cumin1001: dbctl commit (dc=all): 's8 weights: -db1111, +db1099,db1101', diff saved to https://phabricator.wikimedia.org/P11069 and previous config saved to /var/cache/conftool/dbconfig/20200428-205739-cdanis.json
  • 18:07 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: 0639d9f: Allow bdwikimedia bureaucrats to revoke sysop flag (T251078) (duration: 01m 05s)
  • 18:03 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: 07c28d1: GrowthExperiments: cswiki: Change manual of style to 5 pillars (T251290) (duration: 01m 05s)
  • 17:51 reedy@deploy1001: Synchronized php-1.35.0-wmf.30/extensions/OAuth: T251306 (duration: 01m 06s)
  • 17:13 bsitzmann@deploy1001: Finished deploy [mobileapps/deploy@678fb8e]: Update mobileapps to ff88022a (duration: 03m 23s)
  • 17:10 bsitzmann@deploy1001: Started deploy [mobileapps/deploy@678fb8e]: Update mobileapps to ff88022a
  • 16:37 jforrester@deploy1001: Synchronized php-1.35.0-wmf.30/includes/Revision/RevisionStore.php: Follow-up If770120: Fix bad combination of type cast and ?? operator (duration: 01m 06s)
  • 16:36 volker-e@deploy1001: Finished deploy [design/style-guide@335122b]: Deploy design/style-guide: (duration: 00m 08s)
  • 16:36 volker-e@deploy1001: Started deploy [design/style-guide@335122b]: Deploy design/style-guide:
  • 15:49 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 15:42 mepps: updated payments-wiki from 45bf1734e0 to 8c896a8247,
  • 15:34 vgutierrez: rolling restart of ats-tls on cp[3050,3052,3054,3056] - T249335
  • 15:28 ppchelko@deploy1001: Finished deploy [changeprop/deploy@2b87a75]: Switch off rules moved to k8s T248677 (duration: 01m 20s)
  • 15:27 ppchelko@deploy1001: Started deploy [changeprop/deploy@2b87a75]: Switch off rules moved to k8s T248677
  • 15:15 hnowlan@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'changeprop' for release 'production' .
  • 15:14 hnowlan@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'changeprop' for release 'production' .
  • 15:13 otto@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-main' for release 'production' .
  • 15:13 otto@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' .
  • 15:13 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 14:59 otto@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' .
  • 14:58 otto@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' .
  • 14:58 otto@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' .
  • 14:56 otto@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' .
  • 14:55 otto@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' .
  • 14:54 otto@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-main' for release 'production' .
  • 14:54 otto@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' .
  • 14:50 otto@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' .
  • 14:49 otto@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' .
  • 14:49 otto@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' .
  • 14:45 otto@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' .
  • 14:45 otto@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' .
  • 14:39 otto@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-main' for release 'production' .
  • 14:39 otto@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' .
  • 14:37 otto@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' .
  • 14:37 otto@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' .
  • 14:35 otto@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' .
  • 14:35 otto@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' .
  • 14:33 otto@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' .
  • 14:28 hnowlan@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'changeprop' for release 'production' .
  • 14:25 otto@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'eventstreams' for release 'production' .
  • 14:25 otto@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'eventstreams' for release 'canary' .
  • 14:25 hnowlan@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'changeprop' for release 'production' .
  • 14:23 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 14:23 otto@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'eventstreams' for release 'canary' .
  • 14:23 otto@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'eventstreams' for release 'production' .
  • 14:21 moritzm: restarting KDC on krb1001 to pick up openssl update
  • 14:20 otto@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'eventstreams' for release 'canary' .
  • 14:20 otto@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'eventstreams' for release 'production' .
  • 13:59 otto@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' .
  • 13:59 otto@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-main' for release 'production' .
  • 13:54 moritzm: installing idp-test2001.wikimedia.org
  • 13:53 vgutierrez: update ATS 8.1 on cp4026 - T249335
  • 13:48 hknust: holger@mwmaint1002 end (frwiki=success)
  • 13:44 otto@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-main' for release 'production' .
  • 13:43 otto@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' .
  • 13:43 ottomata: enabling Kafka TLS for eventgate-main
  • 13:33 mutante: running puppet on cp-ats - switching backends of design.wikimedia.org and sitemaps.wikimedia.org
  • 13:30 hknust: Restarting uppercaseTitlesForUnicodeTransition.php as part of T219279 for frwiki
  • 13:26 jmm@cumin2001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
  • 13:08 jmm@cumin2001: START - Cookbook sre.ganeti.makevm
  • 13:07 akosiaris@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'mathoid' for release 'production' .
  • 13:03 liw@deploy1001: rebuilt and synchronized wikiversions files: group0 wikis to 1.35.0-wmf.30
  • 12:39 marostegui: Deploy schema change on dbstore1004:3314
  • 12:39 marostegui: Deploy schema change on db1102:3314
  • 12:35 marostegui: Temporarily change query killer from 300 seconds to 3600 on labsdb1010 T249188
  • 11:56 Lucas_WMDE: EU SWAT done
  • 11:55 Lucas_WMDE: lucaswerkmeister-wmde@mwmaint1002:~$ mwscript namespaceDupes.php --wiki=thwikibooks --fix | tee T251118-fix
  • 11:54 lucaswerkmeister-wmde@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Create a bunch of namespace aliases for thwikibooks (T251118) (duration: 01m 05s)
  • 11:52 liw@deploy1001: Finished scap: testwikis wikis to 1.35.0-wmf.30 (duration: 48m 53s)
  • 11:45 marostegui: Deploy schema change on s8 eqiad master with replication T250071
  • 11:34 jmm@cumin1001: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99)
  • 11:33 jmm@cumin1001: START - Cookbook sre.ganeti.makevm
  • 11:20 moritzm: updated ssacli/ssaducli for buster-wikimedia's thirdparty/hwraid component to 4.15-6.0
  • 11:04 liw@deploy1001: Started scap: testwikis wikis to 1.35.0-wmf.30
  • 10:48 liw@deploy1001: Pruned MediaWiki: 1.35.0-wmf.27 (duration: 12m 37s)
  • 10:48 _joe_: running heavy_page test on mw1407,9
  • 10:46 kormat@cumin1001: dbctl commit (dc=all): 'Repooling after reimaging to buster T250666', diff saved to https://phabricator.wikimedia.org/P11064 and previous config saved to /var/cache/conftool/dbconfig/20200428-104650-kormat.json
  • 10:43 hnowlan@puppetmaster1001: conftool action : set/pooled=yes:weight=10; selector: dc=codfw,cluster=restbase,service=restbase-ssl,name=restbase2014.codfw.wmnet
  • 10:43 hnowlan@puppetmaster1001: conftool action : set/pooled=yes:weight=10; selector: dc=codfw,cluster=restbase,service=restbase-backend,name=restbase2014.codfw.wmnet
  • 10:41 hnowlan@puppetmaster1001: conftool action : set/pooled=yes:weight=10; selector: dc=codfw,cluster=restbase,service=restbase,name=restbase2014.codfw.wmnet
  • 10:40 XioNoX: remove unused policy-statements from routers
  • 10:39 ema: cp-text: upgrade purged to 0.9 and restart
  • 10:38 _joe_: running load.php test on mw1407,9
  • 10:34 _joe_: running main_page test on mw1407,9
  • 10:28 liw@deploy1001: Pruned MediaWiki: 1.35.0-wmf.30 (duration: 01m 27s)
  • 10:28 addshore: repool wdqs1007 (lag caught up)
  • 10:10 _joe_: starting benchmarks for light page on mw140{7,9}
  • 10:08 ema: upload purged 0.9 to buster-wikimedia
  • 10:05 liw: 1.35.0-wmf.30 was branched at ffc8e88 for T249962
  • 09:57 kormat@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 09:55 kormat@cumin1001: START - Cookbook sre.hosts.downtime
  • 09:52 liw: starting branch cut for train
  • 09:35 addshore: depool wdqs1007 to catch up on lag a bit
  • 09:32 mutante: running puppet on cp-ats for backend config change
  • 09:23 elukey@cumin1001: END (PASS) - Cookbook sre.presto.roll-restart-workers (exit_code=0)
  • 09:20 kormat@cumin1001: dbctl commit (dc=all): 'Depool db2124 T250666', diff saved to https://phabricator.wikimedia.org/P11063 and previous config saved to /var/cache/conftool/dbconfig/20200428-092052-kormat.json
  • 09:12 elukey@cumin1001: START - Cookbook sre.presto.roll-restart-workers
  • 09:12 elukey@cumin1001: END (FAIL) - Cookbook sre.presto.roll-restart-workers (exit_code=99)
  • 09:12 elukey@cumin1001: START - Cookbook sre.presto.roll-restart-workers
  • 08:55 XioNoX: re-set lost licenses on asw2-a/b-eqiad
  • 08:40 marostegui@cumin1001: dbctl commit (dc=all): 'Fully repool db1105:3311 and 3312 after reimage', diff saved to https://phabricator.wikimedia.org/P11060 and previous config saved to /var/cache/conftool/dbconfig/20200428-084041-marostegui.json
  • 08:36 dcausse: deleting wikidatawiki_content_1587076410 from cloudelastic
  • 08:30 _joe_: restarting php-fpm on mw1407 and mw1409 again, then running traffic on them for 1 hour.
  • 08:24 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repoo db1105:3311 and 3312 after reimage', diff saved to https://phabricator.wikimedia.org/P11059 and previous config saved to /var/cache/conftool/dbconfig/20200428-082420-marostegui.json
  • 08:21 dcausse: restarting blazegraph on wdqs1007 (T242453)
  • 08:20 jynus@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 08:17 jynus@cumin2001: START - Cookbook sre.hosts.downtime
  • 08:13 kormat: reimaging db2124 to buster T250666
  • 08:13 mutante: rsyncing transparency-report-private files from bromine to miscweb1002/2002. git-cloning was removed about a year ago but site still exists. need to figure out if it should be deleted (T188362 T247650)
  • 08:09 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repoo db1105:3311 and 3312 after reimage', diff saved to https://phabricator.wikimedia.org/P11058 and previous config saved to /var/cache/conftool/dbconfig/20200428-080920-marostegui.json
  • 08:06 moritzm: installing qemu security updates
  • 07:52 _joe_: running benchmarks on mw1407 (LCStoreStaticArray) and mw1409 (LCStoreCDB) for T99740: restart php-fpm, pool for 5 minutes to warmup caches, then depool both servers.
  • 07:49 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 07:44 marostegui@cumin1001: START - Cookbook sre.hosts.downtime
  • 07:26 marostegui: Reimage db1105
  • 07:24 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1105:3311 and 3312 for reimage', diff saved to https://phabricator.wikimedia.org/P11057 and previous config saved to /var/cache/conftool/dbconfig/20200428-072416-marostegui.json
  • 06:35 marostegui: Deploy schema change on s3 master with replication for the wikis at T250071#6051598 - T250071
  • 06:06 marostegui: Deploy schema change on s4 codfw, this will generate lag on codfw - T250055
  • 05:57 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1112', diff saved to https://phabricator.wikimedia.org/P11056 and previous config saved to /var/cache/conftool/dbconfig/20200428-055719-marostegui.json
  • 05:52 marostegui: Reclone labsdb1011 from labsdb1012 - T249188
  • 05:42 marostegui: Restart labsdb1011 with innodb_purge_threads set to 10 - T249188
  • 05:35 marostegui: Deploy schema change on db1112
  • 05:34 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1112 for schema change', diff saved to https://phabricator.wikimedia.org/P11054 and previous config saved to /var/cache/conftool/dbconfig/20200428-053453-marostegui.json
  • 04:59 vgutierrez: depool and powercycle cp5012
  • 04:37 kart_: Updated cxserver to 2020-04-27-061703-production (T249852)
  • 04:34 kartik@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'cxserver' for release 'production' .
  • 04:22 kartik@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'cxserver' for release 'production' .
  • 04:18 kartik@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'cxserver' for release 'staging' .

2020-04-27

  • 23:25 catrope@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Update logos for tiwiki and tiwiktionary (T150618, T249451) (duration: 00m 57s)
  • 23:20 catrope@deploy1001: Synchronized static/images/project-logos/: Update logos for tiwiki and tiwiktionary (T150618, T249451) (duration: 00m 58s)
  • 23:18 catrope@deploy1001: Synchronized dblists/visualeditor-nondefault.dblist: Enable VisualEditor by default on srwiki (T250878) (duration: 00m 57s)
  • 23:16 catrope@deploy1001: Synchronized wmf-config/config/srwiki.yaml: Enable VisualEditor by default on srwiki (T250878) (duration: 00m 58s)
  • 20:58 bearND: mobileapps deploy on canary failed due to timeouts, rolled back.
  • 20:56 bsitzmann@deploy1001: Finished deploy [mobileapps/deploy@99c350c]: Update mobileapps to 09cb7c2e (duration: 00m 52s)
  • 20:55 hknust: holger@mwmaint1002 END (enwiki=success, frwiki=fail) uppercaseTitlesForUnicodeTransition.php as part of T219279
  • 20:55 bsitzmann@deploy1001: Started deploy [mobileapps/deploy@99c350c]: Update mobileapps to 09cb7c2e
  • 20:43 bearND: mobileapps deployed failed due to timeouts, rolled back.
  • 20:42 bsitzmann@deploy1001: Finished deploy [mobileapps/deploy@99c350c]: Update mobileapps to 09cb7c2e (duration: 06m 24s)
  • 20:35 bsitzmann@deploy1001: Started deploy [mobileapps/deploy@99c350c]: Update mobileapps to 09cb7c2e
  • 20:28 hknust: holger@mwmaint1002 Restarting uppercaseTitlesForUnicodeTransition.php as part of T219279 for 2 wikis
  • 19:26 cmjohnson@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 19:25 cmjohnson@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 19:23 ppchelko@deploy1001: Finished deploy [changeprop/deploy@ecca66b]: Switch off rules moved to k8s T248677 (duration: 01m 22s)
  • 19:22 cmjohnson@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 19:22 ppchelko@deploy1001: Started deploy [changeprop/deploy@ecca66b]: Switch off rules moved to k8s T248677
  • 19:21 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 19:20 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 19:20 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 19:06 cmjohnson@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 19:05 cmjohnson@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 19:05 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 19:04 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 18:50 James_F: Manually ran `scap pull` on mw1279.eqiad.wmnet as it flaked during deploy.
  • 18:48 jforrester@deploy1001: Synchronized wmf-config/CommonSettings.php: Ready wmgVisualEditorAllowExternalLinkPaste to set wgVisualEditorAllowExternalLinkPaste (duration: 01m 29s)
  • 18:48 James_F: Sync failure to mw1279.eqiad.wmnet (timeout)
  • 18:46 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: IS: Set wmgVisualEditorAllowExternalLinkPaste false everywhere except officewiki (duration: 01m 17s)
  • 18:25 Urbanecm: Run namespaceDupes.php for thwikisource (T251134)
  • 18:24 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: 56a447e: Create several namespace aliases for thwikisource (T251134) (duration: 00m 58s)
  • 18:21 urbanecm@deploy1001: Synchronized php-1.35.0-wmf.28/extensions/Kartographer/modules/: SWAT: 6cd2847: Do not use remove() on maplinks (T250620; T251053) (duration: 00m 58s)
  • 18:10 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: 8b71f38: Remove use of `wgAllowImageMoving` (T245293) (duration: 00m 57s)
  • 18:04 otto@deploy1001: Synchronized wmf-config/InitialiseSettings-labs.php: wgEventStreams: in beta, merge settings from production - T242122 (duration: 00m 56s)
  • 18:02 otto@deploy1001: Synchronized wmf-config/InitialiseSettings.php: wgEventStreams: configure SearchSatisfaction - T249261 (duration: 00m 58s)
  • 17:34 ppchelko@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'changeprop' for release 'production' .
  • 17:31 ppchelko@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'changeprop' for release 'production' .
  • 16:46 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 16:07 hnowlan@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'changeprop' for release 'production' .
  • 16:02 hnowlan@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'changeprop' for release 'production' .
  • 15:54 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 15:45 gehel: restart wdqs-updater on all servers
  • 15:39 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 15:26 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 15:22 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1075', diff saved to https://phabricator.wikimedia.org/P11048 and previous config saved to /var/cache/conftool/dbconfig/20200427-152242-marostegui.json
  • 14:58 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1075 for schema change', diff saved to https://phabricator.wikimedia.org/P11047 and previous config saved to /var/cache/conftool/dbconfig/20200427-145851-marostegui.json
  • 14:50 jynus: setting default etherpadlite db on m1 to utf8mb4_bin
  • 14:50 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1078', diff saved to https://phabricator.wikimedia.org/P11046 and previous config saved to /var/cache/conftool/dbconfig/20200427-145010-marostegui.json
  • 14:46 vgutierrez: pool cp4026 running ATS 8.1.0 - T249335
  • 14:33 otto@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'canary' .
  • 14:33 otto@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'production' .
  • 14:30 otto@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'canary' .
  • 14:30 otto@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'production' .
  • 14:27 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'canary' .
  • 14:27 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'production' .
  • 14:20 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1078 for schema change', diff saved to https://phabricator.wikimedia.org/P11045 and previous config saved to /var/cache/conftool/dbconfig/20200427-142006-marostegui.json
  • 13:56 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 13:55 vgutierrez: depool cp4026 and upgrade to ATS 8.1.0 - T249335
  • 13:53 vgutierrez: restart ats-tls on cp3056 - T249335
  • 13:52 mutante: decom'ing install1002 and install2002 - see install1003/2003 and apt1001/2001
  • 13:52 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 13:51 dzahn@cumin1001: START - Cookbook sre.hosts.decommission
  • 13:47 marostegui: Deploy schema change on s3 codfw, lag will show up - T250055
  • 13:46 marostegui: Drop img_deleted column from wikitech - T250055
  • 13:45 marostegui: Drop img_deleted column from s7 eqiad - T250055
  • 13:42 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 13:42 otto@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-analytics' for release 'canary' .
  • 13:42 otto@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-analytics' for release 'production' .
  • 13:41 dzahn@cumin1001: START - Cookbook sre.hosts.decommission
  • 13:41 dzahn@cumin1001: END (ERROR) - Cookbook sre.hosts.decommission (exit_code=97)
  • 13:41 dzahn@cumin1001: START - Cookbook sre.hosts.decommission
  • 13:38 _joe_: repooling both mw1407 and mw1409 for tesing T99740
  • 13:30 _joe_: depooled mw1409 as well as mw1407 for further benchmarking, T99740
  • 13:28 otto@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-analytics' for release 'canary' .
  • 13:28 otto@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-analytics' for release 'production' .
  • 13:10 elukey: roll restart elastic on cloudelastic-chi again to pick up new JVM settings - T231517
  • 13:09 marostegui: Deploy schema change on s7 codfw, lag will show up - T250055
  • 12:53 marostegui: Drop T248086_wb_terms from db1104 - T248086
  • 12:50 marostegui: Removed img_deleted from s1 (enwiki) T250055
  • 12:49 akosiaris: rolling back etherpad to 1.8.0
  • 12:45 akosiaris: upgrade etherpad to 1.8.3
  • 12:41 marostegui: Remove empty table T248086_wb_terms from wikidatawiki on s3 eqiad - T248086
  • 12:36 marostegui: Remove empty table T248086_wb_terms from wikidatawiki on s3 codfw master - T248086
  • 12:32 marostegui: Remove empty table T248086_wb_terms from wikidatawiki on s8 codfw master - T248086
  • 12:15 marostegui: Remove empty table T248086_wb_terms from commonswiki and testcommonswiki on s4 master - T248086
  • 12:06 ema: cp: upgrade purged to 0.8 T249583
  • 11:53 Lucas_WMDE: EU SWAT done
  • 11:49 hoo: Started Wikibase rebuildItemsPerSite on mwmaint1002 for wikidatawiki. Can be killed at any time, if necessary. (T249613)
  • 11:46 lucaswerkmeister-wmde@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Enable cross-project search on frwiktionary (T250724) (duration: 00m 57s)
  • 11:41 lucaswerkmeister-wmde@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Add transwiki import sources in zhwiki (T250972) (duration: 00m 57s)
  • 11:25 addshore: repool wdqs1007 T242453
  • 11:21 lucaswerkmeister-wmde@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Add two domains in wgCopyUploadsDomains (T250903, T250904) (duration: 00m 57s)
  • 11:21 _joe_: restarted php-fpm on mw1407 to pick up enlarged opcache values, T99740
  • 11:14 jdrewniak@deploy1001: Synchronized portals: Wikimedia Portals Update: Bumping portals to master (563985) (duration: 00m 57s)
  • 11:13 jdrewniak@deploy1001: Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: Bumping portals to master (563985) (duration: 00m 58s)
  • 11:09 cparle@deploy1001: Synchronized wmf-config/InitialiseSettings.php: [SDC] Enable constraints on production commons (duration: 00m 57s)
  • 11:08 cparle@deploy1001: Synchronized wmf-config/InitialiseSettings.php: [SDC] Enable constraints on production commons (duration: 00m 58s)
  • 10:52 hoo: Running the pruneItemsPerSite on mwmaint1002 maintenance script for Wikidata (T249613)
  • 10:52 hoo@deploy1001: Synchronized php-1.35.0-wmf.28/extensions/Wikibase: pruneItemsPerSite: Fix join_condition call signature (T249613) (duration: 01m 02s)
  • 10:49 hoo@deploy1001: Synchronized php-1.35.0-wmf.28/extensions/Wikibase: pruneItemsPerSite: Fix join_condition call signature (T249613) (duration: 01m 01s)
  • 10:32 mutante: contint2001 - systemd status was degraded. icinga alerted. failed unit was jenkins. starting it failed with "address already in use". manually started without using systemctl? killed jenkins and started again with systemctl. T224591
  • 10:29 mutante: contint2001 - jenkins failed and can't start because address is already in use
  • 10:23 addshore: depool and restart wdqs1007 (deadlocks) T242453
  • 09:54 hoo@deploy1001: Synchronized php-1.35.0-wmf.28/extensions/Wikibase: Add pruneItemsPerSite maintenance script (T249613) (duration: 01m 06s)
  • 09:34 jynus@cumin2001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 09:34 jynus@cumin2001: START - Cookbook sre.hosts.decommission
  • 09:34 jynus@cumin2001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 09:33 jynus@cumin2001: START - Cookbook sre.hosts.decommission
  • 09:33 jynus@cumin2001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 09:32 jynus@cumin2001: START - Cookbook sre.hosts.decommission
  • 09:32 jynus@cumin2001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 09:31 jynus@cumin2001: START - Cookbook sre.hosts.decommission
  • 09:25 marostegui: Stop MySQL on labsdb1012 to reclone labsdb1011 - T249188
  • 09:11 marostegui: Deploy schema change on s1 codfw, lag will show up - T250055
  • 08:52 moritzm: restarting cas on idp1001 to pick up Java 11 security update (will void active SSO sessions)
  • 08:26 marostegui: Deploy schema change on s5 codfw, lag will show up - T250055
  • 08:24 kormat: Truncating and optimizing parsercache for pc1010 and pc2010 T247787
  • 08:18 mutante: running puppet on all cp-ats
  • 08:15 godog: add 80G to prometheus global LV
  • 07:25 elukey: roll restart elastic-chi on cloudelastic100[1-4] to pick up the last JVM GC settings - T231517
  • 07:15 marostegui: Kill updateSpecialPages.php wikidatawiki --override --only=Fewestrevisions as it is causing lag - T238199
  • 07:14 elukey: powercycle an-worker1089 - unreachable via ssh, mgmt serial available, soft cpu lock events registered in dmesg
  • 06:59 elukey: force ifdown/ifup eno1 on analytics1052 - interface negotiated speed flapping
  • 06:42 moritzm: installing Java security updates on IDP hosts, will void current SSO sessions
  • 06:30 elukey@puppetmaster1001: conftool action : set/pooled=inactive; selector: name=mw1280.eqiad.wmnet
  • 06:22 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 06:19 marostegui@cumin1001: START - Cookbook sre.hosts.downtime
  • 06:00 marostegui: Stop MySQL on labsdb1011 for reimage - T249188
  • 05:58 moritzm: installing git security updates on jessie
  • 05:56 marostegui: Compress tables on db1104 - T232446
  • 05:53 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1104 for defragmentation - T232446', diff saved to https://phabricator.wikimedia.org/P11039 and previous config saved to /var/cache/conftool/dbconfig/20200427-055320-marostegui.json
  • 05:47 vgutierrez: rolling restart ats-tls in cp[1085,1089] and text@esams - T249335
  • 05:33 marostegui: Depool labsdb1011 T249188

2020-04-26

  • 18:08 elukey: powercycle puppetmaster1001 - mgmt serial console not usable, no ssh, racadm getsel doesn't show anything

2020-04-25

  • 10:23 addshore: going to restart and probably depool for a short time wdqs1005 as it is in a deadlock T242453
  • 05:52 _joe_: depooling mw1407 again, should not be serving traffic
  • 05:27 shdubsh: restart elasticsearch on logstash2022

2020-04-24

  • 21:25 cdanis@cumin1001: conftool action : set/pooled=true; selector: dnsdisc=wdqs,name=eqiad
  • 19:41 Amir1: applying T114117 on labswiki (wikitech)
  • 18:58 shdubsh: restart elasticsearch on logstash2021
  • 18:50 shdubsh: restart elasticsearch on logstash2020
  • 15:12 cdanis@cumin1001: conftool action : set/pooled=false; selector: dnsdisc=wdqs,name=eqiad
  • 15:08 addshore: depool and restart wdqs1006 to catch up with lag after deadlock T242453
  • 11:13 Amir1: apply T250071 on s10 (labswiki)

2020-04-23

  • 22:06 Urbanecm: Perform timeouting rename at enwiki Wikipedia talk:Introduction --> Wikipedia talk:Introduction (historical) using moveBatch.php (request)
  • 18:38 ejegg: updated payments-wiki from 1640f5e21e to 45bf1734e0

2020-04-22

  • 08:55 Urbanecm: Move User:Wikipedia:Introduction (historical) --> Wikipedia:Introduction (historical) at enwiki using moveBatch.php, on-wiki interface was time-outing
  • 05:50 elukey@deploy1001: Finished deploy [analytics/refinery@30facc4]: Test of new scap settings (duration: 04m 42s)
  • 05:45 elukey@deploy1001: Started deploy [analytics/refinery@30facc4]: Test of new scap settings
  • 05:25 elukey@deploy1001: deploy aborted: log (duration: 00m 02s)
  • 05:24 elukey@deploy1001: Started deploy [analytics/refinery@30facc4]: log
  • 01:55 milimetric@deploy1001: Finished deploy [analytics/refinery@30facc4]: Analytics: another follow-up on the train, jar version bump (take 2, analytics1030 keeps failing) (duration: 00m 42s)
  • 01:54 milimetric@deploy1001: Started deploy [analytics/refinery@30facc4]: Analytics: another follow-up on the train, jar version bump (take 2, analytics1030 keeps failing)
  • 01:54 milimetric@deploy1001: Finished deploy [analytics/refinery@30facc4]: Analytics: another follow-up on the train, jar version bump (duration: 02m 54s)
  • 01:51 milimetric@deploy1001: Started deploy [analytics/refinery@30facc4]: Analytics: another follow-up on the train, jar version bump
  • 01:51 milimetric@deploy1001: deploy aborted: Analytics: another follow-up on the train, jar version bump (duration: 04m 08s)
  • 01:46 milimetric@deploy1001: Started deploy [analytics/refinery@30facc4]: Analytics: another follow-up on the train, jar version bump
  • 01:43 reedy@deploy1001: Synchronized wmf-config/CommonSettings.php: T209749 (duration: 01m 01s)

2020-04-21

  • 23:41 maryum: deploy complete for wdqs v0.3.23
  • 23:36 mstyles@deploy1001: Finished deploy [wdqs/wdqs@4e0d55f]: v0.3.23 (duration: 11m 35s)
  • 23:25 mstyles@deploy1001: Started deploy [wdqs/wdqs@4e0d55f]: v0.3.23
  • 23:19 maryum: begin deploy of WDQS v 0.3.23 on deploy1001
  • 22:41 eileen: process-control config revision is 6294adfbaa
  • 22:24 milimetric@deploy1001: Finished deploy [analytics/refinery@64c5ec4]: Analytics: tiny follow-up on weekly train [analytics/refinery@64c5ec4] (duration: 37m 05s)
  • 21:56 andrewbogott: rebooting cloudvirt1004, total raid controller failure
  • 21:50 urandom: bootstrapping restbase2014-c — T250050
  • 21:46 milimetric@deploy1001: Started deploy [analytics/refinery@64c5ec4]: Analytics: tiny follow-up on weekly train [analytics/refinery@64c5ec4]
  • 21:38 milimetric@deploy1001: Finished deploy [analytics/refinery@35781db]: Regular Analytics weekly train deploy [analytics/refinery@35781db] try 2 (analytics1030 failed with OSError the first time) (duration: 00m 13s)
  • 21:37 milimetric@deploy1001: Started deploy [analytics/refinery@35781db]: Regular Analytics weekly train deploy [analytics/refinery@35781db] try 2 (analytics1030 failed with OSError the first time)
  • 21:21 milimetric@deploy1001: Finished deploy [analytics/refinery@35781db]: Regular Analytics weekly train deploy [analytics/refinery@35781db] (duration: 16m 19s)
  • 21:05 milimetric@deploy1001: Started deploy [analytics/refinery@35781db]: Regular Analytics weekly train deploy [analytics/refinery@35781db]
  • 21:05 milimetric@deploy1001: Finished deploy [analytics/refinery@35781db] (thin): Regular Analytics weekly train deploy THIN [analytics/refinery@35781db] (duration: 00m 08s)
  • 21:05 milimetric@deploy1001: Started deploy [analytics/refinery@35781db] (thin): Regular Analytics weekly train deploy THIN [analytics/refinery@35781db]
  • 19:09 rzl: mcrouter certs renewed on puppetmaster1001 (again); puppet re-enabled on mcrouter hosts and will update certs naturally over the next 30m T248093
  • 19:02 urandom: bootstrapping restbase2014-b — T250050
  • 18:28 hoo: Updated the Wikidata property suggester with data from the 2020-04-06 JSON dump and applied the T132839 workarounds
  • 18:19 rzl: disabling puppet on all mcrouter hosts for cert renewal T248093
  • 17:19 pt1979@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 17:16 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 16:49 urandom: bootstrapping restbase2014-a — T250050
  • 15:40 cmjohnson1: replacing mgmt switch on a6-eqiad T250652
  • 15:38 hashar: CI is back, patches would need to be rechecked by commenting "recheck" in Gerrit.
  • 15:32 hashar: Restarting Gerrit T250820 T246973
  • 15:26 hashar: CI / Zuul does not get any events for some reason :/
  • 14:59 volans@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 14:59 volans@cumin1001: START - Cookbook sre.hosts.downtime
  • 14:51 hashar: contint2001: manually dropping /var/lib/docker (we now use /srv/docker )
  • 14:48 jbond42: restart haproxy on dns-auth
  • 14:48 hashar: restarting docker on contint2001
  • 14:47 volker-e@deploy1001: Finished deploy [design/style-guide@d101234]: Deploy design/style-guide: (duration: 00m 09s)
  • 14:47 volker-e@deploy1001: Started deploy [design/style-guide@d101234]: Deploy design/style-guide:
  • 14:45 jbond42: puppet enabled again
  • 14:40 moritzm: restarting apache on miscweb
  • 14:37 moritzm: restarting apache on netbox1001
  • 14:36 jbond42: disable puppet fleet wide to restart puppemaster
  • 14:28 moritzm: installing OpenSSL security updates
  • 14:17 vgutierrez: rolling upgrade of ats to version 8.0.7-1wm1
  • 14:16 moritzm: installing OpenSSL updates on caches
  • 14:08 hashar: contint1001: rm /var/log/apache2/doc_* # service has been moved to doc1001.eqiad.wmnet
  • 13:43 vgutierrez: upload trafficserver 8.0.7-1wm1 to apt.wm.o (buster)
  • 13:11 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 13:10 marostegui@cumin1001: START - Cookbook sre.hosts.decommission
  • 11:15 mutante: recreating cert for contint/integration to add integration.mediawiki.org in addition to integration.wikimedia.org
  • 11:06 mutante: https://integration.wikimedia.org now also using TLS between ATS and contint1001 using envoy (T210411)
  • 10:49 _joe_: mwdebug1001:~# iptables -A INPUT -s 10.64.32.208 -m statistic --mode random --probability 0.1 -j DROP (T240684)
  • 08:52 ema: purged: rolling restart with 4 frontend workers
  • 07:54 ema: cp3050: restart purged with 4 frontend workers
  • 07:47 kormat: dropping old data and optimizing tables on pc1010 and pc2010 T247787
  • 07:26 ema: cp4032: restart ats-tls and ats-be
  • 07:06 ema: cp4026: restart ats-tls and ats-be
  • 06:30 marostegui: Rename flagged* tables on mediawikiwiki on db1075 - T248298
  • 06:24 XioNoX: restore eqsin/ulsfo OSPF metric - T250653
  • 05:46 marostegui: Deploy schema change on s6 codfw master
  • 05:34 marostegui: Add db1095:3312, db1095:3320 to tendril - T250602
  • 05:32 moritzm: installing git security updates
  • 05:19 marostegui: Deploy schema change on s2 codfw - T250055
  • 05:09 vgutierrez: rolling restart of ats-tls to enable SSL_OP_PRIORITIZE_CHACHA

2020-04-20

  • 23:29 catrope@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Update project wordmarks and icons (T249047) (duration: 01m 01s)
  • 23:27 catrope@deploy1001: Synchronized static/images/mobile/: Update project wordmarks and icons (T249047) (duration: 01m 02s)
  • 23:14 catrope@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Add media.api.aucklandmuseum.com to $wgCopyUploadsDomains (T250646) (duration: 01m 08s)
  • 21:11 mepps: update civicrm from 1224b080c1 to e8a0b5395d
  • food: updated fundraising python tools from a93eec292d to c96813eda4
  • 20:14 halfak@deploy1001: Finished deploy [ores/deploy@514f94a]: T250536 (duration: 14m 06s)
  • 20:00 halfak@deploy1001: Started deploy [ores/deploy@514f94a]: T250536
  • 19:53 addshore: pool wdqs1006 again (caught up)
  • 19:53 cdanis@cumin1001: conftool action : set/pooled=true; selector: dnsdisc=wdqs,name=eqiad
  • 19:45 jforrester@deploy1001: Synchronized wmf-config/PoolCounterSettings.php: Revert CirrusSearch-MoreLike pool conter numbers now rebuild is done (duration: 01m 01s)
  • 19:43 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: cirrus: Move more_like from codfw back to eqiad, rebuild complete (duration: 01m 03s)
  • 19:40 rzl: mcrouter certs renewed on puppetmaster1001; puppet re-enabled on mcrouter hosts and will update certs naturally over the next 30m T248093
  • 18:39 rzl: disabling puppet on all mcrouter hosts for cert renewal T248093
  • 18:38 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: T248418 [testwiki] Force videojs-only mode for TimedMediaHandler (duration: 01m 01s)
  • 18:36 jforrester@deploy1001: Synchronized php-1.35.0-wmf.28/extensions/MassMessage/includes/SpecialEditMassMessageList.php: T250710 Follow-up 95c772864: Fix RevisionRecord calls that differ from Revision (duration: 01m 02s)
  • 18:27 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Adjust Parsoid/VE disable comment for wikitechwiki (duration: 01m 02s)
  • 18:23 otto@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'canary' .
  • 18:22 otto@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'production' .
  • 18:21 otto@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'canary' .
  • 18:21 otto@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'production' .
  • 18:20 jforrester@deploy1001: Synchronized wmf-config/CommonSettings.php: Adjust dummy name of fake Parsoid extension to just 'Parsoid' (duration: 01m 01s)
  • 18:19 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'canary' .
  • 18:19 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'production' .
  • 18:14 jforrester@deploy1001: Synchronized wmf-config/CommonSettings.php: T236104 Wait to update the globals cache file for opcache regeneration (duration: 01m 02s)
  • 18:11 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Touch and secondary sync of IS for cache-busting (duration: 01m 02s)
  • 18:06 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Touch and secondary sync of IS for cache-busting (duration: 01m 02s)
  • 18:05 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: DiscussionTools: EditAttemptStepSamplingRate increase for some wikis T250086 (duration: 01m 10s)
  • 15:33 Urbanecm: mwscript importImages.php --wiki=commonswiki --comment-ext=txt --user=Victorgrigas /home/urbanecm/upload (T250687)
  • 15:10 marostegui: Upgrade db2079
  • 14:57 cdanis@cumin1001: conftool action : set/pooled=false; selector: dnsdisc=wdqs,name=eqiad
  • 14:54 addshore: restart blazegraph on wdqs1006
  • 14:53 addshore: depool wdqs1006 as it stopped updating
  • 14:28 marostegui: Upgrade db2096 (x1 codfw master)
  • 14:24 marostegui: Upgrade db2101
  • 14:18 marostegui: Upgrade dbstore1005
  • 14:17 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1127 after upgrade', diff saved to https://phabricator.wikimedia.org/P11025 and previous config saved to /var/cache/conftool/dbconfig/20200420-141711-marostegui.json
  • 14:13 marostegui: Upgrade db2131
  • 14:10 marostegui: Upgrade db1127
  • 14:10 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1127 for upgrade', diff saved to https://phabricator.wikimedia.org/P11023 and previous config saved to /var/cache/conftool/dbconfig/20200420-141017-marostegui.json
  • 14:06 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1089 after schema change', diff saved to https://phabricator.wikimedia.org/P11022 and previous config saved to /var/cache/conftool/dbconfig/20200420-140642-marostegui.json
  • 13:50 marostegui: Deploy schema change on codfw master - T250055
  • 13:30 reedy@deploy1001: Synchronized wmf-config/InitialiseSettings-labs.php: Undeploying graphoid on beta (duration: 01m 07s)
  • 13:18 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1081 after schema change, restore db1097:3314 original weights', diff saved to https://phabricator.wikimedia.org/P11021 and previous config saved to /var/cache/conftool/dbconfig/20200420-131823-marostegui.json
  • 12:40 XioNoX: remove all disabled termsfrom cr2-eqiad
  • 12:31 XioNoX: remove all disabled BGP neighbors on cr2-esams
  • 12:11 mateusbs17: Running `REINDEX DATABASE gis` in maps2004.codfw.wmnet (which is depooled at the moment)
  • 11:41 mutante: puppetmaster - revoking cert for webserver-misc-apps.discovery.wmnet and recreating it with additional static microsite names (T247650)
  • 11:27 awight@deploy1001: Synchronized wmf-config/CommonSettings.php: SWAT: Temporarily enable event oversampling for conflicts (T249616) (duration: 01m 00s)
  • 11:25 awight@deploy1001: Synchronized php-1.35.0-wmf.28/extensions/TwoColConflict: SWAT: Configurable EditStepAttempt oversampling for conflicts (T249616) (duration: 01m 03s)
  • 11:05 mutante: rsyncing static-bugzilla files from bromine to miscweb1002 (T247650)
  • 11:02 mutante: bromine/vega: stop rsyncd which was removed from puppet
  • 10:49 jdrewniak@deploy1001: Synchronized portals: Wikimedia Portals Update: Bumping portals to master (563985) (duration: 00m 57s)
  • 10:48 jdrewniak@deploy1001: Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: Bumping portals to master (563985) (duration: 01m 03s)
  • 10:37 elukey: apt-get purge rsync on mwlog* after https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/589600/
  • 10:08 XioNoX: uRPF, sample + discard in eqiad - T244147
  • 10:06 XioNoX: uRPF, sample + discard in eqord - T244147
  • 09:51 XioNoX: uRPF, sample + discard in dfw - T244147
  • 09:38 XioNoX: uRPF, sample + discard in ulsfo - T244147
  • 09:19 Urbanecm: Security deploy for T250594
  • 08:46 vgutierrez: restart ats-tls in cp3064 - T249335
  • 08:35 jayme: imported helmfile 0.66.0-1+deb10u1 to main for buster-wikimedia
  • 08:20 marostegui@cumin1001: dbctl commit (dc=all): 'Temporary pool db1097:3314 into API', diff saved to https://phabricator.wikimedia.org/P11019 and previous config saved to /var/cache/conftool/dbconfig/20200420-082019-marostegui.json
  • 08:19 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1081', diff saved to https://phabricator.wikimedia.org/P11018 and previous config saved to /var/cache/conftool/dbconfig/20200420-081911-marostegui.json
  • 08:16 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1089', diff saved to https://phabricator.wikimedia.org/P11017 and previous config saved to /var/cache/conftool/dbconfig/20200420-081623-marostegui.json
  • 08:14 marostegui: Remove img_deleted column from db1089 (enwiki), db1081 (commonswiki, db1111 (wikidatawiki) - T250055
  • 08:09 jynus: restarting s3 instance on db1095 to reduce its buffer pool T250602
  • 07:22 _joe_: restarting php-fpm on the eqiad appservers to pick up the new max_execution_time
  • 07:20 marostegui: Re add tl_namespace index to db1104 and db1092 - T250060
  • 06:45 moritzm: installing python2.7 security updates on jessie
  • 06:41 elukey: execute find -mtime +30 -delete in /var/log/airflow/scheduler on an-airflow1001 to free space
  • 06:25 moritzm: installing libxdmcp security updates on jessie
  • 06:16 moritzm: installing bash updates on jessie
  • 05:54 vgutierrez: rolling restart of ats-tls in cp[3052,3054,3056,3058,3060,4028,4029,4030,4031,4032] - T249335
  • 05:53 marostegui: Deploy schema change on s8 eqiad hosts T250060
  • 05:50 marostegui: Deploy schema change on s8 codfw - lag will show up T250060
  • 04:55 ariel@deploy1001: Finished deploy [dumps/dumps@b813c8a]: no private table dumps, check for existence of 7z,bz2 page content files before dumping, various unit tests (duration: 00m 04s)
  • 04:55 ariel@deploy1001: Started deploy [dumps/dumps@b813c8a]: no private table dumps, check for existence of 7z,bz2 page content files before dumping, various unit tests

2020-04-19

  • 16:19 reedy@deploy1001: Synchronized wmf-config/LabsServices.php: labs: Move RB traffic to new stretch host (duration: 01m 11s)
  • 16:05 vgutierrez: rolling restart of ats-tls in text@esams - T249335
  • 05:51 marostegui: Power back on db1140 T250602

2020-04-18

  • 22:50 addshore: pool wdqs1006 blazegraph caught up T242453
  • 20:30 cdanis@cumin1001: conftool action : set/pooled=true; selector: dnsdisc=wdqs,name=eqiad
  • 20:27 thcipriani: restart gerrit-replica
  • 16:40 dcausse: forcing replica count to 1 on some cloudelastic@chi indices
  • 15:13 Amir1: applying schema change of T139090 on labswiki (wikitech)
  • 14:03 cdanis@cumin1001: conftool action : set/pooled=false; selector: dnsdisc=wdqs,name=eqiad
  • 12:19 addshore: restarting blazegraph on wdqs1006 blazegraph stuck T242453
  • 12:15 addshore: depool wdqs1006 blazegraph stuck T242453
  • 06:07 XioNoX: change OSPF metrics to prefer ulsfo tunnel transport

2020-04-17

  • 19:33 Krinkle: Depool mw1407.eqiad.wmnet for opcache testing. Do not repool without first reverting https://gerrit.wikimedia.org/r/589674.
  • 19:32 Krinkle: Depool mw1407.eqiad.wmnet for opcache and LCStoreStaticArray testing. – T99740
  • 17:41 cmjohnson1: replacing network cable pc1009 T250257
  • 17:34 cmjohnson1: moving msw1 to msw-c racks mounted switch cable ports from port 49 to port 50
  • 17:22 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 17:22 jmm@cumin2001: START - Cookbook sre.hosts.downtime
  • 16:15 Urbanecm: Revert recent email change of User:CPHL@SUL's email
  • 16:05 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'canary' .
  • 16:05 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'production' .
  • 15:52 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-main' for release 'canary' .
  • 15:52 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-main' for release 'production' .
  • 15:48 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics' for release 'canary' .
  • 15:48 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics' for release 'production' .
  • 15:42 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'canary' .
  • 15:42 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'production' .
  • 15:20 rzl: remove cronjobs from mwmaint1002 previously updated to systemd timers and erroneously left in crontab -- diffs: https://phabricator.wikimedia.org/P11012 T211250
  • 14:29 mutante: ganeti2001 - kileld and restarted gnt-rapi process with the correct new key and cert
  • 14:19 cdanis: add peer AS29802 to cr2-eqdfw and cr2-esams
  • 14:01 mutante: netbox1001 - netbox_ganeti_eqiad_synx / systemd state fixed after gnt-rapi is runnign again on ganeti1003
  • 14:00 mutante: ganeti1003 - fixing gnt-rapi daemon not running
  • 13:54 mateusbs17: Running VACUUM FULL for gis DB in maps2004.codfw.wmnet (which is depooled at the moment)
  • 13:00 mutante: netbox1001 - sudo systemctl start netbox_ganeti_eqiad_sync (was failed)
  • 12:54 mutante: contint2001 /usr/local/sbin/build-envoy-config -c /etc/envoy ; restart envoyproxy; was not listening on admin port
  • 12:45 mutante: cntint2001 - restart nagios-nrpe-server
  • 12:28 moritzm: copied kubernetes-client from stretch-wikimedia to buster-wikimedia T224591
  • 11:35 mutante: contint2001 - apt-get update, run puppet to install helm-diff
  • 11:33 jayme: imported helm-diff 2.11.0+3-2+deb10u1 to main for buster-wikimedia
  • 11:23 dzahn@cumin2001: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99)
  • 11:23 dzahn@cumin2001: START - Cookbook sre.hosts.decommission
  • 11:22 dzahn@cumin1001: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1)
  • 11:21 dzahn@cumin1001: START - Cookbook sre.hosts.decommission
  • 11:20 dzahn@cumin1001: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1)
  • 11:20 dzahn@cumin1001: START - Cookbook sre.hosts.decommission
  • 10:17 _joe_: contint1001:~$ sudo systemctl restart envoyproxy.service
  • 10:16 _joe_: contint1001:~$ sudo /usr/local/sbin/build-envoy-config -c /etc/envoy
  • 10:07 kormat: change pc2010 to replicate from pc1010 T247787
  • 09:54 kormat: enabling replication from pc1007 to pc1010 T247787
  • 09:20 jayme: imported helm 2.12.2 to main for buster-wikimedia
  • 09:07 vgutierrez: disable KA between ats-tls and varnish-fe on cp1077 - T250258
  • 09:00 kormat: dropping wikidatawiki.wb_items_per_site_old table in eqiad (non-labs hosts) T250345
  • 08:15 kormat: dropping wikidatawiki.wb_items_per_site_old table in codfw T250345
  • 07:54 ema: cache_text: puppet run to stop vhtcpd and start purged T249325
  • 07:45 gehel: restart wdqs-updater on all nodes after deployment
  • 06:31 marostegui@cumin1001: dbctl commit (dc=all): 'Fully repool db1092 after compression', diff saved to https://phabricator.wikimedia.org/P11005 and previous config saved to /var/cache/conftool/dbconfig/20200417-063138-marostegui.json
  • 06:30 marostegui@cumin1001: dbctl commit (dc=all): 'Remove db1111 from API', diff saved to https://phabricator.wikimedia.org/P11004 and previous config saved to /var/cache/conftool/dbconfig/20200417-063038-marostegui.json
  • 06:26 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1092 after compression', diff saved to https://phabricator.wikimedia.org/P11003 and previous config saved to /var/cache/conftool/dbconfig/20200417-062642-marostegui.json
  • 06:19 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1092 after compression', diff saved to https://phabricator.wikimedia.org/P11002 and previous config saved to /var/cache/conftool/dbconfig/20200417-061907-marostegui.json
  • 06:04 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1092 after compression', diff saved to https://phabricator.wikimedia.org/P11001 and previous config saved to /var/cache/conftool/dbconfig/20200417-060419-marostegui.json

2020-04-16

  • 22:34 maryum: reindexing wikis that failed from previous reindex on mwmain1002
  • 22:10 jforrester@deploy1001: Pruned MediaWiki: 1.35.0-wmf.26 (duration: 05m 26s)
  • 21:59 jforrester@deploy1001: Synchronized php-1.35.0-wmf.28/extensions/FlaggedRevs/: T250439 Don't try to create a Revision with null (duration: 01m 02s)
  • 21:54 bsitzmann@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'wikifeeds' for release 'production' .
  • 21:51 bsitzmann@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'wikifeeds' for release 'production' .
  • 21:48 mholloway-shell@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'wikifeeds' for release 'staging' .
  • 20:42 mstyles@deploy1001: Finished deploy [wdqs/wdqs@1fb52b3]: WDQS version 0.3.22 (duration: 11m 43s)
  • 20:30 mstyles@deploy1001: Started deploy [wdqs/wdqs@1fb52b3]: WDQS version 0.3.22
  • 20:01 maryum: "beginning deploy of WDQS 0.3.22"
  • 19:06 jforrester@deploy1001: rebuilt and synchronized wikiversions files: all wikis to 1.35.0-wmf.28
  • 18:57 krinkle@deploy1001: Synchronized errorpages/404.php: I9fd5c99130c64 (duration: 01m 07s)
  • 17:52 XioNoX: rename/format asw-ulsfo interfaces to match future homer driven format
  • 16:51 herron: kafka-logging eqiad set retention.bytes=500000000000 on topic udp_localhost-warning T250133
  • 16:45 herron: kafka-logging eqiad set retention.bytes=500000000000 on topic udp_localhost-info T250133
  • 16:30 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 15:54 elukey: restart chi on cloudelastic1001 with -XX:NewRatio=3 - T231517
  • 15:26 akosiaris: truncate /var/log/ganeti/monitoring-daemon-error.log on ganeti1003, start again all ganeti daemons
  • 15:20 akosiaris: stop ganeti daemons on ganeti1003
  • 15:02 Urbanecm: mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=commonswiki --logwiki=metawiki 'Petri Gyula' '23eki' (T250387)
  • 14:51 hknust: holger@mwmaint1002 END (Fail) uppercaseTitlesForUnicodeTransition.php as part of T219279
  • 14:30 hknust: holger@mwmaint1002 Starting uppercaseTitlesForUnicodeTransition.php as part of T219279
  • 14:21 gehel@cumin1001: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99)
  • 14:17 hnowlan@deploy1001: Finished deploy [changeprop/deploy@354ae2d]: Enabling rules on k8s, disabling on scb (duration: 01m 12s)
  • 14:16 hnowlan@deploy1001: Started deploy [changeprop/deploy@354ae2d]: Enabling rules on k8s, disabling on scb
  • 14:14 dcausse: elastic (search cluster) reindexing commonswiki_content in codfw and ediad (T246882)
  • 14:13 ema: cache: upgrade varnish to 5.1.3-1wm14 and rolling restart T249810
  • 13:40 XioNoX: rename/format asw2-esams interfaces to match future homer driven format
  • 13:36 kormat: Optimizing all tables on pc1010 T247787
  • 13:32 hashar: Restarting CI Jenkins for plugin upgrade T250377
  • 13:04 hnowlan@deploy1001: Finished deploy [changeprop/deploy@baf0a4b]: Rollback removing k8s rules, again (duration: 00m 30s)
  • 13:04 hnowlan@deploy1001: Started deploy [changeprop/deploy@baf0a4b]: Rollback removing k8s rules, again
  • 13:03 gehel@cumin1001: START - Cookbook sre.wdqs.data-reload
  • 12:54 gehel@cumin1001: END (ERROR) - Cookbook sre.wdqs.data-reload (exit_code=97)
  • 12:54 gehel@cumin1001: START - Cookbook sre.wdqs.data-reload
  • 12:48 vgutierrez: pool cp1087
  • 12:44 jynus: test sal again
  • 11:29 elukey: restart atskafka on cp3050 after maintenance
  • 11:22 XioNoX: rename/format asw1-eqsin interfaces to match future homer driven format
  • 11:17 elukey: stop atskafka on cp3050 to re-create the topic atskafka_test_webrequest_text on Kafka Jumbo - T250347
  • 11:16 Urbanecm: EU SWAT done
  • 11:15 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: a105f38: Remove broken groupOverrides from amwikimedia (T249585) (duration: 01m 05s)
  • 11:12 urbanecm@deploy1001: Synchronized wmf-config/CommonSettings.php: SWAT: 70ee5f6: Remove grants for tboverride and tboverride-account (T241114) (duration: 01m 06s)
  • 11:05 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: 74ad793: Turn off direct account creations at Testwikidata (T250348; take II) (duration: 01m 04s)
  • 11:04 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: 74ad793: Turn off direct account creations at Testwikidata (T250348) (duration: 01m 06s)
  • 11:03 urbanecm@deploy1001: sync-file aborted: SWAT: 74ad793: Turn off direct account creations at Testwikidata (duration: 00m 00s)
  • 10:54 hnowlan@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'changeprop' for release 'production' .
  • 10:45 hnowlan@deploy1001: Finished deploy [changeprop/deploy@354ae2d]: Testing rules moved to k8s (duration: 01m 16s)
  • 10:45 vgutierrez: upgrading ATS to version 8.0.7-rc0-1wm3 - T249335
  • 10:44 hnowlan@deploy1001: Started deploy [changeprop/deploy@354ae2d]: Testing rules moved to k8s
  • 10:44 vgutierrez: rolling restart of ats-tls to enable TLSv1.3 globally and disable the old TLS session cache - T170567
  • 10:35 hnowlan@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'changeprop' for release 'production' .
  • 10:35 hnowlan@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'changeprop' for release 'production' .
  • 10:31 hnowlan@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'changeprop' for release 'production' .
  • 10:22 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 09:33 elukey: restart atskafka on cp3050 to pick up snappy compression - T250347
  • 09:32 ema: cp2027: upgrade varnish to 5.1.3-1wm14 T249810
  • 09:17 ema: text@esams: stop vhtcpd, start purged T249325
  • 09:16 jynus: starting es backups on backup2002 T79922
  • 08:33 kormat: Disconnect pc1008 replication from pc1010 T247787
  • 08:22 ema: cp3050: upgrade purged to 0.7 T249583
  • 08:22 ema: upload purged 0.7 to buster-wikimedia T249583
  • 08:21 Urbanecm: Set email for Geraki@grwikimedia (T245911)
  • 08:18 kormat@deploy1001: Synchronized wmf-config/db-eqiad.php: Repool pc1008 as pc2 master T247787 (duration: 01m 08s)
  • 08:06 mutante: mw1396 - restarted php7.2-fpm - was: 503 Service Unavailable - header 'X-Powered-By: PHP/7.' not found on 'http://en.wikipedia.org:80/wiki/Main_Page'
  • 08:04 mutante: mw1396 - restarted apache
  • 07:50 vgutierrez: rolling update ats to version 8.0.7-rc0-1wm3 in cp[4026,4032,5006,5012] - T249335
  • 07:49 vgutierrez: upload trafficserver 8.0.7-rc0-1wm3 to apt.wm.o (buster) - T249335
  • 07:15 volker-e@deploy1001: Finished deploy [design/style-guide@2a7cc4a]: Deploy design/style-guide: (duration: 00m 08s)
  • 07:15 volker-e@deploy1001: Started deploy [design/style-guide@2a7cc4a]: Deploy design/style-guide:
  • 06:33 moritzm: installing apache-log4j1.2 security updates on jessie
  • 06:29 moritzm: installing icu security updates on jessie
  • 06:15 moritzm: installing git security updates on jessie
  • 05:43 marostegui@cumin1001: dbctl commit (dc=all): 'Reorganize s8 weights a little bit after the addition of the new host db1114', diff saved to https://phabricator.wikimedia.org/P10995 and previous config saved to /var/cache/conftool/dbconfig/20200416-054353-marostegui.json
  • 05:33 elukey: restart hadoop-yarn-nodemanager on an-worker108[4,5] - failed after GC OOM events (heavy spark jobs)

2020-04-15

  • 22:11 mholloway-shell@deploy1001: Synchronized php-1.35.0-wmf.28/extensions/MachineVision: Fix: Initialize categories array for initial images (T250321) (duration: 01m 07s)
  • 21:48 maryum: removing duplicate incdices from production ES clusters that were created when reindexing failed
  • 20:16 bsitzmann@deploy1001: Finished deploy [mobileapps/deploy@1907571]: Update mobileapps to ff34d0b5 (duration: 04m 57s)
  • 20:11 bsitzmann@deploy1001: Started deploy [mobileapps/deploy@1907571]: Update mobileapps to ff34d0b5
  • 19:53 addshore: pool wdqs1006 caught up
  • 19:44 addshore: depool wdqs1006 to catch up on lag
  • 19:04 jforrester@deploy1001: Synchronized php: group1 wikis to 1.35.0-wmf.28 (duration: 01m 05s)
  • 19:03 jforrester@deploy1001: rebuilt and synchronized wikiversions files: group1 wikis to 1.35.0-wmf.28
  • 18:44 krinkle@deploy1001: Synchronized wmf-config/CommonSettings.php: Idc81a885b2f3, T196309 (duration: 01m 07s)
  • 18:12 tgr@deploy1001: Synchronized wmf-config/InitialiseSettings.php: re-sync (duration: 01m 07s)
  • 18:10 tgr@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Fix GrowthExperiments helpdesk URL for frwiktionary (T235964) (duration: 01m 06s)
  • 16:08 volker-e@deploy1001: Finished deploy [design/style-guide@a4d5794]: Deploy design/style-guide: (duration: 00m 11s)
  • 16:08 volker-e@deploy1001: Started deploy [design/style-guide@a4d5794]: Deploy design/style-guide:
  • 15:46 ejegg: updated fundraising CiviCRM from 18d7567cd7 to 1224b080c1
  • 15:36 ema: cp2029,cp3050: upgrade purged to 0.6, restart varnish-fe T249583
  • 15:30 ema: upload purged 0.6 to buster-wikimedia T249583
  • 15:19 papaul: upgrading firmware on restbase2014
  • 14:36 vgutierrez: rolling upgrade to ATS 8.0.7-rc0-1wm2 on cp[3064,3065,2042,2041,1090,1089] - T249335
  • 14:32 jforrester@deploy1001: Synchronized wmf-config/ProductionServices.php: Drop 'parsoidphp' service, we use 'parsoid' now (duration: 01m 06s)
  • 14:27 jforrester@deploy1001: Synchronized wmf-config/CommonSettings.php: Use 'parsoid' service in lieu of 'parsoidphp' (duration: 01m 07s)
  • 14:25 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Touch and secondary sync of IS for cache-busting (duration: 01m 06s)
  • 14:23 jforrester@deploy1001: Synchronized wmf-config/ProductionServices.php: Add 'parsoid' service to replace 'parsoidphp' (duration: 01m 06s)
  • 14:17 jforrester@deploy1001: Synchronized wmf-config/wikitech.php: Use MediaWikiServices::getAuthManager on wikitech (duration: 01m 06s)
  • 14:14 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: T242912 Remove wgEnablePartialBlocks config, no longer read (duration: 01m 07s)
  • 14:12 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: wmgExtraLanguageNames: Remove 'smn', supported by core since 1.35.0-wmf.26 (duration: 01m 06s)
  • 14:10 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Touch and secondary sync of IS for cache-busting (duration: 01m 06s)
  • 14:09 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: T250181 T250183 Wikibase: Use false instead of database names for 'local' entity sources on test wikis (duration: 01m 06s)
  • 14:02 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Touch and secondary sync of IS for cache-busting (duration: 01m 05s)
  • 14:01 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Stop defining wmgMobileFrontend and wmgMinervaNeue, unread (duration: 01m 06s)
  • 13:59 jforrester@deploy1001: Synchronized wmf-config/mobile.php: Stop reading wmgMobileFrontend and wmgMinervaNeue, always true (duration: 01m 06s)
  • 13:52 jforrester@deploy1001: Synchronized wmf-config/CommonSettings.php: Stop setting wgContentHandlerUseDB, now unread (duration: 01m 06s)
  • 13:32 ema: upload varnish_5.1.3-1wm14 to buster-wikimedia T249810
  • 13:26 jforrester@deploy1001: Synchronized php-1.35.0-wmf.28/extensions/Flow/Hooks.php: T248727 Adjust to RevisionUndeleted hook now having (duration: 01m 04s)
  • 13:25 jforrester@deploy1001: Synchronized php-1.35.0-wmf.28/extensions/LiquidThreads/classes/DeletionController.php: T248727 Adjust to RevisionUndeleted hook now having (duration: 01m 06s)
  • 13:23 jforrester@deploy1001: Synchronized php-1.35.0-wmf.28/includes/page/PageArchive.php: T248727 Fix RevisionUndeleted hook to add (duration: 01m 08s)
  • 13:23 kormat@cumin1001: dbctl commit (dc=all): 'Increase db1114's weight to 100% of target, and reduce db1104 slightly T250224', diff saved to https://phabricator.wikimedia.org/P10990 and previous config saved to /var/cache/conftool/dbconfig/20200415-132310-kormat.json
  • 13:10 hashar: contint2001: starting zuul-merger process # T224591
  • 12:49 kormat@cumin1001: dbctl commit (dc=all): 'Increase db1114's weight to 50% of target T250224', diff saved to https://phabricator.wikimedia.org/P10989 and previous config saved to /var/cache/conftool/dbconfig/20200415-124931-kormat.json
  • 12:41 vgutierrez: rolling upgrade to ATS 8.0.7-rc0-1wm2 in ulsfo and eqsin - T249335
  • 12:03 mutante: puppetmaster1001: revoking ganeti01.svc.eqiad.wmnet and ganeti01.svc.codfw.wmnet certificates. adding eqiad and codfw to cergen .yaml file, recreating ganeti certs
  • 11:27 awight@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Deploy Welcome Survey to Serbian Wikipedia and French Wiktionary (T249956) (double-sync) (duration: 01m 03s)
  • 11:26 awight@deploy1001: sync-file aborted: SWAT: Deploy Welcome Survey to Serbian Wikipedia and French Wiktionary (T249956) (double-sync) (duration: 00m 02s)
  • 11:23 awight: EU SWAT complete
  • 11:22 awight@deploy1001: Synchronized php-1.35.0-wmf.28/extensions/TwoColConflict: SWAT: Flatten exit logging (T248601) (duration: 01m 09s)
  • 11:09 awight@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Deploy Welcome Survey to Serbian Wikipedia and French Wiktionary (T249956) (duration: 01m 24s)
  • 10:57 marostegui: Deploy schema change on s8 codfw master - T250057
  • 10:25 ema: cp3050: varnish-frontend-restart to clear mbox lag and see how long it takes to show up T249583
  • 10:02 ema: upload purged 0.5 to buster-wikimedia T249583
  • 09:50 jynus@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 09:48 jynus@cumin2001: START - Cookbook sre.hosts.downtime
  • 09:48 vgutierrez: disable KA between ats-tls and varnish-fe for POST requests on eqiad - T250258
  • 09:45 godog: force-run curator from logstash1008 - T250133
  • 09:43 kormat@cumin1001: dbctl commit (dc=all): 'Increase db1114's weight some more T250224', diff saved to https://phabricator.wikimedia.org/P10988 and previous config saved to /var/cache/conftool/dbconfig/20200415-094305-kormat.json
  • 09:08 elukey: restart druid brokers on druid100[4-6] - stuck after datasource deletion
  • 09:07 vgutierrez: repool cp1081
  • 08:54 kormat@cumin1001: dbctl commit (dc=all): 'Increase db1114's weight T250224', diff saved to https://phabricator.wikimedia.org/P10986 and previous config saved to /var/cache/conftool/dbconfig/20200415-085432-kormat.json
  • 08:54 vgutierrez: depool cp1081 for debugging purposes
  • 08:46 XioNoX: reset edac counters on scb1001
  • 08:43 dcausse: errata: elastic (search cluster) reindexing commonswiki_content on cloudelastic (T246882)
  • 08:42 dcausse: elastic (search cluster) reindex commmonswiki_content on cloudelastic (T246882)
  • 08:14 marostegui@cumin1001: dbctl commit (dc=all): 'Pool db1114 on s8 with low weight T250224', diff saved to https://phabricator.wikimedia.org/P10985 and previous config saved to /var/cache/conftool/dbconfig/20200415-081421-marostegui.json
  • 07:59 marostegui: Deploy schema change on s7 codfw master - T250057
  • 07:35 elukey: restart cloudelastic-chi on cloudelastic1002 to apply new jvm settings - T231517
  • 06:55 mutante: install1003 moving /srv/autoinstall to /root, running puppet, leaving a README file to point out it moved to apt1001
  • 06:47 marostegui: Deploy schema change on s6 codfw with replication - T250057
  • 06:43 marostegui: Deploy schema change on labtestwiki - T250057
  • 06:43 XioNoX: re-set asw2-c-eqiad's licenses
  • 06:42 marostegui: Deploy schema change on labswiki - T250057
  • 06:32 XioNoX: set uRPF log action back to log infra wide - T244147
  • 06:04 vgutierrez: update to ats 8.0.7-rc0-1wm2 on cp[5006,5012] - T249335
  • 05:49 moritzm: installing git security updates
  • 05:27 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 05:25 marostegui@cumin1001: START - Cookbook sre.hosts.downtime
  • 05:22 kart_: Update cxserver to 2020-04-13-094138-production (T239459, T249469)
  • 05:21 marostegui: Remove db1114 from tendril and zarcillo T250224
  • 05:17 kartik@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'cxserver' for release 'production' .
  • 05:13 kartik@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'cxserver' for release 'production' .
  • 05:11 kartik@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'cxserver' for release 'staging' .
  • 05:07 marostegui: Remove db1114 from tendril - T250224

2020-04-14

  • 23:24 AndyRussG: re-enabled thank-you, onimailing and new recurring charge jobs
  • 22:59 AndyRussG: disabled thank-you and omnimailing jobs
  • 22:59 AndyRussG: fundraising civicrm revision changed from 59e712ce8e to 18d7567cd7
  • 21:36 addshore: pool wdqs1006, it is caught up
  • 21:03 addshore: depool wdqs1006 to give it a chance to catch up on lag
  • 20:34 cdanis@cumin1001: dbctl commit (dc=all): 'tweak db1111 weight yet again', diff saved to https://phabricator.wikimedia.org/P10979 and previous config saved to /var/cache/conftool/dbconfig/20200414-203426-cdanis.json
  • 20:18 James_F: Adding Create-Signed-Tag right to wikimedia-ui-base group for wikimedia-ui-base repo
  • 20:14 marostegui@cumin1001: dbctl commit (dc=all): 'Change s8 weights', diff saved to https://phabricator.wikimedia.org/P10978 and previous config saved to /var/cache/conftool/dbconfig/20200414-201412-marostegui.json
  • 19:58 marostegui@cumin1001: dbctl commit (dc=all): 'reduce db1126 weight due to cpu issues', diff saved to https://phabricator.wikimedia.org/P10977 and previous config saved to /var/cache/conftool/dbconfig/20200414-195855-marostegui.json
  • 19:57 cdanis@cumin1001: dbctl commit (dc=all): '+db1111, -db1126', diff saved to https://phabricator.wikimedia.org/P10976 and previous config saved to /var/cache/conftool/dbconfig/20200414-195734-cdanis.json
  • 19:51 cdanis@cumin1001: dbctl commit (dc=all): 'more weight to db1104', diff saved to https://phabricator.wikimedia.org/P10975 and previous config saved to /var/cache/conftool/dbconfig/20200414-195100-cdanis.json
  • 19:47 cdanis@cumin1001: dbctl commit (dc=all): '+weight on db1104@s8', diff saved to https://phabricator.wikimedia.org/P10974 and previous config saved to /var/cache/conftool/dbconfig/20200414-194710-cdanis.json
  • 19:26 jforrester@deploy1001: rebuilt and synchronized wikiversions files: group0 wikis to 1.35.0-wmf.28
  • 19:22 ebernhardson@deploy1001: Finished scap: wmf-config/PoolCounterSettings.php cirrus: increase pool counter size for traffic shift to codfw (duration: 21m 55s)
  • 19:00 ebernhardson@deploy1001: Started scap: wmf-config/PoolCounterSettings.php cirrus: increase pool counter size for traffic shift to codfw
  • 18:41 cmjohnson@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 18:38 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 17:59 cmjohnson@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 17:57 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 17:35 jforrester@deploy1001: Finished scap: Testwikis to php-1.35.0-wmf.28 and rebuild i18n cache for T247775 (duration: 42m 37s)
  • 17:26 ppchelko@deploy1001: Finished deploy [changeprop/deploy@baf0a4b]: Rollback removing k8s rules, again (duration: 00m 56s)
  • 17:25 ppchelko@deploy1001: Started deploy [changeprop/deploy@baf0a4b]: Rollback removing k8s rules, again
  • 17:23 ppchelko@deploy1001: deploy aborted: Rollback removing k8s rules, again (duration: 00m 05s)
  • 17:23 ppchelko@deploy1001: Started deploy [changeprop/deploy@354ae2d]: Rollback removing k8s rules, again
  • 17:12 ppchelko@deploy1001: Finished deploy [changeprop/deploy@354ae2d]: Remove rules enabled in k8s T248677 attempt 2 (duration: 00m 25s)
  • 17:12 ppchelko@deploy1001: Started deploy [changeprop/deploy@354ae2d]: Remove rules enabled in k8s T248677 attempt 2
  • 17:08 hnowlan@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'changeprop' for release 'production' .
  • 17:07 hnowlan@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'changeprop' for release 'production' .
  • 17:05 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 16:52 jforrester@deploy1001: Started scap: Testwikis to php-1.35.0-wmf.28 and rebuild i18n cache for T247775
  • 16:49 jforrester@deploy1001: sync aborted: testwikis wikis to 1.35.0-wmf.28 (duration: 00m 05s)
  • 16:49 jforrester@deploy1001: Started scap: testwikis wikis to 1.35.0-wmf.28
  • 16:38 akosiaris: stop all ganeti components (VMs are fine) on all ganeti2* hosts for key/cert rollover
  • 16:38 jforrester@deploy1001: Pruned MediaWiki: 1.35.0-wmf.25 (duration: 17m 20s)
  • 16:20 James_F: Scap cleaning 1.35.0-wmf.25 T247775
  • 16:07 ariel@deploy1001: Finished deploy [dumps/dumps@90cbab0]: fix listing of input files for 7z recompression, retry (duration: 00m 04s)
  • 16:06 ariel@deploy1001: Started deploy [dumps/dumps@90cbab0]: fix listing of input files for 7z recompression, retry
  • 16:06 ppchelko@deploy1001: Finished deploy [changeprop/deploy@baf0a4b]: Rollback removing k8s rules (duration: 01m 20s)
  • 16:06 ejegg: disabled new recurring payments charge job
  • 16:05 ppchelko@deploy1001: Started deploy [changeprop/deploy@baf0a4b]: Rollback removing k8s rules
  • 16:04 ariel@deploy1001: Finished deploy [dumps/dumps@90cbab0]: fix listing of input files for 7z recompression (duration: 00m 04s)
  • 16:04 ariel@deploy1001: Started deploy [dumps/dumps@90cbab0]: fix listing of input files for 7z recompression
  • 15:52 ema: cp3050: suspend purged testing, varnish-frontend-restart to clear mailbox lag T249583
  • 15:50 hnowlan@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'changeprop' for release 'production' .
  • 15:49 James_F: 1.35.0-wmf.28 was branched at ded5b87 for T247775
  • 15:47 hnowlan@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'changeprop' for release 'production' .
  • 15:19 hnowlan@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'changeprop' for release 'production' .
  • 15:17 hnowlan@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'changeprop' for release 'production' .
  • 15:15 vgutierrez: update to ats 8.0.7-rc0-1wm2 on cp[4026,4032] - T249335
  • 15:13 vgutierrez: upload trafficserver 8.0.7-rc0-1wm2 to apt.wm.o (buster) - T249335
  • 15:12 hnowlan@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'changeprop' for release 'production' .
  • 15:11 hnowlan@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'changeprop' for release 'production' .
  • 14:44 ppchelko@deploy1001: Finished deploy [changeprop/deploy@354ae2d]: Remove rules enabled in k8s T248677 (duration: 01m 58s)
  • 14:42 ppchelko@deploy1001: Started deploy [changeprop/deploy@354ae2d]: Remove rules enabled in k8s T248677
  • 14:34 godog: power down ms-be1023 - T249174
  • 14:33 filippo@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 14:33 filippo@cumin1001: START - Cookbook sre.hosts.downtime
  • 14:33 filippo@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 14:33 filippo@cumin1001: START - Cookbook sre.hosts.downtime
  • 14:15 elukey: enable TLS between weblog1001,mwlog2001.codfw.wmnet,mwlog1001 and Kafka Jumbo/Logging - T250147
  • 14:15 hashar: Rebasing mediawiki-config on deploy1001 for a deployment-prep config change ( https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/588706/ )
  • 14:12 ema: cp3050: resume purged testing T249583
  • 13:55 ema: upload purged 0.4 to buster-wikimedia T249583
  • 13:21 hashar: Starting zuul-merger on contint2001
  • 12:50 vgutierrez: Enable inbound TLSv1.3 in text@eqsin - T170567
  • 12:03 jbond42: upgrade haproxy on dns servers
  • 11:08 Urbanecm: EU SWAT done
  • 11:05 Urbanecm: Purge https://en.wikipedia.org/static/images/project-logos/cswiki*.png (T249173)
  • 11:04 urbanecm@deploy1001: Synchronized static/images/project-logos/: SWAT: 7da408e: Revert "Enable cswiki anniversary logo" (T249173) (duration: 01m 00s)
  • 11:01 jynus: resizing backup1001:/srv/databases to 40 TB
  • 10:55 XioNoX: set uRPF log action to syslog infra wide - T244147
  • 10:15 XioNoX: update prefix-list LVS-service-ips to add missing prefixes
  • 09:49 XioNoX: re-order aggregate routes to standardize order
  • 09:48 XioNoX: cleanup 2620:0:860::/46 and 208.80.152.0/22 aggregates from cr2-eqdfw - T246721
  • 09:47 XioNoX: cleanup 2620:0:860::/46 and 208.80.152.0/22 aggregates from cr2-eqord - T246721
  • 09:37 XioNoX: cleanup 2620:0:860::/46 and 208.80.152.0/22 aggregates from cr1/2-codfw - T246721
  • 09:17 XioNoX: add missing `routing-options rib inet6.0 aggregate defaults discard` where missing (cr3-knams, cr3-esams, cr2-eqord, cr2-eqdfw, cr1/2-eqiad/codfw)
  • 09:13 godog: add mwilliams to 'wmf' ldap group - T249844
  • 09:08 marostegui: Add kormat to ops and wmf ldap groups - T250134
  • 08:49 elukey: restart elastic-chi on cloudelastic1001 with -XX:NewSize=10G - T231517
  • 07:33 elukey: apply CMS GC settings to chi on cloudelastic1001 - T231517
  • 05:30 vgutierrez: rolling upgrade to ats 8.0.7-rc0-1wm1 in esams and eqiad
  • 05:01 marostegui@deploy1001: Synchronized wmf-config/db-codfw.php: Repool pc2008 after upgrade (duration: 01m 00s)

2020-04-13

  • 23:24 mdholloway: re-ran extensions/MachineVision/maintenance/withholdImages.php on commonswiki
  • 23:14 mholloway-shell@deploy1001: Synchronized wmf-config/InitialiseSettings.php: MachineVision withholding list additions (T249939) (duration: 00m 59s)
  • 22:41 cdanis: repool codfw
  • 22:35 ebernhardson: restart elasticsearch_6@production-search-psi-eqiad on elastic1052 for excessive old gc over last few hours
  • 22:35 ebernhardson: restart elasticsearch_6@production-search-psi-eqiad on elastic1052
  • 22:08 cdanis: depool codfw
  • 21:43 mdholloway: ran extensions/MachineVision/maintenance/removeBlacklistedSuggestions.php on commonswiki (T249273)
  • 21:34 mdholloway: ran extensions/MachineVision/maintenance/removeBlacklistedSuggestions.php on testcommonswiki
  • 21:32 mholloway-shell@deploy1001: Synchronized php-1.35.0-wmf.27/extensions/MachineVision: Add script to apply blacklist to current labels (T249273) (duration: 00m 58s)
  • 20:49 mholloway-shell@deploy1001: Synchronized wmf-config/InitialiseSettings.php: MachineVision blocklist update (T249895) (duration: 00m 59s)
  • 19:56 mdholloway: finished running extensions/MachineVision/maintenance/withholdImages.php on commonswiki (T249939)
  • 19:51 mdholloway: running extensions/MachineVision/maintenance/withholdImages.php on commonswiki
  • 19:41 mdholloway: ran extensions/MachineVision/maintenance/withholdImages.php on testcommonswiki
  • 19:37 mholloway-shell@deploy1001: Synchronized php-1.35.0-wmf.27/extensions/MachineVision: Add support for WITHHOLD_ALL review state (T249939) (duration: 01m 23s)
  • 19:13 mholloway-shell@deploy1001: Synchronized wmf-config/InitialiseSettings.php: MachineVision: Add MachineVisionWithholdImageList config (T249939) (duration: 01m 03s)
  • 19:06 niedzielski: Morning SWAT done
  • 19:02 niedzielski@deploy1001: Synchronized php-1.35.0-wmf.27/skins/MinervaNeue: SWAT: Update the icon glyph (T249864) (duration: 01m 00s)
  • 18:49 niedzielski@deploy1001: Synchronized php-1.35.0-wmf.27/extensions/TwoColConflict: SWAT: Fix double HTML escaping of "copytext" lines in the diff (T249986) (duration: 01m 01s)
  • 17:01 XioNoX: sample before any other border-in terms in eqiad
  • 16:57 XioNoX: sample before any other border-in terms in esams
  • 16:50 XioNoX: sample before any other border-in terms in dfw
  • 16:46 XioNoX: sample before any other border-in terms in ulsfo
  • 16:36 XioNoX: sample before any other border-in terms in eqsin
  • 16:36 mholloway-shell@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'wikifeeds' for release 'production' .
  • 16:33 mholloway-shell@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'wikifeeds' for release 'production' .
  • 16:31 XioNoX: Sample all inbound v6 traffic on cr2-eqsin
  • 16:31 cmjohnson1: replacing msw-c6-eqiad
  • 16:30 mholloway-shell@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'wikifeeds' for release 'staging' .
  • 15:56 marostegui: Deploy schema change on s4 codfw master - T250067
  • 12:12 vgutierrez: rolling upgrade to ats 8.0.7-rc0-1wm1 in eqsin and codfw
  • 11:58 akosiaris@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'kube-system' for release 'coredns' .
  • 11:57 marostegui: Deploy schema change on eqiad s8 hosts - T250062
  • 11:53 akosiaris@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'kube-system' for release 'calico-policy-controller' .
  • 11:53 akosiaris@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'kube-system' for release 'calico-policy-controller' .
  • 11:53 akosiaris@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'kube-system' for release 'calico-policy-controller' .
  • 11:53 marostegui: Deploy schema change on codfw master (lag will appear on codfw) - T250062
  • 11:15 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: efe2feb: robots.txt: Disable indexing user (sub)pages and draft-related pages on srwiki (T248860; take II) (duration: 00m 58s)
  • 11:14 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: efe2feb: robots.txt: Disable indexing user (sub)pages and draft-related pages on srwiki (T248860) (duration: 00m 58s)
  • 10:37 jdrewniak@deploy1001: Synchronized portals: Wikimedia Portals Update: Bumping portals to master (563985) (duration: 00m 58s)
  • 10:36 jdrewniak@deploy1001: Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: Bumping portals to master (563985) (duration: 01m 00s)
  • 10:24 mutante: depooled wdqs1004 by request because of high lag
  • 10:19 marostegui: Kill updateSpecialPages.php --only=Fewestrevisions for s8 in mwmaint1002, the vslow host is lagging and creating errors
  • 10:12 mutante: mwmaint1002 - sudo systemctl status mediawiki_job_translationnotifications-mediawikiwiki.service
  • 09:52 Urbanecm: Rename user account Gerakiw@grwikimedia to Geraki@grwikimedia (T245911)
  • 09:47 Urbanecm: mwscript createAndPromote.php --wiki=grwikimedia --force Gerakiw <redacted> (T245911)
  • 08:15 marostegui: Remove grants for haproxy@10.64.37.15 from labsdb hosts T231280
  • 07:50 vgutierrez: enable memory tracking in ats-tls on cp1085 - T249335
  • 07:43 marostegui: Compress db1092 T232446
  • 07:41 marostegui@cumin1001: dbctl commit (dc=all): 'Temporary pool db1111 in s8 API', diff saved to https://phabricator.wikimedia.org/P10964 and previous config saved to /var/cache/conftool/dbconfig/20200413-074158-marostegui.json
  • 07:40 vgutierrez: rolling upgrade to ats 8.0.7-rc0-1wm1 in ulsfo
  • 07:39 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1092 T232446', diff saved to https://phabricator.wikimedia.org/P10963 and previous config saved to /var/cache/conftool/dbconfig/20200413-073939-marostegui.json
  • 07:17 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1110 T249973', diff saved to https://phabricator.wikimedia.org/P10962 and previous config saved to /var/cache/conftool/dbconfig/20200413-071740-marostegui.json
  • 06:51 marostegui: Deploy schema changes on db1110 - T249973
  • 06:50 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1110 T249973', diff saved to https://phabricator.wikimedia.org/P10961 and previous config saved to /var/cache/conftool/dbconfig/20200413-065022-marostegui.json
  • 06:36 elukey: temporary stopped puppet on restbase2014 to avoid attempts to start cassandra on each run - T250050
  • 06:23 vgutierrez: upgrade to ats 8.0.7-rc0-1wm1 on cp[4026,4032,5006,5012]
  • 06:20 vgutierrez: upload trafficserver 8.0.7-rc0-1wm1 to apt.wm.o (buster)
  • 05:25 vgutierrez: restart varnish-fe on cp3050

2020-04-12

  • 11:11 vgutierrez: restart ats-tls on cp5008.eqsin.wmnet - T249335
  • 10:18 elukey: restart wdqs-updater on wdqs1004 (logs show no reports from the past hours, last one were stack traces related to a json decode failure)
  • 06:59 dcausse: restarting blazegraph on wdqs1004 (T242453)
  • 06:35 elukey@puppetmaster1001: conftool action : set/pooled=no; selector: name=restbase1025.eqiad.wmnet
  • 06:32 elukey: powerdown restbase1025 - T250027
  • 06:21 elukey: powercycle restbase1025 (not reachable, serial console shows blank, racadm getsel reports errors with DIMM_B2)
  • 05:53 bblack: pushing https://gerrit.wikimedia.org/r/588134 to cache_text
  • 05:50 vgutierrez: restart ats-tls on cp[1077,1081,1083,1085].eqiad.wmnet- T249335

2020-04-11

  • 19:52 cdanis@cumin1001: dbctl commit (dc=all): 'slight deweight to db1111', diff saved to https://phabricator.wikimedia.org/P10960 and previous config saved to /var/cache/conftool/dbconfig/20200411-195235-cdanis.json
  • 17:35 cdanis@cumin1001: dbctl commit (dc=all): 's8: +weight db1111, -weight db1126', diff saved to https://phabricator.wikimedia.org/P10959 and previous config saved to /var/cache/conftool/dbconfig/20200411-173517-cdanis.json
  • 15:39 vgutierrez: restart ats-tls on cp[1077,1081,1083,1085].eqiad.wmnet- T249335
  • 09:30 elukey@cumin1001: END (PASS) - Cookbook sre.presto.roll-restart-workers (exit_code=0)
  • 09:20 elukey@cumin1001: START - Cookbook sre.presto.roll-restart-workers
  • 07:01 vgutierrez: restart ats-tls on cp[1079,1081,1083,1085].eqiad.wmnet- T249335

2020-04-10

  • 21:12 cdanis@cumin1001: dbctl commit (dc=all): 'db1111 seems overloaded', diff saved to https://phabricator.wikimedia.org/P10954 and previous config saved to /var/cache/conftool/dbconfig/20200410-211202-cdanis.json
  • 19:37 cdanis: cdanis@re0.cr1-codfw> clear bfd session address 208.80.153.220
  • 15:03 vgutierrez: restart ats-tls on cp1083 and cp1085 - T249335
  • 13:14 hashar@deploy1001: Finished deploy [zuul/deploy@4a69913]: (no justification provided) (duration: 00m 40s)
  • 13:14 hashar@deploy1001: Started deploy [zuul/deploy@4a69913]: (no justification provided)
  • 13:12 mutante: restarted and re-armed keyholder on deploy1001 to pick up changes for zuul scap deploy
  • 12:12 dzahn@cumin1001: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99)
  • 12:11 dzahn@cumin1001: START - Cookbook sre.ganeti.makevm
  • 12:10 mutante: Creating VM people1002.eqiad.wmnet in cluster ganeti01.svc.eqiad.wmnet with row=A vcpus=1 memory=2GB disk=80GB link=private. (T249907)
  • 12:10 dzahn@cumin1001: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99)
  • 12:10 mutante: Creating VM people1002.eqiad.wmnet in cluster ganeti01.svc.eqiad.wmnet with row=A vcpus=1 memory=2GB disk=80GB link=private. This may take a few minutes.
  • 12:10 dzahn@cumin1001: START - Cookbook sre.ganeti.makevm
  • 12:09 dzahn@cumin1001: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99)
  • 12:09 dzahn@cumin1001: START - Cookbook sre.ganeti.makevm
  • 11:47 akosiaris@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'mathoid' for release 'canary' .
  • 11:47 akosiaris@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'mathoid' for release 'production' .
  • 11:44 akosiaris@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'mathoid' for release 'production' .
  • 11:39 akosiaris@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'mathoid' for release 'staging' .
  • 09:43 marostegui@cumin1001: dbctl commit (dc=all): 'Give more weight to db1089', diff saved to https://phabricator.wikimedia.org/P10953 and previous config saved to /var/cache/conftool/dbconfig/20200410-094359-marostegui.json
  • 09:31 marostegui@cumin1001: dbctl commit (dc=all): 'Give more weight to db1089', diff saved to https://phabricator.wikimedia.org/P10952 and previous config saved to /var/cache/conftool/dbconfig/20200410-093129-marostegui.json
  • 08:52 hashar@deploy1001: Finished deploy [zuul/deploy@4a69913]: (no justification provided) (duration: 00m 16s)
  • 08:51 hashar@deploy1001: Started deploy [zuul/deploy@4a69913]: (no justification provided)
  • 08:46 hashar@deploy1001: Finished deploy [zuul/deploy@5a0a03a]: (no justification provided) (duration: 02m 20s)
  • 08:44 hashar@deploy1001: Started deploy [zuul/deploy@5a0a03a]: (no justification provided)
  • 08:39 mutante: deploy1001 - keyholder disarm, keyholder arm
  • 08:32 mutante: fix comment in deployment ssh key for zuul to include the path to the key on deploy1001
  • 08:24 vgutierrez: update puppet compiler facts
  • 08:20 hashar@deploy1001: Finished deploy [integration/zuul/deploy@6c3ddad]: (no justification provided) (duration: 00m 11s)
  • 08:19 hashar@deploy1001: Started deploy [integration/zuul/deploy@6c3ddad]: (no justification provided)
  • 08:03 hashar@deploy1001: Finished deploy [docker-pkg/deploy@9f2ba2c]: (no justification provided) (duration: 00m 05s)
  • 08:03 hashar@deploy1001: Started deploy [docker-pkg/deploy@9f2ba2c]: (no justification provided)
  • 07:52 mutante: closing port 80 on phab hosts for caching servers
  • 07:37 ema: cp3050: back to vhtcpd for the holidays T249583
  • 07:00 mutante: sodium - sudo -u mirror ftpsync
  • 06:58 mutante: armed keyholder on deploy1001
  • 06:19 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 06:15 marostegui@cumin1001: START - Cookbook sre.hosts.downtime
  • 06:00 marostegui: Stop MySQL on pc1008 for upgrade

2020-04-09

  • 23:44 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Touch and secondary sync of IS for cache-busting (duration: 00m 58s)
  • 23:27 catrope@deploy1001: Synchronized wmf-config/mobile.php: Drop fallback support for wgMobileFrontendLogo (T248500) (duration: 00m 58s)
  • 23:21 catrope@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Drop unused config for main page CSS (T243996) (duration: 00m 58s)
  • 23:17 catrope@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Add extendedconfirmed group and protection level on jawiki (T249820) (duration: 00m 59s)
  • 22:01 sukhe: running initial metadb sync on cescout1001
  • 19:43 mholloway-shell@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'wikifeeds' for release 'production' .
  • 19:41 mholloway-shell@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'wikifeeds' for release 'production' .
  • 19:39 mholloway-shell@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'wikifeeds' for release 'staging' .
  • 19:08 jhuneidi@deploy1001: rebuilt and synchronized wikiversions files: all wikis to 1.35.0-wmf.27 refs T247774
  • 19:01 longma: deploying 1.35.0-wmf.27 to all wikis
  • 17:50 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 17:50 dzahn@cumin1001: START - Cookbook sre.hosts.downtime
  • 17:50 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 17:50 dzahn@cumin1001: START - Cookbook sre.hosts.downtime
  • 17:40 hnowlan@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'changeprop' for release 'production' .
  • 17:24 hnowlan@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'changeprop' for release 'production' .
  • 17:18 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 14:39 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 14:32 XioNoX: disable down interfaces from fasw-c-codfw (mintaka)
  • 13:45 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 13:31 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 12:43 mlitn@deploy1001: Synchronized php-1.35.0-wmf.27/extensions/MachineVision/: [MachineVision] Fix statement creation from suggestion (duration: 01m 09s)
  • 12:31 ema: cp3051: upgrade varnish to 5.1.3-1wm13 once again, restart varnish-fe T249809
  • 11:57 XioNoX: offload more traffic from NTT eqiad - T249808
  • 11:20 kartik@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: 587257|Enable ContentTranslation as a default tool in Slovenian WP (T248836), take II (duration: 01m 06s)
  • 11:19 kartik@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: 587257|Enable ContentTranslation as a default tool in Slovenian WP (T248836) (duration: 01m 07s)
  • 10:50 vgutierrez: rolling upgrade to trafficserver 8.0.6-1mw7
  • 10:50 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 10:50 jmm@cumin2001: START - Cookbook sre.hosts.downtime
  • 10:50 jmm@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 10:49 jmm@cumin2001: START - Cookbook sre.hosts.downtime
  • 10:43 ema: repool cp3051 T249809
  • 10:30 ema: cp3051: re-enable transient storage limit, downgrade varnish to 5.1.3-1wm12 (no 0035-vbf_stp_condfetch_crash.patch) and restart varnish-fe T249809
  • 09:46 ema: cp3051: disable transient storage limit and restart varnish-fe T249809
  • 09:31 XioNoX: offload traffic from NTT eqiad - T249808
  • 07:56 mutante: contint2001 - a2dismod mpm_event - then run puppet to let it enable php_mod_7.3 (race condition like mentioned in https://gerrit.wikimedia.org/r/c/operations/puppet/+/451206) (T224591)
  • 07:56 mutante: contint2001 - a2dismod mpm_event - then run puppet to let it enable php_mod_7.3 (race condition like mentioned in https://gerrit.wikimedia.org/r/c/operations/puppet/+/451206)
  • 07:24 moritzm: synched jenkins 222.1 to apt.wikimedia.org (buster-wikimedia, thirdparty/ci) T224591
  • 07:12 marostegui: Repool labsdb1011
  • 07:10 XioNoX: switch urpf from log to syslog in ulsfo
  • 07:04 XioNoX: re-activate BGP to Zayo in eqiad
  • 06:59 vgutierrez: upgrade ats to version 8.0.6-1wm7 in cp[4026,4032,5006,5012]
  • 06:45 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 06:43 XioNoX: confirmed on one host that the change didn't break logstash. Re-enable Puppet on logstash hosts - T244147
  • 06:42 marostegui@cumin1001: START - Cookbook sre.hosts.downtime
  • 06:36 XioNoX: disabling puppet on logstash host for CR deploy - T244147
  • 06:30 XioNoX: push urpf log only to eqiad - T244147
  • 06:25 XioNoX: push urpf log only to eqsin - T244147
  • 06:21 XioNoX: push urpf log only to AMS - T244147
  • 05:40 vgutierrez: upgrade ats to version 8.0.6-1wm6 in cp[4025,4031,5005,5011] - T249335
  • 05:37 marostegui: Stop MySQL on pc2008 for upgrade to Buster and 10.4
  • 05:36 marostegui@deploy1001: Synchronized wmf-config/db-codfw.php: Depool pc2008 for upgrade (duration: 01m 08s)
  • 05:08 marostegui: Deploy schema change on db1123
  • 05:07 vgutierrez: upload trafficserver 8.0.6-1wm6 to apt.wm.o (buster) - T249335

2020-04-08

  • 21:20 jforrester@deploy1001: Synchronized php-1.35.0-wmf.27/extensions/TemplateData/includes/TemplateDataHooks.php: Restore call to OutputPage::setupOOUI() (duration: 01m 07s)
  • 21:19 jforrester@deploy1001: Synchronized php-1.35.0-wmf.26/extensions/TemplateData/includes/TemplateDataHooks.php: Restore call to OutputPage::setupOOUI() (duration: 01m 09s)
  • 20:09 otto@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-main' for release 'canary' .
  • 20:09 otto@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-main' for release 'production' .
  • 20:06 otto@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-main' for release 'canary' .
  • 20:06 otto@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-main' for release 'production' .
  • 20:04 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-main' for release 'canary' .
  • 20:04 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-main' for release 'production' .
  • 19:51 gehel: restart wdqs-updater after deployment
  • 19:49 mstyles@deploy1001: Finished deploy [wdqs/wdqs@c2995eb]: WDQS version 0.3.21 (duration: 14m 37s)
  • 19:44 dpifke@deploy1001: Finished deploy [performance/navtiming@4acb04d]: Deploy new navtiming with First Input Delay metric https://phabricator.wikimedia.org/T238091 (duration: 00m 05s)
  • 19:44 dpifke@deploy1001: Started deploy [performance/navtiming@4acb04d]: Deploy new navtiming with First Input Delay metric https://phabricator.wikimedia.org/T238091
  • 19:35 mstyles@deploy1001: Started deploy [wdqs/wdqs@c2995eb]: WDQS version 0.3.21
  • 19:08 jhuneidi@deploy1001: Synchronized php: group1 wikis to 1.35.0-wmf.27 refs T247774 (duration: 01m 06s)
  • 19:07 jhuneidi@deploy1001: rebuilt and synchronized wikiversions files: group1 wikis to 1.35.0-wmf.27 refs T247774
  • 19:02 longma: deploying 1.35.0-wmf.27 to group1
  • 18:37 jforrester@deploy1001: Synchronized php-1.35.0-wmf.27/skins/Vector: T248761: Revert moving indicators in DOM (duration: 01m 07s)
  • 18:17 reedy@deploy1001: Synchronized php-1.35.0-wmf.27/extensions/TemplateData/includes/TemplateDataHooks.php: T236809 (duration: 01m 06s)
  • 18:16 reedy@deploy1001: Synchronized php-1.35.0-wmf.26/extensions/TemplateData/includes/TemplateDataHooks.php: T236809 (duration: 01m 10s)
  • 17:31 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 17:23 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 17:13 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 17:13 dzahn@cumin1001: START - Cookbook sre.hosts.downtime
  • 16:16 ema: cache_upload: rolling varnish-fe restarts to bump transient storage limit T185968
  • 15:21 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 15:19 dzahn@cumin1001: START - Cookbook sre.hosts.downtime
  • 15:11 ema: cp3051: param.set shortlived=0 to try ease pressure on transient memory
  • 14:23 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1112 after schema change', diff saved to https://phabricator.wikimedia.org/P10947 and previous config saved to /var/cache/conftool/dbconfig/20200408-142341-marostegui.json
  • 14:14 jeh@deploy1001: Finished deploy [horizon/deploy@0d18f67]: update horizon submodule to enable server groups (duration: 03m 30s)
  • 14:10 jeh@deploy1001: Started deploy [horizon/deploy@0d18f67]: update horizon submodule to enable server groups
  • 13:40 mutante: stopped and masked zuul-merger service on contint2001 via puppet (T224591)
  • 13:30 ema: cp3050: stop vhtcpd, start purged T249583
  • 13:22 vgutierrez: enable inbound TLSv1.3 in text@ulsfo - T170567
  • 13:05 ema: purged 0.1 uploaded to buster-wikimedia T249583
  • 12:31 tgr@deploy1001: Synchronized wmf-config/InitialiseSettings.php: re-sync (duration: 01m 07s)
  • 12:29 tgr@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Enable GrowthExperiments suggested edits on uk, hu, hy, eu wikipedias (T247308) (duration: 01m 08s)
  • {{safesubst:SAL entry|1=12:17 tgr@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:584135|Enable GrowthExperiments welcome survey on Ukrainian, Hungarian, Armenian Wikipedias (T238295) (duration: 01m 08s)}}
  • 12:09 tgr@deploy1001: Synchronized wmf-config/: SWAT: Enable GrowthExperiments on French Wiktionary (T235964) (duration: 01m 06s)
  • 11:56 tgr@deploy1001: Synchronized dblists/: SWAT: Enable GrowthExperiments on French Wiktionary (T235964) (duration: 01m 03s)
  • 11:48 mutante: logstash1009 - restarted logstash
  • 11:43 tgr@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Enable WikibaseQualityConstraints on test commons (T248117) (duration: 01m 05s)
  • 11:43 marostegui: Deploy schema change on db1112, this will generate lag on labs s3
  • 11:43 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1112 for schema change', diff saved to https://phabricator.wikimedia.org/P10942 and previous config saved to /var/cache/conftool/dbconfig/20200408-114315-marostegui.json
  • 11:39 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1078 after schema change', diff saved to https://phabricator.wikimedia.org/P10941 and previous config saved to /var/cache/conftool/dbconfig/20200408-113901-marostegui.json
  • 11:29 tgr@deploy1001: Synchronized wmf-config/: SWAT: Deploy GrowthExperiments on Serbian Wikipedia (T241181) (duration: 01m 06s)
  • 11:28 tgr@deploy1001: Synchronized dblists/: SWAT: Deploy GrowthExperiments on Serbian Wikipedia (T241181) (duration: 01m 17s)
  • 11:05 XioNoX: push urpf log only to codfw - T244147
  • 10:39 jbond42: restarting idp.wikimedia.org
  • 10:14 marostegui: Deploy schema change on db1078
  • 10:14 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1078 for schema change', diff saved to https://phabricator.wikimedia.org/P10940 and previous config saved to /var/cache/conftool/dbconfig/20200408-101431-marostegui.json
  • 09:30 jynus: stopping and removing db1095:s8 instance
  • 09:20 godog: upgrade grafana on cloudmetrics hosts - T244208
  • 09:17 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1075 after schema change', diff saved to https://phabricator.wikimedia.org/P10939 and previous config saved to /var/cache/conftool/dbconfig/20200408-091728-marostegui.json
  • 09:11 gehel: setting weight=10 for all pooled wdqs servers in codfw - T246343
  • 09:10 marostegui: Reload proxies on dbproxy1018 and dbproxy1019 to depool labsdb1011 - T249188 T248592
  • 09:07 gehel: pooling wdqs200[78] - new servers ready to go! - T246343
  • 08:46 marostegui: Rename wb_terms and recreate views on labsdb1009-labsdb1011 - T248592 T248086
  • 08:39 godog: upgrade grafana on grafana1002 - T244208
  • 08:17 _joe_: switching parsoid to envoy (take 2) in eqiad
  • 07:23 marostegui: Deploy schema change on db1075
  • 07:23 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1075 for schema change', diff saved to https://phabricator.wikimedia.org/P10937 and previous config saved to /var/cache/conftool/dbconfig/20200408-072331-marostegui.json
  • 06:31 marostegui: Deploy schema change on db1095:3313
  • 06:11 marostegui: Stop haproxy on dbproxy1011 - T231520
  • 05:44 vgutierrez: rolling upgrade ATS to 8.0.6-1wm6 in cp[5006,5012,3065,3064,2042,2041,1090,1089]
  • 05:34 marostegui: Deploy schema change on dbstore1004:3313
  • 05:33 _joe_: repooling wtp1025, with envoy and logging any error above 404 T249535
  • 04:36 vgutierrez: rolling restart of ats-tls - T249335

2020-04-07

  • 20:39 andrewbogott: correction: briefly downtiming ldap-eqiad-replica0 and ldap-eqiad-replica1. I'm trying to investigate a possible split-brain so going to turn ldap off on one, and then the other, to see if behavior changes
  • 20:37 andrewbogott: briefly downtiming serpens and seaborgium. I'm trying to investigate a possible split-brain so going to turn ldap off on one, and then the other, to see if behavior changes
  • 20:34 hoo: (Take 3) Temporary modified dumpsgen's crontab on snapshot1008 so that the Wikidata RDF dumps start now (broke as a side effect of T249565)
  • 20:17 jhuneidi@deploy1001: rebuilt and synchronized wikiversions files: group0 wikis to 1.35.0-wmf.27 refs T247774
  • 20:09 jhuneidi@deploy1001: Finished scap: testwikis wikis to 1.35.0-wmf.27 (duration: 60m 34s)
  • 20:08 hoo: (Take 2) Temporary modified dumpsgen's crontab on snapshot1008 so that the Wikidata RDF dumps start now (broke as a side effect of T249565)
  • 19:45 hoo: Temporary modified dumpsgen's crontab on snapshot1008 so that the Wikidata RDF dumps start now (broke as a side effect of T249565)
  • 19:13 XioNoX: push pfw firewall rules - T249650
  • 19:08 jhuneidi@deploy1001: Started scap: testwikis wikis to 1.35.0-wmf.27
  • 18:48 jhuneidi@deploy1001: Pruned MediaWiki: 1.35.0-wmf.24 (duration: 12m 44s)
  • 17:56 herron: increasing codfw.mediawiki.job.cirrusSearchElasticaWrite to 3 partitions T240702
  • 17:55 addshore@deploy1001: Synchronized wmf-config/CommonSettings.php: T249565 T249595 RejectParserCacheValue entries during wb_items_per_site drop incident (14.5/14.5h) retry (duration: 01m 02s)
  • 17:54 addshore: last sync stuck on sync-masters
  • 17:54 addshore@deploy1001: sync-file aborted: T249565 T249595 RejectParserCacheValue entries during wb_items_per_site drop incident (14.5/14.5h) (duration: 01m 16s)
  • 17:49 ppchelko@deploy1001: Started restart [cpjobqueue/deploy@83c93d1]: Try to make it notice new partitions T240702
  • 17:40 herron: increasing eqiad.mediawiki.job.cirrusSearchElasticaWrite to 3 partitions T240702
  • 16:24 longma: 1.35.0-wmf.27 was branched at e76ac29 for T247774
  • 16:16 hashar: restarting CI jenkins
  • 15:53 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 15:21 moritzm: installing idp-test2001
  • 15:20 XioNoX: enable uRPF loose mode (log only) on cr4-ulsfo - T244147
  • 15:17 addshore@deploy1001: Synchronized wmf-config/CommonSettings.php: T249565 T249595 RejectParserCacheValue entries during wb_items_per_site drop incident (12/14.5h) (duration: 01m 00s)
  • 15:10 ema: cp3052: stop purged, start vhtcpd T249583 T241232
  • 15:00 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 14:56 addshore@deploy1001: Synchronized wmf-config/CommonSettings.php: T249565 T249595 RejectParserCacheValue entries during wb_items_per_site drop incident (10/14.5h) (duration: 00m 55s)
  • 14:52 jeh: cloudvirt2003-dev: downtime in icinga and reboot to enable BIOS virtualization support T249453
  • 14:38 ema: cp3052: stop vhtcpd, start purged T249583
  • 14:35 addshore@deploy1001: Synchronized wmf-config/CommonSettings.php: T249565 T249595 RejectParserCacheValue entries during wb_items_per_site drop incident (8/14.5h) (duration: 00m 58s)
  • 14:25 addshore@deploy1001: Synchronized wmf-config/CommonSettings.php: T249565 T249595 RejectParserCacheValue entries during wb_items_per_site drop incident (4/14.5h) (duration: 00m 58s)
  • 14:15 addshore@deploy1001: Synchronized wmf-config/CommonSettings.php: T249565 T249595 RejectParserCacheValue entries during wb_items_per_site drop incident (2/14.5h) (duration: 00m 58s)
  • 14:08 addshore@deploy1001: Synchronized wmf-config/CommonSettings.php: T249565 T249595 RejectParserCacheValue entries during wb_items_per_site drop incident (1h) take 2 (duration: 00m 57s)
  • 13:57 addshore@deploy1001: Synchronized wmf-config/CommonSettings.php: REVERT T249565 T249595 RejectParserCacheValue entries during wb_items_per_site drop incident (1h) (duration: 00m 58s)
  • 13:55 addshore@deploy1001: sync-file aborted: T249565 T249595 RejectParserCacheValue entries during wb_items_per_site drop incident (1h) (duration: 00m 29s)
  • 13:17 vgutierrez: restart ats-tls on cp3056 - T249335
  • 12:59 vgutierrez: restart ats-tls on cp3052- T249335
  • 12:50 addshore: addshore@mwmaint1002:~$ mwscript extensions/Wikibase/repo/maintenance/rebuildItemsPerSite.php --wiki=wikidatawiki --file T249596-6.list > T249596-6.out # T249565
  • 12:43 addshore: addshore@mwmaint1002:~$ mwscript extensions/Wikibase/repo/maintenance/rebuildItemsPerSite.php --wiki=wikidatawiki --file T249596-5.list > T249596-5.out # T249565
  • 12:42 vgutierrez: restart ats-tls on cp3058 - T249335
  • 12:25 jmm@cumin2001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
  • 12:06 addshore: addshore@mwmaint1002:~$ mwscript extensions/Wikibase/repo/maintenance/rebuildItemsPerSite.php --wiki=wikidatawiki --file T249596-4.list > T249596-4.out # T249565 T249596
  • 12:05 jmm@cumin2001: START - Cookbook sre.ganeti.makevm
  • 11:52 marostegui@cumin1001: dbctl commit (dc=all): 'repool db1126', diff saved to https://phabricator.wikimedia.org/P10932 and previous config saved to /var/cache/conftool/dbconfig/20200407-115228-marostegui.json
  • 11:51 marostegui@cumin1001: dbctl commit (dc=all): 'depool db1126', diff saved to https://phabricator.wikimedia.org/P10931 and previous config saved to /var/cache/conftool/dbconfig/20200407-115154-marostegui.json
  • 11:50 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1092, db1111, db1099:3318 after table rename', diff saved to https://phabricator.wikimedia.org/P10930 and previous config saved to /var/cache/conftool/dbconfig/20200407-115058-marostegui.json
  • 11:50 jynus: renaming wb_items_per_site_recovered to wb_items_per_site on s8
  • 11:45 jynus: stopping s8 replication on db1116:3318, db1095:3318, db2079
  • 11:42 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1092, db1111, db1099:3318 for table rename', diff saved to https://phabricator.wikimedia.org/P10929 and previous config saved to /var/cache/conftool/dbconfig/20200407-114258-marostegui.json
  • 11:36 Amir1: stopped the rebuilt script (T249565)
  • 11:34 addshore@deploy1001: Synchronized wmf-config/CommonSettings.php: cleanup T203888, Remove old unused RejectParserCacheValue hook (duration: 00m 59s)
  • 11:09 marostegui: Deploy schema change on s3 codfw
  • 11:07 jynus: starting recovery on all s8 hosts
  • 10:45 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 10:41 addshore@deploy1001: Synchronized php-1.35.0-wmf.26/extensions/Wikibase/repo/maintenance/rebuildItemsPerSite.php: T249565 T249596 Wikibase rebuildItemsPerSite.php script that allows lists of ids (duration: 01m 00s)
  • 10:27 jynus: starting recovery on db1099:3318
  • 09:58 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1119 after schema change', diff saved to https://phabricator.wikimedia.org/P10927 and previous config saved to /var/cache/conftool/dbconfig/20200407-095852-marostegui.json
  • 09:49 volans@deploy1001: Finished deploy [homer/deploy@887544c]: Release v0.2.0 (take 2) (duration: 00m 26s)
  • 09:49 volans@deploy1001: Started deploy [homer/deploy@887544c]: Release v0.2.0 (take 2)
  • 09:38 marostegui: Deploy schema change on db1119
  • 09:38 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1119 for schema change', diff saved to https://phabricator.wikimedia.org/P10926 and previous config saved to /var/cache/conftool/dbconfig/20200407-093820-marostegui.json
  • 09:36 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1134 after schema change', diff saved to https://phabricator.wikimedia.org/P10925 and previous config saved to /var/cache/conftool/dbconfig/20200407-093638-marostegui.json
  • 09:31 volans@deploy1001: Finished deploy [homer/deploy@b4522ad]: Release v0.2.0 (duration: 00m 16s)
  • 09:31 volans@deploy1001: Started deploy [homer/deploy@b4522ad]: Release v0.2.0
  • 09:29 volans@deploy1001: Finished deploy [homer/deploy@ac7a818]: Inject plugins (take 3) (duration: 03m 03s)
  • 09:26 volans@deploy1001: Started deploy [homer/deploy@ac7a818]: Inject plugins (take 3)
  • 09:19 marostegui: Deploy schema change on db1134
  • 09:18 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1134 for schema change', diff saved to https://phabricator.wikimedia.org/P10924 and previous config saved to /var/cache/conftool/dbconfig/20200407-091847-marostegui.json
  • 09:17 volans@deploy1001: Finished deploy [homer/deploy@a03d7cd]: Inject plugins (take 2) (duration: 00m 29s)
  • 09:17 volans@deploy1001: Started deploy [homer/deploy@a03d7cd]: Inject plugins (take 2)
  • 09:04 vgutierrez: testing ATS 8.0.6-1wm6 on cp4026 and cp4032
  • 08:58 volans@deploy1001: Finished deploy [homer/deploy@a03d7cd]: Inject plugins (duration: 04m 59s)
  • 08:53 volans@deploy1001: Started deploy [homer/deploy@a03d7cd]: Inject plugins
  • 08:46 XioNoX: enable uRPF loose mode (log only) on cr3-ulsfo v4 uplinks - T244147
  • 08:44 XioNoX: enable uRPF loose mode (log only) on cr3-ulsfo v6 uplinks - T244147
  • 08:42 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 08:37 mutante: decom ganeti VM miscweb1001 (stretch) - kept backup of old racktables files and db dump in /root/racktables on miscweb1002 (T247648)
  • 08:33 dzahn@cumin1001: START - Cookbook sre.hosts.decommission
  • 08:31 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 08:30 mutante: decom ganeti VM miscweb2001 (stretch)
  • 08:30 dzahn@cumin1001: START - Cookbook sre.hosts.decommission
  • 08:26 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1106 after schema change', diff saved to https://phabricator.wikimedia.org/P10923 and previous config saved to /var/cache/conftool/dbconfig/20200407-082607-marostegui.json
  • 08:17 moritzm: installing php5 security updates
  • 08:06 marostegui: Deploy schema change on db1106 (this will generate lag on s1 labs)
  • 08:05 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1106 for schema change', diff saved to https://phabricator.wikimedia.org/P10922 and previous config saved to /var/cache/conftool/dbconfig/20200407-080533-marostegui.json
  • 08:04 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1080 after schema change', diff saved to https://phabricator.wikimedia.org/P10921 and previous config saved to /var/cache/conftool/dbconfig/20200407-080443-marostegui.json
  • 07:52 _joe_: disabling puppet on mwdebug1002
  • 07:47 marostegui: Failover dbproxy1011 to dbproxy1019 - T231520)
  • 07:43 marostegui: Deploy schema change on db1080
  • 07:43 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1080 for schema change', diff saved to https://phabricator.wikimedia.org/P10920 and previous config saved to /var/cache/conftool/dbconfig/20200407-074321-marostegui.json
  • 07:41 dcausse@deploy1001: Finished deploy [wdqs/wdqs@23495ae]: deploying wdqs 0.3.17 to wdqs2002: T249196 (duration: 01m 28s)
  • 07:40 dcausse@deploy1001: Started deploy [wdqs/wdqs@23495ae]: deploying wdqs 0.3.17 to wdqs2002: T249196
  • 07:39 _joe_: depooling wtp1025, used for debugging
  • 07:31 vgutierrez: enable parent proxies in ats-tls - T249335
  • 07:19 jynus: restarting s3 on db1095
  • 07:02 moritzm: updating linux-image-4.9.0-11-amd64 where applicable
  • 06:55 elukey@cumin1001: END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
  • 06:53 elukey@cumin1001: START - Cookbook sre.wdqs.data-transfer
  • 06:52 elukey@cumin1001: END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
  • 06:37 moritzm: installing ruby2.1 security updates
  • 06:32 jynus: stopping slave (s3) on db1095
  • 05:38 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Fix database name for repo in testwikidata (T249533), take II (duration: 00m 58s)
  • 05:37 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Fix database name for repo in testwikidata (T249533) (duration: 01m 00s)
  • 05:26 elukey@cumin1001: START - Cookbook sre.wdqs.data-transfer
  • 01:08 jforrester@deploy1001: Synchronized php-1.35.0-wmf.26/maintenance/: T157651 Remove sql.php from maintenance/ (duration: 00m 58s)
  • 01:06 jforrester@deploy1001: Synchronized php-1.35.0-wmf.26/autoload.php: T157651 Remove sql.php from autoloader (duration: 00m 58s)
  • 01:05 jforrester@deploy1001: Synchronized php-1.35.0-wmf.26/extensions/Wikibase/repo/includes/Store/Sql/DatabaseSchemaUpdater.php: T208425 T249565 Follow-up a956c655: Only avoid dropping wb_items_per_site so prod can be merged (duration: 00m 58s)
  • 00:01 addshore@deploy1001: Synchronized php-1.35.0-wmf.26/extensions/Wikibase/repo/includes/Store/Sql/DatabaseSchemaUpdater.php: Do not try to drop things when theres no wb_terms table T208425 T249565 cache bust (duration: 01m 01s)

2020-04-06

  • 23:59 addshore@deploy1001: Synchronized php-1.35.0-wmf.26/extensions/Wikibase/repo/includes/Store/Sql/DatabaseSchemaUpdater.php: Do not try to drop things when theres no wb_terms table T208425 T249565 (duration: 00m 59s)
  • 23:31 Amir1: ladsgroup@mwmaint1002:/srv/mediawiki-staging/php-1.35.0-wmf.26$ mwscript extensions/Wikibase/repo/maintenance/rebuildItemsPerSite.php --wiki=wikidatawiki
  • 23:26 Amir1: created wb_items_per_site
  • 19:05 elukey@cumin1001: END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
  • 19:03 elukey@cumin1001: START - Cookbook sre.wdqs.data-transfer
  • 19:00 elukey@cumin1001: END (ERROR) - Cookbook sre.wdqs.data-transfer (exit_code=97)
  • 18:58 elukey@cumin1001: START - Cookbook sre.wdqs.data-transfer
  • 18:57 elukey@cumin1001: END (ERROR) - Cookbook sre.wdqs.data-transfer (exit_code=97)
  • 18:51 elukey@cumin1001: START - Cookbook sre.wdqs.data-transfer
  • 18:42 elukey@cumin1001: END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
  • 18:22 Urbanecm: Morning SWAT done
  • 18:19 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: 335a924: Enable Local upload on azbwiki (T248971; take II) (duration: 00m 58s)
  • 18:16 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: 335a924: Enable Local upload on azbwiki (T248971) (duration: 00m 59s)
  • 16:54 elukey@cumin1001: START - Cookbook sre.wdqs.data-transfer
  • 16:52 _joe_: parsoid migrated to use envoy for TLS termination
  • 16:24 _joe_: switching parsoid-php to envoy for TLS termination
  • 15:45 mholloway-shell@deploy1001: Synchronized wmf-config/InitialiseSettings.php: MachineVision: Label blacklist updates (T249285) (duration: 00m 58s)
  • 15:36 hnowlan@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'changeprop' for release 'production' .
  • 15:04 elukey@cumin1001: END (ERROR) - Cookbook sre.wdqs.data-transfer (exit_code=97)
  • 14:59 addshore: deploy slot done
  • 14:55 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: TEST: Test commons: Define entity sources configuration T248664 (cache bust) (duration: 00m 57s)
  • 14:54 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: TEST: Test commons: Define entity sources configuration T248664 (duration: 00m 57s)
  • 14:50 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: TEST: Wikibase, entity source, use modern repoDatabase and interwikiPrefix T248664 (cache bust) (duration: 00m 57s)
  • 14:49 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: TEST: Wikibase, entity source, use modern repoDatabase and interwikiPrefix T248664 (duration: 00m 58s)
  • 14:42 marostegui@cumin1001: dbctl commit (dc=all): 'Fully repool db1118 after schema change', diff saved to https://phabricator.wikimedia.org/P10912 and previous config saved to /var/cache/conftool/dbconfig/20200406-144220-marostegui.json
  • 14:41 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: TEST: Wikibase client entity source config T248664 (cache bust) (duration: 00m 58s)
  • 14:40 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: TEST: Wikibase client entity source config T248664 (duration: 00m 59s)
  • 14:37 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1118 after schema change', diff saved to https://phabricator.wikimedia.org/P10911 and previous config saved to /var/cache/conftool/dbconfig/20200406-143755-marostegui.json
  • 14:30 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1118 after schema change', diff saved to https://phabricator.wikimedia.org/P10910 and previous config saved to /var/cache/conftool/dbconfig/20200406-143042-marostegui.json
  • 14:26 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1118 after schema change', diff saved to https://phabricator.wikimedia.org/P10909 and previous config saved to /var/cache/conftool/dbconfig/20200406-142607-marostegui.json
  • 14:24 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: TEST: Wikibase entity source config for testwikidatawiki T248664 (cachebust) (duration: 00m 58s)
  • 14:23 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: TEST: Wikibase entity source config for testwikidatawiki T248664 (duration: 00m 59s)
  • 14:09 elukey@cumin1001: START - Cookbook sre.wdqs.data-transfer
  • 14:07 elukey@cumin1001: END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99)
  • 14:07 elukey@cumin1001: START - Cookbook sre.wdqs.data-transfer
  • 13:47 sukhe: upload cescout 0.1.1-1 to apt.wm.o (buster) - T247273
  • 13:26 elukey: reboot stat1008 as test to verify ROCm 3.3 upgrades
  • 13:22 elukey: stat1008 upgraded to ROCm 3.3 (enables Tensorflow 2.x)
  • 13:05 ema: cache: upgrade varnish to 5.1.3-1wm13, begin rolling varnish-fe restarts T249344
  • 13:03 marostegui: Deploy schema change on db1118
  • 13:03 jbond42: updating gnutls on buster
  • 13:03 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1118 for schema change', diff saved to https://phabricator.wikimedia.org/P10906 and previous config saved to /var/cache/conftool/dbconfig/20200406-130320-marostegui.json
  • 13:02 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1107 after schema change', diff saved to https://phabricator.wikimedia.org/P10905 and previous config saved to /var/cache/conftool/dbconfig/20200406-130255-marostegui.json
  • 12:59 Urbanecm: Creation of grwikimedia is done (T245911)
  • 12:59 urbanecm@deploy1001: Synchronized wmf-config/interwiki.php: Update interwiki cache (duration: 02m 22s)
  • 12:55 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: 77b9ae9: Create grwikimedia (duration: 00m 58s)
  • 12:54 urbanecm@deploy1001: Synchronized static/images/project-logos/: 77b9ae9: Create grwikimedia (duration: 00m 58s)
  • 12:53 marostegui: Deploy schema change on db1107
  • 12:53 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1107 for schema change', diff saved to https://phabricator.wikimedia.org/P10904 and previous config saved to /var/cache/conftool/dbconfig/20200406-125308-marostegui.json
  • 12:52 urbanecm@deploy1001: Synchronized multiversion/MWMultiVersion.php: 77b9ae9: Create grwikimedia (duration: 00m 58s)
  • 12:52 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1089 after schema change', diff saved to https://phabricator.wikimedia.org/P10903 and previous config saved to /var/cache/conftool/dbconfig/20200406-125222-marostegui.json
  • 12:46 urbanecm@deploy1001: rebuilt and synchronized wikiversions files: 77b9ae9: Create grwikimedia
  • 12:44 urbanecm@deploy1001: Synchronized dblists/: 77b9ae9: Create grwikimedia (duration: 00m 59s)
  • 12:37 XioNoX: Update eqiad analytics filters with new APT IPs
  • 12:27 hnowlan@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'changeprop' for release 'production' .
  • 12:21 marostegui: Deploy schema change on db1089
  • 12:21 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1089 for schema change', diff saved to https://phabricator.wikimedia.org/P10902 and previous config saved to /var/cache/conftool/dbconfig/20200406-122123-marostegui.json
  • 12:20 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1105:3311 after schema change', diff saved to https://phabricator.wikimedia.org/P10901 and previous config saved to /var/cache/conftool/dbconfig/20200406-122058-marostegui.json
  • 12:14 hnowlan@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'changeprop' for release 'production' .
  • 12:08 hnowlan@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'changeprop' for release 'production' .
  • 12:04 godog: test grafana 6.7.2 upgrade on grafana2001 - T244208
  • 11:57 awight: EU swat complete
  • {{safesubst:SAL entry|1=11:53 awight@deploy1001: Synchronized php-1.35.0-wmf.26/extensions/TwoColConflict: SWAT: [[gerrit:586309|Backport talk page and EventLogging changes (T248243, T249404) (duration: 00m 59s)}}
  • 11:52 elukey@cumin1001: END (PASS) - Cookbook sre.aqs.roll-restart (exit_code=0)
  • 11:48 elukey@cumin1001: START - Cookbook sre.aqs.roll-restart
  • 11:48 awight@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Create account creator and rollback groups on yowiki (T249487) (duration: 00m 59s)
  • 11:32 awight@deploy1001: Synchronized php-1.35.0-wmf.26/extensions/ContentTranslation: SWAT: Avoid failure on restoring draft with no categories (T249400) (duration: 01m 02s)
  • 11:25 awight@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: double-syncing (duration: 00m 58s)
  • 11:24 marostegui: Deploy schema change on db1105:3311
  • 11:24 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1105:3311 for schema change', diff saved to https://phabricator.wikimedia.org/P10900 and previous config saved to /var/cache/conftool/dbconfig/20200406-112417-marostegui.json
  • 11:21 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1099:3311 after schema change', diff saved to https://phabricator.wikimedia.org/P10899 and previous config saved to /var/cache/conftool/dbconfig/20200406-112123-marostegui.json
  • 11:18 elukey: import AMD ROCm 3.3 packages in buster-wikimedia (component thirdparty/rocm33) - T247082
  • 11:17 awight@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: cirrus: Increase commonswiki near match weight (T245642) (duration: 00m 59s)
  • 11:11 awight@deploy1001: Synchronized wmf-config/CommonSettings.php: SWAT: Whitelist X-Wikimedia-Debug header for cross-wiki API requests (T249107) (duration: 00m 59s)
  • 10:51 jdrewniak@deploy1001: Synchronized portals: Wikimedia Portals Update: Bumping portals to master (563985) (duration: 00m 58s)
  • 10:50 jdrewniak@deploy1001: Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: Bumping portals to master (563985) (duration: 01m 12s)
  • 09:50 XioNoX: push pfw firewall policies - T249267
  • 09:40 marostegui: Deploy schema change on db1099:3311
  • 09:39 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1099:3311 for schema change', diff saved to https://phabricator.wikimedia.org/P10898 and previous config saved to /var/cache/conftool/dbconfig/20200406-093944-marostegui.json
  • 09:11 ema: cp2027: upgrade varnish to 5.1.3-1wm13 and restart varnish-fe T249344
  • 09:08 ema: upload varnish 5.1.3-1wm13 to buster-wikimedia on apt1001.wm.org T249344
  • 08:55 ariel@deploy1001: Finished deploy [dumps/dumps@ae1e705]: add prefetch test, fix multistream index file download link (duration: 00m 09s)
  • 08:55 ariel@deploy1001: Started deploy [dumps/dumps@ae1e705]: add prefetch test, fix multistream index file download link
  • 08:54 elukey: bootstrap wdqs200[7,8] - T246343
  • 08:50 marostegui: Deploy schema change on db1139:3311
  • 08:18 _joe_: conversion of codfw api done
  • 08:07 marostegui: Deploy schema change on dbstore1003:3311
  • 07:54 vgutierrez: rolling restart of ats-tls to disable wmf-analytics log - T249335 T237993
  • 07:50 dcausse: search index: deleting stale index wikidatawiki_content_1585224806 on cloudelastic:9243
  • 07:49 _joe_: eqiad API migrated to envoy for local TLS termination, now starting codfw
  • 07:35 elukey: restart elasticsearch_6@cloudelastic-chi-eqiad on cloudelastic1003 as attempt to fix heavy GC runs (old gen) - T231517
  • 07:35 marostegui: Rename wb_terms on eqiad excluding labsdb1009, labdb1010, labsdb1011 - T248086
  • 07:06 marostegui: Rename wb_terms on codfw - T248086
  • 06:45 XioNoX: delete BGP to AS25074 in amsix
  • 06:36 _joe_: converting the api servers to envoy for TLS in eqiad
  • 06:30 marostegui: Upgrade dbproxy1019 - T231520
  • 06:18 marostegui: Deploy schema change on s1 codfw master, this will generate lag on codfw
  • 05:54 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 05:54 marostegui@cumin1001: START - Cookbook sre.hosts.decommission
  • 05:50 vgutierrez: ats-tls restart in cp3056, cp3058 and cp3062 - T249335
  • 05:45 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1079 after schema change', diff saved to https://phabricator.wikimedia.org/P10897 and previous config saved to /var/cache/conftool/dbconfig/20200406-054559-marostegui.json
  • 05:18 marostegui: Deploy schema change on db1079 (this will generate lag on s7 labs)
  • 05:17 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1079 for schema change', diff saved to https://phabricator.wikimedia.org/P10896 and previous config saved to /var/cache/conftool/dbconfig/20200406-051744-marostegui.json
  • 05:16 vgutierrez: Enable inbound TLSv1.3 in upload@eqiad - T170567
  • 05:16 vgutierrez: Enable TLS Session Tickets on eqiad - T245616
  • 05:03 vgutierrez: ats-tls restart in cp1075, cp1081 and cp1087 - T249335

2020-04-03

  • 21:17 andrewbogott: ugpraded wikitech-static to 1.34.1
  • 17:58 mutante: rsync home dirs from install1002 to apt1001:/srv/home_install1002...
  • 15:43 ema: cp3061: restart varnish-fe T249344
  • 15:30 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 15:19 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 15:18 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 15:18 ema: cp3057: restart varnish-fe T249344
  • 14:37 hashar: Restarting Jenkins for a CSP parameter T245658
  • 14:07 vgutierrez: restart ats-tls on cp1087 - T249335
  • 14:01 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1090:3317 after schema change', diff saved to https://phabricator.wikimedia.org/P10882 and previous config saved to /var/cache/conftool/dbconfig/20200403-140132-marostegui.json
  • 13:55 vgutierrez: restart ats-tls on cp1075 and cp1081 - T249335
  • 12:49 marostegui: Deploy schema change on db1090:3317
  • 12:49 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1090:3317 for schema change', diff saved to https://phabricator.wikimedia.org/P10881 and previous config saved to /var/cache/conftool/dbconfig/20200403-124908-marostegui.json
  • 12:48 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1136 after schema change', diff saved to https://phabricator.wikimedia.org/P10880 and previous config saved to /var/cache/conftool/dbconfig/20200403-124827-marostegui.json
  • 12:45 dcausse@deploy1001: Finished deploy [wdqs/wdqs@23495ae]: deploying wdqs 0.3.17 to wdqs1007: testing T249196 (duration: 00m 43s)
  • 12:44 dcausse@deploy1001: Started deploy [wdqs/wdqs@23495ae]: deploying wdqs 0.3.17 to wdqs1007: testing T249196
  • 12:27 marostegui: Deploy schema change on db1136
  • 12:27 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1136 for schema change', diff saved to https://phabricator.wikimedia.org/P10879 and previous config saved to /var/cache/conftool/dbconfig/20200403-122716-marostegui.json
  • 12:23 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1094 after schema change', diff saved to https://phabricator.wikimedia.org/P10878 and previous config saved to /var/cache/conftool/dbconfig/20200403-122259-marostegui.json
  • 12:00 marostegui: Deploy schema change on db1094
  • 12:00 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1094 for schema change', diff saved to https://phabricator.wikimedia.org/P10877 and previous config saved to /var/cache/conftool/dbconfig/20200403-115959-marostegui.json
  • 11:59 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1098:3317 after schema change', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20200403-115854-marostegui.json
  • 11:40 marostegui: Deploy schema change on db1098:3317
  • 11:40 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1098:3317 for schema change', diff saved to https://phabricator.wikimedia.org/P10875 and previous config saved to /var/cache/conftool/dbconfig/20200403-114004-marostegui.json
  • 11:37 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1101:3317 after schema change', diff saved to https://phabricator.wikimedia.org/P10874 and previous config saved to /var/cache/conftool/dbconfig/20200403-113717-marostegui.json
  • 10:38 marostegui: Deploy schema change on db1101:3317
  • 10:38 urbanecm@deploy1001: Synchronized static/images/project-logos/: 861b267: Enable cswiki anniversary logo (T249173) (duration: 01m 02s)
  • 10:37 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1101:3317 for schema change', diff saved to https://phabricator.wikimedia.org/P10872 and previous config saved to /var/cache/conftool/dbconfig/20200403-103746-marostegui.json
  • 09:32 marostegui: Deploy schema on db1116:3317
  • 08:43 marostegui: Deploy schema change on dbstore1003:3317
  • 07:57 marostegui: Deploy schema change on s7 codfw master, this will generate lag on codfw
  • 06:55 XioNoX: add fastnetmon 1.1.4 to buster-wikimedia - T240658
  • 06:25 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1126 after schema change', diff saved to https://phabricator.wikimedia.org/P10870 and previous config saved to /var/cache/conftool/dbconfig/20200403-062529-marostegui.json
  • 05:21 marostegui: Deploy schema change on db1126
  • 05:21 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1126 for schema change', diff saved to https://phabricator.wikimedia.org/P10869 and previous config saved to /var/cache/conftool/dbconfig/20200403-052115-marostegui.json
  • 00:42 catrope@deploy1001: Synchronized php-1.35.0-wmf.26/extensions/FlaggedRevs/: Fix logic for determining if pending edits were null (T249277) (duration: 01m 00s)

2020-04-02

  • 23:53 hoo: Started Wikibase rebuildItemsPerSite on mwmaint1002 for wikidatawiki. Can be killed at any time, if necessary.
  • 23:09 catrope@deploy1001: Synchronized wmf-config/CommonSettings.php: Don't try to grant 'oathauth-enable' to '*' (part 2) (T248282) (duration: 00m 58s)
  • 19:53 jforrester@deploy1001: Synchronized php-1.35.0-wmf.26/extensions/Translate/specials/SpecialExportTranslations.php: T249258: Revert 'Special:ExportTranslations: Disallow exporting huge groups' (duration: 00m 59s)
  • 19:38 ppchelko@deploy1001: Finished deploy [restbase/deploy@7923c1f]: Update CSP headers for mobileapps T248431 (duration: 15m 13s)
  • 19:35 jforrester@deploy1001: Synchronized php-1.35.0-wmf.26/includes/MovePage.php: T248789 MovePage: Use correct Title when creating the null revision (duration: 00m 59s)
  • 19:30 hashar: docker-pkg update on contint hosts
  • 19:30 hashar@deploy1001: Finished deploy [docker-pkg/deploy@9f2ba2c]: (no justification provided) (duration: 00m 12s)
  • 19:29 hashar@deploy1001: Started deploy [docker-pkg/deploy@9f2ba2c]: (no justification provided)
  • 19:23 ppchelko@deploy1001: Started deploy [restbase/deploy@7923c1f]: Update CSP headers for mobileapps T248431
  • 19:05 jhuneidi@deploy1001: rebuilt and synchronized wikiversions files: all wikis to 1.35.0-wmf.26 refs T247773
  • 19:00 longma: promoting all to 1.35.0-wmf.26
  • 18:39 jhuneidi@deploy1001: Synchronized php: group1 wikis to 1.35.0-wmf.26 refs T247773 (duration: 01m 05s)
  • 18:38 jhuneidi@deploy1001: rebuilt and synchronized wikiversions files: group1 wikis to 1.35.0-wmf.26 refs T247773
  • 18:37 longma: rolling group1 to 1.35.0-wmf.26
  • 18:27 urbanecm@deploy1001: Synchronized php-1.35.0-wmf.26/extensions/MobileFrontend/: SWAT: 4e2a092: EditorGateway: Fix handling of null sectionId (T249169) (duration: 01m 09s)
  • 18:22 urbanecm@deploy1001: Synchronized php-1.35.0-wmf.26/extensions/VisualEditor/modules/ve-mw: SWAT: 94ded03: Fix issues with treating section "numbers" as integers (T248795; T248968; T249112) (duration: 01m 10s)
  • 17:49 bsitzmann@deploy1001: Finished deploy [mobileapps/deploy@7650fbe]: Update mobileapps to 61977bd7 (duration: 03m 21s)
  • 17:45 bsitzmann@deploy1001: Started deploy [mobileapps/deploy@7650fbe]: Update mobileapps to 61977bd7
  • 16:53 joal@deploy1001: Finished deploy [analytics/refinery@5b254c8] (thin): Regular analytics weekly train THIN [analytics/refinery@5b254c8] (duration: 00m 08s)
  • 16:53 joal@deploy1001: Started deploy [analytics/refinery@5b254c8] (thin): Regular analytics weekly train THIN [analytics/refinery@5b254c8]
  • 16:49 jforrester@deploy1001: Synchronized php-1.35.0-wmf.26/includes/actions/Action.php: T249162 Partially revert 'WikiPage/Article split. Rely on Article inside Action' (duration: 01m 07s)
  • 16:44 joal@deploy1001: Finished deploy [analytics/refinery@5b254c8]: Regular analytics weekly train [analytics/refinery@5b254c8] (duration: 13m 50s)
  • 16:37 volans@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 16:34 volans@cumin1001: START - Cookbook sre.dns.netbox
  • 16:34 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Touch and secondary sync of IS for cache-busting (duration: 01m 05s)
  • 16:33 jforrester@deploy1001: sync-file aborted: T249014 [siwiki] Change wgSitename to drop the ',' (duration: 00m 00s)
  • 16:32 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: T249014 [siwiki] Change wgSitename to drop the ',' (duration: 01m 07s)
  • 16:30 joal@deploy1001: Started deploy [analytics/refinery@5b254c8]: Regular analytics weekly train [analytics/refinery@5b254c8]
  • 16:19 XioNoX: upgrade netflow4001's fastnetmon to 1.1.4 - T240658
  • 14:56 XioNoX: push new test switch config for cloudvirt2001 - T248425
  • 14:33 vgutierrez: Enable inbound TLSv1.3 in upload@codfw - T170567
  • 14:33 vgutierrez: Enable TLS Session tickets in codfw - T245616
  • 14:24 jbond42: updating bluez on ganeti and cloudvirt
  • 14:23 marostegui@cumin1001: dbctl commit (dc=all): 'Fully repool db1111 after schema change', diff saved to https://phabricator.wikimedia.org/P10865 and previous config saved to /var/cache/conftool/dbconfig/20200402-142338-marostegui.json
  • 14:18 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1111 after schema change', diff saved to https://phabricator.wikimedia.org/P10864 and previous config saved to /var/cache/conftool/dbconfig/20200402-141802-marostegui.json
  • 14:13 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1111 after schema change', diff saved to https://phabricator.wikimedia.org/P10863 and previous config saved to /var/cache/conftool/dbconfig/20200402-141335-marostegui.json
  • 14:11 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1111 after schema change', diff saved to https://phabricator.wikimedia.org/P10862 and previous config saved to /var/cache/conftool/dbconfig/20200402-141149-marostegui.json
  • 13:50 marostegui: Compress wbqc_constraints on testcommonswiki and commonswiki (empty tables) - T248967
  • 13:44 vgutierrez: update puppet compiler facts
  • 13:40 marostegui: Deploy schema change on db1111
  • 13:39 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1111 for schema change', diff saved to https://phabricator.wikimedia.org/P10861 and previous config saved to /var/cache/conftool/dbconfig/20200402-133956-marostegui.json
  • 13:32 gehel: OSM data reimport on maps2004 - T249086
  • 12:55 mutante: mw1390 - mw1399 - pooled and active but status "staged" in netbox, fixing to 'active'
  • 12:52 mutante: mw1297 - is pooled and serving traffic but status "staged" in netbox. set to "active"
  • 11:40 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1087 after schema change', diff saved to https://phabricator.wikimedia.org/P10858 and previous config saved to /var/cache/conftool/dbconfig/20200402-114020-marostegui.json
  • 11:06 mutante: decom planet1001 (T248863)
  • 10:56 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 10:55 dzahn@cumin1001: START - Cookbook sre.hosts.decommission
  • 10:19 marostegui: Deploy schema change on db1087, this will generate lag on s8 on wiki replicas
  • 10:19 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1087 for schema change', diff saved to https://phabricator.wikimedia.org/P10857 and previous config saved to /var/cache/conftool/dbconfig/20200402-101920-marostegui.json
  • 10:17 elukey: set up TLS encryption for all pmacct instances on netflow* to Kafka Jumbo
  • 10:17 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1104 after schema change', diff saved to https://phabricator.wikimedia.org/P10856 and previous config saved to /var/cache/conftool/dbconfig/20200402-101747-marostegui.json
  • 09:47 marostegui: Remove haproxy@10.64.37.14 from labsdb hosts - T231280 T248944
  • 09:44 gehel: CORRECTION: depool maps2004 for data reimport - T249086
  • 09:40 gehel: depool wdqs2004 for data reimport - T249086
  • 09:33 oblivian@deploy1001: Finished deploy [docker-pkg/deploy@9f2ba2c]: (no justification provided) (duration: 00m 18s)
  • 09:32 oblivian@deploy1001: Started deploy [docker-pkg/deploy@9f2ba2c]: (no justification provided)
  • 09:28 oblivian@deploy1001: Finished deploy [docker-pkg/deploy@4f86d77]: (no justification provided) (duration: 00m 09s)
  • 09:28 oblivian@deploy1001: Started deploy [docker-pkg/deploy@4f86d77]: (no justification provided)
  • 08:51 marostegui: Deploy schema change db1104
  • 08:50 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1104 for schema change', diff saved to https://phabricator.wikimedia.org/P10854 and previous config saved to /var/cache/conftool/dbconfig/20200402-085057-marostegui.json
  • 08:50 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1092 after schema change', diff saved to https://phabricator.wikimedia.org/P10853 and previous config saved to /var/cache/conftool/dbconfig/20200402-085019-marostegui.json
  • 08:28 gehel: repooling wdqs1006 - catched up on lag
  • 08:22 vgutierrez: Enable inbound TLSv1.3 in upload@esams - T170567
  • 08:21 vgutierrez: Enable TLS Session tickets in esams - T245616
  • 07:45 moritzm: bounced ferm on ms-be1040
  • 07:27 marostegui: Deploy schema change on db1092
  • 07:27 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1092 for schema change', diff saved to https://phabricator.wikimedia.org/P10850 and previous config saved to /var/cache/conftool/dbconfig/20200402-072730-marostegui.json
  • 07:25 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1101:3318 after schema change', diff saved to https://phabricator.wikimedia.org/P10849 and previous config saved to /var/cache/conftool/dbconfig/20200402-072500-marostegui.json
  • 05:49 marostegui: Deploy schema change on db1101:3318
  • 05:49 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1101:3318 for schema change', diff saved to https://phabricator.wikimedia.org/P10848 and previous config saved to /var/cache/conftool/dbconfig/20200402-054931-marostegui.json
  • 05:29 elukey: powercycle analytics1045 (host not responsive to ssh, weird chars showed in mgmt serial console)

2020-04-01

  • 22:44 volker-e@deploy1001: Finished deploy [design/style-guide@4bfe647]: Deploy design/style-guide: (duration: 00m 08s)
  • 22:43 volker-e@deploy1001: Started deploy [design/style-guide@4bfe647]: Deploy design/style-guide:
  • 22:02 volans: forcing logrotate on netflow2001 to compress yesterday's logs
  • 21:53 volans: force-rebooting ms-be1023, unresponsive - T249174
  • 21:50 volans: stopped and restarted kafkatee-webrequest.service on netflow2001, was in a restart loop
  • 19:48 marxarelli: rollback of 1.35.0-wmf.26 from group1 (T247773). blocked by T249162
  • 19:30 dduvall@deploy1001: rebuilt and synchronized wikiversions files: rollback 1.35.0-wmf.26 from group1
  • 19:21 dduvall@deploy1001: Synchronized php: group1 wikis to 1.35.0-wmf.26 (duration: 01m 06s)
  • 19:20 dduvall@deploy1001: rebuilt and synchronized wikiversions files: group1 wikis to 1.35.0-wmf.26
  • 19:18 marxarelli: promoting group1 to 1.35.0-wmf.26 to group1
  • 17:21 cdanis: ✔️ cdanis@cumin1001.eqiad.wmnet ~ 🕐☕ homer 'cr*eqord*' commit 'enable sampling on eqord Iac15379cc'
  • 16:54 cdanis: ✔️ cdanis@cumin1001.eqiad.wmnet ~ 🕐☕ homer 'cr*eqdfw*' commit 'enable sampling on eqdfw Iac15379cc'
  • 16:39 vgutierrez: pool cp2027 - T248816
  • 16:31 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 16:28 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 16:17 ariel@deploy1001: Finished deploy [dumps/dumps@21363c1]: page range prefetch fixup (duration: 00m 09s)
  • 16:17 ariel@deploy1001: Started deploy [dumps/dumps@21363c1]: page range prefetch fixup
  • 15:33 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 15:31 vgutierrez@cumin1001: START - Cookbook sre.hosts.decommission
  • 15:31 vgutierrez@cumin1001: END (ERROR) - Cookbook sre.hosts.decommission (exit_code=97)
  • 15:31 vgutierrez@cumin1001: START - Cookbook sre.hosts.decommission
  • 15:29 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 15:29 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 15:27 vgutierrez: depool & decommission cp20[16,19,23,27] - T249125
  • 15:22 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1099:3318 after schema change', diff saved to https://phabricator.wikimedia.org/P10845 and previous config saved to /var/cache/conftool/dbconfig/20200401-152258-marostegui.json
  • 15:11 herron: performing kafka-main rolling restarts to pick up security updates
  • 14:52 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 14:50 vgutierrez@cumin1001: START - Cookbook sre.hosts.decommission
  • 14:49 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 14:49 vgutierrez@cumin1001: START - Cookbook sre.hosts.decommission
  • 14:46 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 14:46 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 14:43 vgutierrez: depool && decommission cp[2018,2020,2022,2024-2026].codfw.wmnet - T249115
  • 14:32 gehel: depooling wdqs1006 to allow catching up on lag
  • 14:30 vgutierrez: pool cp2042 - T248816
  • 14:16 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 14:13 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 14:09 XioNoX: remove AS-path prepending in esams
  • 13:47 XioNoX: remove AS-path prepending in eqsin
  • 13:39 vgutierrez: pool cp2041 - T248816
  • 13:34 mutante: sodium (mirror): sudo -u mirror ftpsync to get Debian mirror updated (Icinga says it's old)
  • 13:24 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 13:24 dzahn@cumin1001: START - Cookbook sre.hosts.downtime
  • 13:17 marostegui: Deploy schema change on db1099:3318
  • 13:17 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1099:3318 for schema change', diff saved to https://phabricator.wikimedia.org/P10843 and previous config saved to /var/cache/conftool/dbconfig/20200401-131719-marostegui.json
  • 13:13 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 13:10 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 12:19 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 12:19 vgutierrez@cumin1001: START - Cookbook sre.hosts.decommission
  • 12:19 tgr@deploy1001: Synchronized wmf-config/config: SWAT: Sync growthexperiments dblist with actual state of wmgUseGrowthExperiments (T248844) (duration: 01m 06s)
  • 12:18 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 12:18 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 12:17 tgr@deploy1001: Synchronized dblists/growthexperiments.dblist: SWAT: Sync growthexperiments dblist with actual state of wmgUseGrowthExperiments (T248844) (duration: 01m 05s)
  • 12:17 XioNoX: restart nfacct on netflow4001 for kafka tls tests - T248980
  • 12:15 vgutierrez: depool & decommission cp2013 - T249088
  • 12:14 tgr@deploy1001: Synchronized wmf-config/InitialiseSettings.php: re-sync (duration: 01m 06s)
  • 12:12 tgr@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Enable password-reset-update on all other than Wikipedias (T245791) (duration: 01m 07s)
  • 12:09 marostegui: Deploy schema change on db1116:3318
  • 12:05 cparle@deploy1001: Synchronized wmf-config/InitialiseSettings.php: [SDC] Revert enabling WikibaseQualityConstraints on Commons take 2 (duration: 01m 08s)
  • 12:04 cparle@deploy1001: Synchronized wmf-config/InitialiseSettings.php: [SDC] Revert enabling WikibaseQualityConstraints on Commons (duration: 01m 05s)
  • 11:54 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: 4968501: Restrict short URL management log to stewards (T221073; take II) (duration: 01m 05s)
  • 11:53 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: 4968501: Restrict short URL management log to stewards (T221073) (duration: 01m 07s)
  • 11:48 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: [SDC] Enable WikibaseQualityConstraints on Commons take II (duration: 01m 06s)
  • 11:44 cparle@deploy1001: Synchronized wmf-config/InitialiseSettings.php: [SDC] Enable WikibaseQualityConstraints on Commons (duration: 01m 18s)
  • 11:20 cormacparle__: created table wbqc_constraints on commonswiki
  • 11:03 jbond42: install bluez update on ganeti-canary and cloudvirt/cloudcontrol-dev
  • 11:01 mutante: planet1001 - reinstall OS to test install_server switch, ATS switched to planet1002 earlier
  • 10:47 marostegui: Deploy schema change on dbstore1005:3318
  • 10:25 vgutierrez: pool cp2040 - T248816
  • 10:16 oblivian@puppetmaster1001: conftool action : set/pooled=yes:weight=1; selector: service=canary
  • 09:55 dzahn@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
  • 09:46 dzahn@cumin1001: START - Cookbook sre.ganeti.makevm
  • 09:37 marostegui: Deploy schema change on s8 codfw, this will generate lag on codfw
  • 09:35 XioNoX: Update install servers IPs (dhcp helpers + firewall rules) - T224576
  • 09:34 mutante: install_servers: DHCP_relay in routers and TFTP server in DHCP server config have been switched from install1002/2002 to install1003/2003 - doing a test install, but if any issues report on T224576
  • 09:26 marostegui: last entry was for db2093
  • 09:26 marostegui: Downgrade mariadb package from 10.4.12-2 to 10.4.12-1
  • 09:09 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 09:07 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 09:05 mutante: planet - the backend server has been switched from planet1001 (stretch) to planet1002 (buster) - T247651
  • 08:46 mutante: deneb, boron: systemctl reset-failed to clear up systemd state alerts
  • 08:43 marostegui: Stop haproxy on dbproxy1010 T248944
  • 08:37 jynus: restart bacula at backup1001
  • 08:30 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 08:30 vgutierrez@cumin1001: START - Cookbook sre.hosts.decommission
  • 08:28 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 08:28 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 08:28 vgutierrez: depool & decommission cp2017 - T249084
  • 08:21 vgutierrez: pool cp2039 - T248816
  • 08:09 marostegui: Deploy schema change on db1138 (s4 primary master)
  • 08:06 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 08:04 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 07:13 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1121 after schema change', diff saved to https://phabricator.wikimedia.org/P10841 and previous config saved to /var/cache/conftool/dbconfig/20200401-071339-marostegui.json
  • 07:12 vgutierrez: pool cp2038 - T248816
  • 06:38 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 06:38 vgutierrez@cumin1001: START - Cookbook sre.hosts.decommission
  • 06:36 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 06:36 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 06:36 vgutierrez: depool & decommission cp2012 - T249080
  • 06:24 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 06:22 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 05:39 marostegui: Deploy schema change on db1121 (this will create lag on s4 labs)
  • 05:38 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1121 for schema change', diff saved to https://phabricator.wikimedia.org/P10840 and previous config saved to /var/cache/conftool/dbconfig/20200401-053827-marostegui.json
  • 00:39 reedy@deploy1001: Synchronized docroot/mediawiki.org/xml/: Update http and prot rel links to https, fix link to sitelist in MW Core (duration: 01m 06s)
  • 00:12 reedy@deploy1001: Synchronized docroot/mediawiki.org/xml/: Add export-0.11 (duration: 01m 05s)

2020-03-31

  • 22:23 marxarelli: group0 to 1.35.0-wmf.26 (T247773); no rise in error rates following redeployment
  • 22:13 dduvall@deploy1001: rebuilt and synchronized wikiversions files: group0 to 1.35.0-wmf.26
  • 22:07 dduvall@deploy1001: rebuilt and synchronized wikiversions files: testwiki to php-1.35.0-wmf.26 (T247773)
  • 21:54 dduvall@deploy1001: sync aborted: testwiki to php-1.35.0-wmf.26 (T247773) (duration: 07m 31s)
  • 21:47 dduvall@deploy1001: Started scap: testwiki to php-1.35.0-wmf.26 (T247773)
  • 21:46 jforrester@deploy1001: Synchronized php-1.35.0-wmf.26/includes/user/UserNameUtils.php: T249045 Use wfMessage in UserNameUtils::isUsable for now (duration: 00m 58s)
  • 21:05 eileen: process-control config revision is f80d248113 - (catch up dedupe now off - fyi MBeat )
  • 20:59 hashar: contint1001: manually reverted /lib/systemd/system/jenkins.service
  • 20:51 hashar: Restarting Jenkins for new CSP rules # T245658
  • 20:26 dduvall@deploy1001: rebuilt and synchronized wikiversions files: rolling back 1.35.0-wmf.26 testwiki deployment following significant increase in error rate (cc T247773)
  • 20:14 marxarelli: correction: RequestContext::getLanguage errors are for testwiki deployment, pre group0
  • 20:08 marxarelli: a slew of "ErrorException from line 334 of /srv/mediawiki/php-1.35.0-wmf.26/includes/context/RequestContext.php: PHP Warning: Recursion detected in RequestContext::getLanguage" after group0 deployment (cc T247773)
  • 20:04 dduvall@deploy1001: Finished scap: testwiki to php-1.35.0-wmf.26 and rebuild l10n cache (duration: 142m 48s)
  • 19:20 ariel@deploy1001: Finished deploy [dumps/dumps@713c297]: more filelist methods cleanup, sort prefetch possible files properly (duration: 00m 04s)
  • 19:20 ariel@deploy1001: Started deploy [dumps/dumps@713c297]: more filelist methods cleanup, sort prefetch possible files properly
  • 18:08 ariel@deploy1001: Finished deploy [dumps/dumps@8376c62]: bring snapshot1010 up to date (duration: 00m 05s)
  • 18:07 ariel@deploy1001: Started deploy [dumps/dumps@8376c62]: bring snapshot1010 up to date
  • 17:42 dduvall@deploy1001: Started scap: testwiki to php-1.35.0-wmf.26 and rebuild l10n cache
  • 17:40 dduvall@deploy1001: Pruned MediaWiki: 1.35.0-wmf.23 (duration: 26m 51s)
  • 17:38 elukey: restart elasticsearch_6@cloudelastic-chi-eqiad.service on cloudelastic1001 to see if it recovers from a trashing/gc state - T231517
  • 16:30 marxarelli: 1.35.0-wmf.26 was branched at bec758b for T247773
  • 16:24 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Touch and secondary sync of IS for cache-busting (duration: 01m 00s)
  • 16:15 vgutierrez: pool cp2037 - T248816
  • 15:39 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 15:36 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 15:35 mutante: decom mw1254 through mw1258 (last remaining old servers in rack D5, depooled a while ago and average response time is again under 200ms) T247780
  • 15:33 dzahn@cumin1001: START - Cookbook sre.hosts.decommission
  • 15:29 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 15:29 dzahn@cumin1001: START - Cookbook sre.hosts.downtime
  • 15:28 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 15:27 vgutierrez@cumin1001: START - Cookbook sre.hosts.decommission
  • 15:27 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 15:27 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 15:26 vgutierrez: depool & decommission cp2010 - T249002
  • 15:15 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Touch and secondary sync of IS for cache-busting (duration: 00m 58s)
  • 15:14 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: T245794 Enable DiscussionTools as a beta feature on four wikis (duration: 01m 00s)
  • 15:05 cdanis: cr1-eqiad: commit flex-flow-sizing T248394
  • 15:01 cdanis: cr2-eqiad: commit flex-flow-sizing T248394
  • 14:43 vgutierrez: pool cp2036 - T248816
  • 14:21 dzahn@cumin1001: conftool action : set/pooled=inactive; selector: name=mw125[4-8].eqiad.wmnet
  • 14:20 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 14:20 dzahn@cumin1001: START - Cookbook sre.hosts.downtime
  • 14:19 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 14:17 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 14:15 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1091 after schema change', diff saved to https://phabricator.wikimedia.org/P10834 and previous config saved to /var/cache/conftool/dbconfig/20200331-141459-marostegui.json
  • 14:10 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 14:10 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 14:05 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw125[4-8].eqiad.wmnet
  • 13:31 vgutierrez: Enable TLS Session tickets in eqsin - T245616
  • 13:05 XioNoX: update nat on pfw3-codfw - T248906
  • 13:03 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 13:03 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 12:49 _joe_: switching all appserver canaries to envoy
  • 12:46 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 12:45 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 12:45 marostegui: Deploy schema change on db1091
  • 12:44 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1091 for schema change', diff saved to https://phabricator.wikimedia.org/P10833 and previous config saved to /var/cache/conftool/dbconfig/20200331-124452-marostegui.json
  • 12:34 _joe_: transitioning mw1261 to envoy
  • 12:23 vgutierrez: rolling upgrade of ATS to version 8.0.6-1wm5 - T248938
  • 11:30 Lucas_WMDE: EU SWAT done
  • 11:30 lucaswerkmeister-wmde@deploy1001: Synchronized wmf-config/CommonSettings.php: SWAT: Disable TwoColConflict talk page workflow (T230231), take II (duration: 00m 57s)
  • 11:29 lucaswerkmeister-wmde@deploy1001: Synchronized wmf-config/CommonSettings.php: SWAT: Disable TwoColConflict talk page workflow (T230231) (duration: 00m 58s)
  • 11:11 lucaswerkmeister-wmde@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Enable ContentTranslation in Lithuanian Wikipedia as a default tool (T248179), take II (duration: 00m 59s)
  • 11:10 lucaswerkmeister-wmde@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Enable ContentTranslation in Lithuanian Wikipedia as a default tool (T248179) (duration: 01m 00s)
  • 10:46 _joe_: disabled puppet on canary appservers, potentially dangerous change ahead
  • 10:19 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1084 after schema change', diff saved to https://phabricator.wikimedia.org/P10831 and previous config saved to /var/cache/conftool/dbconfig/20200331-101953-marostegui.json
  • 10:03 XioNoX: add BGP to AS41327 in AMS-IX
  • 09:49 XioNoX: push homer diffs to mr1-eqsin
  • 09:36 XioNoX: push homer diffs to mr1-eqiad
  • 09:19 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 09:15 marostegui@cumin1001: START - Cookbook sre.hosts.downtime
  • 09:10 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 09:09 vgutierrez@cumin1001: START - Cookbook sre.hosts.decommission
  • 09:05 vgutierrez: upload trafficserver 8.0.5-1wm6 to apt.wm.o (buster) - T248938
  • 09:00 vgutierrez: depool & decommission cp2011 - T248950
  • 08:44 vgutierrez: pool cp2035 - T248816
  • 08:31 mutante: signed puppet cert for planet1002.eqiad.wmnet
  • 08:29 marostegui: Depool db1084 for schema change
  • 08:29 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1084 for schema change', diff saved to https://phabricator.wikimedia.org/P10829 and previous config saved to /var/cache/conftool/dbconfig/20200331-082904-marostegui.json
  • 08:27 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1081 after schema change', diff saved to https://phabricator.wikimedia.org/P10828 and previous config saved to /var/cache/conftool/dbconfig/20200331-082711-marostegui.json
  • 08:17 dzahn@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
  • 08:08 dzahn@cumin1001: START - Cookbook sre.ganeti.makevm
  • 08:01 XioNoX: delete unused ROA for ARIN v4 prefixes - T235886
  • 07:49 vgutierrez@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 07:49 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 07:17 vgutierrez: pool cp2034 - T248816
  • 07:16 marostegui: Deploy schema change on db1081
  • 07:15 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1081 for schema change', diff saved to https://phabricator.wikimedia.org/P10827 and previous config saved to /var/cache/conftool/dbconfig/20200331-071547-marostegui.json
  • 07:14 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1103:3314 after schema change', diff saved to https://phabricator.wikimedia.org/P10826 and previous config saved to /var/cache/conftool/dbconfig/20200331-071401-marostegui.json
  • 06:48 marostegui: Deploy schema change on db1103:3314
  • 06:47 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1103:3314 for schema change', diff saved to https://phabricator.wikimedia.org/P10825 and previous config saved to /var/cache/conftool/dbconfig/20200331-064707-marostegui.json
  • 06:46 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1097:3314 after schema change', diff saved to https://phabricator.wikimedia.org/P10824 and previous config saved to /var/cache/conftool/dbconfig/20200331-064627-marostegui.json
  • 05:55 marostegui: Drop nova and nova_api from m5 master (db1133) - T248313
  • 05:55 kart_: Updated cxserver to 2020-03-30-145349-production (T248578)
  • 05:55 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 05:54 vgutierrez@cumin1001: START - Cookbook sre.hosts.decommission
  • 05:53 vgutierrez: depool && decommission cp2007 - T248941
  • 05:48 kartik@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'cxserver' for release 'production' .
  • 05:46 vgutierrez@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 05:46 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 05:46 kartik@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'cxserver' for release 'staging' .
  • 05:26 marostegui: Deploy schema change on db1097:3314
  • 05:13 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1097:3314 for schema change', diff saved to https://phabricator.wikimedia.org/P10822 and previous config saved to /var/cache/conftool/dbconfig/20200331-051354-marostegui.json
  • 00:26 eileen: civicrm revision changed from cf2e2c11c3 to 524b162174, config revision is 708198a154

2020-03-30

  • 23:30 cdanis: cr3-esams: commit flex-flow-sizing T248394
  • 23:20 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Touch and secondary sync of IS for cache-busting (duration: 00m 58s)
  • 23:19 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Alphabetize wikis in each GrowthExperiments settings (duration: 00m 58s)
  • 23:16 cdanis: cr2-esams: commit flex-flow-sizing T248394
  • 23:08 cdanis: cdanis@cr3-knams# commit comment "sensible flow table sizes T248394"
  • 22:56 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Touch and secondary sync of IS for cache-busting (duration: 00m 58s)
  • 22:53 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Provide wmgSiteLogoIcon (duration: 00m 57s)
  • 22:52 jforrester@deploy1001: Synchronized wmf-config/CommonSettings.php: Set wmgSiteLogoIcon for each project family and four special wikis (duration: 00m 58s)
  • 22:50 jforrester@deploy1001: Synchronized wmf-config/mobile.php: Set wgMobileFrontendLogo from wgLogos['icon'] if set (duration: 00m 59s)
  • 22:37 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Touch and secondary sync of IS for cache-busting (duration: 00m 57s)
  • 22:36 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Split wgLogos setting into wmgSiteLogo1x etc. (duration: 00m 59s)
  • 22:33 jforrester@deploy1001: Synchronized wmf-config/CommonSettings.php: Construct wgLogos in CommonSettings so that projects can inherit values (duration: 01m 02s)
  • 19:55 cmjohnson@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 19:55 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 15:36 ejegg: updated payments listener (standalone SmashPig) from dc0c6b208b to d80e4c5abd
  • 15:32 vgutierrez: pool cp2033 - T248816
  • 15:25 jeh: add icinga 2h downtime and soft reset iDRAC on labstore1005.mgmt.eqiad.wmnet T247965
  • 14:58 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 14:57 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 14:57 vgutierrez@cumin1001: START - Cookbook sre.hosts.decommission
  • 14:55 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 14:53 vgutierrez: depool & decommission cp2008 - T248864
  • 14:23 vgutierrez: pool cp2032 - T248816
  • 14:17 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 14:17 vgutierrez@cumin1001: START - Cookbook sre.hosts.decommission
  • 14:01 vgutierrez: depool & decommission cp2006 - T248856
  • 13:57 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 13:45 vgutierrez: pool cp2031 - T248816
  • 13:09 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 13:07 dzahn@cumin1001: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99)
  • 13:07 dzahn@cumin1001: START - Cookbook sre.hosts.decommission
  • 13:06 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 12:56 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 12:56 vgutierrez@cumin1001: START - Cookbook sre.hosts.decommission
  • 12:53 vgutierrez: depool & decommission cp2005 - T248848
  • 12:26 cdanis: cdanis@re0.cr2-codfw# set chassis fpc 5 inline-services flex-flow-sizing cdanis@re0.cr2-codfw# commit comment "flex-flow-sizing T248394"
  • 12:24 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 12:23 vgutierrez@cumin1001: START - Cookbook sre.hosts.decommission
  • 12:21 vgutierrez: depool & decommission cp2004 - T248824
  • 12:03 XioNoX: delete unused ROA for ARIN v6 prefixes - T235886
  • 11:59 XioNoX: delete unused ROAs for RIPE prefixes - T235886
  • 11:42 mutante: miscweb2002 - race condition with apache2 mpm and php7.3 module met - a2dismond mpm_event ; systemctl restart apache2 ; puppet agent -tv (also see T196968, https://gerrit.wikimedia.org/r/c/operations/puppet/+/451206) T247887
  • 11:37 mutante: miscweb2002 - installed OS, added to puppet, added role and ... sed -i 's/tin.eqiad/deployment.eqiad/g' /srv/deployment/iegreview/iegreview-cache/.config (T247648)
  • 11:30 marostegui: Deploy schema change on dbstore1004:3314
  • 11:22 XioNoX: delete ARIN allocations from RIPE's IRR - T235886
  • 11:11 Urbanecm: EU SWAT done
  • 11:10 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: ac7e625: Add collections.nmnh.si.edu to $wgCopyUploadsDomains (T248659; take II) (duration: 00m 58s)
  • 11:09 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: ac7e625: Add collections.nmnh.si.edu to $wgCopyUploadsDomains (T248659) (duration: 00m 58s)
  • 11:08 vgutierrez: pool cp2030 - T248816
  • 11:07 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: c8c06f9: Add 3 additional namespaces and assoicated talk pages to trwiktionary (T248734; take II) (duration: 00m 59s)
  • 11:06 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: c8c06f9: Add 3 additional namespaces and assoicated talk pages to trwiktionary (T248734) (duration: 00m 59s)
  • 10:43 dzahn@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
  • 10:34 dzahn@cumin1001: START - Cookbook sre.ganeti.makevm
  • 10:33 dzahn@cumin1001: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99)
  • 10:33 dzahn@cumin1001: START - Cookbook sre.ganeti.makevm
  • 09:59 hoo: Temporary modified dumpsgen's crontab on snapshot1008 so that the Wikidata JSON dumps start at 9:59 UTC today (T248612)
  • 09:56 hoo@deploy1001: Synchronized php-1.35.0-wmf.25/extensions/Wikibase/repo/maintenance/DumpEntities.php: DumpEntities: Fix DB group default override (T248612) (duration: 01m 02s)
  • 09:19 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 09:15 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 08:30 vgutierrez: pool cp2029 - T248816
  • 08:12 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 08:12 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 08:12 vgutierrez@cumin1001: START - Cookbook sre.hosts.decommission
  • 08:10 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 07:53 vgutierrez: depool & decommission cp2002 - T248818
  • 07:48 marostegui: Run cloudcontrol1003:~# wmcs-wikireplica-dns to promote dbproxy1018 to wikireplicas active proxy T231520
  • 07:40 marostegui: Replace dbproxy1010 with dbproxy1011 for wiki replicas, analytics - T231520
  • 07:28 marostegui: Deploy schema change on labswiki (wikitech) - T248333
  • 07:26 marostegui: Deploy schema change on s4 codfw, this will generate lag on codfw - T248333
  • 07:17 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 07:17 vgutierrez@cumin1001: START - Cookbook sre.hosts.decommission
  • 07:10 vgutierrez: depool and decommission cp2001 - T248815
  • 06:52 vgutierrez: pool cp2028 - T247340
  • 06:29 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 06:28 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1074 after schema change', diff saved to https://phabricator.wikimedia.org/P10813 and previous config saved to /var/cache/conftool/dbconfig/20200330-062858-marostegui.json
  • 06:26 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 06:04 marostegui: Deploy schema change on db1074 with replication, this will generate lag on s2 labs
  • 06:03 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1074 for schema change', diff saved to https://phabricator.wikimedia.org/P10812 and previous config saved to /var/cache/conftool/dbconfig/20200330-060338-marostegui.json
  • 05:40 vgutierrez: pool cp2027 - T247340
  • 05:13 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 05:10 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 04:55 vgutierrez: Enable TLS Session tickets in ulsfo - T245616
  • 04:32 vgutierrez: upgrade ATS to version 8.0.6-1wm4 on ulsfo - T245616

2020-03-29

  • 08:24 elukey: powercycle elastic1059 - mgmt/serial console stuck, no ssh - racadm getsel shows a lot of OEM errors occurred, nothing specific

2020-03-28

  • 16:54 elukey: restart yarn on analytics1071
  • 12:05 vgutierrez: preemptive restart of ats-tls on cp1081 and cp3062 - T248736
  • 11:32 vgutierrez: restart ats-tls on cp1077 - T248736
  • 08:34 vgutierrez: pool cp1089
  • 08:30 vgutierrez: restarting ats-tls on cp1089

2020-03-27

  • 20:51 ejegg: updated payments-wiki from db618f429d to 1640f5e21e
  • 15:15 andrew@deploy1001: Finished deploy [horizon/deploy@33e67f9]: fix Identity->Projects with keystone Queens (duration: 03m 35s)
  • 15:12 andrew@deploy1001: Started deploy [horizon/deploy@33e67f9]: fix Identity->Projects with keystone Queens
  • 14:41 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1129 after schema change', diff saved to https://phabricator.wikimedia.org/P10807 and previous config saved to /var/cache/conftool/dbconfig/20200327-144125-marostegui.json
  • 14:22 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1129 for schema change', diff saved to https://phabricator.wikimedia.org/P10806 and previous config saved to /var/cache/conftool/dbconfig/20200327-142240-marostegui.json
  • 14:19 moritzm: updating linux-image-4.9.0-11-amd64 where applicable
  • 13:30 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1076 after schema change', diff saved to https://phabricator.wikimedia.org/P10805 and previous config saved to /var/cache/conftool/dbconfig/20200327-133022-marostegui.json
  • 13:07 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1076 for schema change', diff saved to https://phabricator.wikimedia.org/P10804 and previous config saved to /var/cache/conftool/dbconfig/20200327-130706-marostegui.json
  • 13:05 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1105:3312 after schema change', diff saved to https://phabricator.wikimedia.org/P10803 and previous config saved to /var/cache/conftool/dbconfig/20200327-130542-marostegui.json
  • 12:49 Amir1: ladsgroup@mwmaint1002:~$ mwscript createAndPromote.php --wiki=labswiki --force "Ladsgroup" --interface-admin
  • 12:21 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1105:3312 for schema change', diff saved to https://phabricator.wikimedia.org/P10802 and previous config saved to /var/cache/conftool/dbconfig/20200327-122144-marostegui.json
  • 12:20 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1103:3312 after schema change', diff saved to https://phabricator.wikimedia.org/P10801 and previous config saved to /var/cache/conftool/dbconfig/20200327-122058-marostegui.json
  • 12:02 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1103:3312 for schema change', diff saved to https://phabricator.wikimedia.org/P10800 and previous config saved to /var/cache/conftool/dbconfig/20200327-120234-marostegui.json
  • 11:54 hnowlan@puppetmaster1001: conftool action : set/pooled=yes:weight=10; selector: dc=codfw,cluster=restbase,service=restbase-backend,name=restbase202[123].codfw.wmnet
  • 11:51 hnowlan@puppetmaster1001: conftool action : set/pooled=yes:weight=10; selector: dc=codfw,cluster=restbase,service=restbase-ssl,name=restbase202[123].codfw.wmnet
  • 11:45 hnowlan@puppetmaster1001: conftool action : set/pooled=yes:weight=10; selector: dc=codfw,cluster=restbase,service=restbase,name=restbase2023.codfw.wmnet
  • 11:44 hnowlan@puppetmaster1001: conftool action : set/pooled=yes:weight=10; selector: dc=codfw,cluster=restbase,service=restbase,name=restbase2022.codfw.wmnet
  • 11:44 oblivian@puppetmaster1001: conftool action : edit; selector: dc=codfw,cluster=restbase,service=restbase-ssl,name=restbase202[1].codfw.wmnet
  • 11:44 hnowlan@puppetmaster1001: conftool action : set/pooled=yes:weight=10; selector: dc=codfw,cluster=restbase,service=restbase,name=restbase2021.codfw.wmnet
  • 10:55 mutante: revoke puppet cert webserver-misc-apps.discovery.wmnet and recreate with additional SANs for new VMs
  • 10:45 mutante: miscweb1002 - upload and unpack RackTables-0.21.4 (T247646 T247648)
  • 10:28 marostegui: Alter db2125 s2 to set page_restrictions to default NULL - T248333
  • 10:12 mutante: miscweb1002 - sed -i 's/tin.eqiad/deployment.eqiad/g' /srv/deployment/iegreview/iegreview-cache/.config T247648
  • 10:04 vgutierrez: upload trafficserver 8.0.6-1wm4 to apt.wm.o (buster) - T245616 T170567
  • 10:03 mutante: sodium - find /srv/mirrors/debian/ -user root -exec chown -h mirror:mirror {} \; (-h to also fix symbolic links); sudo -u mirror ftpsync (T248660)
  • 10:02 marostegui: Alter db2084:3315 enwikivoyage.page to set page_restrictions to default NULL - T248333
  • 10:01 marostegui: Alter db1096:3315 enwikivoyage.page to set page_restrictions to default NULL - T248333
  • 09:37 mutante: sodium - running ftpsync as user mirror (T248660)
  • 09:36 mutante: sodium fixing root owned files in /srv/mirrors/debian to be owned by mirror:mirror (T248660)
  • 09:32 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1098:3316', diff saved to https://phabricator.wikimedia.org/P10799 and previous config saved to /var/cache/conftool/dbconfig/20200327-093214-marostegui.json
  • 09:31 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1098:3316 for schema change', diff saved to https://phabricator.wikimedia.org/P10798 and previous config saved to /var/cache/conftool/dbconfig/20200327-093106-marostegui.json
  • 07:58 marostegui: Deploy schema change on s2 codfw - this will generate lag on s2 codfw - T248333
  • 07:36 elukey: execute 'rm /etc/logrotate.d/ceph-common' on cloudvirt[1,2]* and cloudcontrol* to stop daily cronspam (file not in the puppet catalog anymore)
  • 07:32 moritzm: installing grub2 updates from Stretch point release
  • 07:23 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1130 after schema change', diff saved to https://phabricator.wikimedia.org/P10796 and previous config saved to /var/cache/conftool/dbconfig/20200327-072334-marostegui.json
  • 07:02 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1130 for schema change', diff saved to https://phabricator.wikimedia.org/P10795 and previous config saved to /var/cache/conftool/dbconfig/20200327-070224-marostegui.json
  • 07:00 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1082 after schema change', diff saved to https://phabricator.wikimedia.org/P10794 and previous config saved to /var/cache/conftool/dbconfig/20200327-070014-marostegui.json
  • 06:31 marostegui: Deploy schema change on db1082, this will generate lag on s5 labs
  • 06:30 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1082 for schema change', diff saved to https://phabricator.wikimedia.org/P10793 and previous config saved to /var/cache/conftool/dbconfig/20200327-063042-marostegui.json

2020-03-26

  • 23:07 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: ce63a4e: Enable wmgUseFooterContactLink for cswiki (T248584; take II) (duration: 00m 57s)
  • 23:05 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: ce63a4e: Enable wmgUseFooterContactLink for cswiki (T248584) (duration: 00m 58s)
  • 22:51 krinkle@deploy1001: Synchronized php-1.35.0-wmf.25/includes/user/UserRightsProxy.php: I9121f5aae (4/4) (duration: 00m 58s)
  • 22:50 krinkle@deploy1001: Synchronized php-1.35.0-wmf.25/includes/search/SearchMySQL.php: I9121f5aae (3/4) (duration: 00m 58s)
  • 22:48 krinkle@deploy1001: Synchronized php-1.35.0-wmf.25/includes/objectcache/SqlBagOStuff.php: I9121f5aae (2/4) (duration: 00m 58s)
  • 22:44 krinkle@deploy1001: Synchronized php-1.35.0-wmf.25/includes/jobqueue/jobs/RecentChangesUpdateJob.php: I9121f5aae (1/4) (duration: 01m 00s)
  • 22:05 ejegg: updated fundraising CiviCRM from f1cb23e809 to cf2e2c11c3
  • 21:43 mholloway-shell@deploy1001: Synchronized php-1.35.0-wmf.25/extensions/MachineVision: Fix: Stop sorting label suggestions by Wikidata ID in ApiQueryImageLabels (duration: 01m 00s)
  • 21:34 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 21:32 cdanis: cdanis@re0.cr1-eqsin# set chassis afeb slot 0 inline-services flex-flow-sizing cdanis@re0.cr1-eqsin# commit comment "flex-flow-sizing T248394"
  • 21:31 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 21:30 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 21:27 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 21:27 mholloway-shell@deploy1001: Finished deploy [mobileapps/deploy@f34260c]: Update mobileapps to 3f30f20c (duration: 03m 07s)
  • 21:24 mholloway-shell@deploy1001: Started deploy [mobileapps/deploy@f34260c]: Update mobileapps to 3f30f20c
  • 21:15 cdanis: repool ulsfo
  • 21:12 cdanis: applied flow-table-size configuration to cr4-ulsfo which did not need a reboot to apply it T248394
  • 20:51 cdanis: cdanis@cr3-ulsfo> request system reboot
  • 20:36 cdanis: depool ulsfo
  • 16:52 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 16:50 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 16:43 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 16:40 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 16:34 XioNoX: stop exchanging full BGP view between eqiad and codfw - T246721
  • 16:19 pt1979@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 16:18 XioNoX: stop advertising 208.80.152.0/22 from eqiad - T246721
  • 16:15 mutante: signing puppet cert for miscweb1002, installed buster, added insetup role (T247887)
  • 16:15 ebernhardson: set cloudelastic-chi wikidatawiki_content to 0 replicas while reindexing
  • 16:14 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 16:14 moritzm: rebooting mw2150 for some tests
  • 16:12 XioNoX: stop advertising 2620:0:860::/46 from eqiad - T246721
  • 16:12 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 16:11 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 16:11 jmm@cumin2001: START - Cookbook sre.hosts.downtime
  • 16:10 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 15:58 volans@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 15:53 volans@cumin1001: START - Cookbook sre.dns.netbox
  • 15:51 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 15:51 moritzm: installing grub2 updates from Stretch point release
  • 15:49 XioNoX: start advertising 208.80.154.0/23 from eqiad - T246721
  • 15:49 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 15:48 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 15:46 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 15:40 XioNoX: start advertising 2620:0:861::/48 from eqiad - T246721
  • 15:20 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 15:17 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 15:15 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 15:12 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 15:10 dzahn@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
  • 15:02 aborrero@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 15:02 aborrero@cumin1001: START - Cookbook sre.hosts.downtime
  • 15:02 aborrero@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 15:02 aborrero@cumin1001: START - Cookbook sre.hosts.downtime
  • 15:02 aborrero@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 15:02 aborrero@cumin1001: START - Cookbook sre.hosts.downtime
  • 15:02 aborrero@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 15:02 aborrero@cumin1001: START - Cookbook sre.hosts.downtime
  • 15:01 mutante: T247887 - create Ganeti VM miscweb1002.eqiad.wmnet in the ganeti01.svc.eqiad.wmnet cluster on row C with 1 vCPUs, 2GB of RAM, 20GB of disk in the private network.
  • 15:01 aborrero@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 15:01 aborrero@cumin1001: START - Cookbook sre.hosts.downtime
  • 14:59 dzahn@cumin1001: START - Cookbook sre.ganeti.makevm
  • 14:59 dzahn@cumin1001: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99)
  • 14:59 dzahn@cumin1001: START - Cookbook sre.ganeti.makevm
  • 14:50 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 14:47 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 14:26 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 14:23 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 13:56 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1110 after schema change', diff saved to https://phabricator.wikimedia.org/P10787 and previous config saved to /var/cache/conftool/dbconfig/20200326-135625-marostegui.json
  • 13:29 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1110 for schema change', diff saved to https://phabricator.wikimedia.org/P10786 and previous config saved to /var/cache/conftool/dbconfig/20200326-132940-marostegui.json
  • 13:01 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1097:3315 after schema change', diff saved to https://phabricator.wikimedia.org/P10785 and previous config saved to /var/cache/conftool/dbconfig/20200326-130122-marostegui.json
  • 12:57 oblivian@deploy1001: Synchronized wmf-config/ProductionServices.php: eventgate-main to use envoy T244843 (duration: 01m 07s)
  • 12:33 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1097:3315 for schema change', diff saved to https://phabricator.wikimedia.org/P10784 and previous config saved to /var/cache/conftool/dbconfig/20200326-123302-marostegui.json
  • 12:31 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1096:3315 after schema change', diff saved to https://phabricator.wikimedia.org/P10783 and previous config saved to /var/cache/conftool/dbconfig/20200326-123157-marostegui.json
  • 12:25 mutante: analytics1028 - performing a puppet change on every run (all other hosts doing this were fixed just recently)
  • 12:19 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1096:3315 for schema change', diff saved to https://phabricator.wikimedia.org/P10782 and previous config saved to /var/cache/conftool/dbconfig/20200326-121859-marostegui.json
  • 11:38 awight: EU SWAT done
  • 11:37 awight@deploy1001: Synchronized php-1.35.0-wmf.25/extensions/TwoColConflict: SWAT: Two hotfixes for guided tour (T248465) (duration: 01m 07s)
  • 11:25 mutante: sodium - running ftpsync to get Debian mirror in sync
  • 11:23 dcausse@deploy1001: Synchronized wmf-config/InitialiseSettings.php: T231517: [cirrus] force cloudelastic replica count to 1 (duration: 01m 05s)
  • 11:21 dcausse@deploy1001: Synchronized wmf-config/InitialiseSettings.php: T231517: [cirrus] force cloudelastic replica count to 1 (duration: 01m 06s)
  • 11:12 urbanecm@deploy1001: Synchronized php-1.35.0-wmf.25/extensions/ContentTranslation/modules/ui/mw.cx.ui.Categories.js: SWAT: 1ea6bad: Allow publishing to continue even with broken categories (T248302) (duration: 01m 07s)
  • 11:06 urbanecm@deploy1001: Synchronized wmf-config/throttle.php: SWAT: d1bb0b1: Removed expired throttle.php entries (duration: 01m 09s)
  • 11:00 hnowlan@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'changeprop' for release 'production' .
  • 10:58 hnowlan@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'changeprop' for release 'production' .
  • 10:54 hnowlan@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'changeprop' for release 'production' .
  • 10:16 XioNoX: esams/knams: advertise 185.15.58.0/23 instead of 185.15.56.0/22 - T207753
  • 09:50 elukey: reboot stat1008 - gpu + drivers in a weird state after multiple tests
  • 09:00 XioNoX: push v4 conditional advertising on cr3-knams - T236785
  • 08:44 marostegui: Deploy schema change on s5 codfw, lag will show up on codfw - T248333
  • 08:27 XioNoX: troubleshot v6 conditional advertisement from cr3-knams - T236785
  • 07:58 XioNoX: remove BGP session to AS8001 in eqiad (down and not replying to email)
  • 07:40 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1085 after schema change', diff saved to https://phabricator.wikimedia.org/P10781 and previous config saved to /var/cache/conftool/dbconfig/20200326-074033-marostegui.json
  • 07:31 marostegui: Deploy schema change on db1085, lag will appear on s6 on labs
  • 07:30 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1085 for schema change', diff saved to https://phabricator.wikimedia.org/P10780 and previous config saved to /var/cache/conftool/dbconfig/20200326-073048-marostegui.json
  • 07:07 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1093 after schema change', diff saved to https://phabricator.wikimedia.org/P10779 and previous config saved to /var/cache/conftool/dbconfig/20200326-070746-marostegui.json
  • 06:59 marostegui: Deploy schema change on db1093
  • 06:59 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1093 for schema change', diff saved to https://phabricator.wikimedia.org/P10778 and previous config saved to /var/cache/conftool/dbconfig/20200326-065929-marostegui.json
  • 06:58 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1088 after schema change', diff saved to https://phabricator.wikimedia.org/P10777 and previous config saved to /var/cache/conftool/dbconfig/20200326-065814-marostegui.json
  • 06:48 marostegui: Deploy schema change on db1088
  • 06:47 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1088 for schema change', diff saved to https://phabricator.wikimedia.org/P10776 and previous config saved to /var/cache/conftool/dbconfig/20200326-064748-marostegui.json
  • 06:46 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1098:3316 after schema change', diff saved to https://phabricator.wikimedia.org/P10775 and previous config saved to /var/cache/conftool/dbconfig/20200326-064648-marostegui.json
  • 06:39 marostegui: Deploy schema change on db1098:3316
  • 06:38 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1098:3316 for schema change', diff saved to https://phabricator.wikimedia.org/P10774 and previous config saved to /var/cache/conftool/dbconfig/20200326-063844-marostegui.json
  • 06:36 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1096:3316 after schema change', diff saved to https://phabricator.wikimedia.org/P10773 and previous config saved to /var/cache/conftool/dbconfig/20200326-063633-marostegui.json
  • 06:26 marostegui: Deploy schema change on db1096:3316
  • 06:26 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1096:3316 for schema change', diff saved to https://phabricator.wikimedia.org/P10772 and previous config saved to /var/cache/conftool/dbconfig/20200326-062631-marostegui.json
  • 06:22 marostegui: Rename nova and nova_api tables on db1117:3325 - T248313
  • 00:06 catrope@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Enable Special:Investigate on testwiki (T247645) (duration: 03m 14s)

2020-03-25

  • 23:49 catrope@deploy1001: Synchronized wmf-config/CommonSettings.php: Add investigate to $wgAvailableRights (T247645) (duration: 03m 16s)
  • 23:42 catrope@deploy1001: Synchronized php-1.35.0-wmf.25/extensions/CheckUser/: Retry because mw1251 timed out, and it is a proxy (duration: 03m 15s)
  • 23:38 catrope@deploy1001: Synchronized php-1.35.0-wmf.25/extensions/CheckUser/: Add new investigate right (T247645) (duration: 03m 17s)
  • 22:21 rzl@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'canary' .
  • 22:21 rzl@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'production' .
  • 22:16 rzl@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'canary' .
  • 22:16 rzl@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'production' .
  • 22:10 rzl@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'canary' .
  • 22:10 rzl@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'production' .
  • 22:05 rzl@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'canary' .
  • 22:05 rzl@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'production' .
  • 22:05 rlazarus: updating eventgate-logging-external to envoy 1.13.1 T246868
  • 22:00 rzl@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'canary' .
  • 22:00 rzl@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'production' .
  • 21:59 ppchelko@deploy1001: Finished deploy [restbase/deploy@a1c3be4] (dev-cluster): Remove experimental PCS endpoints (duration: 02m 57s)
  • 21:56 ppchelko@deploy1001: Started deploy [restbase/deploy@a1c3be4] (dev-cluster): Remove experimental PCS endpoints
  • 21:54 rzl@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'canary' .
  • 21:54 rzl@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'production' .
  • 21:46 urandom: dropping unused Cassandra keyspaces -- T248018
  • 21:45 rzl@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'canary' .
  • 21:44 rzl@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'production' .
  • 21:44 rlazarus: updating eventgate-analytics-external to envoy 1.13.1 T246868
  • 21:39 rzl@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-analytics' for release 'canary' .
  • 21:39 rzl@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-analytics' for release 'production' .
  • 21:27 rzl@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-analytics' for release 'canary' .
  • 21:27 rzl@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-analytics' for release 'production' .
  • 21:16 rlazarus: holding off on updating eventgate-analytics until EU time, to check on unexpected helmfile diffs T246868
  • 21:11 rzl@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-analytics' for release 'canary' .
  • 21:11 rzl@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-analytics' for release 'production' .
  • 21:10 rzl@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics' for release 'canary' .
  • 21:10 rzl@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics' for release 'production' .
  • 21:07 rzl@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics' for release 'canary' .
  • 21:07 rzl@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics' for release 'production' .
  • 21:07 rlazarus: updating eventgate-analytics to envoy 1.13.1 T246868
  • 20:36 rzl@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'cxserver' for release 'production' .
  • 20:32 rzl@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'cxserver' for release 'production' .
  • 20:22 rzl@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'cxserver' for release 'staging' .
  • 20:22 rlazarus: updating cxserver to envoy 1.13.1 T246868
  • 20:19 rzl@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'citoid' for release 'staging' .
  • 20:19 rlazarus: updating citoid to envoy 1.13.1 T246868
  • 20:16 rzl@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventstreams' for release 'canary' .
  • 20:16 rzl@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventstreams' for release 'production' .
  • 20:01 rzl@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'eventstreams' for release 'canary' .
  • 20:01 rzl@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'eventstreams' for release 'production' .
  • 19:36 hasharDinner: Jenkins restarted on all machines
  • 19:30 rzl@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'canary' .
  • 19:30 rzl@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'production' .
  • 19:29 rlazarus: updating eventstreams to envoy 1.13.1 T246868
  • 19:28 twentyafterfour: group1 looks good after deploying wmf.25 refs T233873
  • 19:27 hashar: upgrading Jenkins # T248122
  • 19:26 twentyafterfour@deploy1001: rebuilt and synchronized wikiversions files: group1 wikis to 1.35.0-wmf.25 refs T233873
  • 19:26 twentyafterfour: scap sync-proxies failed on mw1251
  • 18:53 ppchelko@deploy1001: Finished deploy [restbase/deploy@a1c3be4]: Add restbase202[123] T244178 (duration: 14m 00s)
  • 18:39 ppchelko@deploy1001: Started deploy [restbase/deploy@a1c3be4]: Add restbase202[123] T244178
  • 18:39 ppchelko@deploy1001: Finished deploy [restbase/deploy@777b881]: Remove experimental PCS endpoints (duration: 14m 28s)
  • 18:24 ppchelko@deploy1001: Started deploy [restbase/deploy@777b881]: Remove experimental PCS endpoints
  • 18:21 tgr@deploy1001: Synchronized php-1.35.0-wmf.25/extensions/GrowthExperiments/modules/homepage/: re-sync, mw1251 failed (duration: 03m 18s)
  • 18:13 tgr@deploy1001: Synchronized php-1.35.0-wmf.25/extensions/GrowthExperiments/modules/homepage/: SWAT: Mentorship module: Update for root screen refactor (T248422) (duration: 03m 23s)
  • 18:06 ppchelko@deploy1001: Finished deploy [changeprop/deploy@4bdf55b]: Stop rerendering experimental PCS endpoints (duration: 01m 40s)
  • 18:05 ppchelko@deploy1001: Started deploy [changeprop/deploy@4bdf55b]: Stop rerendering experimental PCS endpoints
  • 17:43 mvolz@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'citoid' for release 'production' .
  • 17:38 mvolz@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'citoid' for release 'production' .
  • 17:33 mvolz@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'citoid' for release 'staging' .
  • 16:50 moritzm: installing python-bleach security updates
  • 16:47 moritzm: updated jenkins packages on apt.wikimedia.org to 2.222.1
  • 16:33 rzl@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'blubberoid' for release 'production' .
  • 16:32 sukhe: upload cescout 0.1.0-1 to apt.wm.o (buster) - T247273
  • 16:17 rzl@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'blubberoid' for release 'production' .
  • 16:15 rzl@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'blubberoid' for release 'staging' .
  • 16:07 rlazarus: updating blubberoid to envoy 1.13.1 T246868
  • 15:21 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db2115 after reimage to Buster', diff saved to https://phabricator.wikimedia.org/P10767 and previous config saved to /var/cache/conftool/dbconfig/20200325-152148-marostegui.json
  • 15:14 moritzm: installing deneb.codfw.wmnet T248165
  • 14:51 cdanis: repool codfw T248394
  • 14:46 mutante: closed port 80 for caching servers on misc backends https://gerrit.wikimedia.org/r/q/topic:%22applayer-tls%22+(status:open%20OR%20status:merged) as final step per service on T210411
  • 14:39 mutante: static microsites (annual.wikimedia.org, research.wikimedia.org, static-bugzilla etc). closed port 80 for caching servers, finalizing switch to https behind caching servers
  • 14:08 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 14:06 marostegui@cumin1001: START - Cookbook sre.hosts.downtime
  • 13:53 hnowlan@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'changeprop' for release 'production' .
  • 13:48 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 13:26 _joe_: cumin A:puppetmaster 'apt-get -y install puppet-common'
  • 13:03 marostegui@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 13:02 marostegui@cumin1001: START - Cookbook sre.hosts.downtime
  • 12:56 marostegui: Deploy schema change on db1139:3316
  • 12:45 marostegui: Stop MySQL on db2115 for reimage to buster
  • 11:50 cdanis: cr1-codfw: `set chassis fpc 5 inline-services flex-flow-sizing` and `request chassis fpc restart slot 5` T248394
  • 11:46 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db2115 for upgrade', diff saved to https://phabricator.wikimedia.org/P10763 and previous config saved to /var/cache/conftool/dbconfig/20200325-114655-marostegui.json
  • 11:39 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 11:37 mutante: decom mw1250 - mw1253
  • 11:37 dzahn@cumin1001: START - Cookbook sre.hosts.decommission
  • 11:35 cdanis: depool codfw for router maintenance T248394
  • 11:33 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 11:32 mutante: decom mw1232 - mw1235
  • 11:31 dzahn@cumin1001: START - Cookbook sre.hosts.decommission
  • 11:27 dzahn@cumin1001: conftool action : set/pooled=inactive; selector: name=mw125[0-3].eqiad.wmnet
  • 11:26 dzahn@cumin1001: conftool action : set/pooled=inactive; selector: name=mw123[2-5].eqiad.wmnet
  • 11:22 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 11:22 dzahn@cumin1001: START - Cookbook sre.hosts.downtime
  • 11:21 Urbanecm: EU SWAT done
  • 11:21 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 11:21 dzahn@cumin1001: START - Cookbook sre.hosts.downtime
  • 11:20 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw123[2-5].eqiad.wmnet
  • 11:20 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw125[0-3].eqiad.wmnet
  • 11:19 urbanecm@deploy1001: Synchronized wmf-config/CommonSettings.php: SWAT: 59412db: Add gwtoolset to available rights to allow granting to global groups (duration: 01m 07s)
  • 11:12 urbanecm@deploy1001: Synchronized wmf-config/CommonSettings.php: SWAT: 7b8d7c5: TwoColConflict: Limited default deployment CommonSettings.php (T244863) (duration: 01m 06s)
  • 11:10 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: 81cda0f: TwoColConflict: Limited default deployment InitialiseSettings.php (T244863; take II) (duration: 01m 06s)
  • 11:08 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: 81cda0f: TwoColConflict: Limited default deployment InitialiseSettings.php (T244863) (duration: 01m 17s)
  • 11:08 jynus@cumin1001: dbctl commit (dc=all): 'Reduce db1091 load, increase main traffic on all other s4 instances', diff saved to https://phabricator.wikimedia.org/P10762 and previous config saved to /var/cache/conftool/dbconfig/20200325-110821-jynus.json
  • 10:55 marostegui@cumin1001: dbctl commit (dc=all): 'Fully repool db1137', diff saved to https://phabricator.wikimedia.org/P10761 and previous config saved to /var/cache/conftool/dbconfig/20200325-105503-marostegui.json
  • 10:39 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1137', diff saved to https://phabricator.wikimedia.org/P10760 and previous config saved to /var/cache/conftool/dbconfig/20200325-103938-marostegui.json
  • 10:37 XioNoX: change aggregate policy for 2620:0:862::/48 on cr3-knams - T236785
  • 10:19 XioNoX: change aggregate policy for v4 prefixes on cr2-eqdfw - T236785
  • 10:04 oblivian@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-main' for release 'canary' .
  • 10:04 oblivian@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-main' for release 'production' .
  • 09:56 XioNoX: change aggregate policy for 2620:0:860::/46 on cr2-eqdfw - T236785
  • 09:54 vgutierrez: Enable inbound TLSv1.3 on upload@eqsin - T170567
  • 09:27 jmm@cumin2001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
  • 09:23 vgutierrez: upgrade ATS to 8.0.6-1wm3 on upload@eqsin - T170567
  • 09:14 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1137', diff saved to https://phabricator.wikimedia.org/P10759 and previous config saved to /var/cache/conftool/dbconfig/20200325-091421-marostegui.json
  • 09:02 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1137', diff saved to https://phabricator.wikimedia.org/P10758 and previous config saved to /var/cache/conftool/dbconfig/20200325-090227-marostegui.json
  • 08:55 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 08:53 marostegui@cumin1001: START - Cookbook sre.hosts.downtime
  • 08:38 marostegui: Reimage db1137
  • 08:18 marostegui: Reboot db1117 for full-upgrade
  • 08:15 oblivian@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-main' for release 'canary' .
  • 08:15 oblivian@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-main' for release 'production' .
  • 08:14 _joe_: upgrading all eventgate-main to envoy 1.13.1 T246868
  • 08:12 marostegui: Stop all mysql daemons on db1117
  • 07:50 oblivian@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-main' for release 'canary' .
  • 07:50 oblivian@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-main' for release 'production' .
  • 07:42 XioNoX: reboot scs-eqsin for CPU usage
  • 07:20 jmm@cumin2001: START - Cookbook sre.ganeti.makevm
  • 07:09 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1137 for upgrade', diff saved to https://phabricator.wikimedia.org/P10757 and previous config saved to /var/cache/conftool/dbconfig/20200325-070946-marostegui.json
  • 06:57 marostegui: Deploy schema change on db2129 (s6 codfw master)
  • 06:15 marostegui: Rename tables on db1133 (m5 master) nova_api database - T248313
  • 06:13 marostegui: Remove grants 'nova'@'208.80.154.23' on nova.* - T248313

2020-03-24

  • 20:53 cdanis: repool eqsin
  • 20:52 jforrester@deploy1001: Synchronized wmf-config/CommonSettings.php: Don't hard-set wgTmhUseBetaFeatures to true, let it vary by wiki (duration: 01m 07s)
  • 20:50 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Touch and secondary sync of IS for cache-busting (duration: 01m 07s)
  • 20:49 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Set wgTmhUseBetaFeatures to vary by wiki (duration: 01m 06s)
  • 20:35 twentyafterfour@deploy1001: rebuilt and synchronized wikiversions files: Attempt #2: group0 wikis to 1.35.0-wmf.25 refs T233873
  • 20:32 twentyafterfour@deploy1001: Synchronized wmf-config: Now touch and sync again because of settings cache rache condition. refs T248409 (duration: 00m 59s)
  • 20:31 cdanis: rebooting cr2-eqsin T248394
  • 20:30 twentyafterfour@deploy1001: Synchronized wmf-config: Now sync InitializeSettings* refs T248409 (duration: 00m 59s)
  • 20:28 twentyafterfour@deploy1001: Synchronized wmf-config/CommonSettings.php: sync CommonSettings before InitialiseSettings refs T248409 (duration: 00m 58s)
  • 20:27 volans: force rebooting analytics1044 from console, host down and unreachable (ping, ssh, console)
  • 20:26 cdanis: commit flow-table-size on cr2-eqsin T248394
  • 20:19 cdanis: eqsin depooled for router maintenance at 16:15
  • 19:29 twentyafterfour@deploy1001: scap failed: average error rate on 4/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/db09a36be5ed3e81155041f7d46ad040 for details)
  • 19:29 twentyafterfour: rolling back to wmf.24 due to high error rate refs T233873
  • 19:28 twentyafterfour@deploy1001: scap failed: average error rate on 7/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/db09a36be5ed3e81155041f7d46ad040 for details)
  • 18:49 gehel: repooling wdqs1006, catched up on lag
  • 17:12 hashar@deploy1001: Finished scap: testwiki to 1.35.0-wmf.25 and rebuild l10n cache # T233873 (duration: 77m 52s)
  • 17:10 ebernhardson: update cloudelastic-chi replica counts from 2 to 1 T231517
  • 16:41 moritzm: installing linux-perf updates on stretch
  • 16:31 moritzm: installing linux-perf-4.19 updates on buster
  • 15:58 mutante: installing OS on otrs1001.eqiad.wmnet (T248028)
  • 15:55 hashar@deploy1001: Started scap: testwiki to 1.35.0-wmf.25 and rebuild l10n cache # T233873
  • 15:35 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 15:31 hashar@deploy1001: Pruned MediaWiki: 1.35.0-wmf.22 (duration: 02m 02s)
  • 15:29 hashar@deploy1001: Pruned MediaWiki: 1.35.0-wmf.21 (duration: 24m 00s)
  • 15:17 hashar: Cleaning old MediaWiki deployments # T233873
  • 15:03 hashar: Applied patches to 1.35.0-wmf.25 # T233873
  • 14:59 hashar: scap prep 1.35.0-wmf.25 # T233873
  • 14:55 gehel: depooling wdqs1006 to catch up on lag
  • 14:28 marostegui: Deploy schema change on db2117 (s6)
  • 14:26 hashar: Branching wmf/1.35.0-wmf.25 # T233873
  • 13:22 moritzm: installing glib2.0 updates from Stretch point release
  • 13:04 moritzm: installing maridb-10.1 updates from Stretch point release (client/tools/libraries as packaged by Debian, different from wmf-mariadb)
  • 12:16 Urbanecm: mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=commonswiki --logwiki=metawiki 'Toroid~huwiki' 'Toroidt' (T248371)
  • 12:10 Urbanecm: mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=enwiki --logwiki=metawiki 'Erika Greenberg' 'Copperqueen' (T248371)
  • 11:57 Urbanecm: mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=commonswiki --logwiki=metawiki 'Romy merdeka' 'Romy_Dwi_Laksono' (T248371)
  • 11:55 marostegui: Deploy schema change on db2087 db2089 db2097
  • 11:34 Urbanecm: EU SWAT done
  • 11:29 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: e28c819: Enable visualeditor on hewiktionary by default (T248311; take II) (duration: 00m 59s)
  • 11:28 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: e28c819: Enable visualeditor on hewiktionary by default (T248311) (duration: 00m 59s)
  • 11:25 urbanecm@deploy1001: Synchronized dblists/visualeditor-nondefault.dblist: SWAT: e28c819: Enable visualeditor on hewiktionary by default (T248311) (duration: 01m 03s)
  • 10:08 gehel: restart blazegraph and updater on wdqs1004
  • 09:41 marostegui: Deploy schema change on db2076 (s6)
  • 08:39 marostegui: Rename nova database tables on db1133 (m5 master) - T248313
  • 08:25 marostegui: Rename wikidatawiki.wb_terms on db1104 - T248086
  • 07:33 elukey: restart update-openstack-mirror.service on sodium
  • 06:55 marostegui: Reboot dbproxy1018
  • 06:42 marostegui: Reboot dbproxy1019
  • 06:16 marostegui: Create empty database testreduce on m5 master T245408
  • 06:01 marostegui@cumin1001: dbctl commit (dc=all): 'Set db1087, vslow s8, with weight 1 as it originally had', diff saved to https://phabricator.wikimedia.org/P10753 and previous config saved to /var/cache/conftool/dbconfig/20200324-060133-marostegui.json

2020-03-23

  • 21:50 krinkle@deploy1001: Synchronized docroot/noc/css/vector.css: I627a0ddba5 (duration: 01m 02s)
  • 21:39 mholloway-shell@deploy1001: Finished deploy [recommendation-api/deploy@26aa5c3]: Update recommendation-api to 3141cb6 (duration: 03m 21s)
  • 18:45 Urbanecm: Morning SWAT done
  • 18:41 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: 0e535b1: InitialiseSettings - clean up groupOverrides layout / spacing (T231178; take II) (duration: 00m 59s)
  • 18:39 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: 0e535b1: InitialiseSettings - clean up groupOverrides layout / spacing (T231178) (duration: 01m 00s)
  • 18:35 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: 6ca1593: wgCopyUploadsDomains: Fix supremecourt.gov (T248146; take II) (duration: 00m 59s)
  • 18:33 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: 6ca1593: wgCopyUploadsDomains: Fix supremecourt.gov (T248146) (duration: 01m 00s)
  • 18:32 urbanecm@deploy1001: Synchronized php-1.35.0-wmf.24/extensions/VisualEditor/includes/ApiVisualEditorEdit.php: SWAT: cbda0e5: ApiVisualEditorEdit: Fix handling of minor parameter (T248257) (duration: 01m 00s)
  • 18:24 urbanecm@deploy1001: Synchronized wmf-config/CommonSettings.php: SWAT: 212114e: Dont try to grant `oathauth-enable` to `*` (T248282) (duration: 00m 59s)
  • 18:19 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: 0c12fc2: wgCopyUploadsDomains: Add supremecourt.gov (T248146, take II) (duration: 00m 59s)
  • 18:18 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: 0c12fc2: wgCopyUploadsDomains: Add supremecourt.gov (T248146) (duration: 01m 00s)
  • 18:18 otto@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventstreams' for release 'canary' .
  • 18:18 otto@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventstreams' for release 'production' .
  • 18:15 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: 5eb70ac: Add configuration variable $wgRestAPIAdditionalRouteFiles (T247997; take II) (duration: 00m 59s)
  • 18:14 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: 5eb70ac: Add configuration variable $wgRestAPIAdditionalRouteFiles (T247997) (duration: 01m 00s)
  • 18:09 otto@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'eventstreams' for release 'canary' .
  • 18:09 otto@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'eventstreams' for release 'production' .
  • 18:08 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'production' .
  • 18:05 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'canary' .
  • 18:05 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'production' .
  • 17:57 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'canary' .
  • 17:57 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'production' .
  • 16:31 ema: upload atskafka 0.5 to buster-wikimedia T237993
  • 15:59 otto@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Re-enablle client side error logging for group0 and hawwike - T226986 (take 2) (duration: 00m 59s)
  • 15:56 otto@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Re-enablle client side error logging for group0 and hawwike - T226986 (duration: 01m 00s)
  • 15:32 moritzm: installing maridb-10.1 updates from Stretch point release (client/tools/libraries as packaged by Debian, different from wmf-mariadb)
  • 15:24 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 15:13 moritzm: installing freetype updates from Stretch point release
  • 15:04 otto@deploy1001: Synchronized php-1.35.0-wmf.24/extensions/WikimediaEvents/modules/ext.wikimediaEvents/clientError.js: clientError: Changes event fields (T226986) (take 2) (duration: 00m 59s)
  • 15:00 jynus@cumin1001: dbctl commit (dc=all): 'Remove db1089 for special groups (rc)', diff saved to https://phabricator.wikimedia.org/P10749 and previous config saved to /var/cache/conftool/dbconfig/20200323-150046-jynus.json
  • 15:00 otto@deploy1001: Synchronized php-1.35.0-wmf.24/extensions/WikimediaEvents/modules/ext.wikimediaEvents/clientError.js: clientError: Changes event fields (T226986) (duration: 01m 01s)
  • 14:46 jynus@cumin1001: dbctl commit (dc=all): 'Finish doubling db1107 main s1 traffic', diff saved to https://phabricator.wikimedia.org/P10748 and previous config saved to /var/cache/conftool/dbconfig/20200323-144612-jynus.json
  • 14:40 jynus@cumin1001: dbctl commit (dc=all): 'Increase db1107 main s1 traffic a 50%', diff saved to https://phabricator.wikimedia.org/P10747 and previous config saved to /var/cache/conftool/dbconfig/20200323-144005-jynus.json
  • 14:35 jynus@cumin1001: dbctl commit (dc=all): 'remove db1107 from special groups', diff saved to https://phabricator.wikimedia.org/P10746 and previous config saved to /var/cache/conftool/dbconfig/20200323-143536-jynus.json
  • 14:28 elukey@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'canary' .
  • 14:28 elukey@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'production' .
  • 14:25 elukey@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'canary' .
  • 14:25 elukey@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'production' .
  • 14:13 elukey@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'canary' .
  • 14:13 elukey@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'production' .
  • 13:54 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 13:40 otto@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Temporarily disable client side error logging for a deploy - T226986 (duration: 01m 01s)
  • 13:33 moritzm: installing python-cryptography updates from Stretch point release
  • 12:27 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 12:25 marostegui@cumin1001: START - Cookbook sre.hosts.downtime
  • 11:41 tgr@deploy1001: Synchronized php-1.35.0-wmf.24/extensions/OAuth/includes/frontend/specialpages/SpecialMWOAuthManageMyGrants.php: SWAT: Get consumerKey from consumerId not from acceptanceId (T247531) (duration: 01m 01s)
  • 11:32 ema: cp1081: restart prometheus-trafficserver-tls-exporter.service
  • 11:27 elukey: upload oozie 4.3.0-3 to thirparty/bigtop14 on wikimedia-stretch - T244499
  • 10:37 jbond42: switch idp1001 to tlsproxy::envoy profile
  • 08:07 marostegui: Start m1 and m2 on db1117
  • 08:04 marostegui: Stop m1 and m2 on db1117 to transfer them to db1077 - this will trigger dbproxies IRC alert
  • 08:03 moritzm: installing python-cryptography bug fix updates from Stretch point release
  • 07:46 marostegui: Stop MySQL on db1077 (non used) for 10.4 upgrade and gtid_domain_id on multisource T149418

2020-03-22

  • 23:19 reedy@deploy1001: Synchronized wmf-config/InitialiseSettings-labs.php: T248274 (duration: 01m 19s)
  • 04:37 gehel@cumin1001: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99)

2020-03-20

  • 23:16 dzahn@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
  • 23:04 dzahn@cumin1001: START - Cookbook sre.ganeti.makevm
  • 21:06 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 21:04 dzahn@cumin1001: START - Cookbook sre.hosts.decommission
  • 21:01 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 20:59 dzahn@cumin1001: START - Cookbook sre.hosts.decommission
  • 20:59 dzahn@cumin1001: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1)
  • 20:59 dzahn@cumin1001: START - Cookbook sre.hosts.decommission
  • 20:57 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 20:56 dzahn@cumin1001: START - Cookbook sre.hosts.decommission
  • 20:55 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 20:53 dzahn@cumin1001: START - Cookbook sre.hosts.decommission
  • 20:41 dzahn@cumin1001: conftool action : set/pooled=inactive; selector: name=mw124[4-9].eqiad.wmnet
  • 20:40 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 20:40 dzahn@cumin1001: START - Cookbook sre.hosts.downtime
  • 20:40 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 20:40 dzahn@cumin1001: START - Cookbook sre.hosts.downtime
  • 20:40 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 20:39 dzahn@cumin1001: START - Cookbook sre.hosts.downtime
  • 20:37 dzahn@cumin1001: conftool action : set/pooled=inactive; selector: name=mw123[0-1].eqiad.wmnet
  • 20:32 dzahn@cumin1001: conftool action : set/pooled=inactive; selector: name=mw122[7-9].eqiad.wmnet
  • 20:18 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw124[4-9].eqiad.wmnet
  • 20:18 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw123[0-1].eqiad.wmnet
  • 20:18 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw122[7-9].eqiad.wmnet
  • 15:44 hashar@deploy1001: Synchronized php-1.35.0-wmf.24/includes/ActorMigration.php: Avoid upsert() log warning spam in ActorMigration due to unique key array format - T248147 (duration: 01m 01s)
  • 13:34 hnowlan@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' .
  • 13:33 hnowlan@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' .
  • 13:33 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' .
  • 12:16 marostegui@cumin1001: dbctl commit (dc=all): 'Decrease db1087, vslow host weight in main, given that the CPU across s8 is now doing a lot better', diff saved to https://phabricator.wikimedia.org/P10741 and previous config saved to /var/cache/conftool/dbconfig/20200320-121628-marostegui.json
  • 11:52 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 11:10 elukey: upload oozie 4.3.0-2 packages to thirdparty/bigtop14 on wikimedia-stretch
  • 10:56 hnowlan@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' .
  • 10:56 hnowlan@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' .
  • 10:34 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 10:29 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' .
  • 10:13 dcausse: repooling wdqs1006
  • 09:28 moritzm: rolling restart of FPM on mw1261-mw1265 for freetype update
  • 08:59 moritzm: installing freetype bugfix updates from stretch point release
  • 08:47 marostegui@cumin1001: dbctl commit (dc=all): 'Fully repool es1017', diff saved to https://phabricator.wikimedia.org/P10739 and previous config saved to /var/cache/conftool/dbconfig/20200320-084730-marostegui.json
  • 08:33 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool es1017', diff saved to https://phabricator.wikimedia.org/P10738 and previous config saved to /var/cache/conftool/dbconfig/20200320-083334-marostegui.json
  • 07:59 XioNoX: reorder LVS BGP neighbors and add descriptions - https://gerrit.wikimedia.org/r/576320
  • 07:48 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool es1017', diff saved to https://phabricator.wikimedia.org/P10737 and previous config saved to /var/cache/conftool/dbconfig/20200320-074816-marostegui.json
  • 07:46 elukey: upload hadoop_2.8.5-2 (and related debs) to thirdparty/bigtop14 on wikimedia-stretch (manually rebuilt via docker after patch backports from upstream)
  • 07:32 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool es1017', diff saved to https://phabricator.wikimedia.org/P10736 and previous config saved to /var/cache/conftool/dbconfig/20200320-073205-marostegui.json
  • 07:26 marostegui: Restart mysql on es1017 for upgrade - T239791
  • 07:09 marostegui@cumin1001: dbctl commit (dc=all): 'Depool es1017 for update T239791', diff saved to https://phabricator.wikimedia.org/P10735 and previous config saved to /var/cache/conftool/dbconfig/20200320-070945-marostegui.json
  • 07:09 marostegui@cumin1001: dbctl commit (dc=all): 'Promote es1014 to es3 master, this is a NOOP T239791', diff saved to https://phabricator.wikimedia.org/P10734 and previous config saved to /var/cache/conftool/dbconfig/20200320-070922-marostegui.json

2020-03-19

  • 22:15 mholloway-shell@deploy1001: Finished deploy [mobileapps/deploy@794f099]: Update mobileapps to 99869f45 (duration: 05m 13s)
  • 22:10 mholloway-shell@deploy1001: Started deploy [mobileapps/deploy@794f099]: Update mobileapps to 99869f45
  • 19:14 hashar@deploy1001: rebuilt and synchronized wikiversions files: all wikis to 1.35.0-wmf.24
  • 18:30 ladsgroup@deploy1001: Synchronized php-1.35.0-wmf.24/extensions/Wikibase/lib/includes/Store/ByIdDispatchingEntityInfoBuilder.php: Fix 'max' to Int32EntityId::MAX conversion (T247985), part II (duration: 01m 07s)
  • 18:24 ladsgroup@deploy1001: Synchronized php-1.35.0-wmf.24/extensions/Wikibase/data-access/src/SingleEntitySourceServices.php: Fix 'max' to Int32EntityId::MAX conversion (T247985), part I (duration: 01m 08s)
  • 17:47 mutante: releases/releases-jenkins - closed firewall hole to port 80 for caching servers - kept it open just for envoy from the backends - ATS speaks https to them meanwhile
  • 16:54 hashar@deploy1001: Synchronized php-1.35.0-wmf.24/extensions/RelatedArticles: Do not register "" as a style path, that breaks ResourceLoader - T248090 (duration: 01m 07s)
  • 16:01 jeh@deploy1001: Finished deploy [horizon/deploy@ad60c2b]: update horizon designate-dashboard submodule (duration: 03m 31s)
  • 15:57 jeh@deploy1001: Started deploy [horizon/deploy@ad60c2b]: update horizon designate-dashboard submodule
  • 15:19 andrew@deploy1001: deploy aborted: modest css change for the hiera editing dialog (take two -- I consistently forget to rebase before doing this) (duration: 00m 00s)
  • 14:54 otto@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventstreams' for release 'production' .
  • 14:52 otto@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventstreams' for release 'canary' .
  • 14:48 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'canary' .
  • 14:48 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'production' .
  • 13:32 hashar@deploy1001: Synchronized php: group1 wikis to 1.35.0-wmf.24 (duration: 01m 07s)
  • 13:31 hashar@deploy1001: rebuilt and synchronized wikiversions files: group1 wikis to 1.35.0-wmf.24
  • 13:11 marostegui: Rename testwikidatawiki.wb_terms on db1078 - T248086
  • 12:33 XioNoX: push frack fw policies T248004
  • 11:43 Lucas_WMDE: EU SWAT done
  • 11:40 lucaswerkmeister-wmde@deploy1001: Synchronized php-1.35.0-wmf.24/includes/OutputPage.php: SWAT: OutputPage: Fix warning when setting wgUserNewMsgRevisionId (T248049) (duration: 01m 08s)
  • 11:15 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: e277d29: trwiki: Grant interface editors editprotected & editsemiprotected (T247672; take II) (duration: 01m 08s)
  • 11:13 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: e277d29: trwiki: Grant interface editors editprotected & editsemiprotected (T247672) (duration: 01m 07s)
  • 10:47 ema: upload atskafka 0.4 to buster-wikimedia T237993
  • 10:24 ladsgroup@deploy1001: Synchronized php-1.35.0-wmf.24/skins/Vector/skin.json: skins.vector.styles.legacy needs to define legacy feature (T247566) (duration: 01m 08s)
  • 10:01 ema: cp: rolling ats-tls-restart to apply log format changes T248067 T237993
  • 09:26 marostegui: m2 maintenance window done T246098
  • 09:03 akosiaris: restart gerrit on gerrit1001 T246098
  • 09:02 akosiaris: restart otrs-daemon, apache on mendelevium T246098
  • 09:01 akosiaris: restart recommendation-api on scb T246098
  • 09:00 marostegui: Restart m2 primary database master - T246098
  • 08:48 dcausse: depooling wdqs1006 to help catching up lag
  • 08:43 dcausse: restarting blazegraph on wdqs1006 (T242453)
  • 07:54 moritzm: installing cups updates from Stretch point release
  • 07:48 moritzm: installing libjaxen-java security updates from Stretch point release
  • 07:07 marostegui@deploy1001: Synchronized wmf-config/db-eqiad.php: Update pc1008 spare situation T247787 (duration: 01m 09s)
  • 06:49 elukey: execute 'sudo rm /etc/logrotate.d/ceph-common' on cloudvirt-dev and cloudcontrol-dev to stop daily cronspam
  • 06:46 marostegui: Deploy schema change on testcommonswiki.globalimagelinks (empty table) on the s4 master T243987
  • 06:33 marostegui: Upgrade db1132 without restarting T246098
  • 00:39 twentyafterfour@deploy1001: rebuilt and synchronized wikiversions files: group0 wikiws to 1.35.0-wmf.24 refs T233872
  • 00:31 twentyafterfour@deploy1001: Synchronized php-1.35.0-wmf.24/skins/Vector/includes/templates/index.mustache: deploy https://gerrit.wikimedia.org/r/c/mediawiki/skins/Vector/+/581116 which reverts https://gerrit.wikimedia.org/r/c/mediawiki/skins/Vector/+/581054 refs T248010 (duration: 01m 07s)
  • 00:18 eileen: civicrm revision changed from a1b2cbeac1 to 1c477ff07f, config revision is 37232d8460

2020-03-18

  • 23:31 twentyafterfour@deploy1001: Synchronized php-1.35.0-wmf.23/includes/TemplateParser.php: sync https://gerrit.wikimedia.org/r/c/mediawiki/core/+/581114/ refs T248010 (duration: 01m 07s)
  • 23:26 twentyafterfour@deploy1001: Synchronized php-1.35.0-wmf.24/includes/TemplateParser.php: sync https://gerrit.wikimedia.org/r/c/mediawiki/core/+/581115/ (duration: 01m 08s)
  • 22:22 volans@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 22:18 volans@cumin1001: START - Cookbook sre.dns.netbox
  • 21:56 Krinkle: krinkle@mw1385: scap pull # clean up AdHoc debugging for T248010
  • 21:16 brennen@deploy1001: Synchronized php-1.35.0-wmf.24/skins/Vector/includes/templates/index.mustache: Change master template to force cache invalidation of partials (duration: 01m 06s)
  • 21:11 brennen@deploy1001: Synchronized php-1.35.0-wmf.23/skins/Vector/includes/templates/index.mustache: Change master template to force cache invalidation of partials (duration: 01m 15s)
  • 20:04 volans@cumin1001: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
  • 19:58 volans@cumin1001: START - Cookbook sre.dns.netbox
  • 19:49 hashar@deploy1001: rebuilt and synchronized wikiversions files: Ensure fleet wide consistency
  • 19:21 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 19:21 mutante: shutting down (decom cookbook) elnath.codfw.wmnet (T188544)
  • 19:20 dzahn@cumin1001: START - Cookbook sre.hosts.decommission
  • 19:15 fdans@deploy1001: Finished deploy [analytics/refinery@549f6a4]: deploying analytics refinery (duration: 15m 02s)
  • 19:11 hashar: 1.35.0-wmf.24 is on hold: too many blockers
  • 19:00 fdans@deploy1001: Started deploy [analytics/refinery@549f6a4]: deploying analytics refinery
  • 18:32 Lucas_WMDE: Morning SWAT done
  • 18:30 otto@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventstreams' for release 'canary' .
  • 18:27 lucaswerkmeister-wmde@deploy1001: Synchronized wmf-config/InitialiseSettings-labs.php: SWAT: Update linter whitelist w/ parsoid11's IP address (T246833) (beta-only) (duration: 01m 04s)
  • 18:20 Lucas_WMDE: scap pull on mwdebug1001, attempting to fix mismatched wikiversions alert
  • 18:14 lucaswerkmeister-wmde@deploy1001: Synchronized wmf-config/InitialiseSettings-labs.php: SWAT: Add beta configuration for Wikibase reference formatting (T247416) (duration: 01m 08s)
  • 18:13 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'canary' .
  • 18:13 lucaswerkmeister-wmde@deploy1001: Synchronized wmf-config/Wikibase.php: SWAT: Add beta configuration for Wikibase reference formatting (T247416), take II (duration: 01m 07s)
  • 18:11 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'canary' .
  • 18:11 lucaswerkmeister-wmde@deploy1001: Synchronized wmf-config/Wikibase.php: SWAT: Add beta configuration for Wikibase reference formatting (T247416) (duration: 01m 07s)
  • 16:43 mutante: wtp1025 - Icinga alerted it's running out of disk - 'apt-get clean' lowered disk usage from 97% to 91%
  • 16:00 hashar@deploy1001: Finished scap: testwiki to 1.35.0-wmf.24 and rebuild l10n cache - T233872 (duration: 61m 23s)
  • 14:58 hashar@deploy1001: Started scap: testwiki to 1.35.0-wmf.24 and rebuild l10n cache - T233872
  • 14:41 vgutierrez: disable TLS session tickets in ulsfo - T245616 T170567
  • 14:29 godog: add debug to icinga2001 - T247538
  • 14:28 _joe_: restarted php-fpm on mw1283, was throwing SIGILL
  • 14:17 marostegui: Rename wb_terms on codfw hosts: s8 (wikidatawiki - db2081), s3 (testwikidatawiki - db2109), s4 (commonswiki, testcommonswiki - db2106) T208425
  • 14:06 hashar@deploy1001: rebuilt and synchronized wikiversions files: all wikis to 1.35.0-wmf.23
  • 11:59 hashar@deploy1001: Synchronized php-1.35.0-wmf.24/includes/objectcache/ObjectCache.php: objectcache: Restore keyspace for LocalServerCache service - T247562 (duration: 01m 07s)
  • 11:57 hashar@deploy1001: Synchronized php-1.35.0-wmf.23/includes/objectcache/ObjectCache.php: objectcache: Restore keyspace for LocalServerCache service - T247562 (duration: 01m 10s)
  • 11:42 marostegui@cumin1001: dbctl commit (dc=all): 'Decrease db1087, vslow host weight in main, given that the CPU across s8 is now doing a lot better', diff saved to https://phabricator.wikimedia.org/P10715 and previous config saved to /var/cache/conftool/dbconfig/20200318-114259-marostegui.json
  • 11:17 ema: upload atskafka 0.3 to buster-wikimedia T237993
  • 11:16 kart_: EU Mid-day SWAT done
  • 11:11 kartik@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: 579893|Enable ContentTranslation as a default tool in Malay, Azerbaijani and Estonian WPs (T246622, T246628, T246629), take II (duration: 01m 07s)
  • 11:10 kartik@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: 579893|Enable ContentTranslation as a default tool in Malay, Azerbaijani and Estonian WPs (T246622, T246628, T246629) (duration: 01m 07s)
  • 10:58 _joe_: setting num_retries=0 on mw2224 for eventgate-analytics in envoy (T247484)
  • 10:58 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Stop writing to old term store (wb_terms table) in wikidata (T208425), take II (duration: 01m 06s)
  • 10:55 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Stop writing to old term store (wb_terms table) in wikidata (T208425) (duration: 01m 08s)
  • 10:52 _joe_: setting num_retries=0, idle_timeout=5s on mw2223 for eventgate-analytics in envoy (T247484)
  • 10:48 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Stop writing to old term store in testwikidatawiki (T208425), take II (duration: 01m 07s)
  • 10:45 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Stop writing to old term store in testwikidatawiki (T208425) (duration: 01m 07s)
  • 10:33 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Read from the new term store everywhere (T219123), take II (duration: 01m 07s)
  • 10:31 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Read from the new term store everywhere (T219123) (duration: 01m 07s)
  • 10:14 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Read from the new term store everywhere (T219123), take II (duration: 01m 07s)
  • 10:12 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Read from the new term store everywhere (T219123) (duration: 01m 08s)
  • 09:43 vgutierrez: enabling inbound TLSv1.3 in upload@ulsfo - T170567
  • 09:18 vgutierrez: enabling inbound TLSv1.3 in cp4026 - T170567
  • 08:44 marostegui: Start replication pc1008 from pc1010 to get some of the new keys so it is not fully empty - T247787
  • 08:14 vgutierrez: upgrade ATS to 8.0.6-1wm3 in ulsfo - T170567
  • 07:55 moritzm: installing remaining libxslt security updates
  • 07:40 oblivian@deploy1001: Synchronized wmf-config/ProductionServices.php: eventgate-analytics to use envoy everywhere (duration: 01m 10s)
  • 07:08 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 07:05 marostegui@cumin1001: START - Cookbook sre.hosts.downtime
  • 06:31 marostegui: Reboot pc1008 to try to get its RAID redone - T247787
  • 00:31 Amir1: foreachwikiindblist medium deleteEqualMessages.php --delete (T247562)
  • 00:10 crusnov@deploy1001: Finished deploy [netbox/deploy@14256f9]: netbox 2.7.10 upgrade (duration: 02m 29s)
  • 00:08 crusnov@deploy1001: Started deploy [netbox/deploy@14256f9]: netbox 2.7.10 upgrade
  • 00:07 crusnov@deploy1001: Finished deploy [netbox/deploy@14256f9]: netbox 2.7.10 upgrade (duration: 01m 17s)
  • 00:06 crusnov@deploy1001: Started deploy [netbox/deploy@14256f9]: netbox 2.7.10 upgrade

2020-03-17

  • 22:49 Amir1: warming up cache for Q80M to Q88M for new term store on db1111, db1126, db1104, db1092 (T219123)
  • 22:17 bsitzmann@deploy1001: Finished deploy [mobileapps/deploy@0adead4]: Update mobileapps to ec6fd6e (duration: 06m 08s)
  • 22:11 bsitzmann@deploy1001: Started deploy [mobileapps/deploy@0adead4]: Update mobileapps to ec6fd6e
  • 21:54 Krinkle: krinkle@mw2170$ disable-puppet (Testing for T99740)
  • 21:15 mholloway-shell@deploy1001: Synchronized wmf-config/InitialiseSettings.php: WikimediaEditorTasks: Enable Depicts counting (again) (T247874) (duration: 01m 07s)
  • 21:10 mholloway-shell@deploy1001: Synchronized wmf-config/InitialiseSettings.php: WikimediaEditorTasks: Enable Depicts counting (T247874) (duration: 01m 07s)
  • 20:50 mholloway-shell@deploy1001: Synchronized php-1.35.0-wmf.23/extensions/WikimediaEditorTasks: Fix revert counting for non-language-specific counters, take 2 (T244974) (duration: 01m 12s)
  • 20:33 mutante: boron - systemctl start docker-reporter-k8s-images ; systemctl start docker-reporter-releng-images
  • 20:31 mutante: boron - had degraded systemd state in Icinga - systemctl start docker-reporter-base-images
  • 19:54 mutante: miscweb1001 - restarted ferm, reverted live hack
  • 19:53 ppchelko@deploy1001: Finished deploy [restbase/deploy@8db09ed]: Various PCS endpoints additions and fixes T247295 T247096 T244175 (duration: 14m 31s)
  • 19:51 mutante: miscweb1001 - testing if ferm 80 firewall hole is needed for envoy, temp. disabled puppet, restarted ferm
  • 19:38 ppchelko@deploy1001: Started deploy [restbase/deploy@8db09ed]: Various PCS endpoints additions and fixes T247295 T247096 T244175
  • 19:01 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Set up read new term store up to Q80M (T219123), take II (duration: 01m 06s)
  • 19:00 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Set up read new term store up to Q80M (T219123) (duration: 01m 07s)
  • 18:53 ladsgroup@deploy1001: Synchronized php-1.35.0-wmf.24/extensions/Wikibase/lib/includes/Store/Sql/Terms/DatabaseItemTermStoreWriter.php: Do not lock rows when there's no term returned (T247553 T246898), To catch the train (duration: 01m 08s)
  • 18:50 otto@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventstreams' for release 'canary' .
  • 18:45 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 18:45 dzahn@cumin1001: START - Cookbook sre.hosts.decommission
  • 18:41 dzahn@cumin1001: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1)
  • 18:39 mutante: removing mw1238 through mw1243 - decom with cookbook (T247780 T245099)
  • 18:38 dzahn@cumin1001: START - Cookbook sre.hosts.decommission
  • 18:38 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 18:37 dzahn@cumin1001: START - Cookbook sre.hosts.decommission
  • 18:35 dzahn@cumin1001: conftool action : set/pooled=inactive; selector: name=mw123[8-9].eqiad.wmnet
  • 18:35 dzahn@cumin1001: conftool action : set/pooled=inactive; selector: name=mw124[0-3].eqiad.wmnet
  • 18:29 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'canary' .
  • 18:01 bsitzmann@deploy1001: Finished deploy [mobileapps/deploy@b6bff94]: Update mobileapps to 3c73ca3 (duration: 06m 06s)
  • 18:00 otto@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventstreams' for release 'canary' .
  • 17:58 otto@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventstreams' for release 'canary' .
  • 17:56 ladsgroup@deploy1001: Synchronized php-1.35.0-wmf.23/languages/LanguageConverter.php: languages: Don't assume in LanguageConverter (T235360) (duration: 01m 07s)
  • 17:55 bsitzmann@deploy1001: Started deploy [mobileapps/deploy@b6bff94]: Update mobileapps to 3c73ca3
  • 17:55 otto@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventstreams' for release 'canary' .
  • 17:53 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw124[0-3].eqiad.wmnet
  • 17:53 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw123[89].eqiad.wmnet
  • 17:52 Amir1: warming up cache for Q70M to Q80M for new term store on db1111, db1126, db1104, db1092 (T219123)
  • 17:46 ladsgroup@deploy1001: Synchronized php-1.35.0-wmf.23/extensions/Wikibase/lib/includes/Store/Sql/Terms/DatabaseItemTermStoreWriter.php: Do not lock rows when there's no term returned (T247553 T246898) (duration: 01m 07s)
  • 17:42 otto@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventstreams' for release 'canary' .
  • 17:40 otto@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventstreams' for release 'canary' .
  • 17:37 ejegg: updated payments-wiki from 86ce0361f9 to 72856949a1
  • 17:30 bearND: mobileapps deploy failed on canary, rolled back
  • 17:29 bsitzmann@deploy1001: Finished deploy [mobileapps/deploy@266e6da]: Update mobileapps to 6370784 (duration: 04m 00s)
  • 17:25 bsitzmann@deploy1001: Started deploy [mobileapps/deploy@266e6da]: Update mobileapps to 6370784
  • 17:24 elukey@deploy1001: Finished deploy [analytics/superset/deploy@3f3ddcb]: Upgrade PyHive to 0.6.2 (duration: 00m 43s)
  • 17:24 elukey@deploy1001: Started deploy [analytics/superset/deploy@3f3ddcb]: Upgrade PyHive to 0.6.2
  • 17:18 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1280.eqiad.wmnet
  • 17:17 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw1280.eqiad.wmnet
  • 17:10 jynus: purging some old rows on pc1010 on a screen to earn some time T247788
  • 16:56 mutante: mw1280 - scap pull - had ancient mw version due to downtime
  • 16:46 mutante: mw1280 back after long downtime due to broken RAM, added back into puppet (T240187)
  • 16:36 dzahn@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 16:36 dzahn@cumin1001: START - Cookbook sre.hosts.downtime
  • 16:36 dzahn@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 16:36 dzahn@cumin1001: START - Cookbook sre.hosts.downtime
  • 15:56 brennen@deploy1001: rebuilt and synchronized wikiversions files: Reverting All wikis to 1.35.0-wmf.23
  • 15:52 otto@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventstreams' for release 'canary' .
  • 15:52 brennen@deploy1001: sync-wikiversions aborted: All wikis to 1.35.0-wmf.23 (duration: 05m 16s)
  • 15:51 otto@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventstreams' for release 'canary' .
  • 15:50 otto@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventstreams' for release 'canary' .
  • 15:44 brennen@deploy1001: sync-wikiversions aborted: All wikis to 1.35.0-wmf.23 (duration: 03m 49s)
  • 15:36 otto@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'eventstreams' for release 'canary' .
  • 15:36 otto@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventstreams' for release 'canary' .
  • 15:23 otto@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'eventstreams' for release 'canary' .
  • 15:11 otto@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventstreams' for release 'canary' .
  • 15:01 hashar: scap prep 1.35.0-wmf.24 and applying security patches # T233872
  • 15:00 otto@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'eventstreams' for release 'canary' .
  • 14:57 otto@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventstreams' for release 'canary' .
  • 14:44 dcausse: wdqs1010 (test server) is running a data-reload cookbook (and is probably taking longer than the expected downtime)
  • 14:38 hashar: mediawiki/core git push 68bc9300dc:wmf/1.35.0-wmf.24 to catch up with a change that got merged while branch is being cut # T233872
  • 14:29 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Set up read new term store up to Q70M (T219123), take II (duration: 01m 04s)
  • 14:28 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Set up read new term store up to Q70M (T219123) (duration: 01m 10s)
  • 14:24 marostegui: Stop mysql and restart pc1008 T247787
  • 14:23 otto@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventstreams' for release 'canary' .
  • 14:21 otto@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventstreams' for release 'canary' .
  • 14:14 ladsgroup@deploy1001: Synchronized php-1.35.0-wmf.23/extensions/Wikibase/lib/includes/Store/Sql/Terms/DatabaseItemTermStoreWriter.php: Store item terms at late as possible to avoid deadlocks (T247553 T246898) (duration: 01m 07s)
  • 14:13 otto@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventstreams' for release 'canary' .
  • 14:12 otto@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventstreams' for release 'canary' .
  • 14:09 herron@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 14:07 akosiaris@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' .
  • 14:07 herron@cumin1001: START - Cookbook sre.hosts.downtime
  • 14:06 akosiaris@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' .
  • 14:03 akosiaris@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' .
  • 13:41 hashar: Branching 1.35.0-wmf.24 # T233872
  • 13:30 godog: stop puppet and turn on debug on icinga2001 - T247538
  • 12:06 cdanis@cumin1001: END (PASS) - Cookbook sre.network.cf (exit_code=0)
  • 12:06 cdanis@cumin1001: START - Cookbook sre.network.cf
  • 11:46 godog: test pinning icinga to a subset of cpu on icinga1001
  • 11:16 akosiaris: T242461 undeploy restrouter. Unused service and per task to not be used after all
  • 11:16 akosiaris@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'restrouter' for release 'staging' .
  • 11:15 akosiaris@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'restrouter' for release 'production' .
  • 11:15 akosiaris@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'restrouter' for release 'production' .
  • 10:56 XioNoX: add extra prepend to LG export filter
  • 10:41 ayounsi@cumin1001: END (PASS) - Cookbook sre.network.cf (exit_code=0)
  • 10:41 ayounsi@cumin1001: START - Cookbook sre.network.cf
  • 10:41 ayounsi@cumin1001: END (PASS) - Cookbook sre.network.cf (exit_code=0)
  • 10:40 ayounsi@cumin1001: START - Cookbook sre.network.cf
  • 10:40 jbond42: sec update for libgraphicsmagick on maps
  • 10:20 godog: bounce squid on install1003 T247759
  • 10:07 _joe_: sudo cumin -b2 -s 50 'A:mw-jobrunner' 'restart-php7.2-fpm' T247622
  • 10:03 Amir1: warming up cache for Q60M to Q70M for new term store on db1111, db1126, db1104, db1092 (T219123)
  • 10:02 ema: create kafka topic atskafka_test_webrequest_text T247497
  • 09:57 elukey@cumin1001: END (PASS) - Cookbook sre.hadoop.roll-restart-workers (exit_code=0)
  • 09:55 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Set up read new term store up to Q60M (T219123), take II (duration: 01m 05s)
  • 09:54 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Set up read new term store up to Q60M (T219123) (duration: 01m 09s)
  • 09:27 elukey@cumin1001: START - Cookbook sre.hadoop.roll-restart-workers
  • 09:21 ema: cp: rolling varnish-frontend-restart to decrease memory usage and apply transient storage limits T185968
  • 09:09 elukey@cumin1001: END (PASS) - Cookbook sre.hadoop.roll-restart-workers (exit_code=0)
  • 08:39 elukey@cumin1001: START - Cookbook sre.hadoop.roll-restart-workers
  • 00:57 krinkle@deploy1001: Synchronized php-1.35.0-wmf.23/extensions/Wikibase/lib/includes/Formatters/: Ic77b2c6b33a, T247458 (duration: 01m 12s)

2020-03-16

  • 23:14 tzatziki: reset email for "MNadrofsky (WMF)" on SUL and officewiki
  • 20:58 mutante: mw1223 power down
  • 20:54 mutante: powercycling mw1223
  • 20:52 mutante: 5 old API appservers in eqiad removed
  • 20:45 dzahn@cumin1001: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1)
  • 20:43 dzahn@cumin1001: START - Cookbook sre.hosts.decommission
  • 20:42 dzahn@cumin1001: conftool action : set/pooled=inactive; selector: name=mw122[1-6].eqiad.wmnet
  • 20:37 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 20:35 dzahn@cumin1001: START - Cookbook sre.hosts.decommission
  • 20:04 mutante: depool (yes->no) mw1221 - mw1226 (T247780)
  • 20:04 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw122[1-6].eqiad.wmnet
  • 19:28 bsitzmann@deploy1001: Finished deploy [mobileapps/deploy@f5600d6]: Update mobileapps to 8a6e403 (duration: 06m 48s)
  • 19:26 otto@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventstreams' for release 'canary' .
  • 19:24 otto@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventstreams' for release 'canary' .
  • 19:23 jynus: stop replication at pc1010 at pos pc1007-bin.080617:259138670
  • 19:21 bsitzmann@deploy1001: Started deploy [mobileapps/deploy@f5600d6]: Update mobileapps to 8a6e403
  • 19:11 marostegui@deploy1001: Synchronized wmf-config/db-eqiad.php: Pool pc1010 instead of pc1008 as pc1008 is overloaded (duration: 01m 06s)
  • 18:38 krinkle@deploy1001: Synchronized wmf-config/: I2c3217 (duration: 01m 07s)
  • 18:36 krinkle@deploy1001: Synchronized wmf-config/CommonSettings.php: no-op, courtesy of opcache (duration: 01m 06s)
  • 18:34 krinkle@deploy1001: Synchronized docroot/noc/: I2c3217fb3 (duration: 01m 07s)
  • 18:18 mforns@deploy1001: Finished deploy [analytics/refinery@1681b92]: deploying refinery to add forgotten artifacts for v0.0.118 (duration: 13m 01s)
  • 18:05 mforns@deploy1001: Started deploy [analytics/refinery@1681b92]: deploying refinery to add forgotten artifacts for v0.0.118
  • 17:08 Amir1: warming up cache for Q50M to Q60M for new term store on db1111, db1126, db1104, db1092 (T219123)
  • 17:06 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Set up read new term store up to Q50M (T219123), take II (duration: 01m 08s)
  • 17:03 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Set up read new term store up to Q50M (T219123) (duration: 01m 06s)
  • 16:54 gehel: repooling wdqs1005
  • 16:52 jforrester@deploy1001: Synchronized wmf-config/CommonSettings.php: Enforce Content Security Policy if wmgUseCSP is set T244124 (duration: 01m 06s)
  • 16:50 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Touch and secondary sync of IS for cache-busting (duration: 01m 07s)
  • 16:48 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Set wmgUseCSP false everywhere T244124 (duration: 01m 07s)
  • 16:34 krinkle@deploy1001: Synchronized wmf-config/CommonSettings.php: I498e2ebd8c9 (duration: 01m 07s)
  • 16:33 krinkle@deploy1001: Synchronized multiversion/MWConfigCacheGenerator.php: I498e2ebd8c9 (no-op) (duration: 01m 07s)
  • 16:30 krinkle@deploy1001: Synchronized wmf-config/wgConf.php: I870122f946d (duration: 01m 07s)
  • 16:22 rlazarus: copied envoyproxy_1.13.1-1 from buster-wikimedia to stretch-wikimedia
  • 16:21 krinkle@deploy1001: Synchronized wmf-config/CommonSettings.php: I08af45e2e47 (duration: 01m 07s)
  • 16:14 krinkle@deploy1001: Synchronized wmf-config/wgConf.php: Ie9002d9095ee (duration: 01m 08s)
  • 15:04 akosiaris: T234181 upload apertium-recursive_0.0.1-1+wmf1 to apt.wikimedia.org jessie-wikimedia/main
  • 15:04 akosiaris: T234181 upload apertium-anaphora_0.0.4-1+wmf1 to apt.wikimedia.org jessie-wikimedia/main
  • 15:02 moritzm: rolling restart of FPM/apache on netmon* to pick up libxslt security updates
  • 14:22 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Set up read new term store up to Q40M (T219123), take II (duration: 01m 06s)
  • 14:22 Amir1: warming up cache for Q40M to Q50M for new term store on db1111, db1126, db1104, db1092 (T219123)
  • 14:18 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Set up read new term store up to Q40M (T219123) (duration: 01m 07s)
  • 14:16 moritzm: rolling restart of FPM on mw1261-mw1265 to pick up libxslt security updates
  • 14:15 Amir1: ladsgroup@mwmaint1002:~$ mwscript extensions/Wikibase/repo/maintenance/rebuildItemTerms.php --wiki=wikidatawiki --from-id 87500000 --to-id 87767570 --batch-size=10 --sleep=5 (T219123)
  • 14:05 moritzm: installing libxslt security updates
  • 13:49 ema: upload atskafka 0.1 to buster-wikimedia T237993
  • 13:42 gehel: restarting blazegraph on wdqs1007
  • 13:30 gehel: depooling wdqs1005 to catch up on lag
  • 12:43 marostegui@cumin1001: dbctl commit (dc=all): 'Fully repool es1015', diff saved to https://phabricator.wikimedia.org/P10706 and previous config saved to /var/cache/conftool/dbconfig/20200316-124309-marostegui.json
  • 12:09 Amir1: warming up cache for Q35M to Q40M for new term store on db1111, db1126, db1104, db1092 (T219123)
  • 12:09 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Set up read new term store up to Q35M (T219123), take II (duration: 01m 07s)
  • 12:05 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Set up read new term store up to Q35M (T219123) (duration: 01m 08s)
  • 11:52 XioNoX: manually fix prometheus squid exporter on install1003
  • 11:04 Amir1: ... for Q30M-Q35M of the new term store
  • 11:04 Amir1: Warming up InnoDB buffer pool cache in db1111, db1126, db1104, db1092 (T219123)
  • 10:55 Amir1: warming up db1026 for up to Q35M for the new term store (T219123)
  • 10:47 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool es1015', diff saved to https://phabricator.wikimedia.org/P10705 and previous config saved to /var/cache/conftool/dbconfig/20200316-104723-marostegui.json
  • 10:45 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: "Set term store to WRITE_BOTH for all of Wikidata" (T219123), take II (duration: 01m 07s)
  • 10:43 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: "Set term store to WRITE_BOTH for all of Wikidata" (T219123) (duration: 01m 13s)
  • 10:40 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool es1015', diff saved to https://phabricator.wikimedia.org/P10704 and previous config saved to /var/cache/conftool/dbconfig/20200316-104002-marostegui.json
  • 10:36 elukey: roll restart of recommendation service on scb* as attempt to fix the flapping alerts - T247732
  • 10:28 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool es1015', diff saved to https://phabricator.wikimedia.org/P10703 and previous config saved to /var/cache/conftool/dbconfig/20200316-102829-marostegui.json
  • 10:17 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool es1015', diff saved to https://phabricator.wikimedia.org/P10702 and previous config saved to /var/cache/conftool/dbconfig/20200316-101707-marostegui.json
  • 10:10 marostegui: Stop mysql for upgrade on es1015 T239791
  • 10:02 Amir1: start of ladsgroup@mwmaint1002:~$ mwscript extensions/Wikibase/repo/maintenance/rebuildItemTerms.php --wiki=wikidatawiki --batch-size=50 --sleep=0 --file=15march2217-holes-nulls.list on screen (T219123)
  • 09:32 marostegui@cumin1001: dbctl commit (dc=all): 'Depool es1015 for upgrade and restart T239791', diff saved to https://phabricator.wikimedia.org/P10701 and previous config saved to /var/cache/conftool/dbconfig/20200316-093228-marostegui.json
  • 09:30 marostegui@cumin1001: dbctl commit (dc=all): 'Promote es1011 to es2 master, this is a NOOP T239791', diff saved to https://phabricator.wikimedia.org/P10700 and previous config saved to /var/cache/conftool/dbconfig/20200316-093048-marostegui.json
  • 08:16 marostegui: Review and enable events on recently migrated 10.4 hosts - T247728
  • 08:02 ema: cp4025 restart trafficserver-tls to clear 'tls process restarted' alert T241593 T185968
  • 07:57 moritzm: installing libxslt security updates
  • 07:52 ema: cp4025: restart varnish-fe to clear 'child restarted' alert T185968
  • 07:47 moritzm: installing lxml security updates
  • 07:14 moritzm: installing libgd2 security updates on jessie
  • 06:54 moritzm: removing some library packages from jessie/stretch after labstore1006/1007 dist-upgrade to buster
  • 06:38 _joe_: restart envoy with 10 requests per connection on mw2231, T247484

2020-03-15

  • 23:20 jynus: removed oldest snapshots on dbprov1001
  • 13:27 dcausse: restarting blazegraph on wdqs1005 T242453
  • 07:01 marostegui: Restart logrotate on db1107

2020-03-14

  • 08:33 elukey: run kafka preferred-replica-election on kafka-jumbo1001 - T247561
  • 08:32 elukey: run systemctl restart systemd-timedated.service on stat1008
  • 01:06 mutante: planet1001 - copying /etc/apt/sources.list from planet2001 to planet1001 - apt-get update - apt-get install openssh-server T247592

2020-03-13

  • 23:12 bstorm_: rebooting labstore1006 for upgrade to stretch T224583
  • 22:49 herron@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 22:45 herron@cumin1001: START - Cookbook sre.hosts.downtime
  • 22:27 bstorm_: rebooting labstore1006 T224583
  • 22:21 bstorm_: downtimed labstore1006 for upgrades T224583
  • 20:02 mutante: stat1005 - ip link set en01 down ; ip link set en01 up (T247561)
  • 19:30 bstorm_: rebooting labstore1007 for upgrade to buster T224583
  • 18:51 shdubsh: test increase fs.inotify.max_user_watches on prometheus2004
  • 17:58 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 17:21 mutante: removed squid from install1002/install2002 (formerly webproxy.(eqiad|codfw).wmnet until 2 days ago, replaced by install1003/install2003) T224576
  • 17:20 elukey@cumin1001: END (PASS) - Cookbook sre.kafka.roll-restart-mirror-maker (exit_code=0)
  • 17:09 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 17:08 elukey@cumin1001: START - Cookbook sre.kafka.roll-restart-mirror-maker
  • 17:00 krinkle@deploy1001: Synchronized dblists/: If4d17082f, Iadba5b01b, Ibe16d5f09 (duration: 01m 07s)
  • 16:58 krinkle@deploy1001: Synchronized wmf-config/config/: Ibe16d5f09 (duration: 01m 10s)
  • 16:51 bstorm_: rebooting labstore1007 for stretch upgrade T224583
  • 16:37 krinkle@deploy1001: Synchronized wmf-config/config/: If4d17082f, Iadba5b01b (duration: 01m 11s)
  • 16:18 herron@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 16:15 herron@cumin1001: START - Cookbook sre.hosts.downtime
  • 16:04 bstorm_: rebooting labstore1007 for first cycle of upgrades T224583
  • 16:02 elukey: powercycle kafka-jumbo1006 after switch port changed - T247561
  • 15:28 _joe_: switch envoy logging to debug on mw2231
  • 14:57 cdanis: T247586 ✔️ cdanis@grafana1002.eqiad.wmnet ~ 🕥☕ sudo systemctl restart apache2.service
  • 12:48 Urbanecm: Password reset for SUL User:FuduBot (T247601)
  • 12:16 akosiaris@deploy1001: Synchronized wmf-config/ProductionServices.php: (no justification provided) (duration: 01m 16s)
  • 10:26 moritzm: installing python-werkzeug security updates
  • 10:09 vgutierrez: upload trafficserver 8.0.6-1wm3 to apt.wm.o (buster) - T245616
  • 09:55 _joe_: running puppet across appservers to switch to http for eventgate-analytics T247484
  • 09:17 moritzm: installing perl updates from Stretch point release
  • 06:16 vgutierrez: triggering OCSP response updates in eqiad,codfw and ulsfo - T247584
  • 06:12 vgutierrez: triggering OCSP response updates in eqsin - T247584
  • 06:05 vgutierrez: triggering OCSP response updates in esams - T247584
  • 00:20 shdubsh: reload prometheus@ops on prometheus1003
  • 00:08 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw215[8-9].codfw.wmnet
  • 00:08 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw216[0-9].codfw.wmnet
  • 00:08 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw217[1-2].codfw.wmnet
  • 00:04 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 00:04 dzahn@cumin1001: START - Cookbook sre.hosts.downtime

2020-03-12

  • 23:58 shdubsh: reload prometheus@ops on prometheus1004
  • 23:42 dzahn@cumin1001: conftool action : set/pooled=inactive; selector: name=mw217[1-2].codfw.wmnet
  • 23:41 dzahn@cumin1001: conftool action : set/pooled=inactive; selector: name=mw216[0-9].codfw.wmnet
  • 23:40 dzahn@cumin1001: conftool action : set/pooled=inactive; selector: name=mw215[89].codfw.wmnet
  • 23:26 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw215[89].codfw.wmnet
  • 23:25 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw2178.codfw.wmnet
  • 23:21 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw216[0-6].codfw.wmnet
  • 22:45 krinkle@deploy1001: Synchronized multiversion/: I403a9890a9 (duration: 01m 07s)
  • 22:44 krinkle@deploy1001: Synchronized dblists/: I403a9890a9 (duration: 01m 09s)
  • 22:41 mforns@deploy1001: Finished deploy [analytics/refinery@906bd1e]: deploying refinery together with refinery-source v0.0.118 (duration: 12m 20s)
  • 22:28 mforns@deploy1001: Started deploy [analytics/refinery@906bd1e]: deploying refinery together with refinery-source v0.0.118
  • 22:15 otto@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventstreams' for release 'canary' .
  • 22:15 otto@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventstreams' for release 'production' .
  • 22:09 otto@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventstreams' for release 'canary' .
  • 22:07 bstorm_: moving all nfs traffic off labstore1007 and to labstore1006 for upgrades T224583
  • 22:06 otto@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'eventstreams' for release 'canary' .
  • 22:05 otto@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'eventstreams' for release 'production' .
  • 22:02 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'canary' .
  • 22:02 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'production' .
  • 21:47 mutante: doc1001 - had to manually run "/usr/local/sbin/build-envoy-config -c /etc/envoy/" to get envoy tls_terminator_443 listener into the config or envoy would not listen on 443 (T210411)
  • 21:19 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'canary' .
  • 21:19 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'production' .
  • 21:06 foks: remove one file for legal compliance
  • 20:49 ottomata: kafka-jumbo1006 - stopping kafka and powercycling - T247561
  • 20:15 brennen@deploy1001: rebuilt and synchronized wikiversions files: Revert "all wikis to 1.35.0-wmf.23"
  • 20:11 brennen@deploy1001: rebuilt and synchronized wikiversions files: all wikis to 1.35.0-wmf.23
  • 20:10 mutante: revoking puppet cert for doc.discovery.wmnet, re-creating with doc.wikimedia.org as SAN
  • 20:09 eileen: civicrm revision changed from a301076871 to a1b2cbeac1, config revision is 37232d8460
  • 19:46 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Revert "Set term store to WRITE_BOTH for all of Wikidata", take II (duration: 01m 06s)
  • 19:45 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Revert "Set term store to WRITE_BOTH for all of Wikidata" (duration: 01m 08s)
  • 19:20 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 19:18 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 18:53 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 18:51 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 18:43 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 18:40 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 18:34 ebernhardson@deploy1001: Synchronized php-1.35.0-wmf.23/extensions/WikimediaEvents/modules/ext.wikimediaEvents/searchSatisfaction.js: cirrus: Start Glent m0 AB test (duration: 01m 07s)
  • 18:31 tgr@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: re-sync InitialiseSettings.php (duration: 01m 08s)
  • 18:29 tgr@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Set term store to WRITE_BOTH for all of Wikidata (T219123) (duration: 01m 07s)
  • 18:23 tgr@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Switch kowiki to use ORES for suggested edits topics (duration: 01m 08s)
  • 18:19 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 17:48 elukey: increase via 'kadmin.local modprinc -maxlife 2d $user' all max ticket lifetimes of Kerberos User principals on the krb1001's KDC (changes will be propagated to codfw automatically)
  • 17:48 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 17:35 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 17:32 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 17:17 elukey: execute modprinc -maxlife 2d krbtgt/WIKIMEDIA via kadmin.local on krb1001 (will be propagated to 2001 automatically)
  • 17:12 volans@cumin1001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 17:06 volans@cumin1001: START - Cookbook sre.dns.netbox
  • 17:03 akosiaris@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-analytics' for release 'canary' .
  • 17:03 akosiaris@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' .
  • 16:53 akosiaris@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-analytics' for release 'canary' .
  • 16:53 akosiaris@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' .
  • 16:41 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 16:39 cmjohnson@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 16:37 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 16:36 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 16:28 volans: restarting icinga, acting up on command file (frack awol and downtimes)
  • 16:20 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'canary' .
  • 16:20 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'production' .
  • 16:15 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 16:13 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 16:07 cmjohnson@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 16:07 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 14:53 rlazarus: uploading envoyproxy_1.13.1-1 (upgrade from 1.12.2) T246868
  • 14:51 elukey: restart kpropd daemon on krb2001
  • 14:26 volans@cumin2001: END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
  • 14:23 volans@cumin2001: START - Cookbook sre.dns.netbox
  • 14:07 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 13:35 mvolz@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'zotero' for release 'staging' .
  • 13:26 otto@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventstreams' for release 'canary' .
  • 13:26 otto@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventstreams' for release 'production' .
  • 13:21 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 12:56 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 12:33 volans@cumin2001: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
  • 12:29 volans@cumin2001: START - Cookbook sre.dns.netbox
  • 12:00 tarrow: EU SWAT done
  • 12:00 tarrow@deploy1001: Synchronized php-1.35.0-wmf.23/extensions/TwoColConflict: SWAT: Detect whether an edit came from VisualEditor (T245722) (duration: 01m 10s)
  • 11:42 akosiaris@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'eventstreams' for release 'canary' .
  • 11:42 akosiaris@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'eventstreams' for release 'production' .
  • 11:39 volans@cumin2001: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
  • 11:38 akosiaris@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'eventstreams' for release 'canary' .
  • 11:38 akosiaris@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'eventstreams' for release 'production' .
  • 11:37 volans@cumin2001: START - Cookbook sre.dns.netbox
  • 11:23 akosiaris@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'eventstreams' for release 'canary' .
  • 11:23 akosiaris@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'eventstreams' for release 'production' .
  • 11:09 elukey: roll restart of krb-kdc on krb1001/krb2001 to pick up new ticket lifetime settings (10h -> 48h)
  • 11:09 akosiaris@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'eventstreams' for release 'canary' .
  • 11:09 akosiaris@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'eventstreams' for release 'production' .
  • 11:05 akosiaris@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'eventstreams' for release 'canary' .
  • 11:05 akosiaris@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'eventstreams' for release 'production' .
  • 11:02 volans@cumin2001: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
  • 10:59 volans@cumin2001: START - Cookbook sre.dns.netbox
  • 10:58 akosiaris@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'eventstreams' for release 'canary' .
  • 10:58 akosiaris@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'eventstreams' for release 'production' .
  • 10:39 volans@cumin2001: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
  • 10:39 volans@cumin2001: START - Cookbook sre.dns.netbox
  • 10:29 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 10:28 akosiaris@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'eventstreams' for release 'canary' .
  • 10:28 akosiaris@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'eventstreams' for release 'production' .
  • 10:13 akosiaris@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'eventstreams' for release 'canary' .
  • 10:13 akosiaris@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'eventstreams' for release 'production' .
  • 09:58 akosiaris@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'eventstreams' for release 'canary' .
  • 09:58 akosiaris@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'eventstreams' for release 'production' .
  • 08:55 oblivian@deploy1001: Synchronized wmf-config/ProductionServices.php: switch ores to use envoy (duration: 01m 08s)
  • 08:36 addshore: start "rebuild" of Q87 -> 87.5 million for T219123
  • 08:27 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Write to new term store up to Q87.5 million, was 87 (T219123) cache bust (duration: 01m 08s)
  • 08:26 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Write to new term store up to Q87.5 million, was 87 (T219123) (duration: 01m 12s)
  • 08:12 elukey: push new install/webproxy terms for analytics-in4/6 to cr1/cr2-eqiad
  • 07:28 kart_: Updated cxserver charts to 0.0.13
  • 07:26 kartik@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'cxserver' for release 'production' .
  • 07:24 kartik@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'cxserver' for release 'production' .
  • 07:22 kartik@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'cxserver' for release 'staging' .
  • 06:14 kart_: Updated cxserver to 2020-03-12-041806-production and added sectionmapping db config (T246316, T243430, T202276)
  • 06:11 kartik@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'cxserver' for release 'production' .
  • 06:08 kartik@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'cxserver' for release 'production' .
  • 06:03 kartik@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'cxserver' for release 'staging' .
  • 01:51 mholloway-shell@deploy1001: Synchronized php-1.35.0-wmf.22/extensions/WikimediaEditorTasks: Revert 'Fix revert counting for non-language-specific counters' (T247479) (duration: 01m 08s)
  • 01:13 ebernhardson@deploy1001: Finished deploy [search/mjolnir/deploy@4e2ea09]: resolve deadlock in bulk_daemon (duration: 10m 05s)
  • 01:03 ebernhardson@deploy1001: Started deploy [search/mjolnir/deploy@4e2ea09]: resolve deadlock in bulk_daemon
  • 00:56 ebernhardson@deploy1001: Synchronized php-1.35.0-wmf.22/extensions/CirrusSearch/includes/Maintenance/Reindexer.php: wait around for counts to match up in reindexer before giving up (duration: 01m 08s)
  • 00:53 ebernhardson: wmf.23 cirrussearch: wait around for counts to match before giving up
  • 00:52 ebernhardson@deploy1001: Synchronized php-1.35.0-wmf.23/extensions/CirrusSearch/includes/Maintenance/Reindexer.php: (no justification provided) (duration: 01m 12s)
  • 00:23 mutante: switching webproxy.eqiad.wmnet / webproxy.codfw.wmnet to install[12]003 (squids on buster)
  • 00:16 mholloway-shell@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Disable depicts counter due to code revert (T244974), take 2 (duration: 01m 07s)
  • 00:14 mholloway-shell@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Disable depicts counter due to code revert (T244974) (duration: 01m 07s)
  • 00:00 mholloway-shell@deploy1001: Synchronized php-1.35.0-wmf.23/extensions/WikimediaEditorTasks: Revert 'Fix revert counting for non-language-specific counters' (T247479) (duration: 01m 07s)

2020-03-11

  • 23:52 mholloway-shell@deploy1001: Synchronized wmf-config/InitialiseSettings.php: WikimediaEditorTasks: Enable depicts counter (T244974) (Simon says) (duration: 01m 07s)
  • 23:51 mholloway-shell@deploy1001: Synchronized wmf-config/InitialiseSettings.php: WikimediaEditorTasks: Enable depicts counter (T244974) (duration: 01m 07s)
  • 23:51 cdanis@cumin1001: END (PASS) - Cookbook sre.network.cf (exit_code=0)
  • 23:51 cdanis@cumin1001: START - Cookbook sre.network.cf
  • 23:42 mholloway-shell@deploy1001: Synchronized php-1.35.0-wmf.22/extensions/WikimediaEditorTasks: Fix revert counting for non-language-specific counters (duration: 01m 08s)
  • 23:40 mholloway-shell@deploy1001: Synchronized php-1.35.0-wmf.23/extensions/WikimediaEditorTasks: Fix revert counting for non-language-specific counters (duration: 01m 11s)
  • 23:18 krinkle@deploy1001: Synchronized multiversion/MWConfigCacheGenerator.php: I91b3a18317af (duration: 01m 08s)
  • 22:39 volans@cumin2001: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
  • 22:39 volans@cumin2001: START - Cookbook sre.dns.netbox
  • 22:28 mutante: depooled mw2167 through mw2172 - rack C3 (T247018)
  • 22:27 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw217[012].codfw.wmnet
  • 22:26 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw216[789].codfw.wmnet
  • 22:16 James_F: Purged trwiki logos from ATS/Varnish for T247445
  • 22:15 jforrester@deploy1001: Synchronized static/images/project-logos/: [trwiki] Restore pre-unblocking celebration logo versions T247445 (duration: 01m 09s)
  • 21:42 ebernhardson: stop all mjolnir-kafka-bulk-daemons in eqiad except 1 to assist debugging
  • 21:33 ebernhardson@deploy1001: Finished deploy [search/mjolnir/deploy@2726268]: Downgrade kafka_python to 1.4.3 (duration: 05m 45s)
  • 21:27 ebernhardson@deploy1001: Started deploy [search/mjolnir/deploy@2726268]: Downgrade kafka_python to 1.4.3
  • 20:53 cdanis@cumin2001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 20:52 cdanis@cumin2001: START - Cookbook sre.hosts.decommission
  • 20:26 brennen@deploy1001: Synchronized php: group1 wikis to 1.35.0-wmf.23 (duration: 01m 03s)
  • 20:25 brennen@deploy1001: rebuilt and synchronized wikiversions files: group1 wikis to 1.35.0-wmf.23
  • 20:09 cmjohnson@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 20:06 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 20:03 cmjohnson@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 20:00 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 19:53 otto@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventstreams' for release 'canary' .
  • 18:36 ejegg: updated payments-wiki from 03765b53de to 86ce0361f9
  • 18:36 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 18:33 elukey@cumin1001: START - Cookbook sre.hosts.downtime
  • 18:25 volans: temporary disabled puppet on A:dns-auth to deploy g/578506 T233183
  • 18:24 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Touch and secondary sync of IS for cache-busting (duration: 01m 07s)
  • 18:22 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Stop setting wmgParsoidVariant, no longer read T229015 (duration: 01m 07s)
  • 18:21 jforrester@deploy1001: Synchronized wmf-config/CommonSettings.php: Stop using wmgParsoidVariant, no longer varied T229015 (duration: 01m 08s)
  • 17:53 otto@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventstreams' for release 'canary' .
  • 16:53 moritzm: removed cas-2020-03-09.log and cas-2020-03-10.log on idp2001 (huge logs due to some debug log level for tracking down a performance issue)
  • 16:36 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 16:25 liw: restarting Zuul to clear queues (in collab with James F)
  • 14:49 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 14:41 volans: installed spicerack to 0.0.32-1 on cumin[12]001
  • 14:25 akosiaris@deploy1001: Synchronized wmf-config/ProductionServices.php: (no justification provided) (duration: 01m 11s)
  • 14:24 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 14:23 akosiaris@deploy1001: sync aborted: wmf-config/ProductionServices.php (duration: 02m 42s)
  • 14:22 volans: uploaded spicerack_0.0.32-1_amd64.deb to apt.wikimedia.org stretch-wikimedia
  • 14:21 akosiaris: switch mediawiki to talk to eventgate-analytics via envoy
  • 14:21 akosiaris@deploy1001: Started scap: wmf-config/ProductionServices.php
  • 14:20 cmjohnson@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 14:18 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 14:09 akosiaris: T239779 upload apertium-swe-nor_0.3.1-1+wmf1 to apt.wikimedia.org jessie-wikimedia/main
  • 14:08 akosiaris: T239779 upload apertium-swe-dan_0.8.1-1+wmf1 to apt.wikimedia.org jessie-wikimedia/main
  • 14:08 akosiaris: T239779 upload apertium-nno-nob_1.3.0-1+wmf1 to apt.wikimedia.org jessie-wikimedia/main
  • 14:08 akosiaris: T239779 upload apertium-dan-nor_1.4.1-1+wmf1 to apt.wikimedia.org jessie-wikimedia/main
  • 13:01 thcipriani: restarting gerrit unstuck the zuul server (T246973)
  • 12:54 thcipriani: restarting gerrit to try to fix thread deadlock on zuul (cf: T246973 )
  • 12:43 akosiaris: disconnect+connect jenkins from gearman server.
  • 12:38 akosiaris@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'eventstreams' for release 'canary' .
  • 12:38 akosiaris@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'eventstreams' for release 'production' .
  • 12:32 akosiaris@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'canary' .
  • 12:32 akosiaris@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'production' .
  • 12:23 akosiaris@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventstreams' for release 'canary' .
  • 12:23 akosiaris@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventstreams' for release 'production' .
  • 12:00 Lucas_WMDE: EU SWAT done
  • 12:00 lucaswerkmeister-wmde@deploy1001: Synchronized wmf-config/CommonSettings-labs.php: SWAT (prod no-op): Don't use TwoColConflict as beta feature on labs (T247292), take II (duration: 01m 07s)
  • 11:59 lucaswerkmeister-wmde@deploy1001: Synchronized wmf-config/CommonSettings-labs.php: SWAT (prod no-op): Don't use TwoColConflict as beta feature on labs (T247292) (duration: 01m 09s)
  • 11:56 lucaswerkmeister-wmde@deploy1001: Synchronized php-1.35.0-wmf.23/extensions/WikibaseCirrusSearch/: SWAT: Wrap property EntitySearchHelper in PropertyDataTypeSearchHelper (duration: 01m 05s)
  • 11:48 vgutierrez: restarting ats-backend on cp2004
  • 11:25 moritzm: restarting slapd on serpens/seaborgium to pick up libidn security updates
  • 11:21 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 11:16 _joe_: restarting zuul and zuul-merger on contint1001, they're stuck
  • 11:11 moritzm: restarting exim on MXes to pick up libidn security updates
  • 11:03 marostegui@cumin1001: dbctl commit (dc=all): 'Give normal 100 weight to es3 old masters - T246072', diff saved to https://phabricator.wikimedia.org/P10685 and previous config saved to /var/cache/conftool/dbconfig/20200311-110334-marostegui.json
  • 10:59 marostegui: Remove Mostrevisions from mwmaint1002 T239072
  • 10:42 vgutierrez: pool ncredir5002 - T243391
  • 10:38 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly give weight to es3 old masters - T246072', diff saved to https://phabricator.wikimedia.org/P10684 and previous config saved to /var/cache/conftool/dbconfig/20200311-103802-marostegui.json
  • 10:34 moritzm: restarting Apache on graphite*. kibana, netmon* to pick up libidn security updates
  • 09:53 moritzm: installing postgresql-9.6 security updates on maps*
  • 09:46 vgutierrez: depool and reimage ncredir5002 with buster - T243391
  • 09:43 marostegui: Finish es3 maintenance window T246072
  • 09:29 marostegui: Disconnect replication on all es3 hosts T246072
  • 09:18 marostegui: Set es1017 (es3 master) in read only on mysql T246072
  • 09:09 marostegui@deploy1001: Synchronized wmf-config/db-eqiad.php: Set es3 as RO - T246072 (duration: 01m 08s)
  • 09:06 marostegui@deploy1001: Synchronized wmf-config/db-codfw.php: Set es3 as RO - T246072 (duration: 01m 08s)
  • 09:01 moritzm: restarting Apache on puppetboard, people.wikimedia.org, webperf*, bromine, miscweb* to pick up libidn security updates
  • 08:40 moritzm: installing libidn security updates
  • 08:33 moritzm: installing libvpx security updates
  • 08:10 oblivian@deploy1001: Synchronized wmf-config/ProductionServices.php: switch wdqs-internal to use envoy (duration: 01m 21s)
  • 07:38 marostegui: fixcopyrightwiki_p views from labs hosts T246055
  • 01:40 ejegg: restarted recurring donation charge jobs
  • 01:27 ejegg: restarted fundraising orphan donation rectifier jobs
  • 01:20 ejegg: updated fundraising CiviCRM from c4b81b19b0 to a301076871
  • 01:19 ejegg: disabled orphan rectifier jobs for upgrade
  • 00:24 eileen: civicrm revision changed from 35651da117 to c4b81b19b0, config revision is 71c8cda115
  • 00:16 dzahn@cumin1001: conftool action : set/weight=15; selector: name=mw2375.codfw.wmnet
  • 00:15 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw237[0246].codfw.wmnet
  • 00:15 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw236[68].codfw.wmnet
  • 00:14 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw23[66-76].codfw.wmnet

2020-03-10

  • 23:53 volker-e@deploy1001: Finished deploy [design/style-guide@8eb1daf]: Deploy design/style-guide: (duration: 00m 05s)
  • 23:53 volker-e@deploy1001: Started deploy [design/style-guide@8eb1daf]: Deploy design/style-guide:
  • 23:50 ejegg: disabled recurring donation charge jobs for upgrade
  • 23:48 mutante: mw2376 - systemctl start apache2
  • 23:45 dzahn@cumin1001: conftool action : set/weight=15; selector: name=mw2376.codfw.wmnet
  • 23:45 ebernhardson: start in-place reindex procedure on kowiki against eqiad and codfw
  • 23:44 ebernhardson@deploy1001: Synchronized php-1.35.0-wmf.22/extensions/CirrusSearch/includes/Maintenance/Reindexer.php: (no justification provided) (duration: 01m 07s)
  • 23:42 cmjohnson@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 23:39 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 23:38 ebernhardson@deploy1001: Synchronized php-1.35.0-wmf.23/extensions/CirrusSearch/includes/Maintenance/Reindexer.php: cirrus: Wait around after a refresh before counting docs (duration: 01m 08s)
  • 23:37 mutante: mw2366 - systemctl start nutcracker
  • 23:11 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 23:11 dzahn@cumin1001: START - Cookbook sre.hosts.downtime
  • 23:07 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw237[135].codfw.wmnet
  • 23:05 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw236[579].codfw.wmnet
  • 23:05 krinkle@deploy1001: Synchronized wmf-config/wgConf.php: Ib5473af6 (duration: 01m 07s)
  • 23:02 krinkle@deploy1001: Synchronized multiversion/MWConfigCacheGenerator.php: Ib5473af6 (duration: 01m 07s)
  • 22:58 krinkle@deploy1001: Synchronized multiversion/MWMultiVersion.php: Ib5473af6 (duration: 01m 07s)
  • 22:31 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw236[0-5].codfw.wmnet
  • 22:28 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw235[0-9].codfw.wmnet
  • 22:12 dzahn@cumin1001: conftool action : set/weight=15; selector: name=mw237[0-4].codfw.wmnet
  • 22:12 dzahn@cumin1001: conftool action : set/weight=15; selector: name=mw236[0-9].codfw.wmnet
  • 22:11 mutante: mw2359 sudo systemctl start php7.2-fpm_check_restart
  • 22:09 dzahn@cumin1001: conftool action : set/weight=15; selector: name=mw235[0-9].codfw.wmnet
  • 21:58 ebernhardson@deploy1001: Finished deploy [search/mjolnir/deploy@dda3d28]: re-sync latest version to trigger scap scripts on new elastic nodes in codfw (duration: 02m 15s)
  • 21:56 ebernhardson@deploy1001: Started deploy [search/mjolnir/deploy@dda3d28]: re-sync latest version to trigger scap scripts on new elastic nodes in codfw
  • 21:51 ebernhardson@deploy1001: Finished deploy [search/mjolnir/deploy@dda3d28]: re-sync latest version to trigger scap scripts on new elastic nodes in codfw (duration: 00m 23s)
  • 21:51 ebernhardson@deploy1001: Started deploy [search/mjolnir/deploy@dda3d28]: re-sync latest version to trigger scap scripts on new elastic nodes in codfw
  • 21:38 volker-e@deploy1001: Finished deploy [design/style-guide@8eb1daf]: Deploy design/style-guide: (duration: 00m 07s)
  • 21:38 volker-e@deploy1001: Started deploy [design/style-guide@8eb1daf]: Deploy design/style-guide:
  • 21:29 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 21:26 dzahn@cumin1001: START - Cookbook sre.hosts.downtime
  • 20:49 brennen@deploy1001: rebuilt and synchronized wikiversions files: group0 to 1.35.0-wmf.23
  • 20:39 brennen@deploy1001: Finished scap: testwiki to php-1.35.0-wmf.23 and rebuild l10n cache (duration: 163m 37s)
  • 20:29 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 20:29 dzahn@cumin1001: START - Cookbook sre.hosts.downtime
  • 20:11 eileen: civicrm revision changed from 10506a9644 to 3de711ed49, config revision is 2d7b926c1d
  • 20:10 eileen: process-control config revision is 2d7b926c1d
  • 19:41 otto@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-main' for release 'production' .
  • 19:40 otto@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-main' for release 'canary' .
  • 19:38 mutante: gerrit1001 - /var/log/syslog empty and 2 rsyslogd procs running, killing one of them, stopping the other, letting puppet run
  • 19:37 otto@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'canary' .
  • 19:34 otto@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'canary' .
  • 19:34 volker-e@deploy1001: Finished deploy [design/style-guide@62bf7c6]: Deploy design/style-guide: (duration: 00m 06s)
  • 19:34 volker-e@deploy1001: Started deploy [design/style-guide@62bf7c6]: Deploy design/style-guide:
  • 19:32 otto@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-analytics' for release 'production' .
  • 19:31 otto@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-analytics' for release 'canary' .
  • 19:30 brennen: scap-cdb-rebuild currently at 29%; at present rate wmf.23 will roll to group0 a bit after the official window
  • 19:29 otto@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'production' .
  • 19:26 otto@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'production' .
  • 19:22 otto@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-main' for release 'production' .
  • 19:19 otto@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-main' for release 'canary' .
  • 19:12 otto@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-analytics' for release 'canary' .
  • 19:12 otto@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-analytics' for release 'production' .
  • 19:09 otto@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-analytics' for release 'canary' .
  • 19:04 otto@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'canary' .
  • 19:04 otto@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'production' .
  • 19:00 otto@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'canary' .
  • 19:00 otto@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'production' .
  • 18:56 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-main' for release 'canary' .
  • 18:56 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-main' for release 'production' .
  • 18:39 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics' for release 'canary' .
  • 18:39 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics' for release 'production' .
  • 18:36 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'canary' .
  • 18:36 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'production' .
  • 18:33 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'canary' .
  • 18:33 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'production' .
  • 17:55 brennen@deploy1001: Started scap: testwiki to php-1.35.0-wmf.23 and rebuild l10n cache
  • 17:34 ebernhardson@deploy1001: Finished deploy [wikimedia/discovery/analytics@88b3e14]: Update predictions dag with new cli parameters (duration: 01m 00s)
  • 17:33 ebernhardson@deploy1001: Started deploy [wikimedia/discovery/analytics@88b3e14]: Update predictions dag with new cli parameters
  • 17:33 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Touch and secondary sync of IS for cache-busting (duration: 01m 00s)
  • 17:31 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: [nlwiki] Enable WikiLove T247286 (duration: 00m 59s)
  • 17:27 bsitzmann@deploy1001: Finished deploy [mobileapps/deploy@6c2ee13]: Update mobileapps to 304fb43 (duration: 08m 09s)
  • 17:25 otto@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventstreams' for release 'canary' .
  • 17:25 otto@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventstreams' for release 'production' .
  • 17:24 James_F: Ran mwscript extensions/WikimediaMaintenance/createExtensionTables.php --wiki=nlwiki wikilove for T247286
  • 17:23 otto@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'eventstreams' for release 'canary' .
  • 17:23 otto@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'eventstreams' for release 'production' .
  • 17:20 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'canary' .
  • 17:20 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'production' .
  • 17:19 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 17:19 bsitzmann@deploy1001: Started deploy [mobileapps/deploy@6c2ee13]: Update mobileapps to 304fb43
  • 17:18 brennen: 1.35.0-wmf.23 was branched at 8e3738c for T233871
  • 16:50 brennen: starting branch cut for wmf/1.35.0-wmf.23 - T233871
  • 16:22 volker-e@deploy1001: Finished deploy [design/style-guide@14bb669]: Deploy design/style-guide: (duration: 00m 08s)
  • 16:21 volker-e@deploy1001: Started deploy [design/style-guide@14bb669]: Deploy design/style-guide:
  • 16:15 ebernhardson@deploy1001: Finished deploy [wikimedia/discovery/analytics@d182ca7]: Build airflow venvs from stat1007 (duration: 00m 45s)
  • 16:15 ebernhardson@deploy1001: Started deploy [wikimedia/discovery/analytics@d182ca7]: Build airflow venvs from stat1007
  • 16:05 oblivian@deploy1001: Synchronized wmf-config/ProductionServices.php: switch termbox to use envoy (duration: 00m 59s)
  • 15:48 vgutierrez: re-enabling session id based caching on ulsfo (along with tls session tickets) - T245616
  • 14:48 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db2121 - T246604', diff saved to https://phabricator.wikimedia.org/P10677 and previous config saved to /var/cache/conftool/dbconfig/20200310-144817-root.json
  • 14:42 akosiaris: T233700 upload apertium-fra-cat_1.7.0-1+wmf1_amd64.changes to apt.wikimedia.org/jessie-wikimedia.org main
  • 14:35 akosiaris@cumin1001: conftool action : set/weight=10; selector: dc=codfw,service=eventstreams,name=scb.*
  • 14:35 akosiaris@cumin1001: conftool action : set/pooled=yes; selector: dc=codfw,service=eventstreams,name=scb.*
  • 14:35 akosiaris@cumin1001: conftool action : set/pooled=yes; selector: dc=eqiad,service=eventstreams,name=scb.*
  • 14:34 akosiaris@cumin1001: conftool action : set/weight=8; selector: dc=eqiad,service=eventstreams,name=scb.*
  • 14:15 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 14:12 vgutierrez: Switch to TLS session tickets on ulsfo - T245616
  • 14:12 marostegui@cumin1001: START - Cookbook sre.hosts.downtime
  • 14:00 vgutierrez: reboot cp4026 - T245616
  • 14:00 oblivian@deploy1001: Synchronized wmf-config/ProductionServices.php: switch echotore to use envoy (duration: 00m 57s)
  • 13:52 marostegui: Stop mysql on db2121 for reimage to buster T246604
  • 13:46 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db2121 for reimage to buster - T246604', diff saved to https://phabricator.wikimedia.org/P10676 and previous config saved to /var/cache/conftool/dbconfig/20200310-134648-marostegui.json
  • 13:45 akosiaris@cumin1001: conftool action : set/pooled=inactive; selector: dc=eqiad,service=eventstreams,name=kubernetes.*
  • 13:44 akosiaris@cumin1001: conftool action : set/pooled=no; selector: dc=eqiad,service=eventstreams,name=kubernetes.*
  • 13:41 otto@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Enable Mediawiki client side error logging on hawwiki (take 2) - T246030 (duration: 00m 57s)
  • 13:40 akosiaris: bump eventstreams on scb1003 to force users to reconnect, hoping more connections will make it to kubernetes hosts
  • 13:35 akosiaris: pool all kubernetes hosts in eqiad for eventstreams. weight=2 which means ~20% of requests are going to be served by kubernetes
  • 13:34 akosiaris@cumin1001: conftool action : set/pooled=yes; selector: dc=eqiad,service=eventstreams,name=kubernetes.*
  • 13:34 akosiaris@cumin1001: conftool action : set/weight=2; selector: dc=eqiad,service=eventstreams,name=kubernetes.*
  • 13:31 otto@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Enable Mediawiki client side error logging on hawwiki - T246030 (duration: 00m 58s)
  • 13:29 akosiaris: T202360 upload apertium-oci-fra_0.3.0-1+wmf1_amd64.changes to apt.wikimedia.org/jessie-wikimedia main
  • 13:25 gehel@cumin1001: START - Cookbook sre.wdqs.data-reload
  • 13:23 gehel@cumin1001: END (ERROR) - Cookbook sre.wdqs.data-reload (exit_code=97)
  • 13:23 gehel@cumin1001: START - Cookbook sre.wdqs.data-reload
  • 13:22 cmjohnson@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 13:19 cmjohnson@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 13:17 cmjohnson@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 13:17 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 13:16 vgutierrez: upgrade ATS on ulsfo to 8.0.6-1wm2 - T245616
  • 13:16 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 13:15 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 13:13 cmjohnson@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 13:10 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 12:05 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 12:02 cmjohnson@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 12:01 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 12:00 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 11:56 vgutierrez: upload trafficserver 8.0.6-1wm2 to apt.wm.o (buster) - T245616
  • 11:41 Lucas_WMDE: EU SWAT done
  • 11:40 lucaswerkmeister-wmde@deploy1001: Synchronized php-1.35.0-wmf.22/extensions/EventLogging/: SWAT: Make BackgroundQueue more aware of page unload flow (T246382, T244874) (duration: 00m 58s)
  • 11:30 oblivian@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'echostore' for release 'production' .
  • 11:27 oblivian@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'echostore' for release 'production' .
  • 11:26 marostegui: Restart mysqld exporter on db2125 to see if the collection errors decrease from 30 T247290
  • 11:21 lucaswerkmeister-wmde@deploy1001: Synchronized php-1.35.0-wmf.22/extensions/DiscussionTools/: SWAT: controller: apply ve.fixBase to the parsed Parsoid response (T245781) (duration: 00m 59s)
  • 09:38 akosiaris@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'cxserver' for release 'production' .
  • 09:37 akosiaris@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'citoid' for release 'production' .
  • 09:36 akosiaris@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'blubberoid' for release 'production' .
  • 09:34 marostegui: es5 deployment window finished T246072
  • 09:31 akosiaris@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'cxserver' for release 'production' .
  • 09:30 akosiaris@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'citoid' for release 'production' .
  • 09:29 akosiaris@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'blubberoid' for release 'production' .
  • 09:27 marostegui@deploy1001: Synchronized wmf-config/db-eqiad.php: Enable es5 as new writable external store section - T246072 (duration: 00m 57s)
  • 09:26 marostegui@deploy1001: Synchronized wmf-config/db-codfw.php: Enable es5 as new writable external store section - T246072 (duration: 00m 58s)
  • 09:25 marostegui@deploy1001: Synchronized wmf-config/db-codfw.php: Enable es5 as new writable external store section - T246072 (duration: 00m 59s)
  • 09:21 akosiaris: update blubberoid, cxserver, citoid to push the TLS resources changes T244843
  • 09:21 akosiaris@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'cxserver' for release 'staging' .
  • 09:21 akosiaris: update blubberoid, cxserver, citoid to push the TLS resources changes
  • 09:20 akosiaris@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'citoid' for release 'staging' .
  • 09:19 akosiaris@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'blubberoid' for release 'staging' .
  • 09:04 marostegui@deploy1001: Synchronized wmf-config/db-eqiad.php: Add es5 to the available es sections, not in use yet - T246072 (duration: 00m 59s)
  • 09:03 marostegui@deploy1001: Synchronized wmf-config/db-codfw.php: Add es5 to the available es sections, not in use yet - T246072 (duration: 01m 01s)
  • 09:00 marostegui: Start es5 deployment window T246072
  • 08:50 marostegui@cumin1001: dbctl commit (dc=all): 'Fully repool es1012', diff saved to https://phabricator.wikimedia.org/P10673 and previous config saved to /var/cache/conftool/dbconfig/20200310-085001-marostegui.json
  • 08:25 marostegui@cumin1001: dbctl commit (dc=all): 'Promote es1012 back to es1 master, this is a NOOP T239791', diff saved to https://phabricator.wikimedia.org/P10671 and previous config saved to /var/cache/conftool/dbconfig/20200310-082552-marostegui.json
  • 08:25 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool es1012', diff saved to https://phabricator.wikimedia.org/P10670 and previous config saved to /var/cache/conftool/dbconfig/20200310-082525-marostegui.json
  • 05:36 vgutierrez: restart ats-be on cp4032 to clean up the restart alert - T247232

2020-03-09

  • 23:21 catrope@deploy1001: Synchronized wmf-config/throttle.php: Remove expired throttle exemptions (duration: 01m 00s)
  • 23:15 catrope@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Create Define/Define talk: namespace on scowiki (duration: 01m 00s)
  • 16:20 oblivian@deploy1001: Synchronized wmf-config/ProductionServices.php: revert: switch eventgate-analytics to use envoy (duration: 00m 59s)
  • 16:15 oblivian@deploy1001: Synchronized wmf-config/ProductionServices.php: switch eventgate-analytics to use envoy (duration: 01m 05s)
  • 16:11 akosiaris@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' .
  • 15:46 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool es1012', diff saved to https://phabricator.wikimedia.org/P10668 and previous config saved to /var/cache/conftool/dbconfig/20200309-154627-marostegui.json
  • 15:35 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool es1012', diff saved to https://phabricator.wikimedia.org/P10667 and previous config saved to /var/cache/conftool/dbconfig/20200309-153515-marostegui.json
  • 15:29 marostegui: Upgrade mysql on es1012 T239791
  • 15:24 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db2125 - T246604', diff saved to https://phabricator.wikimedia.org/P10666 and previous config saved to /var/cache/conftool/dbconfig/20200309-152427-marostegui.json
  • 15:18 oblivian@deploy1001: Synchronized wmf-config/ProductionServices.php: switch mathoid to use envoy (duration: 00m 59s)
  • 15:17 marostegui@cumin1001: dbctl commit (dc=all): 'Depool es1012 T239791', diff saved to https://phabricator.wikimedia.org/P10665 and previous config saved to /var/cache/conftool/dbconfig/20200309-151751-marostegui.json
  • 15:14 akosiaris@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' .
  • 15:13 marostegui@cumin1001: dbctl commit (dc=all): 'Promote es1016 to es1 master, this is a NOOP T239791', diff saved to https://phabricator.wikimedia.org/P10664 and previous config saved to /var/cache/conftool/dbconfig/20200309-151310-marostegui.json
  • 15:13 akosiaris@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' .
  • 15:12 akosiaris@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'kube-system' for release 'calico-policy-controller' .
  • 15:06 marostegui: Restart mysql on db1116 (the previous one was db1102) for upgrade
  • 14:57 marostegui: Restart mysql for upgrade
  • 14:56 hoo: Updated the Wikidata property suggester with data from the 2020-03-02 JSON dump and applied the T132839 workarounds
  • 14:52 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1121 T239791', diff saved to https://phabricator.wikimedia.org/P10663 and previous config saved to /var/cache/conftool/dbconfig/20200309-145232-marostegui.json
  • 14:48 marostegui: Restart and upgrade mysql on db1121 T239791
  • 14:47 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1121 T239791', diff saved to https://phabricator.wikimedia.org/P10662 and previous config saved to /var/cache/conftool/dbconfig/20200309-144752-marostegui.json
  • 14:41 godog: roll restart logstash in codfw / eqiad - T226986
  • 13:52 oblivian@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'sessionstore' for release 'production' .
  • 13:49 oblivian@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'sessionstore' for release 'production' .
  • 12:30 akosiaris: upload apertium 3.6.1, cg3 1.3.1, lttoolbox 3.5.1, apertium-lex-tools 0.2.3 to apt.wikimedia.org/jessie-wikimedia main. T234182
  • 12:06 oblivian@deploy1001: Synchronized wmf-config/ProductionServices.php: switch sessionstore to use envoy permanently (duration: 00m 59s)
  • 11:25 oblivian@deploy1001: Synchronized wmf-config/ProductionServices.php: test: switch sessionstore to use envoy again (duration: 00m 57s)
  • 11:10 Amir1: EU SWAT is done
  • 11:09 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Add `fkv` Kven to $wmgExtraLanguageNames (T167259), take II (duration: 00m 57s)
  • 11:08 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Add `fkv` Kven to $wmgExtraLanguageNames (T167259) (duration: 00m 59s)
  • 10:58 vgutierrez: upload pystemd 0.7.0-1wm1 to apt.wm.o (buster) - T245616
  • 10:46 jdrewniak@deploy1001: Synchronized portals: Wikimedia Portals Update: Bumping portals to master (563985) (duration: 00m 58s)
  • 10:45 jdrewniak@deploy1001: Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: Bumping portals to master (563985) (duration: 00m 58s)
  • 10:34 moritzm: install spamassassin security updates on fermium/lists.wikimedia.org
  • 10:32 moritzm: install spamassassin security updates on mendelevium/ticket.wikimedia.org
  • 10:26 oblivian@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'sessionstore' for release 'production' .
  • 10:12 moritzm: installing openjdk-7 security updates
  • 10:04 vgutierrez: disable parent proxies globally on ats-tls - T244464
  • 10:00 moritzm: installing php5 security updates
  • 09:51 gehel: pooling new elastic20[55-60] servers - T246975
  • 09:48 oblivian@deploy1001: Synchronized wmf-config/ProductionServices.php: re-revert: switch sessionstore to use envoy (duration: 00m 35s)
  • 09:39 oblivian@deploy1001: Synchronized wmf-config/ProductionServices.php: re-try: switch sessionstore to use envoy (duration: 00m 58s)
  • 09:14 oblivian@deploy1001: Synchronized wmf-config/ProductionServices.php: revert switch sessionstore to use envoy (duration: 00m 58s)
  • 09:08 oblivian@deploy1001: Synchronized wmf-config/ProductionServices.php: switch sessionstore to use envoy (duration: 01m 00s)
  • 09:07 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 09:04 marostegui@cumin1001: START - Cookbook sre.hosts.downtime
  • 08:37 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db2125 for reimage to buster - T246604', diff saved to https://phabricator.wikimedia.org/P10658 and previous config saved to /var/cache/conftool/dbconfig/20200309-083711-marostegui.json
  • 08:36 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db2126', diff saved to https://phabricator.wikimedia.org/P10657 and previous config saved to /var/cache/conftool/dbconfig/20200309-083653-marostegui.json
  • 08:21 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db2126 for reimage to buster - T246604', diff saved to https://phabricator.wikimedia.org/P10656 and previous config saved to /var/cache/conftool/dbconfig/20200309-082118-marostegui.json
  • 07:46 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db2114 after reimage to buster - T246604', diff saved to https://phabricator.wikimedia.org/P10655 and previous config saved to /var/cache/conftool/dbconfig/20200309-074629-marostegui.json
  • 07:31 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 07:29 marostegui@cumin1001: START - Cookbook sre.hosts.downtime
  • 07:13 marostegui: Stop MySQL on db2114 to upgrade to buster
  • 07:09 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db2114 for reimage to buster - T246604', diff saved to https://phabricator.wikimedia.org/P10654 and previous config saved to /var/cache/conftool/dbconfig/20200309-070937-marostegui.json
  • 05:34 vgutierrez: restart ats-tls, ats-be and varnish-fe on cp3053 to clean up daemon restart alerts - T247195

2020-03-08

  • 17:58 elukey: restart hadoop-yarn-nodemanger on an-worker1087
  • 17:17 reedy@deploy1001: Synchronized wmf-config/CommonSettings.php: Add wmgDisableAccountCreation (duration: 00m 56s)
  • 17:15 reedy@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Add wmgDisableAccountCreation (duration: 00m 59s)
  • 05:16 thcipriani: restart gerrit-replica as it's OOM T247182

2020-03-07

  • 12:48 reedy@deploy1001: Synchronized wmf-config/throttle.php: T247149 (duration: 01m 07s)
  • 01:35 reedy@deploy1001: Synchronized wmf-config/throttle.php: tidy up (duration: 00m 56s)

2020-03-06

  • 23:50 mutante: install1003/2003 - starting DHCP servers and letting puppet stop them again to clear systemd state
  • 23:04 mutante: signing puppet certs for install1003/install2003, initial puppet runs
  • 22:33 reedy@deploy1001: Synchronized wmf-config/interwiki-labs.php: T247091 (duration: 00m 57s)
  • 22:09 ebernhardson@deploy1001: Finished deploy [wikimedia/discovery/analytics@18f13e4]: update to pyhton3.7, ship articletopic propagation (duration: 00m 36s)
  • 22:08 ebernhardson@deploy1001: Started deploy [wikimedia/discovery/analytics@18f13e4]: update to pyhton3.7, ship articletopic propagation
  • 20:23 ebernhardson: post-deploy restart mjolnir bulk and msearch daemons across eqiad and codfw
  • 20:07 ebernhardson@deploy1001: Finished deploy [search/mjolnir/deploy@dda3d28]: Re-deploy python3.7 upgrade (duration: 05m 14s)
  • 20:02 ebernhardson@deploy1001: Started deploy [search/mjolnir/deploy@dda3d28]: Re-deploy python3.7 upgrade
  • 19:57 dzahn@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
  • 19:56 dzahn@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
  • 19:48 mutante: re-creating install1003 and install2003 with same specs as before but public IP (T244390)
  • 19:47 dzahn@cumin1001: START - Cookbook sre.ganeti.makevm
  • 19:46 dzahn@cumin1001: START - Cookbook sre.ganeti.makevm
  • 19:46 dzahn@cumin1001: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99)
  • 19:46 dzahn@cumin1001: START - Cookbook sre.ganeti.makevm
  • 18:54 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 18:53 dzahn@cumin1001: START - Cookbook sre.hosts.decommission
  • 18:52 dzahn@cumin1001: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1)
  • 18:52 dzahn@cumin1001: START - Cookbook sre.hosts.decommission
  • 18:46 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 18:44 dzahn@cumin1001: START - Cookbook sre.hosts.decommission
  • 18:07 mutante: sudo -i cumin -b 15 'mw23[25-34].codfw.wmnet' 'sudo -u dzahn scap pull'
  • 18:05 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw233[0-4].codfw.wmnet
  • 18:05 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw232[5-9].codfw.wmnet
  • 18:05 dzahn@cumin1001: conftool action : set/weight=15; selector: name=mw233[0-4].codfw.wmnet
  • 18:04 dzahn@cumin1001: conftool action : set/weight=15; selector: name=mw232[5-9].codfw.wmnet
  • 17:42 krinkle@deploy1001: Synchronized wmf-config/wgConf.php: I260bafdb8e (no-op) (duration: 01m 00s)
  • 17:28 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 17:26 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 17:23 dzahn@cumin1001: START - Cookbook sre.hosts.downtime
  • 17:23 dzahn@cumin1001: START - Cookbook sre.hosts.downtime
  • 16:54 reedy@deploy1001: Synchronized php-1.35.0-wmf.22/extensions/WikimediaMaintenance/dumpInterwiki.php: T247097 (duration: 01m 00s)
  • 16:40 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 16:40 dzahn@cumin1001: START - Cookbook sre.hosts.downtime
  • 16:40 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 16:40 dzahn@cumin1001: START - Cookbook sre.hosts.downtime
  • 15:11 moritzm: installing libtimedate-perl updates from Stretch point release
  • 15:07 reedy@deploy1001: Synchronized langlist-labs: T247091 (duration: 01m 05s)
  • 14:53 elukey@cumin1001: END (PASS) - Cookbook sre.aqs.roll-restart (exit_code=0)
  • 14:50 elukey@cumin1001: START - Cookbook sre.aqs.roll-restart
  • 14:44 XioNoX: add cloud-out4 firewall filter in codfw - T246887
  • 11:56 akosiaris: T238658. kubernetes1001 pooled for eventstreams, weight=1 which should account for 2.1% of traffic
  • 11:51 akosiaris@cumin1001: conftool action : set/pooled=yes; selector: dc=eqiad,service=eventstreams,name=kubernetes1001.*
  • 11:50 akosiaris@cumin1001: conftool action : set/weight=1; selector: dc=eqiad,service=eventstreams,name=kube.*
  • 10:21 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 10:16 elukey@cumin1001: END (PASS) - Cookbook sre.presto.roll-restart-workers (exit_code=0)
  • 10:10 moritzm: rolling restart of Exim on mx* to pick up libidn security updates
  • 10:06 elukey@cumin1001: START - Cookbook sre.presto.roll-restart-workers
  • 10:06 marostegui@cumin1001: dbctl commit (dc=all): 'Fully repool db1074', diff saved to https://phabricator.wikimedia.org/P10648 and previous config saved to /var/cache/conftool/dbconfig/20200306-100628-marostegui.json
  • 10:03 moritzm: rolling restart of labweb* to pick up libidn security updates
  • 09:54 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db2084:3314, db2084:3315 after reimage to buster - T246604', diff saved to https://phabricator.wikimedia.org/P10647 and previous config saved to /var/cache/conftool/dbconfig/20200306-095407-marostegui.json
  • 09:52 moritzm: rolling restart of slapd on LDAP replicas to pick up libidn security updates
  • 09:51 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1074', diff saved to https://phabricator.wikimedia.org/P10646 and previous config saved to /var/cache/conftool/dbconfig/20200306-095115-marostegui.json
  • 09:46 elukey@cumin1001: END (PASS) - Cookbook sre.aqs.roll-restart (exit_code=0)
  • 09:45 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 09:43 elukey@cumin1001: START - Cookbook sre.aqs.roll-restart
  • 09:42 marostegui@cumin1001: START - Cookbook sre.hosts.downtime
  • 09:21 marostegui: Stop MySQL on db2084:3315, db2084:3314 for reimage T246604
  • 09:21 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db2084:3314, db2084:3315 for reimage to buster - T246604', diff saved to https://phabricator.wikimedia.org/P10645 and previous config saved to /var/cache/conftool/dbconfig/20200306-092103-marostegui.json
  • 09:20 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1074', diff saved to https://phabricator.wikimedia.org/P10644 and previous config saved to /var/cache/conftool/dbconfig/20200306-092026-marostegui.json
  • 09:12 moritzm: rolling restart of mw canaries to pick up libidn security updates
  • 09:03 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1074', diff saved to https://phabricator.wikimedia.org/P10643 and previous config saved to /var/cache/conftool/dbconfig/20200306-090328-marostegui.json
  • 09:00 moritzm: installing libidn security updates
  • 08:56 moritzm: rolling restart of kartotherian/tilerator/tileratorui to pick up OpenJPEG security updates
  • 08:56 marostegui: Stop MySQL on db1074 for upgrade T239791
  • 08:56 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1074 for upgrade T239791', diff saved to https://phabricator.wikimedia.org/P10642 and previous config saved to /var/cache/conftool/dbconfig/20200306-085435-marostegui.json
  • 08:53 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1113:3315, db1113:3316 after upgrade - T239791', diff saved to https://phabricator.wikimedia.org/P10641 and previous config saved to /var/cache/conftool/dbconfig/20200306-085332-marostegui.json
  • 08:47 marostegui: Stop mysql for db1113:3315, db1113:3316 for upgrade T239791
  • 08:44 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1113:3315, db1113:3316 for upgrade - T239791', diff saved to https://phabricator.wikimedia.org/P10640 and previous config saved to /var/cache/conftool/dbconfig/20200306-084439-marostegui.json
  • 08:41 marostegui@cumin1001: dbctl commit (dc=all): 'Fully repool db1078 T246604', diff saved to https://phabricator.wikimedia.org/P10639 and previous config saved to /var/cache/conftool/dbconfig/20200306-084141-marostegui.json
  • 08:29 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db2085:3311, db2085:3318 after reimage to buster - T246604', diff saved to https://phabricator.wikimedia.org/P10638 and previous config saved to /var/cache/conftool/dbconfig/20200306-082858-marostegui.json
  • 08:25 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1078 T246604', diff saved to https://phabricator.wikimedia.org/P10637 and previous config saved to /var/cache/conftool/dbconfig/20200306-082549-marostegui.json
  • 08:19 moritzm: installing openjpeg2 security updates
  • 08:11 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 08:09 marostegui@cumin1001: START - Cookbook sre.hosts.downtime
  • 07:50 marostegui: Stop MySQL on db2085:3311, db2085:3318 for reimage to buster T246604
  • 07:44 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db2085:3311, db2085:3318 for reimage to buster - T246604', diff saved to https://phabricator.wikimedia.org/P10636 and previous config saved to /var/cache/conftool/dbconfig/20200306-074427-marostegui.json
  • 07:37 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1078 T246604', diff saved to https://phabricator.wikimedia.org/P10635 and previous config saved to /var/cache/conftool/dbconfig/20200306-073707-marostegui.json
  • 07:05 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1078 T246604', diff saved to https://phabricator.wikimedia.org/P10634 and previous config saved to /var/cache/conftool/dbconfig/20200306-070538-marostegui.json
  • 06:48 marostegui@cumin1001: dbctl commit (dc=all): 'Install 10.4 instead of 10.3 on db1078', diff saved to https://phabricator.wikimedia.org/P10633 and previous config saved to /var/cache/conftool/dbconfig/20200306-064800-marostegui.json
  • 01:38 mutante: added 9 more appservers to codfw pool split between appserver and API appservers, weight 15 (like all in codfw) T247021
  • 01:37 mutante: added 9 more appservers to codfw pool
  • 01:34 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw230[1-9].codfw.wmnet
  • 01:34 dzahn@cumin1001: conftool action : set/weight=15; selector: name=mw230[1-9].codfw.wmnet
  • 01:01 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 00:58 dzahn@cumin1001: START - Cookbook sre.hosts.downtime
  • 00:33 cdanis: repool esams T246338
  • 00:19 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 00:19 dzahn@cumin1001: START - Cookbook sre.hosts.downtime
  • 00:02 cdanis: T246338 depool esams for router maintenance

2020-03-05

  • 23:55 mutante: pooled mw2290 - noticed it was the only API appserver in codfw not pooled but did not see why, fine in Icinga and no open tickets/SAL
  • 23:55 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw2290.codfw.wmnet
  • 23:30 rzl@cumin1001: conftool action : set/pooled=yes; selector: name=mw1413.eqiad.wmnet
  • 23:27 rzl@cumin1001: conftool action : set/weight=30; selector: name=mw1413.eqiad.wmnet
  • 23:26 rlazarus: mw1413 test-reimage completed successfully, pooling
  • 23:03 rzl@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 23:01 rzl@cumin1001: START - Cookbook sre.hosts.downtime
  • 22:50 mutante: added 8 new appservers to pool in eqiad
  • 22:50 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw139[0-2].eqiad.wmnet
  • 22:47 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw138[5-9].eqiad.wmnet
  • 22:47 dzahn@cumin1001: conftool action : set/weight=30; selector: name=mw138[5-9].eqiad.wmnet
  • 22:46 dzahn@cumin1001: conftool action : set/weight=30; selector: name=mw139[0-2].eqiad.wmnet
  • 22:46 dzahn@cumin1001: conftool action : set/weight=20; selector: name=mw139[0-2].eqiad.wmnet
  • 22:44 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw138[5-9]eqiad.wmnet
  • 22:42 dzahn@cumin1001: conftool action : set/weight=20; selector: name=mw139[0-2]eqiad.wmnet
  • 22:41 dzahn@cumin1001: conftool action : set/weight=20; selector: name=mw138[5-9]eqiad.wmnet
  • 22:41 rlazarus: reimaging mw1413 (new appserver, not pooled) to test https://gerrit.wikimedia.org/r/c/576464
  • 22:40 mutante: [cumin1001:~] $ sudo -i cumin -b 15 'mw13[85-92].eqiad.wmnet' 'sudo -u dzahn scap pull'
  • 22:40 rzl@cumin1001: conftool action : set/pooled=yes; selector: name=mw14(0[5-9]|1[0-2]).eqiad.wmnet
  • 22:40 rzl@cumin1001: conftool action : set/weight=30; selector: name=mw14(0[5-9]|1[0-2]).eqiad.wmnet
  • 22:40 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 22:40 dzahn@cumin1001: START - Cookbook sre.hosts.downtime
  • 22:38 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Touch and secondary sync of IS for cache-busting (duration: 00m 58s)
  • 22:36 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: [ukwikinews] Add HD logos (duration: 00m 59s)
  • 22:35 eileen: civicrm revision changed from 62e62e107c to 10506a9644, config revision is 734a7bfadd
  • 22:34 jforrester@deploy1001: Synchronized static/images/project-logos/: [ukwikinews] Provide HD logos (duration: 00m 59s)
  • 22:27 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Touch and secondary sync of IS for cache-busting (duration: 00m 56s)
  • 22:25 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: [fawikivoyage] Add custom logos (duration: 00m 58s)
  • 22:22 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Touch and secondary sync of IS for cache-busting (duration: 00m 59s)
  • 22:21 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Use HD logos at arwikibooks, cawikibooks, and plwikivoyage (duration: 00m 59s)
  • 22:17 jforrester@deploy1001: Synchronized static/images/project-logos/: Provide HD logos for arwikibooks, cawikibooks, and plwikivoyage (duration: 01m 00s)
  • 22:14 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Touch and secondary sync of IS for cache-busting (duration: 00m 59s)
  • 22:12 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Use HD logos at bnwikibooks, bnwikisource, and ukwikivoyage (duration: 00m 59s)
  • 22:10 rzl@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 22:09 jforrester@deploy1001: Synchronized static/images/project-logos/: Provide HD logos for bnwikibooks, bnwikisource, and ukwikivoyage (duration: 01m 00s)
  • 22:07 rzl@cumin1001: START - Cookbook sre.hosts.downtime
  • 22:05 rzl@cumin1001: END (ERROR) - Cookbook sre.hosts.downtime (exit_code=97)
  • 22:05 rzl@cumin1001: START - Cookbook sre.hosts.downtime
  • 22:01 jforrester@deploy1001: Synchronized multiversion/MWConfigCacheGenerator.php: Stop loading four old logo dblists (duration: 00m 59s)
  • 21:40 rzl@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 21:40 rzl@cumin1001: START - Cookbook sre.hosts.downtime
  • 20:39 mholloway-shell@deploy1001: Synchronized wmf-config/InitialiseSettings.php: MachineVision: Update label blacklist (once more for good measure) (duration: 00m 57s)
  • 20:37 mholloway-shell@deploy1001: Synchronized wmf-config/InitialiseSettings.php: MachineVision: Update label blacklist (duration: 00m 59s)
  • 20:20 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 20:18 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 20:18 dzahn@cumin1001: START - Cookbook sre.hosts.downtime
  • 20:17 dzahn@cumin1001: START - Cookbook sre.hosts.downtime
  • 19:46 tgr@deploy1001: Synchronized wmf-config/InitialiseSettings.php: re-sync for bug 236104 (duration: 00m 56s)
  • 19:45 tgr@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Switch GrowthExperiments topic search to ORES (T240517) (duration: 00m 58s)
  • 19:40 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 19:39 dzahn@cumin1001: START - Cookbook sre.hosts.downtime
  • 19:10 ebernhardson@deploy1001: Synchronized wmf-config/SearchSettingsForWikibase.php: (no justification provided) (duration: 00m 57s)
  • 18:33 otto@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'eventstreams' for release 'canary' .
  • 18:32 otto@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'eventstreams' for release 'production' .
  • 18:30 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'canary' .
  • 18:30 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'production' .
  • 18:28 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 18:27 otto@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventstreams' for release 'canary' .
  • 18:27 otto@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventstreams' for release 'production' .
  • 18:26 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 18:11 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 18:09 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 18:05 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 18:02 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 17:50 otto@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventstreams' for release 'canary' .
  • 17:50 otto@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventstreams' for release 'production' .
  • 17:44 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 17:41 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 17:40 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 17:38 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 17:37 krinkle@deploy1001: Synchronized wmf-config/InitialiseSettings.php: I8f0d82164, Iaac7cbfbb9 (no-op) (duration: 00m 59s)
  • 17:32 elukey: run homer on cumin1001 to apply https://gerrit.wikimedia.org/r/576873 on cr1/cr2-eqiad
  • 17:27 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'canary' .
  • 17:27 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'production' .
  • 17:24 otto@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'eventstreams' for release 'canary' .
  • 17:24 otto@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'eventstreams' for release 'production' .
  • 17:19 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 17:15 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 17:14 otto@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'eventstreams' for release 'canary' .
  • 17:14 otto@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'eventstreams' for release 'production' .
  • 17:11 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 17:09 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 16:58 otto@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'eventstreams' for release 'canary' .
  • 16:58 otto@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'eventstreams' for release 'production' .
  • 16:55 marostegui@cumin1001: dbctl commit (dc=all): 'Fully repool db1078 after reimage to buster T246604', diff saved to https://phabricator.wikimedia.org/P10631 and previous config saved to /var/cache/conftool/dbconfig/20200305-165555-marostegui.json
  • 16:55 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'canary' .
  • 16:54 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'production' .
  • 16:50 krinkle@deploy1001: Synchronized dblists/: I22a3c2 (duration: 00m 57s)
  • 16:43 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1078 after reimage to buster T246604', diff saved to https://phabricator.wikimedia.org/P10630 and previous config saved to /var/cache/conftool/dbconfig/20200305-164319-marostegui.json
  • 16:22 marostegui: Restart tendril/dbtree database
  • 16:18 _joe_: repooling mw1394
  • 16:12 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1078 after reimage to buster T246604', diff saved to https://phabricator.wikimedia.org/P10629 and previous config saved to /var/cache/conftool/dbconfig/20200305-161222-marostegui.json
  • 16:01 elukey: depool mw1394
  • 16:01 Krinkle: mw1394 (api_appserver) is fatalling search-related api requests due to "Elastic down?"
  • 15:28 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'canary' .
  • 15:28 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'production' .
  • 15:26 otto@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'canary' .
  • 15:26 otto@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'production' .
  • 15:24 otto@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'canary' .
  • 15:24 otto@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'production' .
  • 15:19 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1078 after reimage to buster T246604', diff saved to https://phabricator.wikimedia.org/P10627 and previous config saved to /var/cache/conftool/dbconfig/20200305-151858-marostegui.json
  • 15:18 _joe_: fixing the envoy installation on mw1394-1404, running scap pull
  • 15:15 XioNoX: add SNMP community to Juniper devices
  • 15:01 otto@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'canary' .
  • 15:01 otto@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'production' .
  • 14:55 otto@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'canary' .
  • 14:55 otto@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'production' .
  • 14:52 moritzm: copied hpssacli to thirdparty/hwraid for buster-wikimedia (current Gen 10 releases are named ssaducli now, but retain the old package (which only uses libc anyway) for backwards compat with gen9 on Buster)
  • 14:45 moritzm: copied hpssaducli to thirdparty/hwraid for buster-wikimedia (current releases are named ssaducli now, but retain the old package (which only uses libc anyway) for backwards compat
  • 14:45 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'canary' .
  • 14:45 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'production' .
  • 14:25 XioNoX: push BGP to Cloud on cr2-codfw - T245606
  • 14:13 Urbanecm: Password reset for SUL User:Yezi Brook (T246988)
  • 14:09 XioNoX: push BGP to Cloud on cr1-codfw - T245606
  • 14:05 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 14:05 liw@deploy1001: rebuilt and synchronized wikiversions files: all wikis to 1.35.0-wmf.22
  • 14:03 marostegui@cumin1001: START - Cookbook sre.hosts.downtime
  • 14:03 XioNoX: set all eqiad/codfw PDUs, cord W thresholds to 3440 - T245655
  • 13:54 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 13:51 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 13:50 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 13:49 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 13:48 marostegui: Stop MySQL on db1078 for reimage - T246604
  • 13:47 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1078 for reimage to buster - T246604', diff saved to https://phabricator.wikimedia.org/P10623 and previous config saved to /var/cache/conftool/dbconfig/20200305-134701-marostegui.json
  • 13:26 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 13:24 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 12:56 addshore: stop that cache warming ....
  • 12:52 addshore: START warm cache for db1111 & db1126 for Q30-32 million (100k batch selects, 30s sleep) T219123 (pass 1)
  • 12:06 Amir1: the property terms removal is finished. 312K rows deleted (T225054)
  • 11:53 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db2109 after reimage to buster - T246604', diff saved to https://phabricator.wikimedia.org/P10622 and previous config saved to /var/cache/conftool/dbconfig/20200305-115322-marostegui.json
  • 11:45 Amir1: deleting property terms from wb_terms in wikidatawiki (T225054)
  • 11:43 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Stop writing to the old term store for properties (T219301 T225054), take II (duration: 01m 04s)
  • 11:42 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Stop writing to the old term store for properties (T219301 T225054) (duration: 01m 04s)
  • 11:29 ladsgroup@deploy1001: Synchronized php-1.35.0-wmf.22/extensions/Wikibase: Schedule 1 CleanTermsIfUnusedJob per ID to clean (T244115 T246898) (duration: 01m 08s)
  • 11:25 ladsgroup@deploy1001: Synchronized php-1.35.0-wmf.22/extensions/Cognate: Exit undelete hook early if revision not found (T245869) (duration: 01m 04s)
  • 11:20 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Write to new term store up to Q87 million, was 86 (T219123) cache bust (duration: 01m 03s)
  • 11:19 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Write to new term store up to Q87 million, was 86 (T219123) (duration: 01m 04s)
  • 11:10 vgutierrez: Disable parent proxies on ats-tls in ulsfo - T244464
  • 11:06 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Reading up to Q30M for the new term store everywhere (was Q25M) + warm db1126 & db1111 caches (T219123) cache bust (duration: 01m 04s)
  • 11:04 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Reading up to Q30M for the new term store everywhere (was Q25M) + warm db1126 & db1111 caches (T219123) (duration: 01m 05s)
  • 11:04 jbond42: small update to PCC https://gerrit.wikimedia.org/r/c/operations/software/puppet-compiler/+/576663
  • 10:50 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 10:48 hnowlan@deploy1001: Synchronized multiversion/MWScript.php: T244549: enable running MWScript with phpdbg (duration: 01m 04s)
  • 10:48 marostegui@cumin1001: START - Cookbook sre.hosts.downtime
  • 10:18 oblivian@deploy1001: Synchronized wmf-config/ProductionServices.php: Switch parsoid calls to use envoy as a proxy (duration: 01m 07s)
  • 10:14 vgutierrez: Enable keep alive between ats-tls and varnish-fe globally - T244464
  • 10:12 marostegui: Stop MySQL on db2109 for reimage - T246604
  • 10:11 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db2109 for reimage to buster - T246604', diff saved to https://phabricator.wikimedia.org/P10621 and previous config saved to /var/cache/conftool/dbconfig/20200305-101111-marostegui.json
  • 10:11 addshore: START warm cache for db1111 & db1126 for Q25-30 million T219123 (pass 2 today)
  • 09:53 hashar: Restarting Zuul, it no more process Gerrit events due to a thread stuck waiting on Gerrit.. T246973
  • 08:50 addshore: START warm cache for db1111 & db1126 for Q25-30 million T219123 (pass 1 today)
  • 08:12 marostegui@cumin1001: dbctl commit (dc=all): 'Fully repool db1103:3312 db1103:3314 after reimage to buster T246604', diff saved to https://phabricator.wikimedia.org/P10619 and previous config saved to /var/cache/conftool/dbconfig/20200305-081227-marostegui.json
  • 07:33 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1103:3312 db1103:3314 after reimage to buster T246604', diff saved to https://phabricator.wikimedia.org/P10618 and previous config saved to /var/cache/conftool/dbconfig/20200305-073319-marostegui.json
  • 07:19 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1103:3312 db1103:3314 after reimage to buster T246604', diff saved to https://phabricator.wikimedia.org/P10617 and previous config saved to /var/cache/conftool/dbconfig/20200305-071915-marostegui.json
  • 06:56 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1103:3312 db1103:3314 after reimage to buster T246604', diff saved to https://phabricator.wikimedia.org/P10616 and previous config saved to /var/cache/conftool/dbconfig/20200305-065603-marostegui.json
  • 06:48 elukey: restart yarn on analytics1074 (GC overhead, traces of network errors with datanodes)
  • 06:37 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 06:34 marostegui@cumin1001: START - Cookbook sre.hosts.downtime
  • 06:20 marostegui: Stop MySQL on db1103:3312 and db1103:3314 for reimage T246604
  • 06:18 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1103:3312 db1103:3314 for reimage T246604', diff saved to https://phabricator.wikimedia.org/P10615 and previous config saved to /var/cache/conftool/dbconfig/20200305-061811-marostegui.json
  • 04:22 krinkle@deploy1001: Synchronized dblists/commonsuploads.dblist: Idb69b82f5 (duration: 01m 04s)
  • 03:23 krinkle@deploy1001: Synchronized php-1.35.0-wmf.22/extensions/WikimediaMaintenance/dumpInterwiki.php: Iec6da824cca (duration: 01m 04s)
  • 03:10 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 03:09 krinkle@deploy1001: Synchronized php-1.35.0-wmf.22/includes/SiteConfiguration.php: I723133e68, I2b90e8e9b0 (duration: 01m 05s)
  • 03:08 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 03:00 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 02:57 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 02:46 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 02:43 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 02:42 krinkle@deploy1001: Synchronized multiversion/MWConfigCacheGenerator.php: Ib2aaf6540d85 (duration: 01m 04s)
  • 02:37 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 02:35 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 02:33 krinkle@deploy1001: Synchronized wmf-config/InitialiseSettings.php: I52bb7024384 (no-op) (duration: 01m 04s)
  • 02:30 krinkle@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Ia5b125 (duration: 01m 05s)
  • 02:23 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 02:20 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 02:13 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 02:11 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 01:03 twentyafterfour: phabricator deployment done
  • 00:57 twentyafterfour: deploying phabricator-extensions tag release/2020-03-04/1 ( https://phabricator.wikimedia.org/source/phab-extensions/history/wmf%252Fstable/;release/2020-03-04/1 )
  • 00:45 tgr@deploy1001: Synchronized php-1.35.0-wmf.22/extensions/GrowthExperiments/modules/homepage/: SWAT: Adjust topic UX (T244421) (duration: 01m 05s)
  • 00:41 tgr@deploy1001: Synchronized php-1.35.0-wmf.22/extensions/GrowthExperiments/includes/NewcomerTasks/TaskSuggester/SearchStrategy/SearchStrategy.php: SWAT: Newcomer tasks: Set search sort to random for ORES based topics (T242476) (duration: 01m 04s)
  • 00:27 krinkle@deploy1001: Synchronized wmf-config/interwiki.php: Update interwiki cache (duration: 01m 02s)
  • 00:24 krinkle@deploy1001: Synchronized dblists/: I4fb3d14ed86 (duration: 01m 04s)
  • 00:14 krinkle@deploy1001: update-interwiki-cache aborted: Update interwiki cache (duration: 00m 31s)
  • 00:08 ebernhardson@deploy1001: Synchronized wmf-config/InitialiseSettings.php: cirrus: use 2 shards for commonswiki_content (duration: 01m 04s)
  • 00:06 ebernhardson@deploy1001: Synchronized wmf-config/InitialiseSettings.php: cirrus: Backend configuration for glent m0 ab test (duration: 01m 04s)

2020-03-04

  • 23:30 krinkle@deploy1001: Synchronized src/: Ic344b48a1f8 - creates StaticSiteConfiguration.php (build-only) (duration: 01m 03s)
  • 23:26 reedy@deploy1001: Synchronized php-1.35.0-wmf.22/extensions/CirrusSearch/includes/: T245303 (duration: 01m 02s)
  • 23:01 eileen: process-control config revision is 734a7bfadd
  • 22:59 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 22:56 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 22:55 dzahn@cumin1001: START - Cookbook sre.hosts.downtime
  • 22:54 dzahn@cumin1001: START - Cookbook sre.hosts.downtime
  • 22:53 dzahn@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 22:53 dzahn@cumin1001: START - Cookbook sre.hosts.downtime
  • 22:51 dzahn@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 22:51 dzahn@cumin1001: START - Cookbook sre.hosts.downtime
  • 22:17 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'production' .
  • 22:13 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'production' .
  • 22:12 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'production' .
  • 22:10 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 22:10 dzahn@cumin1001: START - Cookbook sre.hosts.downtime
  • 22:09 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 22:09 dzahn@cumin1001: START - Cookbook sre.hosts.downtime
  • 22:02 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'production' .
  • 22:00 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'production' .
  • 21:46 otto@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'eventstreams' for release 'production' .
  • 21:46 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'production' .
  • 21:46 urbanecm@deploy1001: Synchronized wmf-config/interwiki.php: Update interwiki cache (duration: 02m 25s)
  • 21:43 urbanecm@deploy1001: Synchronized dblists/special.dblist: 8decd01: Add gewikimedia to special wikis (duration: 01m 06s)
  • 21:35 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'production' .
  • 21:28 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'production' .
  • 21:16 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'canary' .
  • 21:16 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'production' .
  • 20:54 eileen: process-control config revision is 21eb2e891f
  • 20:29 otto@deploy1001: Synchronized php-1.35.0-wmf.22/extensions/WikimediaEvents/modules/ext.wikimediaEvents/clientError.js: Include required url in mediawiki/client/error event (T246030) (duration: 01m 05s)
  • 19:58 otto@deploy1001: Synchronized php-1.35.0-wmf.22/extensions/WikimediaEvents/modules/ext.wikimediaEvents/clientError.js: SWAT: Fix callback parameters for client error logging (T246030) (duration: 01m 07s)
  • 19:52 shdubsh: restart logstash on logstash2005 -- testing field type mismatch mitigation
  • 18:43 mutante: starting new DHCP servers to confirm they work and letting puppet immediately stop them again to clear systemd status
  • 18:30 mutante: notebook1003 - restarted nagios-nrpe-server
  • 17:41 addshore: stop item term rebuild at Q Q60345318 as I generate more lists (T219123)
  • 16:50 otto@cumin1001: conftool action : set/pooled=true; selector: dnsdisc=eventgate-analytics-external
  • 16:49 cmjohnson@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 16:45 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 16:05 ottomata: destroying unused eventgate-main 'main' and eventgate-analytics 'analytics' helm releases - T245203
  • 16:02 addshore: addshore@mwmaint1002:~$ mwscript extensions/Wikibase/repo/maintenance/rebuildPropertyTerms.php --wiki=wikidatawiki --sleep 1 --batch-size=50 # T244115
  • 15:47 liw@deploy1001: Synchronized php: group1 wikis to 1.35.0-wmf.22 (duration: 01m 03s)
  • 15:46 liw@deploy1001: rebuilt and synchronized wikiversions files: group1 wikis to 1.35.0-wmf.22
  • 15:29 vgutierrez: upgrading ATS to version 8.0.6 on eqiad
  • 15:11 thcipriani: restarting zuul
  • 14:55 akosiaris@cumin1001: conftool action : set/pooled=true; selector: dnsdisc=eventgate-analytics-external
  • 14:44 filippo@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 14:41 filippo@cumin1001: START - Cookbook sre.hosts.downtime
  • 14:25 vgutierrez: upgrading ATS to version 8.0.6 on codfw
  • 14:19 liw@deploy1001: rebuilt and synchronized wikiversions files: Revert "group1 wikis to 1.35.0-wmf.21
  • 14:18 akosiaris: cleanup old LVS eventgate services. T245203
  • 14:13 addshore: cache warming stopped on db1126 and db1111
  • 14:08 liw@deploy1001: Synchronized php: group1 wikis to 1.35.0-wmf.22 (duration: 01m 04s)
  • 14:07 liw@deploy1001: rebuilt and synchronized wikiversions files: group1 wikis to 1.35.0-wmf.22
  • 13:47 addshore: START warm cache for db1111 & db1126 for Q25-30 million T219123 (pass 3)
  • 13:33 godog: disable puppet on install1002 to test partman on theemin
  • 13:19 vgutierrez: upgrading ATS to version 8.0.6 on esams
  • 13:14 marostegui: Drop fixcopyrightwiki from sanitarium hosts (db1112, db2074) to avoid getting the data alert - T246055
  • 12:55 urbanecm@deploy1001: Synchronized wmf-config/throttle.php: 37db2a1: Add new throttle rule for WikiGap Göteborg 2020-03-06 (T246888) (duration: 01m 04s)
  • 12:23 XioNoX: add flowspec rule on cr3-knams - T243482
  • 12:20 Urbanecm: EU SWAT done
  • 12:19 moritzm: installing 4.9.210-1~deb8u1 kernel on jessie hosts (no reboots, just the upgrade)
  • 12:19 urbanecm@deploy1001: Synchronized php-1.35.0-wmf.22/extensions/GrowthExperiments/includes/HelpPanel/QuestionStore.php: SWAT: d495f4c: Replace loadRevisionFromId which has been removed in I0c8fe834da79c (duration: 01m 06s)
  • 12:14 urbanecm@deploy1001: Synchronized wmf-config/throttle.php: SWAT: 1fa9dda: IP Cap Lift for University of Mannheim Wikimedia Event (2020-04-01) (T246832) (duration: 01m 06s)
  • 12:11 moritzm: imported linux-meta 1.23 to apt.wikimedia.org for jessie-wikimedia
  • 12:04 urbanecm@deploy1001: Synchronized wmf-config/throttle.php: SWAT: 85a5c05: Add throttle exempt for 2020-03-07 GenderGap Event (T246813) (duration: 01m 05s)
  • 11:51 addshore: START warm cache for db1111 & db1126 for Q25-30 million T219123 (pass 2)
  • 11:19 vgutierrez: upgrading ATS to version 8.0.6 on eqsin
  • 11:01 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Write to new term store up to Q86 million, was 84 (T219123) cache bust (duration: 01m 03s)
  • 11:00 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Write to new term store up to Q86 million, was 84 (T219123) (duration: 01m 04s)
  • 10:52 vgutierrez: upgrading ATS to version 8.0.6 on ulsfo
  • 10:41 addshore: START warm cache for db1111 & db1126 for Q25-30 million T219123 (pass 1)
  • 10:38 vgutierrez: upload trafficserver 8.0.6-1wm1 to apt.wm.o (buster)
  • 10:38 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Reading up to Q25M for the new term store everywhere (was Q20M) + warm db1126 & db1111 caches (T219123) cache bust (duration: 01m 04s)
  • 10:36 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Reading up to Q25M for the new term store everywhere (was Q20M) + warm db1126 & db1111 caches (T219123) (duration: 01m 05s)
  • 10:20 marostegui: Remove es2 eqiad and codfw from zarcillo.masters table - T246072
  • 10:10 marostegui: Update shards table to set es2 display=0 - T246072
  • 10:05 marostegui: es2 maintenance window over T246072
  • 09:59 marostegui@cumin1001: dbctl commit (dc=all): 'Give some weight to es2 master es1015 and es2016, now standalone - T246072', diff saved to https://phabricator.wikimedia.org/P10609 and previous config saved to /var/cache/conftool/dbconfig/20200304-095919-marostegui.json
  • 09:55 marostegui: Reset replication on es2 hosts - T246072
  • 09:44 moritzm: installing python-bleach security updates
  • 09:43 marostegui: Set es1015 (es2 master) on read_only - T246072
  • 09:38 addshore: START warm cache for db1111 & db1126 for Q20-25 million T219123 (pass 3 today)
  • 09:21 marostegui@deploy1001: Synchronized wmf-config/db-eqiad.php: Set es2 as RO - T246072 (duration: 01m 04s)
  • 09:13 _joe_: removing nginx from servers where it was just used for service proxying.
  • 09:09 marostegui@deploy1001: Synchronized wmf-config/db-codfw.php: Set es2 as RO - T246072 (duration: 01m 14s)
  • 08:58 akosiaris: release Giant Puppet Lock across the fleet. https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/464601/ has made it's way to all PoPs and most of codfw without issues, will make it in the rest of the fleet in the next 30mins
  • 08:54 addshore: START warm cache for db1111 & db1126 for Q20-25 million T219123 (pass 2 today)
  • 08:45 akosiaris: running puppet on first mw host after merge of https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/464601/, mw2269, rescheduling icinga checks as well
  • 08:41 akosiaris: running puppet on first es host after merge of https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/464601/, es2019, rescheduling icinga checks as well (correction)
  • 08:41 akosiaris: running puppet on first es host after merge of https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/464601/, db2019, rescheduling icinga checks as well
  • 08:41 akosiaris: running puppet on first db host after merge of https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/464601/, db2086, rescheduling icinga checks as well
  • 08:13 addshore: START warm cache for db1111 & db1126 for Q20-25 million T219123 (pass 1 today)
  • 07:37 marostegui@cumin1001: dbctl commit (dc=all): 'Fully repool db1098:3316 and db1098:3317 after reimage to buster T246604', diff saved to https://phabricator.wikimedia.org/P10608 and previous config saved to /var/cache/conftool/dbconfig/20200304-073721-marostegui.json
  • 07:14 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1098:3316 and db1098:3317 after reimage to buster T246604', diff saved to https://phabricator.wikimedia.org/P10607 and previous config saved to /var/cache/conftool/dbconfig/20200304-071443-marostegui.json
  • 07:00 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1098:3316 and db1098:3317 after reimage to buster T246604', diff saved to https://phabricator.wikimedia.org/P10606 and previous config saved to /var/cache/conftool/dbconfig/20200304-070048-marostegui.json
  • 06:45 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1098:3316 and db1098:3317 after reimage to buster T246604', diff saved to https://phabricator.wikimedia.org/P10605 and previous config saved to /var/cache/conftool/dbconfig/20200304-064520-marostegui.json
  • 06:30 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 06:28 marostegui@cumin1001: START - Cookbook sre.hosts.downtime
  • 06:22 cdanis: ✔️ cdanis@prometheus2004.codfw.wmnet ~ 🕝☕ sudo systemctl restart prometheus@ops
  • 06:21 cdanis: ✔️ cdanis@prometheus2004.codfw.wmnet ~ 🕝☕ sudo systemctl reload prometheus@ops
  • 06:10 marostegui: Stop MySQL on db1098:3316, db1098:3317 for upgrade - T246604
  • 01:56 mutante: mw2178 - systemctl reset-failed to clear (CRITICAL: Status of the systemd unit php7.2-fpm_check_restart)
  • 01:55 mutante: mw2290 - systemctl reset-failed to clear (CRITICAL: Status of the systemd unit php7.2-fpm_check_restart)
  • 01:48 mutante: mw1315 - restarted php-fpm and apache (was alerting in Icinga with 503 for 12 hours), log showed failed coredumps, restarts recovered it
  • 01:31 mutante: ganeti2003 - DRAC reset failed with "ipmi_cmd_cold_reset: BMC busy"
  • 01:30 mutante: ganeti2003 - mgmt interface stopped responding on SSH, resetting DRAC via bmc-device from the host
  • 00:25 ebernhardson@deploy1001: Synchronized php-1.35.0-wmf.21/extensions/WikimediaEvents/modules/ext.wikimediaEvents/searchSatisfaction.js: [cirrus] Match fallback config key with the one used in cirrus (duration: 01m 03s)
  • 00:23 ebernhardson@deploy1001: Synchronized php-1.35.0-wmf.22/extensions/WikimediaEvents/modules/ext.wikimediaEvents/searchSatisfaction.js: [cirrus] Match fallback config key with the one used in cirrus (duration: 01m 04s)
  • 00:15 ebernhardson@deploy1001: Synchronized wmf-config/SearchSettingsForWikibase.php: [cirrus] move similarity settings to IS.php (duration: 01m 05s)
  • 00:13 ebernhardson@deploy1001: Synchronized wmf-config/InitialiseSettings.php: [cirrus] move similarity settings to IS.php (duration: 01m 04s)
  • 00:06 ebernhardson@deploy1001: Synchronized wmf-config/InitialiseSettings.php: [cirrus] configure wgCirrusSearchMaxShardsPerNode per cluster (duration: 01m 05s)
  • 00:06 ebernhardson: post-deployment restart mjolnir-kafka-bulk-daemon across eqiad and codfw
  • 00:05 ebernhardson@deploy1001: Finished deploy [search/mjolnir/deploy@1c97543]: Bump mjolnir to master: Revert stream gzip decompression (duration: 05m 25s)
  • 00:00 ebernhardson@deploy1001: Started deploy [search/mjolnir/deploy@1c97543]: Bump mjolnir to master: Revert stream gzip decompression

2020-03-03

  • 21:48 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Touch and secondary sync of IS for cache-busting (duration: 01m 04s)
  • 21:46 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: [wikidatawiki] Note that MostRevisions and MostLinked have been disabled (duration: 01m 05s)
  • 21:33 otto@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'canary' .
  • 21:33 otto@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'production' .
  • 21:13 thcipriani@deploy1001: Synchronized php-1.35.0-wmf.22/includes/Defines.php: Update MW_VERSION to 1.35.0-wmf.22 (duration: 01m 06s)
  • 20:59 vgutierrez: Starting pybal on lvs1013
  • 20:54 vgutierrez: rebooting lvs1013
  • 20:44 joal@deploy1001: Finished deploy [analytics/refinery@264c7ec] (thin): Regular weekly analytics deploy (duration: 00m 07s)
  • 20:44 joal@deploy1001: Started deploy [analytics/refinery@264c7ec] (thin): Regular weekly analytics deploy
  • 20:43 joal@deploy1001: Finished deploy [analytics/refinery@264c7ec]: Regular (duration: 13m 05s)
  • 20:42 vgutierrez: stopping pybal on lvs1013
  • 20:42 otto@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'canary' .
  • 20:42 otto@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'production' .
  • 20:30 joal@deploy1001: Started deploy [analytics/refinery@264c7ec]: Regular
  • 20:28 otto@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'canary' .
  • 20:28 otto@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'production' .
  • 19:50 mutante: cloudmetrics1002 - removed port 8080 from apache's ports.conf and restarted the service (cloudmetrics1001 did not have this)
  • 19:38 jforrester@deploy1001: Synchronized php-1.35.0-wmf.22/extensions/AbuseFilter: T213006 T246539: Minor fixes for the updateVarDumps script (duration: 01m 05s)
  • 19:19 jforrester@deploy1001: Synchronized php-1.35.0-wmf.21/extensions/WikimediaEvents/includes/WikimediaEventsHooks.php: T246030 T226986: Set wgWMEClientErrorIntakeURL in onResourceLoaderGetConfigVars (duration: 01m 05s)
  • 19:02 James_F: Manually purged vecwiki logos from Varnish for T246808
  • 19:01 jforrester@deploy1001: Synchronized static/images/project-logos/: T246808 [vecwiki] Update project logo with temporary 20k branding (duration: 01m 10s)
  • 18:58 mutante: generating new certs for grafana-labs/graphite-labs
  • 18:22 cmjohnson@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 18:22 cmjohnson@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 18:21 cmjohnson@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 18:19 cmjohnson@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 18:17 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 18:17 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 18:17 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 18:16 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 18:14 cmjohnson@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 18:11 cmjohnson@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 18:11 robh: updating firmware on scs-oe16-esams via T174475
  • 18:10 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 18:06 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 18:02 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 18:02 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 18:01 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 17:55 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 17:51 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 17:50 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 17:47 cmjohnson@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 17:46 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 17:45 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 17:44 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 17:40 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 17:36 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 17:32 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 17:30 cmjohnson@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 17:30 cmjohnson@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 17:29 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 17:28 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 17:27 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 17:22 cmjohnson@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 17:20 cmjohnson@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 17:20 hnowlan@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'changeprop' for release 'staging' .
  • 17:17 hnowlan@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'changeprop' for release 'staging' .
  • 17:17 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 17:17 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 17:16 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 17:14 otto@deploy1001: Started restart [changeprop/deploy@e2fe8ca]: Restart to pick up new LVS TLS port for eventgate T242224
  • 17:14 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 17:13 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 17:06 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 17:02 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 17:02 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 17:02 hnowlan@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'changeprop' for release 'production' .
  • 17:01 cmjohnson@cumin1001: START - Cookbook sre.hosts.downtime
  • 16:58 vgutierrez: Re-enable BGP in lvs1013 - T245984
  • 16:51 bblack: lvs5003 - restart pybal, back to normal operations
  • 16:51 hnowlan@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'changeprop' for release 'production' .
  • 16:51 krinkle@deploy1001: Synchronized multiversion/MWWikiversions.php: I9d658ff41b78 (duration: 01m 04s)
  • 16:50 hnowlan@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'changeprop' for release 'staging' .
  • 16:49 hnowlan@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'changeprop' for release 'staging' .
  • 16:49 bblack: reload icinga config on icinga1001
  • 16:48 krinkle@deploy1001: Synchronized wmf-config/import.php: I9d658ff41b78 (duration: 01m 03s)
  • 16:47 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 16:47 otto@deploy1001: Started restart [restbase/deploy@bfdd342]: Restart to pick up new LVS TLS port for eventgate T242224
  • 16:47 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 16:45 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 16:44 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 16:35 otto@deploy1001: Started restart [restbase/deploy@bfdd342] (dev-cluster): Restart (dev-cluster) to pick up new LVS TLS port for eventgate T242224
  • 16:34 krinkle@deploy1001: Synchronized multiversion/MWWikiversions.php: I8815be - T169821 (duration: 01m 04s)
  • 16:32 vgutierrez: reimage lvs1013 with buster - T245984
  • 16:28 bblack: stopping pybal on lvs5003 to test the new icinga checks (will cause a BGP alert, among others)
  • 16:17 Pchelolo: restart restbase on 2009 for T242224
  • 16:14 ottomata: switching restbase & change prop to new eventgate-main LVS TLS ports
  • 16:13 vgutierrez: Re-enable BGP in lvs1014 - T245984
  • 16:05 vgutierrez: Starting pybal on lvs2009 - T246686
  • 16:04 marostegui@cumin1001: dbctl commit (dc=all): 'Fully repool db1096:3315 and db1096:3316 after reimage to buster T246604', diff saved to https://phabricator.wikimedia.org/P10597 and previous config saved to /var/cache/conftool/dbconfig/20200303-160433-marostegui.json
  • 15:59 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 15:56 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 15:55 liw@deploy1001: Finished scap: group0 to 1.35.0-wmf.22 (duration: 24m 29s)
  • 15:49 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1096:3315 and db1096:3316 after reimage to buster T246604', diff saved to https://phabricator.wikimedia.org/P10596 and previous config saved to /var/cache/conftool/dbconfig/20200303-154913-marostegui.json
  • 15:47 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 15:45 vgutierrez: Stopping pybal on lvs2009 to let lvs2010 get its traffic - T246686
  • 15:45 mutante: wtp1025 - scap pull as user cscott - testing sudo privs issue
  • 15:44 vgutierrez: reimage lvs1014 with buster - T245984
  • 15:43 mutante: wtp1025 - scap pull
  • 15:35 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 15:34 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 15:31 vgutierrez: Re-enable BGP in lvs1015 - T245984
  • 15:31 liw@deploy1001: Started scap: group0 to 1.35.0-wmf.22
  • 15:30 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 15:22 liw@deploy1001: Finished scap: testwiki to php-1.35.0-wmf.22 and rebuild l10n cache (duration: 71m 23s)
  • 15:20 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 15:18 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1096:3315 and db1096:3316 after reimage to buster T246604', diff saved to https://phabricator.wikimedia.org/P10595 and previous config saved to /var/cache/conftool/dbconfig/20200303-151805-marostegui.json
  • 15:15 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 15:15 elukey@cumin1001: END (PASS) - Cookbook sre.elasticsearch.rolling-restart (exit_code=0)
  • 15:13 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 15:07 marostegui@cumin1001: dbctl commit (dc=all): 'Decrease a bit the weight for db1126', diff saved to https://phabricator.wikimedia.org/P10594 and previous config saved to /var/cache/conftool/dbconfig/20200303-150712-marostegui.json
  • 15:00 vgutierrez: reimage lvs1015 with buster - T245984
  • 14:52 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1096:3315 and db1096:3316 after reimage to buster T246604', diff saved to https://phabricator.wikimedia.org/P10591 and previous config saved to /var/cache/conftool/dbconfig/20200303-145230-marostegui.json
  • 14:44 vgutierrez@cumin2001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 14:43 vgutierrez@cumin2001: START - Cookbook sre.hosts.decommission
  • 14:43 vgutierrez: running the decommission cookbook against lvs2001.codfw.wmnet - T246779
  • 14:42 vgutierrez: replace lvs2001 with lvs2007 - T196560
  • 14:41 addshore: START warm cache for db1111 & db1126 for Q20-25 million T219123 (pass 2)
  • 14:29 vgutierrez: update puppet compiler facts
  • 14:28 ottomata: postponing LVS for eventgate-analytics-external unti tomorrow
  • 14:13 ottomata: beginning procedure to add LVS and discovery for eventgate-analytics-external - T233629
  • 14:12 elukey@cumin1001: START - Cookbook sre.elasticsearch.rolling-restart
  • 14:11 liw@deploy1001: Started scap: testwiki to php-1.35.0-wmf.22 and rebuild l10n cache
  • 14:08 liw@deploy1001: Pruned MediaWiki: 1.35.0-wmf.20 (duration: 15m 54s)
  • 14:08 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 14:06 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 13:55 zpapierski@deploy1001: Finished deploy [wdqs/wdqs@8da3ae6]: (no justification provided) (duration: 10m 48s)
  • 13:54 vgutierrez: reimage lvs1016 with buster - T245984
  • 13:52 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 13:50 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 13:45 zpapierski@deploy1001: Started deploy [wdqs/wdqs@8da3ae6]: (no justification provided)
  • 13:42 zpapierski@deploy1001: Finished deploy [wdqs/wdqs@8da3ae6]: (no justification provided) (duration: 00m 26s)
  • 13:41 zpapierski@deploy1001: Started deploy [wdqs/wdqs@8da3ae6]: (no justification provided)
  • 13:34 vgutierrez: Re-enable BGP in lvs3005 - T245984
  • 13:19 akosiaris@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'changeprop' for release 'staging' .
  • 13:14 akosiaris@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'changeprop' for release 'staging' .
  • 13:13 akosiaris@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'changeprop' for release 'staging' .
  • 13:13 akosiaris@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 13:12 akosiaris@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 13:12 akosiaris@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 13:12 akosiaris@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'changeprop' for release 'staging' .
  • 13:11 akosiaris@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 13:10 akosiaris@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 13:02 akosiaris@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 13:01 akosiaris@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 13:00 akosiaris@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 12:48 hnowlan@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
  • 12:45 addshore: START warm cache for db1111 & db1126 for Q20-25 million T219123 (pass 1)
  • 12:38 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Reading up to Q20M for the new term store everywhere (was Q15M) + warm db1126 & db1111 caches (T219123) cache bust (duration: 00m 56s)
  • 12:37 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Reading up to Q20M for the new term store everywhere (was Q15M) + warm db1126 & db1111 caches (T219123) (duration: 00m 55s)
  • 12:35 addshore: END warm cache for db1111 & db1126 for Q15-20 million T219123 (pass 3) (finished it early)
  • 12:19 urbanecm@deploy1001: Synchronized wmf-config/throttle.php: SWAT: 7b48737: Throttle rule for Czech Wikigap (T246356) (duration: 00m 56s)
  • 12:14 urbanecm@deploy1001: Synchronized wmf-config/interwiki.php: Update interwiki cache (duration: 02m 10s)
  • 12:12 urbanecm@deploy1001: update-interwiki-cache aborted: Update interwiki cache (duration: 00m 01s)
  • 12:12 urbanecm@deploy1001: update-interwiki-cache aborted: Update interwiki cache (duration: 00m 00s)
  • 12:10 kartik@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: 575974|ContentTranslation: Add URL campaign for WikiGapFinder (T246335), take II (duration: 00m 56s)
  • 12:09 kartik@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: 575974|ContentTranslation: Add URL campaign for WikiGapFinder (T246335) (duration: 00m 56s)
  • 12:06 addshore: START warm cache for db1111 & db1126 for Q15-20 million T219123 (pass 3)
  • 12:03 liw: cutting branch for 1.35.0-wmf.22 train
  • 11:54 jbond42: disable puppet in order to add netbox hiera backend
  • 11:09 moritzm: installing Java security updates on an-airflow, an-launcher and an-presto*
  • 11:08 addshore: START warm cache for db1111 & db1126 for Q15-20 million T219123 (pass 2)
  • 11:04 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 11:01 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 11:01 elukey@cumin1001: END (PASS) - Cookbook sre.kafka.roll-restart-mirror-maker (exit_code=0)
  • 10:49 elukey@cumin1001: START - Cookbook sre.kafka.roll-restart-mirror-maker
  • 10:47 elukey@cumin1001: END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0)
  • 10:47 vgutierrez@cumin2001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 10:46 vgutierrez: running the decommission cookbook against lvs2002 - T246756
  • 10:46 vgutierrez@cumin2001: START - Cookbook sre.hosts.decommission
  • 10:44 vgutierrez: replace lvs2002 with lvs2008 - T196560
  • 10:14 addshore: START warm cache for db1111 & db1126 for Q15-20 million T219123 (pass 1)
  • 10:10 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Reading up to Q15M for the new term store everywhere (was Q12M) + warm db1126 & db1111 caches (T219123) cache bust (duration: 00m 56s)
  • 10:09 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Reading up to Q15M for the new term store everywhere (was Q12M) + warm db1126 & db1111 caches (T219123) (duration: 00m 56s)
  • 10:06 addshore: END warm cache for db1111 & db1126 for Q12-15 million T219123 (pass 4)
  • 10:05 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 10:03 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 09:57 marostegui: es4 deployment window finished
  • 09:43 vgutierrez: reimage lvs3005 with buster - T245984
  • 09:36 marostegui@deploy1001: Synchronized wmf-config/db-eqiad.php: Enable es4 as new writable external store section - T246072 (duration: 00m 56s)
  • 09:35 addshore: START warm cache for db1111 & db1126 for Q12-15 million T219123 (pass 4)
  • 09:33 marostegui@deploy1001: sync-file aborted: Enable es4 as new writable external store section - T246072 (duration: 00m 02s)
  • 09:33 marostegui@deploy1001: Synchronized wmf-config/db-codfw.php: Enable es4 as new writable external store section - T246072 (duration: 00m 56s)
  • 09:32 marostegui@deploy1001: Synchronized wmf-config/db-codfw.php: Enable es4 as new writable external store section - T246072 (duration: 00m 57s)
  • 09:10 marostegui@deploy1001: Synchronized wmf-config/db-codfw.php: Add es4 to the available es sections, not in use yet - T246072 (duration: 00m 57s)
  • 09:07 marostegui@deploy1001: Synchronized wmf-config/db-eqiad.php: Add es4 to the available es sections, not in use yet - T246072 (duration: 00m 57s)
  • 08:53 addshore: START warm cache for db1111 & db1126 for Q12-15 million T219123 (pass 3)
  • 08:36 elukey@cumin1001: START - Cookbook sre.kafka.roll-restart-brokers
  • 08:25 addshore: addshore@mwmaint1002:~$ time mwscript extensions/Wikibase/repo/maintenance/rebuildItemTerms.php --wiki=wikidatawiki --batch-size=25 --sleep=1 --file=27feb1125-40to50-holes # T219123
  • 08:13 addshore: START warm cache for db1111 & db1126 for Q12-15 million T219123 (pass 2)
  • 08:11 elukey@cumin1001: END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0)
  • 08:08 addshore: addshore@mwmaint1002:~$ time mwscript extensions/Wikibase/repo/maintenance/rebuildItemTerms.php --wiki=wikidatawiki --batch-size=25 --sleep=1 --file=27feb1125-30to40-holes # T219123
  • 08:05 elukey@cumin1001: START - Cookbook sre.zookeeper.roll-restart-zookeeper
  • 07:55 elukey@cumin1001: END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0)
  • 07:48 elukey@cumin1001: START - Cookbook sre.zookeeper.roll-restart-zookeeper
  • 07:45 addshore: START warm cache for db1111 & db1126 for Q12-15 million T219123 (pass 1)
  • 07:41 vgutierrez: Re-enable BGP in lvs3006 - T245984
  • 07:39 elukey@cumin1001: END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0)
  • 07:19 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 07:17 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 07:09 elukey@cumin1001: START - Cookbook sre.druid.roll-restart-workers
  • 07:09 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 07:07 elukey@cumin1001: END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0)
  • 07:07 marostegui@cumin1001: START - Cookbook sre.hosts.downtime
  • 06:51 vgutierrez: reimage lvs3006 with buster - T245984
  • 06:49 marostegui: Stop MySQL on db1096:3315,3316 for reimage - T246604
  • 06:43 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1096:3315, db1096:3316 for reimage T246604', diff saved to https://phabricator.wikimedia.org/P10587 and previous config saved to /var/cache/conftool/dbconfig/20200303-064316-marostegui.json
  • 06:41 elukey@cumin1001: START - Cookbook sre.druid.roll-restart-workers
  • 06:33 vgutierrez: Switch from globalsign to LE as unified cert vendor on eqiad - T230687
  • 06:25 vgutierrez: Switch from globalsign to LE as unified cert vendor on codfw - T230687
  • 02:57 mutante: manually running "Ancientpages" cron on s3 (T243599)
  • 02:39 mutante: manually running updateSpecialPages.php maintenance cron on s8 for AncientPages to confirm it was fixed by gerrit:574726 a few days ago (T243599)
  • 00:53 jforrester@deploy1001: Synchronized wmf-config/CommonSettings.php: T240055 Update special scandium configuration to load from /srv/parsoid-testing (duration: 00m 58s)
  • 00:36 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Touch and secondary sync of IS for cache-busting (duration: 00m 56s)
  • 00:35 jforrester@deploy1001: Synchronized dblists/mobilemainpagelegacy.dblist: T32405 Drop legacy main page special casing on select projects (duration: 00m 56s)
  • 00:31 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Touch and secondary sync of IS for cache-busting (duration: 00m 56s)
  • 00:28 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: T242030 Enable lead paragraph in user namespace on nlwiki (duration: 00m 56s)
  • 00:15 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=wtp1025.eqiad.wmnet
  • 00:14 jforrester@deploy1001: Synchronized wmf-config/CommonSettings.php: T240055: Point the Parsoid cluster at the train version of Parsoid, not a special check-out (duration: 00m 56s)
  • 00:13 jforrester@deploy1001: sync aborted: wmf-config/CommonSettings.php T240055: Point the Parsoid cluster at the train version of Parsoid, not a special check-out (duration: 00m 03s)
  • 00:13 jforrester@deploy1001: Started scap: wmf-config/CommonSettings.php T240055: Point the Parsoid cluster at the train version of Parsoid, not a special check-out
  • 00:05 mutante: wtp1025 - scap pull

2020-03-02

  • 23:58 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=wtp1025.eqiad.wmnet
  • 23:43 krinkle@deploy1001: Synchronized docroot/noc/: Idc2671 (duration: 00m 56s)
  • 23:42 krinkle@deploy1001: Synchronized multiversion/: Idc2671 (duration: 00m 57s)
  • 23:41 krinkle@deploy1001: Synchronized src/: Idc2671 (duration: 00m 56s)
  • 23:36 krinkle@deploy1001: Synchronized wmf-config/CommonSettings.php: I1bdc5e359476 (duration: 00m 56s)
  • 21:10 XioNoX: re-number AMS-IX peer 64271
  • 20:22 otto@deploy1001: Synchronized wmf-config/InitialiseSettings-labs.php: Enable Mediawiki client side error logging on group0 wikis - T246030 (duration: 00m 57s)
  • 20:12 krinkle@deploy1001: Synchronized multiversion/MWConfigCacheGenerator.php: I1ef0589 (duration: 00m 58s)
  • 20:01 otto@deploy1001: Synchronized wmf-config/InitialiseSettings-labs.php: Enable Mediawiki client side error logging in beta - T246030 (duration: 00m 56s)
  • 19:35 otto@deploy1001: Synchronized wmf-config/InitialiseSettings.php: EventStreamConfig - allow eventgate-analytics-external to produce error events - T233629 (duration: 00m 56s)
  • 19:17 ppchelko@deploy1001: Synchronized wmf-config/CommonSettings.php: Enable REST run jobs endpoint on jobrunners T244770 (duration: 00m 56s)
  • 19:15 ppchelko@deploy1001: Synchronized wmf-config/CommonSettings-labs.php: Enable REST run jobs endpoint on jobrunners T244770 (duration: 00m 56s)
  • 19:04 otto@deploy1001: Synchronized wmf-config/ProductionServices.php: Use new LVS port for EventBus for eventgate-main on all wikis - T245203 (duration: 00m 56s)
  • 19:03 otto@deploy1001: Synchronized wmf-config/LabsServices.php: Use new LVS port for EventBus for eventgate-main on all wikis - T245203 (duration: 00m 56s)
  • 19:01 otto@deploy1001: Synchronized wmf-config/CommonSettings.php: Use new LVS port for EventBus for eventgate-main on all wikis - T245203 (duration: 00m 57s)
  • 18:58 otto@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Use new LVS port for EventBus for eventgate-main on all wikis - T245203 (duration: 00m 56s)
  • 18:53 otto@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Use new LVS port for EventBus for eventgate-main on group1 wikis - T245203 (duration: 00m 57s)
  • 18:45 otto@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Use new LVS port for EventBus for eventgate-main on group0 wikis - T245203 (duration: 00m 57s)
  • 18:43 otto@deploy1001: Synchronized wmf-config/CommonSettings.php: Use new LVS port for EventBus for eventgate-main on group0 wikis - T245203 (duration: 00m 56s)
  • 18:41 XioNoX: remove BGP to lvs2004/5/6 on cr1/2-codfw
  • 18:41 ebernhardson@deploy1001: Finished deploy [search/mjolnir/deploy@8195b6f]: Bump python to 3.7, python-kafka to 1.4.7 (duration: 04m 04s)
  • 18:41 otto@deploy1001: Synchronized wmf-config/ProductionServices.php: Use new LVS port for EventBus for eventgate-main on group0 wikis - T245203 (duration: 00m 57s)
  • 18:39 otto@deploy1001: Synchronized wmf-config/LabsServices.php: Use new LVS port for EventBus for eventgate-main on group0 wikis - T245203 (duration: 00m 58s)
  • 18:38 ottomata: using new eventgate-main LVS ports for eventbus on group0 wikis - T245203
  • 18:37 ebernhardson@deploy1001: Started deploy [search/mjolnir/deploy@8195b6f]: Bump python to 3.7, python-kafka to 1.4.7
  • 18:35 XioNoX: add BGP to lvs2008 on cr1/2-codfw
  • 18:02 pt1979@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 17:59 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 17:50 vgutierrez: starting pybal on lvs2009 - T246686
  • 17:41 mutante: notebook1003 df: /mnt/hdfs: Input/output error | systemctl restart nagios-nrpe-server (T224682)
  • 17:40 mutante: notebook1003 systemctl restart nagios-nrpe-server
  • 17:04 vgutierrez: Stopping pybal on lvs2009 to let lvs2010 get its traffic - T246686
  • 16:20 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 16:20 moritzm: installing netty-3.9 security updates
  • 16:18 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 15:57 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 15:54 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 15:34 marostegui@cumin1001: dbctl commit (dc=all): 'Increase weight from 350 to 400 on db1111 T246447', diff saved to https://phabricator.wikimedia.org/P10583 and previous config saved to /var/cache/conftool/dbconfig/20200302-153416-marostegui.json
  • 15:30 vgutierrez: reimage lvs3007 with buster - T245984
  • 15:27 vgutierrez@cumin2001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 15:26 vgutierrez@cumin2001: START - Cookbook sre.hosts.decommission
  • 15:26 vgutierrez: running the decommission cookbook against lvs2004 - T246669
  • 15:20 otto@deploy1001: Synchronized wmf-config/ProductionServices.php: Use new LVS port for EventBus+monolog for eventgate-analytics - T245203 (duration: 00m 56s)
  • 15:20 ottomata: Use new LVS port for EventBus+monolog for eventgate-analytics - T245203
  • 15:11 marostegui@cumin1001: dbctl commit (dc=all): 'Increase weight from 300 to 350 on db1111 T246447', diff saved to https://phabricator.wikimedia.org/P10582 and previous config saved to /var/cache/conftool/dbconfig/20200302-151149-marostegui.json
  • 14:51 marostegui@cumin1001: dbctl commit (dc=all): 'Increase weight from 250 to 300 on db1111 T246447', diff saved to https://phabricator.wikimedia.org/P10581 and previous config saved to /var/cache/conftool/dbconfig/20200302-145130-marostegui.json
  • 14:42 vgutierrez: Re-enable BGP in lvs5001 - T245984
  • 14:40 marostegui@cumin1001: dbctl commit (dc=all): 'Give weight to es4 and es5 unused eqiad slaves T246072', diff saved to https://phabricator.wikimedia.org/P10579 and previous config saved to /var/cache/conftool/dbconfig/20200302-144033-marostegui.json
  • 14:39 marostegui@cumin1001: dbctl commit (dc=all): 'Give weight to es4 and es5 unused codfw slaves T246072', diff saved to https://phabricator.wikimedia.org/P10578 and previous config saved to /var/cache/conftool/dbconfig/20200302-143915-marostegui.json
  • 14:38 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Reading up to Q12M for the new term store everywhere (was Q10M) + warm db1126 & db1111 caches (T219123) cache bust (duration: 00m 56s)
  • 14:37 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Reading up to Q12M for the new term store everywhere (was Q10M) + warm db1126 & db1111 caches (T219123) (duration: 00m 58s)
  • 14:37 vgutierrez@cumin2001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 14:36 vgutierrez@cumin2001: START - Cookbook sre.hosts.decommission
  • 14:36 vgutierrez: running the decommission cookbook against lvs2005 - T246666
  • 14:20 marostegui@cumin1001: dbctl commit (dc=all): 'Increase weight from 200 to 250 on db1111 T246447', diff saved to https://phabricator.wikimedia.org/P10577 and previous config saved to /var/cache/conftool/dbconfig/20200302-142017-marostegui.json
  • 14:19 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 14:17 addshore: START warm cache for db1111 & db1126 for Q10-12 million T219123 (pass 3)
  • 14:15 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 14:05 vgutierrez: update puppet compiler facts
  • 13:58 addshore: START warm cache for db1111 & db1126 for Q10-12 million T219123 (pass 2)
  • 13:55 vgutierrez: Switch from globalsign to LE as unified cert vendor on ulsfo - T230687
  • 13:53 vgutierrez: Switch from globalsign to LE as unified cert vendor on cp4026 - T230687
  • 13:48 vgutierrez: reimage lvs5001 with buster - T245984
  • 13:33 kart_: Update cxserver to 2020-03-02-115344-production: Reverting T246319
  • 13:30 kartik@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'cxserver' for release 'production' .
  • 13:28 kartik@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'cxserver' for release 'production' .
  • 13:26 kartik@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'cxserver' for release 'staging' .
  • 13:18 elukey: roll restart Hadoop master daemons on an-master100[1,2] for openjdk upgrades
  • 13:11 addshore: START warm cache for db1111 & db1126 for Q10-12 million T219123 (pass 1)
  • 13:08 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Reading up to Q10M for the new term store for clients (was Q8M) + warm db1126 & db1111 caches (T219123) cache bust (duration: 00m 55s)
  • 13:07 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Reading up to Q10M for the new term store for clients (was Q8M) + warm db1126 & db1111 caches (T219123) (duration: 00m 56s)
  • 12:58 Urbanecm: Deploy security fix for T229731
  • 12:16 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: 8280f81: Set cswiki and cywiki to use custom minerva logo again (T246535): take II (duration: 00m 57s)
  • 12:15 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: 8280f81: Set cswiki and cywiki to use custom minerva logo again (T246535) (duration: 00m 58s)
  • 12:09 oblivian@deploy1001: Synchronized wmf-config/ProductionServices.php: Switch search to use envoy as a proxy (duration: 00m 56s)
  • 11:54 vgutierrez: enable BGP in lvs5002 - T245984
  • 11:44 addshore: START warm cache for db1111 & db1126 for Q8-10 million T219123 (pass 2)
  • 11:41 jdrewniak@deploy1001: Synchronized portals: Wikimedia Portals Update: Bumping portals to master (563985) (duration: 00m 57s)
  • 11:40 jdrewniak@deploy1001: Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: Bumping portals to master (563985) (duration: 00m 57s)
  • 11:39 jbond42: enable strict_hostname_checking on the puppet masters https://gerrit.wikimedia.org/r/c/operations/puppet/+/575220
  • 11:12 kart_: Update cxserver to 2020-02-28-043702-production (T246319)
  • 11:07 kartik@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'cxserver' for release 'production' .
  • 11:05 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 11:04 kartik@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'cxserver' for release 'production' .
  • 11:03 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 11:02 kartik@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'cxserver' for release 'staging' .
  • 11:01 addshore: START warm cache for db1111 & db1126 for Q8-10 million T219123 (pass 1)
  • 10:55 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Reading up to Q8M for the new term store for clients (was Q6M) + warm db1126 & db1111 caches (T219123) (duration: 00m 58s)
  • 10:35 vgutierrez: reimage lvs5002 with buster - T245984
  • 10:34 marostegui@cumin1001: dbctl commit (dc=all): 'Increase weight from 150 to 200 on db1111 T246447', diff saved to https://phabricator.wikimedia.org/P10576 and previous config saved to /var/cache/conftool/dbconfig/20200302-103445-marostegui.json
  • 10:22 addshore: START warm cache for db1111 & db1126 for Q6-8 million T219123 (pass 2)
  • 09:59 marostegui@cumin1001: dbctl commit (dc=all): 'Increase weight from 100 to 150 on db1111 T246447', diff saved to https://phabricator.wikimedia.org/P10575 and previous config saved to /var/cache/conftool/dbconfig/20200302-095921-marostegui.json
  • 09:58 marostegui@cumin1001: dbctl commit (dc=all): 'Fully repool db1119 after upgrade T239791', diff saved to https://phabricator.wikimedia.org/P10574 and previous config saved to /var/cache/conftool/dbconfig/20200302-095841-marostegui.json
  • 09:52 moritzm: installing remaining curl security updates
  • 09:51 addshore: START warm cache for db1111 & db1126 for Q6-8 million T219123 (pass 1)
  • 09:50 elukey: powercycle an-worker1083 (no ssh, mgmt console available but tty not really usable, CPU soft lockups reported)
  • 09:46 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1119 after upgrade T239791', diff saved to https://phabricator.wikimedia.org/P10573 and previous config saved to /var/cache/conftool/dbconfig/20200302-094633-marostegui.json
  • 09:38 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1119 after upgrade T239791', diff saved to https://phabricator.wikimedia.org/P10572 and previous config saved to /var/cache/conftool/dbconfig/20200302-093848-marostegui.json
  • 09:38 moritzm: installing openssh updates for jessie
  • 09:34 marostegui@cumin1001: dbctl commit (dc=all): 'Increase weight from 80 to 100 on db1111 T246447', diff saved to https://phabricator.wikimedia.org/P10571 and previous config saved to /var/cache/conftool/dbconfig/20200302-093449-marostegui.json
  • 09:27 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1119 after upgrade T239791', diff saved to https://phabricator.wikimedia.org/P10570 and previous config saved to /var/cache/conftool/dbconfig/20200302-092743-marostegui.json
  • 09:19 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1119 T239791', diff saved to https://phabricator.wikimedia.org/P10569 and previous config saved to /var/cache/conftool/dbconfig/20200302-091947-marostegui.json
  • 09:12 addshore: warm cache for db1111 for Q0-6 million T219123 T246447 (pass 2)
  • 08:54 marostegui@cumin1001: dbctl commit (dc=all): 'Increase weight from 50 to 80 on db1111 T246447', diff saved to https://phabricator.wikimedia.org/P10568 and previous config saved to /var/cache/conftool/dbconfig/20200302-085420-marostegui.json
  • 08:44 moritzm: installing openssh updates for stretch
  • 08:33 addshore: warm cache for db1111 for Q0-6 million T219123 T246447
  • 08:14 addshore: resume item term table rebuild script (from Q54 mill) T219123
  • 08:07 marostegui@cumin1001: dbctl commit (dc=all): 'Increase weight from 30 to 50 on db1111 T246447', diff saved to https://phabricator.wikimedia.org/P10567 and previous config saved to /var/cache/conftool/dbconfig/20200302-080721-marostegui.json
  • 07:22 vgutierrez: upgrading NICs FW on lvs2008 - T196560 T203194
  • 07:21 marostegui@cumin1001: dbctl commit (dc=all): 'Increase weight from 10 to 30 on db1111 T246447', diff saved to https://phabricator.wikimedia.org/P10566 and previous config saved to /var/cache/conftool/dbconfig/20200302-072118-marostegui.json
  • 07:10 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 07:08 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 06:45 marostegui@cumin1001: dbctl commit (dc=all): 'Increase weight from 1 to 10 on db1111 T246447', diff saved to https://phabricator.wikimedia.org/P10565 and previous config saved to /var/cache/conftool/dbconfig/20200302-064522-marostegui.json
  • 06:42 marostegui: Enable events on db1111 T246447
  • 06:24 marostegui@cumin1001: dbctl commit (dc=all): 'Add db1111 to s8 with minimal weight to check grants and any other issues T246447', diff saved to https://phabricator.wikimedia.org/P10564 and previous config saved to /var/cache/conftool/dbconfig/20200302-062435-marostegui.json
  • 06:04 marostegui: Re-add db1111 to s8 in tendril and zarcillo - T246447

2020-03-01

  • 17:54 marostegui: Start replication on db1111 new host on s8 - T246447
  • 17:45 marostegui@cumin1001: dbctl commit (dc=all): 'Reduce main traffic weight for db1087 as dumps are running ', diff saved to https://phabricator.wikimedia.org/P10563 and previous config saved to /var/cache/conftool/dbconfig/20200301-174536-marostegui.json
  • 16:08 reedy@deploy1001: scap failed: average error rate on 5/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/db09a36be5ed3e81155041f7d46ad040 for details)
  • 06:02 ariel@deploy1001: Finished deploy [dumps/dumps@8376c62]: refactor page content jobs, prefetch, and output file listings: see T246465 (duration: 00m 04s)
  • 06:02 ariel@deploy1001: Started deploy [dumps/dumps@8376c62]: refactor page content jobs, prefetch, and output file listings: see T246465

2020-02-29

  • 12:37 reedy@deploy1001: Synchronized wmf-config/config/viwiki.yaml: T246511 (duration: 00m 56s)
  • 12:35 reedy@deploy1001: Synchronized wikiversions-labs.json: T246511 (duration: 00m 56s)
  • 12:34 reedy@deploy1001: Synchronized dblists/all-labs.dblist: T246511 (duration: 00m 57s)

2020-02-28

  • 21:31 mutante: using planet1001 to manually hack APT sources to test new apt1001.wikimedia.org
  • 20:29 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 20:26 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 19:01 milimetric@deploy1001: Finished deploy [analytics/refinery@0fc392f] (thin): Hotfix: going back to a safe version of geo udf (duration: 00m 07s)
  • 19:01 milimetric@deploy1001: Started deploy [analytics/refinery@0fc392f] (thin): Hotfix: going back to a safe version of geo udf
  • 19:01 milimetric@deploy1001: Finished deploy [analytics/refinery@0fc392f]: Hotfix: going back to a safe version of geo udf (duration: 13m 06s)
  • 18:47 milimetric@deploy1001: Started deploy [analytics/refinery@0fc392f]: Hotfix: going back to a safe version of geo udf
  • 16:25 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 16:22 marostegui@cumin1001: START - Cookbook sre.hosts.downtime
  • 16:05 oblivian@puppetmaster1001: conftool action : set/pooled=yes:weight=1; selector: cluster=kibana,service=kibana-next
  • 15:54 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 15:51 marostegui@cumin1001: START - Cookbook sre.hosts.downtime
  • 15:39 moritzm: installing libperl4-corelibs-perl updates from Stretch point release
  • 15:36 elukey@deploy1001: Finished deploy [analytics/refinery@28fa2fc]: fix for refinery-drop-older-than - part 2 (duration: 13m 40s)
  • 15:24 marostegui: Stop replication on db1077 from db1111 (its master) - T246447
  • 15:22 elukey@deploy1001: Started deploy [analytics/refinery@28fa2fc]: fix for refinery-drop-older-than - part 2
  • 14:17 gehel: rolling restart of elasticsearch/eqiad for JVM upgrade completed
  • 14:16 gehel@cumin1001: END (PASS) - Cookbook sre.elasticsearch.rolling-restart (exit_code=0)
  • 14:15 elukey@deploy1001: Finished deploy [analytics/refinery@2db36f4]: Fix refinery-drop-older-than script (duration: 14m 01s)
  • 14:10 marostegui@cumin1001: dbctl commit (dc=all): 'Increase weight from 100 to 300', diff saved to https://phabricator.wikimedia.org/P10558 and previous config saved to /var/cache/conftool/dbconfig/20200228-141035-marostegui.json
  • 14:01 elukey@deploy1001: Started deploy [analytics/refinery@2db36f4]: Fix refinery-drop-older-than script
  • 13:58 gehel@cumin1001: START - Cookbook sre.elasticsearch.rolling-restart
  • 13:32 marostegui: Reset idrac from db1114
  • 12:11 akosiaris@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' .
  • 12:06 akosiaris@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' .
  • 11:57 akosiaris@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' .
  • 11:04 gehel@cumin1001: END (ERROR) - Cookbook sre.elasticsearch.rolling-restart (exit_code=97)
  • 10:53 jynus: labsdb1009-12 prometheus metrics restored after 90 minutes of unscheduled unavailability
  • 10:27 gehel@cumin1001: START - Cookbook sre.elasticsearch.rolling-restart
  • 10:15 gehel@cumin1001: END (FAIL) - Cookbook sre.elasticsearch.rolling-restart (exit_code=99)
  • 10:13 gehel@cumin1001: START - Cookbook sre.elasticsearch.rolling-restart
  • 10:01 gehel@cumin1001: END (FAIL) - Cookbook sre.elasticsearch.rolling-restart (exit_code=99)
  • 09:59 gehel@cumin1001: START - Cookbook sre.elasticsearch.rolling-restart
  • 09:59 gehel: starting rolling restart of elasticsearch/eqiad for JVM upgrade
  • 09:36 marostegui@cumin1001: dbctl commit (dc=all): 'Remove db1101:3318 from vslow,dump', diff saved to https://phabricator.wikimedia.org/P10555 and previous config saved to /var/cache/conftool/dbconfig/20200228-093653-marostegui.json
  • 09:26 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1087 into vslow,dump as it was there originally', diff saved to https://phabricator.wikimedia.org/P10554 and previous config saved to /var/cache/conftool/dbconfig/20200228-092631-marostegui.json
  • 09:24 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1087 after moving labs hosts back under it', diff saved to https://phabricator.wikimedia.org/P10553 and previous config saved to /var/cache/conftool/dbconfig/20200228-092453-marostegui.json
  • 09:21 jynus: removed leftover labs prometheus target files from ops at prometheus1003, prometheus1004
  • 08:44 moritzm: installing openssh updates from buster point release
  • 08:44 addshore: END warming wikidata term cache on db1126 for Q6-8 million T219123 (pass2 today)
  • 08:30 moritzm: installing mariadb-10.3 update from buster point release (just client-side libs and tools, no mysqlds)
  • 08:24 moritzm: installing cups updates from buster point release
  • 08:22 marostegui: Stop db1087 and db2079 in sync
  • 08:22 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1087 to move labs hosts back under it', diff saved to https://phabricator.wikimedia.org/P10551 and previous config saved to /var/cache/conftool/dbconfig/20200228-082213-marostegui.json
  • 08:12 addshore: START warming wikidata term cache on db1126 for Q6-8 million T219123 (pass2 today) (pass1 just finished)
  • 08:05 moritzm: installing systemd bugfix update from Buster point release
  • 07:38 addshore: START warming wikidata term cache on db1126 for Q6-8 million T219123 (pass1 today)
  • 07:31 moritzm: installing gnutls28 bugfix update from Buster point release
  • 06:40 marostegui@cumin1001: dbctl commit (dc=all): 'Fully repool db1084 - T245621', diff saved to https://phabricator.wikimedia.org/P10550 and previous config saved to /var/cache/conftool/dbconfig/20200228-064037-marostegui.json
  • 06:25 marostegui@cumin1001: dbctl commit (dc=all): '75% of original weight to db1084 - T245621', diff saved to https://phabricator.wikimedia.org/P10549 and previous config saved to /var/cache/conftool/dbconfig/20200228-062536-marostegui.json
  • 06:04 mutante: rsyncing APT repo and firmware data from install1002 to apt2001
  • 05:58 mutante: apt2001 - signed puppet cert, initial run after OS install, rsyncing repo data, not in use yet
  • 01:25 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Bonus sync for cache clearance (duration: 00m 56s)
  • 01:19 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: T196466 [wikitech] Remove the 'shell' user right from assignment and rights lists (duration: 00m 58s)
  • 01:15 dzahn@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
  • 01:05 James_F: Running mwscript emptyUserGroup.php --wiki=labswiki shell for T196466

2020-02-27

  • 23:53 dzahn@cumin1001: START - Cookbook sre.ganeti.makevm
  • 23:10 jforrester@deploy1001: Synchronized wmf-config/CommonSettings.php: Stop setting wgLogos['wordmark'] based on wgMinervaCustomLogos, never set (duration: 00m 56s)
  • 23:07 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Bonus sync for cache clearance (duration: 00m 56s)
  • 23:04 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Merge wgMinervaCustomLogos into wgLogos, take 2 (duration: 00m 56s)
  • 23:01 jforrester@deploy1001: Synchronized wmf-config/CommonSettings.php: Only try to set wgLogos['wordmark'] if not already done (duration: 00m 58s)
  • 22:49 James_F: Manually `scap pull`ed on mw1349 and mw1351 as they were emitting odd errors.
  • 22:06 milimetric@deploy1001: Finished deploy [analytics/aqs/deploy@5a67e6e]: AQS: Minor fix take 3 (duration: 07m 24s)
  • 21:59 milimetric@deploy1001: Started deploy [analytics/aqs/deploy@5a67e6e]: AQS: Minor fix take 3
  • 21:53 effie: depool mw1262, suspecting it might have overloaded logstash
  • 21:51 milimetric@deploy1001: Finished deploy [analytics/aqs/deploy@c70b338]: AQS: Minor fix take 2 (duration: 02m 59s)
  • 21:50 shdubsh: start elasticsearch on logastash1010
  • 21:48 milimetric@deploy1001: Started deploy [analytics/aqs/deploy@c70b338]: AQS: Minor fix take 2
  • 21:43 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Roll back to setting wgMinervaCustomLogos (duration: 00m 33s)
  • 21:42 jforrester@deploy1001: Synchronized multiversion/MWConfigCacheGenerator.php: Use the four dblists again (duration: 00m 33s)
  • 21:40 jforrester@deploy1001: Synchronized dblists/: Re-establish dblists everywhere (duration: 00m 33s)
  • 21:39 jforrester@deploy1001: scap failed: average error rate on 11/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/db09a36be5ed3e81155041f7d46ad040 for details)
  • 21:25 jforrester@deploy1001: Synchronized multiversion/MWConfigCacheGenerator.php: Touch the dblists list (duration: 00m 56s)
  • 21:22 jforrester@deploy1001: Scap failed!: 8/11 canaries failed their endpoint checks(http://en.wikipedia.org)
  • 21:19 jforrester@deploy1001: Scap failed!: 10/11 canaries failed their endpoint checks(http://en.wikipedia.org)
  • 21:19 jforrester@deploy1001: Scap failed!: 10/11 canaries failed their endpoint checks(http://en.wikipedia.org)
  • 21:16 milimetric@deploy1001: Finished deploy [analytics/aqs/deploy@c70b338]: AQS: Minor fix (duration: 02m 30s)
  • 21:14 jforrester@deploy1001: Synchronized multiversion/MWWikiversions.php: Drop references to four dblists to canaries too (duration: 00m 55s)
  • 21:13 milimetric@deploy1001: Started deploy [analytics/aqs/deploy@c70b338]: AQS: Minor fix
  • 21:13 jforrester@deploy1001: Synchronized dblists/: Add back the deleted dblists to make the canaries quiet (duration: 00m 56s)
  • 21:11 jforrester@deploy1001: sync-file aborted: Drop references to four dblists (duration: 00m 05s)
  • 21:11 jforrester@deploy1001: Synchronized multiversion/MWWikiversions.php: Drop references to four dblists (duration: 00m 35s)
  • 21:10 jforrester@deploy1001: Scap failed!: 10/11 canaries failed their endpoint checks(http://en.wikipedia.org)
  • 21:07 jforrester@deploy1001: Scap failed!: 10/11 canaries failed their endpoint checks(http://en.wikipedia.org)
  • 21:04 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Bonus sync for cache clearance (duration: 00m 56s)
  • 21:02 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Merge wgMinervaCustomLogos into wgLogos (duration: 00m 57s)
  • 20:32 andrew@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 20:30 andrew@cumin1001: START - Cookbook sre.hosts.downtime
  • 20:26 andrew@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 20:26 andrew@cumin1001: START - Cookbook sre.hosts.downtime
  • 20:22 otto@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'canary' .
  • 20:22 otto@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'production' .
  • 20:21 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'canary' .
  • 20:21 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'production' .
  • 20:16 otto@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'canary' .
  • 20:16 otto@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'production' .
  • 20:14 effie: pool mw1262
  • 20:07 otto@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'canary' .
  • 20:07 otto@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'production' .
  • 20:05 jhuneidi@deploy1001: rebuilt and synchronized wikiversions files: all wikis to 1.35.0-wmf.21 refs T233869
  • 20:00 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'canary' .
  • 20:00 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'production' .
  • 19:46 mutante: Welcome new deployers Thalia Chan, Moriel Schottlender and Dayllan Maza (Anti-Harrassment-Tools team)
  • 19:38 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'production' .
  • 19:26 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'production' .
  • 19:21 tgr@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: once more for good measure (duration: 01m 03s)
  • 19:20 tgr@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Enable articletopic: search keyword in CirrusSearch (T240559) (duration: 01m 05s)
  • 19:17 otto@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'eventstreams' for release 'production' .
  • 19:17 effie: depool mw1262
  • 19:17 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'production' .
  • 19:17 mutante: ganeti2001 - removing VM apt2001 to re-create it after IP change
  • 19:13 milimetric@deploy1001: Finished deploy [analytics/refinery@357ff5c] (thin): Refinery using 0.0.115 (duration: 00m 07s)
  • 19:12 milimetric@deploy1001: Started deploy [analytics/refinery@357ff5c] (thin): Refinery using 0.0.115
  • 19:06 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'production' .
  • 18:51 elukey: upgrade prometheus-mcrouter-exporter to 0.1.0+git20200227-1 on hosts
  • 18:48 milimetric@deploy1001: Finished deploy [analytics/refinery@357ff5c]: Refinery using 0.0.115 (duration: 10m 11s)
  • 18:43 mutante: adding parse2* machines to puppet
  • 18:37 milimetric@deploy1001: Started deploy [analytics/refinery@357ff5c]: Refinery using 0.0.115
  • 18:31 volans: restarting icinga on icinga1001, command file randomly discarding commands
  • 18:21 addshore: END warming wikidata term cache on db1126 for Q6-8 million T219123 (pass1) (will do 2 more passes tomorrow)
  • 18:20 elukey: upload prometheus-mcrouter-exporter 0.1.0+git20200227-1 to stretch-wikimedia
  • 17:52 addshore: resume item migration script at Q50 million T219123 (batch size of 100, 1s sleep)
  • 17:49 ebernhardson: delete commonswiki_file_1582685980 from cloudelastic-chi, reindex failed and commonswiki_file_first is still primary
  • 17:41 effie: enable puppet on thumbor*
  • 17:40 effie: stop and mask all nginx on thumbor*
  • 17:34 volans@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 17:34 volans@cumin1001: START - Cookbook sre.hosts.downtime
  • 17:33 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 17:31 addshore: START warming wikidata term cache on db1126 for Q6-8 million T219123 (pass1)
  • 17:31 vgutierrez: (from 17:03) reimage lvs5003 with buster - T245984
  • 17:30 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 17:30 jynus@cumin1001: dbctl commit (dc=all): 'Repool db1087 at 20% T232446', diff saved to https://phabricator.wikimedia.org/P10547 and previous config saved to /var/cache/conftool/dbconfig/20200227-173017-jynus.json
  • 17:20 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Reading up to Q6M for the new term store for clients (was Q4M) + warm db1126 caches (T219123) cache bust (duration: 01m 04s)
  • 17:19 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Reading up to Q6M for the new term store for clients (was Q4M) + warm db1126 caches (T219123) (duration: 01m 04s)
  • 17:18 addshore: (relog FROM 5:11) END warming wikidata term cache on db1126 for Q4-6 million T219123 (pass2)
  • 16:55 vgutierrez: re-enable BGP in lvs4005 - T245984
  • 16:50 volans: temporarily decommented external check for icinga2001. Restarting Icinga on icinga2001
  • 16:49 addshore: START warming wikidata term cache on db1126 for Q4-6 million T219123 (pass2)
  • 16:49 addshore: END warming wikidata term cache on db1126 for Q4-6 million T219123 (pass1)
  • 16:39 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 16:36 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 16:27 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 16:24 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 16:23 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 16:22 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 16:21 moritzm: installing wget security updates on jessie
  • 16:20 vgutierrez: reimage lvs4005 with buster - T245984
  • 16:12 papaul: rebooting parse2009 to clear memory error
  • 16:11 Urbanecm: foreachwiki extensions/AbuseFilter/maintenance/fixOldLogEntries.php --verbose started (T228655)
  • 16:10 Urbanecm: mwscript extensions/AbuseFilter/maintenance/fixOldLogEntries.php --wiki=mediawikiwiki --verbose (T228655)
  • 16:10 vgutierrez: re-enable BGP in lvs4006 - T245984
  • 16:09 addshore: begin warming wikidata term cache on db1126 for Q4-6 million T219123
  • 16:08 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Reading up to Q4M for the new term store for clients (was Q2M) + warm db1126 caches (T219123) cache bust (duration: 01m 04s)
  • 16:05 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Reading up to Q4M for the new term store for clients (was Q2M) + warm db1126 caches (T219123) (duration: 01m 04s)
  • 16:05 moritzm: installing python3.7 security updates on Buster
  • 16:02 effie: disable puppet on thumbor*
  • 15:59 moritzm: installing e2fsck security updates on buster
  • 15:56 moritzm: installing python-django updates (packaged Debian version)
  • 15:52 moritzm: installing python-pysaml security updates
  • 15:37 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 15:35 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 15:32 reedy@deploy1001: Synchronized php-1.35.0-wmf.20/extensions/ConfirmEdit/includes/auth/CaptchaPreAuthenticationProvider.php: T245280 (duration: 01m 04s)
  • 15:31 reedy@deploy1001: Synchronized php-1.35.0-wmf.21/extensions/ConfirmEdit/includes/auth/CaptchaPreAuthenticationProvider.php: T245280 (duration: 01m 05s)
  • 15:29 moritzm: restarting mw canaries to pick up curl update
  • 15:23 moritzm: installing curl security updates on stretch/buster
  • 15:17 vgutierrez: reimage lvs4006 with buster - T245984
  • 15:03 jynus@cumin1001: dbctl commit (dc=all): 'Repool db1084 at 50% T245621', diff saved to https://phabricator.wikimedia.org/P10542 and previous config saved to /var/cache/conftool/dbconfig/20200227-150302-jynus.json
  • 14:55 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 14:53 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 14:35 vgutierrez: reimage lvs4007 with buster - T245984
  • 14:09 urbanecm@deploy1001: Synchronized wmf-config/throttle.php: 7e3a57a: Increase arwiki WikiGap throttle lift to 400 accounts (T246092) (duration: 01m 05s)
  • 13:28 _joe_: installing envoy in eqiad too
  • 13:13 cdanis: s/camping/clamping/
  • 13:11 XioNoX: esams/knams rollback tcp-mss camping and prepending
  • 13:07 _joe_: restarting envoy, after chowning the log files, on all codfw mw servers where it was installed
  • 13:06 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Start reading for the new term store for clients up to Q2M (was Q8M) again (T219123) ?cachebust (duration: 01m 03s)
  • 13:05 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Start reading for the new term store for clients up to Q2M (was Q8M) again (T219123) (duration: 01m 03s)
  • 13:03 _joe_: re-stopped puppet on codfw
  • 12:56 XioNoX: delete specific tcp-mss on cr2-eqiad:equinix (will cause an interface flap) - T244610
  • 12:41 XioNoX: bump BGP prefix-limit on all routers - T246110
  • 12:38 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Start reading for the new term store for clients up to Q8M (was Q6M) again (T219123) ?cachebust (duration: 01m 03s)
  • 12:36 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Start reading for the new term store for clients up to Q8M (was Q6M) again (T219123) (duration: 01m 04s)
  • 12:27 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Start reading for the new term store for clients up to Q6M (was Q2M) again (T219123) cachebust? (duration: 01m 17s)
  • 12:24 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Start reading for the new term store for clients up to Q6M (was Q2M) again (T219123) (duration: 01m 45s)
  • 12:20 vgutierrez@cumin2001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 12:19 vgutierrez@cumin2001: START - Cookbook sre.hosts.decommission
  • 12:18 vgutierrez@cumin2001: END (ERROR) - Cookbook sre.hosts.decommission (exit_code=97)
  • 12:18 vgutierrez@cumin2001: START - Cookbook sre.hosts.decommission
  • 12:14 vgutierrez: replace lvs2003 with lvs2009 - T196560 T245984 T246334
  • 12:11 Urbanecm: EU SWAT done
  • 12:06 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: daee105: Add ids.si.edu to the wgCopyUploadsDomains whitelist of Wikimedia Commons (T246330; take II) (duration: 01m 04s)
  • 12:05 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: daee105: Add ids.si.edu to the wgCopyUploadsDomains whitelist of Wikimedia Commons (T246330) (duration: 01m 05s)
  • 11:48 vgutierrez: run decommision script against lvs2006.codfw.wmnet - T246329
  • 11:47 vgutierrez@cumin2001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 11:47 vgutierrez@cumin2001: START - Cookbook sre.hosts.decommission
  • 11:45 jynus@cumin1001: dbctl commit (dc=all): 'Repool db1084 at 10% T245621', diff saved to https://phabricator.wikimedia.org/P10538 and previous config saved to /var/cache/conftool/dbconfig/20200227-114542-jynus.json
  • 11:35 addshore: pause item migration script at Q50 million T219123
  • 11:02 vgutierrez: start pybal on lvs2003 - T196560 T245984
  • 10:58 vgutierrez: stop pybal on lvs2003 to let lvs2010 take the traffic for a little bit - T196560 T245984
  • 10:54 vgutierrez: replacing lvs2006 with lvs2010 - T196560 T245984
  • 09:35 jynus: upgrade and restart db1084 T246323
  • 09:03 jynus@cumin1001: dbctl commit (dc=all): 'Depool db1098 (s6 & s7)', diff saved to https://phabricator.wikimedia.org/P10536 and previous config saved to /var/cache/conftool/dbconfig/20200227-090344-jynus.json
  • 08:26 jynus: killed SpecialFewestRevisions::reallyDoQuery long running query on db1101:s8, causing lag
  • 08:14 jynus@cumin1001: dbctl commit (dc=all): 'Depool db1098 at 50%', diff saved to https://phabricator.wikimedia.org/P10535 and previous config saved to /var/cache/conftool/dbconfig/20200227-081449-jynus.json
  • 03:52 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 03:50 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 03:31 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 03:28 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 03:27 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 03:26 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 02:53 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 02:50 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 02:49 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 02:47 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 02:27 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 02:24 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 02:23 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 02:22 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 01:45 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 01:43 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 01:37 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 01:34 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 01:27 XioNoX: re-enable BGP to telia in esams
  • 01:13 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 01:10 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 00:56 cdanis: repool esams 🙌 😎
  • 00:52 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 00:49 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 00:42 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 00:39 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 00:21 jforrester@deploy1001: Synchronized w/extract2.php: T239975: Use Article::getPage()->getTouched(), not Article::getTouched (duration: 01m 04s)
  • 00:17 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Bonus sync for cache clearance (duration: 01m 04s)
  • 00:15 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: T232140: Merge definition of wgLogos and wgLogo (duration: 01m 04s)
  • 00:13 jforrester@deploy1001: Synchronized wmf-config/CommonSettings.php: T232140: Stop setting wgLogoHD from wgLogos (duration: 01m 05s)
  • 00:02 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Bonus sync for cache clearance (duration: 01m 03s)
  • 00:01 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: T246212 Stop setting wgULSLanguageDetection in IS, set in CS (duration: 01m 05s)

2020-02-26

  • 23:59 jforrester@deploy1001: Synchronized wmf-config/CommonSettings.php: T246212 Set wgULSLanguageDetection false in CS (duration: 01m 04s)
  • 23:55 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Bonus sync for cache clearance (duration: 01m 04s)
  • 23:54 James_F: jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T246193 Stop setting wgAllowTitlesInSVG, never read (and this was default anyway) (duration: 01m 05s)
  • 23:19 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 23:16 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 23:16 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 23:15 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 22:58 dzahn@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 22:58 dzahn@cumin1001: START - Cookbook sre.hosts.downtime
  • 22:58 dzahn@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 22:58 dzahn@cumin1001: START - Cookbook sre.hosts.downtime
  • 22:51 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 22:49 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 22:48 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 22:47 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 22:44 foks: removing one file for legal compliance
  • 22:27 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 22:25 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 22:19 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 22:16 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 21:52 Urbanecm: Password reset for User:Joax (T242941)
  • 21:28 mutante: ganeti - shutting apt2001 down again
  • 21:17 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Decrease the reads for term store for clients down to Q2Mio (T219123), take II (duration: 01m 04s)
  • 21:16 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Decrease the reads for term store for clients down to Q2Mio (T219123) (duration: 01m 04s)
  • 21:15 mutante: ganeti - re-starting apt2001 which is mysteriously broken and "half up" ..as in you can't ssh to it and don't get console but it does cause icinga alerts
  • 20:35 ladsgroup@deploy1001: Synchronized php-1.35.0-wmf.21/extensions/Wikibase/lib/includes/Store/Sql/Terms: SWAT: Do prefetching entity ids on batches of 20 entity per query (T246159) (duration: 01m 04s)
  • 20:20 jhuneidi@deploy1001: Synchronized php: group1 wikis to 1.35.0-wmf.21 refs T233869 (duration: 01m 04s)
  • 20:19 jhuneidi@deploy1001: rebuilt and synchronized wikiversions files: group1 wikis to 1.35.0-wmf.21 refs T233869
  • 20:18 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'production' .
  • 20:10 XioNoX: add BGP to AS4780 in Equinix Palo-Alot
  • 20:09 XioNoX: add BGP to AS8859 in AMS-IX
  • 20:00 Amir1: Morning SWAT is done
  • 19:58 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Increase the reads for term store for clients for up to Q6Mio (T219123), take II (duration: 01m 04s)
  • 19:56 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Increase the reads for term store for clients for up to Q6Mio (T219123) (duration: 01m 02s)
  • 18:09 bstorm_: downtimed labstore1004/5, cloudstore1008/9 and cloudbackup1001/2 for merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/571821
  • 18:05 mutante: phab1001 - manually running community_metrics and project_changes scripts (crons) (T244677)
  • 17:49 Amir1: setting cache type of mwdebug1001 to LCStoreStaticArray, this would break group1 and group2 in that node (T99740)
  • 17:42 XioNoX: remove ns2 redirect to eqiad on cr3-knams
  • 17:40 XioNoX: re-enable transits on cr3-esams
  • 17:09 robh: cr2-esasms work done, cr3-esams linecard swap starting now via T245825
  • 16:40 robh: please note cr2-esams work is ongoing via T246009 and its downtime is expected
  • 16:00 jynus: deploy new grants to phabricator stats user to database on m3 T246105
  • 15:51 jynus: starting s2, s3 eqiad backup source data check; expect increase read traffic on db1095:3313, db1140:3312, db1078, db1090:3312 T244958
  • 15:25 addshore: addshore@mwmaint1002:~$ time mwscript extensions/Wikibase/repo/maintenance/rebuildItemTerms.php --wiki=wikidatawiki --batch-size=50 --sleep=1 --file=20to30holes-25feb2229 # T219123
  • 15:19 volans@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 15:17 volans@cumin1001: START - Cookbook sre.hosts.decommission
  • 14:54 volans@cumin2001: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99)
  • 14:54 volans@cumin2001: START - Cookbook sre.hosts.decommission
  • 14:51 volans@cumin2001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
  • 14:46 volans@cumin2001: START - Cookbook sre.ganeti.makevm
  • 14:19 volans@cumin2001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 14:19 volans@cumin2001: START - Cookbook sre.hosts.decommission
  • 14:12 volans@cumin2001: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1)
  • 14:11 volans@cumin2001: START - Cookbook sre.hosts.decommission
  • 14:05 gehel: restart of elasticsearch on cloudelastic for JVM upgrade completed
  • 14:03 XioNoX: deactivate BGP to AS23930 on cr1-eqsin, will re-enable when their technical issues are fixed and they notify us
  • 14:00 elukey: run apt-get clean on notebook1004 to free some space - T224682
  • 13:46 XioNoX: ganeti2001:~$ sudo gnt-instance shutdown apt2001.wikimedia.org - T224576
  • 12:26 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 12:26 jmm@cumin2001: START - Cookbook sre.hosts.downtime
  • 12:24 kartik@deploy1001: Synchronized wmf-config/CommonSettings.php: SWAT: 416973|ContentTranslation: Set cookieDomain for Production (duration: 01m 04s)
  • 12:11 kartik@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: 574469|Enable CX out of beta in eu, sw, and ta Wikipedias (T245446, T245447, T245448) take II (duration: 01m 05s)
  • 12:10 kartik@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: 574469|Enable CX out of beta in eu, sw, and ta Wikipedias (T245446, T245447, T245448) (duration: 01m 15s)
  • 12:05 volans: uploaded spicerack_0.0.31-1_amd64.deb to apt.wikimedia.org stretch-wikimedia
  • 11:45 jbond42: changing uid/gid of reprepro effects release[12]001/install[12]002
  • 11:05 moritzm: rolling out remaining PHP 7.0 security updates
  • 10:57 elukey@cumin1001: END (PASS) - Cookbook sre.hadoop.roll-restart-workers (exit_code=0)
  • 10:52 moritzm: installing clamav security updates on mendelevium (ticket.wikimedia.org
  • 10:03 elukey: upgrade prometheus-mcrouter-exporter 0.1.0+git20200225-1 to all cumin alias parsoid/deployment-servers/mw-maintenance
  • 09:54 elukey: upgrade prometheus-mcrouter-exporter 0.1.0+git20200225-1 to all cumin alias all-mw-eqiad
  • 09:37 elukey@cumin1001: START - Cookbook sre.hadoop.roll-restart-workers
  • 09:34 elukey: roll restart the Hadoop Analytcs workers for openjdk upgrades
  • 09:32 elukey: upgrade prometheus-mcrouter-exporter 0.1.0+git20200225-1 to all cumin alias all-mw-codfw
  • 09:18 gehel: restarting elasticsearch on cloudelastic for JVM upgrade
  • 08:51 elukey: upload prometheus-mcrouter-exporter 0.1.0+git20200225-1 to stretch-wikimedia
  • 08:38 elukey: upgrade prometheus-mcrouter-exporter on mwdebug1001 to test the new version
  • 06:19 marostegui: Stop MySQL and poweroff db1084 for BBU replacement - T245647
  • 06:17 marostegui@cumin1001: dbctl commit (dc=all): 'Fully repool es1019 after on-site maintenance T243963', diff saved to https://phabricator.wikimedia.org/P10530 and previous config saved to /var/cache/conftool/dbconfig/20200226-061710-marostegui.json
  • 06:16 marostegui@cumin1001: dbctl commit (dc=all): 'Restore es1017 (master) original weight (0) T243963', diff saved to https://phabricator.wikimedia.org/P10529 and previous config saved to /var/cache/conftool/dbconfig/20200226-061640-marostegui.json
  • 06:09 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1084 for BBU replacement - T245647', diff saved to https://phabricator.wikimedia.org/P10528 and previous config saved to /var/cache/conftool/dbconfig/20200226-060906-marostegui.json
  • 05:41 kart_: Updated cxserver to 2020-02-24-110149-production (T227183)
  • 05:35 kartik@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'cxserver' for release 'production' .
  • 05:31 kartik@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'cxserver' for release 'production' .
  • 05:29 kartik@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'cxserver' for release 'staging' .
  • 01:15 ejegg: updated payments-wiki from c3ca3ad6a7 to bfae734204
  • 00:48 eileen: civicrm revision changed from bec2d6ad9f to 62e62e107c, config revision is c0ef31e2fd
  • 00:21 James_F: Manually purged https://de.wikipedia.org/w/index.php?title=Hans-Werner_Sahm&action=history from mwmaint1002
  • 00:15 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Bonus sync for cache clearance (duration: 01m 03s)
  • 00:15 James_F: SWAT complete.
  • 00:14 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: T242381 Set Vector skin version defaults so they can be changed on Beta Cluster (duration: 01m 04s)
  • 00:09 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Bonus sync for cache clearance (duration: 01m 03s)
  • 00:08 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: T245792 Enable password-reset-update on Wikivoyages and Wiktionaries (duration: 01m 04s)
  • 00:08 ebernhardson: resume writes from mediawiki to cloudelastic

2020-02-25

  • 23:51 XioNoX: cr2-esams> request chassis fpc slot 0 offline - T246009
  • 23:38 ebernhardson: pause mediawiki writes to cloudelastic to let old gc on cloudelastic1001-chi recover
  • 23:30 mutante: notebook1004 - disk full once again (T232068)
  • 23:28 mutante: adding mw2366 through mw2376 to site
  • 22:17 jhuneidi@deploy1001: Synchronized php-1.35.0-wmf.21/includes/Defines.php: Update MW_VERSION to 1.35.0-wmf.21 (duration: 01m 04s)
  • 22:17 mutante: scandium restarting php7.2-fpm
  • 22:15 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'canary' .
  • 22:15 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'production' .
  • 21:29 jhuneidi@deploy1001: rebuilt and synchronized wikiversions files: group0 wikis to 1.35.0-wmf.21 refs T233869
  • 21:19 jhuneidi@deploy1001: Finished scap: testwikis wikis to 1.35.0-wmf.21 refs T233869 (duration: 75m 21s)
  • 20:42 eileen: process-control config revision is c0ef31e2fd
  • 20:32 eileen: process-control config revision is e17d104c73 slow down delete deleted contacts
  • 20:28 tzatziki: reset password for ClioCJS
  • 20:25 tzatziki: changing email address for ClioCJS
  • 20:25 mutante: apt.wikimedia.org (current install* and new apt* roles) - going ECDSA-only and removing RSA certificate from nginx config - to support buster without having to maintain patched nginx for duplicate ssl_stapling_file directive - at the cost of slightly reduced back-compat on the public repo (T242602)
  • 20:24 mutante: apt.wikimedia.org (current install* and new apt* roles) - going ECDSA-only and removing RSA certificate from nginx config - to support buster without having to maintain patched nginx for duplicate ssl_stapling_file directive - at the cost of slightly reduced back-compat on the public repo (T224576)
  • 20:18 eileen: process-control config revision is e17d104c73
  • 20:04 jhuneidi@deploy1001: Started scap: testwikis wikis to 1.35.0-wmf.21 refs T233869
  • 20:01 jhuneidi@deploy1001: Pruned MediaWiki: 1.35.0-wmf.19 (duration: 14m 35s)
  • 19:58 otto@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'production' .
  • 19:55 otto@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'canary' .
  • 19:54 otto@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'production' .
  • 19:52 otto@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'canary' .
  • 19:47 otto@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'canary' .
  • 19:47 otto@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'production' .
  • 19:45 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'production' .
  • 19:44 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'canary' .
  • 19:39 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'canary' .
  • 19:31 otto@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' .
  • 19:30 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'canary' .
  • 19:26 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'canary' .
  • 19:26 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'production' .
  • 19:23 longma: 1.35.0-wmf.21 was branched at ed65726 for T233869
  • 19:20 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'canary' .
  • 19:20 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'production' .
  • 18:03 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Decrease the reads for term store for clients back to Q2Mio (T219123), take II (duration: 00m 56s)
  • 18:01 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Decrease the reads for term store for clients back to Q2Mio (T219123) (duration: 00m 56s)
  • 18:00 jynus@cumin1001: dbctl commit (dc=all): 'increase s8 special replica weight', diff saved to https://phabricator.wikimedia.org/P10520 and previous config saved to /var/cache/conftool/dbconfig/20200225-180016-jynus.json
  • 17:21 jynus@cumin1001: dbctl commit (dc=all): 'increase es1019 load to 50% T243963', diff saved to https://phabricator.wikimedia.org/P10519 and previous config saved to /var/cache/conftool/dbconfig/20200225-172133-jynus.json
  • 17:15 vgutierrez: restart ats-tls on cp1075 - T244538
  • 17:10 ejegg: restarted new Ingenico recurring donation charge job
  • 17:02 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Increase the reads for term store for clients for up to Q6Mio (T219123), take II (duration: 00m 55s)
  • 17:01 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'canary' .
  • 17:01 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'production' .
  • 17:01 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'canary' .
  • 17:01 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'production' .
  • 17:00 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Increase the reads for term store for clients for up to Q6Mio (T219123) (duration: 00m 56s)
  • 16:45 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'production' .
  • 16:38 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Increase the reads for term store for clients for up to Q4Mio (T219123), take II (duration: 00m 56s)
  • 16:36 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Increase the reads for term store for clients for up to Q4Mio (T219123) (duration: 00m 56s)
  • 16:25 vgutierrez: enable BGP in lvs2009 - T196560 T245984
  • 16:17 godog: restart debmonitor / puppetboard - T245512
  • 16:17 moritzm: installing pillow security updates
  • 16:09 vgutierrez: update puppet compiler facts
  • 16:08 XioNoX: add BGP to lvs2009 on cr1/2-codfw
  • 16:02 jynus@cumin1001: dbctl commit (dc=all): 'repool es1019 with low load after maintenance T243963', diff saved to https://phabricator.wikimedia.org/P10516 and previous config saved to /var/cache/conftool/dbconfig/20200225-160215-jynus.json
  • 16:00 ejegg: restarted legacy Ingenico recurring donation charge job
  • 15:59 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Increase the reads for term store for clients for up to Q2Mio (T219123), take II (duration: 00m 55s)
  • 15:58 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 15:58 ejegg: updated Fundraising CiviCRM from 88c72e39ca to bec2d6ad9f
  • 15:58 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Increase the reads for term store for clients for up to Q2Mio (T219123) (duration: 00m 56s)
  • 15:56 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 15:36 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Increase the reads for term store for clients for up to Q1Mio (T219123), take II (duration: 00m 55s)
  • 15:34 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Increase the reads for term store for clients for up to Q1Mio (T219123) (duration: 00m 56s)
  • 15:16 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Increase the reads for term store for clients for up to Q512K (T219123), take II (duration: 00m 55s)
  • 15:15 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Increase the reads for term store for clients for up to Q512K (T219123) (duration: 00m 56s)
  • 15:06 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Increase the reads for term store for clients for up to Q256K (T219123), take II (duration: 00m 55s)
  • 15:02 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Increase the reads for term store for clients for up to Q256K (T219123) (duration: 00m 56s)
  • 14:46 godog: roll restart netbox uwsgi - T245511
  • 14:40 akosiaris@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' .
  • 14:39 bblack@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 14:39 akosiaris@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' .
  • 14:39 akosiaris@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' .
  • 14:37 bblack@cumin1001: START - Cookbook sre.hosts.downtime
  • 14:35 ladsgroup@deploy1001: Synchronized php-1.35.0-wmf.20/extensions/Wikibase/lib: wbterms: only select entity terms that are requested (T246005) (duration: 01m 02s)
  • 14:30 vgutierrez: restart pybal with BGP enabled on lvs2010 - T245984 T196560
  • 14:20 vgutierrez: update puppet compiler facts
  • 14:16 bblack: dns1002 - start reimage - T241770
  • 14:15 lucaswerkmeister-wmde@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Reinstate wgULSLanguageDetection setting (T246071) (duration: 01m 03s)
  • 14:14 XioNoX: add bgp session to 10.192.49.7 (lvs2010) on cr1/cr2-codfw
  • 14:03 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 14:01 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 13:42 godog: roll-restart logstash in eqiad/codfw - T227080
  • 13:28 Urbanecm: mwscript updateSpecialPages.php --wiki=enwiki --override --only=Mostcategories
  • 13:00 Urbanecm: Run mwscript updateSpecialPages.php --wiki=enwiki --override --only=Uncategorizedcategories, cron didn't do that for several months (T246063)
  • 12:51 marostegui: Stop mysql on es1019 - T243963
  • 12:49 bblack: dns1002 - shutdown for hardware work after confirming drain of live requests - T241770
  • 12:46 marostegui@cumin1001: dbctl commit (dc=all): 'Depool es1019 for on-site maintenance - T243963', diff saved to https://phabricator.wikimedia.org/P10512 and previous config saved to /var/cache/conftool/dbconfig/20200225-124650-marostegui.json
  • 12:44 bblack: dns1002 - downtimed, disabled puppet, and depool (stop BGP adverts) for hardware work - T241770
  • 12:33 Urbanecm: Run mwscript updateSpecialPages.php --wiki=enwiki --override --only=Wantedtemplates, cron didn't do that for several months (T246063)
  • 12:32 marostegui@cumin1001: dbctl commit (dc=all): 'Increase traffic on db1107 for 10.4 on special groups 10 -> 50 - T242702', diff saved to https://phabricator.wikimedia.org/P10511 and previous config saved to /var/cache/conftool/dbconfig/20200225-123222-marostegui.json
  • 12:14 urbanecm@deploy1001: Synchronized wmf-config/throttle.php: SWAT: 1f58d9a: New throttle rule for arwiki WikiGap (T246092) (duration: 00m 56s)
  • 12:10 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: cdde3a2: db90d22 (T245525, T243359) (duration: 00m 58s)
  • 10:11 volans: re-enabling puppet on A:swift-be-eqiad
  • 09:31 volans: re-enabling puppet on A:swift-be-codfw
  • 09:30 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 09:30 jmm@cumin2001: START - Cookbook sre.hosts.downtime
  • 09:10 addshore: addshore@mwmaint1002:~$ time mwscript extensions/Wikibase/repo/maintenance/rebuildItemTerms.php --wiki=wikidatawiki --batch-size=50 --sleep=1 --file=10to20holes-24feb1345 # T219123
  • 09:09 addshore: addshore@mwmaint1002:~$ time mwscript extensions/Wikibase/repo/maintenance/rebuildItemTerms.php --wiki=wikidatawiki --batch-size=50 --sleep=1 --file=10to20holes-24feb1345
  • 08:23 vgutierrez@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 08:22 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 07:53 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1107 for 10.4 testing in main API and special groups - T242702', diff saved to https://phabricator.wikimedia.org/P10510 and previous config saved to /var/cache/conftool/dbconfig/20200225-075304-marostegui.json
  • 06:57 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1107 to analyze recentchanges table - T242702', diff saved to https://phabricator.wikimedia.org/P10508 and previous config saved to /var/cache/conftool/dbconfig/20200225-065741-marostegui.json
  • 06:02 marostegui: Move labsdb1010 under db2094:3318 - T232446
  • 02:59 ejegg: updated Fundraising CiviCRM from b9d1acdb6d to 88c72e39ca
  • 01:12 jforrester@deploy1001: Synchronized wmf-config/interwiki.php: T238803: Update interwiki cache (duration: 00m 56s)
  • 00:59 jforrester@deploy1001: Synchronized wmf-config/CommonSettings.php: T238803: Drop ability to load SkinPerPage, EUCopyrightCampaign, and EUCopyrightCampaignSkin (duration: 00m 56s)
  • 00:53 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: T238803: Remove all IS config related to the fixcopyrightwiki wiki (duration: 00m 55s)
  • 00:51 James_F: Ran `DELETE FROM globalimagelinks WHERE gil_wiki='fixcopyrightwiki';` - one row removed T238803
  • 00:51 jforrester@deploy1001: Synchronized wmf-config/CommonSettings.php: Stop trying to read wmgUseSkinPerPage or wmgUseEUCopyrightCampaign (duration: 00m 55s)
  • 00:48 James_F: Confirmed not SUL entries for fixcopyrightwiki as expected T238803
  • 00:47 jforrester@deploy1001: Synchronized static/images/project-logos/: T238803: Remove fixcopyrightwiki project logos (duration: 00m 56s)
  • 00:46 ejegg: updated Fundraising CiviCRM from 87b13fd3b5 to b9d1acdb6d
  • 00:46 jforrester@deploy1001: Synchronized dblists/: T238803: Remove fixcopyrightwiki from dblists in general (duration: 00m 58s)
  • 00:45 jforrester@deploy1001: rebuilt and synchronized wikiversions files: T238803: Remove fixcopyrightwiki from wikiversions
  • 00:43 jforrester@deploy1001: Synchronized dblists/all.dblist: T238803: Remove fixcopyrightwiki from all.dblist (duration: 00m 56s)
  • 00:39 jforrester@deploy1001: Scap failed!: Call to mwscript eval.php stderr: not empty
  • 00:38 ejegg: disabled recurring donation charge jobs for CiviCRM update
  • 00:27 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Stop setting wgMaxGeneratedPPNodeCount or wgParserConf::preprocessorClass, never read (duration: 00m 56s)
  • 00:23 jforrester@deploy1001: Synchronized wmf-config/CommonSettings.php: T245983 Read wmgApprovedContentSecurityPolicyDomains for CSP (duration: 00m 56s)
  • 00:21 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: T245983 Set wmgApprovedContentSecurityPolicyDomains (duration: 00m 57s)

2020-02-24

  • 22:58 dzahn@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
  • 22:38 XioNoX: redirect ns2 to authdns1001
  • 22:34 mutante: stat1007 sudo systemctl reset-failed to clear Icinga alerts about reportupdater-pingback.service
  • 22:22 XioNoX: disable transits on cr3-esams
  • 21:43 ppchelko@deploy1001: Finished deploy [cpjobqueue/deploy@f87bdd9]: Take service name into account for consumer group name T244387 (duration: 01m 14s)
  • 21:42 ppchelko@deploy1001: Started deploy [cpjobqueue/deploy@f87bdd9]: Take service name into account for consumer group name T244387
  • 21:37 dzahn@cumin1001: START - Cookbook sre.ganeti.makevm
  • 21:28 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'production' .
  • 21:26 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'canary' .
  • 21:23 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'production' .
  • 21:05 eileen: civicrm revision changed from fffc215e75 to 87b13fd3b5, config revision is 561ae21f77
  • 20:58 XioNoX: test flowspec BGP config on cr3-knams
  • 20:32 XioNoX: load new FW policies on pfw3-eqiad/codfw - T246036
  • food: updated Fundraising CiviCRM from 426e3547ca to fffc215e75
  • 20:03 eileen: civicrm revision changed from c086fd4e0b to 426e3547ca, config revision is 561ae21f77
  • 20:02 mutante: installing OS on new ganeti VMs apt1001 and apt2001.wikimedia.org for buster APT repos
  • 19:07 jforrester@deploy1001: Synchronized multiversion/MWConfigCacheGenerator.php: Changes here areonly used in tests right now, but keep line numbers sync'ed (duration: 00m 56s)
  • 18:46 mutante: deploying cluster apache config change - adds gr.wikimedia.org vhost and refreshes apache2
  • 17:10 jforrester@deploy1001: Synchronized wmf-config/flaggedrevs.php: Sync doc-only change; should be a no-op (duration: 00m 57s)
  • 16:16 jynus: reloading ferm on ms-be2028 DNS query timed out
  • 16:11 jynus: reloading ferm on ms-be2043 DNS query timed out
  • 16:02 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Revert: Increase the reads for term store for clients for up to Q256K (T219123), take II (duration: 00m 56s)
  • 15:57 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Revert: Increase the reads for term store for clients for up to Q256K (T219123) (duration: 00m 56s)
  • 15:30 moritzm: updated component/jdk8 to 8u242-b08-1~deb10u1 (forward port of latest Java 8 security update)
  • 15:21 marostegui@cumin1001: dbctl commit (dc=all): 'Reduce weight for db1126, increase it a bit for db1101:3318', diff saved to https://phabricator.wikimedia.org/P10498 and previous config saved to /var/cache/conftool/dbconfig/20200224-152132-marostegui.json
  • 15:05 marostegui: Deploy schema change on db1086 (s7 master) with replication - T245925
  • 14:59 marostegui: read_only=0 on es1020 (es4) and es1023 (es5) - unused new external store masters - T245806
  • 14:56 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Increase the reads for term store for clients for up to Q256K (T219123), take II (duration: 00m 55s)
  • 14:55 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Increase the reads for term store for clients for up to Q256K (T219123) (duration: 00m 57s)
  • 14:47 andrew@deploy1001: Finished deploy [horizon/deploy@dab0ca0]: modest css change for the hiera editing dialog (take two -- I consistently forget to rebase before doing this) (duration: 03m 33s)
  • 14:44 andrew@deploy1001: Started deploy [horizon/deploy@dab0ca0]: modest css change for the hiera editing dialog (take two -- I consistently forget to rebase before doing this)
  • 14:43 andrew@deploy1001: Finished deploy [horizon/deploy@a8f2ea9]: modest css change for the hiera editing dialog (duration: 00m 12s)
  • 14:43 andrew@deploy1001: Started deploy [horizon/deploy@a8f2ea9]: modest css change for the hiera editing dialog
  • 14:42 marostegui: Compress innodb on wb_terms on db1087 - T232446
  • 14:03 _joe_: depooling esams (authdns-update)
  • 13:51 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Increase the reads for term store for clients for up to Q120K (T219123), take II (duration: 00m 55s)
  • 13:48 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Increase the reads for term store for clients for up to Q120K (T219123) (duration: 00m 56s)
  • 13:30 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Increase the reads for term store for clients for up to Q60K (T219123), take II (duration: 00m 56s)
  • 13:28 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Increase the reads for term store for clients for up to Q60K (T219123) (duration: 00m 56s)
  • 13:18 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Increase the reads for term store for clients for up to Q30K (T219123), take II (duration: 00m 56s)
  • 13:17 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Increase the reads for term store for clients for up to Q30K (T219123) (duration: 00m 56s)
  • 13:05 urbanecm@deploy1001: Synchronized wmf-config/interwiki.php: Update interwiki cache (duration: 02m 18s)
  • 13:01 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Disallow crats to (un)assign flow-bot group on enwiki (T245716) (duration: 00m 56s)
  • 12:59 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Disallow crats to (un)assign flow-bot group on enwiki (T245716) (duration: 00m 56s)
  • 12:48 jdrewniak@deploy1001: Synchronized portals: Wikimedia Portals Update: Bumping portals to master (563985) (duration: 00m 56s)
  • 12:47 jdrewniak@deploy1001: Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: Bumping portals to master (563985) (duration: 00m 56s)
  • 12:38 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Add definitions for redirect badges (T235420), take II, the cache issue (duration: 00m 56s)
  • 12:37 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Add definitions for redirect badges (T235420) (duration: 00m 56s)
  • 12:23 ladsgroup@deploy1001: Synchronized php-1.35.0-wmf.20/extensions/Wikibase/client/includes: SWAT: Use formatter cache in client LUA label lookups (T245740) (duration: 00m 56s)
  • 12:19 ladsgroup@deploy1001: Synchronized php-1.35.0-wmf.20/extensions/WikimediaMaintenance/dumpInterwiki.php: dumpInterwiki: Respect comments in dblists (T244906) (duration: 00m 56s)
  • 12:12 kartik@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: 574265|CX: Adjust MT threshold for Telugu WP to 70% (T244769) (duration: 00m 56s)
  • 12:05 XioNoX: re-enable deactivated BGP sessions from ulsfo to office - T239893
  • 12:02 vgutierrez: reimage pybal-test2001 as buster - T224570 T245984
  • 11:49 jdrewniak@deploy1001: Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: Bumping portals to master (563985) (duration: 00m 55s)
  • 11:45 jdrewniak@deploy1001: Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: Bumping portals to master (563985) (duration: 00m 57s)
  • 11:27 vgutierrez: upload pybal 1.15.8 to apt.wm.o (buster) - T245984
  • 11:06 volans: restarted ferm on ms-be2046
  • 11:02 marostegui: Move labsdb1009, labsdb1011 and labsdb1012 (labsdb1010 is currently delayed, will be done later) to replicate under codfw for a few days while we alter wb_terms on db1087 - T232446
  • 10:59 effie: upgrading scap in eqiad and codfw - T245530
  • 10:55 volans: restarted ferm on ms-be2016, had failed with DNS query for 'ms-be2056.codfw.wmnet' failed: query timed out
  • 10:41 ladsgroup@deploy1001: Synchronized php-1.35.0-wmf.20/extensions/Wikibase: Add metric for recording cache hits in StatsdRecordingSimpleCache (T244260) (duration: 01m 04s)
  • 10:34 godog: onboard netbox to logging pipeline
  • 10:12 marostegui: Stop db1087 and db2079 in sync - T232446
  • 10:10 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1087 for compression and place db1101:3318 into vslow,dump - T232446', diff saved to https://phabricator.wikimedia.org/P10493 and previous config saved to /var/cache/conftool/dbconfig/20200224-101030-marostegui.json
  • 09:21 godog: bounce ferm on ms-be2023, it had failed (no entries in journald)
  • 09:08 elukey: update puppet compiler's facts
  • 08:40 marostegui@cumin1001: dbctl commit (dc=all): 'Add instances to es5 eqiad - T245806', diff saved to https://phabricator.wikimedia.org/P10492 and previous config saved to /var/cache/conftool/dbconfig/20200224-084027-marostegui.json
  • 08:34 marostegui@deploy1001: Synchronized wmf-config/etcd.php: Add es4 and es5 (unused new external store sections to etcd - T245806 (duration: 00m 58s)
  • 08:29 marostegui: Temporary put es1020 (es4) and es1023 (es5) on RO on a mysql level - T245806
  • 08:28 marostegui@cumin1001: dbctl commit (dc=all): 'Add instances to es5 codfw - T245806', diff saved to https://phabricator.wikimedia.org/P10491 and previous config saved to /var/cache/conftool/dbconfig/20200224-082848-marostegui.json
  • 08:07 marostegui@cumin1001: dbctl commit (dc=all): 'Add instances to es4 eqiad - T245806', diff saved to https://phabricator.wikimedia.org/P10490 and previous config saved to /var/cache/conftool/dbconfig/20200224-080708-marostegui.json
  • 08:01 marostegui@cumin1001: dbctl commit (dc=all): 'Add instances to es4 codfw - T245806', diff saved to https://phabricator.wikimedia.org/P10489 and previous config saved to /var/cache/conftool/dbconfig/20200224-080128-marostegui.json
  • 07:31 cdanis: dbctl: edit es4/es5 sections in eqiad (flavor & master & min_replicas fields) T245806
  • 07:30 cdanis: dbctl: (and min_replicas field) T245806
  • 07:29 cdanis: dbctl: edit es4/es5 sections in codfw (flavor & master fields) T245806
  • 07:12 marostegui@cumin1001: dbctl commit (dc=all): 'Pool db1107 for 10.4 testing in special slaves group with weight 10 - T242702', diff saved to https://phabricator.wikimedia.org/P10488 and previous config saved to /var/cache/conftool/dbconfig/20200224-071201-marostegui.json
  • 07:03 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1107 for 10.4 testing in main and API - T242702', diff saved to https://phabricator.wikimedia.org/P10487 and previous config saved to /var/cache/conftool/dbconfig/20200224-070337-marostegui.json
  • 06:40 marostegui@cumin1001: dbctl commit (dc=all): 'Fully repool db1101:3318 after removing partitions - T239453', diff saved to https://phabricator.wikimedia.org/P10486 and previous config saved to /var/cache/conftool/dbconfig/20200224-064044-marostegui.json
  • 06:33 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1101:3318 after removing partitions - T239453', diff saved to https://phabricator.wikimedia.org/P10485 and previous config saved to /var/cache/conftool/dbconfig/20200224-063258-marostegui.json
  • 06:22 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1101:3318 after removing partitions - T239453', diff saved to https://phabricator.wikimedia.org/P10484 and previous config saved to /var/cache/conftool/dbconfig/20200224-062226-marostegui.json
  • 06:01 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1101:3318 after removing partitions - T239453', diff saved to https://phabricator.wikimedia.org/P10483 and previous config saved to /var/cache/conftool/dbconfig/20200224-060118-marostegui.json
  • 05:57 marostegui: Repool labsdb1011 - T245797

2020-02-23

  • 16:52 elukey: powercycle mw1372 - no mgmt console, no ssh
  • 15:17 Urbanecm: mwscript importImages.php --wiki=commonswiki --comment-ext=txt --user='𐰇𐱅𐰚𐰤' /home/urbanecm/T245950 (T245950)

2020-02-22

  • 03:41 dzahn@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
  • 03:37 dzahn@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
  • 02:17 dzahn@cumin1001: START - Cookbook sre.ganeti.makevm
  • 02:16 dzahn@cumin1001: START - Cookbook sre.ganeti.makevm
  • 02:13 mutante: ganeti - removing instances apt1001/apt2001 again, starting over
  • 01:53 mutante: starting new ganeti VMs apt1001 and apt2001 for OS install (WIP, not prod)
  • 01:03 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 01:01 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 00:45 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 00:43 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 00:41 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 00:39 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 00:21 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 00:19 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 00:18 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 00:15 pt1979@cumin2001: START - Cookbook sre.hosts.downtime

2020-02-21

  • 23:26 dzahn@cumin1001: START - Cookbook sre.ganeti.makevm
  • 23:24 dzahn@cumin1001: START - Cookbook sre.ganeti.makevm
  • 23:05 andrewbogott: updated (?) wikitech-static to 1.34.0
  • 22:01 sbassett@deploy1001: Finished scap: Deploy security fix for T232932 (duration: 05m 35s)
  • 21:56 sbassett@deploy1001: Started scap: Deploy security fix for T232932
  • 21:53 andrew@deploy1001: Finished deploy [horizon/deploy@a8f2ea9]: added a warning about the public git history to the hiera edit panel -- take two (duration: 03m 41s)
  • 21:49 andrew@deploy1001: Started deploy [horizon/deploy@a8f2ea9]: added a warning about the public git history to the hiera edit panel -- take two
  • 21:45 andrew@deploy1001: Finished deploy [horizon/deploy@13ca90a]: added a warning about the public git history to the hiera edit panel (duration: 00m 11s)
  • 21:45 andrew@deploy1001: Started deploy [horizon/deploy@13ca90a]: added a warning about the public git history to the hiera edit panel
  • 21:23 mutante: LDAP - added ldickinson to wmf
  • 21:23 mutante: LDAP - added dduvall to archiva-deployers
  • 21:22 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 21:20 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 21:15 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 21:12 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 21:00 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 20:58 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 20:52 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 20:50 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 20:38 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 20:36 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 20:29 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 20:27 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 18:34 XioNoX: re-enable GRE tunnels on cr3-esams - T245825
  • 15:55 XioNoX: add gobgpd to buster-wikimedia repo
  • 15:51 elukey@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
  • 15:06 elukey@cumin1001: START - Cookbook sre.ganeti.makevm
  • 13:38 reedy@deploy1001: Synchronized php-1.35.0-wmf.20/includes/resourceloader/ResourceLoaderSkinModule.php: T245778 T245182 T232140 (duration: 01m 00s)
  • 12:29 mark: cr3-esams: Shutdown GRE tunnels over Telia
  • 12:27 akosiaris: repool mathoid at eqiad, test complete
  • 12:27 akosiaris@cumin1001: conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=mathoid
  • 12:20 moritzm: rebooting boron
  • 12:20 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 12:20 jmm@cumin2001: START - Cookbook sre.hosts.downtime
  • 12:17 moritzm: bumped memory for boron.eqiad.wmnet to 16G
  • 12:04 mark: cr3-esams: request chassis fpc offline slot 1
  • 11:57 mark: Disabled Telia transit on cr3-esams
  • 11:57 mark: Set VRRP prio cost to 50 on cr3-esams to make it backup VRRP
  • 11:48 elukey: restart varnishkafka-webrequest on cp3052 (stuck in timeouts to kafka, analytics alarms raised)
  • 11:47 elukey: restart varnishkafka-webrequest on cp3056/cp3058/cp3054/cp3064 (stuck in timeouts to kafka, analytics alarms raised)
  • 11:39 elukey: restart varnishkafka on cp3057 (stuck in timeouts to kafka, analytics alarms raised)
  • 11:21 godog: bounce logstash on logstash1023 - see if can catch up with elastic7 kafka lag
  • 11:14 elukey: reboot stat1005 - GPU blocked at 100% after issue with tensorflow
  • 09:18 akosiaris: depool mathoid in eqiad for a test
  • 09:18 akosiaris@puppetmaster1001: conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=mathoid
  • 08:54 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1107 after 10.4 testing - T242702', diff saved to https://phabricator.wikimedia.org/P10473 and previous config saved to /var/cache/conftool/dbconfig/20200221-085405-marostegui.json
  • 08:34 fdans@deploy1001: Finished deploy [analytics/refinery@4d56021]: deploying refinery (duration: 14m 55s)
  • 08:19 fdans@deploy1001: Started deploy [analytics/refinery@4d56021]: deploying refinery
  • 08:02 akosiaris: disable mod_remoteip on otrs host, following merge of https://gerrit.wikimedia.org/r/573877
  • 06:58 marostegui: Stop MySQL on labsdb1012 to clone labsdb1011 - T245797
  • 06:58 marostegui: Stop MySQL on labsdb1012 to clone labsdb1011 -
  • 06:34 marostegui: Stop mysql on es1024 to clone es1025 - T243052
  • 05:57 marostegui: Start MySQL on labsdb1011 without replication - T245797
  • 05:44 marostegui: Reload haproxy on dbproxy1010, dbproxy1011, dbproxy18 - T245797
  • 02:53 bstorm_: depooled labsdb1011 and set weight 10 on labsdb1009 vs 3 on labsdb1010 T245797
  • 02:43 ejegg: updated Fundraising CiviCRM from a6b222c19f to c086fd4e0b
  • 02:27 bstorm_: stopped mariadb on labsdb1011 because it keeps crashing anyway
  • 01:05 jforrester@deploy1001: Synchronized wmf-config/CommonSettings.php: Sync Beta-Cluster-only change to CommonSettings now we're sure we won't revert (duration: 00m 56s)
  • 01:04 andrew@deploy1001: Finished deploy [horizon/deploy@13ca90a]: Remove guided puppet config mode; this gets us back to working with latest puppet packages. (duration: 03m 32s)
  • 01:01 andrew@deploy1001: Started deploy [horizon/deploy@13ca90a]: Remove guided puppet config mode; this gets us back to working with latest puppet packages.

2020-02-20

  • 23:50 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: T245787 [nlwiki] Add noindex for NS_USER and NS_USER_TALK (duration: 00m 56s)
  • 23:46 jforrester@deploy1001: Synchronized wmf-config/CommonSettings.php: Stop setting wgVectorPrintLogo for back-compat., not read since wmf.19 (duration: 00m 56s)
  • 23:45 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw232[0-4].codfw.wmnet
  • 23:45 mutante: gerrit1002 - test VM - rebooting for new disk
  • 23:33 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw231[7-9].codfw.wmnet
  • 23:33 dzahn@cumin1001: conftool action : set/weight=15; selector: name=mw232[0-4].codfw.wmnet
  • 23:32 dzahn@cumin1001: conftool action : set/weight=15; selector: name=mw231[7-9].codfw.wmnet
  • 23:32 dzahn@cumin1001: conftool action : set/weight=15; selector: name=mw2381[7-9].codfw.wmnet
  • 23:25 mutante: ganeti1003 - adding another virtual 20G disk to gerrit1002 (T243808)
  • 23:14 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 23:12 dzahn@cumin1001: START - Cookbook sre.hosts.downtime
  • 23:04 jforrester@deploy1001: Synchronized php-1.35.0-wmf.20/includes/pager/IndexPager.php: IndexPager: Limit offset params to the max of the indices available (duration: 00m 56s)
  • 23:01 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 22:59 dzahn@cumin1001: START - Cookbook sre.hosts.downtime
  • 22:28 ebernhardson: restart mjolnir-kafka-bulk-daemon across eqiad
  • 22:28 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 22:28 dzahn@cumin1001: START - Cookbook sre.hosts.downtime
  • 22:28 ebernhardson@deploy1001: Finished deploy [search/mjolnir/deploy@8908dd1]: daemons: Install stack printing signal handler on SIGUSR1 (duration: 05m 05s)
  • 22:23 ebernhardson@deploy1001: Started deploy [search/mjolnir/deploy@8908dd1]: daemons: Install stack printing signal handler on SIGUSR1
  • 21:06 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: T245780 [mediawikiwiki] Deny the 'flow-hide' right to logged out and non-autoconfirmed users (duration: 00m 56s)
  • 20:07 James_F: Train 1.35.0-wmf.20 provisionally looks OK on all wikis. Closing T233868.
  • 20:04 jforrester@deploy1001: rebuilt and synchronized wikiversions files: all wikis to 1.35.0-wmf.20
  • 19:55 twentyafterfour: hotfix deployed
  • 19:51 twentyafterfour: deploying phabricator hotfix: https://phabricator.wikimedia.org/rPHEX2f36eee7ce67eb0c09e9bb0e79b42fc3b41d3597 for T244165
  • 19:33 bblack: codfw+ulsfo repooled in geodns
  • 18:20 fdans@deploy1001: Finished deploy [analytics/refinery@e05ae16]: deploying refinery (duration: 11m 31s)
  • 18:08 fdans@deploy1001: Started deploy [analytics/refinery@e05ae16]: deploying refinery
  • 17:38 bblack: pushed codfw+ulsfo geodns depool
  • 16:45 jynus: stop, upgrade and restart dbprov2002
  • 16:26 jynus: stop, upgrade and restart dbprov1002
  • 16:23 moritzm: installing Java security updates on Hadoop/Kafka Jumbo/AQS/Druid
  • 16:16 jynus: stop, upgrade and restart db1140
  • 16:12 moritzm: installing postgres security updates on netboxdb*
  • 16:03 fdans@deploy1001: Finished deploy [analytics/aqs/deploy@125cffa]: deploying aqs, third time is the charm (duration: 06m 15s)
  • 15:57 fdans@deploy1001: Started deploy [analytics/aqs/deploy@125cffa]: deploying aqs, third time is the charm
  • 15:40 marostegui: Poweroff es2022 T245714
  • 15:32 fdans@deploy1001: Finished deploy [analytics/aqs/deploy@95a7999]: deploying aqs (duration: 00m 48s)
  • 15:32 fdans@deploy1001: Started deploy [analytics/aqs/deploy@95a7999]: deploying aqs
  • 15:23 fdans@deploy1001: Finished deploy [analytics/aqs/deploy@cbc3241]: deploying aqs (duration: 04m 06s)
  • 15:19 fdans@deploy1001: Started deploy [analytics/aqs/deploy@cbc3241]: deploying aqs
  • 14:38 Urbanecm: [dry-run; mwmaint1002] foreachwiki extensions/AbuseFilter/maintenance/fixOldLogEntries.php --dry-run --verbose (T228655)
  • 12:53 moritzm: installing PHP updates on matomo1001/piwik
  • 12:28 moritzm: installing PHP 7.0 security updates
  • 12:11 Urbanecm: EU SWAT done
  • 12:09 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: 728d739: Configure logo for ngwikimedia (T242416) (duration: 01m 04s)
  • 12:05 urbanecm@deploy1001: Synchronized static/images/project-logos/: SWAT: 64240e1: Add logos for ngwikimedia (T242416) (duration: 01m 04s)
  • 11:19 jmm@puppetmaster1001: conftool action : set/pooled=inactive; selector: name=mw1280.eqiad.wmnet
  • 11:08 moritzm: installing boost update from Buster point release
  • 10:51 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1084 after crash - T245621', diff saved to https://phabricator.wikimedia.org/P10468 and previous config saved to /var/cache/conftool/dbconfig/20200220-105117-marostegui.json
  • 10:12 Reedy: created $wikidb.blobs_cluster27 on es1023 - T245720
  • 10:08 Reedy: created $wikidb.blobs_cluster26 on es1020 - T245720
  • 10:08 reedy@deploy1001: Synchronized php-1.35.0-wmf.20/extensions/WikimediaMaintenance/storage/make-all-blobs: (no justification provided) (duration: 01m 04s)
  • 09:42 reedy@deploy1001: Synchronized php-1.35.0-wmf.20/extensions/WikimediaMaintenance/storage/make-all-blobs: (no justification provided) (duration: 01m 03s)
  • 09:27 reedy@deploy1001: Synchronized php-1.35.0-wmf.20/extensions/WikimediaMaintenance/storage/make-all-blobs: (no justification provided) (duration: 01m 01s)
  • 09:12 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1084 after crash - T245621', diff saved to https://phabricator.wikimedia.org/P10467 and previous config saved to /var/cache/conftool/dbconfig/20200220-091233-marostegui.json
  • 09:02 akosiaris: restart etherpad-lite on etherpad1002 T244238
  • 09:00 marostegui: Restart m1 database master db1135 (etherpad will not be available for around 1 minute) - T244238
  • 08:40 jynus: disable puppet and stop bacula service T244238
  • 08:35 marostegui: Upgrade mysql on db1135 without restart T244238
  • 07:47 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Start reading for the new term store for clients up to Q15k (was Q10k) (T225057) - in case of cache issues (duration: 01m 03s)
  • 07:46 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Start reading for the new term store for clients up to Q15k (was Q10k) (T225057) (duration: 01m 03s)
  • 07:26 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Start reading for the new term store for clients up to Q10k (was Q8k) (T225057) - in case of cache issue (duration: 01m 01s)
  • 07:25 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Start reading for the new term store for clients up to Q10k (was Q8k) (T225057) (duration: 01m 03s)
  • 07:17 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Start reading for the new term store for clients up to Q8000 (T225057) - in case of cache issue (duration: 01m 03s)
  • 07:15 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Start reading for the new term store for clients up to Q8000 (T225057) (duration: 01m 03s)
  • 07:01 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Start reading for the new term store for clients up to Q6000 (T225057) - extra sync for cache issue (duration: 01m 04s)
  • 07:00 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Start reading for the new term store for clients up to Q6000 (T225057) (duration: 01m 06s)
  • 06:46 vgutierrez: test trafficserver 8.0.6-rc1 in cp30[64,65]
  • 06:24 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1084 after crash - T245621', diff saved to https://phabricator.wikimedia.org/P10466 and previous config saved to /var/cache/conftool/dbconfig/20200220-062445-marostegui.json
  • 06:17 marostegui: Repool labsdb1011
  • 06:12 marostegui: Remove partitions from db1101:3318 - T239453
  • 06:12 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1101:3318 to remove revision partitions - T239453', diff saved to https://phabricator.wikimedia.org/P10465 and previous config saved to /var/cache/conftool/dbconfig/20200220-061213-marostegui.json
  • 06:10 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1099:3318 this host already had the partitions removed - T239453', diff saved to https://phabricator.wikimedia.org/P10464 and previous config saved to /var/cache/conftool/dbconfig/20200220-061019-marostegui.json
  • 06:09 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1099:3318 to remove revision partitions - T239453', diff saved to https://phabricator.wikimedia.org/P10463 and previous config saved to /var/cache/conftool/dbconfig/20200220-060914-marostegui.json
  • 05:59 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1087 on s8, db1099:3318 back to its original weight', diff saved to https://phabricator.wikimedia.org/P10462 and previous config saved to /var/cache/conftool/dbconfig/20200220-055943-marostegui.json
  • 00:22 tgr@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Allow non-autoconfirmed users to propose OAuth apps (T213760) (duration: 01m 04s)
  • 00:16 tgr@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Enable password-reset (requireemail pref) on test WD and Commons (T245660) (duration: 01m 03s)

2020-02-19

  • 23:39 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw138[0-3].eqiad.wmnet
  • 23:38 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw137[4-9].eqiad.wmnet
  • 23:36 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1363.eqiad.wmnet
  • 23:28 jforrester@deploy1001: Synchronized wmf-config/PoolCounterSettings.php: cirrus: Reduce CirrusSearch-MoreLike cache workers and queue back to normal (duration: 01m 03s)
  • 23:26 dzahn@cumin1001: conftool action : set/weight=30; selector: name=mw138[0-3].eqiad.wmnet
  • 23:26 dzahn@cumin1001: conftool action : set/weight=30; selector: name=mw137[4-9].eqiad.wmnet
  • 23:25 dzahn@cumin1001: conftool action : set/weight=30; selector: name=mw1363.eqiad.wmnet
  • 23:23 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: cirrus: redirect more_like from codfw back to eqiad (duration: 01m 04s)
  • 23:13 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 23:10 dzahn@cumin1001: START - Cookbook sre.hosts.downtime
  • 23:10 dzahn@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 23:10 dzahn@cumin1001: START - Cookbook sre.hosts.downtime
  • 23:09 dzahn@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 23:09 dzahn@cumin1001: START - Cookbook sre.hosts.downtime
  • 22:57 ebernhardson@deploy1001: Finished deploy [wikimedia/discovery/analytics@c16c63a]: articletopic thresholding for ores scores and eventgate port update (duration: 00m 57s)
  • 22:56 ebernhardson@deploy1001: Started deploy [wikimedia/discovery/analytics@c16c63a]: articletopic thresholding for ores scores and eventgate port update
  • 22:54 robh: cp3050 & cp3051 returned to service via T243167
  • 22:49 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 22:49 dzahn@cumin1001: START - Cookbook sre.hosts.downtime
  • 22:42 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Set wgServer to protocol-relative for Wikitech and Test Wikitech (duration: 01m 05s)
  • 22:37 robh: taking cp3050 & cp3051 offline for firmware update via T243167
  • 22:23 mutante: phabricator - upgrading PHP packages
  • 22:14 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw231([0-6]).codfw.wmnet
  • 22:12 dzahn@cumin1001: conftool action : set/weight=15; selector: name=mw231([0-6]).codfw.wmnet
  • 22:11 rzl@cumin1001: conftool action : set/pooled=yes; selector: name=mw13(6[4-9]|7[0-3]|84).eqiad.wmnet
  • 22:10 rzl@cumin1001: conftool action : set/weight=30; selector: name=mw13(6[4-9]|7[0-3]|84).eqiad.wmnet
  • 22:08 dzahn@cumin1001: conftool action : set/weight=10; selector: name=mw2314.codfw.wmnet
  • 21:58 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 21:58 dzahn@cumin1001: START - Cookbook sre.hosts.downtime
  • 21:54 rzl@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 21:52 rzl@cumin1001: START - Cookbook sre.hosts.downtime
  • 21:48 bblack: all authdns servers - upgrade to gdnsd-3.2.2
  • 21:39 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 21:39 dzahn@cumin1001: START - Cookbook sre.hosts.downtime
  • 21:36 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 21:36 dzahn@cumin1001: START - Cookbook sre.hosts.downtime
  • 21:35 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 21:35 dzahn@cumin1001: START - Cookbook sre.hosts.downtime
  • 21:35 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 21:35 dzahn@cumin1001: START - Cookbook sre.hosts.downtime
  • 21:32 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 21:32 dzahn@cumin1001: START - Cookbook sre.hosts.downtime
  • 21:31 rzl@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 21:29 rzl@cumin1001: START - Cookbook sre.hosts.downtime
  • 21:23 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 21:23 dzahn@cumin1001: START - Cookbook sre.hosts.downtime
  • 20:55 eileen: civicrm revision changed from 52c68911c6 to a6b222c19f, config revision is 561ae21f77
  • 20:15 ladsgroup@deploy1001: Synchronized php-1.35.0-wmf.20/extensions/Wikibase/lib: Fix stastd metric for StatsdMissRecordingSimpleCache (wb_terms work) (duration: 01m 06s)
  • 20:13 rzl@cumin1001: conftool action : set/weight=30; selector: name=mw13(5[6-9]|6[0-2]).eqiad.wmnet
  • 20:12 ladsgroup@deploy1001: Synchronized php-1.35.0-wmf.19/extensions/Wikibase/lib: Fix stastd metric for StatsdMissRecordingSimpleCache (wb_terms work) (duration: 01m 06s)
  • 20:10 ladsgroup@deploy1001: Synchronized php-1.35.0-wmf.19/extensions/Wikibase/lib: Fix stastd metric for StatsdMissRecordingSimpleCache (wb_terms work) (duration: 01m 05s)
  • 20:05 jforrester@deploy1001: Synchronized php: group1 wikis to 1.35.0-wmf.20 (duration: 01m 03s)
  • 20:04 jforrester@deploy1001: rebuilt and synchronized wikiversions files: group1 wikis to 1.35.0-wmf.20
  • 20:02 rzl@cumin1001: conftool action : set/pooled=yes; selector: name=mw13(5[6-9]|6[0-2]).eqiad.wmnet
  • 20:02 rzl@cumin1001: conftool action : set/weight=10; selector: name=mw13(5[6-9]|6[0-2]).eqiad.wmnet
  • 19:54 rlazarus: scap pull on new api servers mw13[56-62]
  • 19:50 mutante: generating mcrouter certs for new codfw mw appservers
  • 19:39 mutante: initial puppet run on new hosts mw231*
  • 19:31 jforrester@deploy1001: Synchronized php-1.35.0-wmf.19/skins/MinervaNeue/includes/MinervaHooks.php: T245162 Check title value before proceeding to check if user page (duration: 01m 04s)
  • 19:27 jforrester@deploy1001: Synchronized php-1.35.0-wmf.20/skins/MinervaNeue/includes/MinervaHooks.php: T245162 Check title value before proceeding to check if user page (duration: 01m 04s)
  • 19:21 jforrester@deploy1001: Synchronized dblists/mobilemainpagelegacy.dblist: T244577 [metawiki] Disable MobileFrontend mainpage special casing (duration: 01m 04s)
  • 19:18 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: T244369 [trwiki] Enable the WikidataPageBanner extension (duration: 01m 05s)
  • 19:11 jforrester@deploy1001: Synchronized php-1.35.0-wmf.20/includes/resourceloader/dependencystore/SqlModuleDependencyStore.php: T245570 resourceloader: fix SqlDependencyModuleStore::setMulti() to use upsert() (duration: 01m 01s)
  • 18:45 bblack: dns4001 - upgraded to gdnsd-3.2.2
  • 18:44 bblack: reprepro: upload gdnsd 3.2.2-1~wmf1 to buster-wikimedia
  • 18:39 mutante: mwmaint1002 - sudo systemctl reset-failed to clear systemd alerts
  • 18:38 mutante: mwmaint1002 - removing Icinga ACK for systemd state - comments for it were from HHVM removal in Oct 2019
  • 18:26 mutante: phab2001 - upgraded ssh-server, kept locally modified config; apt autoremove removes python3-debconf
  • 18:23 mutante: phab2001 - installing package upgrades, incl. openssh, PHP version
  • 18:22 mutante: phab2001 - upgrading mariadb client package versions
  • 18:19 mutante: removing problem ACK from Icinga alerts for wikitech-static MediaWiki version. comments were about things in 2019
  • 17:48 robh: cp1089 cp1090 returned to service via T243167
  • 17:40 jynus: starting data check between db1078 and db1140:3313 T244958
  • 17:39 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Start reading for the new term store for clients up to Q4000 (T225057) (just incase of cache issue) (duration: 01m 04s)
  • 17:26 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Start reading for the new term store for clients up to Q4000 (T225057) (duration: 01m 01s)
  • 17:14 ema: cp4026: repool after probe Connection:keep-alive experiment revert https://gerrit.wikimedia.org/r/573337
  • 17:12 robh: cp1088 returned to service, cp1089 & cp1090 offline for firmware update via T243167
  • 16:44 papaul: replacing ps1-a8-codfw mgmt in rack A8 will go down
  • 16:37 otto@deploy1001: Finished deploy [analytics/refinery@e23918a]: Updating eventgate-analytics port (T245203) and also eventlogging whitelist (duration: 12m 27s)
  • 16:32 ema: depool cp4026, 5xx
  • 16:24 otto@deploy1001: Started deploy [analytics/refinery@e23918a]: Updating eventgate-analytics port (T245203) and also eventlogging whitelist
  • 16:13 marostegui: Depool labsdb1011 to help replication to catch up
  • 16:05 elukey: Update analytics-in4 filter term eventgate for T245203 on cr1/cr2 eqiad
  • 15:48 ariel@deploy1001: Finished deploy [dumps/dumps@b42acb5]: fix temp stub generation, add pagerangeinfo cache, some unit tests (duration: 00m 03s)
  • 15:48 ariel@deploy1001: Started deploy [dumps/dumps@b42acb5]: fix temp stub generation, add pagerangeinfo cache, some unit tests
  • 14:59 marostegui: Stop mysql on es2021 - T243052
  • 14:31 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 14:29 marostegui@cumin1001: START - Cookbook sre.hosts.downtime
  • 14:29 marostegui: Data checksum on db1084 T245621
  • 14:07 marostegui: Upgrade and reboot db1084 - T245621
  • 14:02 marostegui: Start mysql on db1084 without replication - T245621
  • 13:53 jbond42: disable puppet to upgrade postgresql
  • 13:30 jynus@cumin1001: dbctl commit (dc=all): 'Depool db1084, lots of connection errors', diff saved to https://phabricator.wikimedia.org/P10458 and previous config saved to /var/cache/conftool/dbconfig/20200219-133057-jynus.json
  • 12:25 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Start reading for the new term store for clients up to Q2000 (T225057), take II, the cache issue (duration: 01m 04s)
  • 12:22 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Start reading for the new term store for clients up to Q2000 (T225057) (duration: 01m 06s)
  • 11:56 volans: better splay of periodic scripts that interact with Netbox - T244291
  • 11:43 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 11:41 marostegui@cumin1001: START - Cookbook sre.hosts.downtime
  • 11:08 ladsgroup@deploy1001: Synchronized php-1.35.0-wmf.20/extensions/Wikibase/lib/includes/Store: Get rid of useless metrics in EntityTermLookupBase (T245592) (duration: 01m 04s)
  • 11:06 ladsgroup@deploy1001: Synchronized php-1.35.0-wmf.19/extensions/Wikibase/lib/includes/Store: Get rid of useless metrics in EntityTermLookupBase (T245592) (duration: 01m 12s)
  • 11:01 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 10:58 marostegui@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 10:58 marostegui@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 10:58 marostegui@cumin1001: START - Cookbook sre.hosts.downtime
  • 10:58 marostegui@cumin1001: START - Cookbook sre.hosts.downtime
  • 10:58 marostegui@cumin1001: START - Cookbook sre.hosts.downtime
  • 10:45 jynus: upgrading mariadb client on cumin hosts
  • 10:38 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db2089:3315, db2089:3316 after new package testing', diff saved to https://phabricator.wikimedia.org/P10457 and previous config saved to /var/cache/conftool/dbconfig/20200219-103806-marostegui.json
  • 10:26 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 10:24 marostegui@cumin1001: START - Cookbook sre.hosts.downtime
  • 10:17 jynus: stopping db2089 mariadb@s5
  • 10:12 jiji@cumin1001: conftool action : set/weight=30; selector: dc=eqiad,cluster=appserver,service=apache2,name=mw135[0-5]*.eqiad.wmnet
  • 10:12 jiji@cumin1001: conftool action : set/weight=30; selector: dc=eqiad,cluster=appserver,service=nginx,name=mw135[0-5]*.eqiad.wmnet
  • 10:11 jiji@cumin1001: conftool action : set/weight=30; selector: dc=eqiad,cluster=appserver,service=nginx,name=mw1349.eqiad.wmnet
  • 10:11 jiji@cumin1001: conftool action : set/weight=30; selector: dc=eqiad,cluster=appserver,service=apache2,name=mw1349.eqiad.wmnet
  • 10:09 moritzm: updated tftpboot environment for stretch-bootif for the 9.12 point release T241359
  • 09:53 jynus: stopping and upgrading db1140 instances
  • 09:51 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db2089:3315, db2089:3316 for new package testing', diff saved to https://phabricator.wikimedia.org/P10455 and previous config saved to /var/cache/conftool/dbconfig/20200219-095139-marostegui.json
  • 09:51 marostegui: Depool db2089:3315, db2089:3316 for new package testing
  • 09:49 akosiaris: T245516. Deploy mathoid chart version 0.0.27, removing logstash gelf configuration
  • 09:46 akosiaris@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'mathoid' for release 'production' .
  • 09:43 vgutierrez: test trafficserver 8.0.6-rc1 in cp40[26,32]
  • 09:34 _joe_: cleared opcache on mw1313
  • 09:34 akosiaris@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'mathoid' for release 'canary' .
  • 09:34 akosiaris@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'mathoid' for release 'production' .
  • 09:33 akosiaris@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'mathoid' for release 'staging' .
  • 08:54 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 08:53 marostegui@cumin1001: START - Cookbook sre.hosts.decommission
  • 08:50 marostegui: Remove dbproxy1007 grants from m2 - T231280
  • 08:41 marostegui: Remove wikiadmin2 user from s7 - T243512
  • 08:23 Urbanecm: run mwscript deleteEqualMessages.php cswiki --delete
  • 08:14 godog: roll restart swift proxies - T244776
  • 07:02 marostegui: Remove wikiadmin2 user from es2 - T243512
  • 06:57 marostegui@cumin1001: dbctl commit (dc=all): 'Increase API weight for db1107 50 -> 100 for 10.4 testing - T242702', diff saved to https://phabricator.wikimedia.org/P10454 and previous config saved to /var/cache/conftool/dbconfig/20200219-065726-marostegui.json
  • 06:35 marostegui: Compress watchlist_expiry table on s3 (this will take hours as I have left a 60 seconds sleep between tables) - T245358
  • 06:17 marostegui: Compress new and empty watchlist_expiry table - T245358
  • 01:34 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 01:32 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 01:28 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1353.eqiad.wmnet
  • 01:27 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 01:24 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 01:23 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1354.eqiad.wmnet
  • 01:22 mutante: mw1353 - restarted apache (some race condition on new installs, 5 other servers did not have the issue)
  • 01:17 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1355.eqiad.wmnet
  • 01:16 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1350.eqiad.wmnet
  • 01:16 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1351.eqiad.wmnet
  • 01:15 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1352.eqiad.wmnet
  • 01:14 dzahn@cumin1001: conftool action : set/weight=10; selector: name=mw1355.eqiad.wmnet
  • 01:14 dzahn@cumin1001: conftool action : set/weight=10; selector: name=mw1354.eqiad.wmnet
  • 01:14 dzahn@cumin1001: conftool action : set/weight=10; selector: name=mw1350.eqiad.wmnet
  • 01:14 dzahn@cumin1001: conftool action : set/weight=10; selector: name=mw1353.eqiad.wmnet
  • 01:14 dzahn@cumin1001: conftool action : set/weight=10; selector: name=mw1351.eqiad.wmnet
  • 01:14 dzahn@cumin1001: conftool action : set/weight=10; selector: name=mw1352.eqiad.wmnet
  • 01:03 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 01:01 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: T240728 Fix Latin Wikipedia (VICIPÆDIA) wordmark and set size correctly (duration: 01m 06s)
  • 01:01 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 00:49 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 00:45 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 00:43 James_F: Manually purged https://en.wikipedia.org/images/mobile/copyright/wikipedia-wordmark-la.svg and .png from Varnish for T240728
  • 00:41 jforrester@deploy1001: Synchronized static/images/mobile/copyright/: T240728 Sync logo images (duration: 01m 04s)
  • 00:40 mutante: mw1351 through mw1355 - initial puppet runs - new appservers
  • 00:36 niharika29@deploy1001: Synchronized static/images/mobile/copyright/: Remove unnecessary id from wordmark (duration: 01m 03s)
  • 00:34 niharika29@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Adjust MT Threshold for Assamese to 70% - T245509 (duration: 01m 04s)
  • 00:24 niharika29@deploy1001: Synchronized php-1.35.0-wmf.19/extensions/WikimediaEvents/: Follow up on authevents statsd changes in I7612b68fe (duration: 01m 03s)
  • 00:21 niharika29@deploy1001: Synchronized wmf-config/logging.php: Update authmanager-statsd channel name (duration: 01m 03s)
  • 00:16 eileen: civicrm revision changed from 8c77e9e915 to 52c68911c6, config revision is 561ae21f77
  • 00:10 niharika29@deploy1001: Synchronized wmf-config/logging.php: Make the logstash and authmanager-statsd Monolog handlers compatible (duration: 01m 04s)
  • 00:08 mutante: creating mcrouter certs for mw1350

2020-02-18

  • 23:56 mutante: mw1349 - scap pull
  • 23:55 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1349.eqiad.wmnet
  • 23:54 dzahn@cumin1001: conftool action : set/weight=10; selector: name=mw1349.eqiad.wmnet
  • 23:34 maryum: running reindex on mwmaint1002 - T194448
  • 23:28 maryum: running reindex for wikimedia wikis
  • 23:14 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 23:12 dzahn@cumin1001: conftool action : set/weight=10; selector: name=mw2151.wmnet
  • 23:12 dzahn@cumin1001: conftool action : set/weight=10; selector: name=mw2150.wmnet
  • 23:12 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 22:58 ebernhardson@deploy1001: Synchronized wmf-config/InitialiseSettings.php: cirrus: Enable ores_articletopics field creation for all wikis (extra sync for T236104) (duration: 01m 04s)
  • 22:54 ebernhardson@deploy1001: Synchronized wmf-config/InitialiseSettings.php: cirrus: Enable ores_articletopics field creation for all wikis (duration: 01m 03s)
  • 22:52 chaomodus: completed upgrading Netbox to 2.7.4 T244291
  • 22:51 crusnov@deploy1001: Finished deploy [netbox/deploy@f3d56dd]: netbox 2.7.4 upgrade T244291 (part3) (duration: 00m 11s)
  • 22:51 crusnov@deploy1001: Started deploy [netbox/deploy@f3d56dd]: netbox 2.7.4 upgrade T244291 (part3)
  • 22:49 crusnov@deploy1001: Finished deploy [netbox/deploy@f3d56dd]: netbox 2.7.4 upgrade T244291 (part2) (duration: 01m 19s)
  • 22:48 crusnov@deploy1001: Started deploy [netbox/deploy@f3d56dd]: netbox 2.7.4 upgrade T244291 (part2)
  • 22:46 crusnov@deploy1001: Finished deploy [netbox/deploy@f3d56dd]: netbox 2.7.4 upgrade T244291 (duration: 01m 19s)
  • 22:45 crusnov@deploy1001: Started deploy [netbox/deploy@f3d56dd]: netbox 2.7.4 upgrade T244291
  • 22:38 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: T244185 Raise minimum log level for 'OAuth' from DEBUG to INFO (duration: 01m 04s)
  • 22:30 chaomodus: Upgrading Netbox to 2.7.4
  • 21:56 bblack@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 21:54 bblack@cumin1001: START - Cookbook sre.hosts.downtime
  • 21:26 XioNoX: rollback tcp-mss clamping in eqiad/eqord
  • 21:07 jeh: power down and set incinga downtime on cloudvirt1022 T243536
  • 21:07 jeh: power down and set incinga downtime on cloudvirt1022 T241884
  • 20:54 otto@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Enabling EventStreamConfig extension on metawiki - T242122 (duration: 01m 03s)
  • 20:47 ppchelko@deploy1001: Finished deploy [changeprop/deploy@e2fe8ca]: respect service name in consumer group T244387 (duration: 07m 59s)
  • 20:45 otto@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Enabling EventStreamConfig extension on testwiki - T242122 (duration: 01m 04s)
  • 20:39 ppchelko@deploy1001: Started deploy [changeprop/deploy@e2fe8ca]: respect service name in consumer group T244387
  • 20:06 jforrester@deploy1001: Synchronized php-1.35.0-wmf.19/includes/libs/StatusValue.php: T245155 StatusValue: Fix __toString() to not choke on special parameters (duration: 01m 04s)
  • 20:03 jforrester@deploy1001: rebuilt and synchronized wikiversions files: group0 to 1.35.0-wmf.20 T233868
  • 19:52 jforrester@deploy1001: Finished scap: testwiki to 1.35.0-wmf.20 and re-build l10n cache T233868 (duration: 61m 01s)
  • 19:41 papaul: shutting down dns2001 for 10G card troubleshooting
  • 19:30 James_F: Running `foreachwiki sql.php php-1.35.0-wmf.19/maintenance/archives/patch-watchlist_expiry.sql` for T244631
  • 18:51 jforrester@deploy1001: Started scap: testwiki to 1.35.0-wmf.20 and re-build l10n cache T233868
  • 18:49 jforrester@deploy1001: Pruned MediaWiki: 1.35.0-wmf.18 (duration: 15m 29s)
  • 18:25 James_F: Running `scap prep` for 1.35.0-wmf.20 ref. T233868
  • 18:01 James_F: 1.35.0-wmf.20 was branched at c664b4f for T233868
  • 18:01 marxarelli: completed promotion of 1.35.0-wmf.19 to all wikis (T233867)
  • 17:52 dduvall@deploy1001: rebuilt and synchronized wikiversions files: Re-roll all wikis to 1.35.0-wmf.19 (T233867)
  • 17:47 marxarelli: re-rolling wmf.19 to all wikis (T233867) with eyes particularly on (T245202)
  • 17:28 bblack: cp3 (esams edge) - revert GRE MTU mitigations - T232602
  • 17:00 papaul: restting ps1-a8-codfw see T245164
  • 16:34 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 16:32 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 16:12 otto@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-main' for release 'production' .
  • 16:11 otto@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-main' for release 'canary' .
  • 16:09 otto@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-main' for release 'production' .
  • 16:08 otto@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-main' for release 'canary' .
  • 16:03 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-main' for release 'production' .
  • 16:02 ottomata: deploying new 'canary' and 'production' releases for eventgate-main. (These releases use a new nodePort, and so will not be active until LVS is modified. The old 'main' release and nodePort is left as is.) - T242861
  • 16:02 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-main' for release 'canary' .
  • 15:51 bblack: dns2001 - shutdown for hw/reimage work - T242017
  • 15:47 bblack: dns2001 - stopping bgp to drain service for hw/reimage work - T242017
  • 15:41 otto@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-analytics' for release 'production' .
  • 15:40 otto@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-analytics' for release 'canary' .
  • 15:36 jynus: stopping db1140:s3 instance
  • 15:35 otto@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-analytics' for release 'production' .
  • 15:34 otto@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-analytics' for release 'canary' .
  • 15:34 otto@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-analytics' for release 'canary' .
  • 15:14 otto@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-analytics' for release 'canary' .
  • 15:08 vgutierrez@puppetmaster1001: conftool action : set/weight=100; selector: dc=eqiad,cluster=cache_text,service=ats-be,name=cp1089.eqiad.wmnet
  • 15:04 otto@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-analytics' for release 'production' .
  • 14:56 bblack: esams repooled in DNS
  • 14:54 otto@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-analytics' for release 'production' .
  • 14:54 ottomata: deploying new 'canary' and 'production' releases for eventgate-analytics. (These releases use a new nodePort, and so will not be active until LVS is modified. The old 'analytics' release and nodePort is left as is.) - T242861
  • 14:47 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics' for release 'canary' .
  • 14:47 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics' for release 'production' .
  • 14:39 XioNoX: remove cr2-esams VRRP handicap - T243080
  • 14:34 XioNoX: restore default esams-eqiad link cost - T243080
  • 14:33 XioNoX: re-enable cr2-esams BGP transit/peering - T243080
  • 14:31 XioNoX: cr2-esams - request chassis routing-engine master switch - T243080
  • 14:29 XioNoX: re-disable cr2-esams BGP group IX4 - T243080
  • 14:14 ladsgroup@deploy1001: Synchronized php-1.35.0-wmf.18/extensions/DiscussionTools: wmf.18: Add config option and query parameter to control loading (duration: 01m 11s)
  • 14:02 cdanis: depool esams
  • 14:01 XioNoX: re-enable cr2-esams BGP group IX4 - T243080
  • 13:55 marostegui@cumin1001: dbctl commit (dc=all): 'Increase API weight for db1107 25 -> 50 for 10.4 testing - T242702', diff saved to https://phabricator.wikimedia.org/P10448 and previous config saved to /var/cache/conftool/dbconfig/20200218-135525-marostegui.json
  • 13:44 XioNoX: installing OS on cr2-esams:re0 - T243080
  • 13:39 XioNoX: cr2-esams - request chassis routing-engine master switch - T243080
  • 13:37 XioNoX: deactivate peering/transit on cr2-esams - T243080
  • 13:24 XioNoX: reboot cr2-esams:re1 (backup) - T243080
  • 13:23 XioNoX: bump cost of eqiad-esams transport - T243080
  • 13:10 XioNoX: fail vrrp master to cr3-esams - T243080
  • 12:58 kartik@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'cxserver' for release 'production' .
  • 12:55 Amir1: EU SWAT done
  • 12:53 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Add DiscussionTools to four wikis in hidden mode (T244870), take II (duration: 01m 03s)
  • 12:52 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Add DiscussionTools to four wikis in hidden mode (T244870) (duration: 01m 04s)
  • 12:45 XioNoX: remove graceful-switchover and nonstop-routing from cr2-esams - T243080
  • 12:36 XioNoX: push new Junos to cr2-esams:re1 (backup RE, noop) - T243080
  • 12:22 ladsgroup@deploy1001: Synchronized wmf-config/Wikibase.php: SWAT: Wikibase: added config variables to configure entity sources (T242087), Part II (duration: 01m 03s)
  • 12:20 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Wikibase: added config variables to configure entity sources (T242087), Part I, take II (the cache issue) (duration: 01m 04s)
  • 12:18 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Wikibase: added config variables to configure entity sources (T242087), Part I (duration: 01m 06s)
  • 12:14 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Start reading for the new term store for clients up to Q1000 (T225057) (duration: 01m 05s)
  • 12:06 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: 4b193dd: Increase Commons linkpurge rate limit for patrollers (T245214) (duration: 01m 31s)
  • 11:51 kartik@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'cxserver' for release 'production' .
  • 11:48 kartik@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'cxserver' for release 'production' .
  • 11:47 kartik@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'cxserver' for release 'staging' .
  • 11:43 kartik@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'cxserver' for release 'production' .
  • 11:41 kartik@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'cxserver' for release 'production' .
  • 11:35 kartik@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'cxserver' for release 'staging' .
  • 11:27 jynus: reenabling prometheus exporter metadata user for prometheus1003
  • 11:10 jynus: temp. disabling prometheus exporter metadata user for prometheus1003
  • 10:49 marostegui@cumin1001: dbctl commit (dc=all): 'Increase API weight for db1107 15 -> 25 for 10.4 testing - T242702', diff saved to https://phabricator.wikimedia.org/P10445 and previous config saved to /var/cache/conftool/dbconfig/20200218-104958-marostegui.json
  • 09:27 gehel: re-enable puppet on mw* - T222321
  • 09:13 marostegui@cumin1001: dbctl commit (dc=all): 'Fully repool db1107 after temporary change optimizer options - T245489', diff saved to https://phabricator.wikimedia.org/P10444 and previous config saved to /var/cache/conftool/dbconfig/20200218-091343-marostegui.json
  • 09:09 gehel: disabling puppet on mw* to deploy apache config change - T222321
  • 09:07 volans: rm /var/log/exim4/paniclog on cumin1001 to clear OOM from last week error
  • 08:59 marostegui: Remove wikiadmin2 grants from es1 T243512
  • 08:59 marostegui: Remove wikiadmin2 grants from es1
  • 08:57 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1107 after temporary change optimizer options', diff saved to https://phabricator.wikimedia.org/P10443 and previous config saved to /var/cache/conftool/dbconfig/20200218-085713-marostegui.json
  • 08:23 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1107 after temporary change optimizer options - T245489', diff saved to https://phabricator.wikimedia.org/P10442 and previous config saved to /var/cache/conftool/dbconfig/20200218-082306-marostegui.json
  • 08:09 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1107 after temporary change optimizer options - T245489', diff saved to https://phabricator.wikimedia.org/P10441 and previous config saved to /var/cache/conftool/dbconfig/20200218-080952-marostegui.json
  • 08:08 marostegui: Restart MySQL to pick up optimizer_switch changes - T245489
  • 08:06 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1107 to temporary change optimizer options - T245489', diff saved to https://phabricator.wikimedia.org/P10440 and previous config saved to /var/cache/conftool/dbconfig/20200218-080623-marostegui.json
  • 07:34 elukey: powercycle analytics1065 (crashed hours ago, no mgmt console available, no ssh)
  • 06:39 marostegui: Remove wikiadmin2 from pc1007, pc1008, pc1009 and pc1010 T243512
  • 06:38 marostegui@cumin1001: dbctl commit (dc=all): 'Increase weight for db1107 100 -> 200 for 10.4 testing - T242702', diff saved to https://phabricator.wikimedia.org/P10439 and previous config saved to /var/cache/conftool/dbconfig/20200218-063819-marostegui.json
  • 06:27 marostegui: Stop haproxy on dbproxy1007 - T245385
  • 06:25 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1107 with weight 100 and weight 10 in API for 10.4 testing - T242702', diff saved to https://phabricator.wikimedia.org/P10438 and previous config saved to /var/cache/conftool/dbconfig/20200218-062459-marostegui.json
  • 06:09 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 06:08 marostegui@cumin1001: START - Cookbook sre.hosts.decommission

2020-02-17

  • 19:56 cdanis: finish enabling TCP-MSS clamping in eqiad
  • 19:49 cdanis: s/no-op//
  • 19:49 cdanis: no-op enable TCP-MSS clamping on eqord and eqiad
  • 19:33 cdanis: no-op enable flowspec change on cr2-eqord and cr2-eqiad
  • 18:25 elukey: restart kafka on kafka-jumbo1001 to pick up new openjdk updates
  • 17:25 bblack: GRE MTU mitigations applied to esams cp hosts only - T232602
  • 15:55 ayounsi@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
  • 15:50 ayounsi@cumin1001: START - Cookbook sre.ganeti.makevm
  • 15:48 ayounsi@cumin1001: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99)
  • 15:48 ayounsi@cumin1001: START - Cookbook sre.ganeti.makevm
  • 15:44 cdanis: ✔️ cdanis@icinga1001.wikimedia.org ~ 🕥☕ sudo systemctl restart ircecho
  • 14:31 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1107 after 10.4 testing - T242702', diff saved to https://phabricator.wikimedia.org/P10422 and previous config saved to /var/cache/conftool/dbconfig/20200217-143146-marostegui.json
  • 14:17 ema: reprepro includedeb buster-wikimedia ~ema/cadvisor_0.35.0+ds1-4_amd64.deb T183146
  • 12:34 XioNoX: add test flowspec rules to cr3-knams
  • 12:34 moritzm: installing postgresql-9.4 security updates
  • 12:27 vgutierrez: reboot acmechief instances (kernel upgrade)
  • 10:31 jynus: dropping all databases from db1140:3313
  • 10:22 marostegui@cumin1001: dbctl commit (dc=all): ' db1107 increase API weight from 10 to 15 for 10.4 testing - T242702', diff saved to https://phabricator.wikimedia.org/P10420 and previous config saved to /var/cache/conftool/dbconfig/20200217-102218-marostegui.json
  • 10:20 vgutierrez: rolling restart of ats-tls and varnish-fe on ulsfo to enable KA between them - T244464
  • 10:00 moritzm: installing Linux 4.9.210 kernels on stretch systems
  • 09:10 godog: correction, +100G
  • 09:09 godog: +10G to prometheus/ops fs on prometheus eqiad - T245361
  • 09:06 godog: +50G to prometheus/ops fs on prometheus eqiad - T245361
  • 07:22 marostegui: Stop haproxy on dbproxy1002 - T245384

2020-02-15

  • 01:01 cdanis: ✔️ cdanis@an-coord1001.eqiad.wmnet ~ 🕗🍺 sudo systemctl restart hive-server2.service ; sudo systemctl restart hive-metastore.service

2020-02-14

  • 23:42 XenoRyet: updated civicrm from cf86495d44 to 8c77e9e915
  • 21:01 volker-e@deploy1001: Finished deploy [design/style-guide@1928c00]: Deploy design/style-guide: (duration: 00m 09s)
  • 21:01 volker-e@deploy1001: Started deploy [design/style-guide@1928c00]: Deploy design/style-guide:
  • 20:21 reedy@deploy1001: Synchronized wmf-config/CommonSettings.php: Prevent some logspam T245280 (duration: 01m 05s)
  • 19:27 XenoRyet: updated civicrm from 55b2afb6eb to cf86495d44
  • 19:10 jforrester@deploy1001: Synchronized php-1.35.0-wmf.19/extensions/Wikibase: T245062 Prevent invalid term languages from cached PrefetchingTermLookup (duration: 01m 09s)
  • 17:37 jforrester@deploy1001: Unlocked for deployment [ALL REPOSITORIES]: Testing T245062 fix on mwdebug1001 (duration: 03m 05s)
  • 17:33 jforrester@deploy1001: Locking from deployment [ALL REPOSITORIES]: Testing T245062 fix on mwdebug1001 (planned duration: 60m 00s)
  • 16:11 moritzm: installing git-lfs updates from Buster 10.3 point update
  • 15:55 moritzm: uploaded pypuppetdb 0.3.3-2~wmf+deb10u1 to apt.wikimedia.org
  • 15:55 bblack: (log(n))
  • 15:54 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db2086:3318 T239453', diff saved to https://phabricator.wikimedia.org/P10414 and previous config saved to /var/cache/conftool/dbconfig/20200214-155443-marostegui.json
  • 15:52 moritzm: uploaded pypuppetdb 0.3.3-2~wmf+deb9u1 to apt.wikimedia.org
  • 15:46 ebernhardson@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Resync initialisesetting to try and pick up previoiusly deployed cirrus query routing changes (duration: 01m 05s)
  • 15:42 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 15:42 jmm@cumin2001: START - Cookbook sre.hosts.downtime
  • 15:32 effie: restart mc-gp* for updates
  • 15:17 bd808: Toil reduction: !log messages now work from the SRE team's Freenode channel.
  • 13:50 gehel: restart relforge for JVM upgrade - T245120
  • 10:35 vgutierrez: revert ats 8.0.6-rc0 experiment on cp40[26,32]
  • 10:14 vgutierrez: rolling restart of ats-be to enable TLSv1.3 against origin servers - T170567
  • 09:34 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1107 after 10.4 testing - T242702', diff saved to https://phabricator.wikimedia.org/P10409 and previous config saved to /var/cache/conftool/dbconfig/20200214-093456-marostegui.json
  • 09:32 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 09:32 jmm@cumin2001: START - Cookbook sre.hosts.downtime
  • 09:25 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 09:25 jmm@cumin2001: START - Cookbook sre.hosts.downtime
  • 09:25 volans: manually absented /usr/local/bin/apt2xml on the 5 hosts with puppet disabled
  • 09:15 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 09:15 jmm@cumin2001: START - Cookbook sre.hosts.downtime
  • 09:12 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 09:12 jmm@cumin2001: START - Cookbook sre.hosts.downtime
  • 09:05 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 09:05 jmm@cumin2001: START - Cookbook sre.hosts.downtime
  • 09:05 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 09:05 jmm@cumin2001: START - Cookbook sre.hosts.downtime
  • 09:05 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 09:05 jmm@cumin2001: START - Cookbook sre.hosts.downtime
  • 08:46 moritzm: installing 4.19.98 kernel update on Buster systems
  • 08:06 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1107 with weight 100 for 10.4 testing - T242702', diff saved to https://phabricator.wikimedia.org/P10408 and previous config saved to /var/cache/conftool/dbconfig/20200214-080600-marostegui.json
  • 06:51 vgutierrez: updating puppet compiler facts
  • 01:27 dpifke@deploy1001: Finished deploy [performance/navtiming@2eec00a]: (no justification provided) (duration: 00m 05s)
  • 01:27 dpifke@deploy1001: Started deploy [performance/navtiming@2eec00a]: (no justification provided)
  • 00:53 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: T245202 cirrus: Move all move_like traffic to codfw (duration: 01m 02s)
  • 00:51 jforrester@deploy1001: Synchronized wmf-config/PoolCounterSettings.php: T245202 cirrus: Increase the pool counter limits a bit (duration: 01m 05s)

2020-02-13

  • 22:13 jeh: running filesystem tests on cloudvirt1024 T241884
  • 21:42 otto@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'canary' .
  • 21:41 otto@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'production' .
  • 21:40 jbond42: refresh facts on compilers
  • 21:38 otto@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'canary' .
  • 21:37 otto@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'production' .
  • 21:35 ottomata: deploying production and canary releases for eventgate-logging-external (and destroying the 'logging-external' release) (safe because eventgate-logging-external is not in use) - T245203
  • 21:29 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'canary' .
  • 21:28 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'production' .
  • 20:33 marxarelli: rollback to group1 due to 500 spike (2k/min) (T233867)
  • 20:32 dduvall@deploy1001: rebuilt and synchronized wikiversions files: (no justification provided)
  • 20:30 marxarelli: varnish 500 spike. rolling back
  • 20:20 gehel: restarting blazegraph + updater on wdqs2006
  • 20:19 dduvall@deploy1001: rebuilt and synchronized wikiversions files: all wikis to 1.35.0-wmf.19
  • 19:44 jforrester@deploy1001: Synchronized php-1.35.0-wmf.19/includes/api/ApiRollback.php: T245159 ApiRollback: Properly deal with UserIdentity (duration: 01m 04s)
  • 19:20 jforrester@deploy1001: Synchronized php-1.35.0-wmf.19/includes/resourceloader/ResourceLoaderSkinModule.php: T245182 ResourceLoaderSkinModule: Don't hard-deprecate wgLogoHD just now (duration: 01m 03s)
  • 19:17 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: T219534 Add new MLR models for Cirrus on zh/ja/kowiki (duration: 01m 03s)
  • 19:10 moritzm: installing e2fsprogs security updates
  • 18:48 bblack: ns1.wikimedia.org - re-routing back to authdns2001 instead of dns2002 on cr[12]-codfw - T242017
  • 18:38 bblack: authdns2001 - reboot - T242017
  • 18:36 bblack: ns1.wikimedia.org - re-routing from authdns2001 to dns2002 on cr[12]-codfw - T242017
  • 18:09 krinkle@deploy1001: Synchronized wmf-config/CommonSettings.php: I9d0c8af3c577 (duration: 01m 06s)
  • 18:00 krinkle@deploy1001: Synchronized wmf-config/etcd.php: Iae1f45896 (duration: 01m 06s)
  • 17:59 volans: downtimed mgmt in eqiad for 1h
  • 17:58 krinkle@deploy1001: Synchronized wmf-config/CommonSettings.php: Iae1f45896 (duration: 01m 08s)
  • 17:49 krinkle@deploy1001: Synchronized wmf-config/etcd.php: Ibfca686f681 (duration: 01m 06s)
  • 17:41 krinkle@deploy1001: Synchronized wmf-config/etcd.php: Iefff596955e (duration: 01m 08s)
  • 17:40 krinkle@deploy1001: Synchronized wmf-config/CommonSettings.php: Iefff596955e (duration: 01m 06s)
  • 17:35 krinkle@deploy1001: Synchronized wmf-config/etcd.php: I2e4fb0 (duration: 01m 06s)
  • 17:32 krinkle@deploy1001: Synchronized wmf-config/CommonSettings.php: I2e4fb0 (duration: 01m 06s)
  • 17:10 jforrester@deploy1001: Synchronized wmf-config/CommonSettings.php: No-op (code style only) deploy sync (duration: 01m 07s)
  • 17:09 jforrester@deploy1001: sync aborted: wmf-config/CommonSettings.php No-op (code style only) deploy sync (duration: 00m 04s)
  • 17:09 jforrester@deploy1001: Started scap: wmf-config/CommonSettings.php No-op (code style only) deploy sync
  • 16:32 robh: ps1-a8-codfw.mgmt.codfw.wmnet firmware upgraded via T245164
  • 16:28 papaul: rebooting elastic2043 for firmware upgrade
  • 16:22 gehel: canceled the restart of elastic2043 - T243715
  • 16:21 gehel: restarting elastic2043 - T243715
  • 16:10 _joe_: depooling/repooling mw1240
  • 16:02 _joe_: pooled mw1238 again
  • 15:59 _joe_: depooling mw1238 for analysis
  • 15:42 vgutierrez: rolling restart of ats-be on esams - T170567
  • 15:38 vgutierrez: disable allow_half_open on ats-tls @ cp4031 - T236458
  • 15:27 vgutierrez: turning on TLSv1.3 between ats-be and applayer in cp30[51-52] - T170567
  • 15:22 jforrester@deploy1001: Synchronized php-1.35.0-wmf.19/extensions/WikibaseMediaInfo/resources/: UBN fix: Force non-value to be undefined (duration: 01m 06s)
  • 14:51 vgutierrez: test TLSv1.3 between ats-be and applayer in cp3050 - T170567
  • 14:47 XioNoX: re-image rpki2001 - T244585
  • 14:33 XioNoX: add routinator_0.6.4_amd64.deb to buster-wikimedia apt repo
  • 14:27 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1107 after 10.4 testing - T242702', diff saved to https://phabricator.wikimedia.org/P10405 and previous config saved to /var/cache/conftool/dbconfig/20200213-142735-marostegui.json
  • 14:24 XioNoX: re-enable ping offload in esams - T244584
  • 13:31 XioNoX: disable ping offload in esams - T244584
  • 13:24 XioNoX: re-enable ping offload in eqiad - T244584
  • 13:06 XioNoX: disable ping offload in eqiad - T244584
  • 13:03 XioNoX: re-enable ping offload in codfw - T244584
  • 13:00 vgutierrez: pool cp10[75,76] running buster - T242093
  • 12:51 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 12:49 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 12:49 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 12:47 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 12:34 Amir1: EU SWAT is done
  • 12:30 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Read and write more in the new term store, take II, the cache issue (T219123 T225055) (duration: 01m 03s)
  • 12:29 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Read and write more in the new term store (duration: 01m 03s)
  • 12:29 vgutierrez: depool cp10[75,76] and reimage as buster - T242093
  • 12:28 vgutierrez: pool cp10[77,78] running buster - T242093
  • 12:20 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Revert: Triple the factor of WDQS lag to maxlag for Wikidata (T244722) (duration: 01m 04s)
  • 12:18 XioNoX: re-image ping2001 to buster - T244584
  • 12:16 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 12:15 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: 1c81925: Create Test Custodians group at Beta Wikiversity (T240438) (duration: 01m 07s)
  • 12:13 XioNoX: disable ping offload in codfw
  • 12:13 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 12:13 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 12:13 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: 0f035e4: Update wgAvailableRights declaration of autoreviewprotected (T230103) (duration: 01m 03s)
  • 12:11 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 12:08 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: 176b0e8: Grant autopatrol to azwiki patrollers (T244338) (duration: 01m 05s)
  • 11:53 vgutierrez: depool cp10[77,78] and reimage as buster - T242093
  • 11:52 vgutierrez: pool cp10[79,80] running buster - T242093
  • 11:40 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 11:37 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 11:37 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 11:35 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 11:18 vgutierrez: rolling upgrade of ATS to version 8.0.5-1wm16 fleet wide - T244464
  • 11:16 vgutierrez: depool cp10[79,80] and reimage as buster - T242093
  • 11:12 ema: A:cp re-enable puppet, leave it to cron to apply wikimedia-common/wikimedia-frontend VCL merge T241239
  • 11:08 vgutierrez: upload trafficserver 8.0.5-1wm16 to apt.wm.o (buster) - T244464
  • 11:02 vgutierrez: pool cp10[81,82] and reimage as buster - T242093
  • 10:59 ema: cp4021 (cache_upload): apply wikimedia-common/wikimedia-frontend VCL merge T241239
  • 10:49 ema: cp4027 (cache_text): apply wikimedia-common/wikimedia-frontend VCL merge T241239
  • 10:46 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 10:44 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 10:44 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 10:41 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 10:23 vgutierrez: removing /root/.ssh/known_hosts in cumin1001
  • 10:21 vgutierrez: pool cp10[83,84] running buster - T242093
  • 10:08 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 10:06 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 10:06 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 10:03 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 09:45 vgutierrez: depool cp10[83,84] and reimage as buster - T242093
  • 09:45 vgutierrez: pool cp10[85,86] running buster - T242093
  • 09:10 moritzm: installing Java security updates on elastic* and relforge*
  • 08:59 marostegui@cumin1001: dbctl commit (dc=all): 'Increase weight for db1107 50 -> 100 for 10.4 testing - T242702', diff saved to https://phabricator.wikimedia.org/P10403 and previous config saved to /var/cache/conftool/dbconfig/20200213-085957-marostegui.json
  • 08:57 gehel: restart elasticsearch on elastic2051 - JVM upgrade
  • 08:21 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 08:18 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 08:17 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 08:15 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 07:57 moritzm: installing Java security updates on Hadoop, Kafka/Jumbo, AQS and Druid canaries
  • 07:57 vgutierrez: depool cp10[85,86] and reimage as buster - T242093
  • 07:53 moritzm: rolling restart of restbase-dev to pick up Java security update
  • 07:49 vgutierrez: pool cp10[87,88] running buster - T242093
  • 07:49 vgutierrez: testing ATS 8.0.5-1wm16 + KA between ats-tls and varnish-fe in cp4031 - T244464
  • 07:47 moritzm: installing Java security updates on stat/SWAP hosts
  • 07:28 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1107 with weight 50 for 10.4 testing - T242702', diff saved to https://phabricator.wikimedia.org/P10402 and previous config saved to /var/cache/conftool/dbconfig/20200213-072839-marostegui.json
  • 07:26 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 07:24 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 07:23 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 07:21 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 07:03 vgutierrez: depool cp10[87,88] and reimage as buster - T242093
  • 07:02 vgutierrez: pool cp10[89,90] running buster - T242093
  • 06:49 vgutierrez: pool cp20[02,05] running buster - T242093
  • 06:36 marostegui: Upgrade and compress db1087, this will generate lag on s8 on the wiki replicas - T232446
  • 06:35 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1087 for compression - T232446', diff saved to https://phabricator.wikimedia.org/P10401 and previous config saved to /var/cache/conftool/dbconfig/20200213-063535-marostegui.json
  • 06:34 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 06:33 marostegui@cumin1001: dbctl commit (dc=all): 'Pool db1099:3318 into vslow for s8 T239453', diff saved to https://phabricator.wikimedia.org/P10400 and previous config saved to /var/cache/conftool/dbconfig/20200213-063334-marostegui.json
  • 06:32 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 06:32 marostegui@cumin1001: dbctl commit (dc=all): 'Fully repool db1099:3318, db1099:3311 T239453', diff saved to https://phabricator.wikimedia.org/P10399 and previous config saved to /var/cache/conftool/dbconfig/20200213-063207-marostegui.json
  • 06:30 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 06:28 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 06:26 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1099:3318, db1099:3311 T239453', diff saved to https://phabricator.wikimedia.org/P10398 and previous config saved to /var/cache/conftool/dbconfig/20200213-062642-marostegui.json
  • 06:25 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 06:23 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 06:22 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 06:21 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1099:3318, db1099:3311 T239453', diff saved to https://phabricator.wikimedia.org/P10397 and previous config saved to /var/cache/conftool/dbconfig/20200213-062148-marostegui.json
  • 06:19 vgutierrez: testing a new build of ATS 8.0.6 in cp40[26,32]
  • 06:19 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 06:12 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1099:3318, db1099:3311 T239453', diff saved to https://phabricator.wikimedia.org/P10396 and previous config saved to /var/cache/conftool/dbconfig/20200213-061219-marostegui.json
  • 06:11 vgutierrez: depool cp10[89,90] and reimage as buster - T242093
  • 06:04 vgutierrez: depool cp20[02,05] and reimage as buster - T242093
  • 06:04 vgutierrez: pool cp20[01,08] running buster - T242093
  • 06:02 twentyafterfour: set phabricator read-only to false
  • 06:01 twentyafterfour: set phabricator read-only
  • 06:00 marostegui: Start phabricator maintenance T244566
  • 05:55 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 05:53 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 05:53 marostegui: Upgrade db1128 without restarting mysql - T244566
  • 05:52 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 05:50 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 05:47 marostegui: Silence m3 hosts for maintenance - T244566
  • 05:38 vgutierrez: depool cp2008 and reimage as buster - T242093
  • 05:37 vgutierrez: pool cp2011 running buster - T242093
  • 05:35 vgutierrez: depool cp2001 and reimage as buster - T242093
  • 05:34 vgutierrez: pool cp2004 running buster - T242093
  • 05:30 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 05:28 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 05:28 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 05:25 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 05:09 vgutierrez: depool cp20[04,11] and reimage as buster - T242093
  • 03:57 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 03:57 pt1979@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 03:54 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 03:52 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 03:52 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 03:52 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 03:32 pt1979@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 03:30 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 03:28 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 03:27 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 03:06 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 03:04 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 03:02 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 03:00 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 02:44 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 02:41 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 02:35 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 02:33 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 01:22 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 01:20 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 01:18 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 01:17 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 01:10 twentyafterfour: no apparent problems with phabricator upgrade, all done
  • 01:01 twentyafterfour: starting phabricator deploy, momentary downtime expected while apache restarts
  • 00:58 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 00:56 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 00:56 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 00:54 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 00:45 niharika29@deploy1001: Synchronized wmf-config/throttle.php: Throttle rule for National Gallery of Canada Library and Archives edit-a-thon - T244488 (duration: 01m 07s)
  • 00:36 pt1979@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 00:34 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 00:32 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 00:31 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 00:11 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 00:08 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 00:08 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 00:06 pt1979@cumin2001: START - Cookbook sre.hosts.downtime

2020-02-12

  • 23:46 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 23:44 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 23:43 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 23:41 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 23:11 XioNoX: deactivate BGP to office's router1 while it's on maintenance
  • 21:59 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics' for release 'production' .
  • 21:58 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics' for release 'production' .
  • 21:57 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics' for release 'canary' .
  • 21:53 chaomodus: restart nagios-nrpe-service on cumin1001 after it had oomed
  • 21:51 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics' for release 'canary' .
  • 21:51 otto@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-analytics' for release 'canary' .
  • 21:47 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics' for release 'canary' .
  • 21:18 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics' for release 'canary' .
  • 21:18 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics' for release 'production' .
  • 21:10 marxarelli: completed group1 to 1.35.0-wmf.19
  • 21:00 dduvall@deploy1001: Synchronized php: group1 wikis to 1.35.0-wmf.19 (duration: 01m 03s)
  • 20:59 dduvall@deploy1001: rebuilt and synchronized wikiversions files: group1 wikis to 1.35.0-wmf.19
  • 20:49 krinkle@deploy1001: Synchronized wmf-config/CommonSettings.php: T232563 - Remove SERVER_SOFTWARE override (duration: 01m 03s)
  • 20:39 krinkle@deploy1001: Synchronized wmf-config/InitialiseSettings.php: T72470 - Disable wgLegacyJavaScriptGlobals on svwiki (duration: 01m 08s)
  • 19:53 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Don't use hex escapes in the name of cawiki (duration: 01m 04s)
  • 19:47 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: T243503 [itwiki] Move assignment of 'mover' group from sysops to bureaucrats (duration: 01m 02s)
  • 19:42 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: T243509 [zh_classicalwiki] Enable new user message for auto-created accounts (duration: 01m 03s)
  • 19:38 James_F: Ran mwscript maintenance/namespaceDupes.php --wiki=mywiki --fix and mwscript maintenance/namespaceDupes.php --wiki=mywiktionary --fix on mwmaint1002
  • 19:37 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: T244980 Localise $wgMetaNamespace for mywiki and mywiktionary (duration: 01m 03s)
  • 19:30 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: T244205 [newiki] Set local timezone to Kathmandu (duration: 01m 03s)
  • 19:19 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: T241883 [fywiktionary] Set a local wgSitename (duration: 01m 03s)
  • 19:12 jforrester@deploy1001: Synchronized wmf-config/throttle-analyze.php: Replace deprecated IP class with IPUtils (no-op sync) (duration: 01m 03s)
  • 18:31 mutante: irc2001 - manually run the "${v6_token_cmd} && ${v6_flush_dyn_cmd}" commands from interface::add_ip6_mapped to debug 'Interface::Add_ip6_mapped[main]/Augeas[ens5_v6_token]: Could not evaluate: Saving failed' but it does not reproduce the puppet error ... (T244719)
  • 17:57 jforrester@deploy1001: Synchronized php-1.35.0-wmf.19/includes/pager/IndexPager.php: T244941 IndexPager: Cast properties passed to implode to arrays (duration: 01m 03s)
  • 17:27 jeh: upgrade RAID firmware on cloudvirt1024 to 25.5.6.0009 T241884
  • 17:22 bblack: ns1.wikimedia.org - re-route back to original authdns2001 destination
  • 17:11 brennen: restarting jenkins for updates
  • 17:09 vgutierrez: disabling KA between ats-tls and varnish-fe on cp4031 - T244464
  • 17:01 vgutierrez: rolling back cp4026 and cp4032 to trafficserver 8.0.5-1wm15
  • 17:00 vgutierrez: depool cp40[26,32]
  • 16:53 bblack@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 16:52 vgutierrez: pool cp20[06,14] running buster - T242093
  • 16:51 bblack@cumin1001: START - Cookbook sre.hosts.downtime
  • 16:49 moritzm: installing openjpeg2 security updates
  • 16:10 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 16:08 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 16:07 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 16:05 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 15:56 vgutierrez: Enable KA and disable parent proxies on cp4031 - T244464
  • 15:50 vgutierrez: depool cp20[06,14] and reimage as buster - T242093
  • 15:49 volans: spicerack upgraded to 0.0.30-1 on both cumin hosts
  • 15:48 vgutierrez: pool cp20[07,17] running buster - T242093
  • 15:46 bblack: authdns2001 - shutting down for hardware work - T242017
  • 15:40 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 15:39 jeh: clearing foreign drive RAID configuration on cloudvirt1024 T241884
  • 15:37 jmm@cumin2001: START - Cookbook sre.hosts.downtime
  • 15:32 marostegui: Disable event handler for db1095 RAID check on icinga - T244958
  • 15:32 marostegui: Disable event handler for db1095 RAID check on icinga -
  • 15:28 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 15:26 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 15:25 jeh: upgrade BIOS firmware on cloudvirt1024 to 2.4.8 T241884
  • 15:19 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 15:17 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 15:02 vgutierrez: depool cp20[07,17] and reimage as buster - T242093
  • 14:34 XioNoX: repool eqsin
  • 14:31 moritzm: reimage logstash2026 to test new standard RAID0 partman recipe
  • 14:00 vgutierrez: pool cp20[10,18] running buster - T242093
  • 13:55 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1107 after 10.4 testing - T242702', diff saved to https://phabricator.wikimedia.org/P10393 and previous config saved to /var/cache/conftool/dbconfig/20200212-135514-marostegui.json
  • 13:39 akosiaris: revert sessionstore on mw1331, mw1348 so that it times out instead of returning TCP RSTs. Testing for T243106
  • 13:36 XioNoX: re-enable transit/peering on cr1-eqsin - T244944
  • 13:26 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 13:24 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 13:23 akosiaris: mangle sessionstore on mw1331, mw1348 so that it timesout instead of returning TCP RSTs. Testing for T243106
  • 13:23 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 13:22 XioNoX: cr1-eqsin RE failover (final) - T244944
  • 13:21 marostegui: Restart wikibugs as phab comments aren't showing up on irc - T241109
  • 13:20 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 13:18 jynus: setting up db1140 under maintenance (upgrade, reboot, disable alerts)
  • 13:15 vgutierrez: disabling KA between ats-tls and varnish-fe on cp4031 - T244464
  • 13:10 moritzm: upgrading debdeploy fleet-wide to 0.0.99.13
  • 13:08 moritzm: uploaded libapache2-mod-auth-cas 1.2-1~deb8u1 for jessie-wikimedia to apt.wikimedia.org
  • 13:05 vgutierrez: depool cp20[10,18] and reimage as buster - T242093
  • 13:05 vgutierrez: pool cp20[12,20] running buster - T242093
  • 12:55 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 12:53 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 12:53 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 12:53 XioNoX: cr1-eqsin RE failover - T244944
  • 12:50 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 12:35 vgutierrez: depool cp20[12,20] and reimage as buster - T242093
  • 12:34 vgutierrez: pool cp20[13,22] running buster - T242093
  • 12:26 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 12:24 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 12:21 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Triple the factor of WDQS lag to maxlag for Wikidata (T244722), take II, the cache issue (duration: 01m 03s)
  • 12:19 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 12:19 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Triple the factor of WDQS lag to maxlag for Wikidata (T244722) (duration: 01m 04s)
  • 12:17 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 12:12 kartik@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: 571412|Enable ContentTranslation out of beta in bs and mk WPs (T244139, T244140) (duration: 01m 15s)
  • 12:08 vgutierrez: depool cp2013 and reimage as buster - T242093
  • 12:06 vgutierrez: pool cp2016 running buster - T242093
  • 12:01 vgutierrez: depool cp20[16,22] and reimage as buster - T242093
  • 11:57 vgutierrez: pool cp20[19,24] running buster - T242093
  • 11:53 akosiaris: mangle sessionstore on mw1331 so that it is unreachable. Testing for T243106
  • 11:49 vgutierrez: repooling cp40[26,32]
  • 11:39 vgutierrez: pool cp3050 running buster - T242093
  • 11:37 vgutierrez: depooling cp[4026,4032]
  • 11:35 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 11:33 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 11:32 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 11:30 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 11:19 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 11:18 vgutierrez: depool cp2024 and reimage as buster - T242093
  • 11:17 vgutierrez: pool cp2025 running buster - T242093
  • 11:16 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 11:15 vgutierrez: depool cp2016 and reimage as buster - T242093
  • 11:14 vgutierrez: pool cp2019 running buster - T242093
  • 11:11 moritzm: reimage logstash2026 to test new standard RAID0 partman recipe
  • 11:05 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 11:03 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 11:03 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 11:00 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 10:50 vgutierrez: depool cp3050 and reimage as buster - T242093
  • 10:49 vgutierrez: pool cp30[51,52] running buster - T242093
  • 10:45 vgutierrez: depool cp20[19,25] and reimage as buster - T242093
  • 10:42 vgutierrez: pool cp2026 running buster - T242093
  • 10:36 vgutierrez: pool cp2023 running buster - T242093
  • 10:34 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 10:34 moritzm: bouncing ferm on ganeti1016, failed to start after boot
  • 10:32 vgutierrez: Enable KA between ats-tls and varnish-fe on cp4031 - T244464
  • 10:31 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 10:31 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 10:29 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 10:29 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 10:27 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 10:25 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 10:23 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 10:12 vgutierrez: testing trafficserver 8.0.6-rc0 in cp40[26,32]
  • 10:06 vgutierrez: depool cp20[23,26] and reimage as buster - T242093
  • 10:01 vgutierrez: depool cp30[51-52] and reimage as buster - T242093
  • 09:38 ema: cp: rolling ats-tls-restart to enable analytics logging T237993
  • 09:26 ema: cp4027: ats-tls-restart to enable analytics logging to pipe T237993
  • 09:25 moritzm: rolling restart of cassandra on restbase-dev to pick up Java security updates
  • 09:17 marostegui: Failover m2 master dbproxy from dbproxy1007 to dbproxy1013 - T202367
  • 09:13 elukey@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
  • 09:11 marostegui: Upgrade and reboot dbproxy1013 before making it master - T202367
  • 08:55 elukey@cumin1001: START - Cookbook sre.ganeti.makevm
  • 08:46 phedenskog@deploy1001: Finished deploy [performance/navtiming@9bbbb58]: (no justification provided) (duration: 00m 05s)
  • 08:46 phedenskog@deploy1001: Started deploy [performance/navtiming@9bbbb58]: (no justification provided)
  • 08:38 marostegui: Restart wikibugs as it doesn't show phab comments on irc - T241109
  • 08:21 moritzm: installing mesa security updates
  • 07:28 vgutierrez: pool cp30[53-54] running buster - T242093
  • 07:18 oblivian@puppetmaster1001: conftool action : set/weight=30; selector: dc=eqiad,pool=appserver,name=mw132[3-4].*
  • 07:16 oblivian@puppetmaster1001: conftool action : set/weight=20; selector: dc=eqiad,pool=appserver,service=nginx,name=mw12[3-5].*
  • 07:02 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1107 with weight 20 for 10.4 testing - T242702', diff saved to https://phabricator.wikimedia.org/P10391 and previous config saved to /var/cache/conftool/dbconfig/20200212-070250-marostegui.json
  • 06:54 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 06:52 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 06:51 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 06:50 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 06:46 marostegui: Redact ngwikimedia on db1124:3313 and db2094:3313 T240772
  • 06:22 vgutierrez: depool cp30[53-54] and reimage as buster - T242093
  • 06:18 marostegui@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 06:17 marostegui@cumin1001: START - Cookbook sre.hosts.decommission
  • 06:16 marostegui@cumin1001: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1)
  • 06:16 marostegui@cumin1001: START - Cookbook sre.hosts.decommission
  • 01:48 XioNoX: disabling peering session on cr1-eqsin (they're flapping otherwise)
  • 00:44 jforrester@deploy1001: Synchronized php-1.35.0-wmf.19/includes/page/ImageHistoryPseudoPager.php: T244937 ImageHistoryPseudoPager: Update doQuery() for IndexPager changes (duration: 01m 03s)
  • 00:38 XioNoX: reboot cr1-eqsin
  • 00:33 XioNoX: commit full on cr1-eqsin - T243080
  • 00:21 reedy@deploy1001: Synchronized wmf-config/CommonSettings.php: rm wgKartographerIconServer (duration: 01m 02s)
  • 00:20 reedy@deploy1001: Synchronized wmf-config/CommonSettings-labs.php: rm wgKartographerIconServer (duration: 01m 03s)
  • 00:16 eileen: civicrm revision changed from ee9edf8137 to 55b2afb6eb, config revision is 561ae21f77

2020-02-11

  • 22:04 XioNoX: switchover RE mastership back re0 on cr1-eqsin - T243080
  • 21:50 XioNoX: reboot re0:cr1-eqsin (backup) - T243080
  • 21:45 cdanis: repool eqiad
  • 21:37 bblack@cumin1001: conftool action : set/pooled=yes; selector: name=^cp107.*
  • 21:36 bblack@cumin1001: conftool action : set/pooled=yes; selector: name=^cp108.*
  • 21:36 bblack: re-pooling all cp10xx in eqiad
  • 21:32 XioNoX: switchover RE mastership on cr1-eqsin - T243080
  • 21:14 robh: cp1067 powered back into service post firmware update via T243167
  • 21:11 cdanis: depool eqiad
  • 21:01 marxarelli: completed group0 to 1.35.0-wmf.19 (T233867)
  • 20:57 robh: cp108[45] returned to service, depooling cp108[67]for firmware update via T243167
  • 20:54 dduvall@deploy1001: rebuilt and synchronized wikiversions files: group0 to 1.35.0-wmf.19
  • 20:53 mutante: gerrit - moving gerrit db_pass from private module passwords to private hieradata
  • 20:51 XioNoX: reboot backup RE on cr1-eqsin - T243080
  • 20:38 robh: depooling cp108[45] for firmware update via T243167
  • 20:32 dduvall@deploy1001: Finished scap: testwiki to php-1.35.0-wmf.19 and rebuild l10n cache (duration: 37m 31s)
  • 20:19 volker-e@deploy1001: Finished deploy [design/style-guide@dd8e6de]: Deploy design/style-guide: (duration: 00m 02s)
  • 20:19 volker-e@deploy1001: Started deploy [design/style-guide@dd8e6de]: Deploy design/style-guide:
  • 20:18 volker-e@deploy1001: Finished deploy [design/style-guide@dd8e6de]: Deploy design/style-guide: (duration: 00m 03s)
  • 20:18 volker-e@deploy1001: Started deploy [design/style-guide@dd8e6de]: Deploy design/style-guide:
  • 20:08 XioNoX: depool eqsin for router upgrade - T243080
  • 20:01 volker-e@deploy1001: Finished deploy [design/style-guide@dd8e6de]: Deploy design/style-guide: (duration: 00m 04s)
  • 20:01 volker-e@deploy1001: Started deploy [design/style-guide@dd8e6de]: Deploy design/style-guide:
  • 19:55 dduvall@deploy1001: Started scap: testwiki to php-1.35.0-wmf.19 and rebuild l10n cache
  • 19:43 dduvall@deploy1001: Pruned MediaWiki: 1.35.0-wmf.16 (duration: 01m 48s)
  • 19:42 dduvall@deploy1001: Pruned MediaWiki: 1.35.0-wmf.15 (duration: 01m 51s)
  • 19:38 dduvall@deploy1001: Pruned MediaWiki: 1.35.0-wmf.14 (duration: 02m 08s)
  • 19:36 dduvall@deploy1001: Pruned MediaWiki: 1.35.0-wmf.11 (duration: 10m 53s)
  • 19:35 marxarelli: running `scap clean --delete` for old wmf branches wmf.11, wmf.14, wmf.15, wmf.16 (T233867)
  • 19:03 volans: uploaded spicerack_0.0.30-1_amd64.deb to apt.wikimedia.org stretch-wikimedia
  • 19:00 Urbanecm: Create User:Ammarpad on ngwikimedia and promote to sysop, bureaucrat (T240771)
  • 18:48 jforrester@deploy1001: rebuilt and synchronized wikiversions files: all wikis to 1.35.0-wmf.18
  • 18:43 twentyafterfour: getting ready to deploy wmf.18 refs T233866
  • 18:42 greg-g: restarting stashbot
  • 18:35 bblack: ns1.wikimedia.org - changing static route destination on cr[12]-codfw from authdns2001 to dns2002 - T242017
  • 18:33 Urbanecm: Create ngwikimedia is done (T240771)
  • 18:26 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Create ngwikimedia (T240771) (duration: 01m 03s)
  • 18:24 urbanecm@deploy1001: Synchronized multiversion/MWMultiVersion.php: Create ngwikimedia (T240771) (duration: 01m 06s)
  • 18:21 urbanecm@deploy1001: rebuilt and synchronized wikiversions files: Create ngwikimedia (T240771)
  • 18:20 dpifke@deploy1001: Finished deploy [performance/navtiming@b471b64]: (no justification provided) (duration: 00m 05s)
  • 18:20 dpifke@deploy1001: Started deploy [performance/navtiming@b471b64]: (no justification provided)
  • 18:19 urbanecm@deploy1001: Synchronized dblists/: Create ngwikimedia (T240771) (duration: 01m 06s)
  • 17:57 bblack: reboot dns2002 post-reimaging
  • 17:13 vgutierrez: Disable KA on cp4031 - T244464
  • 16:49 vgutierrez: pool cp3055 running buster - T242093
  • 16:43 vgutierrez: repooling cp4031
  • 16:38 vgutierrez: depooling cp4031 for some KA tests
  • 16:25 vgutierrez: pool cp3056 running buster - T242093
  • 16:23 bblack: dns2002 - shutting down for hardware work and reinstall - T242017
  • 16:21 bblack: dns2002 - stopping bird adverts to depool service for T242017
  • 16:20 bblack: dns2002 - downtimed in icinga for T242017
  • 16:07 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 16:05 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 15:38 vgutierrez: depool cp3056 and reimage as buster - T242093
  • 15:36 vgutierrez: pool cp3058 running buster - T242093
  • 15:29 otto@deploy1001: Synchronized wmf-config/InitialiseSettings-labs.php: Configuring test.event stream in beta, no-op in prod - T242122 (duration: 01m 08s)
  • 15:24 vgutierrez@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 15:24 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 15:14 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 15:12 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 14:58 vgutierrez: depool cp3055 and reimage as buster - T242093
  • 14:56 vgutierrez: pool cp3057 running buster - T242093
  • 14:52 moritzm: pruning old CAS logs (predating the current logger config for /var/log/cas/*) from idp1001/idp2001
  • 14:38 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 14:35 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 14:21 Amir1: ladsgroup@mwmaint1002:~$ mwscript createAndPromote.php --wiki=labswiki --force "Ladsgroup" --custom-groups checkuser
  • 14:20 vgutierrez: restart varnish-fe on cp4031 - T244464
  • 14:07 vgutierrez: depool cp3057 and cp3058 and reimage as buster - T242093
  • 13:52 vgutierrez: pool cp3059 and cp3060 running buster - T242093
  • 13:03 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1107 after 10.4 testing - T242702', diff saved to https://phabricator.wikimedia.org/P10382 and previous config saved to /var/cache/conftool/dbconfig/20200211-130343-marostegui.json
  • 12:56 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 12:53 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 12:34 Amir1: EU SWAT is done
  • 12:31 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 12:29 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 12:28 ladsgroup@deploy1001: Synchronized wmf-config/Wikibase.php: SWAT: Fix typo in the config name (T244697), take II, cache (duration: 01m 06s)
  • 12:26 ladsgroup@deploy1001: Synchronized wmf-config/Wikibase.php: SWAT: Fix typo in the config name (T244697) (duration: 01m 05s)
  • 12:12 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Stop reading for the new term store as the default of client wikis (T244697), Second round, cache issue (duration: 01m 07s)
  • 12:10 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Stop reading for the new term store as the default of client wikis (T244697) (duration: 01m 11s)
  • 12:04 vgutierrez: depool cp3059 and cp360 and reimage as buster - T242093
  • 11:59 vgutierrez: repool cp3061 and cp3062 running buster - T242093
  • 11:26 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 11:24 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 11:20 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 11:20 vgutierrez: ats-tls effectively reusing connections between ats-tls and varnish-fe on cp4031 - T244464
  • 11:18 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 10:56 vgutierrez: depool cp3062 and reimage as buster - T242093
  • 10:54 vgutierrez: repool cp3064 running buster - T242093
  • 10:51 vgutierrez: depool cp3061 and reimage as buster - T242093
  • 10:50 vgutierrez: repool cp5006 and cp3063 running buster - T242093
  • 10:30 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 10:28 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 10:27 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 10:25 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 10:25 mvolz@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'citoid' for release 'production' .
  • 10:23 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 10:23 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 10:18 mvolz@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'citoid' for release 'production' .
  • 10:11 mvolz@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'citoid' for release 'staging' .
  • 10:07 vgutierrez: rolling restart of ats-tls in ulsfo - T244464
  • 09:57 vgutierrez: depool cp3063 and cp3064 and reimage as buster - T242093
  • 09:52 vgutierrez: depool cp5006 and reimage as buster - T242093
  • 09:52 vgutierrez: pool cp5007 running buster - T242093
  • 08:38 marostegui@cumin1001: dbctl commit (dc=all): 'Increase db1107 weight from 10 to 11', diff saved to https://phabricator.wikimedia.org/P10380 and previous config saved to /var/cache/conftool/dbconfig/20200211-083812-marostegui.json
  • 08:25 marostegui: Upgrade db1095:3312, db1095:3313
  • 08:22 marostegui@cumin1001: dbctl commit (dc=all): 'Fully repool es1013 after upgrade', diff saved to https://phabricator.wikimedia.org/P10379 and previous config saved to /var/cache/conftool/dbconfig/20200211-082204-marostegui.json
  • 08:14 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool es1013 after upgrade', diff saved to https://phabricator.wikimedia.org/P10378 and previous config saved to /var/cache/conftool/dbconfig/20200211-081421-marostegui.json
  • 08:13 marostegui@cumin1001: dbctl commit (dc=all): 'Increase weight from 5 to 10 for db1107 - T242702', diff saved to https://phabricator.wikimedia.org/P10377 and previous config saved to /var/cache/conftool/dbconfig/20200211-081319-marostegui.json
  • 08:04 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool es1013 after upgrade', diff saved to https://phabricator.wikimedia.org/P10376 and previous config saved to /var/cache/conftool/dbconfig/20200211-080458-marostegui.json
  • 07:57 akosiaris: T242705 systemctl stop uwsgi-ores on ores2001.
  • 07:54 vgutierrez@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 07:54 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 07:53 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool es1013 after upgrade', diff saved to https://phabricator.wikimedia.org/P10375 and previous config saved to /var/cache/conftool/dbconfig/20200211-075358-marostegui.json
  • 07:47 marostegui: Upgrade es1013 - T239791
  • 07:44 marostegui@cumin1001: dbctl commit (dc=all): 'Depool es1013 - T239791', diff saved to https://phabricator.wikimedia.org/P10374 and previous config saved to /var/cache/conftool/dbconfig/20200211-074358-marostegui.json
  • 07:23 vgutierrez: depool cp5007 and reimage as buster - T242093
  • 07:22 vgutierrez: pool cp5001 and cp5008 running buster - T242093
  • 07:21 marostegui: Remove partitions from db2086:3318 - T239453
  • 07:19 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db2086:3318 T239453', diff saved to https://phabricator.wikimedia.org/P10373 and previous config saved to /var/cache/conftool/dbconfig/20200211-071936-marostegui.json
  • 07:16 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db2085:3318 T239453', diff saved to https://phabricator.wikimedia.org/P10372 and previous config saved to /var/cache/conftool/dbconfig/20200211-071639-marostegui.json
  • 07:07 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1107 for 10.4 testing - T242702', diff saved to https://phabricator.wikimedia.org/P10371 and previous config saved to /var/cache/conftool/dbconfig/20200211-070720-marostegui.json
  • 07:01 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 06:59 marostegui: Stop haproxy on dbproxy1001 - T244463
  • 06:59 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 06:58 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 06:57 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 06:48 marostegui: Remove grants in m1 for dbproxy1001 - T231280
  • 06:25 vgutierrez: depool cp5001 & cp5008 and reimage as buster - T242093
  • 06:18 marostegui: Failover m1-master from dbproxy1014 to dbproxy1012 - T202367
  • 00:26 ebernhardson@deploy1001: Synchronized php-1.35.0-wmf.18/skins/MinervaNeue: SWAT: Revert: Reduce userContributions icon code (duration: 01m 06s)
  • 00:20 ebernhardson@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Give NS_HELP same weight as NS_MAIN in search on wikitech (duration: 01m 06s)
  • 00:15 ebernhardson@deploy1001: Synchronized wmf-config/: SWAT: Enable SpecialMute page on all wikis (duration: 01m 06s)

2020-02-10

  • 23:30 robh: cp108[23] returned to service via T243167
  • 23:28 legoktm: restarting zuul
  • 23:26 reedy@deploy1001: Synchronized php-1.35.0-wmf.18/extensions/OATHAuth/src/Key/TOTPKey.php: T244308 (duration: 01m 04s)
  • 23:25 reedy@deploy1001: Synchronized php-1.35.0-wmf.16/extensions/OATHAuth/src/Key/TOTPKey.php: T244308 (duration: 01m 07s)
  • 23:06 robh: cp108[01] returned to service, cp108[23] offline for bios update via T243167
  • 22:50 chasemp: phab1001:~# sudo /srv/phab/phabricator/bin/bulk make-silent --id 2164
  • 22:45 sbassett@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Add authevents as monolog channel (duration: 01m 06s)
  • 22:43 robh: cp107[789] returned to service, cp108[01] offline for bios update via T243167
  • 22:42 robh: cp107[89] returned to service, cp108[01] offline for bios update via T243167
  • 21:58 robh: cp107[56] returned to service, cp107[78] offline for bios update via T243167
  • 21:43 arlolra: Updated Parsoid to 612106d2 (T244412, T244413, T242746, T235273, T235307, T238845, T204618, T240054)
  • 21:38 robh: cp1075 & cp1076 offline for bios updates per T243167
  • 21:36 robh: cp1075 and cp1076 going offline for bios updates. This will cause a bit of cp irc icinga noise, but no paging. Not putting into maint mode, as there is no way to maint mode the noisest check (which checks all backends and thus shouldnt be disabled)
  • 21:33 arlolra@deploy1001: Finished deploy [parsoid/deploy@d2d4870]: Updating Parsoid to 612106d2 (duration: 10m 26s)
  • 21:32 XioNoX: clamp tcp-mss on cr2-eqiad:xe-3/3/3
  • 21:23 arlolra@deploy1001: Started deploy [parsoid/deploy@d2d4870]: Updating Parsoid to 612106d2
  • 21:12 halfak@deploy1001: Finished deploy [ores/deploy@a6f4f14]: T242705 (duration: 12m 18s)
  • 21:00 halfak@deploy1001: Started deploy [ores/deploy@a6f4f14]: T242705
  • 20:55 mholloway-shell@deploy1001: Synchronized php-1.35.0-wmf.16/extensions/MachineVision: MachineVision: Fix page id parsing from imageinfo results (T244752) (duration: 01m 11s)
  • 20:14 mholloway-shell@deploy1001: Synchronized php-1.35.0-wmf.18/extensions/MachineVision: MachineVision: Fix page id parsing from imageinfo results (T244752) (duration: 01m 15s)
  • 19:31 ppchelko@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: gerrit:570393 Config: Session Store: Switch group0 and group1 to kask-session T243106 (duration: 01m 06s)
  • 19:28 mutante: Gerrit - added eevans to 'wmf-deployment' group (T244508)
  • 19:12 jforrester@deploy1001: Synchronized wmf-config/CommonSettings.php: T242122 Load new EventStreamConfig extension if so configured (duration: 01m 06s)
  • 19:07 jforrester@deploy1001: Scap failed!: Call to mwscript eval.php stderr: not empty
  • 19:06 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: T242122 Set default of wmgUseEventStreamConfig false everywhere (duration: 01m 06s)
  • 18:39 twentyafterfour@deploy1001: Synchronized php: group1 wikis to 1.35.0-wmf.18 refs T233866 (duration: 01m 05s)
  • 18:38 twentyafterfour@deploy1001: rebuilt and synchronized wikiversions files: group1 wikis to 1.35.0-wmf.18 refs T233866
  • 18:25 twentyafterfour@deploy1001: rebuilt and synchronized wikiversions files: group0 wikis to 1.35.0-wmf.18 refs T233867
  • 18:21 twentyafterfour: MediaWiki train: finally moving forward with group0 wikis to 1.35.0-wmf.18 refs T233866
  • 17:52 jforrester@deploy1001: Synchronized wmf-config/CommonSettings.php: T244561 Set Kartographer servers to Wikimedia servers (duration: 01m 06s)
  • 16:48 moritzm: installing libexif security updates on jessie
  • 16:22 vgutierrez: pooling cp5002 and cp5009 running buster - T242093
  • 15:45 XioNoX: push outbound flowspec support to core routers
  • 15:45 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1107 after first day of 10.4 testing - T242702', diff saved to https://phabricator.wikimedia.org/P10366 and previous config saved to /var/cache/conftool/dbconfig/20200210-154552-marostegui.json
  • 15:41 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 15:41 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 15:33 godog: roll restart cassandra on session* to apply logging changes - T242585
  • 15:23 moritzm: uploading debdeploy 0.0.99.13 to apt.wikimedia.org
  • 15:22 godog: roll restart cassandra on restbase* to apply logging changes - T242585
  • 15:19 vgutierrez@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 15:19 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 15:19 vgutierrez@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 15:19 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 15:06 marostegui: Reload haproxy on dbproxy1017 and dbproxy1017 - T244209
  • 15:04 twentyafterfour@deploy1001: Finished scap: full scap sync prior to wmf.18 rollout (duration: 20m 13s)
  • 15:04 godog: roll restart cassandra on maps* to apply logging changes - T242585
  • 15:03 vgutierrez: rolling restart of ats-tls - T240950
  • 15:00 marostegui: Restart mysql on m5 master (wikitech will go down) - T244209
  • 14:52 vgutierrez: rolling restart of ats-tls in ulsfo - T244464
  • 14:46 vgutierrez: depool cp5002 and cp5009 and reimage as buster - T242093
  • 14:44 twentyafterfour@deploy1001: Started scap: full scap sync prior to wmf.18 rollout
  • 14:42 vgutierrez: repool cp5003 and cp5010 running buster - T242093
  • 14:41 marostegui: Full-upgrade db1133 (without restarting mysql) - T244209
  • 14:40 twentyafterfour: MediaWiki Train: Running a full scap to prepare for moving forward to 1.35.0-wmf.18 ( T233866 )
  • 14:32 marostegui: Downtime m5 hosts for the upcoming maintenance - T244209
  • 14:19 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 14:17 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 14:17 vgutierrez@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 14:17 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 14:11 XioNoX: remove TCP-MSS clamping on cr3-knams
  • 13:48 vgutierrez: depool cp5003 and reimage as buster - T242093
  • 13:47 vgutierrez: pooling cp5004 with buster - T242093
  • 13:46 vgutierrez: depool cp5010 and reimage as buster - T242093
  • 13:45 vgutierrez: pooling cp5011 with buster - T242093
  • 13:28 godog: roll restart cassandra on aqs to apply logging changes - T242585
  • 13:03 ladsgroup@deploy1001: Synchronized php-1.35.0-wmf.18/extensions/Wikibase: Revert "wbterms: Set default for the term store to read new" (T244529) (duration: 01m 00s)
  • 13:03 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 13:00 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 12:59 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 12:58 Urbanecm: EU SWAT is done
  • 12:58 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 12:56 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: 989c9f8: Revert "Revert "Remove handler deleted from the MachineVision extension"" (duration: 00m 58s)
  • 12:51 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: 989c9f8: Revert "Revert "Remove handler deleted from the MachineVision extension"" (duration: 00m 59s)
  • 12:49 urbanecm@deploy1001: Finished scap: SWAT: 799224f: 137a40e (T241242; T243974) (duration: 20m 18s)
  • 12:30 vgutierrez: depool cp5004 and reimage as buster - T242093
  • 12:29 vgutierrez: pooling cp5005 with buster - T242093
  • 12:28 urbanecm@deploy1001: Started scap: SWAT: 799224f: 137a40e (T241242; T243974)
  • 12:23 vgutierrez: pooling ncredir1001 with buster - T243391
  • 12:18 _joe_: running puppet, scap pull on mwdebug1001
  • 12:17 vgutierrez: upload trafficserver 8.0.5-1wm15 to apt.wm.o (buster) - T244538
  • 12:08 vgutierrez@cumin1001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 12:08 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 12:06 vgutierrez: testing ats 8.0.5-1-wm15 on cp4032 - T244538
  • 12:06 urbanecm@deploy1001: Synchronized wmf-config/throttle.php: SWAT: 014405a: Add throttle rules for OSU Editathon and workshop for cawiki, remove expired ones (T244608, T244645) (duration: 01m 03s)
  • 11:57 vgutierrez: depool ncredir1001 and reimage as buster - T243391
  • 11:57 vgutierrez: pooling ncredir1002 with buster - T243391
  • 11:43 vgutierrez: pooling cp4027 with buster - T242093
  • 11:38 vgutierrez: depool ncredir1002 and reimage as buster - T243391
  • 11:31 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 11:29 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 11:22 vgutierrez: depooling cp5011 and cp5005 & reimage as buster - T242093
  • 11:07 vgutierrez: depool cp4027 & reimage as buster - T242093
  • 11:07 vgutierrez: pooling ncredir2001 with buster - T243391
  • 11:03 vgutierrez: pooling cp4028 with buster - T242093
  • 10:50 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 10:48 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 10:47 godog: remove old logs from /var/log/swift on swift hsots
  • 10:31 vgutierrez: depool ncredir2001 and reimage as buster - T243391
  • 10:26 vgutierrez: depool cp4028 & reimage as buster - T242093
  • 10:14 moritzm: installing sudo security updates for buster
  • 08:53 vgutierrez: pooling cp4029 with buster - T242093
  • 08:44 marostegui@cumin1001: dbctl commit (dc=all): 'Increase weight from 1 to 5 for db1107 - T242702', diff saved to https://phabricator.wikimedia.org/P10364 and previous config saved to /var/cache/conftool/dbconfig/20200210-084446-marostegui.json
  • 08:43 vgutierrez: pooling ncredir2002 with buster - T243391
  • 08:34 effie: rolling restart php-fpm on labweb[1001-1002].wikimedia.org,mw*.eqiad.wmnet,scandium.eqiad.wmnet, wtp[1025-1048].eqiad.wmnet
  • 08:32 effie: update php-apcu on eqiad - T236800
  • 08:29 effie: rolling restart php-fpm on cloudweb2001-dev.wikimedia.org,mw[2135-2147,2150-2212,2214-2290].codfw.wmnet,wtp[2001-2020].codfw.wmnet
  • 08:23 effie: update php-apcu on codfw - T236800
  • 07:58 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 07:56 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 07:54 moritzm: updating d-i netinst image for Stretch 9.12 point release (which bumped the kernel ABI)
  • 07:29 moritzm: updating d-i netinst image for Buster 10.3 point release (which bumped the kernel ABI)
  • 07:09 elukey: restore mw1347's mcrouter settings to its default (proxy threads 10 -> 5)
  • 07:01 marostegui@cumin1001: dbctl commit (dc=all): 'Place db1107 - MariaDB 10.4 on s1 with minimal weight - T242702', diff saved to https://phabricator.wikimedia.org/P10363 and previous config saved to /var/cache/conftool/dbconfig/20200210-070140-marostegui.json
  • 06:55 vgutierrez: depool ncredir2002 and reimage as buster - T243391
  • 06:53 marostegui@cumin1001: dbctl commit (dc=all): 'Fully repool es1019', diff saved to https://phabricator.wikimedia.org/P10362 and previous config saved to /var/cache/conftool/dbconfig/20200210-065326-marostegui.json
  • 06:51 marostegui@cumin1001: dbctl commit (dc=all): 'Fully repool db1091 T232446', diff saved to https://phabricator.wikimedia.org/P10361 and previous config saved to /var/cache/conftool/dbconfig/20200210-065135-marostegui.json
  • 06:47 vgutierrez: depool cp4029 & reimage as buster - T242093
  • 06:45 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool es1019', diff saved to https://phabricator.wikimedia.org/P10360 and previous config saved to /var/cache/conftool/dbconfig/20200210-064553-marostegui.json
  • 06:45 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1091 T232446', diff saved to https://phabricator.wikimedia.org/P10359 and previous config saved to /var/cache/conftool/dbconfig/20200210-064458-marostegui.json
  • 06:39 marostegui: Compress db1124:3318 - this will generate lag on s8 wiki replicas - T232446
  • 06:37 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1091 T232446', diff saved to https://phabricator.wikimedia.org/P10358 and previous config saved to /var/cache/conftool/dbconfig/20200210-063716-marostegui.json
  • 06:23 marostegui: Remove partitions from db1099:3311, db1099:3318 T239453
  • 06:21 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1099:3318 T239453', diff saved to https://phabricator.wikimedia.org/P10357 and previous config saved to /var/cache/conftool/dbconfig/20200210-062112-marostegui.json
  • 06:18 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1099:3311 T239453', diff saved to https://phabricator.wikimedia.org/P10356 and previous config saved to /var/cache/conftool/dbconfig/20200210-061822-marostegui.json
  • 06:16 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1091 T232446', diff saved to https://phabricator.wikimedia.org/P10355 and previous config saved to /var/cache/conftool/dbconfig/20200210-061656-marostegui.json

2020-02-09

  • 05:11 cdanis: T238305 hardreset cp3051

2020-02-08

  • 19:12 _joe_: set cpufreq governor to performance on mw1328
  • 17:04 _joe_: restarted php7.2-fpm on mw1332
  • 16:53 Urbanecm: mwscript resetAuthenticationThrottle.php --wiki=enwiki --signup --ip 12.24.27.50
  • 16:47 gjg@deploy1001: Synchronized wmf-config/throttle.php: SWAT: Editathon in Charolette (duration: 00m 58s)
  • 00:05 Jeff_Green: switched payments.wikimedia.org to codfw datacenter due to T244610

2020-02-07

  • 22:20 jeh: ceph: round 2 OSD failover and recovery testing on cloudcephosd1003.wikimedia.org T240718
  • 20:47 mutante: OS install on new install_server VMs worked on second attempt, issues are gone. signed puppet certs for install1003.eqiad.wmnet, install2003.codfw.wmnet, initial puppet runs (T224576)
  • 20:42 jeh: ceph: OSD failover and recovery testing on cloudcephosd1003.wikimedia.org T240718
  • 20:32 mutante: ganeti: attempting to reinstall install1003 which failed last time
  • 17:38 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool es1019 after on-site maintenance T243963', diff saved to https://phabricator.wikimedia.org/P10350 and previous config saved to /var/cache/conftool/dbconfig/20200207-173850-marostegui.json
  • 17:36 twentyafterfour@deploy1001: Synchronized wmf-config/InitialiseSettings.php: sync InitializeSettings again for lols refs T233866 (duration: 01m 03s)
  • 17:32 twentyafterfour@deploy1001: Synchronized wmf-config/InitialiseSettings.php: sync https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/570929 refs T233866 (duration: 01m 02s)
  • 17:25 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool es1019 after on-site maintenance T243963', diff saved to https://phabricator.wikimedia.org/P10349 and previous config saved to /var/cache/conftool/dbconfig/20200207-172541-marostegui.json
  • 17:22 twentyafterfour@deploy1001: rebuilt and synchronized wikiversions files: roll back all wikis to 1.35.0-wmf.16 refs T233866
  • 17:19 marostegui: Start MySQL on es1019 after onsite maintenance T243963
  • 16:46 filippo@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
  • 16:38 filippo@cumin1001: START - Cookbook sre.ganeti.makevm
  • 16:13 XioNoX: remove MSS clamping from eqiad/eqord/knams/esams
  • 16:05 andrew@deploy1001: Finished deploy [horizon/deploy@bc777d6]: Fix for T243422 (duration: 03m 45s)
  • 16:04 vgutierrez: pooling cp4030 with buster - T242093
  • 16:03 bblack: removing GRE MTU mitigations from cp[135]xxx - T232602
  • 16:01 andrew@deploy1001: Started deploy [horizon/deploy@bc777d6]: Fix for T243422
  • 15:50 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 15:48 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 15:25 vgutierrez: depool & reimage cp4030 as buster - T242093
  • 15:21 vgutierrez: pooling cp4031 with buster - T242093
  • 15:20 vgutierrez: pooling ncredir3001 running buster - T243391
  • 15:18 marostegui: Restart all instances on db1124 and db1125 to pick up a new replication filter - T240094
  • 15:11 marostegui: Restart all instances on db2094 and db2095 to pick up a new replication filter - T240094
  • 14:56 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 14:53 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 14:43 hoo@deploy1001: Synchronized wmf-config/Wikibase.php: REVERT: Wikibase Client: Fix setting name typo (T244529) (duration: 01m 40s)
  • 14:43 Amir1: ladsgroup@mwmaint1002:~$ mwscript createAndPromote.php --wiki=zhwiki --force "Amir Sarabadani (WMDE)" --sysop (T244578)
  • 14:40 hoo@deploy1001: Scap failed!: 9/11 canaries failed their endpoint checks(http://en.wikipedia.org)
  • 14:38 hoo@deploy1001: Synchronized wmf-config/Wikibase.php: Wikibase Client: Fix setting name typo (T244529) (duration: 01m 20s)
  • 14:33 vgutierrez: depool and reimage ncredir3001 as buster - T243391
  • 14:32 vgutierrez: depool & reimage cp4031 as buster - T242093
  • 14:23 vgutierrez: pooling ncredir3002 running buster - T243391
  • 13:26 vgutierrez: pooling cp4021 with buster - T242093
  • 13:05 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 13:03 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 12:51 vgutierrez: depool and reimage ncredir3002 as buster - T243391
  • 12:42 vgutierrez: depool & reimage cp4021 as buster - T242093
  • 12:08 akosiaris@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 12:08 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime
  • 11:58 akosiaris@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 11:57 akosiaris@cumin1001: START - Cookbook sre.hosts.downtime
  • 11:25 vgutierrez: pooling ncredir5001 running buster - T243391
  • 11:24 vgutierrez: pooling cp4022 with buster - T242093
  • 11:09 akosiaris: undo wikifeeds experiments
  • 11:07 akosiaris@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'wikifeeds' for release 'production' .
  • 10:42 akosiaris@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'wikifeeds' for release 'production' .
  • 10:40 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 10:37 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 10:36 akosiaris: conduct experiments with stopping/starting uwsgi-ores on ores2001 T242705
  • 10:24 akosiaris@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'wikifeeds' for release 'production' .
  • 10:23 vgutierrez: depool and reimage ncredir5001 as buster - T243391
  • 10:14 vgutierrez: depool & reimage cp4022 as buster - T242093
  • 10:02 akosiaris: increase capacity for wikifeeds by 50% T244535
  • 10:02 akosiaris@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'wikifeeds' for release 'production' .
  • 10:01 akosiaris@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'wikifeeds' for release 'production' .
  • 09:53 ema: A:mw: increase keepalive_requests from 100 to 200 https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/570670/ T241145
  • 09:09 godog: roll restart cassandra instance on restbase-dev
  • 09:03 akosiaris@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'wikifeeds' for release 'production' .
  • 09:03 godog: restart cassandra on restbase-dev1004 to test logging pipeline onboard
  • 09:01 akosiaris@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'wikifeeds' for release 'production' .
  • 08:59 akosiaris@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'wikifeeds' for release 'staging' .
  • 08:58 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1090:3312, db1090:3317', diff saved to https://phabricator.wikimedia.org/P10343 and previous config saved to /var/cache/conftool/dbconfig/20200207-085846-marostegui.json
  • 08:54 marostegui: Upgrade db1090:3312, db1090:3317
  • 08:54 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1090:3312, db1090:3317 for upgrade', diff saved to https://phabricator.wikimedia.org/P10342 and previous config saved to /var/cache/conftool/dbconfig/20200207-085432-marostegui.json
  • 08:44 marostegui@cumin1001: dbctl commit (dc=all): 'Fully repool db1101:3317 T239453', diff saved to https://phabricator.wikimedia.org/P10341 and previous config saved to /var/cache/conftool/dbconfig/20200207-084447-marostegui.json
  • 08:44 moritzm: installing libexif security updates
  • 08:21 akosiaris: deploy https://gerrit.wikimedia.org/r/570726 T244535 to avoid CPU throttling of wikifeeds
  • 08:21 akosiaris@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'wikifeeds' for release 'staging' .
  • 07:53 marostegui@cumin1001: dbctl commit (dc=all): 'Increase base weight for db1126', diff saved to https://phabricator.wikimedia.org/P10340 and previous config saved to /var/cache/conftool/dbconfig/20200207-075323-marostegui.json
  • 07:52 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1101:3317 T239453', diff saved to https://phabricator.wikimedia.org/P10339 and previous config saved to /var/cache/conftool/dbconfig/20200207-075234-marostegui.json
  • 07:48 marostegui: Remove revision partitions from db2085:3318 T239453
  • 07:45 marostegui@cumin1001: dbctl commit (dc=all): 'Fullyy repool db1126 T232446', diff saved to https://phabricator.wikimedia.org/P10338 and previous config saved to /var/cache/conftool/dbconfig/20200207-074511-marostegui.json
  • 07:44 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db2085:3318 T239453', diff saved to https://phabricator.wikimedia.org/P10337 and previous config saved to /var/cache/conftool/dbconfig/20200207-074407-marostegui.json
  • 07:42 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1101:3317 T239453', diff saved to https://phabricator.wikimedia.org/P10336 and previous config saved to /var/cache/conftool/dbconfig/20200207-074258-marostegui.json
  • 07:31 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1101:3317 T239453', diff saved to https://phabricator.wikimedia.org/P10335 and previous config saved to /var/cache/conftool/dbconfig/20200207-073130-marostegui.json
  • 07:30 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1126 T232446', diff saved to https://phabricator.wikimedia.org/P10334 and previous config saved to /var/cache/conftool/dbconfig/20200207-073026-marostegui.json
  • 06:38 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1126 T232446', diff saved to https://phabricator.wikimedia.org/P10333 and previous config saved to /var/cache/conftool/dbconfig/20200207-063831-marostegui.json
  • 06:34 marostegui@cumin1001: dbctl commit (dc=all): 'Fully repool db1105:3311 T239453', diff saved to https://phabricator.wikimedia.org/P10332 and previous config saved to /var/cache/conftool/dbconfig/20200207-063402-marostegui.json
  • 06:31 elukey: force a puppet run on all ores[12] nodes
  • 06:27 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1105:3311 T239453', diff saved to https://phabricator.wikimedia.org/P10331 and previous config saved to /var/cache/conftool/dbconfig/20200207-062731-marostegui.json
  • 06:26 marostegui: Reboot db1107 for update - T242702
  • 06:25 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1126 T232446', diff saved to https://phabricator.wikimedia.org/P10330 and previous config saved to /var/cache/conftool/dbconfig/20200207-062502-marostegui.json
  • 06:23 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1105:3311 T239453', diff saved to https://phabricator.wikimedia.org/P10329 and previous config saved to /var/cache/conftool/dbconfig/20200207-062345-marostegui.json
  • 06:20 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1105:3311 T239453', diff saved to https://phabricator.wikimedia.org/P10328 and previous config saved to /var/cache/conftool/dbconfig/20200207-062043-marostegui.json
  • 04:49 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 04:46 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 04:16 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 04:14 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 04:13 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 04:11 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 03:51 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 03:49 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 03:42 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 03:40 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 01:27 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 01:25 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 01:24 robh: eqsin pdu work ongoing starting now. ps1-603 swapping per T242250
  • 00:13 pt1979@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 00:11 pt1979@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 00:09 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 00:08 pt1979@cumin2001: START - Cookbook sre.hosts.downtime

2020-02-06

  • 23:44 pt1979@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 23:42 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 23:37 pt1979@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 23:35 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 23:25 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: T244133 [cswikisource] Enable VisualEditor in the Edice namespace (duration: 01m 07s)
  • 23:22 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: T159711 T161365 T164435 [nlwiki] Enable VisualEditor in the Project namespace (duration: 01m 08s)
  • 23:21 pt1979@cumin2001: END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
  • 23:19 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 23:15 pt1979@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 23:13 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 23:10 jforrester@deploy1001: Synchronized wmf-config/CommonSettings.php: T244405 Don't trying to assign to if it's unset (duration: 01m 07s)
  • 22:50 jforrester@deploy1001: Synchronized php-1.35.0-wmf.18/extensions/VisualEditor: T242184 Change tags method so anon edits will go through (duration: 01m 08s)
  • 22:42 pt1979@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 22:40 pt1979@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 22:39 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 22:38 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 22:18 pt1979@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 22:15 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 22:13 mutante: turning mw2271 and mw2163 into canary appservers for codfw, this adds mediawiki-testers shell users and removes scap sql scripts, rest stays as is (T242606)
  • 21:54 pt1979@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 21:52 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 21:40 twentyafterfour: train blocked due to serious incident related to deploying the latest branch. Incident documentation: https://wikitech.wikimedia.org/wiki/Incident_documentation/20200206-mediawiki refs T233866
  • 21:30 pt1979@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 21:27 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 21:05 pt1979@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 21:03 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 20:52 akosiaris: restart all wikifeeds pods
  • 20:48 akosiaris@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'wikifeeds' for release 'production' .
  • 20:45 akosiaris: restart restbase on restbase1027
  • 20:32 twentyafterfour@deploy1001: rebuilt and synchronized wikiversions files: (no justification provided)
  • 20:30 twentyafterfour: sync-wikiversions --force
  • 20:30 twentyafterfour@deploy1001: Scap failed!: 9/11 canaries failed their endpoint checks(http://en.wikipedia.org)
  • 20:25 twentyafterfour@deploy1001: rebuilt and synchronized wikiversions files: all wikis to 1.35.0-wmf.18 refs T233866
  • 19:45 jforrester@deploy1001: Synchronized wmf-config/CommonSettings.php: T244405 Set wgLogoHD before adding wordmark (duration: 01m 06s)
  • 19:36 bblack: re-pool cp1075 (eqiad text)
  • 19:33 addshore: SWAT done!
  • 19:32 addshore@deploy1001: Synchronized php-1.35.0-wmf.18/extensions/WikibaseLexemeCirrusSearch: T244479 Update namespace for PrefetchingTermLookup & fix tests (duration: 01m 06s)
  • 19:31 bblack: depool cp1075 (eqiad text) for minor experimentation
  • 19:29 addshore@deploy1001: Synchronized php-1.35.0-wmf.16/extensions/Babel/includes/Babel.php: T243713 Timeout for meta api call from 10 to 2 seconds. (duration: 01m 07s)
  • 19:28 addshore@deploy1001: Synchronized php-1.35.0-wmf.18/extensions/Babel/includes/Babel.php: T243713 Timeout for meta api call from 10 to 2 seconds. (duration: 01m 07s)
  • 19:25 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Fix incorrect spellings of "RESTBase" in config variables (2/2) 2.IS (duration: 01m 06s)
  • 19:23 addshore@deploy1001: Synchronized wmf-config/CommonSettings.php: Fix incorrect spellings of "RESTBase" in config variables (2/2) 1.CS (duration: 01m 07s)
  • 19:23 cdanis: manual puppet run on netflow1001 looked good; ✔️ cdanis@cumin1001.eqiad.wmnet ~ 🕑☕ sudo cumin A:netflow "run-puppet-agent --enable 'rollout of I60692f0e8 T237587 cdanis'"
  • 19:22 pt1979@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 19:20 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Fix incorrect spellings of "RESTBase" in config variables (1/2) (duration: 01m 06s)
  • 19:20 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 19:14 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Enable EntitySourceBasedFederation everywhere T243395, sync again for luck (duration: 01m 06s)
  • 19:12 cdanis: ✔️ cdanis@cumin1001.eqiad.wmnet ~ 🕑☕ sudo cumin A:netflow "disable-puppet 'rollout of I60692f0e8 T237587 cdanis'"
  • 19:10 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Enable EntitySourceBasedFederation everywhere T243395 (duration: 01m 07s)
  • 19:05 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Enable EntitySourceBasedFederation for group1 T243395 (duration: 01m 10s)
  • 19:01 moritzm: restarting exim on mendelevium to pick up cyrus-sasl security updates
  • 18:58 pt1979@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 18:56 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 18:55 moritzm: restarting apache on tungsten/dbmonitor to pick up cyrus-sasl security updates
  • 18:53 mholloway-shell@deploy1001: Finished deploy [mobileapps/deploy@8e15868]: Update mobileapps to ceeb950 (duration: 06m 27s)
  • 18:46 mholloway-shell@deploy1001: Started deploy [mobileapps/deploy@8e15868]: Update mobileapps to ceeb950
  • 18:36 pt1979@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 18:34 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 18:06 pt1979@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 18:04 pt1979@cumin2001: START - Cookbook sre.hosts.downtime
  • 17:32 herron: set performance cpu scaling governor on maps*
  • 16:49 vgutierrez: pooling ncredir5002 running buster - T243391
  • 16:38 vgutierrez: pooling cp4023 with buster - T242093
  • 16:36 ebernhardson@deploy1001: Finished deploy [wikimedia/discovery/analytics@524be2b]: airflow: Update ores data transfer from drafttopic -> articletopic (duration: 00m 19s)
  • 16:35 ebernhardson@deploy1001: Started deploy [wikimedia/discovery/analytics@524be2b]: airflow: Update ores data transfer from drafttopic -> articletopic
  • 16:35 XioNoX: remove AS prepending in esams/knams
  • 16:31 bblack: lvs1013 - restart pybal for dual bgp session config - T180069
  • 16:30 bblack: lvs1014 - restart pybal for dual bgp session config - T180069
  • 16:30 bblack: lvs1015 - restart pybal for dual bgp session config - T180069
  • 16:29 bblack: lvs1016 - restart pybal for dual bgp session config - T180069
  • 16:28 moritzm: restarting apache on bromine to pick up SASL security updates
  • 16:24 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 16:22 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 16:22 moritzm: installing cyrus-sasl2 security updates on jessie
  • 16:20 bblack: lvs2001 - restart pybal for dual bgp session config - T180069
  • 16:19 bblack: lvs2002 - restart pybal for dual bgp session config - T180069
  • 16:19 bblack: lvs2003 - restart pybal for dual bgp session config - T180069
  • 16:07 vgutierrez: depool and reimage ncredir5002 as buster - T243391
  • 16:07 bblack: lvs4005 - restart pybal for dual bgp session config - T180069
  • 16:06 bblack: lvs4006 - restart pybal for dual bgp session config - T180069
  • 16:06 bblack: lvs4007 - restart pybal for dual bgp session config - T180069
  • 16:03 vgutierrez: depool & reimage cp4023 as buster - T242093
  • 16:03 vgutierrez: pooling cp4024 with buster - T242093
  • 15:59 akosiaris: repool eventgate-analytics/eqiad. Experiment proved the failover wouldn't cause (on it's own) a problem. Experiment done.
  • 15:58 akosiaris@cumin1001: conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=eventgate-analytics
  • 15:57 halfak@deploy1001: Finished deploy [ores/deploy@50a101a]: T242705 (duration: 04m 35s)
  • 15:56 vgutierrez: pooling ncredir4001 running buster - T243391
  • 15:55 moritzm: installing qemu security updates
  • 15:54 bblack: lvs5001 - restart pybal for dual bgp session config - T180069
  • 15:53 bblack: lvs5002 - restart pybal for dual bgp session config - T180069
  • 15:53 halfak@deploy1001: Started deploy [ores/deploy@50a101a]: T242705
  • 15:52 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 15:52 bblack: lvs5003 - restart pybal for dual bgp session config - T180069
  • 15:50 moritzm: installing python-ecdsa security updates
  • 15:50 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 15:41 moritzm: installing jsoup security updates
  • 15:30 vgutierrez: depool & reimage ncredir4001 as buster - T243391
  • 15:29 vgutierrez: depool & reimage cp4024 as buster - T242093
  • 15:28 vgutierrez: pooling ncredir4002 running buster - T243391
  • 15:27 moritzm: installing sudo security updates on jessie
  • 15:23 vgutierrez: pooling cp4025 with buster - T242093
  • 15:14 ema: A:mw-api: force puppet run to increase keepalive_requests from 100 to 200 https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/570670/ T241145
  • 15:09 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 15:07 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 14:59 godog: extend graphite1004 / graphite2003 fs +200G
  • 14:56 vgutierrez: depool and reimage ncredir4002 as buster - T243391
  • 14:46 vgutierrez: depool & reimage cp4025 as buster - T242093
  • 14:16 akosiaris: 20mins in with eventgate-analytics/eqiad depooled from discovery, no issues yet.
  • 14:14 ema: run puppet on mw-api-canary to revert nginx keepalive_requests bump T241145
  • 13:55 marostegui: Stop MySQL on es1019, upgrade and poweroff for on-site maintenance - T243963
  • 13:54 akosiaris@cumin1001: conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=eventgate-analytics
  • 13:53 akosiaris: depool eqiad eventgate-analytics for testing purposes. Requests will flow to codfw, monitoring https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&from=now-30m&to=now for issues.
  • 13:51 marostegui@cumin1001: dbctl commit (dc=all): 'Depool es1019 for onsite maintenance T243963', diff saved to https://phabricator.wikimedia.org/P10321 and previous config saved to /var/cache/conftool/dbconfig/20200206-135157-marostegui.json
  • 13:45 XioNoX: rollback deactivate BGP transits on cr3-knams
  • 13:34 elukey: repool mw1347 with mcrouter running with 10 proxy threads (was: 5)
  • 13:31 XioNoX: reboot cr3-knams
  • 13:31 elukey: depool mw1347 to test some mcrouter settings
  • 13:27 XioNoX: deactivate BGP transits on cr3-knams
  • 13:22 vgutierrez: Enable server session sharing on ats-tls in cp4031 - T244464
  • 13:10 XioNoX: rollback: deactivate BGP transits on cr2-eqsin
  • 13:00 XioNoX: reboot cr2-eqsin for sw upgrade
  • 13:00 addshore: SWAT done
  • 13:00 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: resync REVERT Enable EntitySourceBasedFederation for group1 (duration: 01m 07s)
  • 12:59 XioNoX: deactivate BGP transits on cr2-eqsin
  • 12:58 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: REVERT Enable EntitySourceBasedFederation for group1 T243395, due to T244479 (duration: 01m 07s)
  • 12:52 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Enable EntitySourceBasedFederation for group1 T243395 (duration: 01m 06s)
  • 12:46 addshore@deploy1001: Synchronized php-1.35.0-wmf.18/extensions/Babel: REVERT Fetch central babel information over SQL query, not API (T243726) (duration: 01m 07s)
  • 12:44 addshore@deploy1001: sync-file aborted: Fetch central babel information over SQL query, not API (T243726) (duration: 01m 04s)
  • 12:40 vgutierrez: pooling cp3065 - T242093
  • 12:39 addshore@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Enable EntitySourceBasedFederation for group0 T243395 (duration: 01m 07s)
  • 12:34 cparle@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Re-enable delayed new upload jobs for MachineVision extension (duration: 01m 08s)
  • 12:26 cparle@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Remove handler deleted from the MachineVision extension (duration: 01m 05s)
  • 12:25 XioNoX: remove full-duplex statement from eqsin Tata link (not supported on Junos 18, as 10G is full duplex anyway)
  • 12:24 cparle@deploy1001: Synchronized php-1.35.0-wmf.18/extensions/MachineVision: Use the wbsetclaim API to add depicts statements (duration: 01m 09s)
  • 12:07 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: 5e1cbb2: Enable CX in te, kn, gu, mr and pawiki as a default tool (T243271, T243272, T243273, T243274, T243275) (duration: 01m 09s)
  • 11:41 akosiaris: upgrade etherpad-lite on etherpad1002 to 1.8.0-1
  • 11:38 kart_: Updated cxserver to 2020-02-05-051751-production (T244230, T234323)
  • 11:35 kartik@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'cxserver' for release 'production' .
  • 11:33 akosiaris: upload etherpad-lite_1.8.0-1 to apt.wikimedia.org buster-wikimedia/main
  • 11:31 kartik@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'cxserver' for release 'production' .
  • 11:28 kartik@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'cxserver' for release 'staging' .
  • 11:14 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 11:11 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 10:21 akosiaris: undo "switchover selectively eventgate-analytics.discovery.wmnet to codfw for mw1331 and mw1348". no effect observed
  • 10:20 akosiaris: undo "switchover selectively eventgate-analytics.discovery.wmnet to codfw for mw1331 and mw1348"
  • 10:19 vgutierrez: Enabling HTTP keepalive between ats-tls and varnish-frontend on cp4031 - T244464
  • 10:00 vgutierrez: depool and reimage cp3065 as buster - T242093
  • 09:59 vgutierrez: upload trafficserver 8.0.5-1wm14 to apt.wm.o (buster) - T242093
  • 09:08 dcausse@deploy1001: Finished deploy [wdqs/wdqs@4306c64]: deploying wdqs 0.3.14-SNAPSHOT and gui 5a1af3b (duration: 11m 41s)
  • 08:56 dcausse@deploy1001: Started deploy [wdqs/wdqs@4306c64]: deploying wdqs 0.3.14-SNAPSHOT and gui 5a1af3b
  • 08:45 dcausse@deploy1001: Finished deploy [wdqs/wdqs@4306c64]: deploying wdqs 0.3.14-SNAPSHOT and gui 5a1af3b to wdqs1010.eqiad.wmnet (duration: 00m 29s)
  • 08:44 dcausse@deploy1001: Started deploy [wdqs/wdqs@4306c64]: deploying wdqs 0.3.14-SNAPSHOT and gui 5a1af3b to wdqs1010.eqiad.wmnet
  • 08:23 marostegui: Reboot dbproxy1012 and dbproxy1014 for upgrade
  • 08:18 dcausse: restarting blazegraph on wdqs1006: T242453
  • 08:17 akosiaris: switchover selectively eventgate-analytics.discovery.wmnet to codfw for mw1331 and mw1348 to
  • 06:59 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1101:3317 - T239453', diff saved to https://phabricator.wikimedia.org/P10319 and previous config saved to /var/cache/conftool/dbconfig/20200206-065906-marostegui.json
  • 06:52 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1098:3317 - T239453', diff saved to https://phabricator.wikimedia.org/P10318 and previous config saved to /var/cache/conftool/dbconfig/20200206-065238-marostegui.json
  • 06:46 elukey: run puppet on all ores[12]* nodes
  • 02:49 dzahn@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
  • 02:42 mutante: ganeti - Creating new VM named install2003.codfw.wmnet in codfw with row=A vcpu=1 memory=1 gigabytes disk=20 gigabytes link=private (T244390)
  • 02:39 dzahn@cumin1001: START - Cookbook sre.ganeti.makevm
  • 02:30 dzahn@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
  • 02:21 mutante: ganeti - Creating new VM named install1003.eqiad.wmnet in eqiad with row=C vcpu=1 memory=1 gigabytes disk=20 gigabytes link=private (T244390)
  • 02:20 dzahn@cumin1001: START - Cookbook sre.ganeti.makevm

2020-02-05

  • 23:30 ebernhardson: delete search indices duplicated on multiple clusters for: hywwiki, chrwiktionary, gcrwiki, mnwwiki, noboard_chapterswikimedia nqowiki nrmwiki outreachwiki and srnwiki
  • 23:08 mholloway-shell@deploy1001: Finished deploy [mobileapps/deploy@a51f927]: Update mobileapps to a7928fa (duration: 10m 48s)
  • 22:57 mholloway-shell@deploy1001: Started deploy [mobileapps/deploy@a51f927]: Update mobileapps to a7928fa
  • 22:07 mutante: Gerrit - added ppchelko to 'wmf-deployment' Gerrit group (he is already in deployment admin group) (T244389)
  • 21:37 arlolra@deploy1001: Finished deploy [parsoid/deploy@01d9d3d]: Updating Parsoid to 74730a3 (duration: 03m 07s)
  • 21:33 arlolra@deploy1001: Started deploy [parsoid/deploy@01d9d3d]: Updating Parsoid to 74730a3
  • 21:31 mutante: killing and restarting wikibugs, it was reporting each update twice
  • 20:51 joal@deploy1001: Finished deploy [analytics/refinery@a47f0d5] (thin): Analytics regular weekly deploy (duration: 00m 07s)
  • 20:51 joal@deploy1001: Started deploy [analytics/refinery@a47f0d5] (thin): Analytics regular weekly deploy
  • 20:51 joal@deploy1001: Finished deploy [analytics/refinery@a47f0d5]: Analytics regular weekly deploy (duration: 13m 28s)
  • 20:50 mutante: ores1004 - systemctl start celery-ores-worker
  • 20:45 twentyafterfour@deploy1001: Synchronized php: group1 wikis to 1.35.0-wmf.18 refs T233866 (duration: 01m 07s)
  • 20:44 twentyafterfour@deploy1001: rebuilt and synchronized wikiversions files: group1 wikis to 1.35.0-wmf.18 refs T233866
  • 20:37 joal@deploy1001: Started deploy [analytics/refinery@a47f0d5]: Analytics regular weekly deploy
  • 20:34 dzahn@cumin1001: conftool action : set/weight=25; selector: name=mw1269.eqiad.wmnet
  • 20:25 dzahn@cumin1001: conftool action : set/weight=25; selector: name=mw1267.eqiad.wmnet
  • 20:25 mutante: mw1267 restarting php7.2-fpm
  • 20:21 joal@deploy1001: Finished deploy [analytics/hdfs-tools/deploy@714e2d0]: Deploy bug fix version (duration: 00m 08s)
  • 20:21 joal@deploy1001: Started deploy [analytics/hdfs-tools/deploy@714e2d0]: Deploy bug fix version
  • 20:09 twentyafterfour: Preparing to deploy wmf/1.35.0-wmf.18 to group1 wikis refs T233866
  • 20:09 moritzm: installing git security updates for jessie
  • 20:00 moritzm: installing unzip security updates
  • 19:44 mutante: LDAP - added spramduya to wmf group (T243802)
  • 19:38 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Clean up VisualEditor settings (duration: 01m 07s)
  • 19:38 ebernhardson: restart mjolnir-kafka-bulk-daemon across eqiad, daemons appear stuck and not reading new messages
  • 19:19 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: T238029 Enable InukaPageView logging on production Wikipedias (duration: 01m 07s)
  • 19:15 jforrester@deploy1001: Synchronized wmf-config/CommonSettings.php: Sync back revert of 975b4bbb9 (duration: 01m 06s)
  • 19:10 jforrester@deploy1001: scap failed: average error rate on 4/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/db09a36be5ed3e81155041f7d46ad040 for details)
  • 18:35 vgutierrez: pooling cp5012 - T242093
  • 18:23 vgutierrez: rebooting cp5012 - T242093
  • 18:21 elukey: restart memcached on mc1025 with 8 threads (rollback - revert https://gerrit.wikimedia.org/r/#/c/570370/, run puppet, restart memcached)
  • 17:51 mutante: ganeti1017 - rebooting (not in use yet)
  • 17:34 reedy@deploy1001: Synchronized php-1.35.0-wmf.18/languages/: T244300 (duration: 01m 13s)
  • 17:33 reedy@deploy1001: Synchronized php-1.35.0-wmf.18/includes/: T244300 (duration: 01m 14s)
  • 16:53 urandom: Sessionstore deployment (mediawiki-config) is done
  • 16:37 ppchelko@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: gerrit:569678 Config: Enable sessionstore on group0 and 1 T243106 (duration: 01m 08s)
  • 16:25 jforrester@deploy1001: Synchronized wmf-config/CommonSettings.php: T232140 Restore wgLogoHD to wikis without a MinervaCustomLogos defined (duration: 01m 09s)
  • 16:07 elukey: update puppet compiler's facts
  • 15:54 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 15:52 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 15:29 effie: restart php-fpm on canaries - T236800
  • 15:24 effie: Rollout php-apcu_5.1.17+4.0.11-1+0~20190217111312.9+stretch~1.gbp192528+wmf2 to api, app and jobrunner canaries - T236800
  • 15:15 vgutierrez: depooling & reimaging cp5012 as buster - T242093
  • 15:12 ema: cp: unset Accept-Encoding from ats-be requests to applayer T242478
  • 14:35 vgutierrez: updating acme-chief to version 0.24 - T244236
  • 14:32 _joe_: restarting mcrouter at nice -19 on mw1331 for testing effects of that change
  • 14:30 vgutierrez: upload acme-chief 0.24 to apt.wm.o (buster) - T244236
  • 14:26 XioNoX: push inital flowspec config to all routers
  • 14:23 vgutierrez: pooling cp5006 - T242093
  • 14:13 ema: cp1075: back to leaving Accept-Encoding as it is due to unrelated applayer issues T242478
  • 13:46 marostegui: Decrease buffer pool size on db1107 for testing - T242702
  • 13:45 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 13:43 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 13:42 akosiaris: undo the manually set 10.2.1.42 eventgate-analytics.discovery.wmnet in /etc/hosts for mw1331, mw1348. Verify hypothesis that this should cause increased latency. Restart php-fpm
  • 13:41 ema: cp1075: unset Accept-Encoding on origin server requests T242478
  • 13:39 Amir1: EU SWAT is done
  • 13:38 ema: cp: disable puppet and merge https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/570311/ T242478
  • 13:35 XioNoX: rollback traffic steering off cr2-eqord
  • 13:29 akosiaris: manually set 10.2.1.42 eventgate-analytics.discovery.wmnet in /etc/hosts for mw1331, mw1348. Verify hypothesis that this should cause increased latency
  • 13:25 XioNoX: reboot cr2-eqord for software upgrade - yaaaaa
  • 13:24 ladsgroup@deploy1001: Synchronized php-1.35.0-wmf.18/extensions/Wikibase/lib/includes/Store/CachingPropertyInfoLookup.php: SWAT: Cache PropertyInfoLookup internally (T243955) (duration: 01m 07s)
  • 13:17 XioNoX: increase ospf cost for cr2-eqord links
  • 13:16 vgutierrez: upload acme-chief 0.23 to apt.wm.o (buster) - T244236
  • 13:15 XioNoX: disable transit/peering BGP sessions on cr2-eqord
  • 13:15 ladsgroup@deploy1001: Synchronized php-1.35.0-wmf.16/extensions/Wikibase/lib/includes/Store/CachingPropertyInfoLookup.php: SWAT: Cache PropertyInfoLookup internally (T243955) (duration: 01m 07s)
  • 13:10 XioNoX: rollback: disable transit/peering BGP sessions on cr2-eqdfw
  • 13:08 vgutierrez: depooling & reimaging cp5006 as buster - T242093
  • 13:03 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: 5cc2b70: wgLogoHD and $wgVectorPrintLogo is replaced with wgLogos (T232140) (duration: 01m 06s)
  • 13:01 XioNoX: reboot cr2-eqdfw for software upgrade
  • 13:00 Amir1: SWAT needs more time
  • 12:55 XioNoX: disable transit/peering BGP sessions on cr2-eqdfw
  • 12:50 urbanecm@deploy1001: Synchronized wmf-config/CommonSettings.php: SWAT: d450288: wgLogoHD and $wgVectorPrintLogo is replaced with wgLogos (T232140) (duration: 01m 07s)
  • 12:48 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: 5cc2b70: wgLogoHD and $wgVectorPrintLogo is replaced with wgLogos (T232140) (duration: 01m 07s)
  • 12:32 awight@deploy1001: Synchronized php-1.35.0-wmf.18/extensions/Cite: SWAT: Revert follow standardization (T240858) (duration: 01m 13s)
  • 10:53 akosiaris: rolling restart of all pods on kubernetes staging cluster to make sure everything is fine after the upgrade
  • 10:50 akosiaris: T244335 upgrade kubernetes-node on kubestage1002.eqiad.wmnet to 1.13.12
  • 10:43 ema: cp4028: varnish-frontend-restart T243634
  • 10:24 akosiaris: T244335 upgrade kubernetes-master on neon.eqiad.wmnet (staging)
  • 10:24 effie: Upload php-apcu_5.1.17+4.0.11-1+0~20190217111312.9+stretch~1.gbp192528+wmf2 - T236800
  • 10:10 Urbanecm: Run mwscript deleteEqualMessages.php --delete to delete GrowthExperiments' message overrides (cswiki, viwiki, arwiki, kowiki)
  • 09:57 akosiaris: upload kubernetes 1.13.12 to apt.wikimedia.org stretch-wikimedia/main T244335
  • 09:51 effie: install libmemcached-tools on mc-gp* servers - T240684
  • 09:05 ema: add individual FortiGate IPs hitting ulsfo (currently cp4028) to vcl blocked_nets -- trying to identify problematic traffic T243634
  • 07:02 marostegui: Replay s1 traffic on db1107 (10.4) T242702
  • 06:32 elukey: force a puppet run on ores* hosts
  • 06:12 marostegui: Remove partitions from revision table db1098:3317 - T239453
  • 06:09 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1098:3317 - T239453', diff saved to https://phabricator.wikimedia.org/P10312 and previous config saved to /var/cache/conftool/dbconfig/20200205-060942-marostegui.json
  • 06:09 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db2085:3311, db2086:3317 - T239453', diff saved to https://phabricator.wikimedia.org/P10311 and previous config saved to /var/cache/conftool/dbconfig/20200205-060911-marostegui.json
  • 02:38 cdanis: T243634 ✔️ cdanis@cp4030.ulsfo.wmnet ~ 🕤🍺 sudo varnish-frontend-restart

2020-02-04

  • 22:35 twentyafterfour@deploy1001: rebuilt and synchronized wikiversions files: group0 wikis to 1.35.0-wmf.18 refs T233866
  • 22:13 twentyafterfour@deploy1001: Finished scap: testwikis wikis to 1.35.0-wmf.18 refs T233866 (duration: 32m 03s)
  • 22:03 cdanis@cumin2001: conftool action : set/pooled=true; selector: dnsdisc=kartotherian,name=eqiad
  • 21:41 twentyafterfour@deploy1001: Started scap: testwikis wikis to 1.35.0-wmf.18 refs T233866
  • 21:29 twentyafterfour: preparing the new mediawiki branch for deployment to test wikis
  • 20:31 shdubsh: restart kartotherian on maps2001
  • 20:24 shdubsh: temporarily enable access logs on maps2001
  • 20:20 twentyafterfour: branching mediawiki to wmf/1.35.0-wmf.18 from commit 054dd94e97d6 - train blockers should be added as subtasks under T233866
  • 20:06 marxarelli: temporarily holding 1.35.0-wmf.18 [[[phab:T233866|T233866]]] branch cut and train due to concurrent maps prod issues
  • 19:15 mutante: cp3065 - powercycling
  • 18:45 cdanis@cumin2001: conftool action : set/pooled=false; selector: dnsdisc=kartotherian,name=eqiad
  • 17:57 cdanis: ✔️ cdanis@mw1272.eqiad.wmnet ~ 🕐☕ sudo restart-php7.2-fpm
  • 17:41 akosiaris: reenable kartotherian on maps100*
  • 17:34 oblivian@cumin1001: conftool action : set/weight=15; selector: cluster=appserver,service=nginx,dc=eqiad,name=mw12[3-5].*
  • 17:13 _joe_: restarting php-fpm on mw126[1-3]
  • 17:11 _joe_: restarting php-fpm on mw1266-9
  • 17:10 ladsgroup@deploy1001: Synchronized php-1.35.0-wmf.16/includes/filerepo/file/ForeignDBFile.php: gerrit: 570089, ongoing incident (duration: 01m 04s)
  • 17:07 _joe_: restarted php-fpm on mw1265 witrh 80 workers (teh default)
  • 17:07 _joe_: restarted php-fpm on mw1264 witrh 240 workers
  • 16:52 ladsgroup@deploy1001: Synchronized php-1.35.0-wmf.16/extensions/Wikibase: fix for the recent outage (duration: 01m 21s)
  • 16:02 ema: cp: rolling ats-backend-restart to unset Accept-Encoding before sending origin server requests T242478
  • 14:23 akosiaris@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'wikifeeds' for release 'production' .
  • 14:18 akosiaris: deploy new wikifeeds chart that is consistent with the current scaffolding approach. No code deploy though.
  • 14:17 akosiaris@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'wikifeeds' for release 'production' .
  • 14:16 akosiaris@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'wikifeeds' for release 'staging' .
  • 14:07 XioNoX: repool ulsfo
  • 14:03 elukey@cumin1001: END (PASS) - Cookbook sre.aqs.roll-restart (exit_code=0)
  • 14:00 elukey@cumin1001: START - Cookbook sre.aqs.roll-restart
  • 13:36 XioNoX: restart cr3-ulsfo for software upgrade
  • 13:23 vgutierrez: upgrading acme-chief to version 0.22 - T240614
  • 13:10 vgutierrez: uploaded acme-chief 0.22 to apt.wm.o (buster) - T240614
  • 13:09 XioNoX: restart cr4-ulsfo for upgrade
  • 12:49 XioNoX: depool ulsfo for routers upgrade
  • 10:35 ema: cp4032: varnish-frontend-restart T243634
  • 09:08 vgutierrez: manually refreshing OCSP stapling response for non-canonical-redirects-3 - T243948
  • 09:07 marostegui: Upgrade s3 codfw master db2105 - T239791
  • 08:56 marostegui: Deploy schema change on enwiki eqiad host by host - T243804
  • 08:46 marostegui: Deploy schema change on enwiki codfw - T243804
  • 08:16 marostegui: Deploy schema change on testwiki - T243804
  • 08:13 marostegui: Deploy schema change on test2wiki - T243804
  • 07:36 marostegui: Upgrade Mariadb on db1107 from 10.4.11 to 10.4.12 T242702
  • 07:15 marostegui: Compress db1126 - T232446
  • 07:14 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1126 - T232446', diff saved to https://phabricator.wikimedia.org/P10302 and previous config saved to /var/cache/conftool/dbconfig/20200204-071420-marostegui.json
  • 07:09 marostegui: Compress db1091 - T232446
  • 07:08 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1091 - T232446', diff saved to https://phabricator.wikimedia.org/P10301 and previous config saved to /var/cache/conftool/dbconfig/20200204-070804-marostegui.json
  • 07:05 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1105:3311, db2086:3317 - T239453', diff saved to https://phabricator.wikimedia.org/P10300 and previous config saved to /var/cache/conftool/dbconfig/20200204-070533-marostegui.json
  • 06:48 elukey: force a puppet run on all ores[12] nodes
  • 00:14 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: [enwiki] Add Commons as an import source T242884 (duration: 00m 57s)
  • 00:09 mutante: gerrit1002 - replaced ens5 with ens6 in /etc/network/interfaces (IP and row had changed in the past, needed manual fix after reboot and now came back) ; mkfs.ext4 /dev/vdb on new additional 10GB disk. (T239151 T243983)
  • 00:06 jforrester@deploy1001: Synchronized dblists/visualeditor-nondefault.dblist: [nlwiki] Enable VisualEditor by default for all users T161365 (duration: 00m 58s)
  • 00:05 mutante: gerrit1002 - attempt to manually fix /etc/network interfaces , add IP on interface, reboot
  • 00:03 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Configure remainder of testwikis group for kask-session T243106 (duration: 00m 58s)
  • 00:02 volans: depool, varnish-frontend-restart, pool on cp4029 (~242k fds) - T243634

2020-02-03

  • 23:34 mutante: rebooting gerrit1002 (test VM)
  • 23:26 mutante: ganeti1003 - sudo gnt-instance modify --disk add:size=10G gerrit1002.wikimedia.org (T239151 T243983)
  • 23:24 brennen@deploy1001: rebuilt and synchronized wikiversions files: all wikis to 1.35.0-wmf.16
  • 23:21 mutante: gerrit1002 - deleting gerrit.log and gerrit.json files from January to free about 4GB of space (T239151 T243983)
  • 23:12 XioNoX: removing AS15542 from esams
  • 22:18 andrew@deploy1001: Finished deploy [horizon/deploy@8bffc7d]: Fix for T243355 (duration: 03m 29s)
  • 22:14 andrew@deploy1001: Started deploy [horizon/deploy@8bffc7d]: Fix for T243355
  • 22:13 mutante: rebooting ganeti1010, ganeti1011 and other new ganeti machines to pickup microcode mitigations, for some reason the previous reboots did not do it. rescheduled service check on icinga for ganeti1010 and now it recovered (T228924)
  • 22:05 mutante: ganeti1010 - rebooting host to clear microcode mitigations CPU alert
  • 21:39 brennen@deploy1001: rebuilt and synchronized wikiversions files: Revert "group2 wikis to 1.35.0-wmf.15"
  • 21:33 brennen@deploy1001: rebuilt and synchronized wikiversions files: all wikis to 1.35.0-wmf.16
  • 21:28 brennen@deploy1001: Synchronized php-1.35.0-wmf.16/includes/TemplateParser.php: Syncing https://gerrit.wikimedia.org/r/c/mediawiki/core/+/569643 for T243548 (duration: 01m 08s)
  • 21:14 halfak@deploy1001: Finished deploy [ores/deploy@50a101a]: T243451 (duration: 12m 47s)
  • 21:01 halfak@deploy1001: Started deploy [ores/deploy@50a101a]: T243451
  • 20:43 mutante: doc1001 - sudo chown -R doc-uploader:doc-uploader /srv/docroot/
  • 20:19 XioNoX: reactivate L3 only LB in esams/knams
  • 20:19 XioNoX: remove test flowspec rule from cr3-knams
  • 20:13 mutante: doc1001 - re-enabled puppet after merging gerrit:569620 - Git::Clone[integration/docroot]/File[/srv/docroot]/mode: mode changed '2775' to '0755' - Profile::Doc/File[/srv/docroot/org/wikimedia/doc]/group: group changed 'doc-uploader' to 'wikidev', mode changed '0775' to '0755'. needs another follow-up (T237707)
  • 19:27 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: [officewiki] Enable VisualEditor desktop section editing (duration: 01m 07s)
  • 19:21 Urbanecm: Morning SWAT done
  • 19:20 urbanecm@deploy1001: Synchronized wmf-config/InterwikiSortOrders.php: SWAT: 7b53a52: Add gcr, mnw and szy to InterwikiSortOrders (duration: 01m 11s)
  • 19:19 mutante: doc1001 - chown -R doc-uploader:doc-uploader /srv/docroot ; temp. disabled puppet (T237707)
  • 19:09 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: 7bb6a12: Configure remainder of testwikis group for kask-transition (T243106) (duration: 01m 14s)
  • 18:58 mutante: < bblack> !log doc1001: chown -R nobody:wikidev /srv/docroot | < mutante> !doc1001 sudo -u doc-uploader chmod g+w /srv/docroot/org/wikimedia/doc | https://gerrit.wikimedia.org/r/c/operations/puppet/+/484304 | (T237707)
  • 18:44 bblack: doc1001: chown -R nobody:wikidev /srv/docroot
  • 18:34 brennen: edited /srv/mediawiki-stating/wikiversions.json on deploy1001; scap pull and scap wikiversions-compile on mwdebug1002; revert wikiversions changes on deploy1001.
  • 18:25 mholloway-shell@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'wikifeeds' for release 'production' .
  • 18:23 mholloway-shell@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'wikifeeds' for release 'production' .
  • 18:17 mholloway-shell@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'wikifeeds' for release 'staging' .
  • 16:52 eevans@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'sessionstore' for release 'production' .
  • 16:48 eevans@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'sessionstore' for release 'production' .
  • 16:38 eevans@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'sessionstore' for release 'staging' .
  • 15:38 XioNoX: rollback: add debug on eqiad-knams link interfaces - T240659
  • 15:33 XioNoX: add debug on eqiad-knams link interfaces - T240659
  • 14:59 moritzm: restarting exim on phab* to pick up libidn security update
  • 14:55 moritzm: restarting superset on an-tool1004/1005 to pick up libidn security update
  • 14:44 moritzm: restarting apache on an-tool*. cloudmetrics*, logstash*, grafana1002 to pick up libidn security update
  • 14:21 moritzm: restarting slapd on ldap-corp* to pick up libidn2 security updates
  • 14:18 cdanis: T243634 ✔️ cdanis@cp4031.ulsfo.wmnet ~ 🕤☕ sudo varnish-frontend-restart
  • 13:58 moritzm: installing libidn2 security updates
  • 13:32 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 13:32 jmm@cumin2001: START - Cookbook sre.hosts.downtime
  • 13:32 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 13:32 jmm@cumin2001: START - Cookbook sre.hosts.downtime
  • 13:31 moritzm: rebooting ganeti1009 - ganeti1022 to pick up microcode update T228924
  • 12:58 XioNoX: deactivate v6 BGP to AS25596
  • 12:57 moritzm: installing spamassassin security updates
  • 12:53 Urbanecm: Previous message should be "EU SWAT done"
  • 12:52 Urbanecm: Morning SWAT done
  • 12:52 Urbanecm: Purge https://en.wikipedia.org/static/images/project-logos/zh_classicalwiki*.png (T243509)
  • 12:51 urbanecm@deploy1001: Synchronized static/images/project-logos/: SWAT: af0b745: Update logo for zh_classical wiki (T243509) (duration: 01m 06s)
  • 12:45 urbanecm@deploy1001: Synchronized dblists/mobilemainpagelegacy.dblist: SWAT: e9387b2: Disable MobileFrontend Mainpage special casing on frwiktionary (T241888) (duration: 01m 05s)
  • 12:40 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: 5f13c19: Add minerva custom log for la.wiki (T240728; 2/2) (duration: 01m 06s)
  • 12:37 urbanecm@deploy1001: Synchronized static/images/mobile/copyright/: SWAT: 5f13c19: Add minerva custom log for la.wiki (T240728; 1/2) (duration: 01m 06s)
  • 12:35 moritzm: installing openjpeg2 security updates
  • 12:32 Urbanecm: Purge https://en.wikipedia.org/static/images/mobile/copyright/wikipedia-wordmark-szl.svg (T233104)
  • 12:30 urbanecm@deploy1001: Synchronized static/images/mobile/copyright/: SWAT: 76e67cd: e266e25: Add wordmarks for szlwiki and etwiki (T233104, T230379) (duration: 01m 06s)
  • 12:29 urbanecm@deploy1001: Synchronized static/images/mobile/copyright/: SWAT: 76e67cd: e266e25: Add static wordmarks for szlwiki and etwiki (T233104, T230379) (duration: 01m 06s)
  • 12:25 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: 32e0356: Add vzg-easydb.gbv.de to the wgCopyUploadsDomains (T243118) (duration: 01m 07s)
  • 12:20 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: 6c48af8: Assign editautopatrolprotected to hewiki patrollers (T243665) (duration: 01m 06s)
  • 12:14 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: 6b497e7: Wikidata - enable TaintedRefs (T241989) (duration: 01m 06s)
  • 12:09 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: 0c0ef87: Add wgImportSources for hiwikibooks (T244022) (duration: 01m 05s)
  • 12:07 urbanecm@deploy1001: Synchronized wmf-config/CommonSettings.php: SWAT: Remove $wgImgAuthDetails=true (T153459) (duration: 01m 36s)
  • 11:38 ema: powercycle cp3057 T244127 T238305
  • 10:24 godog: temp disable puppet on cp hosts as precaution for https://gerrit.wikimedia.org/r/c/operations/puppet/+/563977
  • 10:08 moritzm: installing sudo security updates on stretch

2020-02-02

  • 19:25 effie: restart varnish on cp4028
  • 08:48 effie: reboot host analytics1061 - T244081

2020-02-01

  • 18:17 effie: pool scb2003, no need for host to stay depooled - T244069
  • 17:46 cdanis: T243634 ✔️ cdanis@cp4030.ulsfo.wmnet ~ 🕐☕ sudo varnish-frontend-restart
  • 17:27 effie: depool scb2003 T244069
  • 16:51 effie: pool mw1273
  • 16:50 effie: pool scb2003
  • 16:30 elukey: powerup analytics1073 (attempt to see if it was only a kernel-related crash) - T244064
  • 16:16 effie: poweroff analytics1073 - T244064
  • 16:16 effie: poweroff analytics1073 - /T244064
  • 16:16 effie: poweroff analytics1073
  • 13:00 effie: depool scb2003
  • 12:21 effie: depool mw1273
  • 01:03 eileen: process-control config revision is c3c8bde761
  • 00:50 eileen: civicrm revision changed from fcc5673ee7 to ee9edf8137, config revision is 2a61da0ace

2020-01-31

  • 22:25 eileen: civicrm revision changed from ac730a6bcb to fcc5673ee7, config revision is 2a61da0ace
  • 22:14 bstorm_: repooled labsdb1011 now that view work is done
  • 22:00 eileen: process-control config revision is 2a61da0ace disabled process-control
  • 21:59 bstorm_: depooled labsdb1011
  • 21:32 bstorm_: updated views on labsdb1010
  • 21:22 bstorm_: updated views on labsdb1009
  • 21:21 bstorm_: updated actor views on labsdb1012
  • 18:17 bblack: repool cp4032 (buster)
  • 18:17 bblack@cumin1001: conftool action : set/pooled=yes; selector: name=cp4032.ulsfo.wmnet
  • 18:14 bblack: repool cp4029
  • 18:13 bblack: restarted ats-tls and varnish-fe on cp4029
  • 18:05 bblack: depool varnish-fe on cp4029
  • 18:03 bblack: depool ats-tls on cp4029
  • 16:59 marostegui: Re-enable notifications on the dbstore1005:3318 check T243871
  • 09:18 addshore: addshore@mwmaint1002:~$ mwscript extensions/Wikibase/repo/maintenance/rebuildPropertyTerms.php --wiki=wikidatawiki --sleep 4 --batch-size=25 # In a screen for T219301
  • 03:22 mutante: powercycling crashed cp3063
  • 01:09 mholloway-shell@deploy1001: Finished deploy [mobileapps/deploy@322ee4c]: Update mobileapps to 3eec28d (duration: 06m 53s)
  • 01:02 mholloway-shell@deploy1001: Started deploy [mobileapps/deploy@322ee4c]: Update mobileapps to 3eec28d
  • 00:41 mutante: contint1001/contint2001 - upgrading jenkins to 2.219
  • 00:36 mutante: releases2001: upgrading jenkins to 2.219; install1002: import jenkins 2.219 into jessie-wikimedia APT repo
  • 00:31 mutante: importing jenkins 2.219 to stretch-wikimedia APT repo; releases1001: upgrading jenkins to 2.219

2020-01-30

  • 19:37 mutante: copying /var/log/apache2 to /root on all eqiad mw appservers to preserve logs
  • 18:07 vgutierrez: depool cp4032 and perform a rolling restart of varnish-fe at cp4027-cp4031 - T243634
  • 17:51 ladsgroup@deploy1001: Synchronized php-1.35.0-wmf.16/extensions/Wikibase/lib/includes/Store/Sql/Terms/FingerprintableEntityTermStoreTrait.php: wbterms: Fix incorrect deletion of rows in findActuallyUnusedTermIds (T243944) (duration: 01m 06s)
  • 17:49 ladsgroup@deploy1001: Synchronized php-1.35.0-wmf.16/extensions/Wikibase/repo/maintenance/rebuildItemTerms.php: wbterms: Write only to the new term store in rebuildItemTerms (T243944) (duration: 01m 09s)
  • 17:03 vgutierrez: repooling cp4032 - T243634
  • 17:02 vgutierrez: restarting varnish-frontend on cp4031 before it crashes - T243634
  • 16:26 vgutierrez: manually refreshing OCSP stapling response for non-canonical-redirects-3 - T243948
  • 12:22 arturo: add prometheus 2.7.1+ds-3+k8s+buster to buster-wikimedia T238096 (basically a rebuild from stretch)
  • 06:23 vgutierrez: restarting varnish-frontend on cp4030 before it crashes - T243634
  • 06:21 vgutierrez: depool cp4032 - T243634
  • 05:12 vgutierrez: restarting varnish-frontend and repooling cp4029 - T243634
  • 05:00 vgutierrez: depooling cp4029

2020-01-29

  • 23:37 marostegui: Remove partitions from db2087:3317 - T239453
  • 18:17 XioNoX: move knams netflow sampling to cr3-knams
  • 17:19 krinkle@deploy1001: Synchronized wmf-config/etcd.php: Ice8dad2 (duration: 01m 10s)
  • 01:11 vgutierrez: varnish-frontend restarted on cp4031
  • 01:09 vgutierrez: repool cp4031
  • 01:05 marostegui: Disable notifications for dbstore1005:3318 slave lag - T243871
  • 01:03 vgutierrez: depool cp4031
  • 00:35 marostegui@cumin1001: dbctl commit (dc=all): 'Fully repool db1097:3314 T239453', diff saved to https://phabricator.wikimedia.org/P10289 and previous config saved to /var/cache/conftool/dbconfig/20200129-003507-marostegui.json
  • 00:22 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1097:3314 T239453', diff saved to https://phabricator.wikimedia.org/P10288 and previous config saved to /var/cache/conftool/dbconfig/20200129-002203-marostegui.json

2020-01-28

  • 23:53 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1097:3314 T239453', diff saved to https://phabricator.wikimedia.org/P10287 and previous config saved to /var/cache/conftool/dbconfig/20200128-235336-marostegui.json
  • 23:46 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1097:3314 T239453', diff saved to https://phabricator.wikimedia.org/P10286 and previous config saved to /var/cache/conftool/dbconfig/20200128-234601-marostegui.json
  • 23:42 marostegui@cumin1001: dbctl commit (dc=all): 'Start repooling db1084 with its original weight', diff saved to https://phabricator.wikimedia.org/P10285 and previous config saved to /var/cache/conftool/dbconfig/20200128-234219-marostegui.json
  • 23:40 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1121 T232446', diff saved to https://phabricator.wikimedia.org/P10284 and previous config saved to /var/cache/conftool/dbconfig/20200128-234037-marostegui.json
  • 15:06 addshore: Start addshore@mwmaint1002:~$ ./T219123.sh # Taking over from @ladsgroup for T219123
  • 09:59 effie: rolling restart mobileapps in codfw
  • 02:05 mutante: gerrit1002 - gzipping a bunch of /var/log/gerrit/ log files (T243808)

2020-01-27

  • 23:40 eileen: civicrm revision changed from fbd5c35fb0 to ac730a6bcb, config revision is 837b9d0703
  • 23:10 vgutierrez: rolling restart of varnish-frontend in cp4026 and cp4027
  • 23:06 filippo@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 23:06 filippo@cumin1001: START - Cookbook sre.hosts.downtime
  • 23:01 _joe_: restart apache on gerrit
  • 22:58 vgutierrez: restarting gerrit service
  • 22:01 vgutierrez: restarting varnish-fe on cp4028
  • 19:16 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db2085:3311 - T239453', diff saved to https://phabricator.wikimedia.org/P10277 and previous config saved to /var/cache/conftool/dbconfig/20200127-191614-marostegui.json
  • 19:15 marostegui: Remove partitions from db2085 enwiki - T239453
  • 13:58 vgutierrez: repooling cp4030 - T243634
  • 13:54 vgutierrez: restarting varnish-fe on cp4030 - T243634
  • 13:54 vgutierrez: repooling cp4029 - T243634
  • 13:36 vgutierrez: restarting varnish-fe on cp4029 - T243634
  • 12:10 Amir1: ladsgroup@mwmaint1002:~$ mwscript extensions/Wikibase/repo/maintenance/rebuildItemTerms.php --wiki=wikidatawiki --from-id 1860 --to-id 1860 (T243705)
  • 03:29 gehel: restarting blazegraph on wdqs100[57]

2020-01-26

  • 21:45 akosiaris: repool maps1003
  • 21:45 akosiaris@cumin1001: conftool action : set/pooled=yes; selector: name=maps1003.*
  • 21:42 akosiaris: test depool maps1003
  • 21:42 akosiaris@cumin1001: conftool action : set/pooled=no; selector: name=maps1003.*
  • 21:38 vgutierrez: powercycling cp3051 - T238305
  • 21:23 akosiaris: restart kartotherian on maps1002
  • 21:19 vgutierrez: restart varnish-fe and ats-tls on cp3056
  • 21:02 bblack: ats-tls-restart on cp3064
  • 20:51 bblack: esams text caches: reverting earlier sysctl mitigations
  • 18:11 volans: shutdown elastic2043 - T243715
  • 18:01 volans: depooled elastic2043 - T243715
  • 18:01 volans@cumin1001: conftool action : set/pooled=inactive; selector: name=elastic2043.codfw.wmnet
  • 17:28 elukey: restart varnishkafka-webrequest on cp3064
  • 17:25 elukey: restart varnishkafka-webrequest on cp3056
  • 17:03 bblack: reduce /proc/sys/net/ipv4/tcp_max_syn_backlog to 8192 on esams text caches
  • 16:55 bblack: reduce /proc/sys/net/ipv4/tcp_synack_retries to 1 on esams text caches
  • 16:42 cdanis: ✔️ cdanis@cp4030.ulsfo.wmnet ~ 🕦☕ sudo depool
  • 16:38 bblack: applying GRE MTU mitigation from T232602 to all cp1, cp3, cp5 cache nodes
  • 15:43 XioNoX: 3*prepend in esams/knams
  • 15:26 elukey: repool deployed
  • 15:24 elukey: repool esams
  • 15:01 cdanis: deployed
  • 15:00 cdanis: depool esams
  • 14:56 XioNoX: enabling netflow sampling on the knams-esams links (esams side)
  • 11:25 effie: restarted tilerator and tileratorui on maps1002
  • 11:23 effie: restarted tilerator and tileratorui on maps1001
  • 10:38 effie: deployed
  • 10:37 effie: Pool esams back
  • 01:12 cdanis: deployed
  • 01:12 cdanis: depool esams with new geo-maps-esams-offline

2020-01-25

  • 12:49 Urbanecm: Run mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=mediawikiwiki --logwiki=metawiki TokyVrpns Mike20LCN (T243668)
  • afk: restarting gerrit-replica

2020-01-24

  • 22:31 mutante: ganeti1003 - sudo gnt-instance remove etherpad1001.eqiad.wmnet (T224580)
  • 22:21 mutante: shutting down etherpad1001 - service fully migrated to etherpad1002 - running decom cookbook on ganeti VM (T224580)
  • 22:20 dzahn@cumin1001: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1)
  • 22:19 dzahn@cumin1001: START - Cookbook sre.hosts.decommission
  • 21:18 cdanis: ✔️ cdanis@cp4029.ulsfo.wmnet ~ 🕟🍵 sudo depool
  • 17:54 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Clean up CheckUser config (duration: 01m 09s)
  • 15:43 gehel: restart blazegraph + updater on wdqs1007 (seems stuck, known issue)
  • 15:33 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'production' .
  • 14:28 vgutierrez: uploaded mtail 3.0.0~rc5-1~bpo9+1wmf2 to apt.wm.o (buster) - T243591
  • 14:26 akosiaris@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' .
  • 14:24 akosiaris@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' .
  • 14:23 akosiaris@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' .
  • 13:16 akosiaris@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'kube-system' for release 'rbac-deploy-clusterrole' .
  • 11:09 moritzm: purged stale grafana package from grafana1001, caused systemd unit failure
  • 11:04 effie: restart php-fpm on mw1238-mw1239
  • 09:29 akosiaris: disable and mask etherpad-lite on etherpad1002 to avoid corruption issues. T224580
  • 08:42 marostegui: Remove wikiadmin2 user from pc2XXX codfw hosts T243512
  • 08:17 moritzm: installing python-apt security updates
  • 07:19 _joe_: force run puppet on all esams cache nodes, for mitigation of T243313
  • 06:37 marostegui: Stop replication on db1107
  • 06:12 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db2085 after memory replacement T243148', diff saved to https://phabricator.wikimedia.org/P10256 and previous config saved to /var/cache/conftool/dbconfig/20200124-061228-marostegui.json
  • 01:24 mutante: running puppet on cp-text_ulsfo
  • 00:46 mutante: cp4032 - starting varnishmtail.service
  • 00:36 catrope@deploy1001: Synchronized php-1.35.0-wmf.16/extensions/CentralNotice/resources/ext.centralNotice.display/hide.js: T240802 (duration: 01m 05s)
  • 00:34 catrope@deploy1001: Synchronized php-1.35.0-wmf.15/extensions/CentralNotice/resources/ext.centralNotice.display/hide.js: T240802 (duration: 01m 07s)
  • 00:33 mutante: cp4032 - starting varnishmtail.service which was failed
  • 00:32 catrope@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Bump Parsoid/PHP cluster memory_limit again (T239806, T236833) (duration: 01m 05s)

2020-01-23

  • 21:08 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'production' .
  • 20:30 brennen@deploy1001: rebuilt and synchronized wikiversions files: Revert "group2 wikis to 1.35.0-wmf.15"
  • 20:29 brennen: reverting group2 to 1.35.0-wmf.15
  • 20:10 brennen@deploy1001: rebuilt and synchronized wikiversions files: all wikis to 1.35.0-wmf.16
  • 20:00 Urbanecm: Morning SWAT done
  • 19:56 mlitn@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Add 3d-patents page to wgForceUIMsgAsContentMsg (duration: 01m 08s)
  • 19:15 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: 2d8f773: Use editeditorprotected for protecting pages for editors (T230103) (duration: 01m 05s)
  • 19:10 urbanecm@deploy1001: Synchronized php-1.35.0-wmf.16/extensions/WikimediaMessages/extension.json: SWAT: 23a6f8e: InukaPageView: update schema version (T238029) (duration: 01m 05s)
  • 19:06 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: 629b5fc: Add *.eso.org to the wgCopyUploadsDomains (T243423) (duration: 01m 06s)
  • 19:03 dzahn@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
  • 18:59 mutante: ganeti1003 - creating new VM etherpad1002.eqiad.wmnet with 1GB RAM and 10GB disk, row C, private link (T243475)
  • 18:58 dzahn@cumin1001: START - Cookbook sre.ganeti.makevm
  • 18:54 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'production' .
  • 18:47 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'production' .
  • 18:40 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Stop setting wgWikimediaMessagesPartialBlockBanner, never read T240300 (duration: 01m 06s)
  • 18:35 rlazarus: etcd main cluster switchover complete, eqiad is now read-write
  • 18:28 otto@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'eventstreams' for release 'production' .
  • 18:27 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'production' .
  • 18:22 vgutierrez: pooling cp4032 running buster - T242093
  • 18:15 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'production' .
  • 18:05 robh@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 18:05 robh@cumin1001: START - Cookbook sre.hosts.decommission
  • 18:03 robh@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 18:03 robh@cumin1001: START - Cookbook sre.hosts.decommission
  • 18:02 robh@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 18:01 robh@cumin1001: START - Cookbook sre.hosts.decommission
  • 17:59 robh@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 17:59 robh@cumin1001: START - Cookbook sre.hosts.decommission
  • 17:53 robh@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 17:53 robh@cumin1001: START - Cookbook sre.hosts.decommission
  • 17:52 _joe_: running systemctl reset-failed on conf1005 to clear useless alerts
  • 17:33 marostegui: Poweroff db2085:3311 and db2085:3318 for maintenance - T243148
  • 17:33 jforrester@deploy1001: Synchronized static/images/project-logos: [trwiki] Tweak logo versions T242977 (duration: 01m 07s)
  • 17:00 akosiaris@deploy1001: helmfile [EQIAD] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' .
  • 16:59 akosiaris@deploy1001: helmfile [STAGING] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' .
  • 16:58 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 16:56 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 16:51 akosiaris@deploy1001: helmfile [CODFW] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' .
  • 16:27 vgutierrez: depool cp4032 and reimage as buster - T242093
  • 16:26 vgutierrez: pooling cp4026 running buster - T242093
  • 16:02 ladsgroup@deploy1001: Synchronized php-1.35.0-wmf.16/extensions/Wikibase/data-access/src/EntitySourceDefinitions.php: EntitySourceDefitions::getEntityTypeToSourceMapping fix for sub entities (T242415 T214557) (duration: 01m 08s)
  • 16:00 rlazarus: Starting etcd main cluster switchover from codfw to eqiad
  • 15:45 vgutierrez: restarting high-traffic1 && high-traffic2 primary LVSs - T236120 T238625
  • 15:32 vgutierrez: restarting secondary LVSs - T236120 T238625
  • 15:22 moritzm: mask uwsgi.service on debmonitor2001 T222874
  • 15:06 vgutierrez@puppetmaster1001: conftool action : set/weight=1; selector: name=cp4026.ulsfo.wmnet,service=nginx
  • 14:39 vgutierrez@puppetmaster1001: conftool action : set/weight=1; selector: service=ats-tls,name=cp4026.ulsfo.wmnet
  • 14:17 marostegui: Remove wikiadmin2 user from codfw x1 hosts - T243512
  • 13:34 vgutierrez@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 13:32 vgutierrez@cumin1001: START - Cookbook sre.hosts.downtime
  • 13:19 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 13:19 jmm@cumin2001: START - Cookbook sre.hosts.downtime
  • 12:50 Amir1: EU SWAT is done
  • 12:49 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Set EntitySourceBasedFederation true for testwiki (T243395) (duration: 01m 06s)
  • 12:47 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Set EntitySourceBasedFederation true for testwiki (T243395) (duration: 01m 05s)
  • 12:46 Urbanecm: Run renameRestrictions.php 'autopatrol' 'editautopatrolprotected' for all Serbian wikis (T230103)
  • 12:44 Urbanecm: mwscript renameRestrictions.php --wiki=hewiki 'autopatrol' 'editautopatrolprotected' (T230103)
  • 12:44 Urbanecm: mwscript renameRestrictions.php --wiki=etwiki 'autopatrol' 'editautopatrolprotected' (T230103)
  • 12:41 urbanecm@deploy1001: Synchronized wmf-config/flaggedrevs.php: SWAT: 0c2fb70: Use editautopatrolprotected right for pages protected for autopatrollers (3/3; T230103) (duration: 01m 05s)
  • 12:39 urbanecm@deploy1001: Synchronized wmf-config/CommonSettings.php: SWAT: 0c2fb70: Use editautopatrolprotected right for pages protected for autopatrollers (2/3; T230103) (duration: 01m 08s)
  • 12:35 Urbanecm: mwscript renameRestrictions.php --wiki=ckbwiki 'autopatrol' 'editautopatrolprotected' (T230103)
  • 12:33 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: 0c2fb70: Use editautopatrolprotected right for pages protected for autopatrollers; fixing broken cache (T230103) (duration: 01m 04s)
  • 12:31 twentyafterfour: Deploying hotfix for T243479, restarting php7.3-fpm on phab1003
  • 12:31 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: 0c2fb70: Use editautopatrolprotected right for pages protected for autopatrollers (T230103) (duration: 01m 06s)
  • 12:15 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Set useEntitySourceBasedFederation to true for Wikidata (T241972) (duration: 01m 04s)
  • 12:14 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Set useEntitySourceBasedFederation to true for Wikidata (T241972) (duration: 01m 06s)
  • 12:10 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Move CX out of beta for af, is, lv and ne WPs (T242011 T242012 T242014 T242016) (duration: 01m 05s)
  • 12:08 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Move CX out of beta for af, is, lv and ne WPs (T242011 T242012 T242014 T242016) (duration: 01m 08s)
  • 11:37 jbond42: updating order in resolve search list https://gerrit.wikimedia.org/r/c/operations/puppet/+/566567
  • 10:25 vgutierrez: depooling and reimaging cp4026 as buster - T242093
  • 09:13 moritzm: installing xen updates (only pulled in via deps, otherwise unused)
  • 08:46 marostegui: Stop mysql on es2024 to "clone" es2025 - T243052
  • 06:05 marostegui: Remove partitions from db1097:3314 - T239453
  • 06:03 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1097:3314 - T239453', diff saved to https://phabricator.wikimedia.org/P10248 and previous config saved to /var/cache/conftool/dbconfig/20200123-060308-marostegui.json
  • 05:59 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1103:3314 - T239453', diff saved to https://phabricator.wikimedia.org/P10247 and previous config saved to /var/cache/conftool/dbconfig/20200123-055919-marostegui.json
  • 05:55 marostegui: Compress some tables on db1124:3318, this might generate lag on s8 labs - T232446
  • 01:40 jforrester@deploy1001: Synchronized php-1.35.0-wmf.16/extensions/AbuseFilter/includes/AFComputedVariable.php: T243469 When no registration date is recorded, use 2008-01-15 (duration: 01m 08s)
  • 01:37 twentyafterfour: Phabricator deployment completed with no apparent issues.
  • 01:27 twentyafterfour: Deploying phabricator update tagged release/2020-01-23/1
  • 00:41 catrope@deploy1001: Synchronized wmf-config/InitialiseSettings.php: resync (duration: 01m 07s)
  • 00:40 RoanKattouw: Deployment freeze lifted

2020-01-22

  • 23:46 James_F: <RoanKattouw> T236104 happened again, and this time I'm leaving it broken so I can investigate. Please don't use do any MW deployments (use scap) for now
  • 23:31 eileen: civicrm revision changed from 036b742316 to fbd5c35fb0, config revision is 74a355670a
  • 23:28 eileen: civicrm revision changed from 7595104180 to 036b742316, config revision is 74a355670a
  • 23:14 eileen: civicrm revision changed from c74092ad63 to 7595104180, config revision is 74a355670a
  • 23:06 XioNoX: configure flowspec on cr3-knams
  • 22:39 catrope@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Enable homepage on ukwiki, huwiki, hywiki (T238320, T231720, T230478, T230676) (duration: 01m 05s)
  • 22:30 catrope@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Enable help panel on ukwiki, huwiki, hywiki (T238319, T231720, T230478, T230676) (duration: 01m 04s)
  • 22:19 catrope@deploy1001: Synchronized php-1.35.0-wmf.16/extensions/CodeReview/: T243337 (duration: 01m 06s)
  • 22:13 catrope@deploy1001: Finished scap: i18n changes for SWAT: Special page aliases for GrowthExperiments (T230676); messages for machinevision-tester group (T243440); fix namespace names for atj (T243125) (duration: 40m 48s)
  • 21:32 catrope@deploy1001: Started scap: i18n changes for SWAT: Special page aliases for GrowthExperiments (T230676); messages for machinevision-tester group (T243440); fix namespace names for atj (T243125)
  • 21:28 arlolra: Updated Parsoid to 7390988 (T242513, T243008, T241146)
  • 21:18 arlolra@deploy1001: Finished deploy [parsoid/deploy@e8610ff]: Updating Parsoid to 7390988 (duration: 08m 28s)
  • 21:10 arlolra@deploy1001: Started deploy [parsoid/deploy@e8610ff]: Updating Parsoid to 7390988
  • 20:07 brennen@deploy1001: Synchronized php: group1 wikis to 1.35.0-wmf.16 (duration: 01m 05s)
  • 20:06 brennen@deploy1001: rebuilt and synchronized wikiversions files: group1 wikis to 1.35.0-wmf.16
  • 19:46 catrope@deploy1001: Synchronized php-1.35.0-wmf.16/extensions/WikimediaMessages/: Remove temporary partial block banner (T240300) (duration: 01m 06s)
  • 19:45 catrope@deploy1001: Synchronized php-1.35.0-wmf.15/extensions/WikimediaMessages/: Remove temporary partial block banner (T240300) (duration: 01m 10s)
  • 19:43 gehel: restart tilerator / kartotherian on maps* servers
  • 19:36 catrope@deploy1001: Synchronized php-1.35.0-wmf.16/extensions/WikimediaEvents/: InukaPageView: update schema version (T238029) (duration: 01m 07s)
  • 19:26 catrope@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Enable UnderstandingFirstDay on ukwiki, huwiki, hywiki (T238294) (duration: 01m 06s)
  • 17:46 arturo: forcing by hand the first sync on sodium for openstack packages (T238820)
  • 16:40 vgutierrez: removing nginx from the caching cluster
  • 16:26 moritzm: installing tiff security updates for buster
  • 16:21 vgutierrez: copied prometheus-trafficserver-exporter from stretch to buster on apt.w.o - T242093
  • 16:13 XioNoX: update logging target for pfw3-eqiad - T243343
  • 16:07 XioNoX: update logging target for pfw3-codfw - T243343
  • 15:43 vgutierrez: uploaded vhtcpd 0.1.2-2 to apt.w.o (buster) - T242093
  • 15:38 marostegui: Compress wikidatawiki.wbt_text wikidatawiki.wbt_text_in_lang on db1124:3318 (this might cause lag on s8 labs) - T232446
  • 15:29 vgutierrez: uploaded fifo-log-demux 0.6.1 to apt.w.o (buster) - T242093
  • 14:54 papaul: FW upgrade on db2085
  • 14:53 vgutierrez: copied python3-logstash to apt.w.o (buster) - T242093
  • 14:50 vgutierrez: copied python3-file-read-backwards to apt.w.o (buster) - T242093
  • 14:39 marostegui: Stop MySQL on db2085:3311 and db2085:3318 for onsite maintenance - T243148
  • 14:39 marostegui: Stop MySQL on db2085:3311 and db2085:3318 for onsite maintenance -
  • 14:18 akosiaris: upload etherpad-lite_1.7.5-3 to apt.wikimedia.org buster-wikimedia/main T224580
  • 13:07 Amir1: EU SWAT is over
  • 13:03 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Revert: Set useEntitySourceBasedFederation to true for Wikidata (T241972) (duration: 01m 05s)
  • 13:02 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Revert: Set useEntitySourceBasedFederation to true for Wikidata (T241972) (duration: 01m 05s)
  • 12:59 effie: restart npre on notebook1003
  • 12:57 hoo: Updated the Wikidata property suggester with data from the 2020-01-06 JSON dump and applied the T132839 workarounds
  • 12:51 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Set useEntitySourceBasedFederation to true for Wikidata (T241972) (duration: 01m 05s)
  • 12:50 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Set useEntitySourceBasedFederation to true for Wikidata (T241972) (duration: 01m 06s)
  • 12:47 jbond42: disable puppet fleat wide - upgrade jdk on puppetdb
  • 12:46 ladsgroup@deploy1001: Synchronized php-1.35.0-wmf.15/extensions/WikibaseQualityConstraints: Better dependency injection of base URI in ConstraintParameterParser (T241972) (duration: 01m 05s)
  • 12:43 ladsgroup@deploy1001: scap failed: average error rate on 4/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/db09a36be5ed3e81155041f7d46ad040 for details)
  • 12:36 ladsgroup@deploy1001: Synchronized php-1.35.0-wmf.16/extensions/WikibaseQualityConstraints: Better dependency injection of base URI in ConstraintParameterParser (T241972) (duration: 01m 14s)
  • 12:35 effie: enable puppet and restart mtail on mw* and wtp*
  • 12:30 vgutierrez: uploaded trafficserver 8.0.5-1wm13 to apt.w.o (buster) - T242093
  • 12:17 effie: Disable puppet on mw* and wtp* to merge 563206
  • 12:15 jmm@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 12:14 jmm@cumin1001: START - Cookbook sre.hosts.decommission
  • 11:40 moritzm: restarting apache on puppetboard/graphite/webperf to pick up OpenLDAP update
  • 11:38 cormacparle__: disabled wikitech 2fa for Cparle
  • 11:16 moritzm: restarting exim on MXes to pick up new openldap
  • 11:04 moritzm: restarting mw canaries to pick up openldap update
  • 10:09 marostegui: Stop MySQL on es2023 to "clone" es2024 - T243052
  • 10:04 moritzm: installing openldap security updates on stretch
  • 08:45 moritzm: upload prometheus-etherpad-exporter 0.2 to buster-wikimedia T224580
  • 08:27 marostegui: Stop MySQL on es2021 to "clone" es2023 - T243052
  • 06:16 marostegui: Remove partitions from db1103:3314 - T239453
  • 06:15 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1103:3314 T239453', diff saved to https://phabricator.wikimedia.org/P10242 and previous config saved to /var/cache/conftool/dbconfig/20200122-061522-marostegui.json
  • 06:14 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db2091:3314 T239453', diff saved to https://phabricator.wikimedia.org/P10241 and previous config saved to /var/cache/conftool/dbconfig/20200122-061429-marostegui.json
  • 01:13 catrope@deploy1001: Synchronized wmf-config/InitialiseSettings.php: resync, the last sync only took on half the appservers (duration: 01m 05s)
  • 00:31 catrope@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Enable topics in suggested edits on cswiki, kowiki, arwiki, viwiki (duration: 01m 05s)
  • 00:26 catrope@deploy1001: Synchronized php-1.35.0-wmf.15/extensions/GrowthExperiments/: SWAT for T242811, T242052 (duration: 01m 05s)

2020-01-21

  • 20:09 brennen@deploy1001: rebuilt and synchronized wikiversions files: group0 to 1.35.0-wmf.16
  • 19:59 mutante: puppet-compilers: syncing facts from puppetmasters to 3 compiler instances
  • 19:55 XioNoX: restart mr1-esams for software upgrade - T242097
  • 19:46 ppchelko@deploy1001: Finished deploy [cpjobqueue/deploy@1ca3071]: Add separate rule for machine vision jobs T241072 (duration: 01m 11s)
  • 19:45 ppchelko@deploy1001: Started deploy [cpjobqueue/deploy@1ca3071]: Add separate rule for machine vision jobs T241072
  • 19:40 andrew@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 19:39 andrew@cumin1001: START - Cookbook sre.hosts.decommission
  • 19:39 XioNoX: mr1-esams> request system software add /var/tmp/junos-srxsme-18.2R3-S2... - T242097
  • 19:39 andrew@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 19:38 andrew@cumin1001: START - Cookbook sre.hosts.decommission
  • 19:22 XioNoX: cr3-knams# set routing-options ppm no-delegate-processing - T240659
  • 19:01 andrew@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 19:00 andrew@cumin1001: START - Cookbook sre.hosts.decommission
  • 19:00 andrew@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
  • 18:59 andrew@cumin1001: START - Cookbook sre.hosts.decommission
  • 18:50 brennen@deploy1001: Finished scap: testwiki to php-1.35.0-wmf.16 and rebuild l10n cache (duration: 30m 27s)
  • 18:19 brennen@deploy1001: Started scap: testwiki to php-1.35.0-wmf.16 and rebuild l10n cache
  • 17:45 XioNoX: add dwisehaupt user to pfw/fasw - T242758
  • 17:44 ebernhardson@deploy1001: Finished deploy [search/mjolnir/deploy@986769c]: bulk_daemon: Treat model exists as unrecoverable failure (duration: 05m 42s)
  • 17:39 ebernhardson@deploy1001: Started deploy [search/mjolnir/deploy@986769c]: bulk_daemon: Treat model exists as unrecoverable failure
  • 17:37 bstorm_: re-exported NFS from labstore1006/7
  • 17:33 ebernhardson@deploy1001: Finished deploy [wikimedia/discovery/analytics@ae77f9d]: Deploy ores_drafttopics dag (duration: 00m 22s)
  • 17:32 ebernhardson@deploy1001: Started deploy [wikimedia/discovery/analytics@ae77f9d]: Deploy ores_drafttopics dag
  • 17:20 brennen: starting branch cut for T233864
  • 17:08 XioNoX: restart pfw3-eqiad for software upgrade
  • 16:45 XioNoX: install software upgrade on pfw3a-eqiad (primary, no restart yet)
  • 16:35 XioNoX: install software upgrade on pfw3b-eqiad (secondary, no restart yet)
  • 16:15 vgutierrez: copied prometheus-varnishkafka-exporter from stretch to buster on apt.w.o - T242093
  • 16:02 vgutierrez: uploaded libvmod-tbf 2.0.91-2wm to apt.w.o (buster) - T242093
  • 14:57 vgutierrez: uploaded libvmod-re2 1.3.1-3 to apt.w.o (buster) - T242093
  • 14:56 vgutierrez: uploaded libvmod-netmapper 1.7-3 to apt.w.o (buster) - T242093
  • 14:39 moritzm: stopping/masking tor on torrelay1001 T243288
  • 14:38 effie: Rolling restart all eqiad mw api servers
  • 14:37 vgutierrez: uploaded varnish-modules 0.12-1+wmf2 to apt.w.o (buster) - T242093
  • 14:36 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 14:36 jmm@cumin2001: START - Cookbook sre.hosts.downtime
  • 14:36 _joe_: restart pybal on low-traffic eqiad to pick up new configuration
  • 14:33 cdanis@cumin2001: conftool action : set/weight=30; selector: cluster=api_appserver,dc=eqiad,service=apache2,name=mw13.*
  • 14:33 cdanis@cumin2001: conftool action : set/weight=30; selector: cluster=api_appserver,dc=eqiad,service=nginx,name=mw13.*
  • 14:30 cdanis@cumin2001: conftool action : set/weight=15; selector: cluster=api_appserver,dc=eqiad,service=nginx,name=mw12[23].*
  • 14:24 _joe_: restarting pybal on lvs low-traffic in codfw
  • 14:02 oblivian@puppetmaster1001: conftool action : set/weight=10:pooled=yes; selector: service=kubesvc,cluster=kubernetes
  • 13:24 marostegui: Clean up some gerrit grants on db1132 (m2 master) T233714
  • 13:00 mvolz@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'zotero' for release 'production' .
  • 12:29 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Revert Set useEntitySourceBasedFederation to true for Wikidata (T241972) (duration: 00m 58s)
  • 12:28 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Revert Set useEntitySourceBasedFederation to true for Wikidata (T241972) (duration: 01m 00s)
  • 12:21 mvolz@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'zotero' for release 'production' .
  • 12:19 vgutierrez: upgrading pybal on esams and eqiad - T169765
  • 12:12 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Set useEntitySourceBasedFederation to true for Wikidata (T241972) (duration: 00m 59s)
  • 12:07 ladsgroup@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Set useEntitySourceBasedFederation to true for Wikidata (T241972) (duration: 01m 12s)
  • 11:56 vgutierrez: upgrading pybal on eqsin and codfw - T169765
  • 11:54 vgutierrez: restarting pybal instancs on eqsin
  • 11:52 _joe_: restarting etcd on conf2003 to test new pybal reconnection. Issues expected for pybal in eqsin, but not in ulsfo
  • 11:44 jbond42: importing puppet-master packages to component/puppet5
  • 11:39 mvolz@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'citoid' for release 'production' .
  • 11:24 vgutierrez: Updating pybal to 1.15.7 on ulsfo load balancers - T169765
  • 11:23 vgutierrez: uploaded pybal 1.15.7 to apt.w.o (stretch) - T169765
  • 11:22 mvolz@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'citoid' for release 'production' .
  • 10:47 mvolz@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'citoid' for release 'staging' .
  • 10:40 mvolz@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'zotero' for release 'staging' .
  • 10:38 akosiaris@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'zotero' for release 'staging' .
  • 10:36 godog: roll-restart thumbor after https://gerrit.wikimedia.org/r/c/operations/puppet/+/566069
  • 10:05 volans@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 10:05 volans@cumin2001: START - Cookbook sre.hosts.downtime
  • 07:34 oblivian@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'citoid' for release 'production' .
  • 07:29 oblivian@deploy1001: helmfile [CODFW] Ran 'apply' command on namespace 'citoid' for release 'production' .
  • 07:23 _joe_: adding TLS to citoid in production
  • 07:20 oblivian@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'citoid' for release 'staging' .
  • 06:28 marostegui: Remove the following users from phabricator database: 'phadmin'@'10.64.48.21' 'phuser'@'10.64.48.21' 'phstats'@'10.64.48.21' 'phmanifest'@'10.64.48.21' T238957
  • 06:19 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db1087', diff saved to https://phabricator.wikimedia.org/P10233 and previous config saved to /var/cache/conftool/dbconfig/20200121-061932-marostegui.json
  • 06:19 marostegui: Aborted upgrade on db1087 (wiki dumps are running)
  • 06:18 marostegui: Upgrade db1087
  • 06:17 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1087 for upgrade', diff saved to https://phabricator.wikimedia.org/P10232 and previous config saved to /var/cache/conftool/dbconfig/20200121-061756-marostegui.json
  • 06:05 marostegui: Stop replication on db1107
  • 05:58 marostegui: Stop MySQL on es2021 to clone es2022 - T243052
  • 05:52 marostegui: Remove partitions from db2091:3314 - T239453
  • 05:51 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db2091:3314 T239453', diff saved to https://phabricator.wikimedia.org/P10231 and previous config saved to /var/cache/conftool/dbconfig/20200121-055149-marostegui.json
  • 05:50 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db2084:3314 T239453', diff saved to https://phabricator.wikimedia.org/P10230 and previous config saved to /var/cache/conftool/dbconfig/20200121-055023-marostegui.json

2020-01-20

  • 16:14 Urbanecm: Change email assigned to User:Sadsadas (T243222)
  • 15:28 mholloway-shell@deploy1001: Finished deploy [mobileapps/deploy@2a1f493]: Update mobileapps to 1848cf5 (duration: 05m 55s)
  • 15:22 mholloway-shell@deploy1001: Started deploy [mobileapps/deploy@2a1f493]: Update mobileapps to 1848cf5
  • 15:20 vgutierrez: rolling upgrade of ats to version 8.0.5-1wm12 - T242620 T242778
  • 15:03 vgutierrez: uploaded trafficserver 8.0.5-1wm12 to apt.wm.o (stretch) - T242620 T242778
  • 13:06 jbond42_: add SSL validation to conftool/etcd expected no-op (https://gerrit.wikimedia.org/r/c/operations/puppet/+/566009)
  • 12:45 vgutierrez: uploaded varnishkafka 1.0.14-1 to apt.wm.o (buster) - T242093
  • 12:25 elukey@cumin1001: END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0)
  • 12:18 elukey@cumin1001: START - Cookbook sre.zookeeper.roll-restart-zookeeper
  • 12:09 moritzm: removing actinium in Ganeti T224551
  • 12:08 jmm@cumin2001: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1)
  • 12:07 jmm@cumin2001: START - Cookbook sre.hosts.decommission
  • 11:43 moritzm: removing alsafi in Ganeti T224551
  • 11:41 jmm@cumin2001: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1)
  • 11:40 jmm@cumin2001: START - Cookbook sre.hosts.decommission
  • 11:32 jbond42_: reverting untill joes change is finished - add SSL validation to conftool/etcd expected no-op (https://gerrit.wikimedia.org/r/c/operations/puppet/+/561817)
  • 11:30 jbond42_: add SSL validation to conftool/etcd expected no-op (https://gerrit.wikimedia.org/r/c/operations/puppet/+/561817)
  • 11:14 vgutierrez: deploying wikiworkshop TLS certificate on the text cluster - T242374
  • 10:06 moritzm: removing alcyone/aluminium in Ganeti T224551
  • 10:06 moritzm: removing alcyone/aluminium in Ganeti
  • 10:04 jmm@cumin2001: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1)
  • 10:04 jmm@cumin2001: START - Cookbook sre.hosts.decommission
  • 10:01 jmm@cumin2001: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1)
  • 10:01 jmm@cumin2001: START - Cookbook sre.hosts.decommission
  • 09:44 marostegui@cumin1001: dbctl commit (dc=all): 'Fully repool db1129', diff saved to https://phabricator.wikimedia.org/P10225 and previous config saved to /var/cache/conftool/dbconfig/20200120-094445-marostegui.json
  • 09:36 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1129', diff saved to https://phabricator.wikimedia.org/P10224 and previous config saved to /var/cache/conftool/dbconfig/20200120-093603-marostegui.json
  • 09:26 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1129', diff saved to https://phabricator.wikimedia.org/P10223 and previous config saved to /var/cache/conftool/dbconfig/20200120-092642-marostegui.json
  • 09:19 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1129', diff saved to https://phabricator.wikimedia.org/P10222 and previous config saved to /var/cache/conftool/dbconfig/20200120-091929-marostegui.json
  • 09:08 marostegui@cumin1001: dbctl commit (dc=all): 'Fully repool db1094', diff saved to https://phabricator.wikimedia.org/P10221 and previous config saved to /var/cache/conftool/dbconfig/20200120-090850-marostegui.json
  • 09:06 marostegui: Upgrade db1129
  • 09:06 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1129', diff saved to https://phabricator.wikimedia.org/P10220 and previous config saved to /var/cache/conftool/dbconfig/20200120-090617-marostegui.json
  • 09:05 moritzm: restarting CAS to pick up Java security updates
  • 09:03 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1094', diff saved to https://phabricator.wikimedia.org/P10219 and previous config saved to /var/cache/conftool/dbconfig/20200120-090336-marostegui.json
  • 09:01 moritzm: installing Java security updates on an-conf*
  • 08:55 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1094', diff saved to https://phabricator.wikimedia.org/P10218 and previous config saved to /var/cache/conftool/dbconfig/20200120-085537-marostegui.json
  • 08:51 marostegui: Upgrade db1139:3311 db1139:3316
  • 08:49 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1094', diff saved to https://phabricator.wikimedia.org/P10217 and previous config saved to /var/cache/conftool/dbconfig/20200120-084908-marostegui.json
  • 08:44 marostegui: Upgrade db1094
  • 08:44 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1094', diff saved to https://phabricator.wikimedia.org/P10216 and previous config saved to /var/cache/conftool/dbconfig/20200120-084408-marostegui.json
  • 08:10 marostegui: Compare data on db2085:3318 - T243148
  • 08:07 ema: powercycle cp3061 T238305
  • 07:15 marostegui: Remove partitions from revision on db2084:3314 T239453
  • 07:15 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db2084 T239453', diff saved to https://phabricator.wikimedia.org/P10215 and previous config saved to /var/cache/conftool/dbconfig/20200120-071513-marostegui.json
  • 07:10 marostegui: Stop MySQL on es2020 to clone es2021 - T243052
  • 06:09 marostegui: Stop replication on db1107
  • 06:08 marostegui: Compress db1121 - T232446
  • 06:08 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db1121, pool db1084 into vslow T232446', diff saved to https://phabricator.wikimedia.org/P10214 and previous config saved to /var/cache/conftool/dbconfig/20200120-060759-marostegui.json

2020-01-19

  • 12:02 marostegui@cumin1001: dbctl commit (dc=all): 'Depool db2085:3311, db2085:3318 T243148', diff saved to https://phabricator.wikimedia.org/P10210 and previous config saved to /var/cache/conftool/dbconfig/20200119-120236-marostegui.json
  • 11:20 elukey: restart-php-fpm on mw2181 to rule out temporary php-related issues in codfw
  • 00:46 cdanis: T238305 cp3053.mgmt /admin1-> racadm serveraction hardreset

2020-01-18

  • off: upgraded spicerack to 0.0.29 on cumin hosts
  • 09:00 dcausse: repool wdqs1007 (T242453)
  • 07:05 marostegui: Remove partitions from enwiki.revision on db2085 T239453
  • 04:15 cdanis: cp3065.mgmt: /admin1-> racadm serveraction hardreset T238305

2020-01-17

  • 21:56 urandom: bootstrapping restbase2023-c — T243000
  • 21:17 dzahn@cumin1001: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
  • 20:40 dzahn@cumin1001: START - Cookbook sre.ganeti.makevm
  • 20:07 urandom: bootstrapping restbase2023-b — T243000
  • 20:01 bblack: reset bgp peerings with gfiber on cr2-eqiad
  • 19:14 mutante: gerrit - switching operations/debs/hhvm to READONLY mode and adding ARCHIVED to description (T237038)
  • 18:19 dzahn@cumin1001: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1)
  • 18:18 dzahn@cumin1001: START - Cookbook sre.hosts.decommission
  • 17:15 urandom: bootstrapping restbase2023-a — T243000
  • 16:33 marostegui: Stop replication on db1107
  • 16:25 ebernhardson@deploy1001: Finished deploy [wikimedia/discovery/analytics@938d253]: Move weekly elasticsearch transfer to airflow (duration: 00m 21s)
  • 16:25 ebernhardson@deploy1001: Started deploy [wikimedia/discovery/analytics@938d253]: Move weekly elasticsearch transfer to airflow
  • 14:31 urandom: bootstrapping restbase2022-c — T243000
  • 14:09 awight@deploy1001: Synchronized php-1.35.0-wmf.15/extensions/Cite: UBN backport: Fix for nested #tag:references and empty name (T242437) (duration: 00m 57s)
  • 14:03 awight: beginning Friday deployment for UBN, T242437
  • 13:38 moritzm: masking squid3 on old URL downloaders T224551
  • 13:35 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 13:35 jmm@cumin2001: START - Cookbook sre.hosts.downtime
  • 13:35 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 13:35 jmm@cumin2001: START - Cookbook sre.hosts.downtime
  • 13:35 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 13:35 jmm@cumin2001: START - Cookbook sre.hosts.downtime
  • 12:55 effie: Updgrade netmon* to to php 7.2.26 and restart - T241222
  • 11:48 moritzm: upgrading PHP 7.2 on netmon* (also apache restart for SSL update)
  • 11:13 elukey: restart nginx on analitycs tool hosts to pick up openssl updates
  • 11:05 moritzm: restarting apache on matomo1001 to pick up SSL updates
  • 11:04 XioNoX: Running homer to remove decom cloud vlans in eqiad/codfw - T240670
  • 11:01 XioNoX: delete vlan cloud-instances1-b-eqiad from asw2-b-eqiad - T240670
  • 10:43 moritzm: restarting apache on miscweb* to pick up SSL updates
  • 10:39 moritzm: restarting apache on puppetboard* to pick up SSL updates
  • 10:32 moritzm: installing remaining OpenSSL 1.0.2 updates
  • 09:25 jmm@cumin2001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
  • 09:25 jmm@cumin2001: START - Cookbook sre.hosts.downtime
  • 08:58 marostegui@cumin1001: dbctl commit (dc=all): 'Repool db2103', diff saved to https://phabricator.wikimedia.org/P10202 and previous config saved to /var/cache/conftool/dbconfig/20200117-085808-marostegui.json
  • 07:51 marostegui@cumin1001: dbctl commit (dc=all): 'Fully repool db1082', diff saved to https://phabricator.wikimedia.org/P10201 and previous config saved to /var/cache/conftool/dbconfig/20200117-075125-marostegui.json
  • 07:46 marostegui@cumin1001: dbctl commit (dc=all): 'Slowly repool db1082', diff saved to