Release Engineering/SAL

From Wikitech
Jump to: navigation, search

2017-12-13

2017-12-12

  • 18:22 mdholloway: deployed mobileapps@5b8796d to BC
  • 15:34 addshore: deploy zuul for parameter_functions update
  • 15:27 addshore: unblocked beta scaps and files syncs on jenkins

2017-12-11

2017-12-09

2017-12-08

2017-12-07

  • 20:53 Hauskatze: maurelio@deployment-tin:~$ mwscript namespaceDupes.php --wiki=dewiki --fix
  • 20:25 Hauskatze: maurelio@deployment-tin:~$ mwscript namespaceDupes.php --wiki=deploymentwiki --fix --add-prefix=Broken/
  • 20:18 Hauskatze: deployment-prep maurelio@deployment-tin:~$ mwscript cleanupSpam.php --wiki=deploymentwiki --delete *.loginidol.org
  • 17:56 mdholloway: deployed mobileapps@71f581c to beta cluster
  • 10:09 hashar: integration: sudo cumin --force 'name:integration-slave-jessie-100*' /usr/local/sbin/run-puppet-agent | https://gerrit.wikimedia.org/r/395961
  • 10:06 hashar: integration: unbroke puppet on some permanent slaves. Add been broken since Nov 29th ~ 19:50UTC | https://gerrit.wikimedia.org/r/#/c/395961/
  • 09:48 hashar: CI: removed Wikidata from configuration, replaced by Wikibase. wmf/* and REL branches are going to be broken though | https://gerrit.wikimedia.org/r/395704 | T181838

2017-12-06

  • 21:43 awight: Update ORES to 42cf532
  • 17:54 gehel: logstash upgrade on deployment-logstash2 completed, 5 minutes of logs lost during upgrade - T178412
  • 17:26 gehel: upgrading ELK on deployment-logstash2 - T178412
  • 16:48 Hauskatze: Ran cleanupSpam.php on deploymentwiki
  • 10:03 hashar: docker push wmfreleng/npm:v2017.12.06.09.55 wmfreleng/npm-stretch:v2017.12.06.09.55 wmfreleng/npm-test:v2017.12.06.09.55 wmfreleng/npm-test-stretch:v2017.12.06.09.55 !!! wmfreleng/npm-browser-test:v2017.12.06.09.55 | https://gerrit.wikimedia.org/r/#/c/395555/

2017-12-05

  • 12:18 hashar: deployment-videoscaler01: rm /var/log/hhvm/* /var/log/apache2/* . Restarted apache2/hhvm/syslog
  • 12:16 hashar: integration: sudo cumin --force '*' 'apt-get clean'
  • 12:16 hashar: deployment-prep: sudo cumin --force '*' 'apt-get clean'
  • 12:15 hashar: deployment-videoscaler01: apt-get clean to free up disk space
  • 08:51 hashar: jenkins: adding global property FORCE_COLOR=1 to https://integration.wikimedia.org/ci/configure . That forces webdriver.io to spurts color in the Jenkins console when not using a TTY
  • 06:37 kart_: Updated cxserver to 1693bcf

2017-12-04

  • 17:44 awight: ORES: Try enwiki models on simplewiki, T181848 (6baed71)

2017-12-03

  • 21:27 legoktm: legoktm@integration-slave-jessie-1001:/srv/jenkins-workspace/workspace$ sudo rm -rf * # to clear out full /srv

2017-12-01

  • 21:15 Krinkle: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/394655
  • 13:46 godog: deployment-prep bounce elasticsearch on logstash2 to test jmx_exporter
  • 11:55 hashar: updating *npm-browser-node-6-docker jobs to use a new container based on Stretch with Chromium/Firefox | https://gerrit.wikimedia.org/r/#/c/394340/ | T179360
  • 10:08 hashar: docker push wmfreleng/npm-browser-test-stretch:v2017.11.30.21.30 && docker push wmfreleng/npm-browser-test-stretch:latest | https://gerrit.wikimedia.org/r/#/c/394340/ | T179360
  • 08:40 hashar: rebased operations/puppet on deployment-prep and integration puppetmasters
  • 08:40 hashar: deployment-prep: removed a hack to puppetmaster environments/future/environment.conf containing: parser = future \n manifest = $confdir/manifests\n
  • 08:38 hashar: integration: removed a hack to puppetmaster environments/future/environment.conf containing: parser = future \n manifest = $confdir/manifests\n

2017-11-30

  • 23:08 addshore: turned beta-scap-eqiad back on
  • 23:03 addshore: reload zuul to deploy Revert "Use gate-and-submit-swat for mediawiki-config" [integration/config] - https://gerrit.wikimedia.org/r/394484
  • 22:58 addshore: also reloaded with hashar Switch ArticlePlaceholder to npm-browser-test & Remove mwgate-npm-node-6-jessie
  • 22:57 addshore: reloaded zuul for Use gate-and-submit-swat for mediawiki-config [integration/config] - https://gerrit.wikimedia.org/r/394464
  • 21:05 hashar: docker push wmfreleng/npm-stretch:v2017.11.30.21.03 && docker push wmfreleng/npm-stretch:latest && docker push wmfreleng/npm-test-stretch:v2017.11.30.21.03 && docker push wmfreleng/npm-test-stretch:latest | https://gerrit.wikimedia.org/r/#/c/394338/ | T179360
  • 20:50 addshore: temp disable beta-scap-eqiad so that it doesnt block me doing my own scaps
  • 18:59 bd808: Testing stashbot fix for double phab logging (T181731)
  • 17:49 anomie: Finished running cleanupUsersWithNoId.php on Beta Cluster for T181731
  • 16:58 anomie: Running cleanupUsersWithNoId.php on Beta Cluster, see T181731

2017-11-29

  • 21:27 awight: Update ores submodule, for RevIdScorer statistics
  • 21:17 awight: deployment-prep Verbose logging for ORES Celery
  • 14:32 chasemp: git pull on /var/lib/git/labs/private and resolve one merge conflict. (the root key file is too old here)
  • 09:18 hashar: gerrit: forcing replication: ssh -p 29418 hashar@gerrit.wikimedia.org replication start operations/software/druid_exporter # T181219
  • 09:14 hashar: github: created wikimedia/operations-debs-contenttranslation-apertium-crh-tur and wikimedia/operations-debs-prometheus-openldap-exporter
  • 09:08 hashar: github: created repo operations-software-druid_exporter | T181219
  • 03:56 legoktm: deleted all workspaces on integration-slave-jessie-1003 /srv ran out of space
  • 03:23 Krinkle: Jenkins jobs for mediawiki-core-php55lint consistently failing on integration-slave-jessie ("git: stderr: error: failed to write..")
  • 00:02 halfak: deploy-prep awight enabled ORES service
  • 00:01 halfak: deploy-prep awight disabled ORES service

2017-11-28

  • 17:42 awight: Remove stale ORES customizations for the beta cluster.
  • 17:31 awight: Remove beta cluster customizations for ORES

2017-11-27

2017-11-24

  • 08:16 hashar: pooling integration-slave-docker-1003 again | T179378
  • 08:14 hashar: nodepool: Image snapshot-ci-jessie-1511510623 in wmflabs-eqiad is ready
  • 08:13 hashar: upgrading blubber on contint2001
  • 08:03 hashar: nodepool: manually rebuilding snapshot-ci-jessie

2017-11-23

2017-11-22

2017-11-21

2017-11-20

  • 15:57 hashar: gerrit: deleted operations/network-diagrams mostly empty and no changes. Created back in 2012.
  • 15:03 hashar: integration: pass all environment variables to the docker run commands | https://gerrit.wikimedia.org/r/#/c/390432/ | T177684
  • 10:06 hashar: nodepool: manually deleted left over instances ci-jessie-wikimedia-894187 and ci-jessie-wikimedia-894188 . Jenkins fails to ssh to it and they were left ready for 72 hours.
  • 10:05 hashar: deployment-phab : set hiera 'phabricator_cluster_search: []' trying to unblock puppet and soft rebooted the instance | T180935
  • 09:39 hashar: deployment-prep added missing key between_bytes_timeout to cache::app_def_be_opts for deployment-cache-text04 and deployment-cache-upload04 | T180935
  • 09:29 hashar: deployment-tin: apt-mark hold scap | the apt-repo on deployment-tin is out of date | T180935

2017-11-16

2017-11-15

2017-11-13

2017-11-10

2017-11-09

2017-11-08

  • 13:43 Reedy: ran apt-get clean|autoclean on deplyoment-mediawiki04 to free up some space

2017-11-07

  • 18:45 twentyafterfour: cowboy-committed and pushed rMSCAc1f2ac2 to hopefully unbreak `scap deploy` in beta
  • 17:56 legoktm: integration-slave-jessie-1003 /srv full, legoktm@integration-slave-jessie-1003:/srv/jenkins-workspace/workspace$ sudo rm -rf mwgate-* mediawiki-*
  • 17:27 hashar: Image snapshot-ci-jessie-1510074928 in wmflabs-eqiad is ready - T179772
  • 17:15 hashar: Updating Nodepool snapshot to get php5.5-zip - T179772
  • 16:15 hashar: Created portalsbuilder in Gerrit, generated a ssh key pair for it and stored in Jenkins credentials store - T179694
  • 15:15 hashar: Created VPS account "PortalsBuilder" - T179694

2017-11-06

  • 23:49 thcipriani: ssh-keyscan deployment-videoscaler01.deployment-prep.eqiad.wmflabs >> /etc/ssh/ssh_known_hosts
  • 22:29 hashar: killed stuck npm Docker containers on integration-slave-docker-1002 (due to T176747 ). Pooled the instance back, the slowness it experienced is probably not related to labvirt CPU usage ( T179378 )
  • 20:35 Amir1: deploy ores:93e8846 in beta cluster
  • 16:02 thcipriani: Reloading zuul to deploy https://gerrit.wikimedia.org/r/#/c/388546/ and https://gerrit.wikimedia.org/r/#/c/389463/

2017-11-03

  • 13:51 hashar: pooled integration-slave-docker-1004 and integration-slave-docker-1007
  • 13:30 hashar: Unpool integration-slave-docker-1002 and integration-slave-docker-1003 . They are slow CPU wise, most probably due to the underlying labvirt being CPU starved. - T179378
  • 12:38 hashar: T179593 generate doc for cumin@v1.2.2 : contint1001$ zuul enqueue-ref --trigger gerrit --pipeline publish --project operations/software/cumin --ref refs/tags/v1.2.2 --newrev f745387
  • 11:20 hashar: generate doc for cumin@v1.2.2 : contint1001$ zuul enqueue-ref --trigger gerrit --pipeline publish --project operations/software/cumin --ref refs/tags/v1.2.2
  • 11:17 addshore: zuul reload for zuul: add noop jobs for new analytics/wmde/WDCM-* repos [integration/config] - https://gerrit.wikimedia.org/r/388423
  • 11:17 hashar: generate doc for cumin ( T179593 ) : contint1001$ zuul enqueue --trigger gerrit --pipeline postmerge --project operations/software/cumin --change 388261,2
  • 02:04 legoktm: integration-slave-jessie-1004 deleted mwgate-php55lint (5.2GB) and mediawiki-core-php55lint (2.5GB) workspaces due to low disk space in /srv

2017-11-02

  • 22:30 halfak: deploying ores-deploy 82a13ae
  • 16:58 addshore: reloaded zuul to deploy https://gerrit.wikimedia.org/r/387960
  • 13:02 hashar: gerrit: marked mediawiki/extensions/WikibaseJavaScriptApi.git read-only - T178226
  • 12:17 hashar: gerrit: created wikibase/javascript-api inheriting from wikibase.git - T178226
  • 07:05 legoktm: mwext-VisualEditor-publish got stuck for 15 hours, deleted a job in jenkins to kick it

2017-11-01

2017-10-31

2017-10-30

  • 10:56 hashar: deployment-logstash2 removed puppet class role::labs::lvm::mnt, replacing with role::labs::lvm::srv . /srv is already mounted. Unmounting /mnt and restarting elastcisearch - T178722
  • 10:53 hashar: deployment-logstash2 removed puppet class role::labs::lvm::mnt, replacing with role::labs::lvm::srv . /srv is already mounted. Unmounting /mnt and restarting elastcisearch - T 178722
  • 10:52 hashar: deployment-logstash2 removed puppet class role::labs::lvm::mnt, replacing with role::labs::lvm::srv . /srv is already mounted. Unmounting /mnt and restarting elastcisearch
  • 09:55 hashar: gerrit: deleted graphs/shared.git unused / emtpy repo
  • 09:27 hashar: gerrit: deleted /nfsd.git (unused / no changes, created on October 4th 2016)
  • 09:22 hashar: gerrit: prefix mediawiki/extensions/AutomaticBoardWelcome description with '[ARCHIVED] ' - T179196
  • 09:21 hashar: gerrit: prefix mediawiki/extensions/AWS description with '[ARCHIVED] ' - T174864

2017-10-28

  • 09:12 Krenair: fixed puppet on deployment-kafka01 by installing ldap-utils

2017-10-27

  • 13:11 godog: provision deployment-redis{03,04} with stretch - T148637
  • 13:06 hashar: zuul enqueue --trigger gerrit --pipeline postmerge --project wikidata/query/rdf --change 383791,15

2017-10-26

2017-10-24

  • 17:59 madhuvishy: Ran `sudo cumin -b 5 --backend openstack "project:deployment-prep" "apt-get install git --yes"`
  • 11:19 elukey: removed several roles mistakenly applied to puppet prefix deployment-aqs in Horizon (causing puppet failures for AQS nodes)
  • 08:35 hashar: beta: cherry pick https://gerrit.wikimedia.org/r/#/c/386077/4 "hieradata for varnish caches" - T178841

2017-10-23

  • 20:29 Krinkle: Puppet still failing, now with: "Error 400 on SERVER: Could not find data item cache::fe_transient_gb in any Hiera data file and no default supplied at /etc/puppet/modules/profile/manifests/cache/text.pp:12 on node deployment-cache-text04.deployment-prep.eqiad.wmflabs"
  • 20:29 Krinkle: Previous edit failed. Horizon saved the field as blank. Presumably because the class is unknown in the current version of puppet manifests it has. Strange that it normalises in this way.
  • 20:28 Krinkle: Edit horizon "Other classes" config for deployment-prep/deployment-cache-text04. Rename role::prometheus::varnish_exporter to profile::prometheus::varnish_exporter
  • 20:13 Krinkle: Puppet run still failing on Beta cluster varnish: "Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Could not find class role::prometheus::varnish_exporter"
  • 09:29 hashar: fab docker_pull_image:wmfreleng/tox
  • 09:26 hashar: docker push wmfreleng/tox:v2017.10.23.09.05 && docker push wmfreleng/tox:latest - https://gerrit.wikimedia.org/r/385950

2017-10-20

  • 10:00 elukey: cherry pick https://gerrit.wikimedia.org/r/#/c/385339 to the operations/puppet git repo on puppetmaster02
  • 03:34 Krinkle: Beta Cluster varnish (text04) has not had a Puppet run for over 10 days (15165 minutes ago). Error: " puppet-agent: Could not retrieve catalog from remote server: Error 400 on SERVER: Could not find class role::prometheus::varnish_exporter for deployment-cache-text04 .. Not using cache on failed catalog .. Could not retrieve catalog; skipping run"

2017-10-19

  • 11:21 zeljkof: Reloading Zuul to deploy 26f4ff5

2017-10-18

  • 18:32 greg-g: MaxSem ran `foreachwiki extensions/LoginNotify/maintenance/migratePreferences.php` on deployment-prep
  • 09:14 dcausse: deployment-prep: upgrading elasticsearch to 5.5.2
  • 08:41 hashar: deployment-mediawiki07: install --owner=nutcracker -d /var/run/nutcracker && systemctl start nutcracker # T178457
  • 08:38 hashar: deployment-videoscaler01: install --owner=nutcracker -d /var/run/nutcracker && systemctl start nutcracker # T178457

2017-10-17

  • 22:08 addshore: replaced integration-slave-docker-c2-m4-d40-1005 with integration-slave-docker-1005 T178409
  • 21:48 addshore: added slave integration-slave-docker-1006 (1x 4GB ram executor)
  • 21:47 addshore: delete wmfreleng/mediawiki-extensions-phan from docker hub
  • 14:05 addshore: deleted slave integration-slave-docker-1004
  • 13:35 addshore: swapped integration-slave-docker-1004 for integration-slave-docker-c2-m4-d40-1004 (So we have more 4GB executors)
  • 09:45 addshore: reload zuul for https://gerrit.wikimedia.org/r/384673
  • 08:55 addshore: delete unused mwext-php70-phan-jessie-docker 'project' in jenkins UI
  • 08:54 addshore: reload zuul for https://gerrit.wikimedia.org/r/384614

2017-10-16

2017-10-14

2017-10-13

  • 16:34 Amir1: ladsgroup@deployment-tin:~$ mwscript extensions/Wikibase/repo/maintenance/rebuildPropertyInfo.php --wiki=wikidatawiki (T177857)
  • 13:41 zeljkof: Reloading Zuul to deploy b5b1dc2
  • 10:43 zeljkof: Reloading Zuul to deploy 320f065

2017-10-11

  • 19:59 hashar: deployment-prep: deploying jobrunner to catchup with changes.
  • 18:19 hashar: beta: rebased puppet master due to a conflict with b3c6968b3c
  • 15:32 _joe_: removing deployment-pdf01, T177931
  • 08:33 hashar: Image snapshot-ci-jessie-1507710117 in wmflabs-eqiad is ready
  • 08:22 hashar: nodepool: refreshing Jessie snapshot after some puppet patches got merged

2017-10-10

  • 17:51 Amir1: add "Ladsgroup" to oversight members in enwiki in beta cluster to test T177705
  • 16:29 Amir1: adding "Ladsgroup" to admins in wikidatawiki in beta cluster

2017-10-09

  • 13:26 hashar: Upgraded Jenkins to 2.73.1 earlier today
  • 08:53 hashar: hard restart integration-slave-docker-1001 via horizon. It is deadlocked somehow. - T177749

2017-10-06

  • 13:22 hashar: Jenkins: adding Maven-3.0.5 to the tool configuration https://integration.wikimedia.org/ci/configureTools/
  • 11:58 hashar: Jenkins: installed Warnings plugin
  • 11:54 hashar: Jenkins: removing the Violations plugin. It is not used.
  • 09:22 hashar: integration: purged bunch of old containers: sudo cumin 'name:slave-docker' 'yes | docker container prune'

2017-10-05

  • 19:15 hasharAway: rebooting integration-slave-docker-1002 to catch with kernel upgrade and pooling it back in Jenkins - T177039
  • 19:11 hasharAway: rebooting integration-slave-jessie-1002 to catch with kernel upgrade and pooling it back in Jenkins - T177039
  • 13:16 hashar: Image snapshot-ci-jessie-1507208677 in wmflabs-eqiad is ready
  • 11:47 hashar: Refreshing Nodepool Jessie snapshot to get java 8 by default - T162828
  • 10:56 hashar: integration: unbreak the puppet master. Was stuck do a cherry pick that needed a rebase
  • 05:56 legoktm: deploying https://gerrit.wikimedia.org/r/382361
  • 04:16 legoktm: deploying https://gerrit.wikimedia.org/r/382354

2017-10-04

  • 13:19 andrewbogott: migrating 'deployment-kafka-jumbo-1' to labvirt1017

2017-10-03

2017-10-02

2017-09-30

  • 14:28 zeljkof: Reloading Zuul to deploy c08a3ad

2017-09-29

  • 19:45 hashar: Deleting integration-slave-jessie-php55
  • 17:34 zeljkof: Reloading Zuul to deploy 0e26c86
  • 16:42 zeljkof: Reloading Zuul to deploy 09445b8
  • 15:10 zeljkof: Reloading Zuul to deploy 7f66813
  • 14:15 tabbycat: maurelio@deployment-tin:~$ mwscript cleanupSpam.php --wiki=deploymentwiki *.logininput.org ( testing w/o delete T176206 / 7f842058602c )
  • 14:10 tabbycat: maurelio@deployment-tin:~$ mwscript cleanupSpam.php --wiki=deploymentwiki *.loginpartner.org --delete ( testing T176206 / 7f842058602c )
  • 13:00 hashar: github: created https://github.com/wikimedia/integration-quibble for gerrit replication
  • 12:53 hashar: gerrit: marked labs/tools/grrrit archived
  • 09:53 addshore: addshore@integration-slave-docker-1001:~$ sudo docker ps --filter "status=exited" | grep 'weeks ago' | awk '{print $1}' | xargs --no-run-if-empty sudo docker rm
  • 09:53 addshore: addshore@integration-slave-docker-1001:~$ sudo docker ps --filter "status=exited" | grep 'months ago' | awk '{print $1}' | xargs --no-run-if-empty sudo docker rm
  • 09:40 addshore: marking integration-slave-docker-1001 as online - T177039
  • 09:33 addshore: rebooting integration-slave-docker-1001
  • 09:10 addshore: wm-ci-docker-push mediawiki-phpcs:v2017.09.29.09.08 & latest https://gerrit.wikimedia.org/r/381413
  • 05:59 legoktm: marking integration-slave-docker-1001 as offline - T177039
  • 00:19 mutante: releases1001 - created user for "no_justification", dropped pass in home dir
  • 00:12 mutante: jenkins now configured and running at https://releases.wikimedia.org/ci/ (T164030) - but needs additional admin users and puppet is still disabled for temp hack fix

2017-09-28

2017-09-27

2017-09-26

2017-09-25

  • 17:07 mutante: Greg is now a contint-admin
  • 12:36 addshore: addshore@integration-saltmaster:~$ sudo salt -v '*slave-docker*' cmd.run 'sudo docker rmi wmfreleng/operations-puppet:0.0.1 wmfreleng/operations-puppet:0.1.0'
  • 12:30 addshore: Reloading Zuul to deploy Refactor 'operations-puppet-tests-docker' into macros for easy reuse [integration/config] - https://gerrit.wikimedia.org/r/379959
  • 09:12 moritzm: added deployment-mediawiki07 to deployment-prep (stretch-based app server, WIP)

2017-09-24

2017-09-22

  • 21:55 tabbycat: Granted Greg G. 'staff' global rights on the beta cluster per request
  • 20:37 hashar: Image snapshot-ci-jessie-1506112074 in wmflabs-eqiad is ready
  • 20:28 hashar: updating nodepool image for jessie [2/x]
  • 20:03 hasharAway: Updating nodepool image for jessie
  • 17:22 addshore: docker push docker.io/wmfreleng/tox:v2017.09.22.17.16 & latest # (From current master)
  • 15:24 hashar: Restarted Jenkins (out of memory)
  • 10:06 hashar: deployement-salt02 migrated hiera config from wikitech to horizon. Removed the class role::deployment::salt_masters
  • 08:44 hashar: Upgraded docker on integration-slave-docker-1001 and integration-slave-docker-1002 - T176267
  • 07:13 greg-g: some jsduck jobs are running now, serially, for the backlogged queue. Unsure of starved jobs (integration-config-qa, pywikibot-beta-cluster, etc)
  • 07:04 greg-g: deleting stuck mediawiki-core-jsduck-publish jobs in Jenkins UI
  • 06:57 greg-g: pinged an opsen, hopefully they'll restart zuul shortly
  • 06:45 greg-g: Zuul is stuck, no jobs are processing

2017-09-21

2017-09-20

  • 15:46 addshore: reloading zuul for https://gerrit.wikimedia.org/r/#/c/379250/
  • 13:59 addshore: docker push docker.io/wmfreleng/mediawiki-phan:v2017.09.20.13.49 & latest # built from master
  • 13:59 addshore: docker push docker.io/wmfreleng/composer:v2017.09.20.13.44 & latest # built from master
  • 13:59 addshore: docker push docker.io/wmfreleng/zuul-cloner:v2017.09.20.13.44 & latest # built from master
  • 13:59 addshore: docker push docker.io/wmfreleng/php-mediawiki:v2017.09.20.13.43 & latest # built from master
  • 13:59 addshore: docker push docker.io/wmfreleng/php:v2017.09.20.13.40 & latest # built from master
  • 13:07 tabbycat: deployment-prep Ran cleanupSpam.php on deploymentwiki. Further testing with regards to ongoing development and updating of the script.
  • 11:53 addshore: Reloading Zuul (Testing)

2017-09-19

  • 17:26 legoktm: removed rights from User:Sau226 on beta cluster due to block of account used for browser tests
  • 09:13 tabbycat: Re-run previous script and it worked this time, see https://deployment.wikimedia.beta.wmflabs.org/wiki/Template_talk:Rotate/en
  • 09:11 tabbycat: Ran mwscript cleanupSpam.php on the beta cluster, but it didn't worked (looks it is not fetching the domains properly)

2017-09-18

2017-09-17

  • 18:59 addshore: Reloading Zuul to deploy archiving of 2 extensions

2017-09-14

  • 19:37 tgr: updated PrivateSettings.php for T175868
  • 10:38 elukey: cherry-pick https://gerrit.wikimedia.org/r/#/c/377753/7 on deployment-prep's puppetmaster02 to test it on the new kafka jumbo instances
  • 10:35 hashar: CI puppet master: added class geoip::data::package and parameters: puppetmaster::geoip::fetch_private: false puppetmaster::geoip::use_proxy: false - T175864

2017-09-13

  • 10:13 addshore: docker push docker.io/wmfreleng/operations-puppet:v2017.09.13.09.23 (#d693f74c9b3404220a2ad2934f526d4f4455914b)
  • 09:25 hashar: Deleting integration-slave-trusty-1003 and integration-slave-trusty-1001 - T175696
  • 09:14 hashar: nodepool: openstack image delete image-ci-trusty - T175696
  • 07:49 hashar: Jenkins: removing the Ubuntu JDK from https://integration.wikimedia.org/ci/configureTools/
  • 07:40 hashar: jenkins: on nodes, removing the labels phpflavor-* they are no more needed - T161882
  • 07:40 hashar: jenkins: on nodes, removing the labels phpflavor-* they are no more needed - T 161882

2017-09-12

  • 20:35 hashar: pooling integration-slave-jessie-1003 and integration-slave-jessie-1004
  • 19:40 hashar: hacked integration-slave-jessie hosts to ship them php5.5
  • 18:49 hasharAway: nodepool: deleted image image-ci-trusty_old_20170804 Keeping image-ci-trusty just in case
  • 14:57 hashar: Deleted all left over jenkins jobs having ci-trusty-wikimedia label. - T161882
  • 14:46 hashar: provisionning integration-slave-jessie-1003 and integration-slave-jessie-1004 to move php55lint to them. NOT READY YET - T161882
  • 14:05 hashar: Deleting integration-slave-trusty-1004 - T161882
  • 13:09 hashar: nodepool: deleting alien instance: openstack server delete ci-jessie-wikimedia-815477
  • 11:09 hashar: Image snapshot-ci-jessie-1505213295 in wmflabs-eqiad is ready
  • 10:48 hashar: nodepool: force updating jessie image to grab php5.5-luasandbox - T161882 T174972

2017-09-11

  • 23:27 thcipriani: restarting jenkins
  • 22:38 legoktm: deploying https://gerrit.wikimedia.org/r/377361
  • 12:47 hashar: Nodepool: refreshing jessie snapshot to get php5.5-luasandbox installed

2017-09-10

  • 01:44 bd808: nodepool running steadily again, but has been heavily throttled to hopefully prevent another weekend thundering herd of doom failure for the OpenStack backend

2017-09-09

  • 22:15 bd808: `sudo journalctl -u nodepool --since today --no-pager` shows many LaunchStatusException failures.

2017-09-07

  • 13:02 hashar: nodepool: Image snapshot-ci-jessie-1504788047 in wmflabs-eqiad is ready | T174972
  • 11:58 hashar: nodepool: updating snapshot-ci-jessie to add php5.5-redis | T161882 T174972
  • 11:10 addshore: Reloading Zuul to deploy "Add gate-submit jobs for analytics/wmde/* repos"
  • 02:44 legoktm: deploying https://gerrit.wikimedia.org/r/376460

2017-09-06

2017-09-05

2017-09-04

  • 15:59 zeljkof: Reloading Zuul to deploy ca1c6ec
  • 12:21 hashar: Image snapshot-ci-jessie-1504527142 in wmflabs-eqiad is ready
  • 11:37 hashar: nodepool: refreshing jessie snapshot
  • 10:03 addshore: Reloading Zuul to deploy mwext-php70-phan-jessie-docker experimental job
  • 00:42 legoktm: legoktm@contint1001:/srv/zuul/git/mediawiki/libs$ sudo -u zuul rm -rf XMPReader

2017-09-02

  • 08:32 legoktm: rm -rf /var/logs/kafka on deployment-kafka01 to free up disk space

2017-08-31

2017-08-30

  • 12:49 hashar: gerrit: marked wikimedia/communications/WMBlog as read-only - T172372

2017-08-29

  • 15:39 hashar: Created integration-slave-jessie-php55 to try out a php5.5 package on Jessie - T161882
  • 15:06 hashar: nodepool: deleting alien instance: openstack server delete ci-jessie-wikimedia-793795
  • 08:45 hashar: Restarting Jenkins for openjdk update
  • 08:11 hashar: refreshing all Jenkins jobs with a newer version of JJB

2017-08-28

2017-08-25

  • 15:11 zeljkof: Reloading Zuul to deploy b6704e2

2017-08-24

2017-08-22

2017-08-21

  • 18:18 mutante: addshore is now a contint-admin

2017-08-18

2017-08-16

2017-08-15

2017-08-14

  • 09:46 TabbyCat: maurelio@deployment-tin:/srv/mediawiki/dblists$ expanddblist flow-computed > /home/maurelio/flow-test.dblist (to test expandblist for a patch I am working on)

2017-08-11

  • 20:25 addshore: added mediawiki::maintenance::wikidata to deployment-tin

2017-08-07

  • 15:11 thcipriani: restarting jenkins for plugin update

2017-08-06

  • 13:28 TabbyCat: Ran mwscript extensions/WikimediaMaintenance/dumpInterwiki.php deploymentwiki on the beta cluster

2017-08-04

2017-08-03

  • 12:02 hashar: Added integration-slave-docker-1004 to the pool of jenkins slaves - T150502
  • 10:12 hashar: gerrit: marked wikimedia/communications/WP-Victor read-only and [ARCHIVED] - T107430
  • 04:50 SMalyshev: update cherry-pick for https://gerrit.wikimedia.org/r/#/c/299825/8 on deployment-puppetmaster02.deployment-prep.eqiad.wmflabs

2017-08-02

  • 22:08 MaxSem: Running rebuildall.php on beta ruwiki
  • 20:17 bearND: Update mobileapps to 2d8e8f6
  • 11:31 hashar: Image snapshot-ci-jessie-1501673225 in wmflabs-eqiad is ready T169602
  • 10:51 hashar: Image snapshot-ci-jessie-1501670727 in wmflabs-eqiad is ready - T169602
  • 09:02 hashar: Regenerating Nodepool Jessie image from scratch to get rid of tox 1.9.2 installed under /usr/local - T169602
  • 08:44 hashar: Image snapshot-ci-jessie-1501662758 in wmflabs-eqiad is ready - T169602
  • 08:42 hashar: - T169602
  • 08:32 hashar: Regenerating Nodepool jessie image to upgrade tox from 1.9.2 to 2.5.0 - T169602

2017-08-01

  • 15:45 hashar: Image snapshot-ci-jessie-1501601670 in wmflabs-eqiad is ready && purging old instances T161861
  • 15:44 hashar: Debug: Executing '/usr/bin/npm install -g npm@3.8.3' - T161861
  • 15:34 hashar: Refreshing nodepool Jessie image to bump npm from 2.x to 3.8.x T161861
  • 10:12 hashar: Stopped Zuul / CI for mass mediawiki extension changes

2017-07-28

  • 21:11 MaxSem: Dropped table wikigrok_questions from beta enwiki
  • 12:19 zeljkof: Reloading Zuul to deploy 47a07e0
  • 00:17 Krinkle: Testing job insertion on beta cluster from deployment-tin triggers PHP Notice: Undefined index: uuid in EventBus/JobQueueEventBus.php:102, PHP Notice: Undefined index: sha1 in EventBus/JobQueueEventBus.php:99

2017-07-26

  • 21:35 Reedy: kill two long running update.php jobs on deployment-tin
  • 13:39 zeljkof: Reloading Zuul to deploy 8787b4b
  • 12:04 zeljkof: Reloading Zuul to deploy 79781d8
  • 11:39 zeljkof: Reloading Zuul to deploy 723ab49
  • 11:31 hashar: realign installed debian packages on integration-slave-jessie-1001 and integration-slave-jessie-1002 - T171724
  • 09:25 hashar: deployment-tin deleting temporary l10n cache from July 19th 20:09 at /tmp/scap_l10n_3608512748 1.5G
  • 09:24 hashar: deployment-cache-upload04 deployment-cache-text04 upgraded logster 0.0.10-1~jessie1 -> 0.0.10-2~jessie1 - T171318

2017-07-25

2017-07-24

  • 21:56 bearND: Update mobileapps to b608ec8
  • 15:03 hashar: Added webperformance Jenkins slave https://integration.wikimedia.org/ci/computer/webperformance/ with a single executor - T166756
  • 14:57 hashar: recreating integration-webperf instance has simply "webperformance" Same 2CPU / 2GB RAM / 40G disk - T166756
  • 14:57 hashar: recreating integration-webperf instance has simply "webperformance" Same 2CPU / 2GB RAM / 40G disk
  • 14:40 hashar: Booting integration-webperf instance 2CPU / 2GB RAM / 40G disk. Intended to host webperformance long running jobs . T166756
  • 11:02 hashar: Removing profile::swift::storage::labs class from deployment-ms-be03 and deployment-ms-be04 to let puppet run. Reapplying it after. - T171174 T171454
  • 10:59 hashar: Removing class from deployment-trending01 to let puppet run. Reapplying it after. - T171174
  • 10:54 hashar: Removing classes from deployment-sca02 and deployment-sca03 to let puppet run. Reapplying it after. - T171174
  • 10:32 hashar: Removing profile::etcd from deployment-conf03 to let puppet run. Reapplying it after. - T171174
  • 10:12 hashar: Removing role::mathoid from deployment-mathoid to let puppet run. Reapplying it after. - T171174
  • 10:09 hashar: Removing role::changeprop from deployment-changeprop to let puppet run. Reapplying it after. - T171174
  • 10:06 hashar: Removing role::ocg from deployment-mcs01 to let puppet run. Reapplying it after. - T171174
  • 10:02 hashar: Removing role::mobileapps from deployment-mcs01 to let puppet run. Reapplying it after. - T171174

2017-07-21

2017-07-20

  • 16:42 hashar: How to fix ssh access on beta cluster instances: https://phabricator.wikimedia.org/T171174#3456966
  • 15:30 hashar: deployment-prep : removing project wide puppet classes from https://horizon.wikimedia.org/project/puppet/ All are role::eventlogging::analytics::*
  • 15:08 hashar: removed profile::recommendation_api from deployment-sca01 to try to fix the ssh access for mobrovac T171173 T171174
  • 14:57 zeljkof: reloading Zuul to deploy 80b9d85
  • 14:31 hashar: deployment-prep: manually cleaned out the puppet master configuration. It was all screwed up. Notably I removed bits about the puppetdb
  • 10:20 zeljkof: Reloading Zuul to deploy 80b9d85
  • 09:17 hashar: Spawning and pooling integration-slave-docker-1003 as replacement to integration-slave-docker-1000 (broken) - T150502
  • 09:03 hashar: Restoring castorby updating all jobs to point to castor02 ( https://gerrit.wikimedia.org/r/366524 ) Starts with a cold cache :( - T171148
  • 08:53 hashar: Created castor02.integration.eqiad.wmflabs with puppet role role::ci::castor::server and adding it to Jenkins. Will then update the Jenkins jobs to point to it - T171148
  • 08:00 hashar: Disabled castor entirely via https://gerrit.wikimedia.org/r/366520 . The instance is broken - T171148
  • 07:55 hashar: Refreshing all Jenkins jobs defined in JJB in order to then disable castor entirely for T171148
  • 07:09 _joe_: rebooting castor, jobs are failing, and no one seems able to login
  • 07:05 _joe_: adding myself to projectadmins for integration, trying to troubleshoot castor
  • 01:38 thcipriani: scap on beta was failing because during the ldap downtime puppet created a shadow mwdeploy user, fixed using vipw and vigr

2017-07-19

  • 14:43 hashar: Jenkins: uploaded a patched android-emulator plugin for T150623 and restarting Jenkins
  • 13:55 hashar: Jenkins: added JDK "Debian - OpenJdk 7" with JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
  • 12:54 hashar: Gerrit: created repo integration/jenkinsci/android-emulator-plugin.git owned by access group integration-jenkinsci-android-emulator-plugin which has Mholloway - T170904

2017-07-18

  • 16:26 halfak: manually restarted uwsgi-ores and celery-ores-worker on deployment-sca03
  • 16:19 halfak: manually installed "aspell-el" on deployment-sca03 (work around for ongoing puppet issues)
  • 09:04 hashar: deleted integration-slave-trusty-1006
  • 03:57 twentyafterfour: Fixed deployment-imagescaler01 by cherry-picking https://gerrit.wikimedia.org/r/#/c/365891/ on deployment-puppetmaster02

2017-07-17

2017-07-14

  • 20:16 Amir1: cpan[1]> install LWP::UserAgent on tin

2017-07-13

  • 17:04 thcipriani: restarting jenkins for updates

2017-07-12

  • 20:07 bearND: Update mobileapps to d30dae2
  • 18:19 greg-g: where "things" == nodepool instance delete/creation
  • 18:18 greg-g: things are back to a bad state, chase etc investigating
  • 17:52 greg-g: nodepool is back to making instances and running jobs, thanks Cloud team
  • 17:22 greg-g: CI is backed up, only one nodepoll instance running for the last long while, many in building
  • 00:35 legoktm: deploying https://gerrit.wikimedia.org/r/364628

2017-07-11

2017-07-09

  • 01:15 Amir1: ladsgroup@deployment-tin:~$ mwscript extensions/ORES/maintenance/CheckModelVersions.php --wiki=enwiki (T170026, T165716)

2017-07-07

2017-07-06

  • 17:28 thcipriani: committed changes to modules/kafkatee on deployment-puppetmaster02 since having them uncommitted broke git-sync-upstream
  • 16:20 hashar: Deleting Nodepool snapshot snapshot-ci-jessie-1499350442 - faulty php7.0-sqlite package that breaks phan jobs - T169904
  • 15:29 hashar: deployment-cache-upload04 manually ran apt-get upgrade to downgrade ldap-utils and libldap-2.4-2 (caused puppet failure)
  • 14:14 hashar: regenerating mediawiki-core-qunit-selenium-jessie jenkins job
  • 12:05 hashar: deployment-prep created Web proxy for recommendation-api-beta.wmflabs.org -> http://10.68.20.183:9632 (deployment-sca01) for schana
  • 02:38 Krinkle: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/363519

2017-07-04

  • 14:10 hashar: manually upgraded apache2 on deployment-puppetmaster02 see T159254
  • 13:33 hashar: beta cluster puppet is broken: Error: Could not send report: Connection refused - connect(2) for "deployment-puppetmaster02.deployment-prep.eqiad.wmflabs" port 8140
  • 09:28 hashar: gerrit: marking read-only mediawiki/extensions/Nonlinear - T169519

2017-07-03

2017-06-30

  • 08:16 hashar: Gerrit: changing repos to read-only: analytics/kraken analytics/kraken/deploy analytics/vagrant/kraken - T169303

2017-06-29

2017-06-28

  • 15:55 hashar: beta: git gc mediawiki repos in /srv/mediawiki-staging
  • 15:47 hashar: beta: git -C /srv/deployment/ores/deploy/submodules/editquality gc (saving 380MBytes)
  • 15:33 hashar: running git gc under /srv/mediawiki-staging
  • 14:43 hashar: pypi.python.org is back again - T169091
  • 14:33 elukey: running alter tables on the EL database in deployment-eventlogging03.deployment-prep.eqiad.wmflabs
  • 14:06 hashar: pypi.python.org has an issue with its CDN . That would affect any CI jobs relying on tox/python - See https://status.python.org for updates and T169091
  • 14:04 hashar: pypi.python.org has an issue with its CDN . That would affect any CI jobs relying on tox/python - See https://status.python.org for updates
  • 10:06 hashar: Unblocked beta cluster jenkins job. Have been stalled for a while

2017-06-27

  • 22:58 Amir1: cherry-picking gerrit:360891/3
  • 22:42 Amir1: cherry-picking gerrit:360891/2
  • 21:58 Amir1: mwscript extensions/Wikibase/repo/maintenance/changePropertyDataType.php --wiki wikidatawiki --new-data-type 'string' --property-id P34
  • 18:31 hashar: Image snapshot-ci-jessie-1498587497 in wmflabs-eqiad is ready - T169004
  • 18:18 hashar: Regenerating Jessie nodepool image to hopefulyl bring back hhvm-tidy package - T169004
  • 17:39 Amir1: running mwscript extensions/Wikibase/repo/maintenance/changePropertyDataType.php --wiki wikidatawiki --new-data-type 'external-id' --property-id P34

2017-06-26

  • 22:24 halfak: deploying ores-prod-deploy:82dfd56 to beta (note: T168099)
  • 22:20 halfak: deploying ores-prod-deploy:82dfd56 to beta
  • 20:33 bearND: Update mobileapps to 0b05026
  • 18:44 hashar: nodepool image-delete 1636 # Deletes snapshot-ci-trusty-1498491445 which lack nodejs when we still need it.
  • 18:23 twentyafterfour: renamed previously active image to 'image-ci-trusty_bad_20170626'
  • 18:22 twentyafterfour: reverted nodepool image-ci-trusty to previous version 'image-ci-trusty-old_20170626'
  • 15:41 hashar: Image snapshot-ci-trusty-1498491445 in wmflabs-eqiad is ready
  • 15:34 hashar: Rebuilding nodepool image for trusty and regenerating snapshots
  • 09:19 hashar: gerrit: marked wikimedia/bugzilla/* repos read-only

2017-06-24

  • 06:02 legoktm: deployment-flourine02 /srv partition is alerting on low disk space but once logs get automatically gzip'd it should be fine

2017-06-23

  • 20:59 hasharAway: deployment-db03 reinstall ldap-utils, libldap-2.4-2 2.4.44+dfsg-4~bpo8+1 > 2.4.41+dfsg-1+wmf1
  • 20:54 hasharAway: apt-get upgrade deployment-elastic06

2017-06-22

  • 19:02 Amir1: cherry-picking gerrit:360891/1 (T163922)
  • 13:35 hashar: Gerrit: adding Bearloga (Mikhail Popov) to the 'search' group . That also makes him an owner to wikimedia/discovery/* - T168588
  • 13:35 hashar: Gerrit: adding Bearloga (Mikhail Popov) to the 'search' group . That also makes him an owner to wikimedia/discovery/*
  • 08:18 hashar: deployment-prep: removed /etc/apt/preferences.d/puppet.pref which was pinning puppet packages to jessie-backports and hence 4.8.x! - T168511
  • 08:16 hashar: deployment-prep: removed /etc/apt/preferences.d/puppet.pref which was pinning puppet packages to jessie-backports and hence 4.8.x!
  • 08:12 hashar: deployment-prep: upgraded puppet to 3.8.5 on all instances

2017-06-21

  • 20:03 bearND: Update mobileapps to 21f771d
  • 19:54 hashar: deployment-tin stopped keyholder and armed it
  • 19:25 hashar: hard rebooting deployment-db04
  • 19:20 hashar: hard rebooting deployment-db03
  • 18:52 hashar: Removing /etc/apt/sources.list.d/wikimedia_mariadb.list (content: deb http://apt.wikimedia.org/wikimedia precise-wikimedia mariadb )
  • 18:51 hashar: fixing up apt config on deployment-db03 and deployment-db04 / upgrade packages and kernel / reboot
  • 17:02 hashar: upgrading kernel and puppet on deployment-mcs01 deployment-restbase01 and deployment-restbase02 - T168541
  • 17:00 hashar: upgrading kernel and puppet on deployment-changeprop and deployment-conf03 - T168541
  • 16:56 hashar: upgrading kernel and puppet on deployment-aqs01 deployment-aqs02 and deployment-aqs03 - T168541
  • 16:38 hashar: rebooting deployment-cache-upload04 and deployment-cache-text-04 - T168541
  • 16:29 hashar: upgrading deployment-apertium02 and deployment-eventlogging04 - T168541
  • 16:23 hashar: upgrade and reboot deployment-prometheus01
  • 16:11 hashar: rebooting deployment-ms-fe02
  • 16:11 hashar: rebooting deployment-ms-be04
  • 16:09 hashar: rebooting deployment-ms-be03
  • 16:03 hashar: upgrading deployment-ms-fe02 deployment-ms-be03 and deployment-ms-be04
  • 15:57 hashar: apt-get upgrade and reboot of deployment-memc04 and deployment-memc05
  • 15:52 hashar: rebooting deployment-etcd-01
  • 15:48 hashar: apt-get upgrade deployment-etcd-01
  • 15:35 hashar: deployment-prep changing Varnish director for citoid from citoid.wmflabs.org to citoid-beta.wmflabs.org ( via https://horizon.wikimedia.org/project/prefixpuppet/ ) - T168519
  • 14:41 hashar: deployment-tmh01 is down for some reason
  • 14:21 hashar: deployment-prep: force running puppet on all instances
  • 14:17 hashar: finally fixed puppet on deployment-prep !
  • 14:02 hashar: deployment-puppmaster (cd /etc/puppet && ln -s /var/lib/git/operations/puppet/manifests && ln -s /var/lib/git/operations/puppet/modules)
  • 13:26 hashar: deployment-prep: puppet master got erroneously upgrade to puppet* 4.8. Roll it back to 3.8 which fail, and then back to 3.7!
  • 12:47 hashar: broke deployment-prep puppet master while upgrading it :(
  • 12:28 hashar: deployment-imagescaler01 removed puppetmaster and puppetmaster-common packages
  • 12:04 hashar: apt-get dist-upgrade on deployment-mediawiki hosts
  • 11:59 hashar: armed keyholder on deployment-tin and deployment-mira
  • 11:15 hashar: deployment-cache-text04 : apt-get dist-upgrade
  • 11:12 hashar: varnish fails on deployment-cache-text04
  • 11:08 hashar: deployment-prep : rebooting deployment-tin deployment-mira deployment-cache-text04 deployment-cache-upload04
  • 11:00 hashar: deployment-prep apt-get upgrade and reboot all hosts
  • 10:21 hashar: deployment-zotero01 apt-get upgrade and rebooted
  • 09:59 hashar: integration: removing swift / python-swift from integration-puppetmaster01
  • 09:57 hashar: Upgrading puppet 3.7.2 .. 3.8.5 on integration-slave-docker-1001 and integration-slave-docker-1002
  • 09:39 hashar: integration: deleting swift and and swift-storage-01 unused
  • 09:38 hashar: upgrading/Rebooting all instances from integration project to catch up with Linux kernel upgrades

2017-06-20

  • 19:25 hashar: Nodepool rate being bumped from 1 query per 6 seconds to 1 query per 5 seconds ( https://gerrit.wikimedia.org/r/#/c/358601/ )
  • afk: deployment-tin stuck on post-merge queue for the past 13 hours, unstuck now

2017-06-19

2017-06-18

  • 19:26 Reedy: Re-enabled beta-update-databases-eqiad as wikidatawiki takes < 10 minutes T168036 T167981
  • 19:25 Reedy: A lot of items on beta wikidatawiki deleted T168036 T167981

2017-06-16

  • 23:41 Reedy_: also deleting a lot of Property:P* pages on beta wikidatawiki T168106
  • 22:55 Reedy: deleting Q100000-Q200000 on beta wikidatawiki T168106
  • 19:04 Reedy: disabled beta-update-databases-eqiad because it's not doing much useful atm
  • 14:56 zeljkof: Reloading Zuul to deploy 18a50a7
  • 14:40 hashar: integration-slave-jessie-1001 apt-get upgrade to downgrade python-pbr to 0.8.2 as pinned since T153877. /usr/bin/unattended-upgrade magically upgraded it for some reason
  • 06:49 Reedy: script upto `Processed up to page 336425 (Q235372)`... hopefully it's finished by morning
  • 03:13 Reedy: running `mwscript extensions/Wikibase/repo/maintenance/rebuildTermSqlIndex.php --wiki=wikidatawiki` in screen as root on deployment-tin for T168036
  • 03:10 Reedy: running `mwscript extensions/Wikibase/repo/maintenance/rebuildEntityPerPage.php --wiki=wikidatawiki` in screen as root on deployment-tin for T168036
  • 02:23 Reedy: cherry-picked https://gerrit.wikimedia.org/r/#/c/354932/ onto beta puppetmaster

2017-06-15

  • 16:34 RainbowSprinkles: deployment-prep: Disabled database updates for awhile, running it by hand
  • 10:39 hashar: apt-get upgrade on deployment-tin
  • 00:52 thcipriani: deployment-tin jenkins agent borked for 4 hours, should be fixed now

2017-06-14

2017-06-13

  • 22:05 hashar: Zuul resarted manually from a terminal on contint1001. It does not have any statsd configuration so we will miss metrics for a bit till it is restarted properly.
  • 21:13 hashar: Gracefully restarting Zuul
  • 20:37 hashar: Restarting Nodepool. apparently confused in pool tracking and spawning to many Trusty nodes (7 instead of 4)
  • 20:31 hashar: Nodepool: deleted a bunch of Trusty instances. It scheduled lot of them that are taking slots in the pool. Better have jessie nodes to be spawned instead since there is high demand for them
  • 20:19 hashar: deployment-prep: added Polishdeveloper to the "importer" global group. https://deployment.wikimedia.beta.wmflabs.org/wiki/Special:GlobalUserRights/Polishdeveloper - T167823
  • 18:47 andrewbogott: root@deployment-salt02:~# salt "*" cmd.run "apt-get -y install facter"
  • 18:46 andrewbogott: using salt to "apt-get -y install facter" on all deployment-prep instances
  • 18:38 andrewbogott: restarting apache2 on deployment-puppetmaster02
  • 18:37 andrewbogott: doing a git fetch and rebase for deployment-puppetmaster02
  • 17:00 elukey: hacking apache on mediawiki05 to test rewrite rules
  • 16:04 Amir1: cherry-picked 357985/4 on puppetmaster
  • 15:59 halfak: deployed ores-prod-deploy:862aea9
  • 13:47 hashar: nodepool force running puppet for: lower min-ready for trusty [puppet] - https://gerrit.wikimedia.org/r/356466
  • 10:53 elukey: rolling restart of all kafka brokers to pick up the new zookeper change (only deployment-zookeeper02 available)
  • 10:36 elukey: delete deployment-zookeeper01 (old trusty instance, replaced with a jessie one)
  • 09:50 elukey: big refactoring for zookeeper merged in operations/puppet - https://gerrit.wikimedia.org/r/#/c/354449 - ping the Analytics team for any issue

2017-06-12

  • 14:22 hashar: Image snapshot-ci-trusty-1497276913 in wmflabs-eqiad is ready
  • 14:15 hashar: Nodepool: regenerating Trusty images to confirm that removal of keystone admin_token is a noop for nodepool - T165211
  • 12:44 hashar: Image snapshot-ci-jessie-1497270581 in wmflabs-eqiad is ready
  • 12:30 hashar: nodepool: refreshing Jessie snapshot to upgrade HHVM from 3.12 to 3.18 - T167493 T165074
  • 08:47 hashar: deployment-prep : salt -v '*' cmd.run 'apt-get clean'

2017-06-09

2017-06-07

  • 17:49 elukey: forced /usr/local/bin/git-sync-upstream manually on puppetmaster02
  • 17:30 elukey: manually fixed rebase issue for operations/puppet on puppetmaster02 (empty commit due to the change for scap3 and jobrunners)
  • 09:33 elukey: restart kafka brokers to pick up the new zookeeper settings
  • 09:00 elukey: adding deployment-zookeeper02.eqiad.wmflabs to Hiera:deployment-prep
  • 08:43 gehel: upgrading kibana to v5.3.3 on deployment-logstash2
  • 08:35 gehel: rolling back to kibana 5.3.2, incompatible elasticsearch version
  • 08:28 gehel: upgrading kibana to v5.4.1 on deployment-logstash2

2017-06-06

  • 14:34 hashar: deleting buildlog.integration.eqiad.wmflabs was mean to receive Jenkins logs in ElasticSearch. We are experimenting with relforge1001.eqiad.wmnet now - T78705
  • 12:37 hashar: Removing HHVM from permanent Trusty slaves
  • 10:44 elukey: running eventlogging_cleaner.py (https://gerrit.wikimedia.org/r/#/c/356383/) on eventlogging to test the cleaning of old events
  • 09:24 hashar: Deleting deployment-phab02 instance. Has been shut off since April 23rd - T167090
  • 07:51 hashar_: Fixed puppet on deployment-aqs instances

2017-06-05

  • 15:38 elukey: manually hacking deployment-jobrunner02.deployment-prep.eqiad.wmflabs to test a new config

2017-06-02

  • 19:51 hashar: integration: granted ebernhardson sudo
  • 12:12 hashar: jenkins: rebuild logstash plugin from HEAD of master for jenkins 2 back compat. logstash-1.2.0-4-gbcbc19e - T78705

2017-06-01

  • 20:14 bearND: Update mobileapps to c4dc72d
  • 20:12 mdholloway: killed the running emulator processes on integration-slave-jessie-android to get it booting again following yesterday's gerrit outage
  • 13:39 hashar: Gerrit: change integration.git project to "Rebase if Necessary" with "Allow content merges" - T131008
  • 13:10 hashar: Gerrit allow content merge for integration/config ( https://gerrit.wikimedia.org/r/#/admin/projects/integration/config ) - T131008
  • 08:03 hashar: Purged all mysql bin files from deployment-db03 ( rm -fR /srv/sqldata/T166060 ) - T166060

2017-05-31

  • 20:21 hashar: Jenkins: upgrading git-client-plugin 2.4.5..2.4.6 T166557
  • 07:50 hashar: deployment-db04: mysql> set global expire_logs_days = 7 - to expire bin logs faster (instead of 30 days) - T166060
  • 07:49 hashar: deployment-db03: mysql> set global expire_logs_days = 7 - to expire bin logs faster (instead of 30 days) - T166060

2017-05-30

  • 22:08 hasharAway: Changed integration/config.git submit type from "Fast forward only" to "Rebase if Necessary" T131008

2017-05-29

  • 14:44 elukey: reverted previous config on redis01
  • 14:36 elukey: set redis-cli -a "$(sudo grep -Po '(?<=masterauth ).*' /etc/redis/tcp_6379.conf)" -p 6381 config set tcp-keepalive 300 on redis01 as test (rollback: redis-cli -a "$(sudo grep -Po '(?<=masterauth ).*' /etc/redis/tcp_6379.conf)" -p 6381 config set tcp-keepalive 0)
  • 10:22 hashar: force refreshed Nodepool Trusty images. Was stuck somehow
  • 10:06 hashar: deployment-tin rm -fR /usr/src/hhvm T166492
  • 09:51 hashar: deployment-tin: rm /var/lib/l10nupdate/caches/cache-master/*.json T166492

2017-05-26

  • 09:20 elukey: installing hhvm_3.18.2+dfsg-1+wmf4+exp1_amd64.deb on jobrunner02
  • 07:20 elukey: hacking on jobrunner02 in deployment-prep
  • 01:28 bearND: Update mobileapps to db6493c

2017-05-25

  • 19:46 hashar: deployment-tin manually cleaning disk space
  • 16:44 elukey: restored hhvm on jobrunner02
  • 16:03 bearND: Update mobileapps to 946fe1f
  • 10:33 elukey: manual install of hhvm_3.18.2+dfsg-1+wmf4+exp1_amd64.deb on jobrunner02 to test a fix for the Redis.php lib
  • 02:46 RainbowSprinkles: running `mwscript extensions/Flow/maintenance/FlowUpdateUserWiki.php --wiki=enwiki` in a screen on deployment-tin, probably going to take all night

2017-05-24

  • 16:04 hashar: rebooting integration-slave-trusty-1003 to catch up with kernel upgrade
  • 12:22 hashar: deployment-prep: finished rebase of puppet.git
  • 10:19 hashar: deployment-prep rebased puppet repo with: git rebase -X theirs
  • 10:10 hashar: deployment-prep : resetting puppet master to last known snapshot snapshot-20170523T0010 . All cherry picks got deleted
  • 10:09 hashar: deployment-etcd-01: fixed puppet run
  • 08:38 moritzm: updated puppet on deployment-puppetmaster02 to 3.8.5-2~bpo8+2

2017-05-23

  • 16:55 RainbowSprinkles: there was no data
  • 16:55 RainbowSprinkles: dropped flow_ext_ref from commonswiki on beta. schema migration is busted, going to let it recreate table
  • 08:20 hashar: Updating Nodepool snapshot-ci-trusty
  • 08:19 hashar: Regenerated Nodepool base image for Trusty. Got rid of hhvm from it

2017-05-22

  • 12:11 greg-g: ran git prune and rm'd the gc.log file
  • 11:40 greg-g: gjg@deployment-tin:/srv/mediawiki/.git/gc.log has warning: There are too many unreachable loose objects; run 'git prune' to remove them.

2017-05-21

  • 12:05 Reedy: deployment-tin is back online
  • 10:41 Reedy: disabled jerkins on deployment-tin again
  • 09:10 greg-g: beta-update-database-eqiad has been hitting the timelimit since May 19th
  • 09:02 Reedy: brought deployment-tin back online a while ago

2017-05-20

  • 09:10 greg-g: executers are running again
  • 09:02 greg-g: All executers in Jenkins are "offline" including the permament ones

2017-05-19

  • 19:05 mutante: fixing role class config on deployment-phab* (remove role::phabricator::main, add role::phabricator_server in context prefix "deployment-phab. remove again from instance level for phab-01
  • 18:40 mutante: deployment-phab01 still has puppet error "Could not find class role::phabricator::main" and that should simply be removed from it, but i can NOT find it in Horizon, i checked instance config, project config, the "Other" section, the "All classes" tab. Because it's gone. But how do i fix the instance config then?
  • 18:39 mutante: applying role::phabricator_server on instance deployment-phab01 (it had error, could not find role::phabricator::main and the name changed in role/profile conversion)

2017-05-15

  • 10:46 addshore: enabled beta-code-update-eqiad for some testing
  • 10:38 addshore: temporarily disabled beta-code-update-eqiad for some testing

2017-05-13

  • 20:31 bd808: Deleted stuck mediawiki-core-doxygen-publish job. Jenkins had it marked for a particular nodepool instance that was offline.

2017-05-12

  • 13:12 hashar: Trying to refresh Nodepool Jessie image. Should get HHVM pinned to 'experimental' component => 3.12.x

2017-05-11

  • 20:43 hashar: nodepool: delete today jessie image snapshot. It comes with HHVM 3.18 which segfault with MediaWiki/PHPUnit. Rolled back to snapshot-ci-jessie-1494425642 from 30 hours ago. T165074
  • 12:57 godog: cherry-pick https://gerrit.wikimedia.org/r/#/c/353282/

2017-05-10

  • 20:28 bearND: Update mobileapps to 75b135e
  • 18:32 mutante: deployment-tin/mira: the change of the role class name was because of https://gerrit.wikimedia.org/r/#/c/344728/ which moved deployment::server to profile/role structure. both instances configured accordingly now. the remaining issue with "id_rsa.bromine" should be all unrelated
  • 18:28 mutante: deployment-mira: configure puppet config in horizon, remove "role::deployment::server", use correct new name "role::deployment_server" (moved to profile). (a bit tricky because then in Horizon it seems to disappear from the "others" section, but if you click the "all" tab you get to see the class names
  • 18:12 mutante: deployment-tin: puppet run now ok, except ":Upload/File[/var/lib/releases/.ssh/id_rsa.bromine.eqiad.wmnet]: Could not evaluate:" this should be an unrelated issue
  • 18:05 mutante: deployment-tin: configure to use role::deployment_server (instead of deployment::server), for some reason now Horizon shows _nothing_ under "other classes" where this was before
  • 17:58 mutante: deployment-tin: deleting puppet lock file (claimed it was running but also didnt run since > 900 min), looking at fixing deployment::server role name change
  • 15:26 elukey: refresh cherry pick gerrit/352582 on puppet master (rebase -i to remove, then cherry pick)
  • 14:34 elukey: cherry pick gerrit/352582 to puppet master
  • 12:35 hashar: deployment-prep: git -C /srv/mediawiki-staging/php-master/extensions rm --cached SemanticFormsInputs
  • 08:04 hashar: merging 'composer test' into mwext-testextension-* jobs https://gerrit.wikimedia.org/r/#/c/352160/ - T161895

2017-05-09

  • 12:44 hashar: deployment-ircd upgrading puppet 3.7.2 => 3.8.5
  • 12:19 hashar: Unbroke puppet on deployment-irc and deployment-urldownloader . Both choked on a ruby one-liner, fixed via https://gerrit.wikimedia.org/r/#/c/336840/

2017-05-08

2017-05-06

2017-05-05

2017-05-04

  • 10:47 hashar: puppet ca destroy deployment-zookeeper01.eqiad.wmflabs
  • 10:46 hashar: puppet ca destroy deployment-ores-redis-02.deployment-prep.eqiad.wmflabs (no such instance)
  • 10:46 hashar: puppet ca sign deployment-ores-redis-02.deployment-prep.eqiad.wmflabs
  • 10:39 hashar: Removing puppetmaster: puppetmaster.thumbor.eqiad.wmflabs from deployment-imagescaler01 - T153319
  • 10:37 hashar: deployment-prep: force recompilation of puppet.conf : salt -v '*' cmd.run 'echo >> /etc/puppet/puppet.conf.d/10-main.conf' - T153319
  • 10:37 hashar: deployment-prep: force recompilation of puppet.conf : salt -v '*' cmd.run 'echo >> /etc/puppet/puppet.conf.d/10-main.conf'
  • 10:31 hashar: deployment-phab01 / deployment-imagescaler01 rm /etc/puppet/puppet.conf.d/10-self.conf - T153319
  • 10:29 hashar: Unbroke puppet on deployment-imagescaler01 and removing role::puppetmaster::self - T153319
  • 10:16 hashar: Unbroke puppet on deployment-phab01 - T153319
  • 07:30 hashar: deployment-prep: adding TTO (This, that and the other) as a project member to grant shell access - T163887

2017-05-03

  • 17:39 mdholloway: (this concerns integration-slave-jessie-android)
  • 17:37 mdholloway: enabled automatic Android component installation for the Android Gradle plugin, rebuilt the SDK, and deleted the old one
  • 15:54 hashar: Granted sudo right for Niedzielski accounts on Android CI slave. Already has it with the other labs account Sniedzielski - T164388
  • 15:38 hashar: Granted mdholloway (mobile team) full sudo access on integration labs project so he can reach integration-slave-jessie-android - T164388

2017-05-02

  • 21:14 hashar: Manually cancelled a few mediawiki-core-jsduck-publish and mediawiki-core-doxygen-publish job in Jenkins build queue. They seems to deadlock Jenkins somehow :(
  • 19:59 hashar: Regenerate jobs selenium-GettingStarted from JJB - T164296
  • 19:51 hashar: Jenkins: rolling back Performance plugin from 2.2 to 2.0 due to an exception / failure to find a junit xml file. T164296
  • 19:02 hashar: Added multichill ( https://github.com/multichill ) to the Wikimedia Github organization
  • 10:21 godog: bounce varnish and varnish-frontend on deployment-cache-upload04
  • 10:16 godog: upgrade scap on deployment-tin to overcome AttributeError: Lock instance has no attribute 'get_lock_excuse'
  • 09:41 godog: flip deployment-cache-upload04 to deployment-ms-fe02 - T162247
  • 08:17 hashar: Reconfigured all Jenkins jobs via jjb

2017-05-01

2017-04-27

  • 18:18 urandom: deployment-prep: restarting cassandra-metrics-collector on deployment-restbase0[1-2]
  • 07:26 Amir1: cherry-picking 348184/4 (T161563)

2017-04-26

  • 23:36 urandom: removing r/350485 from deployment-prep
  • 21:53 urandom: cherry-picking r/350485 to deployment-prep
  • 20:20 bearND: Update mobileapps to 14bd4a5
  • 15:24 godog: add new deployment-ms-be0[34] backends to swift in deployment-prep - T162247

2017-04-25

2017-04-22

  • 20:17 hashar: Added FlorianSW to Github organization "wikimedia" (no team though)

2017-04-21

  • 12:25 hashar: T104048 zuul enqueue --trigger gerrit --pipeline postmerge --project AhoCorasick --change 345433,1
  • 09:32 hashar: Zuul: deploying "Decouple repos from mediawiki gate queue" 7a79f752363a / T107529
  • 09:30 elukey: hack reverted on tin and scap pull performed on jobrunner02

2017-04-20

  • 17:09 elukey: reverted hack on deployment-tin (apparently no effects on the jobrunner)
  • 16:41 elukey: temporary disable puppet on deployment-tin to remove jobrunner02 from scap dsh; manually enable persistent connection between it and rdb redis hosts

2017-04-19

  • 16:34 hashar: deleted nodepool alien ci-jessie-wikimedia-613597
  • 09:20 hashar: apt-get upgrade deployment-tin deployment-mira
  • 09:16 hashar: apt-get upgrade on deployment-mx deployment-redis01 deployment-redis02 deployment-cache-text04
  • 02:58 Krinkle: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/348896

2017-04-18

  • 14:29 hashar: unbreaking integration puppetmaster. Broke it when upgrading the puppet package :(
  • 14:09 hashar: integration: upgrade puppet on Jessie permanent slaves 3.7.2 -> 3.8.5 (and add ruby-rgen). Done via: salt -v '*' pkg.upgrade
  • 13:17 elukey: upgrade deployment-jobrunner02 to hhvm 3.18.2+wmf2 - T162354
  • 10:07 godog: upgrade swift to 2.2.0 on deployment-ms*

2017-04-14

  • 12:29 hashar: Delete integration-c1 instance (32GB RAM) on labvirt1004. It was used as a workaround for T161006
  • 08:17 hashar: beta: cherry picking again 348184/4 'service: use gzip for logging in uwsgi' for T161563
  • 08:03 hashar: beta: resetting puppetmaster to last good tag snapshot-20170414T0030 A cherry pick for T161563 end up dropping three patches which broke other parts of the infrastructure
  • 07:52 hashar_: Puppet failing on deployment-tin and deployment-mira . Some patches have been dropped from the puppet master :-((
  • 00:59 Amir1: three cherry-picks failed to merge, skipped them 93dad5b 92c7d0b 21d60a4
  • 00:45 Amir1: cherry-picking 348184/1 (T161563)

2017-04-13

2017-04-12

  • 15:14 hashar: rm -fR /mnt/home/jenkins-deploy/.android/build-cache/* # T162635
  • 14:56 hashar: integration-slave-jessie-1001 : mv /mnt/home/jenkins-deploy/.android-sdk /mnt/home/jenkins-deploy/.android-sdk.T162635.back for T162635
  • 14:54 hashar: integration-slave-jessie-1002 : mv /mnt/home/jenkins-deploy/.android-sdk /mnt/home/jenkins-deploy/.android-sdk.T162635.back for T162635
  • 10:37 hashar: Jenkins email-ext plugin got upgraded. Some groovy templating might be prevented and would have to be reviewed/approved via https://integration.wikimedia.org/ci/scriptApproval/
  • 08:52 hashar: Cancelled bunch of mediawiki-core-doxygen-publish jobs that were keeping the queue busy/deadlocked builds. Should be moved to poll scm instead ( T115755 )

2017-04-11

  • 15:59 hashar: integration-config-tox-jessie job is broken due to the JJB upgrade
  • 15:40 hashar: Upgraded JJB to latest master 4f77324f with a couple cherrypicks on top of that. 022738f8...edebce7f T162674
  • 15:36 hashar: Updating selenium-* jobs configuration for the performance plugin due to JJB upgrade T162674
  • 15:24 hashar: Adding parameter ZUUL_VOTING to all Jenkins jobs due to JJB upgrade T162674
  • 15:13 hashar: Forced updated jenkins-job-builder 86478421...022738f8 - T162674
  • 13:44 hashar: Forced updated jenkins-job-builder 1639a86e...86478421 - T162674
  • 13:44 hashar: Updating all Jenkins jobs using the git plugin due to JJB change cdfeb7b - T162674
  • 12:35 hashar: Force updated jenkins-job-builder from 1.5.0 to 1.6.0 and bumped python-jenkins to 0.4.14. 6fcaf39b...1639a86e - T162674
  • 12:35 hashar: Force updated jenkins-job-builder from 1.5.0 to 1.6.0 and bumped python-jenkins to 0.4.14. 6fcaf39b...1639a86e
  • 10:41 hashar: Enable webdriver.io browser tests for MediaWiki core - https://gerrit.wikimedia.org/r/#/c/324719/ - T139740
  • 09:50 hashar: Regenerating MediaWiki doxygen documentations for all 1.23.x releases.
  • 08:55 hashar: Retriggering MediaWiki doxygen publishing job for 1.26.0 - T162506 : zuul enqueue-ref --trigger gerrit --pipeline publish --project mediawiki/core --ref refs/tags/1.26.0 --newrev 981ec62

2017-04-10

  • 21:17 hashar: marked a nodepool node online manually. The instance was up but Jenkins failed to reach it due to some SEVERE: I/O error in channel
  • 20:52 hashar: integration-slave-jessie-1001 : cleaning up /tmp: sudo find /tmp -path '/tmp/android-tmp-robo*' -delete # T162635
  • 20:49 hashar: integration-slave-jessie-1002 : cleaning up /tmp: sudo find /tmp -path '/tmp/android-tmp-robo*' -delete # T162635
  • 20:08 bearND: Update mobileapps to 1695900

2017-04-06

  • 16:36 halfak: staging ores:554ea12
  • 12:23 hashar: Image snapshot-ci-trusty-1491480759 in wmflabs-eqiad is ready
  • 12:13 hashar: Updating Nodepool Trusty image to let Linux overcommit memory ( https://gerrit.wikimedia.org/r/#/c/346634/ )

2017-04-05

  • 13:34 ema: testing possible fix for T162035 on deployment-ms-fe01

2017-04-04

  • 21:29 hashar: contint1001 : rm -fR /srv/zuul/git/mediawiki/services/graphoid/deploy due to T157818
  • 21:26 hashar: contint2001 : rm -fR /srv/zuul/git/mediawiki/services/graphoid/deploy due to T157818
  • 20:58 hashar: integration: purging precise cow images from integration-slave-jessie-1001 and integration-slave-jessie-1002 ( https://gerrit.wikimedia.org/r/#/c/345836/ )
  • 20:58 hashar: rebased integration puppet master
  • 20:02 legoktm: deploying https://gerrit.wikimedia.org/r/346348

2017-04-03

  • 20:43 bearND: Update mobileapps to fdd4e31
  • 20:39 hashar: Nodepool: holding instance ci-trusty-wikimedia-597386 in an attempt debug Wikibase/Scribunto memory usage exploding T125050
  • 20:37 hashar: jenkins: disabled/reenabled gearman plugin to unlock the beta cluster related jobs
  • 09:17 hashar: deployment-jobrunner02 : cherry picked a monkey patch for Redis::close() to prevent it from sending QUIT command ( https://gerrit.wikimedia.org/r/#/c/346117/ ) - T125735

2017-04-01

  • 09:48 Sagan: puppet on deployment-tin looks like it is not running properly

2017-03-29

  • 23:51 Krinkle: Free up space on integration-slave-jessie-1001 by removing old /srv/jenkins-workspace and /srv/pbuilder dirs
  • 19:57 thcipriani: added --force flag for scap in beta-scap-eqiad temporarily
  • 18:41 ebernhardson: upgrading elasticsearch and kibana to 5.1.2 on deployment-logstash2 to test puppet+integration prior to prod deployment
  • 15:18 hashar: Delete a 32GB instance integration-ci - T161006

2017-03-28

  • 19:53 hashar: Populating package manager cache of oojs-ui-npm-run-jenkins-node-6-jessie by manually triggering a build with ZUUL_PIPELINE=postmerge T155483
  • 19:34 hashar: Migrate oojs/ui to just run 'npm jenkins' https://gerrit.wikimedia.org/r/345203 / T155483
  • 16:05 halfak: deployed ores:18beebf (T160638)
  • 13:22 gehel: restarting elasticsearch on deployment-elastic05 to reload log4j configuration
  • 10:28 hashar: Jenkins: installing Android Lint plugin 2.4 - T161305
  • 07:42 hashar: nodepool cleared a couple alien instances

2017-03-27

  • 17:02 ebernhardson: cherry pick https://gerrit.wikimedia.org/r/344964 to puppetmaster to test upgrade to logstash 5.x
  • 11:10 hashar: Image snapshot-ci-jessie-1490612363 in wmflabs-eqiad is ready
  • 10:59 hashar: Updating Nodepool Jessie image to include PhantomJS (take two) - T137112
  • 10:58 hashar: Image snapshot-ci-jessie-1490611594 in wmflabs-eqiad is ready
  • 10:47 hashar: Updating Nodepool Jessie image to include PhantomJS - T137112
  • 10:20 hashar: Restarting Jenkins to drop the Throttle Concurrent Builds plugin - T158596

2017-03-25

  • 10:46 Amir1: deleting deployment-ores-redis (T160762)
  • 10:39 Amir1: changing ores redis address to deployment-ores-redis-01 (T160762)
  • 10:02 Amir1: deleted deployment-ores-redis-02

2017-03-24

  • 21:34 Amir1: launching deployment-ores-redis-02 (T160762)

2017-03-23

  • 16:07 mobrovac: restbase deploying 752ca4b7
  • 15:52 hashar: Deleting integration-slave-trusty-1011 m1.large. One less perm slave to take care about
  • 14:02 hashar: deployment-ms-be01 and deployment-ms-be02 : Lower Swift replicator on, upgrade package, reboot hosts. T160990

2017-03-22

  • 09:45 hashar: beta: purging all Linux kernel from Swift instances
  • 08:48 hashar: deployment-ms-be01: swift-init reload all - T160990
  • 08:45 hashar: deployment-ms-be01: swift-init reload container - T160990
  • 08:43 hashar: deployment-ms-be01: swift-init reload object - T160990

2017-03-21

  • 16:47 halfak: halfak@deployment-ores-redis:~$ redis-cli -h deployment-ores-redis.deployment-prep.eqiad.wmflabs -p 6380 -a areallysecretpassword flushall (T160762)
  • 16:07 Amir1: ladsgroup@deployment-ores-redis:~$ redis-cli -h deployment-ores-redis.deployment-prep.eqiad.wmflabs -p 6380 -a areallysecretpassword flushall (T160762)
  • 11:27 hashar: integration: purging old packages on permanent slaves, mostly old kernels: apt-get autoremove --purge
  • 09:06 hashar: CI deploying config hack "High priority test pipeline"  : https://gerrit.wikimedia.org/r/343318 - T160667

2017-03-20

  • 20:51 andrewbogott: migrating deployment-urldownloader to labvirt1013
  • 20:45 andrewbogott: migrating deployment-pdf01 to labvirt1011
  • 20:14 andrewbogott: migrating deployment-puppetmaster02 to a different labvirt
  • 20:09 bearND: Update mobileapps to c0ab01d
  • 08:51 hashar: Jenkins: depooling / deleting Precise instances.

2017-03-17

  • 14:08 hashar: salt -v '*precise*' cmd.run 'puppet agent --disable "Pending shutdown on March 20th - T158652"'

2017-03-16

2017-03-15

  • 20:29 bearND: Update mobileapps to bb8fcf2
  • 19:02 niedzielski: Reloading Zuul to deploy f1c9073
  • 15:55 Reedy: Removed hhvm statcache cherrypick from beta puppetmaster
  • 11:09 elukey: Restore prod version of memcached on deployment-memc04 after experiment (I installed a new version a while ago)
  • 10:22 elukey: created instances deployment-aqs0[23] to have better testing for the AQS beta environment
  • 09:10 addshore: addshore@deployment-tin mwscript extensions/Cognate/maintenance/populateCognatePages.php --wiki=hewiktionary
  • 09:10 addshore: addshore@deployment-tin mwscript extensions/Cognate/maintenance/populateCognatePages.php --wiki=dewiktionary
  • 09:08 addshore: addshore@deployment-tin mwscript extensions/Cognate/maintenance/populateCognatePages.php --wiki=enwiktionary
  • 08:56 addshore: addshore@deployment-tin mwscript extensions/Cognate/maintenance/populateCognatePages.php --wiki=enwiktionary // (ParameterTypeException, T160503)
  • 08:50 addshore: addshore@deployment-tin mwscript extensions/Cognate/maintenance/populateCognateSites.php --wiki=enwiktionary --site-group=wiktionary // (3 sites added)
  • 08:49 addshore: addshore@deployment-tin mwscript extensions/Wikidata/extensions/Wikibase/lib/maintenance/populateSitesTable.php --wiki=enwiktionary --force-protocol=https --load-from=https://deployment.wikimedia.beta.wmflabs.org/w/api.php
  • 08:49 addshore: addshore@deployment-tin mwscript sql.php --wiki=enwiktionary "TRUNCATE sites; TRUNCATE site_identifiers;"
  • 08:44 addshore: addshore@deployment-tin mwscript extensions/Wikidata/extensions/Wikibase/lib/maintenance/populateSitesTable.php --wiki=enwiktionary --force-protocol=https
  • 08:43 addshore: addshore@deployment-tin mwscript extensions/Cognate/maintenance/populateCognateSites.php --wiki=dewiktionary --site-group=wiktionary // (0 sites added)
  • 08:43 addshore: addshore@deployment-tin mwscript extensions/Cognate/maintenance/populateCognateSites.php --wiki=enwiktionary --site-group=wiktionary // (1 site added)

2017-03-14

  • 19:22 thcipriani: removed alien nodepool instance via: openstack server delete ci-jessie-wikimedia-566503
  • 10:15 hashar: Added Niedzielski to integration.
  • 09:54 hashar: Jenkins: dropping Sniedzielski more specific permissions. Account is already in wmf ldap group

2017-03-13

  • 13:19 hashar: Depooled Precise instances from Jenkins T158652 leaving the instances up for now.
  • 11:38 hashar: Deleting php53lint jobs. Replacing them with php55 equivalents
  • 09:39 hashar: upgrading puppet on deployment-pdf01
  • 09:30 hashar: Removing old kernel packages from deployment-pdf01 to free up disk space
  • 08:55 hashar: Deleting deployment-copper Fails puppet due to broken OpenStack metadata http://169.254.169.254/openstack/2015-10-15/meta_data.json (fails) and no more needed (per elukey )

2017-03-10

2017-03-09

  • 16:20 gehel: upgrading elasticsearch on deployment-prep to v5.1.2
  • 09:39 hashar: deployment-prep: rebasing puppet master. Got stall due to a submodule update apparently

2017-03-08

2017-03-07

  • 22:39 hashar: upgrading jenkins02.ci-staging to jenkins 2.x
  • 15:26 hashar: ci-staging, enabling puppet master auto signing ( puppetmaster::autosigner: true )
  • 08:25 hashar: Image snapshot-ci-jessie-1488874660 in wmflabs-eqiad is ready (Chromium 55->56 among others) - T153038
  • 08:16 hashar: Pushing new Jessie image: image-jessie-20170306T224719Z.qcow2

2017-03-06

  • 19:03 addshore: mwscript sql.php --wiki=aawiki "CREATE DATABASE cognate_wiktionary"
  • 16:03 hashar: Jenkins upgrading "Git client plugin" 1.19.6 to 2.3.0

2017-03-02

  • 20:47 hashar: deployment-prep: restarted apache/puppet master. Maybe that will fix ssh_known_hosts being emptied from time to time T159332
  • 19:32 thcipriani: snapshot-ci-jessie updated for nodepool
  • 19:15 thcipriani: running: nodepool image-update wmflabs-eqiad snapshot-ci-jessie to manually update the ci-jessie snapshot for nodepool
  • 18:26 godog: integration update composer on '*slave*'
  • 11:52 hashar: gerrit: killed a stalled connection: dd511e52 Feb-27 07:11 git-receive-pack '/mediawiki/services/zotero/translators'
  • 09:53 hashar: Image snapshot-ci-jessie-1488447340 in wmflabs-eqiad is ready
  • 09:29 hashar: Image snapshot-ci-trusty-1488446586 in wmflabs-eqiad is ready
  • 09:18 hashar: upgrading composer on permanent slaves for T125343 : salt -v '*slave*' cmd.run 'cd /srv/deployment/integration/composer && git pull'
  • 09:16 hashar: upgrade composer to 1.1.0 https://gerrit.wikimedia.org/r/#/c/339645/
  • 08:40 elukey: upgrading apache2 on deployment-mediawiki* - latest debian DSA, introduces https://httpd.apache.org/docs/2.4/mod/core.html#httpprotocoloptions (risk of HTTP 400 responses regression, contact elukey or moritzm if you see any issue)

2017-03-01

  • 19:09 addshore: "mwscript extensions/WikimediaMaintenance/addWiki.php --wiki=aawiki he wiktionary hewiktionary he.wiktionary.beta.wmflabs.org" T158628
  • 17:11 hashar: cleaned out Jenkins security matrix to drop users that are no more used/inexistent -- T69027
  • 14:13 hashar: deployment-prep : on deployment-tin removed empty dir /etc/ssh/userkeys/root.d . Causes puppet noise
  • 12:21 hashar: deployment-prep cleaning out git repos on deployment-tin
  • 10:00 legoktm: deployed https://gerrit.wikimedia.org/r/340280 to slaves
  • 04:28 Krinkle: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/340465
  • 01:03 Reedy: beta-scap-eqiad giving Host key verification failed

2017-02-28

  • 19:43 thcipriani: deployment-puppetmaster02 puppetmaster running again, apache2 was refusing to start with: Invalid command 'SSLOpenSSLConfCmd' -- installed apache from wmf repo instead of debian fixed it
  • 08:36 hashar: nodepool deleted alien instances 541585 541586 and 541587

2017-02-27

  • 21:36 bearND: Update mobileapps to c924126

2017-02-25

  • 03:50 MaxSem: deployment-prep Deleted January logs from deployment-fluorine02, was running out of space

2017-02-24

2017-02-23

  • 18:35 greg-g: 18:29 < chasemp> !log labnodepool1001:~# service nodepool restart
  • 09:27 hashar: Clearing skins from testextension jobs T117710 salt -v '*slave*' cmd.run 'rm -fR /srv/jenkins-workspace/workspace/mwext-testextension*/src/skins/*'

2017-02-22

  • 20:58 hashar: Deleted jenkins job pplint-HEAD. Fully replaced by rake / puppet-syntax gem - T154894
  • 20:54 hashar: Deleted jenkins job erblint-HEAD. Fully replaced by rake / puppet-syntax gem - T154894

2017-02-20

  • 14:53 hashar: integration: applying role::ci::slave::saucelabs to saucelabs-01
  • 12:50 hashar: integration-slave-jessie-1001 downgraded cowbuilder to 0.73 from jessie to match integration-slave-jessie-1002

2017-02-17

  • 14:07 hashar: integration: deleting "repository" instance. No time to figure out how to ship Sonatype Nexus to it. T147635

2017-02-16

  • 18:34 greg-g: chase restarted nodepool, the daemon crashed
  • 18:32 greg-g: no active nodepool instances listed in Jenkin's view: https://integration.wikimedia.org/ci/ but zuul has plenty to do https://integration.wikimedia.org/zuul/
  • 16:56 hashar: integration: provisioned browsertests-1001 with role::ci::slaves::browsertests . Added it to Jenkins with label BrowserTests
  • 16:33 halfak: deploying ores:e9bbda3
  • 16:30 hashar: integration: created browsertests-1001 intended to run the daily browser tests later on

2017-02-15

  • 15:47 hashar: Zuul reducing gate-and-submit minimum amount of changes to process from the wrong 12 down to 2. In case of repeating failures it would end up running jobs for only two jobs which would prevent cancelling jobs for up to 11 changes!

2017-02-14

  • 14:38 hashar: Updating castor-save publish job to properly capture composer cache on Jessie ( it is in ~/.composer/cache for some reason) T156359

2017-02-13

2017-02-10

2017-02-09

2017-02-08

  • 22:26 mdholloway: mobileapps deployed 0efa7b8 in the beta cluster
  • 14:14 hashar: integration-slave-jessie-1001 upgrading cowbuilder
  • 09:20 hashar: deployment-fluorine02 upgraded packages, deleted old files from /srv/mw-log/archive

2017-02-07

  • 17:49 halfak: deploying ores 7c80636
  • 09:02 hashar: Hard rebooting integration-slave-jessie-1001 . I messed up with the DHCP client :(

2017-02-06

  • 21:31 bearND: Update mobileapps to 034a391

2017-02-04

  • 21:37 halfak: deploying ores 7c80636
  • 21:24 halfak: deploying ores 691b340

2017-02-03

  • 11:09 hashar: beta: removed old kernels from deployment-redis02 to free up disk space
  • 10:42 hashar: Image ci-jessie-wikimedia-1486115643 in wmflabs-eqiad is ready T156923
  • 10:12 hashar: Image ci-jessie-wikimedia-1486115643 in wmflabs-eqiad is ready T156923
  • 09:54 hashar: Regenerate Nodepool Jessie snapshot. Would get a new HHVM version T156923

2017-02-02

  • 21:56 hashar: integration-slave-jessie-1001 wiping /srv/pbuilder/base-trusty-amd64.cow it was not properly provisioned causing build to fail (eg lack of /etc/hosts) Running puppet to reprocvision it (poke T156651)
  • 16:26 Amir1: deploying 9fd75a1 ores in beta
  • 16:17 hashar: integration-slave-jessie-1001 wiping /srv/pbuilder/base-trusty-i386.cow/ it was not properly provisioned causing build to fail (eg lack of /etc/hosts) Running puppet to reprocvision it (poke T156651)
  • 14:15 hashar: Nodepool: delete the image building of Jessie (image id 1322) to prevent a faulty HHVM version from being added. T156923
  • 00:52 tgr: added mhurd as member

2017-02-01

  • 21:43 bearND: Update mobileapps to e48a88c
  • 18:51 thcipriani: nodepool delete-image 1320 per T156923
  • 14:53 gehel: deployment-elastic* fully migrated to Jessie and /srv as data partition - T151326
  • 14:52 gehel: killing test node deployment-elastic08 - T151326
  • 14:32 gehel: shutting down and reimaging deployment-elastic07 - T151326
  • 14:06 gehel: shutting down and reimaging deployment-elastic06 - T151326
  • 13:34 gehel: shutting down and reimaging deployment-elastic05 - T151326
  • 13:29 gehel: starting deployment-elastic* migration to jessie and moving data partition to /srv (T151326 / T151328)
  • 13:18 moritzm: upgraded deployment-prep to hhvm 3.12.12

2017-01-31

2017-01-26

2017-01-24

  • 11:04 hashar: Deleting integration-publisher (Precise) replaced by integration-publishing (Jessie). T156064 T143349

2017-01-23

  • 23:41 bearND: Update mobileapps to 66ef3c2
  • 21:05 hashar: Created integration-publishing Jessie instance 10.68.23.254 with puppet class role::ci::publisher::labs . Meant to replace Precise instance integration-publisher T156064
  • 12:45 hashar: Image ci-jessie-wikimedia-1485174573 in wmflabs-eqiad is ready | should no more spawn varnish on boot
  • 09:02 hashar: Archiving Gerrit project wikidata/gremlin marking it read-only T155829
  • 07:15 _joe_: cherry-picking the move of base to profile::base

2017-01-21

  • 21:20 hashar: integration: updating slave scripts for https://gerrit.wikimedia.org/r/#/c/333389/
  • 21:08 bd808: Puppet failures on deployment-restbase0[12] seem to be some sort of hang of the Puppet process itself. Run prints "Finished catalog run in 2n.nn seconds" but Puppet doesn't terminate for about a minute longer. The only state change logged is cassandra-metrics-collector service start.

2017-01-20

  • 10:14 hashar: puppet fails on "integration" labs instances due to an attempt to unmount the non existing NFS /home. Filled T155820
  • 09:18 hashar: beta: reset workspace of /srv/mediawiki-staging/php-master/extensions/reCaptcha it had a .gitignore local hack for some reason
  • 09:05 hashar: integration restarted mysql on trusty permanent slaves T141450 T155815 salt -v '*trusty*' cmd.run 'service mysql start'

2017-01-19

  • 22:11 Krenair: added bunch of others to the same group per request. we should figure out how to make this process sane somehow
  • 22:06 Krenair: added nuria to deploy-service group on deployment-tin
  • 16:56 hashar: rebased puppet master on integration and deployment-prep Trivial conflict between https://gerrit.wikimedia.org/r/#/c/312523/ and a lint change
  • 09:36 hashar: Nuking workspaces of all mwext-testextension-hhvm-composer* jobs. Lame attempt for T155600. salt -v '*slave*' cmd.run 'rm -fR /srv/jenkins-workspace/workspace/mwext-testextension-hhvm-composer*'

2017-01-18

  • 10:49 hashar: Disconnected/connected Jenkins Gearman client. The beta cluster builds had a deadlock.
  • 10:39 hashar: Image ci-jessie-wikimedia-1484735445 in wmflabs-eqiad is ready (add python-conftool to hopefully have puppet rspec pass on https://gerrit.wikimedia.org/r/#/c/332475/ )

2017-01-17

  • 21:47 urandom: deployment-prep restarting Cassandra on deployment-restbase02
  • 21:46 urandom: deployment-prep restarting Cassandra on deployment-restbase01
  • 19:02 thcipriani: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/#/c/332534/
  • 18:25 thcipriani: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/#/c/332521/
  • 18:07 urandom: deployment-prep restarting Cassandra on deployment-restbase01
  • 17:50 urandom: re-enabling puppet on deployment-restbase02
  • 17:47 urandom: re-enabling puppet on deployment-restbase01
  • 10:32 hashar: Refreshing all jobs in Jenkins 'jenkins-jobs --conf jenkins_jobs.ini update config/jjb'

2017-01-16

2017-01-12

2017-01-11

  • 18:07 urandom: restarting restbase cassandra nodes
  • 18:01 urandom: disabling puppet on restbase cassandra nodes to experiment with prometheus exporter

2017-01-10

2017-01-08

  • 05:20 Krenair: deployment-stream: live hacked /usr/lib/python2.7/dist-packages/socketio/handler.py a bit (added apostrophes) to try to make rcstream work

2017-01-07

  • 10:17 Amir1: ladsgroup@deployment-tin:~$ mwscript updateCollation.php --wiki=fawiki (T139110)

2017-01-06

  • 16:31 hashar: Nodepool Image ci-jessie-wikimedia-1483719758 in wmflabs-eqiad is ready
  • 16:24 hashar: Nodepool Image ci-trusty-wikimedia-1483719370 in wmflabs-eqiad is ready
  • 04:56 Krinkle: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/330843

2017-01-05

2017-01-04

  • 21:29 mutante: deployment-cache-text-04 - running acme-setup command to debug .. Creating CSR /etc/acme/csr/beta_wmflabs_org.pem
  • 21:26 Krenair: trying to troubleshoot puppet by stopping nginx then letting puppet start it
  • 21:05 mutante: deployment-cache-text04 stopping nginx service, running puppet to debug dependency issue
  • 09:41 hashar: integration: pruning /srv/pbuilder/aptcache/ on Jessie perm slaves

2017-01-02

  • 11:22 hashar: Nodepool Image ci-jessie-wikimedia-1483355768 in wmflabs-eqiad is ready
  • 11:17 hashar: Jessie images have the wrong python-pbr version ( T153877 ) causing zuul-cloner to fail. Refreshing image
  • 10:02 hashar: Nodepool Image ci-jessie-wikimedia-1483350885 in wmflabs-eqiad is ready
  • 09:57 hashar: Nodepool Image ci-trusty-wikimedia-1483350368 in wmflabs-eqiad is ready

2016-12-27

2016-12-26

  • 12:09 hashar: beta: restarted varnish.service and varnish-frontend.service on deployment-cache-text04

2016-12-24

2016-12-23

2016-12-22

  • 22:11 thcipriani: disable production l10nupdate for deployment freeze

2016-12-21

  • 05:57 Krinkle: Jenkins "Collapsing Console Sections" for PHPUnit was broken since "-d zend.enable_gc=0" was added to phpunit.php invocation. Updated pattern in Jenkins system configuration.

2016-12-19

2016-12-16

  • 22:34 legoktm: deploying https://gerrit.wikimedia.org/r/327202
  • 14:33 hashar: Nodepool Image ci-jessie-wikimedia-1481897950 in wmflabs-eqiad is ready
  • 14:25 hashar: Nodepool Image ci-trusty-wikimedia-1481897961 in wmflabs-eqiad is ready
  • 14:19 hashar: Refreshing Nodepool images. The snapshots were broken due to mariadb-client failing to upgrade
  • 13:45 hashar: integration / contintcloud : remove security rules of labs projects that allowed gallium (phased out) T95757
  • 13:44 hashar: integration / contintcloud : update security rules of labs projects to allow contint2001
  • 13:15 hashar: integration: update sudo policy for debian-glue to keep the env variable SHELL_ON_FAILURE (for https://gerrit.wikimedia.org/r/#/c/327720/ )
  • 10:15 hashar: integration: apt-get upgrade on all permanent slaves
  • 10:13 hashar: integration-slave-docker-1000 changed docker::version from no more existent '1.12.3-0~jessie' to simply 'present'. Will have to manually upgrade it from now on. T153419
  • 10:04 hashar: deployment-puppetmaster02 updated puppet repo. Was stall due to a bump of the mariadb submodule

2016-12-15

  • 21:00 Krinkle: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/324368
  • 19:23 marxarelli: Manually rebasing and re-applying cherry picks for operations/puppet on integration-puppetmaster01.eqiad.wmflabs
  • 16:08 hashar: deployment-phab02 : apt-get upgrade T147818
  • 14:48 Amir1: ladsgroup@deployment-tin:~$ mwscript updateCollation.php --wiki=fawiki (T139110)
  • 11:41 zeljkof: Reloading Zuul to deploy 327473

2016-12-14

  • 12:38 elukey: created deployment-copper on deployment-prep as temporary test

2016-12-13

2016-12-09

2016-12-08

2016-12-07

  • 15:04 hashar: Image ci-trusty-wikimedia-1481122712 in wmflabs-eqiad is ready T117418
  • 02:29 matt_flaschen: foreachwikiindblist FlowFixInconsistentBoards complete
  • 02:27 matt_flaschen: Started (foreachwikiindblist flow.dblist extensions/Flow/maintenance/FlowFixInconsistentBoards.php) 2>&1 | tee FlowFixInconsistentBoards_2016-12-06.txt on deployment-tin

2016-12-06

  • 21:20 hashar: Image ci-jessie-wikimedia-1481058839 in wmflabs-eqiad is ready T113342
  • 21:13 hashar: Refresh Nodepool Jessie snapshot which boot 3 times faster. Will help get nodes available faster T113342
  • 16:33 hashar: Nodepool imported a new Jessie image 'jessie-T113342' with some network configuration hotfix. Will use for debugging. T113342
  • 09:08 Reedy: running foreachwiki update.php on beta

2016-12-05

  • 20:43 hashar: Image ci-jessie-wikimedia-1480969940 in wmflabs-eqiad is ready (include trendingedits::packages which explicitly define the installation of librdkafka-dev' )
  • 09:52 elukey: add https://gerrit.wikimedia.org/r/#/c/324642/ to the deployment-prep's puppet master to test nutcracker
  • 09:39 hashar: beta-update-databases-eqiad fails due to CONTENT_MODEL_FLOW_BOARD not registered on the wiki. T152379
  • 08:44 hashar: Image ci-jessie-wikimedia-1480926961 in wmflabs-eqiad is ready T113342
  • 08:35 hashar: Pushing new Jessie image to Nodepool that is supposedly boot 3x times faster T113342

2016-12-04

  • 15:25 Krenair: Found a git-sync-upstream cron on deployment-mx for some reason... commented for now, but wtf was this doing on a MX server?

2016-12-03

2016-12-02

  • 14:40 hashar: added Tobias Gritschacher to Gerrit "integration" group so he can +2 patches on integration/* repositories \O/

2016-12-01

2016-11-30

  • 17:22 gehel: restart of logstash on deployment-logstash2 - upgrade to Java 8 - T151325
  • 17:11 gehel: rolling restart of deployment-elastic0* - upgrade to Java 8 - T151325
  • 11:22 hashar: Gerrit hide mediawiki/extensions/JsonData/JsonSchema Empty since 2013
  • 11:20 hashar: Gerrit made mediawiki/extensions/GuidedTour/guiders read-only (per README.md, no more used)
  • 11:18 hashar: Gerrit mediawiki/extensions/CentralNotice/BannerProxy.git Empty since 2014

2016-11-29

  • 15:23 hashar: Image ci-jessie-wikimedia-1480432368 in wmflabs-eqiad is ready
  • 14:30 hashar: Image ci-trusty-wikimedia-1480429423 in wmflabs-eqiad is ready T151879
  • 14:24 hashar: Refreshing Nodepool Trusty snapshot to get php5-xsl installed T151879

2016-11-28

2016-11-26

  • 16:15 Reedy: killed /srv/jenkins-workspace/workspace/mediawiki-core-*/src and /srv/jenkins-workspace/workspace/mwext-*/src from integration slaves to get rid of borked MW dirs
  • 15:51 Reedy: deleted /srv/jenkins-workspace/workspace/mediawiki-core-code-coverage/src on integration-slave-trusty-1006 to force a reclone
  • 14:14 Reedy: moved old /srv/mediawiki-staging/php-master to /tmp/php-master, recloned MW Core, copied in LocalSettings, skins, vendor and extensions. T151676. scap sync-dir running
  • 13:05 Reedy: marked deployment-tin as offline due to T151670

2016-11-24

2016-11-23

  • 15:04 Krenair: fixed puppet on deployment-cache-text04 by manually enabling experimental apt repo, see T150660
  • 10:57 hashar: Terminating deployment-apertium01 again T147210

2016-11-22

  • 19:31 hashar: beta: rebased puppet master
  • 19:30 hashar: beta: dropping cherry pick for the PDF render by mobrovac ( https://gerrit.wikimedia.org/r/#/c/305256/ ). Got merged
  • 08:29 hashar: Deleting shut off instances: integration-puppetmaster , deployment-puppetmaster , deployment-pdf02 , deployment-conftool - T150339

2016-11-21

2016-11-19

2016-11-18

2016-11-17

  • 22:07 mutante: re-enabled puppet on contint1001 after live Apache fix
  • 11:34 hasharLunch: Deleted instance deployment-apertium01 . Was Trusty and lacked packages, replaced by a Jessie one ages ago. T147210

2016-11-16

  • 20:53 elukey: restored apache2 config on deployment-mediawiki06
  • 20:28 elukey: temporary increasing verbosity of mod_rewrite on deployment-mediawiki06 as test
  • 20:02 Krenair: mysql master back up, root identity is now unix socket based rather than password
  • 19:57 Krenair: taking mysql master down to fix perms
  • 13:02 hashar: Restarted HHVM on deployment-mediawiki05 was not honoring requests T150849
  • 12:24 hashar: beta: created dewiktionary table on the Database slave. Restarted replication with START SLAVE; T150834 T150764
  • 10:39 hashar: Removing revert b47ce21 from deployment-tin and reenabling jenkins job. https://gerrit.wikimedia.org/r/321857 will get it fixed
  • 10:26 hashar: Reverting mediawiki/core b47ce21 on beta cluster T150833
  • 09:51 hashar: marking deployment-tin offline so I can live hack mediawiki code / scap for T150833 and T15034
  • 09:12 hashar: deployment-mediawiki04 stopping hhvm
  • 09:12 hashar: deployment-mediawiki04 stopping hhv
  • 08:59 hashar: beta database update broken with: MediaWiki 1.29.0-alpha Updater\n\nYour composer.lock file is up to date with current dependencies!
  • 07:52 Krenair: the new mysql root password for -db04 is at /tmp/newmysqlpass as well as in a new file in the puppetmaster's labs/private.git
  • 06:34 twentyafterfour: restarting hhvm on deployment-mediawiki04
  • 06:33 Amir1: ladsgroup@deployment-mediawiki05:~$ sudo service hhvm restart
  • 06:30 mutante: restarting hhvm on deployment-mediawiki06

2016-11-15

  • 16:03 hasharAway: adding thcipriani to the labs "git" project maintained by paladox

2016-11-14

  • 08:16 Amir1: cherry-picking 321096/3 in beta puppetmaster

2016-11-12

  • 14:02 Amir1: cherry-picked gerrit change 321096/2 in puppetmaster

2016-11-11

2016-11-10

  • 09:33 hashar: Image ci-jessie-wikimedia-1478770026 in wmflabs-eqiad is ready
  • 09:26 hashar: Regenerate Nodepool base image for Jessie and refreshing snapshot image

2016-11-09

  • 20:27 Krenair: removed default SSH access from production host 208.80.154.135, the old gallium IP
  • 16:34 Reedy: deployment-tin no longer offline, jenkins running jobs now
  • 16:11 Reedy: marking deployment-tin.eqiad as offline to test -labs -> beta config rename

2016-11-08

  • 10:23 hashar: refreshing all jenkins jobs to clear out potential live hack I made but can't remember on which jobs I did

2016-11-07

  • 14:01 gilles: Pointing deployment-imagescaler01.eqiad.wmflabs' puppet to puppetmaster.thumbor.eqiad.wmflabs

2016-11-04

  • 13:20 hashar: gerrit: created mediawiki/extensions/PageViewInfo.git and renamed user group extension-WikimediaPageViewInfo to extension-PageViewInfo T148775
  • 12:57 hashar: Image ci-jessie-wikimedia-1478263647 in wmflabs-eqiad is ready (bring in java for maven projects)
  • 12:49 dcausse: deployment-prep reloading nginx on deployment-elastic0[5-7] to fix ssl cert issue
  • 09:28 hashar: Delete integration-slave-jessie-1003 , only have a few jobs running on permanent Jessie slaves - T148183
  • 09:26 hashar: Delete zuul-dev-jessie.integration.eqiad.wmflabs was for testing Zuul on Jessie and it works just fine on contint1001 :] T148183
  • 09:25 hashar: Delete integration-slave-trusty-1012 one less permanent slave since some load has been moved to Nodepool T148183
  • 09:24 hashar: Delete integration-slave-trusty-1016 not pooled in Jenkins anymore T148183

2016-11-03

  • 15:05 Amir1: deploy 0caa589 in ores to deployment-sca03
  • 14:52 Amir1: deploying ores 0caa589 in deployment-sca03
  • 11:32 hashar: deployment-apertium01 manually cleared puppet.conf
  • 11:29 hashar: deployment-apertium01 fails puppet du to wrong certificate bah
  • 07:22 Krenair: fiddled with jenkins jobs in mediawiki-core-doxygen-publish to try to get stuff moving in the postmerge queue again
  • 05:04 Krenair: beginning to move the rest of beta to the new puppetmaster
  • 01:53 mutante: followed instructions at https://www.mediawiki.org/wiki/Continuous_integration/Zuul#Gearman_deadlock
  • 01:53 mutante: disabling and re-enabling gearman, zuul is not working and could be gearman deadlock

2016-11-02

  • 22:06 hashar: hello stashbot
  • 18:51 Krenair: armed keyholder on -tin and -mira
  • 18:50 Krenair: started mysql on -db boxes to bring beta back online
  • 10:54 hashar: Image ci-jessie-wikimedia-1478083637 in wmflabs-eqiad is ready
  • 10:47 hashar: Force refresh Nodepool snapshot for Jessie so it get doxygen included T119140

2016-11-01

  • 22:22 Krenair: started mysql on -db03 to hopefully pull us out of read-only mode
  • 22:21 Krenair: started mysql on -db04
  • 22:19 Krenair: stopped and started udp2log-mw on -fluorine02
  • 22:10 hashar: Armed keyholder on deployment-tin . Instance had 20 minutes uptime and apparently keyholder does not self arm
  • 22:00 Krenair: started moving nodes back to the new puppetmaster
  • 02:55 Krenair: Managed to mess up the deployment-puppetmaster02 cert, had to move those nodes back

2016-10-31

  • 20:57 Krenair: moving some nodes to deployment-puppetmaster02
  • 16:57 bd808: Added Niharika29 as project member

2016-10-27

  • 20:51 hashar: reboot integration-puppetmaster01
  • 18:50 bd808: stashbot has replaced qa-morebots in this channel as the sole bot handling !log messages
  • 18:46 bd808: Testing dual page wiki logging by stashbot. (check #3)
  • 18:36 bd808: !log deployment-prep Testing dual page wiki logging by stashbot. (second attempt)
  • 18:14 bd808: !log deployment-prep Testing dual page wiki logging by stashbot.
  • 10:30 hashar: integration: on Trusty slaves, remove jenkins-deploy from KVM which is only needed for Android testing for T149294: salt -v '*slave-trusty*' cmd.run 'deluser jenkins-deploy kvm'
  • 10:29 hashar: integration: on Trusty slaves, remove jenkins-deploy from KVM which is only needed for Android testing: salt -v '*slave-trusty*' cmd.run 'groupdeluser jenkins-deploy kvm'
  • 10:25 hashar: integration: purge Android packages from Trusty slaves for T149294 : salt -v '*slave-trusty*' cmd.run 'apt-get --yes remove --purge gcc-multilib lib32z1 lib32stdc++6 qemu'

2016-10-25

2016-10-24

  • 16:19 andrewbogott: upgrading deployment-puppetmaster to puppet 3.8.5 packages
  • 09:14 hashar: rebasing integration puppet master

2016-10-21

  • 09:42 gehel: decommission of deployment-elastic08 - T147777

2016-10-20

2016-10-14

  • 21:13 matt_flaschen: Ran START SLAVE to restart replication after columns created directly on replica were deleted.
  • 20:53 bd808: Dropped lu_local_id, lu_global_id from replica db which were added improperly
  • 20:37 matt_flaschen: Applied CentralAuth's patch-lu_local_id.sql migration for T148111, to sql --write
  • 20:09 bd808: Applied CentralAuth's patch-lu_local_id.sql migration for T148111
  • 11:30 dcausse: deployment-prep running sudo update-ca-certificates --fresh on deployment-ton to fix curl error code 60 in cirrus maint script (T145609)

2016-10-13

  • 21:21 hashar: Deleted CI slaves integration-slave-jessie-1004 integration-slave-jessie-1005 integration-slave-trusty-1013 integration-slave-trusty-1014 integration-slave-trusty-1017 integration-slave-trusty-1018
  • 20:12 hashar: Switching composer-hhvm / composer-php55 to Nodepool https://gerrit.wikimedia.org/r/#/c/306727/ T143938
  • 16:23 gilles: Resetting to 61a9cd1f47c5aec8ded92f2486ce43309b9e3e03 on deployment-puppetmaster
  • 16:06 godog: add settings to duplicate traffic to thumbor in beta and restart swift-proxy
  • 16:03 gilles: Cherry-picking https://gerrit.wikimedia.org/r/#/c/315648/ on deployment-puppetmaster
  • 15:35 gilles: Resetting to 61a9cd1f47c5aec8ded92f2486ce43309b9e3e03 on deployment-puppetmaster
  • 14:38 gilles: Cherry-picking https://gerrit.wikimedia.org/r/#/c/315234/5 on deployment-puppetmaster
  • 14:34 gilles: Resetting to 61a9cd1f47c5aec8ded92f2486ce43309b9e3e03 on deployment-puppetmaster
  • 14:32 gilles: Cherry-picking https://gerrit.wikimedia.org/r/#/c/315234/4 on deployment-puppetmaster
  • 14:32 gilles: Resetting to 61a9cd1f47c5aec8ded92f2486ce43309b9e3e03 on deployment-puppetmaster
  • 14:27 gilles: Cherry-picking https://gerrit.wikimedia.org/r/#/c/315234/ on deployment-puppetmaster
  • 14:22 gilles: Resetting to 61a9cd1f47c5aec8ded92f2486ce43309b9e3e03 on deployment-puppetmaster
  • 13:42 gilles: Cherry picking https://gerrit.wikimedia.org/r/#/c/315248/ on deployment-puppetmaster

2016-10-12

2016-10-11

  • 21:35 hasharAway: Force pushed Zuul patchqueue 5628f95...fc6a118 HEAD -> patch-queue/debian/precise-wikimedia
  • 14:37 hashar: Mysql was down on Precise slaves. Apparently rebooted 17 days ago and I guess mysql does not spawn on boot. Restarted mysql on all Precise via: salt -v '*slave-precise*' cmd.run 'start mysql'
  • 09:35 godog: reboot deployment-imagescaler01 to enable memory cgroup
  • 08:29 hashar: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/#/c/313387/ Filter out refs/meta/config from all pipelines T52389

2016-10-10

  • 15:45 dcausse: deployment-prep deployment-elastic0[5-8]: reduce the number of replicas to 1 max for all indices

2016-10-07

  • 20:10 hashar: Created repository.integration.eqiad.wmflabs to play/Test Sonatype Nexus
  • 20:10 hashar: rebooting integration-puppetmaster01
  • 07:55 hashar: Upgrading Nodepool image for Jessie

2016-10-06

  • 14:45 hashar: deployment-mira disarmed/rearmed keyholder in an attempt to clear a Shinken alarm
  • 12:16 hashar: Jenkins slave deployment-tin.eqiad , removing label "deployment-tin.eqiad" it has "BetaClusterBastion" and all jobs are bound to it already

2016-10-05

  • 19:33 andrewbogott: removing mediawiki::conftool from deployment-mediawiki04, deployment-mediawiki06, deployment-mediawiki05

2016-10-04

  • 19:43 andrewbogott: removed contint::slave_scripts and associated files from deployment-sca01 and deployment-sca02
  • 16:22 bd808: Restarted puppetmaster process on deployment-puppetmaster
  • 16:20 bd808: deployment-puppetmaster: removing cherry-pick of https://gerrit.wikimedia.org/r/#/c/305256/; conflicts with upstream changes
  • 15:01 godog: shutdown deployment-poolcounter02, replaced by deployment-poolcounter04 - T123734
  • 09:03 hashar: Regenerating configuration of all Jenkins job due to https://gerrit.wikimedia.org/r/#/c/313306/
  • 01:14 twentyafterfour: New scap command line autocompletions are now installed on deployment-tin and deployment-mira refs T142880

2016-10-03

  • 22:40 thcipriani: manual rebase on deployment-puppetmaster:/var/lib/git/operations/puppet
  • 22:05 thcipriani: reapplied beta::deployaccess to mediawiki servers
  • 21:42 cscott: updated OCG to version 0bf27e3452dfdc770317f15793e93e6e89c7865a
  • 21:36 cscott: starting OCG deploy
  • 13:43 hashar: Added integration-slave-trusty-1014 back in the pool
  • 13:41 hashar: Tip of the day: to reboot an instance and bypass molly-guard: /sbin/reboot
  • 13:39 hashar: integration-slave-trusty-1014 upgrading packages, clean up and rebooting it
  • 13:37 hashar: marked integration-slave-trusty-1014 offline. Cant run job / get stuck somehow
  • 10:21 godog: add role::prometheus::node_exporter to classes in hiera:deployment-prep T144502

2016-10-01

  • 09:41 hashar: beta: shutdown deployment-db1 and deployment-db2 . Databases have been migrated to other hosts T138778

2016-09-29

2016-09-28

  • 23:56 MaxSem: Deleted varnish cache files on deployment-cache-upload04 to free up space, disk full
  • 21:48 hasharAway: deployment-tin: service nscd restart
  • 21:43 hasharAway: beta cluster update database is broken :/ Filled T146947 about it
  • 21:25 hasharAway: deployment-tin: sudo -H -u www-data php5 /srv/mediawiki-staging/multiversion/MWScript.php update.php --wiki=commonswiki --quick
  • 21:18 hasharAway: https://integration.wikimedia.org/ci/view/Beta/job/beta-update-databases-eqiad/ is broken for unkwnon reason :(
  • 20:48 hasharAway: Deleted deployment-tin02 via Horizon. Replaced by deployment-tin
  • 20:19 hasharAway: restarted keyholder on deployment-tin
  • 20:11 hasharAway: Switch Jenkins slave deployment-mira.eqiad to deployment-tin.eqiad
  • 20:09 hasharAway: deployment-tin: keyholder arm
  • 20:08 hasharAway: deployment-tin for instance in `grep deployment /etc/dsh/group/mediawiki-installation`; do ssh-keyscan `dig +short $instance` >> /etc/ssh/ssh_known_hosts; done;
  • 19:49 hasharAway: Dropping deployment-tin02 , replacing it with deployment-tin which has been rebuild to Jessie T144006
  • 12:44 hashar: Cant finish up the switch to deployment-tin, puppet still does not pass due to weird clone issues ...
  • 11:48 hashar: Deleting deployment-tin Trusty instance and recreate one with same hostname as Jessie; Meant to replace deployment-tin02 T144006
  • 10:44 hashar: CI updating all mwext-Wikibase* jenkins jobs for https://gerrit.wikimedia.org/r/#/c/313056/ T142158
  • 10:43 hashar: Updating slave scripts for "Disable garbage collection for mw-phpunit.sh" https://gerrit.wikimedia.org/r/313051 T142158
  • 08:31 hashar: Reloading Zuul to deploy dc2ada37

2016-09-27

2016-09-26

  • 23:58 bd808: Started udp2log-mw on deployment-fluorine02 for T146723
  • 11:35 hashar: deployment-salt02 : autoremoving a bunch of java related packages
  • 11:31 hashar: rebooting deployment-salt02 has a kernel soft lock while hitting the disk
  • 11:24 hashar: beta: mass upgrading all debian packages on all instances
  • 10:32 hashar: beta: on deployment-pdf01 rm -fR /home/cscott/tmp/npm*
  • 10:29 hashar: deployment-pdf01 apt-get upgrade / cleaning files left over etc
  • 10:28 hashar: beta: on deployment-pdf01 rm -fR /home/cscott/.npm/ T145343

2016-09-24

  • 20:08 hashar: deployment-tin is shutdown. Replaced by Jessie deployment-tin02
  • 20:02 hashar: deployment-mira: ssh-keyscan deployment-tin02.deployment-prep.eqiad.wmflabs >> /etc/ssh/ssh_known_hosts
  • 20:00 hashar: beta: dropping deployment-tin (ubuntu) replaced by deployment-tin02 (jessie). Primary is still deployment-mira (https://gerrit.wikimedia.org/r/#/c/312654/ T144578 )

2016-09-23

2016-09-22

  • 19:29 hasharAway: switching Jenkins slaves workspace from /mnt/jenkins-workspace to /srv/jenkins-workspace (actually the same dir/inode on the filesystem)
  • 01:52 legoktm: deploying https://gerrit.wikimedia.org/r/312158

2016-09-21

  • 18:22 yuvipanda: shutting down integration-puppetmaster
  • 17:26 yuvipanda: cherry-pick https://gerrit.wikimedia.org/r/#/c/312044/ on deployment-puppetmaser
  • 16:41 hashar: deployment-tin02 initiale provisioning is complete. Gotta add it as a deployment server via a puppet.git patch
  • 16:01 hashar: deployment-tin02 applied puppet classes beta::autoupdater, beta::deployaccess, role::deployment::server, role::labs::lvm::srv
  • 15:32 hashar: spawned deployment-tin02
  • 14:55 hashar: removed the CI puppet class from deployment-sca01 and deployment-sca02 . Stopped services using /srv , unmounted /srv, removed it from /etc/fstab
  • 14:27 hashar: deployment-sca01 and deployment-sca02 are now broken. The CI puppet class mount /srv which ends up being only 500 MBytes
  • 14:08 hashar: deployment-mira adding puppet class beta::autoupdater
  • 14:06 hashar: Enabling Jenkins slave deployment-mira
  • 14:05 hashar: deployment-mira seems ready for action and is the primary deployment server. Enabling jenkins to it
  • 11:25 hashar: removing Jenkins slave deployment-tin , deployment-mira is the new deployment master T144578
  • 10:58 hashar: Changing Jenkins slaves home dir for deployment-sca01 and deployment-sca02 from /mnt/home/jenkins-deploy to /srv/jenkins/home/jenkins-deploy
  • 10:57 hashar: Changing Jenkins slaves home dir for deployment-tin and deployment-mira from /mnt/home/jenkins-deploy to /srv/jenkins/home/jenkins-deploy
  • 10:10 hashar: deployment-mira removing "role::labs::lvm::srv" duplicate with role::ci::slave::labs::common
  • 10:07 hashar: Making deployment-mira a Jenkins slave by applying puppet class role::ci::slave::labs::common T144578
  • 10:05 hashar: Arming keyholder on deployment-mira
  • 09:43 hashar: beta: switching master deployment server from deployment-tin to deployment-mira
  • 09:34 hashar: From Hiera:deployment-prep remove bit already in puppet: "scap::deployment_server": deployment-tin.deployment-prep.eqiad.wmflabs
  • 08:55 moritzm: remove mira from deployment-prep (replaced by deployment-mira)
  • 08:37 hashar: beta: manually rebased puppetmaster
  • 08:11 elukey: terminated jobrunner01 and removed from deployment-prep's sacp dsh list
  • 07:19 legoktm: deploying https://gerrit.wikimedia.org/r/311927

2016-09-20

  • 21:49 hashar: Deleting deployment-mira02 /srv was too small. Replaced by deployment-mira
  • 20:54 hashar: from deployment-tin for T144578, accept ssh host key of deployment-mira : sudo -u jenkins-deploy -H SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh mwdeploy@deployment-mira.deployment-prep.eqiad.wmflabs
  • 20:47 hashar: Creating deployment-mira instance with flavor c8.m8.s60 (8 cpu, 8G RAM and 60G disk) T144578
  • 19:00 thcipriani: cherry-picked https://gerrit.wikimedia.org/r/#/c/311760/ to deployment-puppetmaster to fix failing beta-scap-eqiad job, had to manually start rsync, puppet failed to start
  • 18:38 hashar: on tin: `sudo -u jenkins-deploy -H SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh mwdeploy@deployment-mira02.deployment-prep.eqiad.wmflabs` - T144006
  • 18:33 hashar: on deployment-mira02 ran `sudo -u jenkins-deploy -H SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh mwdeploy@deployment-mediawiki04.deployment-prep.eqiad.wmflabs` per T144006
  • 18:01 marxarelli: deployed mediawiki-config changes on beta cluster. back in read/write mode using new database instances
  • 17:37 marxarelli: deployment-db04 restored from backup and replication started
  • 16:54 marxarelli: upgraded package and data to mariadb 10 on deployment-db03
  • 16:31 marxarelli: cherry picking operations/puppet patches (T138778) to deployment-puppetmaster
  • 16:30 moritzm: rebooting deployment-mira02
  • 16:23 marxarelli: applied innodb transaction logs to deployment-db1 backup and successfully restored on deployment-db03
  • 15:47 marxarelli: completed innobackupex on deployment-db1. copying backup to deployment-db03 for restoration
  • 14:54 hashar: beta: cherry picking fix up for the jobrunner logging https://gerrit.wikimedia.org/r/#/c/311702/ and https://gerrit.wikimedia.org/r/311719 T146040
  • 14:44 marxarelli: entering read-only mode on beta cluster
  • 14:27 elukey: stopped puppet, jobrunner and jobchron on deployment-jobrunner01
  • 14:20 marxarelli: disabling beta cluster jenkins jobs in preparation for data migration (T138778)
  • 13:07 godog: add deployment-prometheus01 instance T53497
  • 11:20 elukey: applied beta::deployaccess, role::labs::lvm::srv, role::mediawiki::jobrunner to jobrunner02
  • 10:45 elukey: created deployment-jobrunner02 in deployment-prep

2016-09-19

  • 22:01 legoktm: shutdown integration-puppetmaster
  • 21:29 yuvipanda: regenerated client certs only on integration-puppetmaster01, seems ok now
  • 20:46 yuvipanda: re-enable puppet everywhere
  • 20:43 yuvipanda: enable puppet and run on integration-slave-trusty-1003.eqiad.wmflabs
  • 20:41 yuvipanda: accidentally deleted /var/lib/puppet/ssl on integration-puppetmaster01 as well, causing it to lose keys. Reprovision by pointing to labs puppetmaster
  • 20:34 yuvipanda: rm -rf /var/lib/puppet/ssl on all integration nodes
  • 20:34 yuvipanda: copied /etc/puppet/puppet.conf from integration-trusty-slave-1001 to all integration
  • 20:25 yuvipanda: delete /etc/puppet/puppet.conf.d/10-self.conf and /var/lib/puppet/ssl on integration-slave-trusty-1001
  • 20:20 yuvipanda: re-enabled puppet on integration-slave-trusty-1001
  • 20:08 yuvipanda: reset puppetmaster of integration-puppetmaster01 to be labs puppetmaster
  • 20:03 yuvipanda: disable puppet across integration project, moving puppetmasters
  • 19:49 legoktm: creating T144951 enabled role::puppetmaster::standalone role on integration-puppetmaster01
  • 19:33 legoktm: creating T144951 integration-puppetmaster01 instance using m1.small and debian jessie
  • 15:11 hashar: beta: updating jobrunner service 0dc341f..a0e8216

2016-09-17

2016-09-16

  • 21:03 hashar: deployment-tin did a git gc on /srv/deployment/ores That freed up disk space and cleared an alarm on co master mira02
  • 21:00 hashar: deleted deployment-parsoid05
  • 20:52 hashar: fixed puppet on deployment-parsoid05 . Temporary instance will delete it later to clear out shinken.wmflabs.org
  • 20:27 hashar: beta: force running puppet in batches of 4 instances: salt --batch 4 -v 'deployment-*' cmd.run 'puppet agent -tv'
  • 20:13 hashar: beta: restarted puppetmaster
  • 20:07 hashar: beta: salt -v '*' cmd.run 'rm -fR /var/lib/puppet/client/ssl/'
  • 20:07 hashar: beta: stopping puppetmaster, rm -f /var/lib/puppet/server/ssl/ca/signed/*
  • 19:53 hashar: beta created instance "deployment-parsoid05" Should be deleted later, that is merely to purge the hostname from Shinken ( http://shinken.wmflabs.org/host/deployment-parsoid05 )
  • 11:42 hashar: beta: apt-get upgrade on deployment-jobrunner01
  • 11:36 hashar: apt-get upgrade on deployment-tin , bring in a new hhvm version and others

2016-09-15

  • 22:29 legoktm: sudo salt '*precise*' cmd.run 'service mysql start', all mysql's are down
  • 16:45 godog: install xenial kernel on deployment-zotero01 and reboot T145793
  • 16:18 hashar: prometheus enabled on all beta cluster instance. Does not support Precise hence puppet will fail on the last two Precise instances deployment-db1 and deployment-db2 until they are migrated to Jessie T138778
  • 15:53 godog: add role::prometheus::node_exporter to classes in hiera:deployment-prep T144502
  • 15:10 hashar: beta: Applying puppet class role::prometheus::node_exporter to mira02 just like mira. That is for godog
  • 15:08 hashar: T144006 Disabled Jenkins job beta-scap-eqiad. On mira02 rm -fR /srv/* . Applying puppet for role::labs::lvm::srv
  • 15:05 hashar: T144006 Applying class role::labs::lvm::srv to mira02 (it is out of disk space :D )
  • 14:45 hashar: T144006 sudo -u jenkins-deploy -H SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh mwdeploy@mira02.deployment-prep.eqiad.wmflabs
  • 14:44 hashar: T144006 sudo -u jenkins-deploy -H SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh mwdeploy@deployment-mediawiki05.deployment-prep.eqiad.wmflabs
  • 12:33 elukey: added base::firewall, beta::deployaccess, mediawiki::conftool, role::mediawiki::appserver to mediawiki05
  • 12:20 elukey: terminate mediawiki02 to create mediawiki05
  • 10:48 hashar: beta: cherry picking moritzm patch https://gerrit.wikimedia.org/r/#/c/310793/ "Also handle systemd in keyholder script" T144578
  • 09:33 hashar: T144006 sudo -u jenkins-deploy -H SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh mwdeploy@deployment-mediawiki06.deployment-prep.eqiad.wmflabs
  • 09:10 elukey: executed git pull and then git rebase -i on deployment puppet master
  • 08:52 elukey: terminated mediawiki03 and created mediawiki06
  • 08:45 elukey: removed mediawiki03 from puppet with https://gerrit.wikimedia.org/r/#/c/310749/
  • 02:36 Krinkle: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/310701

2016-09-14

  • 21:37 hashar: integration: setting "ulimit -c 2097152" on all slaves due to Zend PHP segfaulting T142158
  • 14:31 hashar: Added otto to integration labs project
  • 13:28 gehel: upgrading deployment-logstash2 to elasticsearch 2.3.5 - T145404
  • 09:27 hashar: Deleting deployment-mediawiki01 , replaced by deployment-mediawiki04 T144006
  • 07:19 legoktm: sudo salt '*trusty*' cmd.run 'service mysql start', it was down on all trusty salves
  • 07:17 legoktm: mysql just died on a bunch of slaves (trusty-1013, 1012, 1001)

2016-09-13

  • 17:02 marxarelli: re-enabling beta cluster jenkins jobs following maintenance window
  • 16:59 marxarelli: aborting beta cluster db migration due to time constraints and ops outage. will reschedule
  • 15:34 marxarelli: disabled beta jenkins builds while in maintenance mode
  • 15:18 marxarelli: starting 2-hour read-only maintenance window for beta cluster migration
  • 10:06 hashar: beta: manually updated jobrunner install on deployment-jobrunner01 and deployment-tmh01 then reloaded the services with: service jobchron reload
  • 10:02 hashar: Trebuchet is broken for /srv/deployment/jobrunner/jobrunner cant reach the deploy minions somehow. Did the update manually
  • 10:00 hashar: Upgrading beta cluster jobrunner to catch up with upstream b952a7c..0dc341f merely picking up a trivial log change ( https://gerrit.wikimedia.org/r/#/c/297935/ )
  • 09:40 hashar: Unpooled deployment-mediawiki01 from scap and varnish. Shutting down instance. T144006
  • 09:02 hashar: on deployment-tin, accepted mediawiki04 host key for jenkins-deploy user : sudo -u jenkins-deploy -H SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh mwdeploy@deployment-mediawiki04.deployment-prep.eqiad.wmflabs T144006
  • 08:26 hashar: mwdeploy@deployment-mediawiki04 manually accepted ssh host key of deployment-tin T144006
  • 08:17 hashar: beta: manually accepted ssh host key for deployment-mediawiki04 as user mwdeploy on deployment-tin and mira T144006
  • 07:46 gehel: upgrading elasticsearch to 2.3.5 on deployment-elastic0? - T145404

2016-09-12

  • 14:41 elukey: applied base::firewall, beta::deployaccess, mediawiki::conftool, role::mediawiki::appserver to deployment-mediawiki04.deployment-prep.eqiad.wmflabs (Debian jessie instance) - T144006
  • 12:50 gehel: rolling back upgrading elasticsearch to 2.4.0 on deployment-elastic05 - T145058
  • 12:03 gehel: upgrading elasticsearch to 2.4.0 on deployment-elastic0? - T145058
  • 12:01 hashar: Gerrit: made analytics-wmde group to be owned by themselves
  • 11:57 hashar: Gerrit: added ldap/wmde as an included group of the 'wikidata' group. Asked by and demoed to addshore

2016-09-11

2016-09-09

  • 20:53 thcipriani: testing scap 3.2.5-1 on beta cluster
  • 11:08 hashar: Added git tag for latest versions of mediawiki/selenium and mediawiki/ruby/api
  • 09:30 legoktm: Image ci-jessie-wikimedia-1473412532 in wmflabs-eqiad is ready
  • 08:53 legoktm: added phpflavor-php70 label to integration-slave-jessie-100[1-5]
  • 08:49 legoktm: deploying https://gerrit.wikimedia.org/r/309048

2016-09-08

  • 21:33 hashar: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/309413 " Inject PHP_BIN=php5 for php53 jobs"
  • 20:00 hashar: nova delete ci-jessie-wikimedia-369422 (was stuck in deleting state)
  • 19:49 hashar: Nodepool, deleting instances that Nodepool lost track of (from nodepool alien-list)
  • 19:47 hashar: nodepool cant delete: ci-jessie-wikimedia-369422 [ delete | 2.24 hours . Stuck in task_state=deleting  :(
  • 19:46 hashar: Nodepool looping over some tasks since 17:45 ( https://grafana.wikimedia.org/dashboard/db/nodepool?panelId=21&fullscreen )
  • 19:26 legoktm: repooled integration-slave-jessie-1005 now that php7 testing is done
  • 19:19 hashar: integration: salt -v '*' cmd.run 'cd /srv/deployment/integration/slave-scripts; git pull' | https://gerrit.wikimedia.org/r/308931
  • 19:12 hashar: integration: salt -v '*' cmd.run 'cd /srv/deployment/integration/slave-scripts; git pull' | https://gerrit.wikimedia.org/r/309272
  • 17:08 legoktm: deleted integration-jessie-lego-test01
  • 16:50 legoktm: deleted integration-aptly01
  • 10:03 hashar: Delete Jenkins job https://integration.wikimedia.org/ci/job/mwext-VisualEditor-sync-gerrit/ that has been left behind. It is no more needed. T51846 T86659
  • 10:02 hashar: Delete mwext-VisualEditor-sync-gerrit job, already got removed by ostriches in 139d17c8f1c4bcf2bb761e13a6501e4d85684066 . The issue in Gerrit (T51846) has been fixed. Poke T86659 , one less job on slaves.

2016-09-07

  • 20:44 matt_flaschen: Re-enabled beta-code-update-eqiad .
  • 20:35 hashar: Updated security group for deployment-prep labs project. Allow ssh port 22 from contint1001.wikimedia.org (matching rules for gallium). T137323
  • 20:30 hashar: Updated security group for contintcloud and integration labs project. Allow ssh port 22 from contint1001.wikimedia.org (matching rules for gallium). T137323
  • 20:14 matt_flaschen: Temporarily disabled https://integration.wikimedia.org/ci/view/Beta/job/beta-scap-eqiad/ to test live revert of aa0f6ea
  • 16:09 hashar: Nodepool back in action. Had to manually delete some instances in labs
  • 15:58 hashar: Restarting Nodepool . Lost state when labnet got moved T144945
  • 13:13 hashar: Image ci-jessie-wikimedia-1473253681 in wmflabs-eqiad is ready , has php7 packages. T144872
  • 11:53 hashar: Force refreshing Nodepool jessie snapshot to get PHP7 included T144872
  • 11:03 hashar: integration: cherry pick https://gerrit.wikimedia.org/r/#/c/308955/ "contint: prefer our bin/php alternative" T144872
  • 10:55 hashar: integration: dropped PHP7 cherry pick from puppet master. https://gerrit.wikimedia.org/r/#/c/308918/ has been merged. Pushing it to the fleet of permanent Jessie slaves. T144872
  • 10:37 hashar: beta: cleaning up salt-keys on deployment-salt02 . Bunch of instances got deleted
  • 09:41 hashar: Moving rake jobs back to Nodepool ( T143938 ) with https://gerrit.wikimedia.org/r/#/c/306723/ and https://gerrit.wikimedia.org/r/#/c/306724/
  • 05:57 legoktm: deploying https://gerrit.wikimedia.org/r/308932 https://gerrit.wikimedia.org/r/299697
  • 05:26 legoktm: cherry-picked https://gerrit.wikimedia.org/r/#/c/308918/ onto integration-puppetmaster with a hack that has it only apply to integration-slave-jessie-1005
  • 04:59 legoktm: added Krenair to integration project to help debug puppet stuff
  • 04:35 legoktm: depooled integration-slave-jessie-1005 in jenkins so I can test puppet stuff on it

2016-09-06

  • 13:58 hashar: Qunit jobs should be all fine again now. T144802
  • 13:46 hashar: nodepool.SnapshotImageUpdater: Image ci-jessie-wikimedia-1473169259 in wmflabs-eqiad is ready T144802
  • 13:20 hashar: Rebuilding Nodepool Jessie image to hopefully include libapache-mod-php5 and restore qunit jobs behavior T144802
  • 10:37 hashar: gerrit: mark apps/android/commons hidden since it is now community maintained on GitHub. Will avoid confusion. T127678
  • 09:11 hashar: nodepool.SnapshotImageUpdater: Image ci-trusty-wikimedia-1473152801 in wmflabs-eqiad is ready
  • 09:06 hashar: nodepool.SnapshotImageUpdater: Image ci-jessie-wikimedia-1473152393 in wmflabs-eqiad is ready
  • 09:00 hashar: Trying to refresh Nodepool Jessie image . Image properties have been dropped, should fix it

2016-09-05

  • 14:08 hashar: Refreshing Nodepool base images for Trusty and Jessie. Managed to build new ones after T143769

2016-09-02

2016-09-01

2016-08-31

  • 23:40 bd808: forced puppet run on deployment-salt02. Had not run automatically for 8 hours
  • 23:36 bd808: Deleted /data/scratch on integration-slave-trusty-1016 to fix puppet
  • 23:32 bd808: Deleted /data/scratch on integration-slave-trusty-1013 to fix puppet
  • 23:22 bd808: Deleted /data/scratch on integration-slave-trusty-1012 to fix puppet
  • 23:19 bd808: Deleted /data/scratch on integration-slave-trusty-1011 to fix puppet
  • 23:15 bd808: Deleted /data/scratch on integration-slave-precise-1012 to fix puppet
  • 23:11 bd808: Deleted /data on integration-slave-precise-1011 to fix puppet
  • 23:08 bd808: Deleted /data on integration-slave-jessie-1001 to fix puppet
  • 23:04 bd808: Deleted empty /data, /data/project, and /data/scratch on integration-puppetmaster to fix puppet
  • 22:59 bd808: Deleted empty /data, /data/project, and /data/scratch on integration-publisher to fix puppet
  • 01:44 Krinkle: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/307670

2016-08-30

  • 23:31 yuvipanda: cherry-picking https://gerrit.wikimedia.org/r/#/c/307656/ fixed puppet on the elasticsearch machines!
  • 22:29 yuvipanda: in lieu of blood sacrifice, restart puppetmaster on deployment-pupetmaster
  • 21:44 yuvipanda: use clush to fix puppet.conf of all clients, realize also accidentally set a client's puppet.conf for the server, recover server's old conf file from a cat in shell history, restore, breathe sigh of relief
  • 21:37 yuvipanda: sudo takes like 15s each time, is there no god?
  • 21:36 yuvipanda: managed to get vim into a state where I can not quit it, probably recording a macro. I hate computers
  • 21:16 yuvipanda: deployment-pdf01 fixed manually
  • 21:15 yuvipanda: deployment-pdf02 has proper ssl certs mysteriously without me doing anything
  • 21:06 yuvipanda: moved deployment-db[12], deployment-stream to not use role::puppet::self, attempting to semi-automate rest
  • 20:52 yuvipanda: cherry-picked appropriate patch on deployment-puppetmaster for T120159, did https://wikitech.wikimedia.org/w/index.php?title=Hiera:Deployment-prep/host/deployment-puppetmaster&oldid=818847 to make sure the puppetmaster allows connections from elsewhere
  • 19:48 legoktm: deploying https://gerrit.wikimedia.org/r/306710
  • 19:13 bd808: Fixed puppet runs on deployment-sca0[12] with cherry-pick of https://gerrit.wikimedia.org/r/#/c/307561
  • 18:57 bd808: Duplicate declaration: File[/srv/deployment] is already declared in file /etc/puppet/modules/contint/manifests/deployment_dir.pp:14; cannot redeclare at /etc/puppet/modules/service/manifests/deploy/common.pp:12 on node deployment-sca01.deployment-prep.eqiad.wmflabs
  • 18:40 bd808: Puppet busted on deployment-aqs01 -- Could not find data item analytics_hadoop_hosts in any Hiera data file and no default supplied at /etc/puppet/manifests/role/aqs.pp:46
  • 12:59 hashar: beta: revert master branch to origin. Ran scap and enabled again beta-code-update-eqiad job.
  • 12:55 hashar: Running scap on beta cluster via https://integration.wikimedia.org/ci/view/Beta/job/beta-scap-eqiad/117786/console T143889
  • 12:53 hashar: Cherry picking https://gerrit.wikimedia.org/r/#/c/307501/ on beta cluster for T143889
  • 12:51 hashar: disabling https://integration.wikimedia.org/ci/view/Beta/job/beta-code-update-eqiad/ to cherry pick a revert patch

2016-08-29

  • 07:56 hashar: hard rebooting integration-slave-trusty-1012 via horizon and restarting puppet manually
  • 07:50 hashar: integration-slave-trusty-1013 puppet.conf certname was set to 'undef' breaking puppet

2016-08-27

  • 20:51 hashar: integration: tweak sudo policy for jenkins-deploy running cowbuilder: env_keep+=DEB_BUILD_OPTIONS
  • 20:24 hashar: Manually installing jenkins-debian-glue 0.17.0 on integration-slave-jessie-1004 and integration-slave-jessie-1005 ( T142891 ) . That is to support PBUILDER_USENETWORK T141114
  • 20:05 hashar: Jenkins added global env variable BUILD_TIMEOUT set to 30 for T144094

2016-08-26

  • 22:29 legoktm: deploying https://gerrit.wikimedia.org/r/307025
  • 08:15 Amir1: restart uwsgi-ores and celery-ores-worker in deployment-sca03 (T143567)
  • 08:11 hashar: beta-scap-eqiad job is back in operation. Was blocked on logstash not being reachable. T143982
  • 08:10 hashar: deployment-logstash2 is back after a hard reboot. T143982
  • 08:07 hashar: rebooting deployment-logstash02 via Horizon. Kernel hang apparently T143982
  • 08:00 hashar: beta-scap-eqiad failing investigating
  • 07:54 Amir1: cherry-picked 306839/1 into deployment-puppetmaster
  • 00:28 twentyafterfour: restarted puppetmaster service on deployment-puppetmaster

2016-08-25

  • 23:15 Amir1: cherry-picked 306839/1 into puppetmaster
  • 20:10 hashar: Delete integration-slave-trusty-1023 with label AndroidEmulator. The Android job has been migrated to a new Jessie based instance via T138506
  • 19:05 hashar: hard rebooting integration-raita via Horizon
  • 16:04 hashar: fixing puppet.conf on integration-slave-trusty-1013 it mysteriously considered itself as the puppetmaster
  • 16:02 hashar: integration restarted puppetmaster service
  • 08:28 hashar: beta update database fixed
  • 08:28 hashar: beta cluster update database failed due to: "Your composer.lock file is up to date with current dependencies!" Probably a race condition with ongoing scap.

2016-08-24

  • 15:14 halfak: deploying ores d00171
  • 09:50 hashar: deployment-redis02 fixed AOF file /srv/redis/deployment-redis02-6379.aof and restarted the redis instance should fix T143655 and might help T142600
  • 09:43 hashar: T143655 stopping redis 6379 on deployment-redis02 : initctl stop redis-instance-tcp_6379
  • 09:38 hashar: deployment-redis02 initctl stop redis-instance-tcp_6379 && initctl start redis-instance-tcp_6379 | That did not fix it magically though T143655

2016-08-23

2016-08-22

  • 23:40 legoktm: updating slave_scripts on all slaves

2016-08-18

  • 22:03 bd808: deployment-fluorine02: Hack 'datasets:x:10003:997::/home/datasets:/bin/bash' into /etc/passwd for T117028
  • 20:30 MaxSem: Restarted hhvm on appservers for wikidiff2 upgrades
  • 19:03 MaxSem: Upgrading hhvm-wikidiff2 in beta cluster
  • 16:53 legoktm: deploying https://gerrit.wikimedia.org/r/#/c/305532/

2016-08-17

  • 22:28 legoktm: deploying https://gerrit.wikimedia.org/r/305408
  • 21:33 cscott: updated OCG to version e3e0fd015ad8fdbf9da1838c830fe4b075c59a29
  • 21:28 bd808: restarted salt-minion on deployment-pdf02
  • 21:26 bd808: restarted salt-minion on deployment-pdf01
  • 21:15 cscott: starting OCG deploy to beta
  • 14:10 gehel: upgrading elasticsearch to 2.3.4 on deployment-logstash2.deployment-prep.eqiad.wmflabs
  • 13:28 gehel: upgrading elasticsearch to 2.3.4 on deployment-elastic*.deployment-prep + JVM upgrade

2016-08-16

  • 23:10 thcipriani: max_servers at 6, seeing 6 allocated instances, still seeing 403 already used 10 of 10 instances :((
  • 22:37 thcipriani: restarting nodepool, bumping max_servers to match up with what openstack seems willing to allocate (6)
  • 09:06 Amir1: removing ores-related-cherry-picked commits from deployment-puppetmaster

2016-08-15

  • 21:30 thcipriani: update scap on beta to 3.2.3-1 bugfix release
  • 02:30 bd808: Forced a zuul restart -- https://www.mediawiki.org/wiki/Continuous_integration/Zuul#Restart
  • 02:23 bd808: Lots and lots of "AttributeError: 'NoneType' object has no attribute 'name'" errors in /var/log/zuul/zuul.log
  • 02:21 bd808: nodepool delete 301068
  • 02:20 bd808: nodepool delete 301291
  • 02:20 bd808: nodepool delete 301282
  • 02:19 bd808: nodepool delete 301144
  • 02:11 bd808: nodepool delete 299641
  • 02:11 bd808: nodepool delete 278848
  • 02:08 bd808: Aug 15 02:07:48 labnodepool1001 nodepoold[24796]: Forbidden: Quota exceeded for instances: Requested 1, but already used 10 of 10 instances (HTTP 403)

2016-08-13

2016-08-12

2016-08-10

2016-08-09

2016-08-08

  • 23:33 Tim: deleted instance deployment-depurate01
  • 16:19 bd808: Manually cleaned up root@logstash02 cronjobs related to logstash03
  • 14:39 Amir1: deploying d00159c for ores in sca03
  • 10:14 Amir1: deploying 616707c into sca03 (for ores)

2016-08-07

  • 12:01 hashar: Nodepool: can't spawn instances due to: Forbidden: Quota exceeded for instances: Requested 1, but already used 10 of 10 instances (HTTP 403)
  • 12:01 hashar: nodepool: deleted servers stuck in "used" states for roughly 4 hours (using: nodepool list , then nodepool delete <id>)
  • 11:54 hashar: Nodepool: can't spawn instances due to: Forbidden: Quota exceeded for instances: Requested 1, but already used 10 of 10 instances (HTTP 403)
  • 11:54 hashar: nodepool: deleted servers stuck in "used" states for roughly 4 hours (using: nodepool list , then nodepool delete <id>)

2016-08-06

  • 12:31 Amir1: restarting uwsgi-ores and celery-ores-worker in deployment-sca03
  • 12:28 Amir1: cherry-picked 303356/1 into the puppetmaster
  • 12:00 Amir1: restarting uwsgi-ores and celery-ores-worker in deployment-sca03

2016-08-05

2016-08-04

  • 20:07 marxarelli: Running jenkins-jobs update config/ 'selenium-*' to deploy https://gerrit.wikimedia.org/r/#/c/302775/
  • 17:03 legoktm: jstart -N qamorebots /usr/lib/adminbot/adminlogbot.py --config ./confs/qa-logbot.py

2016-08-01

  • 20:28 thcipriani: restarting deployment-ms-be01, not responding to ssh, mw-fe01 requests timing out
  • 08:28 Amir1: deploying fedd675 to ores in sca03

2016-07-29

2016-07-28

  • 21:46 hashar_: xintegration: change sudo policy for jenkins-deploy to help on T141538 : env_keep+=WORKSPACE
  • 12:18 hashar: installed 2.1.0-391-gbc58ea3-wmf1jessie1 on zuul-dev-jessie.integration.eqiad.wmflabs T140894
  • 12:18 hashar: installed 2.1.0-391-gbc58ea3-wmf1jessie1 on zuul-dev-jessie.integration.eqiad.wmflabs
  • 09:46 hashar: Nodepool: Image ci-trusty-wikimedia-1469698821 in wmflabs-eqiad is ready
  • 09:35 hashar: Regenerated Nodepool image for Trusty. The snapshot failed while upgrading grub-pc for some reason. Noticed with thcipriani yesterday

2016-07-27

  • 16:13 hashar: salt -v '*slave-trusty*' cmd.run 'service mysql start' ( was missing on integration-slave-trusty-1011.integration.eqiad.wmflabs )
  • 14:03 hashar: upgraded zuul on gallium via dpkg -i /root/zuul_2.1.0-391-gbc58ea3-wmf1precise1_amd64.deb (revert is zuul_2.1.0-151-g30a433b-wmf4precise1_amd64.deb )
  • 12:43 hashar: restarted Jenkins for some trivial plugins updates
  • 12:35 hashar: hard rebooting integration-slave-trusty-1011 from Horizon. ssh lost, no log in Horizon.
  • 09:46 hashar: manually triggered debian-glue on all operations/debs repo that had no jenkins-bot vote. Via zuul enqueue on gallium and list fetched from "gerrit query --current-patch-set 'is:open NOT label:verified=2,jenkins-bot project:^operations/debs/.*'|egrep '(ref|project):'"
  • 06:21 Tim: created instance deployment-depurate01 for testing of role::html5depurate

2016-07-26

  • 20:13 hashar: Zuul deployed https://gerrit.wikimedia.org/r/301093 which adds 'debian-glue' job on all of operations/debs/ repos
  • 18:10 ostriches: zuul: reloading to pick up config change
  • 12:49 godog: cherry-pick https://gerrit.wikimedia.org/r/#/c/300827/ on deployment-puppetmaster
  • 11:59 legoktm: also pulled in I73f01f87b06b995bdd855628006225879a17fee5
  • 11:59 legoktm: deploying https://gerrit.wikimedia.org/r/301109
  • 11:37 hashar: rebased integration puppetmaster git repo
  • 11:31 hashar: enable puppet agent on integration-puppetmaster . Had it disabled while hacking on https://gerrit.wikimedia.org/r/#/c/300830/
  • 08:42 hashar: T141269 On integration-slave-trusty-1018 , deleting workspace that has a corrupt git: rm -fR /mnt/jenkins-workspace/workspace/mediawiki-extensions-hhvm*
  • 01:08 Amir1: deployed ores a291da1 in sca03, ores-beta.wmflabs.org works as expected

2016-07-25

  • 22:45 legoktm: restarting zuul due to depends-on lockup
  • 14:24 godog: bounce puppetmaster on deployment-puppetmaster
  • 13:17 godog: cherry-pick https://gerrit.wikimedia.org/r/#/c/300827/ on deployment-puppetmaster

2016-07-23

  • 20:06 bd808: Cleanup jobrunner01 logs via -- sudo logrotate --force /etc/logrotate.d/mediawiki_jobrunner
  • 20:03 bd808: Deleted jobqueues in redis with no matching wikis: ptwikibooks, labswiki
  • 19:20 bd808: jobrunner01 spamming /var/log/mediawiki with attempts to process jobs for wiki=labswiki

2016-07-22

  • 20:26 hashar: T141114 upgraded jenkins-debian-glue from v0.13.0 to v0.17.0 on integration-slave-jessie-1001 and integration-slave-jessie-1002
  • 19:07 thcipriani: beta-cluster has successfully used a canary for mediawiki deployments
  • 16:53 thcipriani: bumping scap to v.3.2.1 on deployment-tin to test canary deploys, again
  • 16:46 thcipriani: rolling back scap version to v.3.2.0
  • 16:38 thcipriani: bumping scap to v.3.2.1 on deployment-tin to test canary deploys
  • 13:02 hashar: zuul rebased patch queue on tip of upstream branch and force pushed branch. c3d2810...4ddad4e HEAD -> patch-queue/debian/precise-wikimedia (forced update)
  • 10:32 hashar: Jenkins restarted and it pooled both integration-slave-jessie-1002 and integration-slave-trusty-1018
  • 10:23 hashar: Jenkins has some random deadlock. Will probably reboot it
  • 10:17 hashar: Jenkins can't ssh / add slaves integration-slave-jessie-1002 or integration-slave-trusty-1018 . Apparently due to some Jenkins deadlock in the ssh slave plugin :-/ Lame way to solve it: restart Jenkins
  • 10:10 hashar: rebooting integration-slave-jessie-1002 and integration-slave-trusty-1018 . Hang somehow
  • 10:06 hashar: T141083 salt -v '*slave-trusty*' cmd.run 'service mysql start'
  • 09:55 hashar: integration-slave-trusty-1001 service mysql start

2016-07-21

  • 16:11 hashar: Updated our JJB fork cherry picking f74501e781f by madhuvishy. Was made to support the maven release plugin. Branch bump is 10f2bcd..6fcaf39
  • 16:04 hashar: integration/zuul.git .Updated upstream branch:bc58ea34125f11eb353abc3e5b96ac1efad06141 finally caught up with upstream \O/
  • 15:13 hashar: integration/zuul.git .Updated upstream branch: 06770a85fcff810fc3e1673120710100fc7b0601:upstream
  • 14:03 hashar: integration/zuul.git bumping upstream branch: git push d34e0b4:upstream
  • 03:18 greg-g: had to do https://www.mediawiki.org/wiki/Continuous_integration/Jenkins#Hung_beta_code.2Fdb_update twice, seems to be back
  • 00:13 bd808: Cherry-picked https://gerrit.wikimedia.org/r/#/c/299825/ to deployment-puppetmaster so wdqs nginx log parsing can be tested

2016-07-20

  • 13:55 hashar: beta: switching job beta-scap-eqiad to use 'scap sync' per https://gerrit.wikimedia.org/r/#/c/287951/ (poke thcipriani )
  • 12:47 hashar: integration: enabled unattended upgrade on all instances by adding contint::packages::apt to https://wikitech.wikimedia.org/wiki/Hiera:Integration
  • 10:28 hashar: beta dropped salt-key on deployment-salt02 for the three instances: deployment-upload.deployment-prep.eqiad.wmflabs , deployment-logstash3.deployment-prep.eqiad.wmflabs and deployment-ores-web.deployment-prep.eqiad.wmflabs
  • 10:26 hashar: beta: rebased puppetmaster git repo. "Parsoid: Move to service::node" has weird conflict https://gerrit.wikimedia.org/r/#/c/298436/
  • 10:15 hashar: beta: removing puppet cherry pick of https://gerrit.wikimedia.org/r/#/c/258979/ "mediawiki: add conftool-specifc credentials and scripts" abandonned/superseeded and caused a conflict
  • 08:17 hashar: deployment-fluorine : deleting a puppet lock file /var/lib/puppet/state/agent_catalog_run.lock (created at 2016-07-18 19:58:46 UTC)
  • 01:53 legoktm: deploying https://gerrit.wikimedia.org/r/299930

2016-07-18

  • 20:56 thcipriani: Deleted deployment-fluorine:/srv/mw-log/archive/*-201605* freed 30 GB
  • 15:00 hashar: Upgraded Zuul on the Precise slaves to zuul_2.1.0-151-g30a433b-wmf4precise1
  • 12:10 hashar: (restarted qa-morebots)
  • 12:10 hashar: Enabling puppet again on integration-slave-precise-1002 , removing Zuul-server config and adding the slave back in Jenkins pool

2016-07-16

  • 23:19 paladox: testing morebots

2016-07-15

  • 08:34 hashar: Unpooling integration-slave-precise-1002 will use it as a zuul-server test instance temporarily

2016-07-14

  • 18:54 ebernhardson: deployment-prep manually edited elasticsearch.yml on deployment-elastic05 and restarted to get it listening on eth0. Still looking into why puppet wrote out wrong config file
  • 09:05 Amir1: rebooting deployment-ores-redis
  • 08:29 Amir1: deploying 0e9555f to ores-beta (sca03)

2016-07-13

  • 16:05 urandom: Installing Cassandra 2.2.6-wmf1 on deployment-restbase0[1-2].deployment-prep.eqiad.wmflabs : T126629
  • 13:58 hashar: T137525 reverted Zuul back to zuul_2.1.0-95-g66c8e52-wmf1precise1_amd64.deb . It could not connect to Gerrit reliably
  • 13:46 hashar: T137525 Stopped zuul that ran in a terminal (with -d). Started it with the init script.
  • 11:37 hashar: apt-get upgrade on deployment-mediawiki02
  • 08:33 hashar: removing deployment-parsoid05 from the Jenkins slaves T140218

2016-07-12

  • 20:29 hashar: integration: force running unattended upgrade on all instances: salt --batch 4 -v '*' cmd.run 'unattended-upgrade' . That upgrades diamond and hhvm among others. imagemagick-common has a prompt though
  • 20:22 hashar: CI force running puppet on all instances: salt --batch 5 -v '*' puppet.run
  • 20:04 hashar: Maybe fix unattended upgrade on the CI slaves via https://gerrit.wikimedia.org/r/298568
  • 16:43 Amir1: deploying f472f65 to ores-beta
  • 10:11 hashar: Github created repos operations-debs-contenttranslation-apertium-mk-en and operations-docker-images-toollabs-images for Gerrit replication

2016-07-11

  • 14:24 hashar: Removing ZeroMQ config from the Jenkins jobs. It is now enabled globally. T139923
  • 10:16 hashar: T136188: on Trusty slaves, upgrading Chromium from v49 to v51: salt -v '*slave-trusty-*' cmd.run 'apt-get -y install chromium-browser chromium-chromedriver chromium-codecs-ffmpeg-extra'
  • 10:13 hashar: T136188: salt -v '*slave-trusty*' cmd.run 'rm /etc/apt/preferences.d/chromium-*'
  • 10:09 hashar: Unpinning Chromium v49 from the Trusty slaves and upgrading to v51 for T136188
  • 09:34 zeljkof: Enabled ZMQ Event Publisher on all Jobs in Jenkins

2016-07-09

2016-07-08

2016-07-07

  • 21:41 MaxSem: Chowned php-master/vendor back to jenkins-deploy
  • 13:10 hashar: deleting integration-slave-trusty-1024 and integration-slave-trusty-1025 to free up some RAM. We have enough permanent Trusty slaves. T139535
  • 02:43 MaxSem: started redis-server on deployment-stream
  • 01:14 bd808: Restarted logstash on deployment-logstash2
  • 01:13 MaxSem: Leaving my hacks for the night to collect data, if needed revert with cd /srv/mediawiki-staging/php-master/vendor && sudo git reset --hard HEAD && sudo chown -hR jenkins-deploy:wikidev .
  • 00:50 bd808: Rebooting deployment-logstash3.eqiad.wmflabs; console full of hung process messages from kernel
  • 00:27 MaxSem: Initialized ORES on all wikis where it's enabled, was causing job failures
  • 00:13 MaxSem: Debugging a fatal in betalabs, might cause syncs to fail

2016-07-06

  • 20:30 hashar: beta: restarted mysql on both db1 and db2 so it takes in account the --syslog setting T119370
  • 20:08 hashar: beta: on db1 and db2 move the MariaDB 'syslog' setting under [mysqld_safe] section. Cherry picked https://gerrit.wikimedia.org/r/#/c/296713/3 and reloaded mysql on both instances. T119370
  • 14:54 hashar: Image ci-jessie-wikimedia-1467816381 in wmflabs-eqiad is ready T133779
  • 14:47 hashar_: attempting to refresh ci-jessie-wikimedia image to get librdkafka-dev included for T133779

2016-07-05

  • 21:54 hasharAway: CI has drained the gate-and-submit queue
  • 21:37 hasharAway: Nodepool: nodepool delete a few instances that would never spawn / have been stuck for ~ 40 minutes

2016-07-04

  • 18:58 hashar: Upgrading arcanist on permanent CI slaves since xhpast was broken T137770
  • 12:50 yuvipanda: migrating deployment-tin to labvirt1011

2016-07-03

  • 13:10 paladox: phabricator Update phab-01 and phab-05 (phab-02) and phab-03 to fix a security bug in phabricator (Did the update last night but forgot to log it)
  • 12:04 jzerebecki: reloading zuul for 7e6a2e2..13ea50f

2016-07-02

  • 13:38 jzerebecki: reloading zuul for 15127b2..7e6a2e2

2016-06-30

  • 10:31 hashar: Deleting integration-slave-trusty-1015 . Can not bring up mysql T138074 and the ssh slave connection would not hold anyway. Must be broken somehow
  • 10:04 hashar: Attempting to refresh Nodepool image for Jessie ( ci-jessie-wikimedia ). Been stall for 284 hours (12 days)
  • 09:36 hashar: Trusty is missing the package arcanist ... :(
  • 09:35 hashar: Attempting to refresh Nodepool image for Trusty ( ci-trusty-wikimedia ). Been stall for 283 hours (12 days)

2016-06-28

  • 21:33 halfak: deploying ores beec291
  • 21:15 halfak: deploying ores 6979a98

2016-06-27

  • 22:32 eberhardson: deployment-prep deployed gerrit.wikimedia.org/r/296279 to puppetmaster to test kibana4 role
  • 19:41 bd808: Rebooting deployment-logstash3.eqiad.wmflabs via wikitech. Console log full of blocked kworker messages, ssh non-responsive, and blocking logstash records being recorded.
  • 18:20 thcipriani: deployment-puppetmaster.deployment-prep:/var/lib/git/labs/private modules/secret/secrets/keyholder keys conflicts resolved
  • 18:09 bd808: Git repo at deployment-puppetmaster.deployment-prep:/var/lib/git/labs/private is behind upstream due to multiple modules/secret/secrets/keyholder local files that would be overwritten by upstream changes.

2016-06-24

2016-06-23

  • 13:58 Krinkle: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/295691
  • 12:13 hashar: Deleting integration-saltmaster and recreating it with Jessie T136410
  • 10:14 hashar: T137807 Upgrading Jenkins TAP Plugin
  • 08:55 hashar: integration: rebased puppet master by dropping a conflicting/obsolete patch
  • 08:28 hashar: fixing puppet cert on deployment-cache-text04

2016-06-17

  • 10:35 jzerebecki: offlined integration-slave-trusty-1015 T138074
  • 10:06 hashar: Refreshed Nodepool Trusty image
  • 10:02 hashar: Refreshed Nodepool Jessie image

2016-06-14

  • 14:22 hashar: T136971 on tin MediaWiki 1.28.0-wmf.6, from 1.28.0-wmf.6, successfully checked out. Applying security patches
  • 11:21 hashar: T137797 Created Gerrit repository operations/debs/geckodriver to package https://github.com/mozilla/geckodriver

2016-06-13

  • 21:11 hashar: https://integration.wikimedia.org/ci/computer/integration-slave-trusty-1015/ put offline. Jenkins cant ssh / pool it for some reason
  • 20:07 hashar: beta: update.php / database update finally pass!
  • 19:55 hashar: T137615 deployment-db2, **eswiki** > CREATE INDEX echo_notification_event ON echo_notification (notification_event);
  • 19:22 hashar: T137615 deployment-db2, enwiki > CREATE INDEX echo_notification_event ON echo_notification (notification_event);
  • 10:37 hashar: Restarted puppetmaster on integration-puppetmaster (memory leak / can not fork: no memory)
  • 10:35 hashar: T137561 salt -v '*trusty*' cmd.run "cd /root/ && dpkg -i firefox_46.0.1+build1-0ubuntu0.14.04.3_amd64.deb"
  • 10:23 hashar: Hard reboot integration-slave-trusty-1015
  • 08:30 hashar: Beta: `mwscript extensions/Echo/maintenance/removeInvalidTargetPage.php --wiki=enwiki` for T137615

2016-06-10

2016-06-09

  • 18:49 hashar: restarting nutcracker on deployment-mediawiki02
  • 16:53 hashar: rebuild Nodepool trusty image ci-trusty-wikimedia-1465490962
  • 16:37 hashar: Manually deleting old zuul references on scandium.eqiad.wmnet . Running in a screen
  • 16:32 hashar: rebuild Nodepool jessie image ci-jessie-wikimedia-1465489579
  • 16:03 hashar: Restarting Nodepool

2016-06-08

  • 02:56 legoktm: / on gallium is read-only
  • 02:47 legoktm: disabling/enabling gearman in jenkins because everything is stuck

2016-06-07

  • 19:28 hashar: Nodepool has troubles spawning instances probably due to on going (?) labs maintenance
  • 14:56 hashar: Restarting Jenkins to upgrade Rebuilder plugin with https://github.com/jenkinsci/rebuild-plugin/pull/34 (sort out parameters not being reinjected)
  • 09:02 hashar: Upgrading Jenkins IRC plugin 2.25..2.27 and instant messaging plugin 1.34..1.35 . The former should fix a deadlock on shutdowning Jenkins | T96183

2016-06-06

  • 19:26 hasharAway: Regenerating Nodepool snapshots for Trusty and Jessie
  • 13:04 hashar: Migrated all qunit jobs to Nodepool T136301 has the related Gerrit changes
  • 10:05 hashar: migrating mediawiki-core-qunit job to Nodepool instances https://gerrit.wikimedia.org/r/#/c/291322/ T136301

2016-06-04

  • 00:09 Krinkle: krinkle@integration-slave-trusty-1017:~$ sudo rm -rf /mnt/jenkins-workspace/workspace/mediawiki-extensions-hhvm/src/extensions/Babel (T86730)

2016-06-03

  • 19:18 hashar: Image ci-jessie-wikimedia-1464981111 in wmflabs-eqiad is ready Zend 5.x for qunit | T136301
  • 15:17 hashar: refreshed Nodepool Trusty image due to some imagemagick upgrade issue. Image ci-trusty-wikimedia-1464966671 in wmflabs-eqiad is ready
  • 10:40 hashar: scandium (zuul merger): rm -fR /srv/ssd/zuul/git/mediawiki/extensions/Collection T136930

2016-06-02

  • 12:10 hashar: Upgraded Zuul upstream code being 66c8e52..30a433b package is 2.1.0-151-g30a433b-wmf1precise1

2016-06-01

  • 17:49 Krinkle: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/292186
  • 16:45 tgr: enabling AuthManager on beta cluster
  • 15:20 legoktm: deploying https://gerrit.wikimedia.org/r/292153
  • 14:44 twentyafterfour: jenkins restart completed
  • 14:36 twentyafterfour: restarting jenkins to install "single use slave" plugin (jenkins will restart when all builds are finished)
  • 13:49 hashar: Beta : clearing temporary files under /data/project/upload7 (mainly wikimedia/commons/temp )
  • 10:29 hashar: Upgraded Linux kernel on deployment-salt02 T136411
  • 10:14 hashar: beta: salt-key -d deployment-salt.deployment-prep.eqiad.wmflabs T136411
  • 09:16 hashar: Enabling puppet again on Trusty slaves. Chromium is now properly pinned to version 49 ( https://gerrit.wikimedia.org/r/#/c/291116/3 | T136188 )
  • 08:55 hashar: integration slaves : salt -v '*' pkg.upgrade

2016-05-31

  • 20:24 bd808: Reloading zuul to pick up I58f878f3fd19dfa21a46a52464575cb06aacbb22

2016-05-30

  • 18:39 hashar: Upgraded our Jenkins Job Builder fork to 1.5.0 + a couple of cherry picks: cd63874...10f2bcd
  • 12:53 hashar: Upgrading Zuul 1cc37f7..66c8e52 T128569
  • 08:04 ori: zuul is back up but jobs which were enqueued are gone
  • 07:50 ori: restarting jenkins on gallium, too
  • 07:49 ori: restarted zuul-merger service on gallium
  • 07:44 ori: Disconnecting and then reconnecting Gearman from Jenkins did not appear to do anything; going to depool / repool nodes.
  • 07:42 ori: Temporarily disconnecting Gearman from Jenkins, per <https://www.mediawiki.org/wiki/Continuous_integration/Zuul#Known_issues>

2016-05-28

  • 04:43 ori: depooling integration-slave-trusty-1015 to profile phpunit runs

2016-05-27

  • 19:29 hasharAway: Refreshed Nodepool images
  • 18:13 thcipriani: restarting zuul for deadlock
  • 18:00 thcipriani: Reloading Zuul to deploy I0c3aeacf92d430ad1272f5f00e7fb7182b8a05bf
  • 02:55 bd808: Deleted deployment-fluorine:/srv/mw-log/archive/*-20160[34]* logs; freed 26G

2016-05-26

  • 22:23 hashar: salt -v '*trusty*' cmd.run 'puppet agent --disable "Chromium needs to be v49. See T136188"'
  • 21:47 hashar: integration-slave-trusty-1015 still on Chromium 50 .. T136188
  • 21:42 hashar: downgrading chromium-browser on integration-slave-1015 T136188
  • 09:24 jzerebecki: reloading zuul for d38ad0a..6798539
  • 07:48 gehel: deployment-prep upgrading elasticsearch to 2.3.3 and restarting (T133124)
  • 07:36 dcausse: deployment-prep elastic: updating cirrussearch warmers (T133124)
  • 07:31 gehel: deployment-prep deploying new elasticsearch plugins (T133124)

2016-05-25

  • 22:38 Amir1: running puppet agent manually on sca01
  • 16:26 hashar: 2016-05-25 16:24:35,491 INFO nodepool.image.build.wmflabs-eqiad.ci-trusty-wikimedia: Notice: /Stage[main]/Main/Package[ruby-jsduck]/ensure: ensure changed 'purged' to 'present' T109005
  • 15:07 hashar: g++ added to Jessie and Trusty Nodepool instances | T119143
  • 14:12 hashar: Regenerating Nodepool snapshot to include g++ which is required by some NodeJS native modules T119143
  • 10:58 hashar: Updating Nodepool ci-jessie-wikimedia snapshot image to get netpbm package installed into it. T126992 https://gerrit.wikimedia.org/r/290651
  • 09:30 hashar: Clearing git-sync-upstream script on integration-slave-trusty1013 and integration-slave-trusty-1017. That is only supposed to be on the puppetmaster
  • 09:15 hashar: Fixed resolv.conf on integration-slave-trusty-1013 and force running puppet to catch up with change since May 16 19:52
  • 09:11 hashar: restarting puppetmaster on integration-puppetmaster ( memory leak / can not fork)

2016-05-24

  • 07:03 mobrovac: rebooting deployment-tin, can't log in

2016-05-23

  • 19:35 hashar: killed all mysqld process on Trusty CI slaves
  • 15:49 thcipriani: beta code update not running, disconnect-reconnect dance resulted in: [05/23/16 15:48:39] [SSH] Authentication failed.
  • 14:32 jzerebecki: offlined integration-slave-trusty-1004 because it can't connect to mysql T135997
  • 13:32 hashar: Upgrading Jenkins git plugins and restarting Jenkins
  • 11:01 hashar: Upgrading hhvm on Trusty slaves. Bring him hhvm compiled against libicu52 instead of libicu48
  • 09:12 _joe_: deployment-prep: all hhvm hosts in beta upgraded to run on the newer libicu; now running updateCollation.php (T86096)
  • 09:11 hashar: Image ci-jessie-wikimedia-1463994307 in wmflabs-eqiad is ready
  • 09:01 hashar: Image ci-trusty-wikimedia-1463993508 in wmflabs-eqiad is ready
  • 08:56 _joe_: deployment-prep: starting upgrade of HHVM to a version linked to libicu52, T86096
  • 08:54 hashar: Regenerating Nodepool image manually. Broke over the week-end due to a hhvm/libicu transition. Should get pip 8.1.x now

2016-05-20

2016-05-19

  • 16:47 thcipriani: deployment-tin jenkins worker seems to be back online after some prodding
  • 16:41 thcipriani: beta-code-update eqiad hung for past few hours
  • 15:16 hashar: Restarted zuul-merger daemons on both gallium and scandium : file descriptors leaked
  • 11:59 hashar: CI: salt -v '*' cmd.run 'pip install --upgrade pip==8.1.2'
  • 11:54 hashar: Upgrading pip on CI slaves from 7.0.1 to 8.1.2 https://gerrit.wikimedia.org/r/#/c/289639/
  • 10:15 hashar: puppet broken on deployment-tin :  ?[1;31mError: Could not retrieve catalog from remote server: Error 400 on SERVER: Invalid parameter trusted_group on node deployment-tin.deployment-prep.eqiad.wmflabs?[0m

2016-05-18

  • 13:16 Amir1: deploying a05e830 to ores nodes (sca01 and ores-web)
  • 12:46 urandom: (re)cherry-picking c/284078 to deployment-prep
  • 11:36 hashar: Restarted qa-morebots
  • 11:36 hashar: Marked mediawiki/core/vendor repository has hidden in Gerrit. It got moved to mediawiki/vendor including the whole history Settings page: https://gerrit.wikimedia.org/r/#/admin/projects/mediawiki/core/vendor

2016-05-13

  • 14:39 thcipriani: remove shadow l10nupdate user from deployment-tin and mira in beta
  • 10:20 hashar: Put integration-slave-trusty-1004 offline. Ssh/passwd is borked T135217
  • 09:59 hashar: Deleting non nodepool mediawiki PHPUnit jobs for T135001 (mediawiki-phpunit-hhvm mediawiki-phpunit-parsertests-hhvm mediawiki-phpunit-parsertests-php55 mediawiki-phpunit-php55)
  • 04:06 thcipriani|afk: changed ownership of mwdeploy public keys post shadow mwdeploy user removal is important
  • 03:47 thcipriani|afk: ldap failure has created a shadow mwdeploy user on beta, deleted using vipw

2016-05-12

  • 22:53 bd808: Started dead mysql on integration-slave-precise-1011

2016-05-11

  • 21:05 hashar: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/288128 #T134946
  • 20:26 hashar: rebooting integration-slave-trusty-1016 is back up
  • 20:15 hashar: rebooting integration-slave-trusty-1016 unreachable somehow
  • 16:43 hashar: Reduced number of executors on Trusty instances from 3 to 2. Memory get exhausted causing the tmpfs to drop files and thus MW jobs to fail randomly.
  • 13:33 hashar: Added contint::packages::php to Nodepool images T119139
  • 12:59 hashar: Dropping texlive and its dependencies from gallium.
  • 12:52 hashar: deleted integration-dev
  • 12:51 hashar: creating integration-dev instance to hopefully have Shinken clean itself
  • 11:42 hashar: rebooting deployment-aqs01 via wikitech T134981
  • 10:46 hashar: beta/ci puppetmaster : deleting old tags in /var/lib/git/operations/puppet and repacking the repos
  • 08:49 hashar: Deleting instances deployment-memc02 and deployment-memc03 (Precise instances, migrated to Jessie) #T134974
  • 08:43 hashar: Beta: switching memcached to new Jessie servers by cherry picking https://gerrit.wikimedia.org/r/#/c/288156/ and running puppet on mw app servers #T134974
  • 08:20 hashar: Creating deployment-memc04 and deployment-memc05 to switch beta cluster memcached to Jessie. m1.medium with security policy "cache" T13497
  • 01:44 matt_flaschen: Created Flow-specific External Store tables (blobs_flow1) on all wiki databases on Beta Cluster: T128417

2016-05-10

  • 19:17 hashar: beta / CI purging old Linux kernels: salt -v '*' cmd.run 'dpkg -l|grep ^rc|awk "{ print \$2 }"|grep linux-image|xargs dpkg --purge'
  • 17:34 cscott: updated OCG to version b0c57a1c6890e9fa1f2c3743fc14cb6a7f244fc3
  • 16:44 bd808: Cleaned up 8.5G of pbuilder tmp output on integration-slave-jessie-1001 with `sudo find /mnt/pbuilder/build -maxdepth 1 -type d -mtime +1 -exec rm -r {} \+`
  • 16:35 bd808: https://integration.wikimedia.org/ci/job/debian-glue failure on integration-slave-jessie-1001 due to /mnt being 100$ full
  • 14:20 hashar: deployment-puppetmaster mass cleaned packages/service/users etc T134881
  • 13:54 moritzm: restarted zuul-merger on scandium for openssl update
  • 13:52 moritzm: restarting zuul on gallium for openssl update
  • 13:51 moritzm: restarted apache and zuul-merger on gallium for openssl update
  • 13:48 hashar: deployment-puppetmaster : dropping role::ci::jenkins_access role::ci::slave::labs and role::ci::slave::labs::common T134881
  • 13:46 hashar: Deleting Jenkins slave deployment-puppetmaster T134881
  • 13:45 hashar: Change https://integration.wikimedia.org/ci/job/beta-build-deb/ job to use label selector "DebianGlue && DebianJessie" instead of "BetaDebianRepo" T134881
  • 13:33 hashar: Migrating all debian glue jobs to Jessie permanent slaves T95545
  • 13:30 hashar: Adding integration-slave-jessie-1002 in Jenkins. it is all puppet compliant
  • 12:59 thcipriani|afk: triggering puppet run on scap targets in beta for https://gerrit.wikimedia.org/r/#/c/287918/ cherry pick
  • 09:07 hashar: fixed puppet.conf on deployment-cache-text04

2016-05-09

  • 20:58 hashar: Unbroke puppet on integration-raita.integration.eqiad.wmflabs . Puppet was blocked because role::ci::raita was no more. Fixed by rebasing https://gerrit.wikimedia.org/r/#/c/208024 T115330
  • 20:13 hashar: beta: salt -v '*' cmd.run 'dpkg --purge libganglia1 ganglia-monitor; rm -fR /etc/ganglia' # T134808
  • 20:06 hashar: CI, removing ganglia configuration entirely via: salt -v '*' cmd.run 'rm -fRv /etc/ganglia' # T134808
  • 20:04 hashar: CI, removing ganglia configuration entirely via: salt -v '*' cmd.run 'dpkg --purge ganglia-monitor' # T134808
  • 16:32 jzerebecki: reloading zuul for 3e2ab56..d663fd0
  • 15:39 andrewbogott: migrating deployment-flourine to labvirt1009
  • 15:39 hashar: Adding label contintLabsSlave to integration-slave-jessie1001 and integration-slave-jessie1002
  • 15:26 hashar: Creating integration-slave-jessie-1001 T95545

2016-05-06

  • 19:45 urandom: Restart cassandra-metrics-collector on deployment-restbase0[1-2]
  • 19:41 urandom: Rebasing 02ae1757 on deployment-puppetmaster : T126629

2016-05-05

  • 22:09 MaxSem: Promoted Yurik and Jgirault to sysops on beta enwiki. Through shell because logging in is broken for me.

2016-05-04

  • 21:28 cscott: deployed puppet FQDN domain patch for OCG: https://gerrit.wikimedia.org/r/286068 and restarted ocg on deployment-pdf0[12]
  • 15:03 hashar: beta-scap: deployment-tin.deployment-prep.eqiad.wmflabs Name or service not known
  • 15:03 hashar: beta-scap: deployment-tin.deployment-prep.eqiad.wmflabs
  • 12:24 hashar: deleting Jenkins job mediawiki-core-phpcs , replaced by Nodepool version mediawiki-core-phpcs-trusty T133976
  • 12:11 hashar: beta: restarted nginx on varnish caches ( systemctl restart nginx.service ) since they were not listening on port 443 #T134362
  • 11:07 hashar: restarted CI puppetmaster (out of memory leak)
  • 10:57 hashar: CI: mass upgrading deb packages
  • 10:53 hashar: beta: clearing out leftover apt conf that points to unreachable web proxy : salt -v '*' cmd.run "find /etc/apt -name '*-proxy' -delete"
  • 10:48 hashar: Manually fixing nginx upgrade on deployment-cache-text04 and deployment-cache-upload04 see T134362 for details
  • 09:27 hashar: deployment-cache-text04 systemctl stop varnish-frontend.service . To clear out all the stuck CLOSE_WAIT connections T134346
  • 08:33 hashar: fixed puppet on deployment-cache-text04 (race condition generating puppet.conf )

2016-05-03

  • 23:21 bd808: Changed "Maximum Number of Retries" for ssh agent launch in jenkins for deployment-tin from "0" to "10"
  • 23:01 twentyafterfour: rebooting deployment-tin
  • 23:00 bd808: Jenkins agent on deployment-tin not spawning; investigating
  • 20:02 hashar: Restarting Jenkins
  • 16:49 hashar: Notice: /Stage[main]/Contint::Packages::Python/Package[pypy]/ensure: ensure changed 'purged' to 'present' | T134235
  • 16:46 hashar: Refreshing Nodepool Jessie image to have it include pypy | T134235 poke @jayvdb
  • 14:49 mobrovac: deployment-tin rebooting it
  • 14:25 hashar: beta salt -v '*' pkg.upgrade
  • 14:19 hashar: beta: added unattended upgrade to Hiera::deployment-prep
  • 13:30 hashar: Restarted nslcd on deployment-tin , pam was refusing authentication for some reason
  • 13:29 hashar: beta: got rid of a leftover Wikidata/Wikibase patch that broke scap salt -v 'deployment-tin*' cmd.run 'sudo -u jenkins-deploy git -C /srv/mediawiki-staging/php-master/extensions/Wikidata/ checkout -- extensions/Wikibase/lib/maintenance/populateSitesTable.php'
  • 13:23 hashar: deployment-tin force upgraded HHVM from 3.6 to 3.12
  • 09:42 hashar: adding puppet class contint::slave_scripts to deployment-sca01 and deployment-sca02 . Ships multigit.sh T134239
  • 09:31 hashar: Deleting CI slave deployment-cxserver03 , added deployment-sca01 and deployment-sca02 in Jenkins. T134239
  • 09:28 hashar: deployment-sca01 removing puppet lock /var/lib/puppet/state/agent_catalog_run.lock and running puppet again
  • 09:26 hashar: Applying puppet class role::ci::slave::labs::common on deployment-sca01 and deployment-sca02 (cxserver and parsoid being migrated T134239 )
  • 03:33 kart_: Deleted deployment-cxserver03, replaced by deployment-sca0x

2016-05-02

  • 21:27 cscott: updated OCG to version b775e612520f9cd4acaea42226bcf34df07439f7
  • 21:26 hashar: Nodepool is acting just fine: Demand from gearman: ci-trusty-wikimedia: 457 | <AllocationRequest for 455.0 of ci-trusty-wikimedia>
  • 21:25 hashar: restarted qa-morebots "2016-05-02 21:22:23,599 ERROR: Died in main event loop"
  • 21:23 hashar: gallium: enqueued 488 jobs directly in Gearman. That is to test https://gerrit.wikimedia.org/r/#/c/286462/ ( mediawiki/extensions to hhvm/zend5.5 on Nodepool). Progress /home/hashar/gerrit-286462.log
  • 20:14 hashar: MediaWiki phpunit jobs to run on Nodepool instances \O/
  • 16:41 urandom: Forcing puppet run and restarting Cassandra on deployment-restbase0[1-2] : T126629
  • 16:40 urandom: Cherry-picking https://gerrit.wikimedia.org/r/operations/puppet refs/changes/78/284078/12 to deployment-puppetmaster : T126629
  • 16:24 urandom: Restarat Cassandra on deployment-restbase0[1-2] : T126629
  • 16:21 urandom: forcing puppet run on deployment-restbase0[1-2] : T126629
  • 16:21 urandom: cherry-picking latest refs/changes/78/284078/11 onto deployment-puppetmaster : T126629
  • 09:44 hashar: On zuul-merger instances (gallium / scandium), cleared out pywikibot/core working copy ( rm -fR /srv/ssd/zuul/git/pywikibot/core/ ) T134062

2016-04-30

  • 18:31 Amir1: deploying d4f63a3 from github.com/wiki-ai/ores-wikimedia-config into targets in beta cluster via scap3

2016-04-29

  • 16:37 jzerebecki: restarting zuul for 4e9d180..ebb191f
  • 15:45 hashar: integration: deleting integration-trusty-1026 and cache-rsync . Maybe that will clear them up from Shinken
  • 15:14 hashar: integration: created 'cache-rsync' and 'integration-trusty-1026' , attempting to have Shinken to deprovision them

2016-04-28

  • 22:03 urandom: deployment-restbase01 upgrade to 2.2.6 complete : T126629
  • 21:56 urandom: Stopping Cassandra on deployment-restbase01, upgrading package to 2.2.6, and forcing puppet run : T126629
  • 21:55 urandom: Snapshotting Cassandra tables on deployment-restbase01 (name = 1461880519833) : T126629
  • 21:55 urandom: Snapshotting Cassandra tables on deployment-restbase01 : T126629
  • 21:52 urandom: Forcing puppet run on deployment-restbase02 : T126629
  • 21:51 urandom: Cherry picking operations/puppet refs/changes/78/284078/10 to puppmaster : T126629
  • 20:46 urandom: Starting Cassandra on deployment-restbase02 (now v2.2.6) : T126629
  • 20:41 urandom: Re-enable puppet and force run on deployment-restbase02 : T126629
  • 20:38 urandom: Halting Cassandra on deployment-restbase02, masking systemd unit, and upgrading package(s) to 2.2.6 : T126629
  • 20:37 urandom: Snapshotting Cassandra tables on deployment-restbase02 (snapshot name = 1461875833996) : T126629
  • 20:37 urandom: Snapshotting Cassandra tables on deployment-restbase02 : T126629
  • 20:33 urandom: Cassandra on deployment-restbase01.deployment-prep started : T126629
  • 20:25 urandom: Restarting Cassandra on deployment-restbase01.deployment-prep : T126629
  • 20:14 urandom: Re-enable puppet on deployment-restbase01.deployment-prep, and force a run : T126629
  • 20:12 urandom: cherry-picking https://gerrit.wikimedia.org/r/#/c/284078/ to deployment-puppetmaster : T126629
  • 20:06 urandom: Disabling puppet on deployment-restbase0[1-2].deployment-prep : T126629
  • 14:43 hashar: Rebuild Nodepool Jessie image. Comes with hhvm
  • 12:52 hashar: Puppet is happy on deployment-changeprop
  • 12:47 hashar: apt-get upgrade deployment-changeprop (outdated exim package)
  • 12:42 hashar: Rebuild Nodepool Trusty instance to include the PHP wrapper script T126211

2016-04-27

  • 23:57 thcipriani: nodepool instances running again after an openstack rabbitmq restart by andrewbogott
  • 22:51 duploktm: also ran openstack server delete ci-jessie-wikimedia-85342
  • 22:42 legoktm: nodepool delete 85342
  • 22:41 matt_flaschen: Deployed https://gerrit.wikimedia.org/r/#/c/285765/ to enable External Store everywhere on Beta Cluster
  • 22:38 legoktm: stop/started nodepool
  • 22:36 thcipriani: I don't have permission to restart nodepool
  • 22:35 thcipriani: restarting nodepool
  • 22:18 matt_flaschen: Deployed https://gerrit.wikimedia.org/r/#/c/282440/ to switch Beta Cluster to use External Store for new testwiki writes
  • 21:00 hashar: thcipriani downgraded git plugins successfully (we wanted to rule out their upgrade for some weird issue)
  • 20:13 cscott: updated OCG to version e39e06570083877d5498da577758cf8d162c1af4
  • 14:10 hashar: restarting Jenkins
  • 14:09 hashar: Jenkins upgrading credential plugin 1.24 > 1.27 And Credentials binding plugin 1.6 > 1.7
  • 14:07 hashar: Jenkins upgrading git plugin 2.4.1 > 2.4.4
  • 14:01 hashar: Jenkins upgrading git client plugin 1.19.1. > 1.19.6
  • 13:13 jzerebecki: reloading zuul for 81a1f1a..0993349
  • 11:43 hashar: fixed puppet on deployment-cache-text04 T132689
  • 10:38 hashar: Rebuild Image ci-trusty-wikimedia-1461753210 in wmflabs-eqiad is ready
  • 09:43 hashar: tmh01.deployment-prep.eqiad.wmflabs denies mwdeploy user breaking https://integration.wikimedia.org/ci/job/beta-scap-eqiad/

2016-04-26

  • 20:45 hashar: Regenerating Nodepool Jessie snapshot to include composer and HHVM | T128092
  • 20:23 jzerebecki: reloading zuul for eb480d8..81a1f1a
  • 19:25 jzerebecki: reload zuul for 4675213..eb480d8
  • 19:25 jzerebecki: 4675213..eb480d8
  • 14:18 hashar: Applied security patches to 1.27.0-wmf.22 | T131556
  • 12:39 hashar: starting cut of 1.27.0-wmf.22 branch ( poke ostriches )
  • 10:29 hashar: restored integration/phpunit on CI slaves due to https://integration.wikimedia.org/ci/job/operations-mw-config-phpunit/ failling
  • 09:11 hashar: CI is back up!
  • 08:20 hashar: shutoff instance castor, does not seem to be able to start again :( | T133652
  • 08:12 hashar: hard rebooting castor instance | T133652
  • 08:10 hashar: soft rebooting castor instance | T133652
  • 08:06 hashar: CI jobs deadlocked due to castor being unavailable | https://phabricator.wikimedia.org/T133652
  • 00:46 thcipriani: temporary keyholder fix in place in beta
  • 00:18 thcipriani: beta-scap-eqiad failure due to bad keyholder-auth.d fingerprints

2016-04-25

  • 20:58 cscott: updated OCG to version 58a720508deb368abfb7652e6a8c7225f95402d2
  • 19:46 hashar: Nodepool now has a couple trusty instances intended to experiment with Zend 5.5 / HHVM migration . https://phabricator.wikimedia.org/T133203#2236625
  • 13:34 hashar: Nodepool is attempting to create a Trusty snapshot with name ci-trusty-wikimedia-1461591203 | T133203
  • 13:15 hashar: openstack image create --file /home/hashar/image-trusty-20160425T124552Z.qcow2 ci-trusty-wikimedia --disk-format qcow2 --property show=true # T133203
  • 10:38 hashar: Refreshing Nodepool Jessie snapshot based on new image
  • 10:35 hashar: Refreshed Nodepool Jessie image ( image-jessie-20160425T100035Z )
  • 09:24 hashar: beta / scap failure filled as T133521
  • 09:20 hashar: Keyholder / mwdeploy ssh keys have been messed up on beta cluster somehow :-(
  • 08:47 hashar: mwdeploy@deployment-tin has lost ssh host keys file :(

2016-04-24

  • 17:14 jzerebecki: reloading e06f1fe..672fc84

2016-04-22

2016-04-21

  • 19:07 thcipriani: scap version testing should be done, puppet should no longer be disabled on hosts
  • 18:02 thcipriani: disabling puppet on scap targets to test scap_3.1.0-1+0~20160421173204.70~1.gbp6706e0_all.deb

2016-04-20

  • 22:28 thcipriani: rolling back scap version in beta, legit failure :(
  • 21:52 thcipriani: testing new scap version in beta on deployment-tin
  • 17:54 thcipriani: Reloading Zuul to deploy gerrit:284494
  • 13:58 hashar: Stopping HHVM on CI slaves by cherry picking a couple puppet patches | T126594
  • 13:33 hashar: salt -v '*trusty*' cmd.run 'rm /usr/lib/x86_64-linux-gnu/hhvm/extensions/current' # Cleanup on CI slaves for T126658
  • 13:27 hashar: Restarted integration puppet master service (out of memory / mem leak)

2016-04-17

2016-04-16

  • 14:21 Krenair: restarted qa-morebots per request
  • 14:18 Krenair: <jzerebecki> !log reloading zuul for 3f64dbd..c6411a1

2016-04-13

2016-04-12

  • 19:47 bd808: Cleaned up large hhbc cache file on deployment-medaiwiki03 via `sudo service hhvm stop; sudo rm /var/cache/hhvm/fcgi.hhbc.sq3; sudo service hhvm start`
  • 19:47 bd808: Cleaned up large hhbc cache file on deployment-medaiwiki02 via `sudo service hhvm stop; sudo rm /var/cache/hhvm/fcgi.hhbc.sq3; sudo service hhvm start`
  • 19:46 bd808: Cleaned up large hhbc cache file on deployment-medaiwiki01 via `sudo service hhvm stop; sudo rm /var/cache/hhvm/fcgi.hhbc.sq3; sudo service hhvm start`
  • 19:10 Amir1: manually rebooted deployment-ores-web
  • 19:08 Amir1: manually cherry-picked 282992/2 into to puppetmaster
  • 17:05 Amir1: ran puppet agen in sca01 manually in /srv directory
  • 11:34 hashar: Jenkins upgrading "Script Security Plugin" from 1.17 to 1.18.1 https://wiki.jenkins-ci.org/display/SECURITY/Jenkins+Security+Advisory+2016-04-11

2016-04-11

  • 21:23 csteipp: deployed and reverted oath
  • 20:30 thcipriani: relaunched slave-agent on integration-slave-trusty-1025, back online
  • 20:19 thcipriani: integration-slave-trusty-1025 horizon console filled with INFO: task jbd2/vda1-8:170 blocked for more than 120 seconds. rebooting
  • 20:13 thcipriani: killing stuck jobs, marking integration-slave-trusty-1025 as offline temporarily
  • 14:42 thcipriani: deployment-mediawiki01 disk full :(

2016-04-08

  • 22:46 matt_flaschen: Created blobs1 table for all wiki DBs on Beta Cluster
  • 14:34 hashar: Image ci-jessie-wikimedia-1460125717 in wmflabs-eqiad is ready adds package 'unzip' | T132144
  • 12:49 hashar: Image ci-jessie-wikimedia-1460119481 in wmflabs-eqiad is ready , adds package 'zip' | T132144
  • 09:30 hashar: Removed label hasAndroidSdk from gallium . That prevent that slave from sometime running the job apps-android-commons-build 
  • 08:42 hashar: Rebased puppet master and fixed conflict with https://gerrit.wikimedia.org/r/#/c/249490/

2016-04-07

  • 20:16 hashar: deployment-mediawiki02.deployment-prep.eqiad.wmflabs , cleared up random left over stuff / big logs etc
  • 20:08 hashar: deployment-mediawiki02.deployment-prep.eqiad.wmflabs / is full

2016-04-05

  • 23:56 marxarelli: Removed cherry-pick and rebased /var/lib/git/operations/puppet on integration-puppetmaster after merge of https://gerrit.wikimedia.org/r/#/c/281706/
  • 21:58 marxarelli: Restarting puppetmaster on integration-puppetmaster
  • 21:53 marxarelli: Cherry picked https://gerrit.wikimedia.org/r/#/c/281706/ on integration-puppetmaster and applying on integration-slave-trusty-1014
  • 10:32 hashar: gallium removing texlive
  • 10:29 hashar: gallium removing libav / ffmpeg. No more needed since jobs are no more running on that server

2016-04-04

  • 17:30 greg-g: Phabricator going down in about 10 minutes to hopefully address the overheating issue: T131742
  • 10:06 hashar: integration: salt -v '*-slave*' cmd.run 'rm /usr/local/bin/grunt; rm -fR /usr/local/lib/node_modules/grunt-cli' | T124474
  • 10:04 hashar: integration: salt -v '*-slave*' cmd.run 'npm -g uninstall grunt-cli' | T124474
  • 03:15 greg-g: Phabricator is down

2016-04-03

2016-04-02

  • 22:58 Amir1: added local hack to pupetmaster to make scap3 provider more verbose
  • 19:46 hashar: Upgrading Jenkins Gearman plugin to v2.0 , bring in diff registration for faster updates of Gearman server
  • 14:39 Amir1: manually added 281170/5 to beta puppetmaster
  • 14:22 Amir1: manually added 281161/1 to beta puppetmaster
  • 11:31 Reedy: deleted archived logs older than 30 days from deployment-fluorine

2016-04-01

  • 22:16 Krinkle: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/281046
  • 21:13 hashar: Image ci-jessie-wikimedia-1459544873 in wmflabs-eqiad is ready
  • 20:57 hashar: Refreshing Nodepool snapshot to hopefully get npm 2.x installed T124474
  • 20:37 hashar: Added Luke081515 as a member of deployment-prep (beta cluster) labs project
  • 20:31 hashar: Dropping grunt-cli from the permanent slaves. People can have it installed by listing it in their package.json devDependencies https://gerrit.wikimedia.org/r/#/c/280974/
  • 14:06 hashar: integration: removed sudo policy permitting sudo as any member of the project for any member of the project, which included jenkins-deploy user
  • 14:05 hashar: integration: removed sudo policy permitting sudo as root for any member of the project, which included jenkins-deploy user
  • 11:23 bd808: Freed 4.5G on deployment-fluorine:/srv/mw-log by deleting wfDebug.log
  • 04:00 Amir1: manually rebooted deployment-sca01
  • 00:16 csteipp: created oathauth_users table on centralauth db in beta

2016-03-31

  • 21:19 legoktm: deploying https://gerrit.wikimedia.org/r/280756
  • 13:52 hashar: rebasing integration puppetmaster (it had some merge commit )
  • 01:40 Krinkle: Purge npm cache in integration-slave-trusty-1015:/mnt/home/jenkins-deploy/.npm was corrupted around March 23 19:00 for unknown reasons (T130895)

2016-03-30

  • 19:32 twentyafterfour: deleted some nutcracker and hhvm log files on deployment-mediawiki01 to free space
  • 15:37 hashar: Gerrit has trouble sending emails T131189
  • 13:48 Reedy: deployment-prep Make that deployment-tmh01
  • 13:48 Reedy: deployment-prep upgrade hhvm on deployment-mediawiki01 and reboot
  • 13:35 Reedy: deployment-prep upgrade hhvm on deployment-mediawiki03 and reboot
  • 12:16 gehel: deployment-prep restarting varnish on deployment-cache-text04
  • 11:04 Amir1: cherry-picked 280413/1 in beta puppetmaster, manually running puppet agent in deployment-ores-web
  • 10:22 Amir1: cherry-picking 280403 to beta puppetmaster and manually running puppet agent in deployment-ores-web

2016-03-29

  • 23:22 marxarelli: running jenkins-jobs update config/ 'mwext-donationinterfacecore125-testextension-zend53' to deploy https://gerrit.wikimedia.org/r/#/c/280261/
  • 19:52 Amir1: manually updated puppetmaster, deleted SSL cert key in deployment-ores-web in VM, running puppet agent manually
  • 02:20 jzerebecki: reloading zuul fo 46923c8..c0937ee

2016-03-26

  • 22:38 jzerebecki: reloading zuul for 2d7e050..46923c8

2016-03-25

  • 23:55 marxarelli: deleting instances integration-slave-trusty-1002 and integration-slave-trusty-1005
  • 23:54 marxarelli: deleting jenkins nodes integration-slave-trusty-1002 and integration-slave-trusty-1005
  • 23:41 marxarelli: completed rolling manual deploy of https://gerrit.wikimedia.org/r/#/c/279640/ to trusty slaves
  • 23:27 marxarelli: starting rolling offline/remount/online of trusty slaves to increase tmpfs size
  • 23:22 marxarelli: pooled new trusty slaves integration-slave-trusty-1024 and integration-slave-trusty-1025
  • 23:13 jzerebecki: reloading zuul fro 0aec21d..2d7e050
  • 22:14 marxarelli: creating new jenkins node for integration-slave-trusty-1024
  • 22:11 marxarelli: rebooting integration-slave-trusty-{1024,1025} before pooling as replacements for trusty-1002 and trusty-1005
  • 21:06 marxarelli: repooling integration-slave-trusty-{1005,1002} to help with load while replacement instances are provisioning
  • 16:59 marxarelli: depooling integration-slave-trusty-1002 until DNS resolution can be resolved. still investigating disk space issue

2016-03-24

  • 16:39 thcipriani: restarted rsync service on deployment-tin
  • 13:45 thcipriani|afk: rearmed keyholder on deployment-tin
  • 04:41 Krinkle: beta-update-databases-eqiad and beta-scap-eqiad stuck for over 8 hours (IRC notifier plugin deadlock)
  • 03:28 Krinkle: beta-mediawiki-config-update-eqiadqueued has been stuck for over 5 hours.

2016-03-23

  • 23:00 Krinkle: rm-rf integration-slave-trusty-1013:/mnt/home/jenkins-deploy/tmpfs/jenkins-2/karma-54925082/ (bad permissions, caused Karma issues)
  • 19:02 legoktm: restarted zuul

2016-03-22

2016-03-21

  • 21:55 hashar: zuul: almost all MediaWiki extensions migrated to run the npm job on Nodepool (with Node.js 4.3) T119143 . All tested. Will monitor the build results that ran overnight tomorrow
  • 20:28 hashar: Mass running npm-node-4.3 jobs against MediaWiki extensions to make sure they all pass ( https://gerrit.wikimedia.org/r/#/c/278004/ | T119143 )
  • 17:40 elukey: executed git rebase --interactive on deployment-puppetmaster.deployment-prep.eqiad.wmflabs to remove https://gerrit.wikimedia.org/r/#/c/278713/
  • 15:46 elukey: hacked manually the cdh puppet submodule on deployment-puppetmaster.deployment-prep.eqiad.wmflabs - please let me know if interfere with anybody's tests
  • 14:24 elukey: executed git submodule update --init on deployment-puppetmaster.deployment-prep.eqiad.wmflabs
  • 11:25 elukey: beta: cherry picked https://gerrit.wikimedia.org/r/#/c/278713/ to test an updated to the cdh module (analytics)
  • 11:13 hashar: beta: rebased puppet master which had a conflict on https://gerrit.wikimedia.org/r/#/c/274711/ which got merged meanwhile (saves Elukey )
  • 11:02 hashar: beta: added Elukey (wikimedia ops) to the project as member and admin

2016-03-19

  • 13:04 hashar: Jenkins: added ldap-labs-codfw.wikimedia.org as a fallback LDAP server T130446

2016-03-18

  • 17:16 jzerebecki: reloading zuul for e33494f..89a9659

2016-03-17

  • 21:10 thcipriani: updating scap on deployment-tin to test D133
  • 18:31 cscott: updated OCG to version c1a8232594fe846bd2374efd8f7c20d7e97ac449
  • 09:34 hashar: deployment-jobrunner01 deleted /var/log/apache/*.gz T130179
  • 09:04 hashar: Upgrading hhvm and related extensions on jobrunner01 T130179

2016-03-16

2016-03-15

  • 15:17 jzerebecki: added wikidata.beta.wmflabs.org in https://wikitech.wikimedia.org/wiki/Special:NovaAddress to deployment-cache-text04.deployment-prep.eqiad.wmflabs
  • 14:19 hashar: Image ci-jessie-wikimedia-1458051246 in wmflabs-eqiad is ready T124447
  • 14:14 hashar: Refreshing Nodepool snapshot images so it get a fresh copy of slave-scripts T124447
  • 14:08 hashar: Deploying slave script change https://gerrit.wikimedia.org/r/#/c/277508/ "npm-install-dev.py: Use config.dev.yaml instead of config.yaml" for T124447

2016-03-14

  • 22:18 greg-g: new jobs weren't processing in Zuul, lego fixed it and blamed Reedy
  • 20:13 hashar: Updating Jenkins jobs mwext-Wikibase-* so they no more rely on --with-phpunit ( ping @hoo https://gerrit.wikimedia.org/r/#/c/277330/ )
  • 17:03 Krinkle: Doing full Zuul restart due to deadlock (T128569)
  • 10:18 moritzm: re-enabled systemd unit for logstash on deployment-logstash2

2016-03-11

  • 22:42 legoktm: deploying https://gerrit.wikimedia.org/r/276901
  • 19:41 legoktm: legoktm@integration-slave-trusty-1001:/mnt/jenkins-workspace/workspace$ sudo rm -rf mwext-Echo-testextension-* # because it was broken

2016-03-10

  • 20:22 hashar: Nodepool Image ci-jessie-wikimedia-1457641052 in wmflabs-eqiad is ready
  • 20:19 hashar: Refreshing Nodepool to include the 'varnish' package T128188
  • 20:05 hashar: apt-get upgrade integration-slave-jessie1001 (bring in ffmpeg update and nodejs among other things)
  • 12:22 hashar: Nodeppol Image ci-jessie-wikimedia-1457612269 in wmflabs-eqiad is ready
  • 12:18 hashar: Nodepool: rebuilding image to get mathoid/graphoid packages included (hopefully) T119693 T128280

2016-03-09

  • 17:56 bd808: Cleaned up git clone state in deployment-tin.deployment-prep:/srv/mediawiki-staging/php-master and queued beta-code-update-eqiad to try again (T129371)
  • 17:48 bd808: Git clone at deployment-tin.deployment-prep:/srv/mediawiki-staging/php-master in completely horrible state. Investigating
  • 17:22 bd808: Fixed https://integration.wikimedia.org/ci/job/beta-mediawiki-config-update-eqiad/4452/
  • 17:19 bd808: Manually cleaning up broken rebase in deployment-tin.deployment-prep:/srv/mediawiki-staging
  • 16:27 bd808: Removed cherry-pick of https://gerrit.wikimedia.org/r/#/c/274696 ; manually cleaned up systemd unit and restarted logstash on deployment-logstash2
  • 14:59 hashar: Image ci-jessie-wikimedia-1457535250 in wmflabs-eqiad is ready T129345
  • 14:57 hashar: Rebuilding snapshot image to get Xvfb enabled at boot time T129345
  • 13:04 moritzm: cherrypicked patch to deployment-prep which provides a systemd unit for logstash
  • 10:52 hashar: Image ci-jessie-wikimedia-1457520493 in wmflabs-eqiad is ready
  • 10:29 hashar: Nodepool: created new image and refreshing snapshot in attempt to get Xvfb running T129320 T128090

2016-03-08

  • 23:42 legoktm: running CentralAuth's checkLocalUser.php --verbose=1 --delete=1 on deployment-tin for T115198
  • 21:33 hashar: Nodepool Image ci-jessie-wikimedia-1457472606 in wmflabs-eqiad is ready
  • 19:23 hashar: Zuul inject DISPLAY https://gerrit.wikimedia.org/r/#/c/273269/
  • 16:03 hashar: Image ci-jessie-wikimedia-1457452766 is ready T128090
  • 15:59 hashar: Nodepool: refreshing snapshot image to ship browsers+Xvfb for T128090
  • 14:27 hashar: Mass refreshed CI slave-scripts 1d2c60d..e27c292
  • 13:38 hashar: Rebased integration puppet master. Dropped a make-wmf-branch patch and the one for raita role
  • 11:26 hashar: Nodepool: created new snapshot to set puppet $::labsproject : ci-jessie-wikimedia-1457436175 hoping to fix hiera lookup T129092
  • 02:51 ori: deployment-prep Updating HHVM on deployment-mediawiki01
  • 02:27 ori: deployment-prep Updating HHVM on deployment-mediawiki02
  • 01:50 Krinkle: integration-saltmater: salt -v '*slave-trusty*' cmd.run 'rm -rf /mnt/jenkins-workspace/workspace/mwext-testextension-hhvm/src/skins/BlueSky' (T117710)
  • 01:50 Krinkle: integration-saltmater: salt -v '*slave-trusty*' cmd.run 'rm -rf /mnt/jenkins-workspace/workspace/mwext-testextension-hhvm-composer/src/skins/BlueSky'

2016-03-07

  • 21:03 hashar: Nodepool upgraded to 0.1.1-wmf.4 , it no more waits 1 minute before deleted a used node | T118573
  • 20:05 hashar: Upgrading Nodepool from 0.1.1-wmf3 to 0.1.1-wmf.4 with andrewbogott | T118573

2016-03-06

2016-03-04

  • 19:31 hashar: Nodepool Image ci-jessie-wikimedia-1457119603 in wmflabs-eqiad is ready - T128846
  • 13:29 hashar: Nodepool Image ci-jessie-wikimedia-1457097785 in wmflabs-eqiad is ready
  • 08:42 hashar: CI deleting integration-slave-precise-1001 (2 executors). It is not in labs DNS which causes bunch of issues, no need for the capacity anymore. T128802
  • 02:49 Krinkle: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/274889
  • 00:11 Krinkle: salt -v --show-timeout '*slave*' cmd.run "bash -c 'cd /srv/deployment/integration/slave-scripts; git pull'"

2016-03-03

  • 23:37 legoktm: salt -v --show-timeout '*slave*' cmd.run "bash -c 'cd /srv/deployment/integration/slave-scripts; git pull'"
  • 22:34 legoktm: mysql not running on integration-slave-precise-1002, manually starting (T109704)
  • 22:30 legoktm: mysql not running on integration-slave-precise-1011, manually starting (T109704)
  • 22:19 legoktm: mysql not running on integration-slave-precise-1012, manually starting (T109704)
  • 22:07 legoktm: deploying https://gerrit.wikimedia.org/r/274821
  • 21:58 Krinkle: Reloading Zuul to deploy (EventLogging and AdminLinks) https://gerrit.wikimedia.org/r/274821 /
  • 18:49 thcipriani: killing deployment-bastion since it is no longer used
  • 14:23 hashar: https://integration.wikimedia.org/ci/computer/integration-slave-trusty-1011/ is out of disk space

2016-03-02

2016-03-01

  • 23:10 Krinkle: Updated Jenkins configuration to also support php5 and hhvm for Console Sections detection of "PHPUnit"
  • 17:05 hashar: gerrit: set accounts inactive for Eloquence and Mgrover. Former employees of wmf and mail bounceback
  • 16:41 hashar: Restarted Jenkins
  • 16:32 hashar: Bunch of Jenkins job got stall because I have killed threads in Jenkins to unblock integration-slave-trusty-1003 :-(
  • 12:14 hashar: integration-slave-trusty-1003 is back online
  • 12:13 hashar: Might have killed the proper Jenkins thread to unlock integration-slave-trusty-1003
  • 12:03 hashar: Jenkins can not pool back integration-slave-trusty-1003 Jenkins master has a bunch of blocking threads pilling up with hudson.plugins.sshslaves.SSHLauncher.afterDisconnect() locked somehow
  • 11:41 hashar: Rebooting integration-slave-trusty-1003 (does not reply to salt / ssh)
  • 10:34 hashar: Image ci-jessie-wikimedia-1456827861 in wmflabs-eqiad is ready
  • 10:24 hashar: Refreshing Nodepool snapshot instances
  • 10:22 hashar: Refreshing Nodepool base image to speed instances boot time (dropping open-iscsi package https://gerrit.wikimedia.org/r/#/c/273973/ )

2016-02-29

  • 16:23 hashar: salt -v '*slave*' cmd.run 'rm -fR /mnt/jenkins-workspace/workspace/mwext*jslint' T127362
  • 16:17 hashar: Deleting all mwext-.*-jslint jobs from Jenkins. Paladox has migrated all of them to jshint/jsonlint generic jobs T127362
  • 16:16 hashar: Deleting all mwext-.*-jslint jobs from Jenkins. Paladox has migrated all of them to jshint/jsonlint generic jobs
  • 09:46 hashar: Jenkins installing Yaml Axis Plugin 0.2.0

2016-02-28

  • 01:30 Krinkle: Rebooting integration-slave-precise-1012 – Might help T109704 (MySQL not running)

2016-02-26

  • 15:14 jzerebecki: salt -v --show-timeout '*slave*' cmd.run "bash -c 'cd /srv/deployment/integration/slave-scripts; git pull'" T128191
  • 15:14 jzerebecki: salt -v --show-timeout '*slave*' cmd.run "bash -c 'cd /srv/deployment/integration/slave-scripts; git pull'"
  • 14:44 hashar: (since it started, dont be that scared!)
  • 14:44 hashar: Nodepool has triggered 40 000 instances
  • 11:53 hashar: Restarted memcached on deployment-memc02 T128177
  • 11:53 hashar: memcached process on deployment-memc02 seems to have a nice leak of socket usages (from lost) and plainly refuse connections (bunch of CLOSE_WAIT) T128177
  • 11:53 hashar: memcached process on deployment-memc02 seems to have a nice leak of socket usages (from lost) and plainly refuse connections (bunch of CLOSE_WAIT)
  • 11:40 hashar: deployment-memc04 find /etc/apt -name '*proxy' -delete (prevented apt-get update)
  • 11:26 hashar: beta: salt -v '*' cmd.run 'apt-get -y install ruby-msgpack' . I am tired of seeing puppet debug messages: "Debug: Failed to load library 'msgpack' for feature 'msgpack'"
  • 11:24 hashar: puppet keep restarting nutcracker apparently T128177
  • 11:20 hashar: Memcached error for key "enwiki:flow_workflow%3Av2%3Apk:63dc3cf6a7184c32477496d63c173f9c:4.8" on server "127.0.0.1:11212": SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY

2016-02-25

  • 22:38 hashar: beta: maybe deployment-jobunner01 is processing jobs a bit faster now. Seems like hhvm went wild
  • 22:23 hashar: beta: jobrunner01 had apache/hhvm killed somehow .... Blame me
  • 21:56 hashar: beta: stopped jobchron / jobrunner on deployment-jobrunner01 and restarting them by running puppet
  • 21:49 hashar: beta did a git-deploy of jobrunner/jobrunner hoping to fix puppet run on deployment-jobrunner01 and apparently it did! T126846
  • 11:21 hashar: deleting workspace /mnt/jenkins-workspace/workspace/browsertests-Wikidata-WikidataTests-linux-firefox-sauce on slave-trusty-1015
  • 10:08 hashar: Jenkins upgraded T128006
  • 01:44 legoktm: deploying https://gerrit.wikimedia.org/r/273170
  • 01:39 legoktm: deploying https://gerrit.wikimedia.org/r/272955 (undeployed) and https://gerrit.wikimedia.org/r/273136
  • 01:37 legoktm: deploying https://gerrit.wikimedia.org/r/273136
  • 00:31 thcipriani: running puppet on beta to update scap to latest packaged version: sudo salt -b '10%' -G 'deployment_target:scap/scap' cmd.run 'puppet agent -t'
  • 00:20 thcipriani: deployment-tin not accepting jobs for some time, ran through https://www.mediawiki.org/wiki/Continuous_integration/Jenkins#Hung_beta_code.2Fdb_update, is back now

2016-02-24

  • 19:55 legoktm: legoktm@deployment-tin:~$ mwscript extensions/ORES/maintenance/PopulateDatabase.php --wiki=enwiki
  • 18:30 bd808: "configuration file '/etc/nutcracker/nutcracker.yml' syntax is invalid"
  • 18:27 bd808: nutcracker dead on mediawiki01; investigating
  • 17:20 hashar: Deleted Nodepool instances so new ones get to use the new snapshot ci-jessie-wikimedia-1456333979
  • 17:12 hashar: Refreshing nodepool snapshot. Been stall since Feb 15th T127755
  • 17:01 bd808: https://wmflabs.org/sal/releng missing SAL data since 2016-02-20T20:19 due to bot crash; needs to be backfilled from wikitech data (T127981)
  • 16:43 hashar: sal on elastic search is stall https://phabricator.wikimedia.org/T127981
  • 15:07 hasharAW: beta app servers have lost access to memcached due to bad nutcracker conf | T127966
  • 14:41 hashar: beta: we have a lost a memcached server 11:51am UTC

2016-02-23

  • 22:45 thcipriani: deployment-puppetmaster is in a weird rebase state
  • 22:25 legoktm: running sync-common manually on deployment-mediawiki02
  • 09:59 hashar: Deleted a bunch of mwext-.*-jslint jobs that are no more in used (migrated to either 'npm' or 'jshint' / 'jsonlint' )

2016-02-22

  • 22:06 bd808: Restarted puppetmaster service on deployment-puppetmaster to "fix" error "invalid byte sequence in US-ASCII"
  • 17:46 jzerebecki: ssh integration-slave-trusty-1017.eqiad.wmflabs 'sudo -u jenkins-deploy rm -rf /mnt/jenkins-workspace/workspace/mwext-testextension-hhvm/src/.git/config.lock
  • 16:47 gehel: deployment-prep upgrading deployment-logstash2 to elasticsearch 1.7.5
  • 10:26 gehel: deployment-prep upgrading elastic-search to 1.7.5 on deployment-elastic0[5-8]

2016-02-20

  • 20:19 Krinkle: beta-code-update-eqiad job repeatedly stuck at "IRC notifier plugin"
  • 19:29 Krinkle: beta-code-update-eqiad broken because deployment-tin:/srv/mediawiki-staging/php-master/extensions/MobileFrontend/includes/MobileFrontend.hooks.php was modified on the server without commit
  • 19:22 Krinkle: Various beta-mediawiki-config-update-eqiad jobs have been stuck 'queued' for > 24 hours

2016-02-19

2016-02-18

2016-02-17

2016-02-16

  • 23:22 yuvipanda: new instances on deployment-prep no longer get NFS because of https://wikitech.wikimedia.org/w/index.php?title=Hiera%3ADeployment-prep&type=revision&diff=311783&oldid=311781
  • 23:18 hashar: jenkins@gallium find /var/lib/jenkins/config-history/nodes -maxdepth 1 -type d -name 'ci-jessie*' -exec rm -vfR {} \;
  • 23:17 hashar: Jenkins accepting slave creations again. Root cause is /var/lib/jenkins/config-history/nodes/ has reached the 32k inode limit.
  • 23:14 hashar: Jenkins: Could not create rootDir /var/lib/jenkins/config-history/nodes/ci-jessie-wikimedia-34969/2016-02-16_22-40-23
  • 23:02 hashar: Nodepool can not authenticate with Jenkins anymore. Thus it can not add slaves it spawned.
  • 22:56 hashar: contint: Nodepool instances pool exhausted
  • 21:14 andrewbogott: deployment-logstash2 migration finished
  • 20:49 jzerebecki: reloading zuul for 3bf7584..67fec7b
  • 19:58 andrewbogott: migrating deployment-logstash2 to labvirt1010
  • 19:00 hashar: tin: checking out mw 1.27.0-wmf.14
  • 15:23 hashar: integration-make-wmfbranch : /mnt/make-wmf-branch mount now has gid=wikidev and group setuid (i.e. mode 2775)
  • 15:20 hashar: integration-make-wmfbranch : change tmpfs to /mnt/make-wmf-branch (from /var/make-wmf-branch )
  • 11:30 jzerebecki: T117710 integration-saltmaster:~# salt -v '*slave-trusty*' cmd.run 'rm -rf /mnt/jenkins-workspace/workspace/mwext-testextension-hhvm-composer/src/skins/BlueSky'
  • 09:52 hashar: will cut the wmf branches this afternoon starting around 14:00 CET

2016-02-15

2016-02-14

2016-02-13

  • 06:42 bd808: restarted nutcracker on deployment-mediawiki01
  • 06:32 bd808: jobrunner on deployment-jobrunner01 enabled after reverting changes from T87928 that caused T126830
  • 05:51 bd808: disabled jobrunner process on jobrunner01; queue full of jobs broken by T126830
  • 05:31 bd808: trebuchet clone of /srv/jobrunner/jobrunner broken on jobrunner01; failing puppet runs
  • 05:25 bd808: jobrunner process on deployment-jobrunner01 badly broken; investigating
  • 05:20 bd808: Ran https://phabricator.wikimedia.org/P2273 on deployment-jobrunner01.deployment-prep.eqiad.wmflabs; freed ~500M; disk utilization still at 94%

2016-02-12

  • 23:54 hashar: beta cluster broken since 20:30 UTC https://logstash-beta.wmflabs.org/#/dashboard/elasticsearch/fatalmonitor havent looked
  • 17:36 hashar: salt -v '*slave-trusty*' cmd.run 'apt-get -y install texlive-generic-extra' # T126422
  • 17:32 hashar: adding texlive-generic-extra on CI slaves by cherry picking https://gerrit.wikimedia.org/r/#/c/270322/ - T126422
  • 17:19 hashar: get rid of integration-dev it is broken somehow
  • 17:10 hashar: Nodepool back at spawning instances. contintcloud has been migrated in wmflabs
  • 16:51 thcipriani: running sudo salt '*' -b '10%' deploy.fixurl to fix deployment-prep trebuchet urls
  • 16:31 hashar: bd808 added support for saltbot to update tasks automagically!!!! T108720
  • 03:10 yurik: attempted to sync graphoid from gerrit 270166 from deployment-tin, but it wouldn't sync. Tried to git pull sca02, submodules wouldn't pull

2016-02-11

  • 22:53 thcipriani: shutting down deployment-bastion
  • 21:28 hashar: pooling back slaves 1001 to 1006
  • 21:18 hashar: re enabling hhvm service on slaves ( https://phabricator.wikimedia.org/T126594 ) Some symlink is missing and only provided by the upstart script grrrrrrr https://phabricator.wikimedia.org/T126658
  • 20:52 legoktm: deploying https://gerrit.wikimedia.org/r/270098
  • 20:35 hashar: depooling the six recent slaves: /usr/lib/x86_64-linux-gnu/hhvm/extensions/current/luasandbox.so cannot open shared object file
  • 20:29 hashar: pooling integration-slave-trusty-1004 integration-slave-trusty-1005 integration-slave-trusty-1006
  • 20:14 hashar: pooling integration-slave-trusty-1001 integration-slave-trusty-1002 integration-slave-trusty-1003
  • 19:35 marxarelli: modifying deployment server node in jenkins to point to deployment-tin
  • 19:27 thcipriani: running sudo salt -b '10%' '*' cmd.run 'puppet agent -t' from deployment-salt
  • 19:27 twentyafterfour: Keeping notes on the ticket: https://phabricator.wikimedia.org/T126537
  • 19:24 thcipriani: moving deployment-bastion to deployment-tin
  • 17:59 hashar: recreated instances with proper names: integration-slave-trusty-{1001-1006}
  • 17:52 hashar: Created integration-slave-trusty-{1019-1026} as m1.large (note 1023 is an exception it is for Android). Applied role::ci::slave , lets wait for puppet to finish
  • 17:42 Krinkle: Currently testing https://gerrit.wikimedia.org/r/#/c/268802/ in Beta Labs
  • 17:27 hashar: Depooling all the ci.medium slaves and deleting them.
  • 17:27 hashar: I tried. The ci.medium instances are too small and MediaWiki tests really need 1.5GBytes of memory :-(
  • 16:00 hashar: rebuilding integration-dev https://phabricator.wikimedia.org/T126613
  • 15:27 Krinkle: Deploy Zuul config change https://gerrit.wikimedia.org/r/269976
  • 11:46 hashar: salt -v '*' cmd.run '/etc/init.d/apache2 restart' might help for Wikidata browser tests failling
  • 11:32 hashar: disabling hhvm service on CI slaves ( https://phabricator.wikimedia.org/T126594 , cherry picked both patches )
  • 10:50 hashar: reenabled puppet on CI. All transitioned to a 128MB tmpfs (was 512MB)
  • 10:16 hashar: pooling back integration-slave-trusty-1009 and integration-slave-trusty-1010 (tmpfs shrunken)
  • 10:06 hashar: disabling puppet on all CI slaves. Trying to lower tmpfs 512MB to 128MB ( https://gerrit.wikimedia.org/r/#/c/269880/ )
  • 02:45 legoktm: deploying https://gerrit.wikimedia.org/r/269853 https://gerrit.wikimedia.org/r/269893

2016-02-10

  • 23:54 hashar_: depooling Trusty slaves that only have 2GB of ram that is not enough. https://phabricator.wikimedia.org/T126545
  • 22:55 hashar_: gallium: find /var/lib/jenkins/config-history/config -type f -wholename '*/2015*' -delete ( https://phabricator.wikimedia.org/T126552 )
  • 22:34 Krinkle: Zuul is back up and procesing Gerrit events, but jobs are still queued indefinitely. Jenkins is not accepting new jobs
  • 22:31 Krinkle: Full restart of Zuul. Seems Gearman/Zuul got stuck. All executors were idling. No new Gerrit events processed either.
  • 21:22 legoktm: cherry-picking https://gerrit.wikimedia.org/r/#/c/269370/ on integration-puppetmaster again
  • 21:17 hashar: CI dust have settled. Krinkle and I have pooled a lot more Trusty slaves to accommodate for the overload caused by switching to php55 (jobs run on Trusty)
  • 21:08 hashar: pooling trusty slaves 1009, 1010, 1021, 1022 with 2 executors (they are ci.medium)
  • 20:38 hashar: cancelling mediawiki-core-jsduck-publish and mediawiki-core-doxygen-publish jobs manually. They will catch up on next merge
  • 20:34 Krinkle: Pooled integration-slave-trusty-1019 (new)
  • 20:28 Krinkle: Pooled integration-slave-trusty-1020 (new)
  • 20:24 Krinkle: created integration-slave-trusty-1019 and integration-slave-trusty-1020 (ci1.medium)
  • 20:18 hashar: created integration-slave-trusty-1009 and 1010 (trusty ci.medium)
  • 20:06 hashar: creating integration-slave-trusty-1021 and integration-slave-trusty-1022 (ci.medium)
  • 19:48 greg-g: that cleanup was done by apergos
  • 19:48 greg-g: did cleanup across all integration slaves, some were very close to out of room. results: https://phabricator.wikimedia.org/P2587
  • 19:43 hashar: Dropping slaves Precise m1.large integration-slave-precise-1014 and integration-slave-precise-1013 , most load shifted to Trusty (php53 -> php55 transition)
  • 18:20 Krinkle: Creating a Trusty slave to support increased demand following MediaWIki php53(precise)>php55(trusty) bump
  • 16:06 jzerebecki: reloading zuul for 41a92d5..5b971d1
  • 15:42 jzerebecki: reloading zuul for 639dd40..41a92d5
  • 14:12 jzerebecki: recover a bit of disk space: integration-saltmaster:~# salt --show-timeout '*slave*' cmd.run 'rm -rf /mnt/jenkins-workspace/workspace/*WikibaseQuality*'
  • 13:46 jzerebecki: reloading zuul for 639dd40
  • 13:15 jzerebecki: reloading zuul for 3be81c1..e8e0615
  • 08:07 legoktm: deploying https://gerrit.wikimedia.org/r/269619
  • 08:03 legoktm: deploying https://gerrit.wikimedia.org/r/269613 and https://gerrit.wikimedia.org/r/269618
  • 06:41 legoktm: deploying https://gerrit.wikimedia.org/r/269607
  • 06:34 legoktm: deploying https://gerrit.wikimedia.org/r/269605
  • 02:59 legoktm: deleting 14GB broken workspace of mediawiki-core-php53lint from integration-slave-precise-1004
  • 02:37 legoktm: deleting /mnt/jenkins-workspace/workspace/mwext-testextension-hhvm-composer on trusty-1017, it had a skin cloned into it
  • 02:26 legoktm: queuing mwext jobs server-side to identify failing ones
  • 02:21 legoktm: deploying https://gerrit.wikimedia.org/r/269582
  • 01:03 legoktm: deploying https://gerrit.wikimedia.org/r/269576

2016-02-09

  • 23:17 legoktm: deploying https://gerrit.wikimedia.org/r/269551
  • 23:02 legoktm: gracefully restarting zuul
  • 22:57 legoktm: deploying https://gerrit.wikimedia.org/r/269547
  • 22:29 legoktm: deploying https://gerrit.wikimedia.org/r/269540
  • 22:18 legoktm: re-enabling puppet on all CI slaves
  • 22:02 legoktm: reloading zuul to see if it'll pickup the new composer-php53 job
  • 21:53 legoktm: enabling puppet on just integration-slave-trusty-1012
  • 21:52 legoktm: cherry-picked https://gerrit.wikimedia.org/r/#/c/269370/ onto integration-puppetmaster
  • 21:50 legoktm: disabling puppet on all trusty/precise CI slaves
  • 21:40 legoktm: deploying https://gerrit.wikimedia.org/r/269533
  • 17:49 marxarelli: disabled/enabled gearman in jenkins, connection works this time
  • 17:49 marxarelli: performed stop/start of zuul on gallium to restore zuul and gearman
  • 17:45 marxarelli: "Failed: Unable to Connect" in jenkins when testing gearman connection
  • 17:40 marxarelli: killed old zull process manually and restarted service
  • 17:39 marxarelli: restart of zuul fails as well. old process cannot be killed
  • 17:38 marxarelli: reloading zuul fails with "failed to kill 13660: Operation not permitted"
  • 16:06 bd808: Deleted corrupt integration-slave-precise-1003:/mnt/jenkins-workspace/workspace/mediawiki-core-php53lint/.git
  • 15:11 hashar: mira: /srv/mediawiki-staging/multiversion/checkoutMediaWiki 1.27.0-wmf.13 php-1.27.0-wmf.13
  • 14:51 hashar: ./make-wmf-branch -n 1.27.0-wmf.13 -o master
  • 14:50 hashar: pooling back integration-slave-precise1001 - 1004. Manually fetched git repos in workspace for mediawiki core php53
  • 14:49 hashar: make-wmf-branch instance: created a local ssh key pair and set the config to use User: hashar
  • 14:13 hashar: pooling https://integration.wikimedia.org/ci/computer/integration-slave-precise-1012/ Mysql is back .. Blame puppet
  • 14:12 hashar: de pooling https://integration.wikimedia.org/ci/computer/integration-slave-precise-1012/ Mysql is gone somehow
  • 14:04 hashar: Manually git fetching mediawiki-core in /mnt/jenkins-workspace/workspace/mediawiki-core-php53lint of slaves precise 1001 to 1004 (git on Precise is remarkably too slow)
  • 13:28 hashar: salt '*trusty*' cmd.run 'update-alternatives --set php /usr/bin/hhvm'
  • 13:28 hashar: salt '*precise*' cmd.run 'update-alternatives --set php /usr/bin/php5'
  • 13:18 hashar: salt -v --batch=3 '*slave*' cmd.run 'puppet agent -tv'
  • 13:15 hashar: removing https://gerrit.wikimedia.org/r/#/c/269370/ from CI puppet master
  • 13:14 hashar: slave recurse infinitely doing /bin/bash -eu /srv/deployment/integration/slave-scripts/bin/mw-install-mysql.sh then loop over /bin/bash /usr/bin/php maintenance/install.php --confpath /mnt/jenkins-workspace/workspace/mediawiki-core-qunit/src --dbtype=mysql --dbserver=127.0.0.1:3306 --dbuser=jenkins_u2 --dbpass=pw_jenkins_u2 --dbname=jenkins_u2_mw --pass testpass TestWiki WikiAdmin https://phabricator.wikimedia.org/T126327
  • 12:46 hashar: Mass testing php loop of death: salt -v '*slave*' cmd.run 'timeout 2s /srv/deployment/integration/slave-scripts/bin/php --version'
  • 12:40 hashar: mass rebooting CI slaves from wikitech
  • 12:39 hashar: salt -v '*' cmd.run "bash -c 'cd /srv/deployment/integration/slave-scripts; git pull'"
  • 12:33 hashar: all slaves dieing due to PHP looping
  • 12:02 legoktm: re-enabling puppet on all trusty/precise slaves
  • 11:20 legoktm: cherry-picked https://gerrit.wikimedia.org/r/#/c/269370/ on integration-puppetmaster
  • 11:20 legoktm: enabling puppet just on integration-slave-trusty-1012
  • 11:13 legoktm: disabling puppet on all *(trusty|precise)* slaves
  • 10:26 hashar: pooling in integration-slave-trusty-1018
  • 03:19 legoktm: deploying https://gerrit.wikimedia.org/r/269359
  • 02:53 legoktm: deploying https://gerrit.wikimedia.org/r/238988
  • 00:39 hashar: gallium edited /usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/trigger/gerrit.py and modified: replication_timeout = 300 -> replication_timeout = 10
  • 00:37 hashar: live hacking Zuul code to have it stop sleeping() on force merge
  • 00:36 hashar: killing zuul

2016-02-08

2016-02-06

  • 18:34 jzerebecki: reloading zuul for bdb2ed4..46ccca9

2016-02-05

  • 13:30 hashar: beta cleaning out /data/project/logs/archive was from pre logstash area. We no more log this way since May 2015 apparently
  • 13:29 hashar: beta deleting /data/project/swift-disk created in august 2014 , unused since june 2015. Was a fail attempt at bringing swift to beta
  • 13:27 hashar: beta: reclaiming disk space from extensions.git. On bastion: find /srv/mediawiki-staging/php-master/extensions/.git/modules -maxdepth 1 -type d -print -execdir git gc \;
  • 13:03 hashar: integration-slave-trusty-1011 went out of disk space. Did some brute clean up and git gc.
  • 05:21 Tim: configured mediawiki-extensions-qunit to only run on integration-slave-trusty-1017, did a rebuild and then switched it back

2016-02-04

  • 22:08 jzerebecki: reloading zuul for bed7be1..f57b7e2
  • 21:51 hashar: salt-key -d integration-slave-jessie-1001.eqiad.wmflabs
  • 21:50 hashar: salt-key -d integration-slave-precise-1011.eqiad.wmflabs
  • 00:57 bd808: Got deployment-bastion processing Jenkins jobs again via instructions left by my past self at https://phabricator.wikimedia.org/T72597#747925
  • 00:43 bd808: Jenkins agent on deployment-bastion.eqiad doing the trick where it doesn't pick up jobs again

2016-02-03

  • 22:24 bd808: Manually ran sync-common on deployment-jobrunner01.eqiad.wmflabs to pickup wmf-config changes that were missing (InitializeSettings, Wikibase, mobile)
  • 17:43 marxarelli: Reloading Zuul to deploy previously undeployed Icd349069ec53980ece2ce2d8df5ee481ff44d5d0 and Ib18fe48fe771a3fe381ff4b8c7ee2afb9ebb59e4
  • 15:12 hashar: apt-get upgrade deployment-sentry2
  • 15:03 hashar: redeployed rcstream/rcstream on deployment-stream by using git-deploy on deployment-bastion
  • 14:55 hashar: upgrading deployment-stream
  • 14:42 hashar: pooled back integration-slave-trusty-1015 Seems ok
  • 14:35 hashar: manually triggered a bunch of browser tests jobs
  • 11:40 hashar: apt-get upgrade deployment-ms-be01 and deployment-ms-be02
  • 11:32 hashar: fixing puppet.conf on deployment-memc04
  • 11:09 hashar: restarting beta cluster puppetmaster just in case
  • 11:07 hashar: beta: apt-get upgrade on delpoyment-cache* hosts and checking puppet
  • 10:59 hashar: integration/beta: deleting /etc/apt/apt.conf.d/*proxy files. There is no need for them, in fact web proxy is not reachable from labs
  • 10:53 hashar: integration: switched puppet repo back to 'production' branch, rebased.
  • 10:49 hashar: various beta cluster have puppet errors ..
  • 10:46 hashar: integration-slave-trusty-1013 heading to out of disk space on /mnt ...
  • 10:42 hashar: integration-slave-trusty-1016 out of disk space on /mnt ...
  • 03:45 bd808: Puppet failing on deployment-fluorine with "Error: Could not set uid on user[datasets]: Execution of '/usr/sbin/usermod -u 10003 datasets' returned 4: usermod: UID '10003' already exists"
  • 03:44 bd808: Freed 28G by deleting deployment-fluorine:/srv/mw-log/archive/*2015*
  • 03:42 bd808: Ran deployment-bastion.deployment-prep:/home/bd808/cleanup-var-crap.sh and freed 565M

2016-02-02

  • 18:32 marxarelli: Reloading Zuul to deploy If1f3cb60f4ccb2c1bca112900dbada03a8588370
  • 17:42 marxarelli: cleaning mwext-donationinterfacecore125-testextension-php53 workspace on integration-slave-precise-1013
  • 17:06 ostriches: running sync-common on mw2051 and mw1119
  • 09:38 hashar: Jenkins is fully up and operational
  • 09:33 hashar: restarting Jenkins
  • 08:47 hashar: pooling back integration-slave-precise1011 , puppet run got fixed ( https://phabricator.wikimedia.org/T125474 )
  • 03:48 legoktm: deploying https://gerrit.wikimedia.org/r/267828
  • 03:29 legoktm: deploying https://gerrit.wikimedia.org/r/266941
  • 00:42 legoktm: due to T125474
  • 00:42 legoktm: marked integration-slave-precise-1011 as offline
  • 00:39 legoktm: precise-1011 slave hasn't had a puppet run in 6 days

2016-02-01

  • 23:53 bd808: Logstash working again; I applied a change to the default mapping template for Elasticsearch that ensures that fields named "timestamp" are indexed as plain strings
  • 23:46 bd808: Elasticsearch index template for beta logstash cluster making crappy guesses about syslog events; dropped 2016-02-01 index; trying to fix default mappings
  • 23:09 bd808: HHVM logs causing rejections during document parse when inserting in Elasticsearch from logstash. They contain a "timestamp" field that looks like "Feb 1 22:56:39" which is making the mapper in Elasticsearch sad.
  • 23:04 bd808: Elasticsearch on deployment-logstash2 rejecting all documents with 400 status. Investigating
  • 22:50 bd808: Copying deployment-logstash2.deployment-prep:/var/log/logstash/logstash.log to /srv for debugging later
  • 22:48 bd808: deployment-logstash2.deployment-prep:/var/log/logstash/logstash.log is 11G of fail!
  • 22:46 bd808: root partition on deployment-logstash2 full
  • 22:43 bd808: No data in logstash since 2016-01-30T06:55:37.838Z; investigating
  • 15:33 hashar: Image ci-jessie-wikimedia-1454339883 in wmflabs-eqiad is ready
  • 15:01 hashar: Refreshing Nodepool image. Might have npm/grunt properly set up
  • 03:15 legoktm: deploying https://gerrit.wikimedia.org/r/267630

2016-01-31

  • 13:35 hashar: Jenkins IRC bot started falling at Jan 30 01:04:00 2016 for whatever reason.... Should be fine now
  • 13:33 hashar: cancelling/aborting jobs that are stuck while reporting to IRC (mostly browser tests and beta cluster jobs)
  • 13:32 hashar: Jenkins jobs are being blocked because they can no more report back to IRC :-(((
  • 13:28 hashar: Jenkins jobs are being blocked because they can no more report back to IRC :-(((

2016-01-30

  • 12:46 hashar: integration-slave-jessie-1001 : fixed puppet.con server name and ran puppet

2016-01-29

  • 18:43 thcipriani: updated scap on beta
  • 16:44 thcipriani: deployed scap updates on beta
  • 11:58 _joe_: upgraded hhvm to 3.6 wm8 in deployment-prep

2016-01-28

  • 23:22 MaxSem: Updated portals on betalabs to master
  • 22:23 hashar: salt '*slave-precise*' cmd.run 'apt-get install php5-ldap' ( https://phabricator.wikimedia.org/T124613 ) will need to be puppetized
  • 18:17 thcipriani: cleaning npm cache on slave machines: salt -v '*slave*' cmd.run 'sudo -i -u jenkins-deploy -- npm cache clean'
  • 18:12 thcipriani: running npm cache clean on integration-slave-precise-1011 sudo -i -u jenkins-deploy -- npm cache clean
  • 15:25 hashar: apt-get upgrade deployment-sca01 and deployment-sca02
  • 15:09 hashar: fixing puppet.conf hostname on deployment-upload deployment-conftool deployment-tmh01 deployment-zookeeper01 and deployment-urldownloader
  • 15:06 hashar: fixing puppet.con hostname on deployment-upload.deployment-prep.eqiad.wmflabs and running puppet
  • 15:00 hashar: Running puppet on deployment-memc02 and deployment-elastic07 . It is catching up with lot of changes
  • 14:59 hashar: fixing puppet hostnames on deployment-elastic07
  • 14:59 hashar: fixing puppet hostnames on deployment-memc02
  • 14:55 hashar: Deleted salt keys deployment-pdf01.eqiad.wmflabs and deployment-memc04.eqiad.wmflabs (obsolete, entries with '.deployment-prep.' are already there)
  • 07:38 jzerebecki: reload zuul for 4951444..43a030b
  • 05:55 jzerebecki: doing https://www.mediawiki.org/wiki/Continuous_integration/Jenkins#Hung_beta_code.2Fdb_update
  • 03:49 mobrovac: deployment-prep re-enabled puppet on deployment-restbase0x
  • 02:49 mobrovac: deployment-prep deployment-restbase01 disabled puppet to set up cassandra for
  • 02:27 mobrovac: deployment-prep recreating deployment-restbase01 for T125003
  • 02:23 mobrovac: deployment-prep deployment-restbase02 disabled puppet to recreate deployment-restbase01 for T125003
  • 01:42 mobrovac: deployment-prep recreating deployment-sca02 for T125003
  • 01:28 mobrovac: deployment-prep recreating deployment-sca01 for T125003
  • 00:36 mobrovac: deployment-prep re-imaging deployment-mathoid for T125003
  • 00:02 jzerebecki: integration-slave-trusty-1016:~$ sudo -i rm -rf /mnt/jenkins-workspace/workspace/mwext-testextension-hhvm/src/skins/Donate

2016-01-27

  • 23:49 jzerebecki: integration-slave-precise-1011:~$ sudo -i /etc/init.d/salt-minion restart
  • 23:46 jzerebecki: work around https://phabricator.wikimedia.org/T117710 : salt --show-timeout '*slave*' cmd.run 'rm -rf /mnt/jenkins-workspace/workspace/mwext-testextension-hhvm/src/skins/BlueSky'
  • 21:19 cscott: updated OCG to version 64050af0456a43344b32e3e93561a79207565eaf (should be no-op after yesterday's deploy)
  • 10:29 hashar: triggered bunch of browser tests, deployment-redis01 was dead/faulty
  • 10:08 hashar: mass restarting redis-server process on deployment-redis01 (for https://phabricator.wikimedia.org/T124677 )
  • 10:07 hashar: mass restarting redis-server process on deployment-redis01
  • 09:00 hashar: beta: commenting out "latency-monitor-threshold 100" parameter from any /etc/redis/redis.conf we have ( https://phabricator.wikimedia.org/T124677 ). Puppet will not reapply it unless distribution is Jessie

2016-01-26

  • 16:51 cscott: updated OCG to version 64050af0456a43344b32e3e93561a79207565eaf
  • 12:14 hashar: Added Jenkins IRC bot (wmf-insecte) to #wikimedia-perf for https://gerrit.wikimedia.org/r/#/c/265631/
  • 09:30 hashar: restarting Jenkins to upgrade the gearman plugin with https://review.openstack.org/#/c/271543/
  • 04:18 bd808: integration-slave-jessie-1001:/mnt full; cleaned up 15G of files in /mnt/pbuilder/build (27 hours after the last time I did that)

2016-01-25

  • 18:59 twentyafterfour: started redis-server on deployment-redis01 by commenting out latency-monitor-threshold from the redis.conf
  • 15:22 hashar: CI: fixing kernels not upgrading via: rm /boot/grub/menu.lst ; update-grub -y (i.e.: regenerate the Grub menu from scratch)
  • 14:21 hashar: integration-slave-trusty-1015.integration.eqiad.wmflabs is gone. I have failed the kernel upgrade / grub update
  • 01:35 bd808: integration-slave-jessie-1001:/mnt full; cleaned up 15G of files in /mnt/pbuilder/build

2016-01-24

2016-01-22

  • 23:58 legoktm: removed skins from mwext-qunit workspace on trusty-1013 slave
  • 23:34 legoktm: rm -rf /mnt/jenkins-workspace/workspace/mediawiki-phpunit-php53 on slave precise 1012
  • 22:45 legoktm: deploying https://gerrit.wikimedia.org/r/265864
  • 22:27 hashar: rebooted all CI slaves using OpenStackManager
  • 22:09 hashar: rebooting deployment-redis01 (kernel upgrade)
  • 21:22 hashar: Image ci-jessie-wikimedia-1453497269 in wmflabs-eqiad is ready (with node 4.2 for https://phabricator.wikimedia.org/T119143 )
  • 21:14 hashar: updating nodepool snapshot based on new image
  • 21:12 hashar: rebuilding nodepool reference image
  • 20:04 hashar: Image ci-jessie-wikimedia-1453492820 in wmflabs-eqiad is ready
  • 20:00 hashar: Refreshing nodepool image to hopefully get Nodejs 4.2.4 https://phabricator.wikimedia.org/T124447 https://gerrit.wikimedia.org/r/#/c/265802/
  • 16:32 hashar: Nuked corrupted git repo on integration-slave-precise-1012 /mnt/jenkins-workspace/workspace/mediawiki-extensions-php53
  • 12:23 hashar: beta: reinitialized keyholder on deployment-bastion. The proxy apparently had no identity
  • 09:32 hashar: beta cluster Jenkins job have been stalled for 9hours and 25 minutes. Disabling/reenabling the Gearman plugin to remove the deadlock

2016-01-21

  • 21:41 hashar: restored role::mail::mx on deployment-mx
  • 21:36 hashar: dropping role::mail::mx from deployment-mx to let puppet run
  • 21:33 hashar: rebooting deployment-jobrunner01 / kernel upgrade / /tmp is only 1MBytes
  • 21:19 hashar: fixing up deployment-jobrunner01 /tmp and / disks are full
  • 19:57 thcipriani: ran REPAIR TABLE globalnames; on centralauth db
  • 19:48 legoktm: deploying https://gerrit.wikimedia.org/r/265552
  • 19:39 legoktm: deploying jjb changes for https://gerrit.wikimedia.org/r/264990
  • 19:25 legoktm: deploying https://gerrit.wikimedia.org/r/265546
  • 01:59 jzerebecki: jenkins-deploy@deployment-bastion:/srv/mediawiki-staging/php-master/extensions/SpellingDictionary$ rm -r modules/jquery.uls && git rm modules/jquery.uls
  • 01:00 jzerebecki: jenkins-deploy@deployment-bastion:/srv/mediawiki-staging/php-master/extensions$ git pull && git submodule update --init --recursive
  • 00:57 jzerebecki: jenkins-deploy@deployment-bastion:/srv/mediawiki-staging/php-master/extensions$ git reset HEAD SpellingDictionary

2016-01-20

  • 20:05 hashar: beta sudo find /data/project/upload7/math -type f -delete (probably some old left over)
  • 19:50 hashar: beta: on commons ran deleteArchivedFile.php : Nuked 7130 files
  • 19:49 hashar: beta : foreachwiki deleteArchivedRevisions.php -delete
  • 19:26 hasharAway: Nuked all files from http://commons.wikimedia.beta.wmflabs.org/wiki/Category:GWToolset_Batch_Upload
  • 19:19 hasharAway: beta: sudo find /data/project/upload7/*/*/temp -type f -delete
  • 19:14 hasharAway: beta: sudo rm /data/project/upload7/*/*/lockdir/*
  • 18:57 hasharAway: beta cluster code has been stalled for roughly 2h30
  • 18:55 hasharAway: disconnecting Gearman plugin to remove deadlock for beta cluster rjobs
  • 17:06 hashar: clearing files from beta-cluster to prepare for Swift migration. python pwb.py delete.py -family:betacommons -lang:en -cat:'GWToolset Batch Upload' -verbose -putthrottle:0 -summary:'Clearing out old batched upload to save up disk space for Swift migration'

2016-01-19

2016-01-17

2016-01-16

2016-01-15

  • 12:17 hashar: restarting Jenkins for plugins updates
  • 02:49 bd808: Trying to fix submodules in deployment-bastion:/srv/mediawiki-staging/php-master/extensions for T123701

2016-01-14

2016-01-13

  • 21:06 hashar: beta cluster code is up to date again. Got delayed by roughly 4 hours.
  • 20:55 hashar: unlocked Jenkins jobs for beta cluster by disabling/reenabling Jenkins Gearman client
  • 10:15 hashar: beta: fixed puppet on deployment-elastic06 . Was still using cert/hostname without .deployment-prep. .... Mass update occurring.

2016-01-12

2016-01-11

  • 22:24 hashar: Deleting old references on Zuul-merger for mediawiki/core : /usr/share/python/zuul/bin/python /home/hashar/zuul-clear-refs.py --until 15 /srv/ssd/zuul/git/mediawiki/core
  • 22:21 hashar: gallium in /srv/ssd/zuul/git/mediawiki/core$ git gc --prune=all && git remote update --prune
  • 22:21 hashar: scandium in /srv/ssd/zuul/git/mediawiki/core$ git gc --prune=all && git remote update --prune
  • 07:35 legoktm: deploying https://gerrit.wikimedia.org/r/263319

2016-01-07

2016-01-06

  • 21:13 thcipriani: kicking integration puppetmaster, weird node unable to find definition.
  • 21:11 jzerebecki: on scandium: sudo -u zuul rm -rf /srv/ssd/zuul/git/mediawiki/services/mathoid
  • 21:04 legoktm: ^ on gallium
  • 21:04 legoktm: manually deleted /srv/ssd/zuul/git/mediawiki/services/mathoid to force zuul to re-clone it
  • 20:17 hashar: beta: dropped a few more /etc/apt/apt.conf.d/*-proxy files. webproxy is no more reachable from labs
  • 09:44 hashar: CI/beta: deleting all git tags from /var/lib/git/operations/puppet and doing git repack
  • 09:39 hashar: restoring puppet hacks on beta cluster puppetmaster.
  • 09:35 hashar: beta/CI: salt -v '*' cmd.run 'rm -v /etc/apt/apt.conf.d/*-proxy' https://phabricator.wikimedia.org/T122953

2016-01-05

2016-01-04

2016-01-02

  • 03:17 yurik: purged varnishs on deployment-cache-text04

2016-01-01

  • 22:17 bd808: No nodepool ci-jessie-* hosts seen in Jenkins interface and rake-jessie jobs backing up

Archive