Release Engineering/SAL/Archive 1
Appearance
2015-12-30
- 00:13 bd808: rake-jessie jobs running again which will hopefully clear the large zuull backlog
- 00:12 bd808: nodepool restarted by andrewbogott when no ci-jessie-* slaves seen in Jenkins
2015-12-29
- 21:57 bd808: Updated zuul with https://gerrit.wikimedia.org/r/#/c/261114/
- 21:51 bd808: Updated zuul with https://gerrit.wikimedia.org/r/#/c/261163/
- 21:42 bd808: Updated zuul with https://gerrit.wikimedia.org/r/#/c/261322/
- 21:32 bd808: Updated zuul with https://gerrit.wikimedia.org/r/#/c/261577/
- 19:53 bd808: Cherry-picked https://gerrit.wikimedia.org/r/#/c/261476/ to integration-puppetmaster for testing
- 19:51 bd808: Fixed git remote of integration-puppetmaster.integration:/var/lib/git/labs/private to use https instead of old ssh method
2015-12-26
- 21:41 hashar: integration: getting rid of $wgHTTPProxy https://gerrit.wikimedia.org/r/261096 (no more needed)
- 20:34 hashar: integration: cherry picked puppet patches https://gerrit.wikimedia.org/r/#/c/208024/ (raita role) and https://gerrit.wikimedia.org/r/#/c/204528/ (mysql on tmpfs)
- 10:07 hashar: no clue what is going on and I am traveling. Will look later tonight
- 10:07 hashar: restarted Xvfb on trusty-1011 and rebooted trusty-1015. mediawiki-extensions-qunit randomly fails on some hosts ( https://phabricator.wikimedia.org/T122449 ) :(
- 10:00 hashar: restarted CI puppetmaster
2015-12-23
- 23:37 marxarelli: Reloading Zuul to deploy I39b9f292e95363addf8983eec5d08a0af527a163
- 23:15 marxarelli: Reloading Zuul to deploy I78727ce68b45f3a6305291e6e1e596b62069fc21
2015-12-22
- 23:31 Krinkle: (when npm jobs run) - sudo rm -rf /mnt/home/jenkins-deploy/.npm at integration-slave-trusty-1015 (due to cache corruption)
- 21:13 ostriches: jenkins: kicking gearman connection, nothing is being processed from zuul queue
- 17:00 hashar: If in doubt, restart Jenkins.
- 10:06 hashar: Restarting Jenkins
- 09:58 hashar: Delete integration-zuul-debian-glue-* files. Leftover from an experiment
- 09:57 hashar: deleted cdb-* Jenkins jobs. Repo uses generic jobs
2015-12-21
- 20:06 hashar: Downgrading Jenkins plugin from 1.24 to 1.21
- 19:01 marxarelli: Purging TMPDIR contents on idle integration slaves
- 18:43 marxarelli: Updating slave scripts on all integration slaves to deploy I4edf7099acfeb0f06ea2042902bef03097137d6e
- 18:31 legoktm: same thing on 1015
- 18:28 legoktm: deleted some large npm directories from tmpfs on 1017 due to tmpfs being full
- 13:04 hashar: restarting cxserver on deployment-cxserver03
- 10:48 hashar: Banned testing-shinken- bot (useless duplicate notifications)
2015-12-19
- 22:18 Krinkle: sudo rm -rf /mnt/home/jenkins-deploy/tmpfs/jenkins* on integration-slave-precise-1014
- 22:18 Krinkle: sudo rm -rf /mnt/home/jenkins-deploy/tmpfs/jenkins* on integration-slave-precise-1012
- 22:04 Krinkle: sudo rm -rf /mnt/home/jenkins-deploy/tmpfs/jenkins* on integration-slave-precise-1013
- 22:04 Krinkle: sudo rm -rf /mnt/home/jenkins-deploy/tmpfs/jenkins* on integration-slave-precise-1011
- 01:23 bd808: Added Reedy as a projectadmin in the integration project
- 01:09 bd808: Cleared tmpfs on integration-slave-trusty-101[26] with variant of salt command from T120824 and marked as online again
- 01:02 Reedy: releng integration-slave-trusty-101[26] marked as offline due to chmod related errors
2015-12-18
- 22:39 hashar: salt -v '*slave*' cmd.run 'find /mnt/home/jenkins-deploy/tmpfs -user www-data -delete'
- 22:20 legoktm: deploying https://gerrit.wikimedia.org/r/260028
- 21:30 jzerebecki: salt --show-timeout '*slave*' cmd.run 'rm -fR /mnt/home/jenkins-deploy/tmpfs/jenkins-?/*'
- 16:43 hashar: Nodepool: Image ci-jessie-wikimedia-1450456713 in wmflabs-eqiad is ready
- 16:42 hashar: Nodepool instances now have Zuul 2.1.0-60-g1cc37f7-wmf4jessie1 finally
- 16:15 hashar: Image ci-jessie-wikimedia-1450455076 in wmflabs-eqiad is ready ( still has the wrong Zuul version grr)
- 16:11 hashar: Nodepool force refreshing image to make sure zuul is up to date (should be 2.1.0-60-g1cc37f7-wmf4jessie1 )
- 15:43 hashar: salt '*slave*' cmd.run 'rm -fR /srv/deployment/integration/mediawiki-tools-codesniffer' https://phabricator.wikimedia.org/T66371
- 15:37 hashar: Deleting mediawiki/tools/codesniffer.git branch wmf-deploy (was 358c2e7bdec269cec999af89e3412951bb463dc0 )
- 15:19 hashar: Created Github repo https://github.com/wikimedia/thumbor-video-engine
- 10:04 hashar: rechecking mediawiki/core REL branches ( REL1_26 https://gerrit.wikimedia.org/r/#/c/247327/ ) ( REL1_25 https://gerrit.wikimedia.org/r/#/c/247337/ ) ( REL1_24 https://gerrit.wikimedia.org/r/#/c/179886/ ) ( https://gerrit.wikimedia.org/r/#/c/143591/ | REL1_23 )
- 09:18 hashar: killing Zuul
- 09:16 hashar: mass cancelling jobs for changes that got force merged
2015-12-17
- 23:07 legoktm: sudo salt --show-timeout '*slave*' cmd.run 'rm -fR /mnt/home/jenkins-deploy/tmpfs/jenkins-?/*'
- 23:05 legoktm: marked integration-slave-trusty-1012 as offline due to tmpfs issues
- 22:58 legoktm: marked integration-slave-trusty-1014 as offline due to tmpfs issues
- 22:19 hashar: Nodepool: all nodes are on ci-jessie-wikimedia-1450388384
- 22:04 hashar: Nodepool: openstack image delete ci-jessie-wikimedia_old_20151210
- 21:49 hashar: milestone, Nodepool has spawned 14500 instances so far
- 21:48 hashar: Image ci-jessie-wikimedia-1450388384 in wmflabs-eqiad is ready
- 21:39 hashar: refreshing nodepool snapshot to get the latest zuul right now ( https://wikitech.wikimedia.org/wiki/Nodepool#Manually_generate_a_new_snapshot )
- 16:51 hashar: Added PdfHandler to extension-gate https://gerrit.wikimedia.org/r/#/c/259714/
- 13:46 hashar: Build example pass ( https://integration.wikimedia.org/ci/job/mediawiki-extensions-hhvm/43270/ https://integration.wikimedia.org/ci/job/mediawiki-extensions-qunit/23552/ )
- 13:42 hashar: Added CirrusSearch to extension-gate https://gerrit.wikimedia.org/r/259679
- 12:16 hashar: beta: salt -v '*' cmd.run 'apt-get clean'
- 12:07 hashar: doing cleanup maintenance on deployment-bastion git repo under /srv/mediawiki-staging : git remote update --prune ; git gc ; git pack-refs
- 12:04 hashar: beta-scap fixed all by itself
- 11:14 hashar: beta-scap chokes on Copying to deployment-bastion.deployment-prep.eqiad.wmflabs from deployment-bastion.eqiad.wmflabs | Started rsync common
- 11:13 hashar: beta-scap-eqiad broken ( rsync: rename failed for "/srv/mediawiki/php-master/cache/gitinfo/info-extensions-AJAXPoll.json" (from php-master/cache/gitinfo/.~tmp~/info-extensions-AJAXPoll.json): No such file or directory (2) )
2015-12-16
- 23:23 Krinkle: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/259616
- 20:47 thcipriani: update scap on deployment-prep
- 16:41 hashar: updating operations-mw-config-phpunit to have it process submodules
- 14:43 hashar: beta puppetmaster is rebased and up-to-date (upstream at d3e1f70 )
- 14:41 hashar: beta puppetmaster: drop cherry pick 02c2006 RESTBase: Switch to service::node -- got merged
- 13:18 hashar: Gerrit created https://github.com/wikimedia/thumbor-svg-engine | https://phabricator.wikimedia.org/T121635
- 13:18 hashar: Gerrit created https://github.com/wikimedia/operations-debs-bloomd | https://phabricator.wikimedia.org/T121635
- 13:15 hashar: Gerrit , on mediawiki/services/mathoid force pushed gh-pages branch from Github to Gerrit repo . Attempting to fix Gerrit replication issue ( https://phabricator.wikimedia.org/T121635 )
2015-12-15
- 22:57 hashar: On scandium created /srv/ssd/zuul/git/wikimedia/fundraising/crm repo manually. Namespace conflict with wikimedia/fundraising/crm/civicrm.git which prevented zuul-merger to clone the arm repo
- 16:33 hashar: cleared /mnt/home/jenkins-deploy/tmpfs/jenkins-2 from integration-slave-trusty-1017 and added it back to the pool
- 13:45 hashar: reverted composer upgrade on CI with https://gerrit.wikimedia.org/r/#/c/259241/
- 13:37 hashar: bumping composer on CI to 1.0.0-alpha11 https://gerrit.wikimedia.org/r/#/c/258933/
- 08:29 hashar: stopping zuul-merger on gallium for maintenance
- 06:19 legoktm: marked integration-slave-trusty-1017 as offline due to tmpfs issue
- 02:38 Krinkle: beta-mediawiki-config-update-eqiad jobs have been stuck on 'queued' for the past 3 hours
- 02:35 Krinkle: Ran 'sudo rm -rf /mnt/home/jenkins-deploy/tmpfs/jenk*' on ci slaves via salt
- 00:57 thcipriani: marking integration-slave-trusty-1012 offline, strange zuul.cloner behavior.
2015-12-14
- 20:15 hashar: scandium restarted zuul-merger
- 15:12 hashar: Stopping zuul-merger daemon on scandium. It lost its disk somehow earlier "DISK CRITICAL - /srv/ssd is not accessible: No such file or directory" https://phabricator.wikimedia.org/T121400#1877725
- 14:09 hashar: beta and integration: killing redis-servers on Ubuntu instances so they are properly tracked by upstart/puppet ( https://phabricator.wikimedia.org/T121396 )
- 12:59 hashar: dist-upgrade of all CI slaves
2015-12-13
- 23:30 bd808: Ran deployment-bastion:~bd808/cleanup-var-crap.sh and freed 846M on /var
- 21:11 legoktm: deploying https://gerrit.wikimedia.org/r/258784
2015-12-11
- 22:18 hashar: Stopped zuul merger on gallium to have phabricator/extensions populated on scandium (namespacing issue). Restarted zuul-merger on gallium once done.
- 22:14 hashar: On Zuul merger, nuking /srv/ssd/zuul/git/phabricator/extensions so zuul-merger can properly clone phabricator/extensions.git (dir exists because of phabricator/extensions/Sprint.git among others )
- 21:55 hashar: Reloading Zuul to deploy 385ddd9dd906865e7e61c3c5ea85eae0bb522c8d
- 15:14 jzerebecki: ssh integration-slave-trusty-1017.eqiad.wmflabs 'sudo -u jenkins-deploy rm -rf /mnt/home/jenkins-deploy/tmpfs/jenkins-1'
- 15:04 jzerebecki: jenkins-deploy@integration-slave-precise-1011:/mnt/jenkins-workspace/workspace/mwext-Wikibase-client-tests-mysql-zend/src/extensions/Wikibase$ rm .git/refs/heads/mw1.21-wmf6.lock
- 10:45 hashar: salt --show-timeout '*slave*' cmd.run 'rm -fR /mnt/home/jenkins-deploy/tmpfs/jenkins-?/*'
- 01:12 legoktm: deploying https://gerrit.wikimedia.org/r/258395
2015-12-10
- 19:30 legoktm: marked integration-slave-trusty-1011 as offline, all jobs failing due to tmpfs/lesscache permission denied errors
- 15:10 hashar: deleted all nodepool snapshot image ci-jessie-wikimedia-1449740024
- 15:06 hashar: Image ci-jessie-wikimedia-1449759571 in wmflabs-eqiad is ready
- 14:59 hashar: Refreshing Nodepool snapshot ( doc is https://wikitech.wikimedia.org/wiki/Nodepool#Manually_generate_a_new_snapshot )
- 14:59 hashar: New image id is 82a708eb-fd1a-4320-a054-6f1d4a319caa
- 14:57 hashar: created new Nodepool base image and pushing it to labs
- 09:40 hashar: Image ci-jessie-wikimedia-1449740024 in wmflabs-eqiad is ready "etcd got upgraded in the snapshot image: etcd (2.0.10-1 => 2.2.1+dfsg-1)"
- 09:34 hashar: Updating Nodepool snapshot image ; setup_node.sh now runs apt-get upgrade ( https://gerrit.wikimedia.org/r/#/c/257940/ )
- 01:27 legoktm: deploying https://gerrit.wikimedia.org/r/258090
2015-12-09
- 23:41 thcipriani: delete deployment-kafka03 doesn't seem to be in-use yet and cannot be accessed via salt or ssh by root or anyone
- 17:50 hashar: salt-key --delete deployment-sentry2.eqiad.wmflabs ( already have deployment-sentry2.deployment-prep.eqiad.wmflabs )
- 16:19 hashar: Image ci-jessie-wikimedia-1449677602 in wmflabs-eqiad is ready ( comes with python-etcd )
- 16:15 hashar: Refreshing nodepool snapshots will hopefully grab python-etcd ( https://gerrit.wikimedia.org/r/257906 )
- 16:05 hashar: Image ci-jessie-wikimedia-1449676603 in wmflabs-eqiad is ready
- 15:56 hashar: refreshing nodepool snapshot instance, need a new etcd version
- 14:06 hashar: integration-slave-trusty-1011: sudo rm -fR /mnt/home/jenkins-deploy/tmpfs/jenkins-0 ( https://phabricator.wikimedia.org/T120824 )
- 12:57 hashar: Upgrading Jenkins Gearman plugin to grab upstream patch https://review.openstack.org/#/c/252768/ 'fix registration for jenkins master' should be noop
2015-12-08
- 20:31 hashar: beta cluster instances switching to new ldap configuration
- 20:31 hashar: beta: rebased operations/puppet and locally fixed a conflict
- 19:32 hashar: LDAP got migrated. We might have mwdeploy local users that got created on beta cluster instances :(
- 19:29 hashar: beta: aborted rebase on puppetmaster.
- 14:18 Krinkle: Removed integration-slave-trusty-1012:/mnt/home/jenkins-deploy/tmpfs/jenkins-2 which was left behind by a job. Caused other jobs to fail due to lack of permission to chmod/rm-rf this dir.
- 11:48 hashar: beta: salt-key --delete deployment-cxserver03.eqiad.wmflabs
- 11:42 hashar: running puppet on deployment-restbase01 . Catch up on lot of changes
- 11:23 hashar: puppet catching up a lot of changes on deployment-cache-mobile04 and deployment-cache-text04
- 11:20 hashar: beta: rebased puppet.git on puppetmaster
- 10:58 hashar: dropped deployment-cache-text04 puppet SSL certificates
- 10:44 hashar: beta: deployment-cache-text04 upgrading openssl libssl1.0.0
- 10:42 hashar: beta: fixing salt on bunch of hosts. There are duplicate process on a few of them. Fix up is: killall salt-minion && rm /var/run/salt-minion.pid && /etc/init.d/salt-minion start
- 10:32 hashar: beta: salt-key --delete=deployment-cache-upload04.eqiad.wmflabs (missing 'deployment-prep' subdomain)
- 10:31 hashar: beta: puppet being fixed on memc04 sentry2 cache-upload04 cxserver03 db1
- 10:27 hashar: beta: fixing puppet.con on a bunch of hosts. The [agent] server = deployment-puppetmaster.eqiad.wmflabs is wrong, missing 'deployment-prep' sub domain
- 10:23 hashar: beta: salt-key --delete=i-000005d2.eqiad.wmflabs
2015-12-07
- 16:16 hashar: Nodepool no more listens for Jenkins events over ZeroMQ. No TCP connection established on port 8888
- 16:09 hashar: Nodepool no more notice Jenkins slaves went offline. Delay deletions and repooling significantly. Investigating
- 15:22 hashar: labs DNS had some issue. all solved now.
- 13:46 hashar: Reloading Jenkins configuration from disk following up mass deletions of jobs directly on gallium
- 13:41 hashar: deleting a bunch of unmanaged Jenkins jobs (no more in JJB / no more in Zuul)
- 04:24 bd808: The ip address in jenkins for ci-jessie-wikimedia-10306 now belongs to an instance named future-wikipedia.reading-web-staging.eqiad.wmflabs (obviously the config is wrong)
- 04:12 bd808: ci-jessie-wikimedia-10306 down and blocking many zuul queues
2015-12-04
- 19:24 MaxSem: bumped portals
- 09:15 hashar: salt --show-timeout '*' cmd.run 'rm -fR /mnt/jenkins-workspace/workspace/mwext-qunit/src/skins/*' ( https://phabricator.wikimedia.org/T120349 )
2015-12-03
- 23:53 marxarelli: Reloading Zuul to deploy If60f720995dfc7859e53cf33043b5a21b1a4b085
- 23:39 jzerebecki: reloading zuul for c078000..f934379
- 17:46 jzerebecki: reloading zuul for e4d3745..c078000
- 16:25 jzerebecki: reloading zuul for 58b5486..e4d3745
- 11:01 hashar: reenabled puppet on integration slaves
- 10:13 hashar: integration disabling puppet agent to test xvfb https://gerrit.wikimedia.org/r/#/c/256643/
- 08:52 hashar: apt-get upgrade integration-raita.integration.wmflabs.org
2015-12-02
- 11:16 hashar: configure wmf-insecte to join #wikimedia-android-ci ( https://gerrit.wikimedia.org/r/#/c/254905/3/jjb/mobile.yaml,unified )
- 11:06 hashar: restarting nodepool
- 11:05 hashar: manually refreshed nodepool snapshot ( Image ci-jessie-wikimedia-1449053701 in wmflabs-eqiad is ready ) while investigating for https://phabricator.wikimedia.org/T120076
- 09:24 hashar: key holder rearmed (hopefully) doc at https://wikitech.wikimedia.org/wiki/Keyholder
- 09:19 hashar: beta-scap-eqiad is broken Permission denied (publickey).
++
2015-12-01
- 16:06 hashar: split mediawiki core parser tests under Zend to their own job https://gerrit.wikimedia.org/r/#/c/256006/
- 14:55 hashar: salt --show-timeout '*' cmd.run 'cd /srv/deployment/integration/slave-scripts; git pull'
- 14:49 hashar: mw-phpunit.sh error is fixed via https://gerrit.wikimedia.org/r/256222
- 14:36 hashar: bin/mw-phpunit.sh: line 31: phpunit_args[@]: unbound variable
- 10:37 hashar: kicking puppetmaster on integration-puppetmaster : out of memory
- 10:30 hashar: Upgrading Zuul on Trusty and Jessie labs slaves to 2.1.0-60-g1cc37f7-wmf4...
2015-11-29
- 23:23 bd808: updated cherry-pick of https://gerrit.wikimedia.org/r/#/c/255916/ to PS2
- 05:34 bd808: cherry-picked https://gerrit.wikimedia.org/r/#/c/255916/ for testing
- 05:25 bd808: trebuchet is wack and not getting returner results from any hosts; see T119765
- 05:23 bd808: updated scap to 1879fd4 Add sync-l10n command for l10nupdate
- 05:22 bd808: stashed uncommited scap3 changes found on deployment-bastion
2015-11-27
- 19:25 bd808: Trebuchet returners not reporting to redis from sca0[12]
- 19:04 bd808: restarted salt-minion on deployment-sca02
- 19:03 bd808: stopped 2 salt-minion processes on deployment-sca01; started one back up
2015-11-26
- 22:35 hashar: Upgraded Zuul on labs Precise hosts to zuul_2.1.0-60-g1cc37f7-wmf4precise1 ( https://phabricator.wikimedia.org/T119741 )
- 14:10 hashar: Kicking stupid salt on integration box. We need a new orchestration system
- 14:09 hashar: upgrading zuul on labs to 2.1.0-60-g1cc37f7-wmf3 ( https://review.openstack.org/#/c/249207/2 https://phabricator.wikimedia.org/T97106 )
- 14:04 hashar: upgrading zuul on labs to 2.1.0-60-g1cc37f7-wmf3 ( https://review.openstack.org/#/c/249207/2 https://phabricator.wikimedia.org/T97106 )
2015-11-25
- 19:53 legoktm: ran mwscript sql.php --wiki=enwiki --wikidb=wikishared /srv/mediawiki-staging/php-master/extensions/Echo/db_patches/echo_unread_wikis.sql
2015-11-24
- 21:16 hashar: bumping our JJB fork to 4eb33b2. Pinning jenkins==0.4.8 with https://gerrit.wikimedia.org/r/255182
- 15:28 hashar: stopping zuul-merger on gallium for some tricky change
- 15:27 jzerebecki: reloading zuul for 89ae39e..06ae234
- 10:34 hashar: made mediawiki/extensions/SyntaxHighlighter readonly. A single commit copy pasted from some wikipage
2015-11-23
- 22:32 hashar: Updating parsoidsvc-source-npm-* jobs https://gerrit.wikimedia.org/r/254946
- 22:31 hashar: Updating parsoidsvc-deploy-npm-* jobs https://gerrit.wikimedia.org/r/254946
- 17:13 hashar: deleting old Nodepool snapshots. Current one is ci-jessie-wikimedia-1448296278
- 15:25 hashar: Image ci-jessie-wikimedia-1448292050 in wmflabs-eqiad is ready
- 15:02 hashar: updating rake-jessie job to use cached repos under /srv/git (for nodepool)
- 14:42 hashar: regenerating the nodepool images and snapshot. 51-git-mirror-ownership did not run because of missing executable bit
- 14:31 hashar: deleted obsolete nodepool instances so nodepool replenish the pool with new image
- 14:27 hashar: Image ci-jessie-wikimedia-1448288646 in wmflabs-eqiad is ready
- 14:24 hashar: refreshing nodepool snapshot
- 14:23 hashar: pushing new disk image to labs for Nodepool
- 10:17 hashar: added Jcrespo (Jaime) to the beta cluster project as an admin + sudo rights
- 03:28 bd808: Freed 800M on deployment-bastion by running /home/bd808/cleanup-var-crap.sh
2015-11-22
- 21:15 Krinkle: Zuul queue stuck as of 3 hours ago. Jenkins unresponsive or 503 over HTTP.
2015-11-20
- 21:59 bd808: Removed shadow l10nupdate user from /etc/passwd & /etc/shadow on mira.deployment-prep.eqiad.wmflabs
- 14:33 hashar: restarted qa-morebots
- 14:33 hashar: Jenkins: deleting all phpcs-HEAD jobs https://phabricator.wikimedia.org/T90943
2015-11-13
- 21:11 hashar: zuul enqueue --trigger gerrit --pipeline postmerge --project mediawiki/tools/releng --change 200240,1
- 21:06 legoktm: deploying https://gerrit.wikimedia.org/r/252287
- 20:28 hashar: on gallium backing up /srv/org/wikimedia/integration/{doc,cover} to /home/hashar/{doc,cover}.tar.gz
- 20:22 hashar: integration mass updating publish related jobs ( https://gerrit.wikimedia.org/r/#/c/251442/ )
- 14:44 hashar: deleted https://integration.wikimedia.org/ci/job/mwext-VisualEditor-qunit/ (subset of mediawiki-extensions-qunit https://gerrit.wikimedia.org/r/252130 )
- 14:10 hashar: Upgrading Zuul on all CI slaves to zuul_2.1.0-60-g1cc37f7-wmf2<DISRO>1 (refreshed the zuul-cloner --no-hard-links patch) https://phabricator.wikimedia.org/T118340
- 13:07 hashar: force rebased integration/zuul patch-queue/debian/precise-wikimedia e667fdd...ac7616a
- 13:05 hashar: force pushed to integration/zuul patch-queue/debian/precise-wikimedia 7d4b2d3...e667fdd
2015-11-12
- 22:03 legoktm: rm -rf mediawiki-extensions-qunit workspace on trusty-1012
2015-11-11
- 17:58 jzerebecki: reloading zuul for e81ee95..acbd851
- 00:02 thcipriani: deployment-puppetmaster:/var/lib/git/operations/puppet was left mid-rebase conflicts with portals apache config. Fixed conflicts added the file and continued the rebase.
2015-11-10
- 19:36 jzerebecki: reloading zuul for 5009316..e81ee95
- 17:08 jzerebecki: https://phabricator.wikimedia.org/T117710 ssh integration-slave-trusty-1012.eqiad.wmflabs 'sudo -u jenkins-deploy rm -rf /mnt/jenkins-workspace/workspace/mwext-testextension-hhvm/src/skins/BlueSky'
2015-11-09
- 23:19 hashar: put https://integration.wikimedia.org/ci/computer/integration-slave-trusty-1015/ back on
- 23:11 hashar: Made https://integration.wikimedia.org/ci/computer/integration-slave-trusty-1015/ offline because Chromium/XVFB is unreacheable somehow causing issue to https://integration.wikimedia.org/ci/job/mwext-VisualEditor-qunit/
- 12:56 hashar: restarting Jenkins to refresh the cli-shutdown.groovy script -- https://gerrit.wikimedia.org/r/251935 (https://phabricator.wikimedia.org/T118064)
- 11:09 hashar: Upgrading Jenkins from LTS 1.609.3 to LTS 1.625.1 https://phabricator.wikimedia.org/T118157
2015-11-07
- 20:52 jzerebecki: reloading zuul for a2951c3..ccea029
2015-11-06
- 23:02 jzerebecki: reloading zuul for 583e87e..a2951c3
- 11:41 hashar: Image ci-jessie-wikimedia-1446809868 in wmflabs-eqiad is ready
- 11:38 hashar: refreshing Nodepool snapshot again. We need libgnutls28-dev ( https://gerrit.wikimedia.org/r/251485 https://phabricator.wikimedia.org/T117955 )
- 10:52 hashar: Refreshing Nodepool snapshot image and deleting obsolete instances
- 10:43 hashar: integration fixed puppet run libcurl-dev -> libcurl4-gnutls-dev ( https://gerrit.wikimedia.org/r/251479 https://phabricator.wikimedia.org/T117955 )
- 08:59 hashar: nodepool image-update wmflabs-eqiad ci-jessie-wikimedia // for libcurl-dev https://gerrit.wikimedia.org/r/#/c/251432/
2015-11-05
- 22:50 legoktm: deploying https://gerrit.wikimedia.org/r/251418
- 19:44 legoktm: deploying https://gerrit.wikimedia.org/r/251313
- 19:16 legoktm: deploying https://gerrit.wikimedia.org/r/251305
- 18:23 marxarelli: All deployment-db1 tables appear OK
- 18:10 marxarelli: Running mysqlcheck to verify databases on deployment-db1 after https://phabricator.wikimedia.org/T117881
- 16:39 hashar: deployment-db1 instance is down. I guess beta cluster is dead now.
- 15:44 hashar: kicking puppetmaster on integration
2015-11-04
- 23:18 marxarelli: Running `jenkins-jobs update config/ 'mw-tools-scap-tox-doc-publish'` to test Ib4753ad493115682d902cf15136199fd2083b8e5
- 21:41 jzerebecki: reloading zuul for 8db8417..d49e21a
- 19:33 Krinkle: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/246773
- 17:07 jzerebecki: reloading zuul for 0f620bd..b3a8e99
- 10:21 jzerebecki: reloading zuul for b8808f8..0f620bd
- 04:15 bd808: Handy cleanup script for when /var is full on deployment-bastion -- https://phabricator.wikimedia.org/P2273
- 04:01 bd808: sudo rm /var/log/*.??.gz on deployment-bastion
- 04:00 bd808: sudo rm /var/log/apache2/*.??.gz on deployment-bastion
- 03:58 bd808: sudo rm /var/log/atop.log.? on deployment-bastion
- 03:52 bd808: sudo rm /var/log/account/pacct.?* on deployment-bastion for the usual reason
2015-11-03
- 22:48 thcipriani: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/250847
- 22:37 ostriches: deployment-bastion: scap now pointing to Phab repo instead of Gerrit.
- 22:05 bd808: applied ::beta::deployaccess on deployment-bastion via Special:NovaInstance
- 21:59 bd808: cherry-picked https://gerrit.wikimedia.org/r/#/c/250837/ and forced puppet run on deployment-bastion
- 13:20 hashar: restarting Jenkins to apply updated plugins
- 13:09 hashar: Jenkins upgrading a few more plugins
- 13:07 hashar: Upgrading Jenkins plugin "Green Ball" from 1.14 to 1.15. Seems to fix a potential deadlock on jenkins start ( https://issues.jenkins-ci.org/browse/JENKINS-28422 )
2015-11-02
- 20:48 hashar: Upgraded Zuul on Trusty slaves to /root/zuul_2.1.0-60-g1cc37f7-wmf1trusty1_amd64.deb
- 20:00 legoktm: deploying https://gerrit.wikimedia.org/r/250487
- 16:27 hashar: beta: restart salt-minion on a bunch of instances
- 16:15 hashar: integration-slave-trusty-1014 upgrading Zuul to /root/zuul_2.1.0-60-g1cc37f7-wmf1trusty1_amd64.deb . In case of trouble unpool it from https://integration.wikimedia.org/ci/computer/integration-slave-trusty-1014/
- 16:15 hashar: integration-slave-trusty-1014 upgrading Zuul to /root/zuul_2.1.0-60-g1cc37f7-wmf1trusty1_amd64.deb
- 13:32 hashar: Refreshing Nodepool snapshot to be based on new image image-jessie-20151030T221648Z.qcow2 : git -C /etc/nodepool/wikimedia/ pull && nodepool image-update wmflabs-eqiad ci-jessie-wikimedia
- 13:31 hashar: openstack image create --file image-jessie-20151030T221648Z.qcow2 ci-jessie-wikimedia --disk-format qcow2 --property show=true
- 13:31 hashar: openstack image set --name ci-jessie-wikimedia_old_20151102 ci-jessie-wikimedia
- 13:30 hashar: updating Nodepool base image on wmflabs to get https://gerrit.wikimedia.org/r/#/c/250148/ (set hostname on debian hosts)
- 12:27 hashar: Upgraded zuul-cloner on Precise slaves: 2.0.0-327-g3ebedde-wmf3precise1 -> /root/zuul_2.1.0-60-g1cc37f7-wmf1precise1_amd64.deb
- 12:24 hashar: unpooling integration-slave-precise1013 and upgrading zuul-cloner 2.0.0-327-g3ebedde-wmf3precise1 -> /root/zuul_2.1.0-60-g1cc37f7-wmf1precise1_amd64.deb
- 10:59 hashar: Bump integration/zuul upstream branch from 3ebedde to 1cc37f7
- 10:52 hashar: Restarting Jenkins to upgade Gearman plugin from 0.1.2 to 0.1.3 "Send node labels back on build completion"
2015-11-01
- 22:18 legoktm: deploying https://gerrit.wikimedia.org/r/250362
2015-10-31
- 21:32 bd808: Ran `sudo rm /var/log/account/pacct.?` on deployment-bastion; freed ~425M
2015-10-30
- 18:20 marxarelli: deployment-db2 replication recovered after slave stop/reset/set master position/start
- 18:17 marxarelli: stopping/resetting slave on deployment-db2 to fix replicate after relay log corruption
- 11:35 hashar: have varnish stats collector to emit to labmon1001.eqiad.wmnet instead of production statsd ( cherry picked https://gerrit.wikimedia.org/r/#/c/249490/ )
2015-10-29
- 17:33 legoktm: deploying https://gerrit.wikimedia.org/r/249223
- 16:50 legoktm: deploying https://gerrit.wikimedia.org/r/249766
- 15:27 bd808: removed shadow mwdeploy account from /etc/passwd on mira.deployment-prep.eqiad.wmflabs
- 10:21 hashar: restarting Jenkins (java upgrade)
- 03:18 thcipriani|afk: broken again. looks like /srv/mediawiki-staging on mira should be owned by mwdeploy
- 02:50 thcipriani|afk: hooray, fixed!
- 02:40 thcipriani|afk: beta-scap-eqiad failing due to rsync-created mira:/srv/mediawiki-staging/.~tmp~ directory being owned by mwdeploy but with a uid of 993 instead of 603 (local mwdeploy)
- 00:12 MaxSem: Manually fixed permissions on mw-config/portals, reinitialized submodule and synced
2015-10-28
- 19:06 legoktm: deploying https://gerrit.wikimedia.org/r/249476
- 16:00 hashar: for integration/zuul.git , created branch labs-tox-deployment to be used to deploy Zuul with pip on labs instances
- 15:38 hashar: no matter, NFS is under maintenance
- 15:38 hashar: rebooting deployment-parsoid05 . seems NFS is flappy
- 10:01 hashar: applying role::cache::parsoid to deployment-cache-parsoid05
- 09:58 hashar: Deleting deployment-parsoidcache02 (Trusty) 10.68.16.145 to be replaced with deployment-cache-parsoid05 10.68.20.102 (Jessie)
- 09:54 hashar: beta: deleting deployment-cache-parsoid04 not enough disk space for /srv/ ( https://phabricator.wikimedia.org/T103660 )
- 05:26 legoktm: deploying https://gerrit.wikimedia.org/r/249349
- 03:43 Krinkle: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/249341
2015-10-27
- 20:42 hashar: Nuking Nodepool instances using the previous snapshot ( ci-jessie-wikimedia-1445955240 )
- 20:40 hashar: Nodepool snapshot ci-jessie-wikimedia-1445977928 generated. Includes /usr/bin/rake ( puppet: https://gerrit.wikimedia.org/r/#/c/249219/ )
- 20:31 hashar: Generating new Nodepool snapshot ( https://wikitech.wikimedia.org/wiki/Nodepool#Manually_generate_a_new_snapshot ) to have 'rake' included ( puppet: https://gerrit.wikimedia.org/r/#/c/249219/ )
- 15:08 ostriches: deploying master of scap to beta
2015-10-26
- 23:04 Krinkle: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/249015
- 18:37 legoktm: deploying https://gerrit.wikimedia.org/r/248929
- 18:22 MaxSem: Cherrypicked https://gerrit.wikimedia.org/r/#/c/248374/ on beta
- 16:03 hashar: reenabling puppet on integration-slave-trusty-1013
- 15:13 hashar: Disabling puppet on trusty-1013 to apply the slave scripts local hack https://gerrit.wikimedia.org/r/#/c/248883/ . Should fix some weird qunit failure ( https://phabricator.wikimedia.org/T116565 )
- 05:11 legoktm: deploying https://gerrit.wikimedia.org/r/248811 & https://gerrit.wikimedia.org/r/248812
2015-10-24
- 05:58 marxarel_: deployment-db2 data restored, replication working
- 03:54 marxarelli: restoring deployment-db2 again ...
- 03:18 marxarelli: finished restoring data on deployment-db2. replication is working once again
- 02:15 marxarelli: restoring data on deployment-db2
- 00:20 legoktm: deploying https://gerrit.wikimedia.org/r/248576
- 00:15 legoktm: deploying https://gerrit.wikimedia.org/r/248562
2015-10-23
- 23:05 marxarelli: restoring deployment-db2 from dump
- 23:04 legoktm: deploying https://gerrit.wikimedia.org/r/248551
- 22:18 marxarelli: dump of deployment-db1 failed due to "View 'labswiki.bounce_records' references invalid table(s)"
- 21:37 marxarelli: dumping databases on deployment-db1 for restore of deployment-db2
- 21:13 marxarelli: deployment-db1 binlog deployment-db1-bin.000062 appears corrupt
- 20:54 marxarelli: deployment-db2 shows slave io but slave sql failed on duplicate key
- 18:59 twentyafterfour: deleted atop.log.* files on deployment-bastion. when are we going to enlarge /var on this instance. grr
- 18:58 marxarelli: Killed mysql process 15840440 on account of its gargantuan temp file filling up /mnt
2015-10-22
- 10:36 hashar: integration: cherry picked https://gerrit.wikimedia.org/r/#/c/244748/ "contint: install npm/grunt-cli with npm" , giving it a try one host a time
- 10:31 hashar: integration disabling puppet salt --show-timeout --timeout=10 '*' cmd.run 'puppet agent --disable "install npm/grunt-cli via puppet https://gerrit.wikimedia.org/r/#/c/244748/"'
- 10:05 hashar: salt-key -d deployment-logstash2.eqiad.wmflabs
- 10:05 hashar: salt-key -d deployment-urldownloader.eqiad.wmflabs
- 10:04 hashar: integration: clean up downloaded apt packages which are filing /var/cache/apt/archives on a few instances salt --show-timeout '*' cmd.run 'apt-get clean'
- 10:03 hashar: beta: clean up downloaded apt packages which are filing /var/cache/apt/archives on a few instances (ex: 4GBytes on mediawiki02) salt --show-timeout '*' cmd.run 'apt-get clean'
- 10:01 hashar: beta-cluster: I have deleted some incorrect salt minions with: salt-key -d i-0000*
2015-10-19
- 14:51 hashar: Adding CirrusSearch to the extensions gate ( https://gerrit.wikimedia.org/r/#/c/247280/ )
- 14:38 hashar: Adding PdfHandler to the extensions gate ( https://gerrit.wikimedia.org/r/#/c/247278/ )
- 14:26 hashar: Adding TimedMediaHandler to the extensions gate ( https://gerrit.wikimedia.org/r/#/c/247273/ )
- 14:05 hashar: Adding MwEmbedSupport to the extensions gate ( https://gerrit.wikimedia.org/r/#/c/247271/ )
- 13:40 hashar: Adding Cite to the extensions gate ( https://gerrit.wikimedia.org/r/#/c/247266/ )
- 13:18 hashar: Adding Elastica to the extensions gate ( https://gerrit.wikimedia.org/r/#/c/247264/ )
2015-10-16
- 20:51 hashar: Restarting Jenkins to remove potential dead locks before the week-end
- 20:34 hashar: cancelled a bunch of https://integration.wikimedia.org/ci/job/mediawiki-core-doxygen-publish/ jobs. We keep rebuilding over and over REL* merged changes
- 20:24 hashar: disconnected / reconnected a bunch of trusty slaves. Seems some node executors were disabled/deadlocked
- 12:56 hashar: Added Cards and RelatedArticles to the shared jobs mediawiki-extensions-*' https://gerrit.wikimedia.org/r/#/c/246818/
2015-10-15
- 22:28 legoktm: deploying https://gerrit.wikimedia.org/r/246785
- 20:33 SMalyshev: cherry-picked https://gerrit.wikimedia.org/r/#/c/240888/1 to deployment-puppetmaster.eqiad.wmflabs to test portal deployment
- 04:34 bd808: freed another 258M on deployment-bastion by forcing an early rotation of /var/log/account/pacct and deleting the archived copy
- 04:19 bd808: Freed 290M on deployment-bastion:/var by deleting old pacct files
2015-10-14
- 18:02 legoktm: deploying https://gerrit.wikimedia.org/r/246276
- 16:22 Krinkle: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/246256
- 13:02 hashar: Adjusting Jenkins SMTP server from polonium.wikimedia.org to mx1001.wikimedia.org
2015-10-13
- 03:21 bd808: Updated scap to 13c2af4 (Fix sync-dblist to go with dblist moves to folder)
2015-10-12
- 08:54 hashar: zuul-merger process leaked file descriptors and ended up unable to open any more files. Fixed by restarting the service on gallium. https://phabricator.wikimedia.org/T115243
- 08:44 hashar: Zuul CI in trouble. zuul-merger can't not apply patches anymore https://phabricator.wikimedia.org/T115243
- 02:43 Krenair: fixed puppet on deployment-conf03 by running several manual apt-get commands
- 00:38 bd808: fixed puppet on deploymnet-restbase01 by running several manual apt-get and dpkg commands; had to downgrade zsh from 5.1.1-1 (unstable) to 5.0.7-5 (stable)
- 00:27 bd808: puppet failing on deployment-restbase01 due to corrupt apt config state
2015-10-11
- 14:57 legoktm: deploying https://gerrit.wikimedia.org/r/244192
2015-10-09
- 21:57 greg-g: 21:51 < ori> !log deployment-prep Accidentally clobbered /etc/init.d/mysql on deployment-db1, causing deployment-prep failures. Restored now
- 21:55 twentyafterfour: deployment-db1 has a running mysqld again, shinken reports recovery.
- 21:41 twentyafterfour: ori broke mariadb on deployment-db1 :-P
- 20:29 hashar: beta cluster parsoid now runs from /parsoid.git && npm install (was from /deploy.git previously). In case of troubles poke subbu and see revert instructions on https://phabricator.wikimedia.org/T92871
- 20:16 hashar: Parsoid on beta is broken. Busy installing npm dependencies
- 20:09 hashar: switching Parsoid on beta to install dependencies with npm (instead of /deploy) https://phabricator.wikimedia.org/T92871 for subbu
- 14:54 hashar: added Geodata as a dependency to the wikibase jobs ( https://gerrit.wikimedia.org/r/#/c/244489/ )
2015-10-07
- 15:49 bd808: Updated cherry-pick of https://gerrit.wikimedia.org/r/#/c/241984 on deployment-puppetmaster and forced puppet run on deployment-logstash2
- 15:48 bd808: Made [LOCAL HACK] commit on deployment-puppetmaster for l10nupdate cache location change
- 15:45 bd808: Uncommited local change to /var/lib/git/operations/puppet/modules/scap/files/l10nupdate-1 on deployment-puppetmaster. Looks like something Krenair was working on.
- 13:57 hashar: fixing up apt on deployment-cache-parsoid04 as well as salt install https://phabricator.wikimedia.org/T114755
- 13:52 hashar: Upgrading packages oon deployment-cache-{text04,mobile04,upload04} , downgrading salt-common and salt-minion in the process ( https://phabricator.wikimedia.org/T114755 )
- 13:43 hashar: Cleaning up http://debian.saltstack.com/debian/ jessie-saltstack/main from some beta-cluster instances | https://phabricator.wikimedia.org/T114755
- 10:09 hashar: deleted angry-caching-proxy.integration.eqiad.wmflabs | was an experiment no more use for it
- 00:16 Krinkle: Restored normal mysql replication on deployment-beta
2015-10-06
- 23:39 Krinkle: Messing with mysql slave replication on deployment-db2 to investigate CentralAuth bug with AaronSchulz
- 13:14 hashar: mass apt-get upgrade on beta cluster ( salt '*' pkg.upgrade )
- 13:09 hashar: Upgrades are now done on a daily basis thanks to unattended upgrade ( https://gerrit.wikimedia.org/r/243925 )
- 13:08 hashar: Mass dist-upgrade of all jenkins slaves
- 04:04 legoktm: deploying https://gerrit.wikimedia.org/r/243828
2015-10-05
- 22:20 Krenair: deployment-prep deleted deployment-bastion:/var/log/account files to free up some space on /var
- 22:10 Krenair: deployment-prep Renamed deployment-bastion:/srv/l10nupdate/mediawiki/extensions/BlueSpiceExtensions to OldBlueSpiceExtensions, was not in a clean state for l10nupdate
- 22:00 Krenair: deployment-prep Doing a manual run of l10nupdate
- 21:58 Krenair: deployment-prep Fixed l10nupdate - I broke it with https://gerrit.wikimedia.org/r/#/c/228126/ - created https://wikitech.wikimedia.org/wiki/Hiera:Deployment-prep/host/deployment-bastion
- 20:25 hashar: Image ci-jessie-wikimedia-1444076296 in wmflabs-eqiad is ready. It provides bundler/ruby
- 20:18 hashar: nodepool image-update wmflabs-eqiad ci-jessie-wikimedia
- 15:28 hashar: integration-slave-jessie-1001 ssh reacheable
- 15:28 hashar: integration-slave-jessie-1001 pass puppet again. Caused by a conflict with the 'zip' package, solved by cherry picking https://gerrit.wikimedia.org/r/#/c/243674/
- 15:21 jzerebecki: reloading zuul for d944bbb..1272657
- 14:48 hashar: integration-slave-jessie-1001 fails ssh auth with: error: AuthorizedKeysCommand /usr/sbin/ssh-key-ldap-lookup returned status 1
- 14:45 hashar: rebooting integration-slave-jessie-1001.integration.eqiad.wmflabs , does not reply to ssh
- 14:34 hashar: nodepool image-update wmflabs-eqiad ci-jessie-wikimedia
- 13:45 hashar: lets funk it https://soundcloud.com/professorkliq/wire-flashing-lights?in=professorkliq/sets/wire-and-flashing-lights-ep
- 13:40 hashar: stopping puppetmaster on integration. Out of memory
2015-10-04
- 21:39 Krenair: Changed MW private DB settings in beta to set $wgDBadmin* properly so the `sql` command works again
- 21:39 Krenair: Changed MW private DB settings in beta to
2015-10-03
- 18:29 jzerebecki: reloading zuul for 4aa5d22..d609dad
- 01:05 Krinkle: Clean up spurious remotes/origin/review/markahershberger/207722 branch in mediawiki/core
2015-10-02
- 08:45 hashar: restarting Nodepool to take in account changes made to the logging configuration https://gerrit.wikimedia.org/r/#/c/240986/
- 08:41 hashar: integration-dev : rm -fR /etc/ferm/ (ferm firewall system is not installed
2015-10-01
- 16:30 valhallasw`cloud: test again
- 16:30 valhallasw`cloud: hello qa-morebots
- 14:50 hashar: do I log
- 13:27 hashar: deleted from Jenkins all remaining pyflakes/pep8 jobs but operations-puppet-pep8
- 10:55 hashar: Enabled color for wmf-insecte ( https://phabricator.wikimedia.org/T64573 )
2015-09-30
- 22:19 bd808: Cherry-picked https://gerrit.wikimedia.org/r/#/c/241984 to deployment-puppetmaster
- 22:18 bd808: Fixed borken puppet configuration on deployment-logstash2
- 21:06 hashar: We went from 20 to 28 executors for Trusty based jobs or a 40% increase! Some stats at https://integration.wikimedia.org/ci/label/UbuntuTrusty/load-statistics?type=min
- 19:18 hashar: Pooling back Jenkins trusty slaves 1014 and 1017. They were falling qunit for some reason but I can no more reproduce https://phabricator.wikimedia.org/T113489
2015-09-29
- 22:36 Krinkle: Added support for "Install MediaWiki" and "PHPUnit" to the Jenkins section detection
- 20:33 greg-g: 20:05 < ottomata> !beta eventlogging fixed, now using etcd for shared ip_hash token
- 19:46 ottomata: eventlogging in beta is down, trying to set up etcd for shared ip hash tokens, on it...
- 18:11 jzerebecki: reloading zuul for 1cb1c8c..6c9c684
2015-09-28
- 23:37 Krinkle: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/241960
- 22:05 hashar: hashar@integration-slave-trusty-1016:~$ sudo rm /mnt/jenkins-workspace/workspace/mwext-qunit/src/vendor/.git/config.lock
- 20:08 hashar: Blessed EBernhardson with beta cluster project admin rights
- 12:35 hashar: Cleaning old references in Zuul merger git repositories. On gallium as user zuul: find /srv/ssd/zuul/git/ -type d -name .git -print -exec /home/hashar/zuul-clear-refs.py --until 16 {} \;
2015-09-27
- 07:22 legoktm: deploying https://gerrit.wikimedia.org/r/241500
2015-09-26
- 22:26 legoktm: deploying https://gerrit.wikimedia.org/r/241312
- 19:04 hashar: restarting Jenkins. Just in case :-D
- 09:14 hashar: Nodepool / tox jobs seems to have survived the european night \O/
- 09:01 hashar: cleaning dupe workspaces: salt '*slave-trusty*' cmd.run 'rm -fR /mnt/jenkins-workspace/workspace/*@?'
- 09:01 hashar: cleaning tox*-trusty jobs from Trusty workspaces: salt '*slave-trusty*' cmd.run 'rm -fR /mnt/jenkins-workspace/workspace/tox*trusty*'
- 09:00 hashar: Disk space issue on trust CI slaves, probably due to the mediawiki composer jobs recently introduced.
2015-09-25
- 17:18 legoktm: deploying https://gerrit.wikimedia.org/r/241086
- 16:41 legoktm: deploying https://gerrit.wikimedia.org/r/241071
- 15:00 hashar: deleted Jenkins jobs *tox*-trusty ( https://gerrit.wikimedia.org/r/#/c/241051/ )
- 14:17 hashar: nodepool: regenerated new image ci-jessie-wikimedia-1443190440 to add libssl-dev ( https://gerrit.wikimedia.org/r/#/c/241037/ )
- 13:30 hashar: stopped nodepool. Instances cant be deleted on labs due to ongoing issue
- 09:48 hashar_: Updated qa-morebots SAL link from https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL to https://tools.wmflabs.org/sal/releng
- 09:45 hashar_: phasing out tox.*trusty jobs in favor of tox.*jessie jobs on Nodepool instances. python3.4 is available here which was the main use for -trusty
- 08:26 hashar: Migrating tox.*jessie jobs to Nodepool instances | https://gerrit.wikimedia.org/r/#/c/240705/
- 02:31 legoktm: deploying https://gerrit.wikimedia.org/r/240954
- 00:59 Krinkle: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/240938
- 00:01 Krinkle: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/240934
2015-09-24
- 19:39 hashar: nodepool new snapshot image ci-jessie-wikimedia-1443123518
- 19:37 hashar: nodepool: openstack image set --name ci-jessie-wikimedia_old_20150924 ci-jessie-wikimedia
- 16:12 hashar: openstack image delete ci-jessie-wikimedia && openstack image set --name ci-jessie-wikimedia ci-jessie-wikimedia_old_20150924
- 16:10 hashar: nodepool image ci-jessie-wikimedia-1443109895 fails to acquire network. Reverting to previous image.
- 15:51 hashar: nodepool new snapshot image is ci-jessie-wikimedia-1443109895
- 15:50 hashar: pushing new reference image to nodepool to include etcd package. Followed doc from https://wikitech.wikimedia.org/wiki/Nodepool#Publish_on_labs
- 06:09 greg-g: deployment-bastion is getting close to having a filled up /var again: https://phabricator.wikimedia.org/T91354
2015-09-23
- 23:26 thcipriani: Reloading Zuul to deploy Iaa33b2002e898d68fe78f236e225c6e4bac61679
- 22:38 thcipriani: Reloading Zuul to deploy I5db99dee0318b5854a05ea68870c57fb658f42e3
- 22:06 ostriches: added tyler to gerrit admin group
- 22:05 marxarelli: Reloading zuul to deploy I55c30dd85444f84d21972b39a6237bab86518f13
- 22:05 ostriches: added dan to gerrit admin group
- 21:58 hashar: Added thcipriani to the Gerrit 'integration' group
- 21:58 hashar: Added marxarelli and twentyafterfour (wm.org emails) to the Gerrit 'integration' group
- 21:43 marxarelli: Reloading zuul to deploy I02685b2f0f4215904e4a364e2f8597d1ab47d657
- 20:58 marxarelli: Reloading zuul to deploy I4925da63b0aae22b3e809522466623d7958bfa24
- 20:23 hashar: mwext-mw-selenium fails due to missing avconv https://phabricator.wikimedia.org/T113520
- 18:36 hashar: reapplying labels contintLabsSlave UbuntuTrusty phpflavor-hhvm on https://integration.wikimedia.org/ci/computer/integration-slave-trusty-1014/ . https://phabricator.wikimedia.org/T113489
- 16:00 hashar: unpooling integration-slave-trusty-1014 and integration-slave-trusty-1017 . They fail both https://integration.wikimedia.org/ci/job/mediawiki-core-qunit/ and https://integration.wikimedia.org/ci/job/mediawiki-extensions-qunit/
- 13:52 hashar: dist upgrading integration-slave-trusty1014 and integration-slave-trusty-1017 to make sure we get the latest versions from apt.wm.org
- 12:31 hashar: pooling integration-slave-trusty-1014 and integration-slave-trusty-1017 in Jenkins
- 11:16 hashar: applying contint::slave::labs to integration-slave-trusty-1014 and integration-slave-trusty-1017
- 11:02 hashar: restarted integration puppetmaster. Newly created instances fail with: [1;31mError: Could not request certificate: Connection refused - connect(2)[0m
- 05:32 Krinkle: beta-scap-eqiad is back up
- 05:26 Krinkle: Graceful restarting of Jenkins
- 05:20 Krinkle: beta-scap-eqiad and others stuck in Jenkins queue for over 6 hours. deployment-bastion.eqiad is idle but not accepting jobs.
2015-09-22
- 17:05 legoktm: deploying https://gerrit.wikimedia.org/r/238985
- 16:48 hashar: restarting nodepool to take in account logging config change (dropping apscheduler bucket)
2015-09-21
- 19:30 hashar: lets get rid of PhantomJS https://phabricator.wikimedia.org/T113279 :D
- 18:21 thcipriani: deployed up-to-date scap in deployment-prep
- 12:37 jzerebecki: gallium:~$ sudo -u jenkins-slave rm -rf /srv/org/wikimedia/integration/cover/\$DOC_PROJECT/
2015-09-18
- 22:34 greg-g: we missed log entries on the wikitech log from the 15th, see https://tools.wmflabs.org/sal/releng?p=0&q=&d=2015-09-18
- 22:33 greg-g: test test
- 22:18 marxarelli: stopping eventlogging services on deployment-eventlogging02 and truncating logs
- 22:09 marxarelli: deployment-eventlogging02:/var full due to massive eventlogging_processor-server-side-0.log.1 ("(MainThread) Unable to process" errors)
- 17:26 Krinkle: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/239395
2015-09-17
- 22:40 legoktm: deploying https://gerrit.wikimedia.org/r/239281
- 18:16 legoktm: deploying https://gerrit.wikimedia.org/r/239146
- 16:09 hashar: rebase integration puppetmaster
2015-09-16
- 22:53 legoktm: deploying https://gerrit.wikimedia.org/r/238847
- 14:50 hashar: Move integration-jjb-config-diff Jenkins job to Nodepool instances. https://gerrit.wikimedia.org/r/#/c/238752/ and https://phabricator.wikimedia.org/T112750
2015-09-15
- 18:09 legoktm: deploying https://gerrit.wikimedia.org/r/238506
- 13:57 hashar: Force pushed https://github.com/legoktm/tools-ci to gerrit integration/config.git under branch attic/legoktm/tools-ci https://phabricator.wikimedia.org/T111758
2015-09-14
- 19:08 jzerebecki: reloading zuul for bd97ce4..4abd32e
- 18:25 hashar: deleted integration-zuul-server . Was to play test the zuul .deb package. Not needed anymore.
- 18:22 jzerebecki: reloading zuul for 72d41fb..bd97ce4
- 16:45 Krinkle: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/238176
- 08:42 hashar: rebased integration puppetmaster 61870d1..8cf247f
2015-09-12
- 19:48 jzerebecki: reloading zuul for c7f17b5..a813107
2015-09-11
- 21:44 hashar: self note: test unifier should be named "Pinailleur"
- 21:03 hashar: Nodepool finally build a snapshot instance and booted out of it another instance. Please welcome ci-jessie-wikimedia-39 the first disposable slave with production status. https://integration.wikimedia.org/ci/computer/ci-jessie-wikimedia-39/
2015-09-10
- 22:28 Krenair: cherry-picked https://gerrit.wikimedia.org/r/#/c/237523/ on deployment-puppetmaster
- 09:52 hashar: Gerrit made mediawiki/extensions/skins readonly per https://phabricator.wikimedia.org/T62927
- 09:13 hashar: triaging https://phabricator.wikimedia.org/tag/wikimedia-git-or-gerrit/board/
- 07:39 hashar: Repacked zuul-merger git repos on gallium in /srv/ssd/zuul/git
2015-09-09
- 18:14 bd808: restarted nutcracker on mediawiki02
- 14:29 aude: stashed local changes to dumpInterwiki in WikimediaMaintenance on deployment-bastion
- 08:31 hashar: removed cherry pick https://gerrit.wikimedia.org/r/#/c/233413/ "elasticsearch: ensure /var/run subdir exists" from integration puppetmaster https://phabricator.wikimedia.org/T109497
- 08:28 hashar: rebooting integration-slave-precise-1014 to see whether elasticsearch runs properly https://phabricator.wikimedia.org/T109497
- 08:27 hashar: upgraded elastic search from 1.6.0 to 1.7.1 on Jenkins Precise slaves https://phabricator.wikimedia.org/T109497
- 08:20 hashar: attempting to git deploy integration/config from tin
2015-09-08
- 21:11 legoktm: deploying https://gerrit.wikimedia.org/r/236943
- 19:28 bd808: removed cherry-pick of https://gerrit.wikimedia.org/r/#/c/197655/ due to rebase conflict
2015-09-07
- 17:01 jzerebecki: reloading zuul for 1b53ca9..4d9d237
- 08:13 hashar: Jenkins upgraded to latest LTS ( https://phabricator.wikimedia.org/T111326 )
- 07:32 hashar: deleted instance nodepool-t105406 ( https://phabricator.wikimedia.org/T105406 is resolved )
2015-09-04
- 22:06 jzerebecki: reloading zuul for 1219cc3..1b53ca9
- 21:16 jzerebecki: reloading zuul for 3740b19..1219cc3
- 14:45 hashar: Jenkins upgrade on Monday Sep. 7th at 8:00am UTC ( https://phabricator.wikimedia.org/T111326#1606556 )
- 12:57 hashar: Compiling Textual v5.2.0 to support latest bd808 theme ( https://github.com/bd808/Textual-Theme-bd808.git )
2015-09-03
- 22:59 marxarelli: Rebooting integration-dev.eqiad.wmflabs
- 19:10 Krinkle: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/235783
- 18:45 jzerebecki: reloading zuul for 451f8aa..fd482b7
- 15:17 hashar: stopping nodepool on labnodepool1001.eqiad.wmnet not ready yet
- 15:07 hashar: deleting /mnt/jenkins-workspace/workspace/mediawiki-extensions-qunit on integration-slave-trusty-1016 ( https://phabricator.wikimedia.org/T111369 )
- 15:03 hashar: rebased operations/puppet on integration puppetmaster
- 14:55 hashar: restarting puppetmaster on integration-puppetmaster
- 00:51 Krinkle: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/235656
- 00:08 Krinkle: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/235641
2015-09-02
- 18:21 jzerebecki: reloading zuul for e0e78c7..3a1658e
- 18:15 jzerebecki: reloaded zuul for 327dcb6..e0e78c7
- 16:34 hashar_: nodepool .deb package bumped to 0.1.1 . Pending upload to apt.wikimedia.org
- 16:34 hashar_: nodepool database backend has been setup by ops (thank you Jaime)
- 16:33 hashar_: ping
- 15:09 hashar: bumping operations/debs/nodepool upstream branch from 0.1.0 to 0.1.1 ( 462cbe9..3c635ec )
2015-09-01
- 21:08 hashar: marxarelli properly build a CI image using diskimage-builder \O/
- 20:01 dapatrick: Starting scans/spidering on deployment-mediawiki03
- 01:12 James_F: Re-restarting grrrit-wm rolled back to 2f5de55ff75c3c268decfda7442dcdd62df0a42d
- 00:54 James_F: Restarted grrrit-wm with I7eb67e3482 as well as I48ed549dc2b.
- 00:33 James_F: Didn't work, rolled back grrrit-wm to 2f5de55ff75c3c268decfda7442dcdd62df0a42d.
- 00:32 James_F: Didn't work, r
- 00:29 James_F: Restarted grrrit-wm for I48ed549dc2b.
2015-08-31
- 15:13 jzerebecki: did https://phabricator.wikimedia.org/T109007#1537572
2015-08-30
- 20:53 hashar: beta-scap-eqiad failling due to some mwdeploy not being able to ssh to other hosts. Attempted to add the ssh key again following https://phabricator.wikimedia.org/T109007#1537572 which fixed it
2015-08-29
- 01:01 bd808: Deleted local mwdeploy user on deployment-tmh01 that was causing scap failures
- 00:21 bd808: stopping and starting jobrunner and jobchron on deployment-tmh01
2015-08-28
- 23:40 bd808: Cherry-picked https://gerrit.wikimedia.org/r/#/c/234699/
- 20:17 bd808: cherry-picked https://gerrit.wikimedia.org/r/#/c/234599 to setup new tmh01 as scap target
- 20:15 bd808: restored 3 cherry picks that were lost when rebuilding the ops/puppet git repo
- 20:07 bd808: deployment-puppetmaster has only one cherry-pick; looks like maybe dcausse dropped the prior stack when working on Icc95ac8
- 18:17 bd808: Cleaned up some puppet groups for deployment-prep that no longer exist in ops/puppet
- 18:03 bd808: Building deployment-tmh01.deployment-prep.eqiad.wmflabs to replace deployment-videoscaler01
- 18:01 bd808: Nope, I deleted deployment-videoscaler01
- 18:01 bd808: Deleted deployment-urldownloader.deployment-prep.eqiad.wmflabs
- 16:53 Krinkle: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/234569
- 11:39 hashar: gallium: rm -fR /srv/org/wikimedia/integration/cover/mediawiki-core/master/php2 . This way https://integration.wikimedia.org/cover/mediawiki-core/ redirects to the coverage report (thanks Krinkle)
- 11:37 hashar: deleting https://integration.wikimedia.org/ci/job/mediawiki-core-code-coverage-2 (same)
- 10:43 hashar: pooling back integration-slave-trusty-1016 Was once depooled for debugging purposes and repealed ( https://phabricator.wikimedia.org/T110054 ) but apparently Jenkins restart did not pool it back again :/
- 00:56 thcipriani: sudo keyholder arm on deployment-bastion fixed beta-scap-eqiad
2015-08-27
- 20:37 marxarelli: Reloading Zuul to deploy If273fceb4134e5f3e38db8361f1a355f9fcfee3a
- 12:52 hashar: cleaning up old workspaces from jobs that are now throttled to one per node (ex: sudo salt '*slave*' cmd.run 'rm -fR /mnt/jenkins-workspace/workspace/mediawiki*@?' )
- 08:17 moritzm: enabled base::firewall on deployment-mediawiki0[1-3]
- 02:54 matt_flaschen: Manual UPDATE for enwiki DB on Beta Cluster to work around earlier ref_src_wiki update.php problem.
- 02:42 matt_flaschen: Manually fixed index on flow_ext_ref for cawiki, en_rtlwiki, enwiki, hewiki, metawiki, and testwiki on Beta Cluster due to https://gerrit.wikimedia.org/r/#/c/234162/
- 00:14 marxarelli: Reloading Zuul to deploy Iaab45d659df4b817a0dd27a7ccde17d71f630aaa
2015-08-26
- 23:39 bd808: Updated scap to a7ec319 (Use configured bin_dir to find refreshCdbJsonFiles)
- 23:32 Krenair: Re-armed keyholder on deployment-bastion
- 21:51 matt_flaschen: To fix https://gerrit.wikimedia.org/r/#/c/233952/1 on Beta, manually ran: while read line; do echo "Starting $line\n"; echo 'ALTER TABLE flow_wiki_ref DROP COLUMN ref_src_wiki;' | sql --write "$line"; echo "Finished $line\n"; done < /srv/mediawiki/all-labs.dblist
- 16:39 bd808: marked https://integration.wikimedia.org/ci/computer/integration-slave-precise-1014/ offline for git clone problems
- 16:16 marxarelli: stopping udp2log on deployment-flourine
- 16:14 marxarelli: udp2log is mostly "egrep: writing output: Broken pipe"
- 16:10 marxarelli: disk space at 97% on deployment-flourine, mainly due to 15G /var/log/udp2log/udp2log.log
- 16:01 bd808: sudo rm -rf integration-slave-precise-1014:/mnt/jenkins-workspace/workspace/mediawiki-core-phplint/.git
- 09:57 hashar: Bumping our JJB mirror a3aef64..f01628c Required for the Android Emulator plugin support ( https://phabricator.wikimedia.org/T110307 )
- 07:39 hashar_: puppet is back in action on beta cluster
- 07:38 hashar_: enabling puppet agent on deployment-puppetmaster. It is disable with no reason given
- 07:24 hashar_: resetted beta cluster puppet master to origin/production . We have lost any cherry pick that might have existed
- 07:16 hashar_: started puppetmaster on deployment-puppetmaster
- 07:11 hashar: puppet fails on most beta cluster instances :-(
2015-08-25
- 23:48 thcipriani: stopping puppetmaster and disabling puppet runs on deployment-puppetmaster until we get a change to diagnose/rebuild (tomorrow)
- 23:47 thcipriani: deployment-puppetmaster showing signs of a corrupt disk "error: object file .git/objects/cc/026ba0cdc872490ef6a616b2bac4bb829639cd is empty" shutting it off for now.
- 23:43 thcipriani: reboot deployment-puppetmaster unreachable from other vms (labvirt1007 thing, probably)
- 15:10 hashar: unpooling and deleting integration-slave-trusty-1014 integration-slave-trusty-1017 and integration-slave-precise-1014 . They are most probably corrupted ( https://phabricator.wikimedia.org/T110052#1571184 )
- 15:04 hashar: soft rebooting integration-slave-trusty-1014 (ssh dead)
- 15:02 hashar: tashing workspaces on integration-slave-trusty-1014 and integration-slave-trusty-1017 ( https://phabricator.wikimedia.org/T110052#1571184 )
- 14:55 hashar: dropping all workspaces from integration-slave-precise-1014 . Some .git repos in workspaces might be corrupted
- 11:47 hashar: Upgraded a bunch of Jenkins plugins
- 10:02 hashar: deleted job https://integration.wikimedia.org/ci/job/browsertests-UploadWizard-en.wikipedia.beta.wmflabs.org-windows_7-internet_explorer-8-sauce/ Was disabled and no more in our JJB config
- 08:52 hashar: pooling back integration-slave-precise-1014 , integration-slave-trusty-1014 and integration-slave-trusty-1017 . labvirt1007 missed disk space ( https://phabricator.wikimedia.org/T110052 )
- 00:58 bd808: Updated tin:/srv/deployment/integration/slave-scripts to b287e93 (Revert "Run mw-install-mysql.sh with statement tracing") and synced via trebuchet
2015-08-24
- 23:36 bd808: Updated tin:/srv/deployment/integration/slave-scripts to a2cdf4f (Run mw-install-mysql.sh with statement tracing) and synced via trebuchet
- 21:51 tgr: migrating OAuth to metawiki on beta - T108648
- 20:51 bd808: Restarted puppetmaster on integration-puppetmaster
- 20:47 bd808: Ran sudo apt-get install hhvm hhvm-dev on integration-slave-trusty-1016 to get hhvm 3.6.5+dfsg1-1+wm3
- 18:34 matt_flaschen: Completed on Beta Cluster: foreachwiki populateContentModel.php --ns=all --table=page | tee populateContentModel_page_table_all_namespaces_all_wikis_2015-08-24.log
- 18:24 Krinkle: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/233480
- 18:23 matt_flaschen: Running on Beta Cluster: foreachwiki populateContentModel.php --ns=all --table=page | tee populateContentModel_page_table_all_namespaces_all_wikis_2015-08-24.log
- 15:23 hashar: hard rebooting integration-slave-trusty-1017
- 15:19 hashar: hard rebooting integration-slave-trusty-1014
- 15:16 hashar: apt-get upgrade all Trusty slaves
- 15:15 hashar: apt-get upgrade on gallium
- 15:12 hashar: integration-slave-precise-1011 and integration-slave-precise-1014 went offline due to elasticsearch not starting properly. https://phabricator.wikimedia.org/T109497
- 15:06 hashar: upgrading packages on precise slaves
- 09:17 hashar: recreating Jenkins job mwext-PronunciationRecording-jslint (still triggered by Zuul somehow)
2015-08-23
- 00:37 legoktm: deploying https://gerrit.wikimedia.org/r/233178
2015-08-22
- 00:08 marxarelli: Reloading Zuul to deploy I86fcd065f7a77e293a8882fa3ad2c20fffe4b092
2015-08-21
- 18:31 marxarelli: Deployed I4975a99f1c7d22c2f99f4557de4d7a081a5300f4 to integration slaves
- 17:59 marxarelli: Updated 29 Jenkins jobs with minor changes to bundler execution and MW LocalSettings.php (closing php tag) (I2f32701dd9478857ed5a2fb1bfbe13e134d7b27c)
- 17:55 marxarelli: Running `jenkins-jobs update` to deploy I2f32701dd9478857ed5a2fb1bfbe13e134d7b27c
2015-08-19
- 16:35 marxarelli: Verified that elasticsearch and mysql are now running after the reinstall and a manual puppet run
- 16:32 marxarelli: Running apt-get install --reinstall elasticsearch to re-create missing /var/run/elasticsearch (and possibly others) directory
- 16:29 marxarelli: Investigating stopped mysql on integration-slave-precise-1014
- 16:21 marxarelli: Reloading Zuul to deploy I6815cd66169ee8f6fbb5ea394e3a10ce6b6e7609
- 16:17 marxarelli: Reloading Zuul to deploy I48ab39e330ebc71266b72cae8449cc2f6da495fe
- 14:51 jzerebecki: reload zuul for 700f380..0384ff5
2015-08-18
- 18:33 jzerebecki: (mysql wasn't started as puppet never got to that point)
- 18:32 jzerebecki: /etc/init.d/elasticsearch start was looping endlessly because /var/run/elasticsearch/ did not exist even though it is part of the debian package elasticsearch which was installed. fixed the issue on this instance by: integration-slave-precise-1013:~# apt-get install --reinstall elasticsearch
- 16:32 jzerebecki: offlined integration-slave-precise-1013 : Fails to connect to mysl. /etc/init.d/mysql start fails.
- 16:00 jzerebecki: reloading zuul for 6486889..700f380
2015-08-17
- 22:18 legoktm: running schema change for gerrit:202344 on beta
- 19:19 legoktm: freeing up disk space on 1012
- 19:15 legoktm: [11:45:39] <legoktm> !log freeing up disk space on 1017
- 19:15 legoktm: restarted qa-morebots
2015-08-15
- 00:38 marxarelli: Reloading Zuul to deploy I9ef82b6d3ea7d83de8e4a67c9715ccf335c00b88
2015-08-14
- 18:52 thcipriani: disconnect/reconnect for deployment-bastion jenkins slave—left over stalled jobs went away
- 18:43 greg-g: killed some of the queued jobs (beta-scap etc) via clicking on the red X
- 18:42 thcipriani: disconnected and reconnected deployment-bastion jenkins slave
- 16:14 ostriches: fixed deployment-cache-upload04
- 05:21 bd808: varnish-fe on deployment-cache-upload04.deployment-prep.eqiad.wmflabs not starting because nginx isn't starting because ssl cert is missing. No port 80 listener to serve images
2015-08-13
- 23:27 legoktm: deploying https://gerrit.wikimedia.org/r/230256
- 21:47 bd808: triggered beta-scap-eqiad jenkins job
- 21:47 bd808: Primed keyholder agent via `sudo -u keyholder env SSH_AUTH_SOCK=/run/keyholder/agent.sock ssh-add /etc/keyholder.d/mwdeploy_rsa`
- 20:21 cscott: deployed I2e792ca14a35a79e7846b0ed03a36adf55fe338f to zuul (and reloaded)
- 19:28 cscott: deployed 0c0f6e936bacfffde432ecf1e53f73f037ca6c42 to zuul (and jenkins)
- 17:43 marxarelli: Reloading Zuul to deploy I0159f6dba5e187bfc5fe2b680408f35aca6ca2fe
2015-08-12
- 22:16 bd808: Cherry-picked https://gerrit.wikimedia.org/r/#/c/231179/ (Disable authentication for Kibana)
- 21:01 marxarelli: Reloading Zuul to deploy I11bcac5b35a8f36cf3eb43caf7b792de6105a501 and I4bec54d445cb41cba3d6f5d9bd74ffe823b2c7ad
- 20:46 urandom: restarted restbase on deployment-restbase01 (dead)
- 18:58 bd808: Applied https://gerrit.wikimedia.org/r/#/c/231049/ via cherry-pick
- 16:50 bd808: Fixed puppet merge conflict
2015-08-11
- 21:54 bd808: Cherry-picked https://gerrit.wikimedia.org/r/#/c/230922/ for testing
- 11:36 hashar: Fixed puppet on integration-slave-trusty-1017 . The puppet.conf servername had: integration-puppetmaster. , appended the sub domain/domain to it.
- 02:02 legoktm: deleted beta-recompile-math-texvc-eqiad from jenkins
- 02:01 legoktm: deploying https://gerrit.wikimedia.org/r/229376
- 01:27 legoktm: deploying https://gerrit.wikimedia.org/r/230619
- 00:45 bd808: Logstash properly creating new indices again and logs are being collected
- 00:42 bd808: Fixed default mapping for logstash indices to have 0 replicas
- 00:40 bd808: Crap! logstash errors are my fault. I updated the default index mapping and neglected to correct the replica count. Missing all data from 2015-08-08 to now
- 00:36 bd808: Elasticsearch index for logstash-2015.08.10 missing/corrupt
- 00:34 bd808: Restarted elasticsearch on deployment-logstash2
- 00:32 bd808: Started logstash on deployment-logstash2; process had died from OOM
- 00:02 marxarelli: clearing disk space on integrations-slave-trusty-1011, integrations-slave-trusty-1012, integrations-slave-trusty-1013
2015-08-10
- 23:57 marxarelli: clearing disk space on integrations-slave-trusty-1016 with `find /mnt/jenkins-workspace/workspace -mindepth 1 -maxdepth 1 -type d -mtime +15 -exec rm -rf {} \;`
- 23:57 marxarelli: clearing disk space on integrations-slave-trusty-1014 with `find /mnt/jenkins-workspace/workspace -mindepth 1 -maxdepth 1 -type d -mtime +15 -exec rm -rf {} \;`
- 23:53 marxarelli: clearing disk space on integrations-slave-trusty-1014
- 23:04 bd808: updated scap to a404a39: Build wikiversions.php in addition to wikiversions.cdb
- 22:51 bd808: testing https://gerrit.wikimedia.org/r/#/c/230679 via cherry-pick to /srv/deployment/scap/scap
- 18:39 legoktm: deploying https://gerrit.wikimedia.org/r/230591
2015-08-08
- 00:57 legoktm: deploying https://gerrit.wikimedia.org/r/230255
2015-08-07
- 18:50 bd808: updated scap: removed cherry-pick of I3d2b4e7 and updated to latest HEAD
- 17:55 bd808: logstash-beta.wmflabs.org working again; broken since Ib10deb5 was merged
- 17:52 bd808: Removed stale cherry-pick of Ib10deb5b4e42d440c5deff0897e714174f3e38fe that was breaking puppet rebase
2015-08-06
- 22:53 jzerebecki: reloading zuul for e17f502..af1f5d1
- 17:01 bd808: fixed rebase conflict with logstash cherry-picks
2015-08-05
- 20:43 thcipriani: marking integration-slave-precise-1012 back online, elasticsearch hung up on starting because /var/run/elasticsearch wasn't a directory :(
- 20:00 thcipriani: restarting integration puppetmaster memory usage way high
- 00:12 Krinkle: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/229287
2015-08-04
- 18:02 bd808: Upgraded elasticsearch to 1.7.1 on deployment-logstash2
- 18:01 bd808: Upgraded logstash to 1.5.3 on deployment-logstash2
- 17:57 bd808: cherry-picked https://gerrit.wikimedia.org/r/#/c/227175/21
- 17:49 bd808: cherry-picked https://gerrit.wikimedia.org/r/#/c/227175/
- 17:48 bd808: cherry-picked https://gerrit.wikimedia.org/r/#/c/226991/
- 00:30 marxarelli: Reloading Zuul to deploy I18fa56da9a27a4efeb061ceba773c8b50bc4a8f4
2015-08-02
- 04:30 legoktm: deploying https://gerrit.wikimedia.org/r/228606
- 03:00 legoktm: deploying https://gerrit.wikimedia.org/r/228601
- 02:39 legoktm: deploying https://gerrit.wikimedia.org/r/228600
- 01:58 legoktm: deploying https://gerrit.wikimedia.org/r/228596
- 01:53 legoktm: deploying https://gerrit.wikimedia.org/r/222760
- 01:28 legoktm: deploying https://gerrit.wikimedia.org/r/226753
- 01:20 legoktm: deploying https://gerrit.wikimedia.org/r/228507
- 00:36 legoktm: deploying https://gerrit.wikimedia.org/r/228583
2015-08-01
- 23:27 legoktm: deploying https://gerrit.wikimedia.org/r/228492
2015-07-31
- 01:32 jzerebecki: reload zuul for 83a30e5..f2d2517
2015-07-30
- 23:50 bd808: upgraded nutcracker to 0.4.1-1+wm2~precise1 on deployment-bastion
- 21:48 legoktm: deploying https://gerrit.wikimedia.org/r/228155
- 17:09 ostriches: cleaned up /var space on deployment-videoscaler01
- 09:17 hashar: apt-get upgrade on all Trusty slaves
- 09:13 hashar_: integration: upgrading Zuul package on Precise/Trusty instances ( https://phabricator.wikimedia.org/T106499 )
2015-07-29
- 23:55 marxarelli: clearing disk space on integrations-slave-trusty-1012 with `find /mnt/jenkins-workspace/workspace -mindepth 1 -maxdepth 1 -type d -mtime +15 -exec rm -rf {} \;`
- 18:15 bd808: upgraded nutcracker on deployment-jobrunner01
- 18:14 bd808: upgraded nutcracker on deployment-videoscaler01
- 18:08 bd808: rm deployment-fluorine:/a/mw-log/archive/*-201506*
- 18:08 bd808: rm deployment-fluorine:/a/mw-log/archive/*-201505*
- 18:02 bd808: rm deployment-videoscaler01:/var/log/atop.log.?*
- 16:49 thcipriani: lots of "Error connecting to 10.68.16.193: Can't connect to MySQL server on '10.68.16.193'" deployment-db1 seems up and functional :(
- 16:27 thcipriani: deployment-prep login timeouts, tried restarting apache, hhvm, and nutcracker on mediawiki{01..03}
- 14:38 bblack: cherry-picked https://gerrit.wikimedia.org/r/#/c/215624 (updated to PS8) into deployment-puppetmaster ops/puppet
- 14:28 bblack: cherry-picked https://gerrit.wikimedia.org/r/#/c/215624 into deployment-puppetmaster ops/puppet
- 12:38 hashar_: salt minions are back somehow
- 12:36 hashar_: salt on deployment-salt is missing most of the instances :-(((
- 03:00 ostriches: deployment-bastion: please please someone rebuild me to not have a stupid 2G /var partition
- 03:00 ostriches: deployment-bastion: purged a bunch of atop and pacct logs, and apt cache...clogging up /var again.
- 02:34 legoktm: deploying https://gerrit.wikimedia.org/r/227640
2015-07-28
- 23:43 marxarelli: running `jenkins-jobs update config/ 'mwext-mw-selenium'` to deploy I7afa07e9f559bffeeebaf7454cc6b39a37e04063
- 21:05 bd808: upgraded nutcracker on mediawiki03
- 21:04 bd808: upgraded nutcracker on mediawiki02
- 21:01 bd808: upgraded nutcracker on mediawiki01
- 19:49 jzerebecki: reloading zuul b1b2cab..b02830e
- 11:18 hashar: Assigning label "BetaClusterBastion" to https://integration.wikimedia.org/ci/computer/deployment-bastion.eqiad/
- 11:12 hashar: Jenkins jobs for the beta cluster ended up stuck again. Found a workaround by removing the Jenkins label on deployment-bastion node and reinstating it. Seems to get rid of the deadlock ( ref: https://phabricator.wikimedia.org/T72597#1487801 )
- 09:50 hashar: deployment-apertium01 is back! The ferm rules were outdated / not maintained by puppet, dropped ferm entirely.
- 09:40 hashar: rebooting deployment-apertium01 to ensure its ferm rules are properly loaded on boot ( https://phabricator.wikimedia.org/T106658 )
- 00:46 legoktm: deploying https://gerrit.wikimedia.org/r/227383
2015-07-27
- 23:04 marxarelli: running `jenkins-jobs update config/ 'browsertests-*'` to deploy I3c61ff4089791375e21aadfa045d503dfd73ca0e
- 13:26 hashar: Precise slaves had faulty elasticsearch: apt-get install --reinstall elasticsearch
- 13:21 hashar: puppet stalled on Precise Jenkins slaves :-(
- 08:52 hashar: upgrading packages on Precise slaves
- 08:49 hashar: rebooting all Trusty jenkins slaves
- 08:39 hashar: upgrading python-pip on Trusty from 1.5.4-1ubuntu1 to 1.5.4-1ubuntu3 . Fix up pip silently removing system packages ( https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=771794 )
- 08:12 hashar: On CI slaves, bumping HHVM from 3.6.1+dfsg1-1+wm3 to 3.6.5+dfsg1-1+wm1
- 08:11 hashar: apt-get upgrade Trusty Jenkins slaves
2015-07-24
- 17:35 marxarelli: updating integration slave scripts from integration-saltmaster to deploy I6906fadede546ce2205797da1c6b267aed586e17
- 17:17 marxarelli: running `jenkins-jobs update config/ 'mediawiki-selenium-integration' 'mwext-mw-selenium'` to deploy Ib289d784c7b3985bd4823d967fbc07d5759dc756
- 17:05 marxarelli: running `jenkins-jobs update config/ 'mediawiki-selenium-integration'` to deploy and test Ib289d784c7b3985bd4823d967fbc07d5759dc756
- 17:04 hashar: integration-saltmaster, in a screen : salt -b 1 '*slave*' cmd.run '/usr/local/sbin/puppet-run'|tee hashar-massrun.log
- 17:04 hashar: cancelled last command
- 17:03 hashar: integration-saltmaster : salt -b 1 '*slave*' cmd.run '/usr/local/sbin/puppet-run' & && disown && exit
- 16:55 hashar: Might have fixed the puppet/pip mess on CI slaves by creating a symlink from /usr/bin/pip to /usr/local/bin/pip ( https://gerrit.wikimedia.org/r/#/c/226729/1..2/modules/contint/manifests/packages/python.pp,unified )
- 16:36 hashar: puppet on Jenkins slaves might have some intermittent issues due to pip installation https://gerrit.wikimedia.org/r/226729
- 15:29 hashar: removing pip obsolete download-cache setting ( https://gerrit.wikimedia.org/r/#/c/226730/ )
- 15:27 hashar: upgrading pip to 7.1.0 via pypi ( https://gerrit.wikimedia.org/r/#/c/226729/ ). Revert plan is to uncherry pick the patch on the puppetmaster and: pip uninstall pip
- 12:46 hashar: Jenkins: switching gearman plugin from our custom compiled 0.1.1-9-g08e9c42-change_192429_2 to upstream 0.1.2. They are actually the exact same versions.
- 08:40 hashar: upgrading zuul to zuul_2.0.0-327-g3ebedde-wmf3precise1 to fix a regression ( https://phabricator.wikimedia.org/T106531 )
- 08:39 hashar: upgrading zuul
2015-07-23
- 23:03 marxarelli: running `jenkins-jobs update config/ 'browsertests-*'` to deploy I2d0f83d0c6a406d46627578cb8db0706d1b8655d
- 16:38 marxarelli: Reloading Zuul to deploy I96b6218a208f133209452c71bcf01a1088305aea
- 15:39 urandom: applied wip logstash & cassandra changes (https://gerrit.wikimedia.org/r/#/c/226025/) to deployment-prep
- 13:24 hashar: apt-get upgrade integration-puppetmaster and rebooting it
- 13:23 hashar: integration puppetmaster in bad shape: Warning: Error 400 on SERVER: Cannot allocate memory - fork(2)
- 10:58 hashar: beta : salt '*' cmd.run 'rm /etc/apt/apt.conf.d/20auto-upgrades.ucf-dist'
- 10:52 hashar: Beta cluster puppetmaster is now deployment-puppetmaster.deployment-prep.eqiad.wmflabs . Migrated all instances (solves https://phabricator.wikimedia.org/T106649 )
- 10:30 hashar: regenerated puppet cert on deployment-salt , the old puppetmaster now a puppet client
- 10:23 hashar: running apt-get upgrade on deployment-parsoidcache02
- 09:32 hashar: puppet broken on deployment-fluorine : Error: Could not request certificate: Neither PUB key nor PRIV key:: header too long
- 08:39 hashar: Disabling puppet agent on ALL beta cluster instances
- 08:18 hashar: creating deployment-puppetmaster m1.medium :D
- 01:57 jzerebecki: reconnected slave and needed to kill a few pending beta jobs, works again
- 01:50 jzerebecki: trying https://www.mediawiki.org/wiki/Continuous_integration/Jenkins#Hung_beta_code.2Fdb_update
- 01:09 legoktm: beta-mediawiki-config-update-eqiad jobs stuck
- 00:41 jzerebecki: clean up doc dir after job changes gallium:~$ sudo -iu jenkins-slave rm -r /srv/org/wikimedia/doc/MobileFrontend/master/{app-0c945a27f43452df695771ddb60b3d14.js,data-500abda2bcb0df13609e38707dfa7f4e.js,eg-iframe.html,extjs,favicon.ico,index.html,member-icons,output,resources,source,styles-3eba09980fa05ead185cb17d9c0deb0f.css}
- 00:14 jzerebecki: reloading zuul 369e6eb..73dc1f6 for https://gerrit.wikimedia.org/r/#/c/223527/
2015-07-22
- 10:24 hashar: Upgrading Zuul on Jenkins Precise slaves to zuul_2.0.0-327-g3ebedde-wmf2precise1_amd64.deb
- 09:32 hashar_: Reupgrading Zuul to zuul_2.0.0-327-g3ebedde-wmf2precise1_amd64.deb with an approval fix ( https://gerrit.wikimedia.org/r/#/c/226274/ ) for gate-and-submit no more matching Code-Review+2 events ( https://phabricator.wikimedia.org/T106436 )
2015-07-21
- 22:54 greg-g: 22:50 < chasemp> "then git reset --hard 9588d0a6844fc9cc68372f4bf3e1eda3cffc8138 in /etc/zuul/wikimedia"
- 22:53 greg-g: 22:47 < chasemp> service zuul stop && service zuul-merger stop && sudo apt-get install zuul=2.0.0-304-g685ca22-wmf1precise1
- 21:48 greg-g: Zuul not responding
- 20:23 hasharConfcall: Zuul no more reports back to Gerrit due to an error with the Gerrit label
- 20:10 hasharConfcall: Zuul restarted with 2.0.0-327-g3ebedde-wmf2precise1
- 19:48 hasharConfcall: Upgrading Zuul to zuul_2.0.0-327-g3ebedde-wmf2precise1 Previous version failed because python-daemon was too old, now shipped in the venv https://phabricator.wikimedia.org/T106399
- 15:04 hashar: upgraded Zuul on gallium from zuul_2.0.0-306-g5984adc-wmf1precise1_amd64.deb to zuul_2.0.0-327-g3ebedde-wmf1precise1_amd64.deb . now uses python-daemon 2.0.5
- 13:37 hashar: upgraded Zuul on gallium from zuul_2.0.0-304-g685ca22-wmf1precise1 to zuul_2.0.0-306-g5984adc-wmf1precise1 . Uses a new version of GitPython
- 02:15 bd808: upgraded to elasticsearch-1.7.0.deb on deployment-logstash2
2015-07-20
- 16:55 thcipriani: restarted puppetmaster on deployment-salt, was acting whacky
2015-07-17
- 21:45 hashar: upgraded nodepool to 0.0.1-104-gddd6003-wmf4 . That fix graceful stop via SIGUSR1 and let me complete the systemd integration
- 20:03 hashar: stopping Zuul to get rid of a faulty registered function "build:Global-Dev Dashboard Data". Job is gone already.
2015-07-16
- 16:08 hashar_: kept nodepool stopped on labnodepool1001.eqiad.wmnet because it spams the cron log
- 10:27 hashar: fixing puppet on deployment-bastion. Stalled since July 7th - https://phabricator.wikimedia.org/T106003
- 10:26 hashar: deployment-bastion: apt-get upgrade
- 02:34 bd808: cherry-picked https://gerrit.wikimedia.org/r/#/c/224313 for scap testing
2015-07-15
- 20:53 bd808: Added JanZerebecki as deployment-prep root
- 17:53 bd808: cherry-picked https://gerrit.wikimedia.org/r/#/c/224829/
- 16:10 bd808: sudo rm -rf /tmp/scap_l10n_* on deployment-bastion
- 15:33 bd808: root (/) is full on deployment-bastion, trying to figure out why
- 14:39 bd808: mkdir mira.deployment-prep:/home/l10nupdate because puppet's managehome flag doesn't seem to be doing that :(
- 05:00 bd808: created mira.deployment-prep.eqiad.wmflabs to begin testing multi-master scap
2015-07-14
- 00:45 bd808: /srv/deployment/scap/scap on deployment-mediawiki02 had corrupt git cache info; moved to scap-corrupt and forced a re-sync
- 00:41 bd808: trebuchet deploy of scap to mediawiki02 failed. investigating
- 00:41 bd808: Updated scap to d7db8de (Don't assume current l10n cache files are .cdb)
2015-07-13
- 20:44 thcipriani: might be some failures, puppetmaster refused to stop as usual, had to kill pid and restart
- 20:39 thcipriani: restarting puppetmaster on deployment-salt, seeing weird errors on instances
- 10:24 hashar: pushed mediawiki/ruby/api tags for versions 0.4.0 and 0.4.1
- 10:12 hashar: deployment-prep: killing puppetmaster
- 10:06 hashar: integration: kicking puppet master. It is stalled somehow
2015-07-11
- 04:35 bd808: Updated /var/lib/git/labs/private to latest upstream
- 03:54 bd808: cherry-picked https://gerrit.wikimedia.org/r/#/c/224219/
- 03:54 bd808: fixed rebase conflict with "Enable firejail containment for zotero" by removing stale cherry-pick
July 10
- 16:12 hashar: nodepool puppitization going on :-D
- 03:01 legoktm: deploying https://gerrit.wikimedia.org/r/223992
July 9
- 22:16 hashar: integration: pulled labs/private.git : dbef45d..d41010d
July 8
- 23:17 bd808: Kibana functional again. Imported some dashboards from prod instance.
- 22:48 marxarelli: cherry-picked https://gerrit.wikimedia.org/r/#/c/223691/ on integration-puppetmaster
- 22:33 bd808: about half of the indices on deployment-logstash2 lost. I assume it was caused by shard rebalancing to logstash1 that I didn't notice before I shut it down and deleted it :(
- 22:32 bd808: Upgraded elasticsearch on logstash2 to 1.6.0
- 22:00 bd808: Kibana messed up. Half of the logstash elasticsearch indices are gone from deployment-logstash2
- 21:05 legoktm: deployed https://gerrit.wikimedia.org/r/223669
- 11:47 Krinkle: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/223530
- 09:26 hashar: upgraded plugins on jenkins and restarting it
July 7
- 23:58 bd808: updated scap to 303e72e (Increment deployment stats after sync-wikiversions)
- 21:23 bd808: deleted instance deployment-logstash1
- 20:48 marxarelli: cherry-picking https://gerrit.wikimedia.org/r/#/c/158016/ on deployment-salt
- 20:07 bd808: Forced puppet run on deployment-restbase01; run picked up changes that should have been applied yesterday, not sure why puppet wasn't running from cron properly
- 19:58 bd808: cherry-picked https://gerrit.wikimedia.org/r/#/c/223391/
- 18:51 bd808: restarted puppetmaster on deployment-salt to pick up logging config changes
- 18:14 bd808: Changed role::protoproxy::ssl::beta to role::tlsproxy::ssl::beta for deployment-cache-*
- 18:10 bd808: puppet broken on deployment-cache-* by https://gerrit.wikimedia.org/r/#/c/222124/
- 15:45 bd808: Cherry-picked https://gerrit.wikimedia.org/r/#/c/223301/
July 6
- 23:34 marxarelli: Reloading Zuul to deploy I33ac72e7df498e58f0e25d8c59f167d13eae06cf
- 23:24 bd808: restarted nutcracker on deployment-mediawiki01
- 21:32 bd808: cherry-picked https://gerrit.wikimedia.org/r/#/c/223184/ to deployment-salt
- 20:57 bd808: restarted puppetmaster on deployment-salt
- 20:55 bd808: cherry-picked https://gerrit.wikimedia.org/r/#/c/223172/ for testing
- 20:50 hashar: removing lanthanum from Jenkins slave configuration. Server is gone ( https://phabricator.wikimedia.org/T86658 )
- 20:34 hashar: lanthanum: deleting gerrit replicas under /srv/ssd/gerrit
- 20:32 hashar: Gerrit: reloading replication plugin: gerrit plugin reload replication
- 14:08 hashar: Disconnected lanthanum Jenkins slave. Being phased out https://phabricator.wikimedia.org/T86658
July 3
- 14:07 hashar: adding puppetmaster::certcleaner class to integration and beta puppetmaster
- 14:03 hashar: rebased puppetmaster on integration project
- 13:59 hashar: removing puppetmaster::autosigner from integration-puppetmaster
- 13:58 hashar: removing puppetmaster::autosigner from deployment-salt. It is now automatic per https://gerrit.wikimedia.org/r/#/c/220306/
- 13:55 hashar: restarted puppetmaster on deployment-salt
- 05:20 legoktm: deploying https://gerrit.wikimedia.org/r/222539
- 01:18 legoktm: deploying https://gerrit.wikimedia.org/r/166074
- 00:41 legoktm: deploying https://gerrit.wikimedia.org/r/222503
July 2
- 10:07 hashar: adding mobrovac to the integration project so he can ssh to slaves and sudo as jenkins-deploy user
July 1
- 15:44 hashar: Kunal awesome dashboard for repos https://www.mediawiki.org/wiki/User:Legoktm/ci
- 15:34 hashar: https://integration.wikimedia.org/ci/job/mediawiki-core-phpcs-HEAD/ is fixed. populated the git repos manually
- 15:21 hashar: manually populating mediawiki/core on Precise instances for mediawiki-core-phpcs-HEAD job using: git config remote.origin.url https://gerrit.wikimedia.org/r/p/mediawiki/core git fetch
- 15:14 hashar: https://integration.wikimedia.org/ci/job/mediawiki-core-phpcs-HEAD/ broken while cloning mediawiki/core :-(
- 10:47 hashar: puppet fixed by restarting the puppet master
- 10:41 hashar: restarting Jenkins
- 10:40 hashar: upgrading Jenkins gearman plugin from 0.1.1-8-gf2024bd to 0.1.1-9-g08e9c42-change_192429_2 https://phabricator.wikimedia.org/T72597#1416913
- 10:38 hashar: restarted puppetmaster on integration
- 10:36 hashar: Error: /Stage[main]/Ldap::Client::Utils/File[/usr/local/sbin/archive-project-volumes]: Could not evaluate: Could not retrieve information from environment production source(s) puppet:///modules/ldap/scripts/archive-project-volumes
- 10:36 hashar: integration: puppet now fails on instances :-/
- 10:29 hashar: rebased puppet.git on integration-puppetmaster. Autoupdater was blocked by a couple 3-way merges.
June 30
- 10:07 hashar: deployment-bastion sudo -u l10nupdate bash -c 'cd /srv/l10nupdate/mediawiki/extension && git submodule foreach git gc'
- 09:43 hashar: deployment-bastion sudo -u jenkins-deploy bash -c 'cd /srv/mediawiki-staging/php-master/extensions && git submodule foreach git gc'
- 09:40 hashar: deployment-bastion sudo -u l10nupdate bash -c 'cd /srv/l10nupdate/mediawiki/core/.git && git gc'
- 09:39 hashar: deployment-bastion: sudo -u l10nupdate bash -c 'cd /srv/l10nupdate/mediawiki/extensions/.git && git gc'
- 09:38 hashar: deployment-bastion sudo -u jenkins-deploy bash -c 'cd /srv/mediawiki-staging/php-master/extensions/.git && git gc'
- 09:31 hashar: beta: running git gc on deployment-bastion Trebuchet directories. As trebuchet: find /srv/deployment/*/*/.git -type d -name .git -print -exec bash -c 'cd {} && git gc' \;
- 07:09 legoktm: deploying https://gerrit.wikimedia.org/r/221835
June 29
- 23:19 bd808: Moved logstash irc bot from logstash1 to logstash2
- 22:25 legoktm: deploying https://gerrit.wikimedia.org/r/221749
- 18:08 thcipriani: restarted nutcracker on beta cluster salt '*-mediawiki*' cmd.run 'service nutcracker restart'
- 10:42 hashar: manually rebasing integration-puppetmaster git repo
- 10:24 hashar: restarted puppetmater on deployment-salt
- 10:23 hashar: puppet master stalled due to: [ldap-yaml-enc.p] <defunct> . Killing it
- 10:21 hashar: sees beta cluster puppetmaster is suffering from some random issue
June 27
- 02:42 legoktm: deploying https://gerrit.wikimedia.org/r/221343 & https://gerrit.wikimedia.org/r/221344
- 02:36 Krinkle: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/221342
- 02:22 Krinkle: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/221338
- 02:15 legoktm: deploying https://gerrit.wikimedia.org/r/#/c/221337/
- 01:56 legoktm: deploying https://gerrit.wikimedia.org/r/221333 & https://gerrit.wikimedia.org/r/221334
- 01:42 legoktm: deploying https://gerrit.wikimedia.org/r/221331
- 01:36 legoktm: deploying https://gerrit.wikimedia.org/r/221330
- 01:28 legoktm: deploying https://gerrit.wikimedia.org/r/221329
- 01:13 legoktm: deploying https://gerrit.wikimedia.org/r/221328
- 00:15 legoktm: deploying https://gerrit.wikimedia.org/r/221316 & https://gerrit.wikimedia.org/r/221318
June 26
- 22:39 marxarelli: Reloading Zuul to deploy I3deec5e5a7ce7eee75268d0546eafb3e4145fdc7
- 22:20 marxarelli: Reloading Zuul to deploy I7affe14e878d5c1fc4bcb4dfc7f2d1494cd795b7
- 21:45 legoktm: deploying https://gerrit.wikimedia.org/r/221295
- 21:21 marxarelli: running `jenkins-jobs update` to deploy I7affe14e878d5c1fc4bcb4dfc7f2d1494cd795b7
- 18:46 marxarelli: running `jenkins-jobs update '*bundle*'` to deploy Icb31cf57bee0483800b41a2fb60d236fcd2d004e
June 25
- 23:38 legoktm: deploying https://gerrit.wikimedia.org/r/221001
- 21:21 thcipriani: updated deployment-salt to match puppet by rm /var/lib/git/operations/puppet/modules/cassandra per godog's instructions
- 19:09 hashar: purged all WikidataQuality workspaces. Got renamed to WikibaseQuality*
- 14:22 jzerebecki: reloading zuul for https://gerrit.wikimedia.org/r/#/c/220737/2
- 14:20 jzerebecki: killing a fellows idle shell zuul@gallium:~$ kill 13602
- 11:03 hashar: Rebooting integration-raita and integration-vmbuilder-trusty
- 11:01 hashar: Unmounting /data/project and /home NFS mounts from integration-raita and integration-vmbuilder-trusty https://phabricator.wikimedia.org/T90610
- 10:45 hashar: deployment-sca02 deleted /var/lib/puppet/state/agent_catalog_run.lock from June 5th
- 08:57 hashar: Fixed puppet "Can't dup Symbol" on deployment-pdf01 by deleting puppet, /var/lib/puppet and reinstalling it from scratch https://phabricator.wikimedia.org/T87197
- 08:39 hashar: apt-get upgrade deployment-salt
- 08:08 hashar: deployment-pdf01 deleted /var/log/ocg/ content. Last entry is from July 25th 2014 and puppet complains with e[/var/log/ocg]: Not removing directory; use 'force' to override
- 08:04 hashar: apt-get upgrade deployment-pdf01
- 06:37 Krinkle: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/220712
- 06:33 Krinkle: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/220705
June 24
- 19:31 hashar: rebooting deployment-cache-upload02
- 19:28 hashar: fixing DNS puppet etc on deployment-cache-upload02
- 19:24 hashar: rebooting deployment-zookeeper to get rid of the /home NFS https://phabricator.wikimedia.org/T102169
- 19:06 hashar: beta: salt 'i-00*' cmd.run "echo 'domain integration.eqiad.wmflabs\nsearch integration.eqiad.wmflabs eqiad.wmflabs\nnameserver 208.80.154.20\noptions timeout:5' > /etc/resolv.conf"
- 19:06 hashar: fixing DNS / puppet and salt on i-000008d5.eqiad.wmflabs i-000002de.eqiad.wmflabs i-00000958.eqiad.wmflabs
- 15:35 hashar: integration-dev recovered! puppet hasn't run for ages but caught up with changes
- 15:13 hashar: removed /var/lib/puppet/state/agent_catalog_run.lock on integration-dev
- 09:52 hashar: Java 6 removed from gallium / lanthanum and CI labs slaves.
- 09:18 hashar: getting rid of java 6 on CI machines ( https://phabricator.wikimedia.org/T103491 )
- 07:58 hashar: Bah puppet reenable NFS on deployment-parsoidcache02 for some reason
- 07:57 hashar: disabling NFS on deployment-parsoidcache02
- 00:38 marxarelli: reloading zuul to deploy https://gerrit.wikimedia.org/r/#/c/219513/
- 00:32 marxarelli: running `jenkins-jobs update` to create 'mwext-MobileFrontend-mw-selenium' with I7affe14e878d5c1fc4bcb4dfc7f2d1494cd795b7
- 00:20 marxarelli: running `jenkins-jobs update` to create 'mediawiki-selenium-integration' with I7affe14e878d5c1fc4bcb4dfc7f2d1494cd795b7
June 23
- 23:29 Krinkle: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/220350
- 21:34 bd808: updated scap to 33f3002 (Ensure that the minimum batch size used by cluster_ssh is 1)
- 19:53 legoktm: deleted broken renames from centralauth.renameuser_status on beta cluster
- 18:28 jzerebecki: zuul reload for https://gerrit.wikimedia.org/r/#/c/219778/4
- 16:33 bd808: updated scap to da64a65 (Cast pid read from file to an int)
- 16:20 bd808: updated scap to 947b93f (Fix reference to _get_apache_list)
- 12:24 hashar: rebooting integration-labvagrant (stuck)
- 00:07 legoktm: deploying https://gerrit.wikimedia.org/r/220020
June 22
- 22:23 legoktm: deploying https://gerrit.wikimedia.org/r/219603
- 21:47 bd808: scap emitting soft failures due to missing python-netifaces on deployment-videoscaler01; should be fixed by a current puppet run
- 21:37 bd808: Updated scap to 81b7c14 (Move dsh group file names to config)
- 14:58 hashar: disabled sshd MAC/KEX hardening on beta (was https://gerrit.wikimedia.org/r/#/c/219828/ )
- 14:32 hashar: restarting Jenkins
- 14:30 hashar: Reenable sshd MAC/KEX hardening on beta by cherry picking https://gerrit.wikimedia.org/r/#/c/219828/
- 13:17 moritzm: activated firejail service containment for graphoid, citoid and mathoid in deployment-sca
- 11:07 hashar: fixing puppet on integration-zuul-server
- 10:29 hashar: rebooted deployment-kafka02 to get rid of /home NFS share
- 10:25 hashar: fixed puppet.conf on deployment-urldownloader
- 10:20 hashar: enabled puppet agent on deployment-urldownloader
- 10:05 hashar: removing puppet lock on deployment-elastic07 ( rm /var/lib/puppet/state/agent_catalog_run.lock )
- 09:40 hashar: fixed puppet certificates on integration-lightslave-jessie-1002 by deleting the SSL certs
- 09:31 hashar: cant reach integration-lightslave-jessie-1002 , probably NFS related
- 09:22 hashar: upgrading Jenkins gearman plugin from 0.1.1 to latest master (f2024bd).
June 21
- 02:40 legoktm_: deploying https://gerrit.wikimedia.org/r/219401
June 20
- 03:12 Krinkle: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/219449
June 19
- 18:39 thcipriani: running `salt -b 2 '*' cmd.run 'puppet agent -t'` from deployment salt to remount /data/projects
- 18:36 thcipriani: added role::deployment::repo_config to deployment-prep hiera, to be removed after patched in ops/puppet
- 16:48 thcipriani: primed keyholder on deployment-bastion
- 15:35 hashar: nodepool manages to boot instances and ssh to them. Now attempting to add them as slave in Jenkins!
June 17
- 20:43 legoktm: deploying https://gerrit.wikimedia.org/r/219021
- 18:56 legoktm: deploying https://gerrit.wikimedia.org/r/218981
- 16:40 legoktm: deploying https://gerrit.wikimedia.org/r/218938 & https://gerrit.wikimedia.org/r/218939
- 14:16 jzerebecki: deploying zuul config ca3bd69..00eb921
- 13:53 jzerebecki: applying https://www.mediawiki.org/wiki/Continuous_integration/Zuul#Gearman_deadlock
- 13:00 jzerebecki: done
- 12:32 jzerebecki: also needed to kill a few beta jobs, like https://www.mediawiki.org/wiki/Continuous_integration/Jenkins#Hung_beta_code.2Fdb_update says. no proceeding with https://gerrit.wikimedia.org/r/#/c/214603/8
- 12:23 jzerebecki: before doing that actually trying https://www.mediawiki.org/wiki/Continuous_integration/Zuul#Jenkins_execution_lock to try to unlock https://integration.wikimedia.org/ci/computer/deployment-bastion.eqiad/
- 12:17 jzerebecki: changing many jenkins jobs while deploying https://gerrit.wikimedia.org/r/#/c/214603/8
June 16
- 15:55 bd808: Resolved rebase conflicts on deployment-salt caused by code review changes of https://gerrit.wikimedia.org/r/#/c/216325 prior to merge
- 13:05 hashar: upgrading HHVM on CI trusty slaves https://phabricator.wikimedia.org/T102616 salt -v -t 30 --out=json -C 'G@oscodename:trusty and *slave*' pkg.install pkgs='["hhvm","hhvm-dev","hhvm-fss","hhvm-luasandbox","hhvm-tidy","hhvm-wikidiff2"]'
- 11:45 hashar: integration-slave-trusty-1021 downgrading hhvm plugins to match hhvm 3.3.1
- 11:42 hashar: integration-slave-trusty-1021 downgrading hhvm, hhvm-dev from 3.3.6 to 3.3.1
- 11:19 hashar: rebooting integration-dev , unreacheable
- 11:09 hashar: apt-get upgrade on integration-slave-trusty-1021
- 08:19 hashar: rebooting integration-slave-jessie-1001, unreacheable
June 15
- 23:39 legoktm: deploying https://gerrit.wikimedia.org/r/218549
- 23:22 legoktm: deploying https://gerrit.wikimedia.org/r/218527
- 21:10 bd808: Put cherry-picks of https://gerrit.wikimedia.org/r/#/c/216325/ and https://gerrit.wikimedia.org/r/#/c/216337/ back on deployment-salt
- 19:59 hashar: manually rebased puppet repo on integration-puppetmaster (some patch got merged)
- 17:21 legoktm: deploying https://gerrit.wikimedia.org/r/218391
- 15:02 hashar: rebooting integration-slave-jessie-1001.integration.eqiad.wmflabs (unresponsive)
- 14:37 hashar: rebooting integration-dev since it is unresponsive
- 13:22 hashar: cleaned integration-puppetmaster certificate
- 13:09 hashar: deleting integration-saltmaster puppet cert
June 13
- 07:53 legoktm: deploying https://gerrit.wikimedia.org/r/217997
- 03:42 Krinkle: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/217993
- 01:11 Krinkle: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/217982
June 12
- 21:29 jzerebecki: reloading zuul with 9ceb1ea..3b862a7 for https://gerrit.wikimedia.org/r/#/c/176377/3
- 21:22 legoktm: deploying https://gerrit.wikimedia.org/r/217448
- 19:02 jzerebecki: done
- 19:00 jzerebecki: doing https://www.mediawiki.org/wiki/Continuous_integration/Zuul#Gearman_deadlock
- 14:56 jzerebecki: reloaded zuul 38c009d..6753a47
June 11
- 14:44 hashar: deployment-prep and integration labs project got migrated out of ec2id. Flawless / self maintaining task thanks to Andrew B. !
- 14:38 hashar: integration-saltmaster: salt-key --accept-all --yes
- 14:30 hashar: rebasing puppetmaster on integration-puppetmaster ca27502..c409503
- 14:28 hashar: rebasing puppetmaster on deployment-salt ca27502..c409503
- 14:28 hashar: cert madness on integration and deployment-prep ( https://gerrit.wikimedia.org/r/#/c/202924/ )
- 10:44 hashar: operations-dns-lint can't be migrated yet until we figure out a solution to provide some missing GeoIP file https://phabricator.wikimedia.org/T98737
- 10:33 hashar: integration: pooling https://integration.wikimedia.org/ci/computer/integration-lightslave-jessie-1002/ with labels DebianJessie and contintLabsSlave. Does not have Zuul package installed though.
- 10:27 hashar: integration: do not install zuul on light slaves (i.e.: integration-lightslave-jessie-1002 ). Jessie does not have a zuul package yet https://gerrit.wikimedia.org/r/#/c/217476/1
- 10:03 hashar: integration: cherry picked https://gerrit.wikimedia.org/r/#/c/217466/1 and https://gerrit.wikimedia.org/r/#/c/217467/1 and applied role::ci::slave::labs::light to integration-lightslave-jessie-1002
- 09:41 hashar: Hiera:Integration change puppet master from 'integration-puppetmaster' to 'integration-puppetmaster.integration.eqiad.wmflabs' https://phabricator.wikimedia.org/T102108
- 09:20 hashar: creating integration-lightslave-jessie-1002 a m1.small (1CPU) instance that would be a very basic Jenkins slaves. The reason is role::ci::slave::labs includes too many things which are not ready for Jessie yet ( https://phabricator.wikimedia.org/T94836 ). Will let us migrate operations-dns-lint to it since prod switched to Jessie (https://phabricator.wikimedia.org/T98003)
June 10
- 20:18 legoktm: deploying https://gerrit.wikimedia.org/r/217277
- 10:42 hashar: restarted jobchron/jobrunner on deployment-jobrunner01
- 10:42 hashar: manually nuked and repopulated jobqueue:aggregator:s-wikis:v2 on deplkoyment-redis01 It now only contains entries from all-labs.dblist
- 09:46 hashar: deployment-videoscaler restarted jobchron
- 08:19 mobrovac: reboot deployment-restbase01 due to ssh problems
June 9
- 22:13 thcipriani: are we back?
- 17:31 twentyafterfour: Branching 1.26wmf9
- 17:10 hashar: restart puppet master on deployment-salt. Was overloaded with wait I/O since roughly 1am UTC
- 16:56 hashar: restarted puppetmaster on deployment-salt
June 8
- 14:12 hashar: clearing disk space on trusty 1011 and 1012
- 14:12 hashar: clearing disk space on trusty 1011 and 1012
- 08:56 hashar: rebooted trusty-1013 trusty-1015 ( https://phabricator.wikimedia.org/T101658 ) and repooled them in Jenkins
- 08:48 hashar: rebooting integration-slave-trusty-1012 (stalled can't login)
- 04:30 legoktm: deploying https://gerrit.wikimedia.org/r/216520
- 00:40 legoktm: deploying https://gerrit.wikimedia.org/r/216600
June 7
- 20:43 Krinkle: Rebooting integration-slave-trusty-1015 to see if it comes back so we can inspect logs (T101658)
- 20:16 Krinkle: Per Yuvi's advice, disabled "Shared project storage" (/data/project NFS mount) for the integration project. Mostly unused. Two existing directories were archived to /home/krinkle/integration-nfs-data-project/
- 17:51 Krinkle: integration-slave-trusty-1012, trusty-1013 and 1015 unresponsive to pings or ssh. Other trusty slaves still reachable.
June 6
- 21:05 legoktm: deploying https://gerrit.wikimedia.org/r/216500
June 5
- 23:55 bd808: added deployment-logstash2 host and told cluster to move logstash all data there
- 21:22 bd808: restarted puppetmaster on deployment-salt ("Could not request certificate: Error 500 on SERVER: <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">")
- 21:17 hashar: Pooled in mediawiki-extensions-qunit which runs qunit tests with karma with multiple extensions . https://gerrit.wikimedia.org/r/#/c/216132/ . https://phabricator.wikimedia.org/T99877
- 19:45 thcipriani: set use_dnsmasq: false on Hiera:Integration
- 19:40 hashar: refreshed Jenkins jobs mediawiki-extensions-hhvm and mediawiki-extensions-zend with https://gerrit.wikimedia.org/r/#/c/216100/3 (refactoring)
- 18:56 Krinkle: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/216182
- 18:52 Krinkle: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/216159
June 4
- 18:06 Krinkle: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/214501
- 16:50 legoktm: deploying https://gerrit.wikimedia.org/r/215935
- 15:07 hashar: integration-jessie-slave1001 : upgrading salt from 2014.1.13 to 2014.7.5
- 14:58 thcipriani: running sudo salt '*' cmd.run 'sed -i "s/GlobalSign_CA.pem/ca-certificates.crt/" /etc/ldap/ldap.conf' on integration-saltmaster
- 14:54 hashar: integration-jessie-slave1001 : running dpkg --configure -a
- 09:26 Krinkle: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/215870
June 3
- 23:31 Krinkle: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/209991
- 20:49 hashar: restarted zuul entirely to remove some stalled jobs
- 20:47 marxarelli: Reloading Zuul to deploy I96649bc92a387021a32d354c374ad844e1680db2
- 20:28 hashar: Restarting Jenkins to release a deadlock
- 20:22 hashar: deployment-bastion Jenkins slave is stalled again :-( No code update happening on beta cluster
- 18:50 thcipriani: change use_dnsmasq: false for deployment-prep
- 18:24 thcipriani: updating deployment-salt puppet in prep for use_dnsmasq=false
- 11:58 kart_: Cherry-picked 213840 to test logstash
- 10:08 hashar: Update JJB fork again f966521..4135e14 . Will remove the http notification to zuul {{bug:T93321}}. REFRESHING ALL JOBS!
- 10:03 hashar: Further updated JJB fork c7231fe..f966521
- 09:10 hashar: Refershing almost all jenkins jobs to take in account the Jenkins Git plugin upgrade https://phabricator.wikimedia.org/T101105
- 03:07 Krinkle: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/215571
June 2
- 20:58 bd808: redis-cli srem "deploy:scap/scap:minions" i-000002f4.eqiad.wmflabs
- 20:54 bd808: deleted unused deployment-rsync01 instance
- 20:49 bd808: Updated scap to 62d5cb2 (Lint JSON files)
- 20:40 marxarelli: cherry-picked https://gerrit.wikimedia.org/r/#/c/208024/ on integration-puppetmaster
- 20:38 marxarelli: manually rebased operations/puppet on integration-puppetmaster to fix empty commit from cherry-pick
- 17:01 hashar: updated JJB fork to e3199d9..c7231fe
- 15:16 hashar: updated integration/jenkins-job-builder to e3199d9
- 13:16 hashar: restarted deployment-salt
June 1
- 08:18 hashar: Jenkins: upgrading git plugin from 1.5.0 to latest
May 31
- 21:31 legoktm: deploying https://gerrit.wikimedia.org/r/214982
- 20:50 legoktm: deploying https://gerrit.wikimedia.org/r/214939
- 00:59 legoktm: deployed https://gerrit.wikimedia.org/r/214889
May 29
- 22:45 legoktm: deploying https://gerrit.wikimedia.org/r/214775
- 19:48 legoktm: deleting corrupt mwext-qunit@2 workspace on integration-slave-trusty-1017
- 17:21 legoktm: deploying https://gerrit.wikimedia.org/r/214652 and https://gerrit.wikimedia.org/r/214653
May 28
- 20:50 bd808: Ran "del jobqueue:aggregator:h-ready-queues:v2" on deployment-redis01
- 13:46 hashar: upgrading Jenkins git plugin from 1.4.6+wmf1 to 1.7.1 bug T100655 and restarting Jenkins
May 27
- 15:09 hashar: Jenkins slaves are all back up. Root cause was some ssh algorithm in their sshd which is not supported by Jenkins jsch embedded lib.
- 14:30 hashar: manually rebasing puppet git on deployment-salt (stalled)
- 14:27 hashar: restarting deployment-salt / some process is 100% wa/IO
- 13:38 hashar: restarted integration puppetmaster (memory leak)
- 13:35 hashar: integration-puppetmaster apparently out of memory
- 13:30 hashar: All Jenkins slaves are disconnected due to some ssh error. CI is down.
May 24
- 10:27 duh: deploying https://gerrit.wikimedia.org/r/213218
May 23
- 21:43 legoktm: deploying https://gerrit.wikimedia.org/r/212960
May 20
- 17:19 thcipriani|afk: add --fail to curl inside mwext-Wikibase-qunit jenkins job
- 15:59 bd808: Applied role::beta::puppetmaster on deployment-salt to get Puppet logstash reports back
May 19
- 02:54 bd808: Primed keyholder agent via `sudo -u keyholder env SSH_AUTH_SOCK=/run/keyholder/agent.sock ssh-add /etc/keyholder.d/mwdeploy_rsa`
- 02:40 Krinkle: deployment-bastion.eqiad magically back online and catching up jobs, though failing due to T99644
- 02:36 Krinkle: Jenkins is unable to launch slave agent on deployment-bastion.eqiad. Using "Jenkins Script Console" throws HTTP 503.
- 02:30 Krinkle: Various beta-mediawiki-config-update-eqiad jobs have been stuck for over 13 hours.
May 12
- 15:18 hashar: downgrading hhvm on CI slaves
- 15:10 hashar: mediawiki-phpunit-hhvm Jenkins job is broken due to an hhvm upgrade bug T98876
- 00:48 bd808: beta cluster central syslog going to logstash rather than deployment-bastion (see https://gerrit.wikimedia.org/r/#/c/210253)
- 00:36 bd808: Cherry-picked https://gerrit.wikimedia.org/r/#/c/210253/
- 00:16 legoktm: deploying https://gerrit.wikimedia.org/r/210251
May 11
- 22:50 legoktm: deploying https://gerrit.wikimedia.org/r/210219
- 22:29 bd808: removed duplicate local group l10nupdate from deployment-bastion that was shadowing the ldap group of the same name
- 22:24 bd808: removed duplicate local group mwdeploy from deployment-bastion that was shadowing the ldap group of the same name
- 22:15 bd808: Removed role::logging::mediawiki from deployment-bastion
- 20:55 legoktm: deleted operations-puppet-tox-py27 workspace on integration-slave-precise-1012, it was corrupt (fatal: loose object b48ccc3ef5be2d7252eb0f0f417f1b5b7c23fd5f (stored in .git/objects/b4/8ccc3ef5be2d7252eb0f0f417f1b5b7c23fd5f) is corrupt)
- 13:54 hashar: Jenkins: removing label hasContintPackages from production slaves, it is no more needed :)
May 9
- 00:10 bd808: Cherry-picked https://gerrit.wikimedia.org/r/#/c/209830 to deployment-bastion:/srv/deployment/scap/scap and deployed with trebuchet
May 8
- 23:59 bd808: Created /data/project/logs/WHERE_DID_THE_LOGS_GO.txt to point folks to the right places
- 23:54 bd808: Switched MediaWiki debug logs to deployment-fluorine:/srv/mw-log
- 20:05 bd808: Cherry-picked https://gerrit.wikimedia.org/r/#/c/209801
- 18:15 bd808: Cherry-picked https://gerrit.wikimedia.org/r/#/c/209769/
- 05:14 bd808: apache2 access logs now only locally on instances in /var/log/apache2/other_vhosts_access.log; error log in /var/log/apache2.log and still relayed to deployment-bastion and logstash (works like production now)
- 04:49 bd808: Symbolic link not allowed or link target not accessible: /srv/mediawiki/docroot/bits/static/master/extensions
- 04:47 bd808: cherry-picked https://gerrit.wikimedia.org/r/#/c/209680/
May 7
- 20:48 bd808: Updated kibana to bb9fcf6 (Merge remote-tracking branch 'upstream/kibana3')
- 18:00 greg-g: brought deployment-bastion.eqiad back online in Jenkins (after Krinkle disconnected it some hours ago). Jobs are processing
- 16:05 bd808: Updated scap to 5d681af (Better handling for php lint checks)
- 14:05 Krinkle: deployment-bastion.eqiad has been stuck for 10 hours.
- 14:05 Krinkle: As of two days now, Jenkins always returns Wikimedia 503 Error page after logging in. Log in session itself is fine.
- 05:02 legoktm: slaves are going up/down likely due to automated labs migration script
May 6
- 15:13 bd808: Updated scap to 57036d2 (Update statsd events)
May 5
- 19:06 jzerebecki: integration-slave-trusty-1015:~$ sudo -u jenkins-deploy rm -rf /mnt/jenkins-workspace/workspace/mwext-Wikibase-qunit/src/node_modules
- 15:42 legoktm: deploying https://gerrit.wikimedia.org/r/208975 & https://gerrit.wikimedia.org/r/208976
- 04:36 legoktm: deploying https://gerrit.wikimedia.org/r/208899
- 04:04 legoktm: deploying https://gerrit.wikimedia.org/r/208889,90,91,92
May 4
- 23:50 hashar: restarted Jenkins (deadlock with deployment-bastion)
- 23:49 hashar: restarted Jenkins
- 22:50 hashar: Manually retriggering last change of operations/mediawiki-config.git with: zuul enqueue --trigger gerrit --pipeline postmerge --project operations/mediawiki-config --change 208822,1
- 22:49 hashar: restarted Zuul to clear out a bunch of operations/mediawiki-config.git jobs
- 22:20 hashar: restarting Jenkins from gallium :/
- 22:18 thcipriani: jenkins restarted
- 22:12 thcipriani: preparing jenkins for shutdown
- 21:59 hashar: disconnected reconnected Jenkins Gearman client
- 21:41 thcipriani: deployment-bastion still not accepting jobs from jenkins
- 21:35 thcipriani: disconnecting deployment-bastion and reconnecting, again
- 20:54 thcipriani: marking node deployment-bastion offline due to suck jenkins execution lock
- 19:03 legoktm: deploying https://gerrit.wikimedia.org/r/208339
- 17:46 bd808: integration-slave-precise-1014 died trying to clone mediawiki/core.git with "fatal: destination path 'src' already exists and is not an empty directory."
May 2
- 06:53 legoktm: deploying https://gerrit.wikimedia.org/r/208366
- 06:45 legoktm: deploying https://gerrit.wikimedia.org/r/208364
- 05:49 legoktm: deploying https://gerrit.wikimedia.org/r/208358
- 05:25 legoktm: deploying https://gerrit.wikimedia.org/r/207132
- 04:18 legoktm: deploying https://gerrit.wikimedia.org/r/208342 and https://gerrit.wikimedia.org/r/208340
- 03:56 legoktm: reset mediawiki-extensions-hhvm workspace on integration-slave-trusty-1015 (bad .git lock)
April 30
- 19:26 Krinkle: Repooled integration-slave-trusty-1013. IP unchanged.
- 19:00 Krinkle: Depooled integration-slave-trusty-1013 for labs maintenance (per andrewbogott)
- 14:17 hashar: Jenkins: properly downgraded IRC plugin from 2.26 to 2.25
- 13:40 hashar: Jenkins: downgrading IRC plugin from 2.26 to 2.25
- 12:09 hashar: restarting Jenkins https://phabricator.wikimedia.org/T96183
April 29
- 17:15 thcipriani: removed l10nupdate user from /etc/passwd on deployment-bastion
- 15:00 hashar: Instances are being moved out from labvirt1005 which has some faulty memory. List of instances at https://phabricator.wikimedia.org/T97521#1245217
- 14:25 hashar: upgrading zuul on integration-slave-precise-1011 for https://phabricator.wikimedia.org/T97106
- 14:11 hashar: rebooting integration-saltmaster stalled.
- 13:11 hashar: Rebooting deployment-parsoid05 via wikitech interface.
- 13:02 hashar: labvirt1005 seems to have hardware issue. Impacts a bunch of beta cluster / integration instances as listed on https://phabricator.wikimedia.org/T97521#1245217
- 12:22 hashar: deployment-parsoid05 slow down is https://phabricator.wikimedia.org/T97421 . Running apt-get upgrade and rebooting it but its slowness issue might be with the underlying hardware
- 12:13 hashar: killing puppet on deployment-parsoid05 eats all CPU for some reason
- 02:40 legoktm: deploying https://gerrit.wikimedia.org/r/207363 and https://gerrit.wikimedia.org/r/207368
April 28
- 23:37 hoo: Ran foreachwiki extensions/Wikidata/extensions/Wikibase/lib/maintenance/populateSitesTable.php --load-from 'http://meta.wikimedia.beta.wmflabs.org/w/api.php' --force-protocol http (because some sites are http only, although the sitematrix claims otherwise)
- 23:33 hoo: Ran foreachwiki extensions/Wikidata/extensions/Wikibase/lib/maintenance/populateSitesTable.php --load-from 'http://meta.wikimedia.beta.wmflabs.org/w/api.php' to fix all sites tables
- 23:18 hoo: Ran mysql> INSERT INTO sites (SELECT * FROM wikidatawiki.sites); on enwikinews to populate the sites table
- 23:18 hoo: Ran mysql> INSERT INTO sites (SELECT * FROM wikidatawiki.sites); on testwiki to populate the sites table
- 17:48 James_F: Restarting grrrit-wm for config change.
- 16:24 bd808: Updated scap to ef15380 (Make scap localization cache build $TMPDIR aware)
- 15:42 bd808: Freed 5G on deployment-bastion by deleting abandoned /tmp/scap_l10n_* directories
- 14:01 marxarelli: reloading zuul to deploy https://gerrit.wikimedia.org/r/#/c/206967/
- 00:17 greg-g: after the 3rd or so time doing it (while on the Golden Gate Bridge, btw) it worked
- 00:11 greg-g: still nothing...
- 00:10 greg-g: after disconnecting, marking temp offline, bringing back online, and launching slave agent: "Slave successfully connected and online"
- 00:07 greg-g: deployment-bastion is idle, yet we have 3 pending jobs waiting for an executer on it - will disconnect/reconnect it in Jenkins
April 27
- 21:45 bd808: Manually triggered beta-mediawiki-config-update-eqiad for zuul build df1e789c726ad4aae60d7676e8a4fc8a2f6841fb
- 21:20 bd808: beta-scap-equad job green again after adding a /srv/ disk to deployment-jobrunner01
- 21:08 bd808: Applied role::labs::lvm::srv on deployment-jobrunner01 and forced puppet run
- 21:08 bd808: Deleted deployment-jobrunner01:/srv/* in preparation for applying role::labs::lvm::srv
- 21:06 bd808: deployment-jobrunner01 missing role::labs::lvm::srv
- 21:00 bd808: Root partition full on deployment-jobrunner01
- 20:53 bd808: removed mwdeploy user from deployment-bastion:/etc/passwd
- 20:15 Krinkle: Relaunched Gearman connection
- 19:53 Krinkle: Jenkins unable to re-create Gearman connection. (HTTP 503 error from /configure). Have to force restart Jenkins
- 17:32 Krinkle: Relauch slave agent on deployment-bastion
- 17:31 Krinkle: Jenkins slave deployment-bastion deadlock waiting for executors
April 26
- 06:09 thcipriani|afk: rm scap l10nfiles from /tmp on deployment-bastion root partition 100% again...
April 25
- 16:00 thcipriani|afk: manually ran logrotate on deployment-jobrunner01, root partition at 100%
- 15:16 thcipriani|afk: clear /tmp/scap files on deployment-bastion, root partition at 100%
April 24
- 18:01 thcipriani: ran sudo chown -R mwdeploy:mwdeploy /srv/mediawiki on deployment-bastion to fix beta-scap-eqiad, hopefully
- 17:26 thcipriani: remove deployment-prep from domain in /etc/puppet/puppet.conf on deployment-stream, puppet now OK
- 17:20 thcipriani: rm stale lock on deployment-rsync01, puppet fine
- 17:10 thcipriani: gzip /var/log/account/pacct.0 on deployment-bastion: ought to revisit logrotate on that instance.
- 17:00 thcipriani: rm stale /var/lib/puppet/state/agent_catalog_run.lock on deployment-kafka02
- 9:56 hashar: restarted mysql on both deployment-db1 and deployment-db2. The service is apparently not started on instance boot. https://phabricator.wikimedia.org/T96905
- 9:08 hashar: beta: manually rebased operations/puppet.git
- 8:43 hashar: Enabling puppet on deployment-eventlogging02.eqiad.wmflabs bug T96921
April 23
- 06:11 Krinkle: Running git-cache-update inside screen on integration-slave-trusty-1021 at /mnt/git
- 06:11 Krinkle: integration-slave-trusty-1021 stays depooled (see T96629 and T96706)
- 04:35 Krinkle: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/206044 and https://gerrit.wikimedia.org/r/206072
- 00:29 bd808: cherry-picked and applied https://gerrit.wikimedia.org/r/#/c/205969/ (logstash: Convert $::realm switches to hiera)
- 00:17 bd808: beta cluster fatal monitor full of "Bad file descriptor: AH00646: Error writing to /data/project/logs/apache-access.log"
- 00:03 bd808: cleaned up redis leftovers on deployment-logstash1
April 22
- 23:57 bd808: cherry-picked and applied https://gerrit.wikimedia.org/r/#/c/205968 (remove redis from logstash)
- 23:33 bd808: reset deployment-salt:/var/lib/git/operations/puppet HEAD to production; forced update with upstream; re-cherry-picked I46e422825af2cf6f972b64e6d50040220ab08995
- 23:28 bd808: deployment-salt:/var/lib/git/operations/puppet in detached HEAD state; looks to be for cherry pick of I46e422825af2cf6f972b64e6d50040220ab08995 ?
- 21:40 thcipriani: restarted mariadb on deployment-db{1,2}
- 20:20 thcipriani: gzipped /var/log/pacct.0 on deployment-bastion
- 19:50 hashar: zuul/jenkins are back up (blame Jenkins)
- 19:40 hashar: reenabling Jenkins gearman client
- 19:30 hashar: Gearman went back. Reenabling Jenkins as a Gearman client
- 19:27 hashar: Zuul gearman is stalled. Disabling Jenkins gearman client to free up connections
- 17:58 Krinkle: Creating integration-slave-trusty-1021 per T96629 (using ci1.medium type)
- 14:34 hashar: beta: failures on instances are due to them being moved on different openstack compute nodes (virt***)
- 13:51 jzerebecki: integration-slave-trusty-1015:~$ sudo -u jenkins-deploy rm -rf /mnt/jenkins-workspace/workspace/mwext-Wikibase-qunit/src/node_modules
- 12:48 hashar: beta: Andrew B. starting to migrate beta cluster instances on new virt servers
- 11:34 hashar: integration: apt-get upgrade on integration-slave-trusty* instances
- 11:31 hashar: integration: Zuul package has been uploaded for Trusty! Deleting the .deb from /home/hashar/
April 21
- 09:27 hashar: Nodepool created it is first instance ever! :)
- 01:51 legoktm: deploying https://gerrit.wikimedia.org/r/205494
April 20
- 23:34 legoktm: deploying https://gerrit.wikimedia.org/r/205465
- 19:20 legoktm: mediawiki-extensions-hhvm workspace on integration-slave-trusty-1011 had bad lock file, wiping
- 16:10 hashar: deployment-salt kill -9 of puppetmaster processes
- 16:08 hashar: deployment-salt killed git-sync-upstream netcat to labmon1001.eqiad.wmnet 8125 was eating all memory
- 16:04 hashar: beta: manually rebasing operations/puppet on deployment-salt . Might have killed some live hack in the process :/
- 13:58 hashar: In Gerrit, hidden integration/jenkins-job-builder-config and integration/zuul-config historical repositories. Suggest by addshore on {{bug:T96522}}
- 03:39 legoktm: deploying https://gerrit.wikimedia.org/r/205174
April 19
- 06:12 legoktm: deploying https://gerrit.wikimedia.org/r/205076
April 18
- 05:18 legoktm: deploying https://gerrit.wikimedia.org/r/204995
- 03:09 Krinkle: Finished set up of integration-slave-trusty-1017. Pooled.
April 17
- 17:52 Krinkle: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/204812
- 17:45 Krinkle: Creating integration-slave-trusty-1017
- 16:29 Krinkle: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/204791
- 16:00 Krinkle: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/204783
- 12:42 hashar: restarting Jenkins
- 12:38 hashar: Switching zuul on lanthanum.eqiad.wmnet to the Debian package version
- 12:14 hashar: Switching Zuul scheduler on gallium.wikimedia.org to the Debian package version
- 12:12 hashar: Jenkins: enabled plugin "ZMQ Event Publisher" and publishing all jobs result on TCP port 8888
- 05:37 legoktm: deploying https://gerrit.wikimedia.org/r/204706
- 01:11 Krinkle: Repool integration-slave-precise-1013 and integration-slave-trusty-1015 (live hack with libeatmydata enabled for mysql; T96308)
April 16
- 22:08 Krinkle: Rebooting integration-slave-precise-1013 (depooled; experimenting with libeatmydata)
- 22:07 Krinkle: Rebooted integration-slave-trusty-1015 (experimenting with libeatmydata)
- 18:31 Krinkle: Rebooting integration-slave-precise-1012 and integration-slave-trusty-1012
- 17:57 Krinkle: Repooled instances. Conversion of mysql.datadir to tmpfs complete.
- 17:22 Krinkle: Gracefully depool integration slaves to deploy https://gerrit.wikimedia.org/r/#/c/204528/ (T96230)
- 14:35 thcipriani: running dpkg --configure -a on deployment-bastion to correct puppet failures
April 15
- 23:21 Krinkle: beta-update-databases-eqiad stuck waiting for executors on a node that has plenty executors available
- 21:15 hashar: Jenkins browser test jobs sometime deadlock because of the IRC notification plugin https://phabricator.wikimedia.org/T96183
- 20:34 hashar: hard restarting Jenkins
- 19:24 Krinkle: Aborting browser tests jobs. Stuck for over 5 hours.
- 19:24 Krinkle: Aborting beta-scap-eqiad. Has been stuck for 2 hours on "Notifying IRC" after "Connection time out" from scap.
- 08:22 hashar: restarted Jenkins
- 08:20 hashar: Exception in thread "RequestHandlerThread[#2]" java.lang.OutOfMemoryError: Java heap space
- 08:16 hashar: Jenkins process went wild taking all CPU busy on gallium
April 14
- 20:43 legoktm: starting SULF on beta cluster
- 20:42 marktraceur: stopping all beta jobs, aborting running (and stuck) beta DB update, kicking bastion, to try and get beta to update
- 19:49 Krinkle: All systems go.
- 19:48 Krinkle: Jenkins configuration panel won't load ("Loading..." stays indefine, "Uncaught TypeError: Cannot convert to object at prototype.js:195")
- 19:46 Krinkle: Jenkins restarted. Relaunching Gearman
- 19:42 Krinkle: Jenkins still unable to obtain Gearman connection. (HTTP 503 error from /configure). Have to force restart Jenkins.
- 19:42 Krinkle: deployment-bastion jobs were stuck. marktraceur cancelled queue and relaunched slave. Now processing again.
- 15:27 Krinkle: puppetmaster: Re-apply I05c49e5248cb operations/puppet patch to re-fix T91524. Somehow the patch got lost.
- 08:46 hashar: does qa-morebots works ?
April 13
- 20:14 Krinkle: Restarting Zuul, Jenkins and aborting all builds. Everything got stuck following NFS outage in lab
- 19:28 Krinkle: Restarting Zuul, Jenkins and aborting all builds. Everything crashed following NFS outage in labs
- 17:01 legoktm: deploying https://gerrit.wikimedia.org/r/203858
- 13:56 Krinkle: Delete old integration-slave1001...1004 (T94916)
- 10:43 hashar: reducing number of executors on Precise instances from 5 to 4 and on Trusty instances from 6 to 4. The Jenkins scheduler tends to assign the unified jobs to the same slave which overload a single slave while others are idling.
- 10:43 hashar: reducing number of executors from 5 to 4
- 08:46 hashar: jenkins removed #wikimedia-qa IRC channel from the global configuration
- 08:42 hashar: kill -9 jenkins causes it was stuck in some deadlock related to the IRC plugin :(
- 08:34 zeljkof: restarting stuck Jenkins
April 12
- 23:58 bd808: sudo ln -s /srv/l10nupdate/mediawiki /var/lib/l10nupdate/mediawiki on deployment-bastion
- 23:11 greg-g: 0bytes left on /var on deployment-bastion
April 11
- 23:13 legoktm: deploying https://gerrit.wikimedia.org/r/203628
- 22:58 legoktm: deploying https://gerrit.wikimedia.org/r/203619 & https://gerrit.wikimedia.org/r/203626
- 06:13 legoktm: deployed https://gerrit.wikimedia.org/r/203520
- 05:49 legoktm: deploying https://gerrit.wikimedia.org/r/203519 https://gerrit.wikimedia.org/r/203516 https://gerrit.wikimedia.org/r/203518
April 10
- 13:50 Krinkle: Pool integration-slave-precise-1012..integration-slave-precise-1014
- 11:43 hashar: Filled https://phabricator.wikimedia.org/T95675 to migrate "Global-Dev Dashboard Data" to JJB/Zuul
- 11:40 Krinkle: Deleting various jobs from Jenkins that can be safely deleted (no longer in jjb-config). Will report the others to T91410 for inspection.
- 11:29 Krinkle: Fixed job "Global-Dev Dashboard Data" to be restricted to node "gallium" because it fails to connect to gp.wmflabs.org from lanthanum 1/2 builds.
- 11:26 Krinkle: Re-established Gearman connection from Jenkins
- 11:20 Krinkle: Jenkins unable to re-establish Gearman connection. Full restart.
- 10:39 Krinkle: Deleting the old integration1401...integration1405 instances. They've been depooled for 24h and their replacements are OK. This is to free up quota to create new Precise instances.
- 10:35 Krinkle: Creating integration-slave-precise-1012...integration-slave-precise-1014
- 10:31 Krinkle: Pool integration-slave-precise-1011
- 09:02 hashar: integration: Refreshed Zuul packages under /home/hashar
- 08:57 Krinkle: Fixed puppet failure for missing Zuul package on integration-dev by applying patch-integration-slave-trusty.sh
April 9
- 19:50 legoktm: deployed https://gerrit.wikimedia.org/r/202932
- 17:20 Krinkle: Creating integration-slave-precise-1011
- 17:11 Krinkle: Depool integration-slave1402...integration-slave1405
- 16:52 Krinkle: Pool integration-slave-trusty-1011...integration-slave-trusty-1016
- 16:00 hashar: integration-slave-jessie-1001 recreated. Applying it role::ci::slave::labs which should also bring in the package builder role under /mnt/pbuilder
- 15:32 thcipriani: added mwdeploy_rsa to keyholder agent.sock via chmod 400 /etc/keyholder.d/mwdeploy_rsa && SSH_AUTH_SOCK=/run/keyholder/agent.sock ssh-add /etc/keyholder.d/mwdeploy_rsa && chmod 440 /etc/keyholder.d/mwdeploy_rsa; permissions in puppet may be wrong?
- 14:24 hashar: deleting integration-slave-jessie-1001 extended disk is too small
- 14:24 hashar: deleting integration-slave-jessie-1001 extended disk is too smal
- 13:14 hashar: integration-zuul-packaged applied role::labs::lvm::srv
- 13:01 hashar: integration-zuul-packaged applied zuul::merger and zuul::server
- 12:59 Krinkle: Creating integration-slave-trusty-1011 - integration-slave-trusty-1016
- 12:40 hashar: spurts out Permission denied (publickey).
- 12:39 hashar: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/ is still broken :-(
- 12:31 hashar: beta: reset hard of operations/puppet repo on the puppetmaster since it has been stalled for 9+days https://phabricator.wikimedia.org/T95539
- 10:46 hashar: repacked extensions in deployment-bastion staging area: find /mnt/srv/mediawiki-staging/php-master/extensions -maxdepth 2 -type f -name .git -exec bash -c 'cd `dirname {}` && pwd && git repack -Ad && git gc' \;
- 10:31 hashar: deployment-bastion has a lock file remaining /mnt/srv/mediawiki-staging/php-master/extensions/.git/refs/remotes/origin/master.lock
- 09:55 hashar: restarted Zuul to clear out some stalled jobs
- 09:35 Krinkle: Pooled integration-slave-trusty-1010
- 08:59 hashar: rebooted deployment-bastion and cleared some files under /var/
- 08:51 hashar: deployment-bastion is out of disk space on /var/ :(
- 08:50 hashar: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/ timed out after 30 minutes while trying to git pull
- 08:50 hashar: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/ job stalled for some reason
- 06:15 legoktm: deploying https://gerrit.wikimedia.org/r/202998
- 06:02 legoktm: deploying https://gerrit.wikimedia.org/r/202992
- 05:11 legoktm: deleted core dumps from integration-slave1002, /var had filled up
- 04:36 legoktm: deploying https://gerrit.wikimedia.org/r/202938
- 00:32 legoktm: deploying https://gerrit.wikimedia.org/r/202279
April 8
- 21:56 legoktm: deploying https://gerrit.wikimedia.org/r/202930
- 21:15 legoktm: deleting non-existent jobs' workspaces on labs slaves
- 19:09 Krinkle: Re-establishing Gearman-Jenkins connection
- 19:00 Krinkle: Restarting Jenkins
- 19:00 Krinkle: Jenkins Master unable to re-establish Gearman connection
- 19:00 Krinkle: Zuul queue is not being distributed properly. Many slaves are idling waiting to receive builds but not getting any.
- 18:29 Krinkle: Another attempt at re-creating the Trusty slave pool (T94916)
- 18:07 legoktm: deploying https://gerrit.wikimedia.org/r/202289 and https://gerrit.wikimedia.org/r/202445
- 18:01 Krinkle: Jobs for Precise slaves are not starting. Stuck in Zuul as 'queued'. Disconnected and restarted slave agent on them. Queue is back up now.
- 17:36 legoktm: deployed https://gerrit.wikimedia.org/r/180418
- 13:32 hashar: Disabled Zuul install based on git clone / setup.py by cherry picking https://gerrit.wikimedia.org/r/#/c/202714/ . Installed the Zuul debian package on all slaves
- 13:31 hashar: integration: running apt-get upgrade on Trusty slaves
- 13:30 hashar: integration: upgrading python-gear and python-six on Trusty slaves
- 12:43 hasharLunch: Zuul is back and it is nasty
- 12:24 hasharLunch: killed zuul on gallium :/
April 7
- 16:26 Krinkle: git-deploy: Deploying integration/slave-scripts 4c6f541
- 12:57 hashar: running apt-get upgrade on integration-slave-trusty* hosts
- 12:45 hashar: recreating integration-slave-trusty-1005
- 12:26 hashar: deleting integration-slave-trusty-1005 has been provisioned with role::ci::website instead of role::ci::slave::labs
- 12:11 hashar: retriggering a bunch of browser tests hitting beta.wmflabs.org
- 12:07 hashar: Puppet being fixed, it is finishing the installation of integration-slave-trusty-*** hosts
- 12:03 hashar: Browser tests against beta cluster were all failing due to an improper DNS resolver being applied on CI labs instances bug T95273. Should be fixed now.
- 12:00 hashar: running puppet on all integration machines and resigning puppet client certs
- 11:31 hashar: integration-puppetmaster is back and operational with local puppet client working properly.
- 11:28 hashar: restored /etc/puppet/fileserver.conf
- 11:08 hashar: dishing out puppet SSL configuration on all integratio nodes. Can't figure out so lets restart from scratch
- 10:52 hashar: made puppetmaster certname = integration-puppetmaster.eqiad.wmflabs instead of the ec2 id :(
- 10:49 hashar: manually hacking integration-puppetmaster /etc/puppet/puppet.conf config file which is missing the [master] section
- 09:37 hashar: integration project has been switched to a new labs DNS resolver ( https://lists.wikimedia.org/pipermail/labs-l/2015-April/003585.html ) . It is missing the dnsmasq hack to resolve beta cluster URls to the instance IP instead of the public IP. Causes a wild range of jobs to fail.
- 01:25 Krinkle: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/202300
April 6
- 23:19 bd808: Updated scap to f9b9a82 (emove exotic unicode from ascii logo)
- 22:34 legoktm: deployed https://gerrit.wikimedia.org/r/202229
- 20:55 legoktm: deploying https://gerrit.wikimedia.org/r/202233
- 20:46 legoktm: deploying https://gerrit.wikimedia.org/r/202225
- 17:37 legoktm: deploying https://gerrit.wikimedia.org/r/201032
- 12:38 Krinkle: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/201984 https://gerrit.wikimedia.org/r/202020 https://gerrit.wikimedia.org/r/202026
- 04:20 legoktm: deploying https://gerrit.wikimedia.org/r/201669
April 5
- 11:13 Krinkle: New integration-slave-trusty-1001..1005 must remain unpooled. Provisioning failed. details at https://phabricator.wikimedia.org/T94916#1180522
- 10:48 Krinkle: Puppet on integration-puppetmaster has been failing for the past 2 days: "Failed when searching for node i-0000063a.eqiad.wmflabs: You must set the 'external_nodes' parameter to use the external node terminus" (=integraton-dev.eqiad.wmflabs)
- 10:22 Krinkle: Creating integration-slave-trusty-1001-1005 per T94916.
April 3
- 23:47 greg-g: for Krinkle 23:31 "Finished npm upgrade on trusty slaves."
- 23:08 Krinkle: Finished npm upgrade on precise slaves. Rolling trusty slaves now.
- 22:55 bd808: Updated scap to a1a5235 (Add a logo banner to scap)
- 21:31 Krinkle: Upgrading npm from v2.4.1 to v2.7.6 (rolling, slave by slave graceful)
- 21:11 ^d: puppet re-enabled on staging-palladium, running fine again
- 21:05 Krinkle: Delete unfinished/unpoooled instances integration-slave-precise-1011-1014. (T94916)
- 14:49 hashar: integration-slave-jessie-1001 : manually installed jenkins-debian-glue Debian packages. It is pending upload by ops to apt.wikimedia.org bug T95006
- 12:56 hashar: installed zuul_2.0.0-304-g685ca22-wmf1precise1_amd64.deb on integration-slave-precise-101* instances
- 12:56 hashar: installed zuul_2.0.0-304-g685ca22-wmf1precise1_amd64.deb on integration-slave-precise-1011.eqiad.wmflabs
- 12:35 hashar: Switching Jessie slave from role::ci::slave::labs::common to role::ci::slave::labs which will bring a whole lot of packages and break
- 12:28 hashar: integration-slave-jessie-1001 applying role::ci::slave::labs::common to pool it as a very basic Jenkins slave
- 12:19 hashar: enabled puppetmaster::autosigner on integration-puppetmaster
- 11:58 hashar: Applied role::ci::slave::labs on integration-slave-precise-101[1-4] that Timo created earlier
- 11:58 hashar: Cherry picked a couple patches to fix puppet Package[] definitions issues
- 11:49 hashar: made integration-puppetmaster to self update its puppet clone
- 11:42 hashar: recreating integration-slave-precise-1011 stalled with a puppet oddity related to Package['gdb'] defined twice bug T94917
- 11:30 hashar: integration-puppetmaster migrated down to Precise
- 11:23 hashar: rebooting integration-publisher : cant ssh to it
- 10:37 hashar: disabled some hiera configuration related to puppetmaster.
- 10:22 hashar: Created instance i-00000a4a with image "ubuntu-12.04-precise" and hostname i-00000a4a.eqiad.wmflabs.
- 10:21 hashar: downgrading integration-puppetmaster from Trusty to Precise https://phabricator.wikimedia.org/T94927
- 05:42 legoktm: deploying https://gerrit.wikimedia.org/r/200744
- 03:58 Krinkle: Jobs were throwing NOT_RECOGNISED. Relaunched Gearman. Jobs are now happy again.
- 03:51 Krinkle: Jenkins is unable to re-establish Gearman connection. Have to force restart Jenkins master.
- 03:42 Krinkle: Reloading Jenking config repaired the broken references. However Jenkins is still unable to make new references properly. New builds are 404'ing the same way.
- 03:26 Krinkle: Reloading Jenkins configuration from disk
- 03:18 Krinkle: Build metadata exists properly at /var/lib/jenkins/jobs/:jobname/builds/:nr, but the "last*Build" symlinks are outdated.
- 03:12 Krinkle: As of 03:03, recent builds are mysteriously missing their entry in Jenkins. They show up on the dashbaord when running, but their build log is never published (url is 404). E.g. https://integration.wikimedia.org/ci/job/integration-docroot-deploy/105 and https://integration.wikimedia.org/ci/job/jshint/239
- 02:47 Krinkle: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/201644
- 00:31 greg-g: rm 'd .gitignore in /srv/mediawiki-staging/php-master/skins due to https://gerrit.wikimedia.org/r/#/c/200307/ clashing with a local untracked version
April 2
- 22:56 Krinkle: New integration-slave-precise-101x are unfinished and must remain depooled. See T94916.
- 22:53 Krinkle: Most puppet failures blocking T94916 may be caused by the fact that intergration-puppetmaster was inadvertently changed to Trusty; puppetmaster version of Trusty is not yet supported by ops
- 21:41 Krinkle: It seems integration-slave-jessie-1001 has role::ci::slave::labs::common instead of role::ci::slave::labs. Intentional?
- 21:25 Krinkle: Re-creating integration-dev-slave-precise in preparation of re-creating precise slaves
- 14:51 hashar: applying role::ci::slave::labs::common on integration-slave-jessie-1001
- 14:49 hashar: integration: nice thing, newly created instances are automatically made to point to integration-pummetmaster via hiera! Just have to sign the certificate on the master using: puppet ca list ; puppet ca sign i-000xxxx.eqiad.wmflabs
- 14:42 hashar: Created integration-slave-jessie-1001 to try out CI slave on Jessie (phab:T94836)
- 14:11 hashar: reduced integration-slave1004 executors from 6 to 5 to make it on par with the other precise slaves
- 14:10 hashar: integration-slave100[1-4] are now using Zuul provided by a Debian package as of https://gerrit.wikimedia.org/r/#/c/195272/ PS 16
- 14:04 hashar: uninstall the pip installed zuul version from Precise labs slaves by doing: pip uninstall zuul && rm /usr/local/bin/zuul* . Switching them all to a Debian package
- 13:45 hashar: pooling back integration-slave1001 and 1002 which are using zuul-cloner provided by a debian package
- 13:35 hashar: reloading Jenkins configuration files from disk to make it knows about a change manually applied to most jobs config.xml files for https://gerrit.wikimedia.org/r/#/c/201451/
- 13:01 Krinkle: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/201458
- 12:19 hashar: preventing job to run on integration-slave1001 by replacing its label with 'DoNotLabelThisSlaveHashar'. Going to install Zuul debian package on it
- 09:37 hashar: rebooting integration-zuul-server homedir seems to be stalled/missing
- 08:12 hashar: upgrading packages on integration-dev
- 05:14 greg-g: and right when I log'd that, things seem to be recovering
- 05:12 greg-g: the shinken alerts about beta cluster issues are due to wmflabs having issues.
April 1
- 07:17 Krinkle: Creating integration-slave1410 as test. Will re-create our pool later today.
- 06:26 Krinkle: Apply puppetmaster::autosigner to integration-puppetmaster
- 05:51 legoktm: deleting non-existent job workspaces from integration slaves
- 05:42 Krinkle: Free up space on integration-slave1001-1004 by removing obsolete phplint and qunit workspaces
- 02:05 Krinkle: Restarting Jenkins again..
- 01:35 legoktm: started zuul on gallium
- 01:00 Krinkle: Restarting Jenkins
- 01:00 Krinkle: Jenkins is unable to start Gearman connection (HTTP 503);
- 01:00 Krinkle: Force restarted Zuul, didn't help
- 00:55 Krinkle: Jenkins stuck. Builds are queued in Zuul but nothing is sent to Jenkins.
March 31
- 21:00 greg-g: puppet-compiler02: This node is offline because Jenkins failed to launch the slave agent on it.
- 20:15 legoktm: deploying https://gerrit.wikimedia.org/r/200926
- 18:48 legoktm: DEPLOYING https://gerrit.wikimedia.org/r/200327
- 15:44 thcipriani: primed keyholder on deployment-bastion to ensure jenkins-deploy can ssh
- 12:25 hashar: qa-morebots is back
March 30
- 22:58 legoktm: 1001-1003 were depooled, restarted and repooled. 1004 is depooled and restarted
- 22:40 legoktm: rebooting precise jenkins slaves
- 21:40 greg-g: Beta Cluster is down due to WMF Labs issues, being taken care of now (by Coren and Yuvi)
- 19:53 legoktm: deleted core dumps from integration-slave1001
- 19:11 legoktm: deploying https://gerrit.wikimedia.org/r/200646
- 16:29 jzerebecki: another damaged git repo integration-slave1001:~$ sudo -u jenkins-deploy rm -rf /mnt/jenkins-workspace/workspace/mwext-Wikibase-qunit/src/vendor/
- 16:07 jzerebecki: removing workspaces of deleted jobs integration-slave100{1,2,3,4}:~$ sudo -u jenkins-deploy rm -rf /mnt/jenkins-workspace/workspace/mwext-Wikibase-{client,repo,repo-api}-tests{,@*}
- 15:14 jzerebecki: integration-slave1001:~$ sudo -u jenkins-deploy rm -rf /mnt/jenkins-workspace/workspace/mwext-Wikibase-repo-api-tests-sqlite
- 15:05 jzerebecki: integration-slave1001:~$ sudo -u jenkins-deploy rm -rf /mnt/jenkins-workspace/workspace/mwext-Wikibase-repo-api-tests-mysql/src/extensions/cldr
- 14:36 jzerebecki: integration-slave1001:~$ sudo -u jenkins-deploy rm -rf /mnt/jenkins-workspace/workspace/mwext-Wikibase-*-tests{,@*}
- 13:06 jzerebecki: integration-slave1001:~$ sudo -u jenkins-deploy rm -rf /mnt/jenkins-workspace/workspace/mwext-Wikibase-client-tests@*
- 13:05 jzerebecki: integration-slave1001:~$ sudo -u jenkins-deploy rm -rf /mnt/jenkins-workspace/workspace/mwext-Wikibase-client-tests
March 29
- 07:29 legoktm: deploying https://gerrit.wikimedia.org/r/#/c/200333/
- 07:07 legoktm: deploying https://gerrit.wikimedia.org/r/#/c/200332/
- 03:51 legoktm: deploying https://gerrit.wikimedia.org/r/200330
- 03:09 legoktm: deploying https://gerrit.wikimedia.org/r/#/c/200329/
- 00:10 legoktm: deploying https://gerrit.wikimedia.org/r/#/c/200323/
March 28
- 04:02 bd808: manually updated beta-code-update-eqiad job to remove sudo to mwdeploy; needs associated jjb change for T94261
March 27
- 23:28 bd808: applied beta::autoupdater directly to deployment-bastion via wikitech interface
- 23:21 bd808: Duplicate declaration: Git::Clone[operations/mediawiki-config] is already declared in file /etc/puppet/modules/beta/manifests/autoupdater.pp:46; cannot redeclare at /etc/puppet/modules/scap/manifests/master.pp:22
- 23:01 bd808: restarted puppetmaster
- 22:52 hashar: integration: jzerebecki addition and sudo policy tracked for history purpose as bug T94280
- 22:52 bd808: chown -R l10nupdate:wikidev /srv/mediawiki-staging/php-master/cache/l10n
- 22:44 bd808: deployment-bastion: chown -R jenkins-deploy:wikidev /srv/mediawiki-staging/
- 22:41 bd808: forcing puppet run on deployment-bastion
- 22:41 bd808: cherry-picked https://gerrit.wikimedia.org/r/#/c/200248/ and https://gerrit.wikimedia.org/r/#/c/199988/
- 22:40 hashar: integration: created sudo policy allowing members to run any command as jenkins-deploy on all hosts.
- 22:40 hashar: added jzerebecki to the integration labs project as a normal member
- 22:34 hashar: integration-slave1001 rm -fR mwext-Wikibase-repo-api-tests/src/vendor
- 21:13 greg-g: things be better
- 20:56 greg-g: Beta Cluster is down, known
- 18:50 marxarelli: running `jenkins-jobs update` to update 'browsertests-UploadWizard-*' with Id33ffde07f0c15e153d52388cf130be4c59b4559
- 17:50 legoktm: deleted core dumps from integration-slave1002
- 17:48 legoktm: marked integration-slave1002 as offline, /var filled up
- 05:42 legoktm: marked integration-slave1001 as offline due to https://phabricator.wikimedia.org/T94138
March 26
- 23:47 legoktm: deploying https://gerrit.wikimedia.org/r/200069
- 19:22 bd808: Manually added missing !log entries from 2015-03-25 from my bouncer logs
- 17:14 greg-g: jobs appear to be processing according to zuul, the Jenkins UI just takes forever to load, apparently
- 17:12 greg-g: "Please wait while Jenkins is getting ready to work"
- 17:08 greg-g: 0:07 < robh> kill -9 and restarted per instrucitons
- 16:53 greg-g: Still.... "Please wait while Jenkins is restarting..."
- 16:49 greg-g: "Please wait while Jenkins is restarting..."
- 16:39 greg-g: going to do a safe-restart of Jenkins https://www.mediawiki.org/wiki/Continuous_integration/Jenkins#Restart_all_of_Jenkins
- 16:38 greg-g: nothing executing on deployment-bastion, that is
- 16:38 greg-g: same, nothing executing
- 16:37 greg-g: did that checklist once, jobs still not executing, doing again
- 16:32 greg-g: I'll start going through the checklist at https://www.mediawiki.org/wiki/Continuous_integration/Jenkins#Hung_beta_code.2Fdb_update
- 16:30 hashar: deadlock on deployment-bastion slave. Someone need to restart Jenkins :(
- 13:25 hashar: yamllint job fixed by altering the label https://gerrit.wikimedia.org/r/#/c/199876/
- 13:17 hashar: Changes blocked because there is nothing able to run yamllint ( zuul-gearman.py status|grep build:yamllint , shows 8 jobs pending and no worker available)
March 25
- 23:23 bd808: chown -R jenkins-deploy:project-deployment-prep /srv/mediawiki-staging/php-master/cache/gitinfo
- 23:14 bd808: chown -R l10nupdate:project-deployment-prep /srv/mediawiki-staging/php-master/cache/l10n
- 23:14 bd808: chown -R l10nupdate:project-deployment-prep /srv/mediawiki-staging/php-master/cache/l10n
- 23:04 bd808: chown -R mwdeploy:project-deployment-prep /srv/mediawiki-staging
- 22:58 bd808: File permissions in deployment-bastion:/srv/mediawiki-staging as part mwdeploy:mwdeploy and part mwdeploy:project-deployment-prep and part jenkins-deploy:project-deployment-prep
- 21:52 legoktm: deploying https://gerrit.wikimedia.org/r/199736
- 18:49 legoktm: deploying https://gerrit.wikimedia.org/r/196745
- 15:13 bd808: Updated scap to include 4a63a63 (Copy l10n CDB files to rebuildLocalisationCache.php tmp dir)
- 03:44 legoktm: deploying https://gerrit.wikimedia.org/r/199555 and https://gerrit.wikimedia.org/r/199559
- 00:52 Krinkle: Restarted Jenkins-Gearman connection
- 00:50 Krinkle: Jenkins is unable to start Gearman connection (HTTP 503); Restarting Jenkins.
- 00:32 legoktm: disabling/enabling gearman in jenkins
March 24
- 23:32 Krinkle: Force restart Zuul
- 22:25 hashar: marked gallium and lanthanum slaves as temp offline, then back. Seems to have cleared some Jenkins internal state and resumed the build
- 21:55 bd808: Ran trebuchet for scap to keep cherry-pick of I01b24765ce26cf48d9b9381a476c3bcf39db7ab8 on top of active branch; puppet was forcing back to prior trebuchet sync tag
- 21:42 hashar: Reconfigured mediawiki-core-code-coverage
- 21:22 hashar: Zuul gate is deadlocked for up to half an hour due to change being force merged :(
- 21:15 hashar: beta: deleted untracked file /srv/mediawiki-staging/php-master/extensions/.gitignore . That fixed the Jenkins job https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/
- 20:31 twentyafterfour: sudo ln -s /srv/l10nupdate/ /var/lib/
- 20:31 twentyafterfour: sudo mv /var/lib/l10nupdate/ /srv/
- 20:28 bd808: deployment-bastion -- rm -r pacct.1.gz pacct.2.gz pacct.3.gz pacct.4.gz pacct.5.gz pacct.6.gz
- 20:24 bd808: Deleted junk in deployment-bastion:/tmp
- 18:57 legoktm: deploying https://gerrit.wikimedia.org/r/199305
- 18:25 legoktm: deploying https://gerrit.wikimedia.org/r/199216
- 17:06 legoktm: deploying https://gerrit.wikimedia.org/r/199273
- 11:23 hashar: beta-scap-eqiad keeps regenerating l10n cache https://phabricator.wikimedia.org/T93737
- 08:35 hashar: restarting Jenkins for some plugins upgrades
- 08:07 legoktm: deployed https://gerrit.wikimedia.org/r/199190
- 07:21 legoktm: deploying https://gerrit.wikimedia.org/r/199205
- 07:17 legoktm: deploying https://gerrit.wikimedia.org/r/199204
- 07:08 legoktm: deploying https://gerrit.wikimedia.org/r/199201
- 06:46 legoktm: freed ~6G on lanthanum by deleting mediawiki-extensions-zend* worksapces
- 05:04 legoktm: deleting workspaces of jobs that no longer exist in jjb on lathanum
- 04:11 legoktm: deploying https://gerrit.wikimedia.org/r/198792
- 03:14 Krinkle: Deleting old job workspaces on gallium not touched since 2013
- 02:42 Krinkle: Restarting Zuul, wikimedia-fundraising-civicrm is stuck as of 46min ago waiting for something already merged
- 02:32 legoktm: toggling gearman off/on in jenkins
- 01:47 twentyafterfour: deployed scap/scap-sync-20150324-014257 to beta cluster
- 00:23 Krinkle: Restarted Zuul
March 23
- 23:18 hasharDinner: Stopping Jenkins for an upgrade
- 23:16 legoktm: deleting mwext-*-lint* workspaces on gallium, shouldn't be needed
- 23:11 legoktm: deleting mwext-*-qunit* workspaces on gallium, shouldn't be needed
- 23:07 legoktm: deleting mwext-*-lint workspaces on gallium, shouldn't be needed
- 23:00 legoktm: lanthanum is now online again, with 13G free disk space
- 22:58 legoktm: deleting mwext-*-qunit* workspaces on lanthanum, shouldn't be needed any more
- 22:54 legoktm: deleting mwext-*-qunit-mobile workspaces on lanthanum, shouldn't be needed any more
- 22:48 legoktm: deleting mwext-*-lint workspaces on lanthanum, shouldn't be needed any more
- 22:45 legoktm: took lanthanum offline in jenkins
- 20:59 bd808: Last log copied from #wikimedia-labs
- 20:58 bd808: 20:41 cscott deployment-prep updated OCG to version 11f096b6e45ef183826721f5c6b0f933a387b1bb
- 19:28 YuviPanda: created staging-rdb01.eqiad.wmflabs
- 19:19 YuviPanda: disabled puppet on staging-palladium to test a puppet patch
- 18:41 legoktm: deploying https://gerrit.wikimedia.org/r/198762
- 13:11 hashar: and I restarted qa-morebots a minute or so ago (see https://wikitech.wikimedia.org/wiki/Morebots#Example:_restart_the_ops_channel_morebot )
- 13:11 hashar: Jenkins: deleting unused jobs mwext-.*-phpcs-HEAD and mwext-.*-lint
March 21
- 17:53 legoktm: deployed https://gerrit.wikimedia.org/r/198503
- 00:02 Krinkle: Reestablished Jenkins-Gearman connection
March 20
- 23:08 marxarelli: Reloading Zuul to deploy I693ea49572764c96f5335127902404167ca86487
- 22:50 marxarelli: Running `jenkins-jobs update` to create job mediawiki-vagrant-bundle17-yard-publish
- 19:00 Krinkle: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/198276
- 17:17 Krinkle: Reloading Zuul to deploy I5edff10a4f0
- 12:32 mobrovac: deployment-salt ops/puppet: un-cherry-picked I48b1a139b02845c94c85cd231e54da67c62512c9
- 12:30 mobrovac: deployment-prep disabled puppet on deployment-restbase[1,2] until https://gerrit.wikimedia.org/r/#/c/197662/ is merged
- 08:36 mobrovac: deployment-salt ops/puppet: cherry-picking I48b1a139b02845c94c85cd231e54da67c62512c9
- 04:57 legoktm: deployed https://gerrit.wikimedia.org/r/198184
- 00:21 legoktm: deployed https://gerrit.wikimedia.org/r/198161
- 00:14 legoktm: deployd https://gerrit.wikimedia.org/r/198160
March 19
- 23:59 legoktm: deployed https://gerrit.wikimedia.org/r/198154
- 21:48 hashar: Jenkins: depooled/repooled lanthanum slave, it was no more processing any jobs.
- 14:09 hashar: Further updated our JJB fork to upstream commit 4bf020e07 which version 1.1.0-3
- 13:22 hashar: refreshed our JJB fork 7ad4386..8928b66 . No difference in our jobs.
- 11:25 hashar: refreshing configuration of all beta* jenkins jobs
- 06:18 legoktm: deployed https://gerrit.wikimedia.org/r/197860 & https://gerrit.wikimedia.org/r/197858
- 05:20 legoktm: deleting 'mediawiki-ruby-api-bundle-*' 'mediawiki-selenium-bundle-*' 'mwext-*-bundle-*' jobs
- 05:06 legoktm: deployed https://gerrit.wikimedia.org/r/197853
- 00:57 Krinkle: Reloading Zuul to deploy Ie1d7bf114b34f9
March 18
- 17:52 legoktm: deployed https://gerrit.wikimedia.org/r/197674 and https://gerrit.wikimedia.org/r/197675
- 17:27 legoktm: deployed https://gerrit.wikimedia.org/r/197651
- 15:20 hashar: setting gallium # of executors from 5 back to 3. When jobs run on it that slowdown the zuul scheduler and merger!
- 15:06 legoktm: deployed https://gerrit.wikimedia.org/r/194990
- 02:02 bd808: Updated scap to I58e817b (Improved test for content preceeding <?php opening tag)
- 01:48 marxarelli: memory usage, swap, io wait seem to be back to normal on deployment-salt and kill/start of puppetmaster
- 01:45 marxarelli: kill 9'd puppetmaster processes on deployment-salt after repeated attempts to stop
- 01:28 marxarelli: restarting salt master on deployment-salt
- 01:20 marxarelli: deployment-salt still unresponsive, lot's of io wait (94%) + swapping
- 00:32 marxarelli: seeing heavy swapping on deployment-salt; puppet processes using 250M+ memory each
March 17
- 21:42 YuviPanda: recreated staging-sca01, let’s wait and see if it just automagically configures itself :)
- 21:40 YuviPanda: deleted staging-sca01 because why not :)
- 17:52 Krinkle: Reloading Zuul to deploy I206c81fe9bb88feda6
- 16:28 bd808: Updated scap to include I61dcf7ae6d52a93afc6e88d3481068f09a45736d (Run rebuildLocalisationCache.php as www-data)
- 16:25 bd808: chown -R trebuchet:wikidev && chmod -R g+rwX deployment-bastion:/srv/deployment/scap/scap
- 16:16 YuviPanda: created staging-sca01
- 14:39 hashar: me versus debian packaging tool chain http://xkcd.com/1168/
- 09:24 hashar: deleted operations-puppet-validate
- 09:21 hashar: deleted mwext-Wikibase-lint job, not triggered anymore
March 16
- 21:55 legoktm: deployed https://gerrit.wikimedia.org/r/197213
- 21:25 legoktm: deployed https://gerrit.wikimedia.org/r/#/c/196095/
- 18:50 legoktm: deployed https://gerrit.wikimedia.org/r/197109
- 18:38 legoktm: deployed https://gerrit.wikimedia.org/r/196743 & https://gerrit.wikimedia.org/r/196746
- 18:24 legoktm: deleted rcstream-* jobs
- 18:11 legoktm: deployed https://gerrit.wikimedia.org/r/197094
- 10:02 hashar: restarting Jenkins
- 02:00 legoktm: deleting all 'mwext-*-composer-*' jobs that should never have been used it
March 15
- 07:39 legoktm: deleting non-generic, unused *-rubylint1.9.3lint & *-ruby2.0lint jobs
- 00:56 Krinkle: Reload Zuul to deploy Idb2f15a94a67
March 14
- 03:52 legoktm: deployed https://gerrit.wikimedia.org/r/196540
March 13
- 01:49 legoktm: deleted a bunch of unused *-tox-* jobs
- 01:03 legoktm: deployed https://gerrit.wikimedia.org/r/191063 & https://gerrit.wikimedia.org/r/196505
- 00:17 Krinkle: Reloading Zuul to deploy I46c60d520
March 12
- 23:34 Krinkle: Depooling integration-slave1402 to play with T92351
- 20:26 Krinkle: Restablished Gearman connection from Zuul due to deadlock
- 17:39 YuviPanda: killll deployment-rsync01, wasn’t being used for anything discernable, and that’s not how proxies work in prod
- 15:31 Krinkle: Reloading Zuul to deploy Ia289ebb0
- 15:22 Krinkle: Fix Jenkins UI (was stuck in German)
- 15:05 YuviPanda: jenkins loves german again
- 07:11 YuviPanda: scap still failing on beta, I'll check when I'm back from lunch
- 07:11 YuviPanda: rebooted puppetmaster, was dead
March 11
- 19:47 legoktm: deployed https://gerrit.wikimedia.org/r/195990
- 15:11 Krinkle: Jenkins UI in German, again
- 14:05 Krinkle: Jenkins web dashboard is in German
- 11:02 hashar: created integration-zuul-packaged.eqiad.wmflabs to test out the Zuul debian package
- 09:07 hashar: Deleted refs/heads/labs branch in integration/zuul.git
- 09:01 hashar: https://gerrit.wikimedia.org/r/#/c/195287/
- 09:01 hashar: made Zuul clone on labs to use the master branch instead of the labs one. There is no point in keeping separate ones anymore
March 10
- 15:22 apergos: after update of salt in deployment-prep git deploy restart is likely broken. details; https://phabricator.wikimedia.org/T92276
- 14:50 Krinkle: Browsertest job was stuck for > 10hrs. Jobs should not be allowed to run that long.
March 9
- 23:57 legoktm: deployed https://gerrit.wikimedia.org/r/195486
- 22:49 Krinkle: Reloading Zuul to deploy I229d24c57d90ef
- 20:37 legoktm: doing the gearman shuffle dance thing
- 19:42 Krinkle: Reloading Zuul to deploy I48cb4db87
- 19:35 Krinkle: Delete integration-slave1010
- 19:31 Krinkle: Restarted slave agent on gallium
- 19:30 Krinkle: Re-established Gearman connection from Jenkins
March 8
- 17:40 Krinkle: Delete integration-slave1006, integration-slave1007 and integration-slave1008
- 00:06 legoktm: deployed https://gerrit.wikimedia.org/r/195072
March 7
- 22:10 legoktm: deployed https://gerrit.wikimedia.org/r/195069
- 14:44 Krinkle: Depool integration-slave1008 and integration-slave1010 (not deleting yet, just in case)
- 14:43 Krinkle: Depool integration-slave1006 and integration-slave1007 (not deleting yet, just in case)
- 14:41 Krinkle: Pool integration-slave1404
- 14:35 Krinkle: Reloading Zuul to deploy I864875aa4acc
- 06:28 Krinkle: Reloading Zuul to deploy I8d7e0bd315c4fc2
- 04:53 Krinkle: Reloading Zuul to deploy I585b7f026
- 04:51 Krinkle: Pool integration-slave1403
- 03:55 Krinkle: Pool integration-slave1402
- 03:31 Krinkle: Reloading Zuul to deploy I30131a32c7f1
- 02:59 James_F: Pushed Ib4f6e9 and Ie26bb17 to grrrit-wm and restarted
- 02:54 Krinkle: Reloading Zuul to deploy Ia82a0d45ac431b5
March 6
- 23:30 Krinkle: Pool integration-slave1401
- 22:24 Krinkle: Re-establishing Gearman connection from Jenkins (deployment-bastion was deadlocked)
- 22:16 Krinkle: beta-scap-eqiad is has been waiting for 50minutes for an executor on deployment-bastion.eqiad (which has 5/5 slots idle)
- 21:36 Krinkle: Provisioning integration-slave1401 - integration-slave1404
- 20:14 legoktm: deployed https://gerrit.wikimedia.org/r/194939 for reals this time
- 20:12 legoktm: deployed https://gerrit.wikimedia.org/r/194939
- 18:22 ^d: staging: set has_ganglia to false in hiera
- 16:57 legoktm: deployed https://gerrit.wikimedia.org/r/194892
- 16:40 Krinkle: Jenkins auto-depooled integration-slave1008 due to low /tmp space. Purged /tmp/npm-* to bring back up.
- 16:27 Krinkle: Delete integration-slave1005
- 09:17 hasharConf: Jenkins: upgrading and restarting. Wish me luck.
- 06:29 Krinkle: Re-creating integration-slave1401 - integration-slave1404
- 02:21 legoktm: deployed https://gerrit.wikimedia.org/r/194340
- 02:12 Krinkle: Pooled integration-slave1405
- 01:52 legoktm: deployed https://gerrit.wikimedia.org/r/194461
March 5
- 22:01 Krinkle: Reloading Zuul to deploy I97c1d639313b
- 21:15 hashar: stopping Jenkins
- 21:08 hashar: killing browser tests running
- 20:48 Krinkle: Re-establishing Gearman connection from Jenkins
- 20:44 Krinkle: Deleting integration-slave1201-integration-slave1204, and integration-slave1401-integration-slave1404.
- 20:18 Krinkle: Finished creation and provisioning of integration-slave1405
- 19:34 legoktm: deploying https://gerrit.wikimedia.org/r/194461, lots of new jobs
- 18:50 Krinkle: Re-creating integration-slave1405
- 17:52 twentyafterfour: pushed wmf/1.25wmf20 branch to submodule repos
- 16:18 greg-g: now there are jobs running on the zuul status page
- 16:16 greg-g: getting "/zuul/status.json: Service Temporarily Unavailable" after the zuul restart
- 16:12 ^d: restarted zuul
- 16:06 greg-g: jenkins doesn't have anything queued and is processing jobs apparently, not sure why zuul is showing two jobs queued for almost 2 hours (one with all tests passing, the other with nothing tested yet)
- 16:04 greg-g: not sure it helped
- 16:02 greg-g: about to disconnect/reconnect gearman per https://www.mediawiki.org/wiki/Continuous_integration/Zuul#Known_issues
- 00:34 legoktm: deployed https://gerrit.wikimedia.org/r/194421
March 4
- 17:34 Krinkle: Depooling all new integation-slave12xx and integration-slave14xx instances again (See T91524)
- 17:11 Krinkle: Pooled integration-slave1201, integration-slave1202, integration-slave1203, integration-slave1204
- 17:06 Krinkle: Pooled integration-slave1402, integration-slave1403, integration-slave1404, integration-slave1405
- 16:56 Krinkle: Pooled integration-slave1401
- 16:26 Krinkle: integration-slave12xx and integration-slave14xx are now provisioned. Old slaves will be depooled later and eventually deleted.
March 3
- 22:00 hashar: reboot integration-puppetmaster in case it solves a NFS mount issue
- 20:33 legoktm: manually created centralauth.users_to_rename table
- 18:28 Krinkle: Lots of Jenkins builds are stuck even though they're "Finished". All services look up. (Filed T91430.)
- 17:18 Krinkle: Reloading Zuul to deploy Icad0a26dc8 and Icac172b16
- 15:39 hashar: cancelled logrotate update of all jobs since that seems to kill the Jenkins/Zuul gearman connection. Probably because all jobs are registered on each config change.
- 15:31 hashar: updating all jobs in Jenkins based on PS2 of https://gerrit.wikimedia.org/r/194109
- 10:56 hashar: Created instance i-000008fb with image "ubuntu-14.04-trusty" and hostname i-000008fb.eqiad.wmflabs.
- 10:52 hashar: deleting integration-puppetmaster to recreate it with a new image {bug|T87484} . Will have to reapply I5335ea7cbfba33e84b3ddc6e3dd83a7232b8acfd and I30e5bfeac398e0f88e538c75554439fe82fcc1cf
- 03:47 Krinkle: git-deploy: Deploying integration/slave-scripts 05a5593..1e64ed9
- 01:11 marxarelli: gzip'd /var/log/account/pacct.0 on deployment-bastion to free space
March 2
- 21:35 twentyafterfour: <Krenair> (per #mediawiki-core, have deleted the job queue key in redis, should get regenerated. also cleared screwed up log and restarted job runner service)
- 15:39 Krinkle: Removing /usr/local/src/zuul from integration-slave12xx and integration-slave14xx to let puppet re-install zuul-cloner (T90984)
- 13:39 Krinkle: integration-slave12xx and integration-slave14xx instances still depooled due to T90984
February 27
- 21:58 Krinkle: Ragekilled all queued jobs related to beta and force restarted Jenkins slave agent on deployment-bastion.eqiad
- 21:56 Krinkle: Job beta-update-databases-eqiad and node deployment-bastion.eqiad have been stuck for the past 4 hours
- 21:49 marxarelli: Reloading Zuul to deploy I273270295fa5a29422a57af13f9e372bced96af1 and I81f5e785d26e21434cd66dc694b4cfe70c1fa494
- 18:08 Krenair: Kicked deployment-bastion node in jenkins to try to fix jobs
- 06:42 legoktm: deployed https://gerrit.wikimedia.org/r/193057
- 01:01 Krinkle: Keeping all integration-slave12xx and slave14xx instances depooled.
- 00:53 Krinkle: Finished provisioning of integration-slave12xx and slave14xx instance. Initial testing failed due to "/usr/local/bin/zuul-cloner: No such file or directory"
February 26
- 23:24 Krinkle: integration-puppetmaster /var disk is full (1.8 of 1.9GB) - /var/log/puppet/reports is 1.1GB - purging
- 23:23 Krinkle: Puppet failing on new instances due to "Error 400 on SERVER: cannot generate tempfile `/var/lib/puppet/yaml/"
- 13:27 Krinkle: Provisioning the new integration-slave12xx and integration-slave14xx instances
- 05:05 legoktm: deployed https://gerrit.wikimedia.org/r/192980
- 03:48 Krinkle: Creating integration-slave1201,02,03,04 and integration-slave1401,02,03,04,05 per T74011 (not yet setup/provisioned, keep depooled)
- 03:39 Krinkle: Cleaned up and re-pooled integration-slave1006 (was depooled since yesterday)
- 03:39 Krinkle: Cleaned up and re-pooled integration-slave1007 and integration-slave1008 (was auto-depooled by Jenkins)
- 01:54 Krinkle: integration-slave1007 and integration-slave1008 were auto-deplooed due to main disk (/ and its /tmp) being < 900 MB free
- 01:20 legoktm: actually deployed https://gerrit.wikimedia.org/r/192772 this time
- 01:16 legoktm: deployed https://gerrit.wikimedia.org/r/192772
February 25
- 23:55 Krinkle: Re-established Jenkins-Gearman connection
- 23:54 Krinkle: Zuul queue is growing. Nothing is added to its dashboard. Jenkins executers all idle. Gearman deadlock?
- 20:38 legoktm: deployed https://gerrit.wikimedia.org/r/192564
- 20:18 legoktm: deployed https://gerrit.wikimedia.org/r/192267
- 17:22 ^d: reloading zuul to pick up utfnormal jobs
- 02:15 Krinkle: integration-slave1006 has <700MB free disk space (including /tmp)
February 24
- 18:41 marxarelli: Running `jenkins-jobs update` to create browsertests-CentralAuth-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce
- 17:55 Krinkle: It seems xdebug was enabled on integration slaves running trusty. This makes errors in build logs incomprehensible.
February 21
- 03:01 Krinkle: Reloading Zuul to deploy I3bcd3d17cb886740bd67b33b573aa25972ddb574
February 20
- 07:25 Krinkle: Finished setting up integration-slave1010 and added it to Jenkins slave pool
- 00:54 Krinkle: Setting up integration-slave1010 (replacement for integration-slave1009)
February 19
- 23:13 bd808: added Thcipriani to under_NDA sudoers group; WMF staff
- 19:45 Krinkle: Destroying integration-slave1009 and re-imaging
- 19:02 bd808: VICTORY! deployment-bastion jenkins slave unstuck
- 19:01 bd808: toggling gearman plugin in jenkins admin console
- 18:58 bd808: took deployment-bastion jenkins connection offline and online 5 times; gearman plugin still stuck
- 18:41 bd808: cleaned up mess in /tmp on integration-slave1008
- 18:38 bd808: brought integration-slave1007 back online
- 18:37 bd808: cleaned up mess in /tmp on integration-slave1007
- 18:29 bd808: restarting jenkins because I messed up and disabled gearman plugin earlier
- 16:30 bd808: disconnected and reconnected deployment-bastion.eqiad again
- 16:28 bd808: reconnected deployment-bastion.eqiad to jenkins
- 16:28 bd808: disconnected deployment-bastion.eqiad from jenkins
- 16:27 bd808: killed all pending jobs for deployment-bastion.eqiad
- 16:26 bd808: disconnected deployment-bastion.eqiad from jenkins
- 16:20 legoktm: updated phpunit for https://gerrit.wikimedia.org/r/188398
February 18
- 23:50 marxarelli: Reloading Zuul to deploy Id311d632e5032ed153277ccc9575773c0c8f30f1
- 23:37 marxarelli: Running `jenkins-jobs update` to create mediawiki-vagrant-bundle17-cucumber job
- 23:15 marxarelli: Running `jenkins-jobs update` to update mediawiki-vagrant-bundle17 jobs
- 22:56 marxarelli: Reloading Zuul to deploy I3b71f4dc484d5f9ac034dc1050faf3ba6f321752
- 22:42 marxarelli: running `jenkins-jobs update` to create mediawiki-vagrant-bundle17 jobs
- 22:13 hashar: saving Jenkins configuration at https://integration.wikimedia.org/ci/configure to reset the locale
- 16:41 bd808: beta-scap-eqiad job fixed after manually rebuilding git clones of scap/scap on rsync01 and videoscaler01
- 16:39 bd808: rebuilt corrupt deployment-videoscaler01:/srv/deployment/scap/scap
- 16:36 bd808: rebuilt corrupt deployment-rsync01:/srv/deployment/scap/scap
- 16:26 bd808: scap failures only from deployment-videoscaler01 and deployment-rsync01
- 16:25 bd808: scap failing with "ImportError: cannot import name cli" after latest update; investigating
- 16:23 bd808: redis-cli srem 'deploy:scap/scap:minions' i-0000059b.eqiad.wmflabs i-000007f8.eqiad.wmflabs i-0000022e.eqiad.wmflabs i-0000044e.eqiad.wmflabs i-000004ba.eqiad.wmflabs
- 16:16 bd808: 5 deleted instances in trebuchet redis cache for salt/salt repo
- 16:16 bd808: updated scap to 7c64584 (Add universal argument to ignore ssh_auth_sock)
- 16:14 bd808: scap clone on deployment-mediawiki02 corrupt; git fsck did not fix; will delete and refetch
- 01:41 bd808: fixed git rebase conflict on deployment-salt caused by outdated cherry-pick; cherry-picks are merged now so reset to tracking origin/production
February 17
- 17:47 hashar: beta cluster is mostly down because the instance supporting the main database (deployment-db1) is down. The root cause is an outage on the labs infra
- 03:43 Krinkle: Depooled integration-slave1009 (Debugging T89180)
- 03:38 Krinkle: Depooled integration-slave1009
February 14
- 00:55 marxarelli: gzip'd /var/log/account/pacct.0 on deployment-bastion
- 00:02 bd808: Stopped udp2log ans started udp2log-mw on deployment-bastion
February 13
- 23:25 bd808: cherry-picked https://gerrit.wikimedia.org/r/#/c/190231/ to deployment-salt for testing
- 14:03 Krinkle: Jenkins UI stuck in Spanish. Resetting configuration.
- 13:05 Krinkle: Reloading Zuul to deploy I0eaf2085576165b
February 12
- 11:11 hashar: changed passwords of selenium users.
- 10:41 hashar: Removing MEDIAWIKI_PASSWORD* global env variables from Jenkins configuration bug T89226
February 11
- 19:39 Krinkle: Jenkins UI is stuck in French. Resetting..
- 17:56 greg-g: hashar saved Jenkins global configuration at https://integration.wikimedia.org/ci/configure to hopefully reset the web interface default locale
- 09:57 hashar: restarting Jenkins to upgrade the Credentials plugin
- 09:25 hashar: bunch of puppet failure since 8:00am UTC. Seems to be DNS timeouts.
February 10
- 09:18 hashar: reenabling puppet-agent on deployment-salt . Was disabled with no reason nor sal entry.
- 06:32 Krinkle: Fix lanthanum:/srv/ssd/jenkins-slave/workspace/mediawiki-extensions-zend@3/src/extensions/Flow/.git/config.lock
- 00:50 bd808: Updated integration/slave-scripts to "Load extensions using wfLoadExtensions() if possible" (b532a9a)
February 9
- 22:40 Krinkle: Various mediawiki-extensions-zend builds are jammed half-way through phpunit execution (filed T89050)
- 21:31 hashar: Deputized legoktm to the Gerrit 'integration' group. Brings +2 on integration/* repos.
- 20:38 hashar: reconnected jenkins slave agents 1006 1007 and 1008
- 20:37 hashar: deleted /tmp on integration slaves 1006 1007 and 1008. Filled with npm temp directories
- 15:51 hashar: integration : allowed ssh from gallium 208.80.154.135/32 to the instances
- 09:20 hashar: starting puppet agent on integration-puppetmaster
February 7
- 16:23 hashar: puppet is broken on integration project for some reason. No clue what is going on :-( bug T88960
- 16:19 hashar: restarted puppetmaster on integration-puppetmaster.eqiad.wmflabs
- 00:42 Krinkle: Jenkins is alerting for integration-slave1006, integration-slave1007 and integration-slave1008 having low /tmp space free (< 0.8GB)
February 6
- 22:40 Krinkle: Installed dsh on integration-dev
- 05:46 Krinkle: Reloading Zuul to deploy I096749565 and I405bea9d3e
- 01:35 Krinkle: Upgraded all integration slaves to npm v2.4.1
February 5
- 13:11 hasharAway: restarted Zuul server to clear out stalled jobs
- 12:25 hashar: Upgrading puppet-lint from 0.3.2 to 1.1.0 on all repositories. All jobs are non voting beside mediawiki-vagrant-puppetlint-lenient which pass just fine with 1.1.0
- 03:21 Krinkle: Reloading Zuul to deploy I08a524ea195c
- 00:22 marxarelli: Reloaded Zuul to deploy Iebdd0d2ddd519b73b1fc5e9ce690ecb59da9b2db
February 4
- 10:43 hashar: beta-scap-eqiad job is broken because mwdeploy can no more ssh from deployment-bastion to deployment-mediawiki01 . Filled as bug T88529
- 10:30 hashar: piok
February 3
- 13:55 hashar: ElasticSearch /var/log/ filling up is bug T88280
- 09:15 hashar: Running puppet on deployment-eventlogging02 has been stalled for 3d15h. No log :-(
- 09:08 hashar: cleaning /var/log on deployment-elastic06 and deployment-elastic07
- 00:44 Krinkle: Restarting Jenkins-Gearman connection
February 2
- 21:39 Krinkle: Deployed I94f65b56368 and reloading Zuul
January 31
- 20:31 hashar: canceling a bunch of browser tests jobs that are deadlocked waiting for SauceLabs. The http request has no timeout bug T88221
January 29
- 01:39 James_F: Restarting Jenkins because deployment-bastion.eqiad isn't depooling even after restart.
- 00:47 Krenair: running instructions at https://www.mediawiki.org/wiki/Continuous_integration/Jenkins#Hung_beta_code.2Fdb_update
- 00:26 Krinkle: integration-slave1007 rm -rf /mnt/jenkins-workspace/workspace/oojs*
- 00:19 Krinkle: Jenkins slave on deployment-bastion.eqiad has been stuck for the past 5 hours
January 28
- 22:53 Krinkle: rm -rf integration-slave1007 rm -rf /mnt/jenkins-workspace/workspace/mwext-DonationInterface-np*
- 22:43 Krinkle: /srv/deployment/integration/slave-scripts got corrupted by puppet on labs slaves. No longer has the appropriate permission flags.
- 16:52 marktraceur: restarting nginx on deployment-upload so beta images might work again
January 27
- 18:54 Krinkle: rm -rf integration-slave1007 mwext-VisualEditor-*
January 26
- 23:22 bd808: rm integration-slave1006:/mnt/jenkins-workspace/workspace/mediawiki-phpunit-hhvm/src/.git/HEAD.lock (file was timestamped Jan 22 23:55)
- 21:06 bd808: I just merged a scap change that probably will break the beta-recomile-math-textvc-eqiad job -- https://gerrit.wikimedia.org/r/#/c/186808/
January 24
- 01:05 hashar: restarting Jenkins (deadlock on deployment-bastion slave)
January 20
- 18:50 Krinkle: Reconfigure Jenkins default language back to 'en' as it was set to Turkish
January 17
- 20:20 James_F: Brought deployment-bastion.eqiad back online, but without effect AFAICS.
- 20:19 James_F: Marking deployment-bastion.eqiad as temporarily offline to try to fix the backlog.
January 16
- 23:26 bd808: cherry-picked https://gerrit.wikimedia.org/r/#/c/185570/ to fix puppet errors on deployment-prep
- 12:43 _joe_: added hhvm.pcre_cache_type = "lru" to beta hhvm config
- 12:32 _joe_: installing the new HHVM package on mediawiki hosts
- 11:59 akosiaris: removed ferm from all beta hosts via salt
January 15
- 17:06 greg-g: turned off the beta-scap-eqiad jenkins job due to the persistent failing (https://phabricator.wikimedia.org/T86901) and the impending labs outage
- 14:50 hashar: beta-scap-eqiad broken since ~ 7:52am UTC. Depends on mwdeploy user homedir to be fixed in LDAP https://phabricator.wikimedia.org/T86903
- 10:55 hashar: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/ is broken since roughly 7:52am UTC.
January 14
- 23:22 mutante: cherry-picked I1e5f9f7bcbbe6c4 on deployment-bastion
- 20:37 hashar: Restarting Zuul
- 20:36 hashar: Zuul applied Ori patch to fix a git lock contention in Zuul-cloner bug T86730 . Tagged wmf-deploy-20150114-1
- 16:58 greg-g: rm -rf'd the Wikigrok checkout in integration-slave1006:/mnt/jenkins-workspace/workspace/mediawiki-extensions-hhvm/src/extensions to (hopefully) fix https://phabricator.wikimedia.org/T86730
- 14:56 anomie: Cherry-pick https://gerrit.wikimedia.org/r/#/c/173336/11/ to Beta Labs
- 02:05 bd808: There is some kind of race / conflict with the mediawiki-extensions-hhvm; I cleaned up the same error for a different extension yesterday
- 02:04 bd808: integration-slave1006 IOError: Lock for file '/mnt/jenkins-workspace/workspace/mediawiki-extensions-hhvm/src/extensions/WikiGrok/.git/config' did already exist, delete '/mnt/jenkins-workspace/workspace/mediawiki-extensions-hhvm/src/extensions/WikiGrok/.git/config.lock' in case the lock is illegal
January 13
- 22:37 hashar: Restarted Zuul, deadlocked waiting for Gerrit
- 21:38 ori: deployment-prep upgraded nutcracker on mw1/mw2 to 0.4.0+dfsg-1+wm1
- 17:49 hashar: If Zuul status page ( https://integration.wikimedia.org/zuul/ ) shows a lot of changes with completed jobs and the number of results growing, Zuul is deadlocked waiting for Gerrit. Have to restart it on gallium.wikimedia.org with /etc/init.d/zuul restart
- 17:43 hashar: Restarted deadlocked Zuul , which drops ALL events. Reason is Gerrit lost connection with its database which is not handled by Zuul . See https://wikitech.wikimedia.org/wiki/Incident_documentation/20150106-Zuul
- 17:32 James_F: No effect from restarting Gearman. Getting Timo to restart Zuul.
- 17:30 James_F: No effect. Restarting Gearman.
- 17:26 James_F: Trying a shutdown/re-enable of Jenkins.
- 13:59 YuviPanda: running scap via jenkins, hitting buttons on https://integration.wikimedia.org/ci/job/beta-scap-eqiad/
- 13:58 YuviPanda: scap failed
- 13:58 YuviPanda: running scap, because why not
- 13:58 YuviPanda: modified PrivateSettings.php to make it use wikiadmin user rather than mw user
- 13:51 YuviPanda: created user wikiadmin on deployment-db1
- 04:31 James_F: Zuul now appears fixed.
- 04:29 marktraceur: FORCE RESTART ZUUL (James_F told me to)
- 04:28 marktraceur: Attempting graceful zuul restart
- 04:26 marktraceur: Reloaded zuul to see if it will help
- 04:24 James_F: Took the gallium Jenkins slave offline, disconnected and relaunched; no effect.
- 04:19 James_F: Disabled and re-enabled Gearman, no effect.
- 04:15 James_F: Flagged and unflagged Jenkins for restart, no effect.
- 04:10 James_F: Jenkins/zuul/whatever not working, investigating.
- 01:12 marxarelli: Added twentyafterfour as an admin to the integration project
- 01:08 bd808: Added Dduvall as an admin in the integration project
- 00:55 bd808: zuul is plugged up because a gate-and-submit job failed on integration-slave1006 (ZeroBanner clone problem) and then the patch was force merged
- 00:48 bd808: deleted ntegration-slave1006:/mnt/jenkins-workspace/workspace/mediawiki-extensions-hhvm/src/extensions/ZeroBanner to try and clear the git clone problem there
- 00:35 bd808: git clone failure in https://integration.wikimedia.org/ci/job/mediawiki-extensions-hhvm/131/console blocking merge of core patch
January 12
- 21:17 hashar: qa-morebots moved from #wikimedia-qa to #wikimedia-releng bug T86053
- 20:57 greg-g: yuvi removed webserver:php5-mysql role from deployment-sentry2, thus getting puppet onit to unfail
- 20:57 greg-g: test-qa
- 11:41 hashar: foo
- 10:28 hashar: Removing Jenkins IRC notifications from #wikimedia-qa , please switch to #wikimedia-releng
- 09:06 hashar: Tweak Zuul configuration to pin python-daemon <= 2.0 and deploying tag wmf-deploy-20150112-1. bug T86513
January 8
- 19:21 Krinkle: Force restart Zuul
- 19:21 Krinkle: Gearman is back up but Zuul itself still stuck (no longer processing new events, doing "Updating information for .." for the same three jobs over and over again)
- 19:08 Krinkle: Relaunched Gearman from Jenkins manager
- 19:05 Krinkle: Zuul/Gearman stuck
- 18:26 YuviPanda: purged nscd cache on all deployment-prep hosts
- 16:34 Krinkle: Reload Zuul to deploy I9bed999493feb715
- 14:58 hashar: contintcloud labs project has been created! bug T86170. Added Krinkle and 20after4 as project admins.
- 14:44 hashar: on gallium and lanthanum, pushing integration/jenkins.git which would: 1b6a290 - Upgrade JSHint from v2.5.6 to 2.5.11
January 7
- 10:57 hashar: Taught Jenkins configuration about Java 8. Name: "Ubuntu - OpenJdk 8" JAVA_HOME: /usr/lib/jvm/java-8-openjdk-amd64/ . Only available on Trusty slaves though
- 10:56 hashar: installed openjdk 8 on CI Trusty labs slaves https://phabricator.wikimedia.org/T85964
- 10:34 hashar: varnish text cache is back up. Had to delete /etc/varnish and reinstall varnish from scratch + rerun puppet.
- 10:25 hashar: deleting /etc/varnish on deplloyment-cache-text02 and running puppet
- 10:24 hashar: beta varnish text cache is broken. The vcl refuses to load because of undefined probes
- 10:01 hashar: restarted deployment-cache-mobile03 and deployment-cache-text02
- 09:49 hashar: rebooting deployment-cache-bits01
- 00:41 Krinkle: rm -rf slave-scripts and re-cloning from integration/jenkins.git on all slaves (under sudo, just like puppet originally did) - git-status and jshint both work fine now
- 00:40 Krinkle: Permissions of deployment/integration/slave-scripts on labs slave are all screwed up (git-status says files are dirty, but when run as root git-status is clean and jshint also works fine via sudo)
- 00:29 Krinkle: Tried reconnecting Gearman, relaunching slave agents. Force-restarting Zuul now.
- 00:15 Krinkle: Permissions in deployment/integration/slave-scripts on integration-slave1003 are screwed up as well
January 6
- 22:13 hashar: jshint complains with: Error: Cannot find module './lib/node' :-(
- 22:12 hashar: integration-slave1005 chmod -R go+r /srv/deployment/integration/slave-scripts
- 22:08 hashar: integration-slave1007 chmod -R go+r /srv/deployment/integration/slave-scripts . cscott mentioned build failures of parsoidsvc-jslint which could not read /srv/deployment/integration/slave-scripts/tools/node_modules/jshint/src/cli.js
- 02:29 ori: qdel -f'd qa-morebots and started a new instance
December 22
- 20:06 bd808: Saved settings in https://integration.wikimedia.org/ci/configure to get jenkins ui language back to english from korean
December 21
- 08:31 Krinkle: /var on integration-slave1005 had 93% of 2GB full. Removed some large items in /var/cache/apt/archives that seemed unneeded and don't exist on other slaves.
December 19
- 23:01 greg-g: Krinkle restarted Gearman, which got the jobs to flow again
- 20:51 Krinkle: integration-slave1005 (new Ubuntu Trusty instance) is now pooled
- 18:51 Krinkle: Re-created and provisioning integration-slave1005 (UbuntuTrusty)
- 18:23 bd808: redis input to logstash stuck; restarted service
- 18:16 bd808: ran `apt-get dist-upgrade` on logstash01
- 18:02 bd808: removed local mwdeploy user & group from videoscaler01
- 18:01 bd808: deployment-videoscaler01 has mysteriously aquired a local mwdeploy user instead of the ldap one
- 17:58 bd808: forcing puppet run on deploymnet-videoscaler01
- 07:24 Krinkle: Restarting Gearman connection to Jenkins
- 07:24 Krinkle: Attempt #5 at re-creating integration-slave1001. Completed provisioning per Setup instructions. Pooled.
- 05:33 Krinkle: Rebasing integration-puppetmaster with latest upstream operations/puppet (5 local patches) and labs/private
- 00:06 bd808: restored local commit with ssh keys for scap to deployment-salt
December 18
- 23:57 bd808: temporarily disabled jenkins scap job
- 23:56 bd808: killed some ancient screen sessions on deployment-bastion
- 23:53 bd808: Restarted udp2log-mw on deployment-bastion
- 23:53 bd808: Restarted salt-minion on deployement-bastion
- 23:47 bd808: Updated scap to latest HEAD version
- 21:57 Krinkle: integration-slave1005 is not ready. It's incompletely setup due to https://phabricator.wikimedia.org/T84917
- 19:29 marxarelli: restarted puppetmaster on deployment-salt
- 19:29 marxarelli: seeing "Could not evaluate: getaddrinfo: Temporary failure in name resolution" in the deployment-* puppet logs
- 14:17 hashar: deleting instance deployment-parsoid04 and removing it from Jenkins
- 14:08 hashar: restarted varnish backend on parsoidcache02
- 14:00 hashar: parsoid05 seems happy: curl http://localhost:8000/_version: {"name":"parsoid","version":"0.2.0-git","sha":"d16dd2db6b3ca56e73439e169d52258214f0aeb2"}
- 14:00 hashar: parsoid05 seems happy: curl http://localhost:8000/_version
- 13:56 hashar: applying latest changes of Parsoid on parsoid05 via: zuul enqueue --trigger gerrit --pipeline postmerge --project mediawiki/services/parsoid --change 180671,2
- 13:56 hashar: parsoid05: disabling puppet, stopping parsoid, rm -fR /srv/deployment/parsoid ; rerunning the Jenkins beta-parsoid-update-eqiad to hopefully recreate everything properly
- 13:52 hashar: making parsoid05 a Jenkins slave to replace parsoid04
- 13:24 hashar: apt-get upgrade on parsoidcache02 and parsoid04
- 13:23 hashar: updated labs/private on puppet master to fix a puppet dependency cycle with sudo-ldap
- 13:19 hashar: rebased puppetmaster repo
- 12:53 hashar: reenqueuing last merged change of Parsoid in Zuul postmerge pipeline in order to trigger the beta-parsoid-update-eqiad job properly. zuul enqueue --trigger gerrit --pipeline postmerge --project mediawiki/services/parsoid --change 180671,2
- 12:52 hashar: deleting the workspace for the beta-parsoid-update-eqiad jenkins job on deployment-parsoid04 . Some file belong to root which prevent the job from processing
- 09:13 hashar: enabled MediaWiki core 'structure' PHPUnit tests for all extensions. Will require folks to fix their incorrect AutoLoader and RessourceLoader entries. 180496 bug T78798
December 17
- 21:02 hashar: cancelled all browser tests,suspecting them to deadlock Jenkins somehow :(
December 16
- 17:17 bd808: git-sync-upstream runs cleanly on deployment-salt again!
- 17:16 bd808: removed cherry pick of Ib2a0401a7aa5632fb79a5b17c0d0cef8955cf990 (-2 by _joe_; replaced by Ibcad98a95413044fd6c5e9bd3c0a6fb486bd5fe9)
- 17:15 bd808: removed cherry pick of I3b6e37a2b6b9389c1a03bd572f422f898970c5b4 (modified in gerrit by bd808 and not repicked; merged)
- 17:15 bd808: removed cherry pick of I08c24578596506a1a8baedb7f4a42c2c78be295a (-2 by _joe_ in gerrit; replaced by Iba742c94aa3df7497fbff52a856d7ba16cf22cc7)
- 17:13 bd808: removed cherry pick of I6084f49e97c855286b86dbbd6ce8e80e94069492 (merged by Ori with a change)
- 17:09 bd808: trying to fix it without using important changes
- 17:08 bd808: deployment-salt:/var/lib/git/operations/puppet is a rebase hell of cherry-picks that don't apply
- 13:51 hashar: deleting integration-slave1001 and recreating it. It is blocked on boot and we can't console on it https://phabricator.wikimedia.org/T76250
December 15
- 23:24 Krinkle: integration-slave1001 isn't coming back (T76250), building integration-slave1005 as its replacement.
- 12:53 YuviPanda: manually restarted diamond on all betalabs host, to see if that is why metrics aren’t being sent anymore
- 09:41 hashar: deleted hhvm core files in /var/tmp/core from both mediawiki01 and mediawiki02 task T1259 and task T71979
December 13
- 18:51 bd808: Running chmod -R g+s /data/project/upload7 on deploymnet-mediawiki02
- 18:25 bd808: Running chmod -R u=rwX,g=rwX,o=rX /data/project/upload7 from deployment-mediawiki02
- 18:16 bd808: chown done for /data/project/upload7
- 17:51 bd808: Running chown -R apache:apache on /data/project/upload7 from deployment-mediawiki02
- 17:11 bd808: Labs DNS seems to be flaking out badly and causing random scap and puppet failures
- 16:58 bd808: restarted puppetmaster on deployment-salt
- 16:31 bd808: apache user renumbered on deployment-mediawiki03
- 16:23 bd808: apache and hhvm restarted on beta app servers following apache user renumber
- 16:09 bd808: apache and hhvm stopped on beta app server tier. All requests expected to return 503 from varnish
- 16:03 bd808: Starting work on phab:T78076 to renumber apache users in beta
- 08:21 YuviPanda|zzz: forcing puppet run on all deployment-prep hosts
December 12
- 22:38 bd808: Fixed scap by deleting /srv/mediawiki/~tmp~ on deployment-rsync01
- 22:27 hashar: Creating 1300 Jenkins jobs to run extensions PHPUnit tests under either HHVM or Zend PHP flavors.
- 18:35 bd808: Added puppet config to record !log messages in logstash
- 17:32 bd808: forcing puppet runs on deployment-mediawiki0[12]; hiera settings specific to beta were not applied on the hosts leading to all kinds of problems
- 17:12 bd808: restarted hhvm on deployment-mediawiki0[12] and purged hhbc database
- 17:00 bd808: restarted apache2 on deployment-mediawiki01
- 16:59 bd808: restarted apache2 on deployment-mediawiki02
December 11
- 22:13 hashar: Adding chrismcmahon to the 'integration' Gerrit group so he can +2 changes made to integration/config.git
- 21:47 hashar: Jenkins re adding integration-slave1009 to the pool of slaves
- 19:45 bd808|LUNCH: I got nerd snipped into looking at beta. Major personal productivity failure.
- 19:43 bd808|LUNCH: nslcd log noise is probably a red herring -- https://access.redhat.com/solutions/58684
- 19:39 bd808|LUNCH: lots of nslcd errors in syslog on deployment-rsync01 which may be causing scap failures
- 07:45 YuviPanda: shut up shinken-wm
December 10
- 22:17 bd808: restarted logstash on logstash1001. redis event queue not being processed
- 10:30 hashar: Adding hhvm on Trusty slaves, using depooled integration-slave1009 as the main work area
December 9
- 16:33 bd808: restarted puppetmaster to pick up changes to custom functions
- 16:19 bd808: forced install of sudo-ldap across beta with: salt '*' cmd.run 'env SUDO_FORCE_REMOVE=yes DEBIAN_FRONTEND=noninteractive apt-get -y install sudo-ldap'
December 8
- 23:45 bd808: deleted hhvm core on mediawiki01
- 23:43 bd808: Ran `apt-get clean` on deployment-mediawiki01
December 5
- 22:21 bd808: 1.1G free on deployment-mediawiki02:/var after removing a lot of crap form logs and /var/tmp/cores
- 22:06 bd808: /var full on deployment-mediawiki02 again :(((
- 10:50 hashar: applying mediawiki::multimedia class on contint slaves ( https://phabricator.wikimedia.org/T76661 | https://gerrit.wikimedia.org/r/#/c/177770/ )
- 01:01 bd808: Deleted a ton of jeprof.*.heap files from deployment-mediawiki02:/
- 00:54 YuviPanda: cleared out pngs from mediawiki02 to kill low space warning
- 00:53 YuviPanda: mediawiki02 instance is low on space, /tmp has lots of... pngs?
December 4
- 22:48 YuviPanda: manually rebased puppet on deployment-prep
- 00:29 bd808: deleted instance "udplog"
December 3
- 19:11 bd808: Cleaned up legacy jobrunner scripts on deployment-jobrunner01 (/etc/default/mw-job-runner /etc/init.d/mw-job-runner /usr/local/bin/jobs-loop.sh)
December 2
- 23:39 bd808: Cause of full disk on deployment-mediawiki01 was an hhvm core file; fixed now
- 23:35 bd808: /var full on deployment-mediawiki01
- 11:27 hashar: deleting /srv/vdb/varnish* files on all varnish instances ( https://phabricator.wikimedia.org/T76091 )
- 10:23 hashar: restarted parsoid on deployment-parsoid05
- 05:26 Krinkle: integration-slave1001 has been down since the failed reboot on 28 November 2014. Still unreachable over ssh and no Jenkins slave agent.
December 1
- 18:54 bd808: Got jenkins updates working again by taking deployment-bastion node offline, killing waiting jobs and bringing it back online again.
- 18:51 bd808: updates in beta suck with the "Waiting for next available executor" deadlock again
- 17:59 bd808: Testing rsyslog event forwarding to logstash via puppet cherry-pick
November 27
- 12:28 hashar: enabled puppet master autoupdate by setting puppetmaster_autoupdate: true in Hiera:Integration . https://phabricator.wikimedia.org/T75878
- 12:28 hashar: rebased integration puppetmaster : 5d35de4..1a5ebee
- 00:32 bd808: Testing local hack on deployment-salt to switch order of heira backends
- 00:16 bd808: Testing a proposed puppet patch to allow pointing hhvm logs back to deploment-bastion
November 26
- 00:51 bd808: cherry-picked patch for redis logstash input from MW 175896
- 00:50 bd808: Restored puppet cherry-picks from reflog [phab:T75947]
November 25
- 23:45 hashar: Fixed upload cache on beta cluster, the Varnish backend had a mmap SILO error that prevented the backend from starting. https://phabricator.wikimedia.org/T75922
- 21:05 bd808: Running `sudo find . -type d ! -perm -o=w -exec chmod 0777 {} +` to fix upload permissions
- 18:01 legoktm: cleared out renameuser_status table (old broken global merges)
- 18:00 legoktm: 4086 rows deleted from localnames, 3929 from localuser
- 17:59 legoktm: clearing out localnames/localuser where wikis don't exist on beta
- 17:10 legoktm: ran migratePass0.php on all wikis
- 17:09 legoktm: ran checkLocalUser.php --delete on all wikis
- 17:08 legoktm: PHP Notice: Undefined index: wmgExtraLanguageNames in /mnt/srv/mediawiki/php-master/includes/SiteConfiguration.php on line 307
- 17:07 legoktm: ran checkLocalNames.php --delete on all wikis
- 04:37 jgage: restarted jenkins at 20:31
November 24
- 17:24 greg-g: stupid https
- 16:40 bd808|deploy: My problem with en.wikipedia.beta.wmflabs.org was caused by a forceHTTPS cookie being set in my browser and redirecting to the broken https endpoint
- 16:33 bd808|deploy: scap fixed by reverting bad config patch; still looking into failures from en.wikipedia.beta.wmflabs.org
- 16:27 bd808: Looking at scap crash
- 15:18 YuviPanda: restored local hacks + fixed 'em to account for 47dcefb74dd4faf8afb6880ec554c7e087aa947b on deployment-salt puppet repo, puppet failures recovering now
November 21
- 17:06 bd808: deleted salt keys for deleted instances: i-00000289, i-0000028a, i-0000028b, i-0000028e, i-000002b7, i-000006ad
- 15:57 hashar: fixed puppet cert on deployment-restbase01
- 15:50 hashar: deployment-sca01 regenerating puppet CA for deployment-sca01
- 15:34 hashar: Renerated puppet master certificate on deployment-salt. It needs to be named deployment-salt.eqiad.wmflabs not i-0000015c.eqiad.wmflabs. Puppet agent works on deployment-salt now.
- 15:19 hashar: I have revoked the deployment-salt certificates. All puppet agent are thus broken!
- 15:01 hashar: deployment-salt cleaning certs with puppet cert clean
- 14:52 hashar: manually switching restbase01 puppet master from virt1000 to deployment-salt.eqiad.wmflabs
- 14:50 hashar: deployment-restbase01 has some puppet error: Error 400 on SERVER: Must provide non empty value. on node i-00000727.eqiad.wmflabs . That is due to puppet pickle() function being given an empty variable
November 20
- 15:25 hashar: 15:01 Restarted Jenkins AND Zuul. Beta cluster jobs are still deadlocked.
- 13:21 hashar: for integration, set puppet master report retention to 360 minutes ( https://wikitech.wikimedia.org/wiki/Hiera:Integration , see https://bugzilla.wikimedia.org/show_bug.cgi?id=73472#c14 )
- 13:20 hashar: rebased puppet master on integration project
- 13:20 hashar: rebased puppet master
November 19
- 21:27 bd808: Ran `GIT_SSH=/var/lib/git/ssh git pull --rebase` in deployment-salt:/srv/var-lib/git/labs/private
November 18
- 15:32 hashar: Deleting job https://integration.wikimedia.org/ci/job/mediawiki-vendor-integration/ replaced by mediawiki-phpunit. Clearing out workspaces bug 73515
November 17
- 09:24 YuviPanda: moved *old* /var/log/eventlogging into /home/yuvipanda so puppet can run without bitching
- 04:57 YuviPanda: cleaned up coredump on mediawiki02 on deployment-prep
November 14
- 21:03 marxarelli: loaded and re-saved jenkins configuration to get it back to english
- 17:27 bd808: /var full on deployment-mediawiki02. Adjusted ~bd808/cleanup-hhvm-cores for core found in /var/tmp/core rather than the expected /var/tmp/hhvm
- 11:14 hashar: Recreated a labs Gerrit setup on integration-zuul-server . Available from http://integration.wmflabs.org/gerrit/ using OpenID for authentication.
November 13
- 11:13 hashar: apt-get upgrade / maintenance on all slaves
- 11:02 hashar: bringing back integration-slave1008 to the pool. The label had a typo. https://integration.wikimedia.org/ci/computer/integration-slave1008/
November 12
- 21:03 hashar: Restarted Jenkins due to a deadlock with deployment-bastion slave
November 9
- 16:51 bd808: Running `chmod -R =rwX .` in /data/project/upload7
November 8
- 08:06 YuviPanda: that fixed it
- 08:04 YuviPanda: disabling/enabling gearman
November 6
- 23:43 bd808: https://integration.wikimedia.org/ci/job/mwext-MobileFrontend-qunit-mobile/ happier after I deleted the clone of mw/core that was somehow corrupted
- 21:01 cscott: bounced zuul, jobs seem to be running again
- 20:58 cscott: about to restart zuul as per https://www.mediawiki.org/wiki/Continuous_integration/Zuul#Known_issues
- 00:53 bd808: HHVM not installed on integration-slave1009? "/srv/deployment/integration/slave-scripts/bin/mw-run-phpunit-hhvm.sh: line 42: hhvm: command not found" -- https://integration.wikimedia.org/ci/job/mediawiki-core-regression-hhvm-master/2542/console
November 5
- 16:14 bd808: Updated scap to include Ic4574b7fed679434097be28c061927ac459a86fc (Revert "Make scap restart HHVM")
October 31
- 17:13 godog: bouncing zuul in jenkins as per https://www.mediawiki.org/wiki/Continuous_integration/Zuul#Known_issues
October 30
- 16:34 hashar: cleared out /var/ on integration-puppetmaster
- 16:34 bd808: Upgraded kibana to v3.1.1
- 15:54 hashar: Zuul: merging in https://review.openstack.org/#/c/128921/3 which should fix jobs being stuck in queue on merge/gearman failures. bug 72113
- 15:45 hashar: Upgrading Zuul reference copy from upstream c9d11ab..1f4f8e1
- 15:43 hashar: Going to upgrade Zuul and monitor the result over the next hour.
October 29
- 22:58 bd808: Stopped udp2log and started udp2log-mw on deployment-bastion
- 19:46 bd808: Logging seems broken following merge of https://gerrit.wikimedia.org/r/#/c/119941/24. Investigating
October 28
- 21:39 bd808: RoanKattouw creating deployment-parsoid05 as a replacement for the totally broken deployment-parsoid04
October 24
- 13:36 hashar: That bumps hhvm on contint from 3.3.0-20140925+wmf2 to 3.3.0-20140925+wmf3
- 13:36 hashar: apt-get upgrade on Trusty Jenkins slaves
October 23
- 22:43 hashar: Jenkins resumed activity. Beta cluster code is being updated
- 21:36 hashar: Jenkins: disconnected / reconnected slave node deployment-bastion.eqiad
October 22
- 20:54 bd808: Enabled puppet on deployment-logstash1
- 09:07 hashar: Jenkins: upgrading gearman-plugin from 0.0.7-1-g3811bb8 to 0.1.0-1-gfa5f083 . Ie bring us to latest version + 1 commit
October 21
- 21:10 hashar: contint: refreshed slave-scripts 0b85d48..8c3f228 sqlite files will be cleared out after 20 minutes (instead of 60 minutes) bug 71128
- 20:51 cscott: deployment-prep _joe_ promises to fix this properly tomorrow am
- 20:51 cscott: deployment-prep turned off puppet on deployment-pdf01, manually fixed broken /etc/ocg/mw-ocg-service.js
- 20:50 cscott: deployment-prep updated OCG to version 523c8123cd826c75240837c42aff6301032d8ff1
- 10:55 hashar: deleted salt master key on deployment-elastic{06,07}, restarted salt-minion and reran puppet. It is now passing on both instances \O/
- 10:48 hashar: rerunning puppet manually on deployment-elastic{06,07}
- 10:48 hashar: beta: signing puppet cert for deployment-elastic{06,07}. On deployment-salt ran: puppet ca sign i-000006b6.eqiad.wmflabs; puppet ca sign i-000006b7.eqiad.wmflabs
- 09:29 hashar: forget me deployment-logstash1 has a puppet agent error but it is simply because the agent is disabled "'debugging logstash config'"
- 09:28 hashar: deployment-logstash1 disk full
October 20
- 17:41 bd808: Disabled redis input plugin and restarted logstash on deployment-logstash1
- 17:39 bd808: Disabled puppet on deployment-logstash1 for some live hacking of logstash config
- 15:27 apergos: upgrded salt-master on virt1000 (master for labs)
October 17
- 22:34 subbu: live fixed bad logger config in /srv/deployment/parsoid/deploy/conf/wmf/betalabs.localsettings.js and verified that parsoid doesn't crash anymore -- fix now on gerrit and being merged
- 20:48 hashar: qa-morebots is back
- 20:30 hashar: beta: switching Parsoid config file to the one in mediawiki/services/parsoid/deploy.git instead of the puppet maintained config file https://gerrit.wikimedia.org/r/#/c/166610/ for subbu. Parsoid seems happy :)
- hashar: qa-morebots disappeared :( bug 72179
- hashar: deployment-logstash1 unlocking puppet by deleting left over /var/lib/puppet/state/agent_catalog_run.lock
- hashar: logstash1 instance being filled up is bug 72175 probably caused by the Diamond collector spamming /server-status?auto
- hashar: deployment-logstash1 deleting files under /var/log/apache2/ gotta fill a bug to prevent access log from filling the partition
October 16
- 06:14 apergos: updated remaining beta instances to salt-minion 2014.1.11 from salt ppa
October 15
- 12:56 apergos: updated i-000002f4, i-0000059b, i-00000504, i-00000220 salt-minion to 2014.1.11
- 12:20 apergos: updated salt-master and salt-minion on the deployment-salt host _only_ to 2014.1.11 (using salt ppa for now)
- 01:08 Krinkle: Pooled integration-slave1009
- 01:00 Krinkle: Setting up integration-slave1009 (bug 72014 fixed)
- 01:00 Krinkle: integration-publisher and integration-zuul-server were rebooted by me yesterday. Seems they only show up in graphite now. Maybe they were shutdown or had puppet stuck.
October 14
- 21:00 JohnLewis: icinga says deployment-sca01 is good (yay)
- 20:42 JohnLewis: deleted and recreated deployment-sca01 (still needs puppet set up)
- 20:24 JohnLewis: rebooted deployment-sca01
- 09:26 hashar: renamed deployment-cxserver02 node slaves to 03 and updated the ip address
- 06:49 Krinkle: Did a slow-rotating graceful depool/reboot/repool of all integration-slave's over the past hour to debug problems whilst waiting for puppet to unblock and set up new slaves.
- 06:43 Krinkle: Keeping the new integration-slave1009 unpooled because setup could not be completed due to bug 72014.
- 06:43 Krinkle: Pooled integration-slave1004
- 05:40 Krinkle: Setting up integration-slave1004 and integration-slave1009 (bug 71873 fixed)
October 10
- 20:53 Krinkle: Deleted integration-slave1004 and integration-slave1009. When bug 71873 is fixed, they'll need to be re-created.
- 19:11 Krinkle: integration-slave1004 (new instance, not set up yet) was broken (bug 71741). The bug seems fixed for new instances so, I deleted and re-created it. Will be setting up as a Precise instance and pool it.
- 19:09 Krinkle: integration-slave1009 (new instance) remains unpooled as it is not yet fully set up (bug 71874). See Nova_Resource:Integration/Setup
October 9
- 20:17 bd808: rebooted deployment-sca01 via wikitech ui
- 20:16 bd808: deployment-sca01 dead -- Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000100
- 19:44 bd808: added role::deployment::test to deployment-rsync01 and deployment-mediawiki03 for trebuchet testing
- 19:07 bd808: updated scap to include 8183d94 (Fix "TypeError bufsize must be an integer")
- 09:34 hashar: migrating deployment-cxserver02 to beta cluster puppet and salt masters
- 09:22 hashar: Renamed Jenkins slave deployment-cxserver01 to deployment-cxserver02 and updated IP. It is marked offline until the instance is ready and has the relevant puppet classes applied.
- 09:19 hashar: deleting deployment-cxserver01 (borked since virt1005 outage) creating deployment-cxserver02 to replace it bug 71783
October 7
- 19:19 bd808: ^d deleted all files/directories in gallium:/var/lib/jenkins-slave/tmpfs
- 18:24 bd808: /var/lib/jenkins-slave/tmpfs full (100%) on gallium
- 11:54 Krinkle: The new integration-slave1009 must remain unpooled because Setup failed (puppet unable to mount /mnt, bug 71874) - see also Nova Resource:Integration/Setup
- 11:53 Krinkle: Deleted integration-slave1004 because bug 71741
- 10:16 hashar: beta: apt-get upgraded all instances beside the lucid one.
- 09:57 hashar: beta: deleting old occurrences of /etc/apt/preferences.d/puppet_base_2.7
- 09:53 hashar: apt-get upgrade on all beta cluster instances
- 09:34 Krinkle: Rebase integration-puppetmaster on latest operations-puppet (patches: I7163fd38bcd082a1, If2e96bfa9a1c46)
- 09:32 Krinkle: Apply I44d33af1ce85 instead of Ib95c292190d on integration-puppetmaster (remove php5-parsekit package)
- 09:28 hashar: upgrading php5-fss on both beta-cluster and integration instances. bug 66092 https://rt.wikimedia.org/Ticket/Display.html?id=7213
- 08:55 Krinkle: Building additional contint slaves in labs (integration-slave1004 with precise and integration-slave1009 with trusty)
- 08:21 Krinkle: Reload Zuul to deploy 5e905e7c9dde9f47482d
October 3
- 22:53 bd808: Had to stop and start zuul due to NoConnectedServersError("No connected Gearman servers") in zuul.log on gallium
- 22:34 bd808|deploy: Merged Ie731eaa7e10548a947d983c0539748fe5a3fe3a2 (Regenerate autoloader) to integration/phpunit for bug 71629
- 14:01 manybubbles: rebuilding beta's simplewiki cirrus index
- 08:24 hashar: deployment-bastion clearing up /var/log/account a bit bug 69604. Puppet patch pending :]
October 2
- 19:42 bd808: Updated scap to include eff0d01 Fix format specifier for error message
- 11:58 hashar: Migrated all mediawiki-core-regression* jobs to Zuul cloner bug 71549
- 11:57 hashar: Migrated all mediawiki-core-regression* jobs to Zuul cloner
October 1
- 20:57 bd808: hhvm servers broken because of I5f9b5c4e452e914b33313d0774fb648c1cdfe7ad
- 17:29 bd808: Stopped service udp2log and started service udp2log-mw on deployment-bastion
- 16:21 bd808: Cherry-picked https://gerrit.wikimedia.org/r/#/c/163078/ into scap for beta. hhvm will be restarted on each scap. Keep your eyes open for weird problems like 503 responses that this may cause.
- 14:14 hashar: rebased contint puppetmaster
September 30
- 23:47 bd808: jobrunner using outdated ip address for redis01. Testing patch to use hostname rather than hardcoded ip
- 21:45 bd808: jobrunner not running. ebernhardson is debugging.
- 21:38 bd808: /srv on rsync01 now has 3.2G of free space and should be fine fro quite a while again.
- 21:37 bd808: I figured out the disk space problem on rsync01 (just as I was ready to replace it with rsync02). The old /src/common-local directory was still there which doubled the disk utilization. /src/mediawiki is the correct sync dir now following prod changes.
- 21:15 bd808: local l10nupdate users on bastion, mediawiki01 and rsync01
- 21:06 bd808: Local mwdeploy user on deployment-bastion making things sad
- 20:36 bd808: lots and lots of "file has vanished" errors from rsync. Not sure why
- 20:35 bd808: Initial puppet run with role::beta::rsync_slave applied on rsync02 failed spectacularly in /Stage[main]/Mediawiki::Scap/Exec[fetch_mediawiki] stage
- 20:02 bd808: Started building deployment-rsync02 to replace deployment-rsync01
- 19:59 bd808|LUNCH: /srv partition on deployment-rsync01 full again. We need a new rsync server with more space
- 17:44 bd808: Updated scap to 064425b (Remove restart-nutcracker and restart-twemproxy scripts)
- 16:08 bd808: Occasional memecached-serious errors in beta for something trying to connect to the default memcached port (11211) rather than the nutcracker port (11212).
- 15:58 bd808: scap happy again after fixing rogue group/user on rsync01 \o/ Not sure why they were created but likely an ldap hiccup during a puppet run
- 15:56 bd808: removed local group/user mwdeploy on deployment-rsync01
- 15:54 bd808: Local mwdeploy (gid=996) shadowing ldap group gid=603(mwdeploy) on deployment-rsync01
- 15:49 bd808: apt-get dist-upgrade fixed hhvm on deployment-mediawiki03
- 15:45 hashar: Updating our Jenkins job builder fork 686265a..ee80dbc (no job changed)
- 15:44 bd808: scap failing in beta due to "Permission denied (publickey)" talking to deployment-rsync01.eqiad.wmflabs
- 15:39 bd808: hhvm not starting after puppet run on deployment-mediawiki03. Investigating.
- 15:36 bd808: enabling puppet and forcing run on deployment-mediawiki03
- 15:34 bd808: enabling puppet and forcing run on deployment-mediawiki02
- 15:29 bd808: puppet showed no changes on mediawiki01‽
- 15:27 bd808: enabling puppet and forcing run on deployment-mediawiki01
- 15:13 bd808: Fixed logstash by installing http://packages.elasticsearch.org/logstash/1.4/debian/pool/main/l/logstash-contrib/logstash-contrib_1.4.2-1-efd53ef_all.deb
- 15:02 bd808: Logstash doesn't bundle the prune filter by default any more -- http://logstash.net/docs/1.4.2/filters/prune
- 14:59 bd808: Logstash rules need to be adjusted for latest upstream version: "Couldn't find any filter plugin named 'prune'"
- 12:37 hashar: Fixed some file permissions under deployment-bastion:/srv/mediawiki-staging/php-master/vendor/.git some files belonged to root instead of mwdeploy
- 00:34 bd808: Updated kibana to latest upstream head 8653aba
September 29
- 14:22 hashar: apt-get upgrade and reboot of all integration-slaveXX instances
- 14:07 hashar: updated puppetmaster labs/private on both integration and beta cluster projects ( a41fcdd..84f0906 )
- 08:57 hashar: rebased puppetmaster
September 26
- 22:16 bd808: Deleted deployment-mediawiki04 (i-000005ba.eqiad.wmflabs) and removed from salt and trebuchet
- 07:50 hashar: Pooled back integration-slave1006 , was removed because of bug 71314
- 07:41 hashar: Updated our Jenkins Job Builder fork 2d74b16..686265a
September 25
- 23:35 bd808: Done messing with puppet repo. Replaced 2 local commits with proper gerrit cherry picks. Removed a cherry-pick that had been rearranged and merged. Removed a cherry-pick that had been abandoned in gerrit.
- 23:10 bd808: removed cherry-pick of abandoned https://gerrit.wikimedia.org/r/#/c/156223/; if beta wikis stop working this would be a likely culprit
- 22:36 bd808: Trying to reduce the number of untracked changes in puppet repo. Expect some short term breakage.
- 22:21 bd808: cleaned up puppet repo with `git rebase origin/production; git submodule update --init --recursive`
- 22:18 bd808: puppet repo on deployment-salt out of whack. I will try to fix.
- 08:15 hashar: beta: puppetmaster rebased
- 08:10 hashar: beta: dropped a patch that reverted OCG LVS configuration ( https://gerrit.wikimedia.org/r/#/c/146860/ ), it has been fixed by https://gerrit.wikimedia.org/r/#/c/148371/
- 08:04 hashar: attempting to rebase beta cluster puppet master. Currently at 74036376
September 24
- 15:30 hashar_: install additional fonts on jenkins slaves for browser screenshots ( https://gerrit.wikimedia.org/r/#/c/162604/ and https://bugzilla.wikimedia.org/69535 )
- 09:57 hashar_: upgraded Zuul on all integration labs instances
- 09:33 hashar_: Jenkins switched mwext-UploadWizard-qunit back to Zuul cloner by applying pending change 161459
- 09:19 hashar_: Upgrading Zuul to f0e3688 Cherry pick https://review.openstack.org/#/c/123437/1 which fix bug 71133 Zuul cloner: fails on extension jobs against a wmf branch
September 23
- 23:08 bd808: Jenkins and deployment-bastion talking to each other again after six (6!) disconnect, cancel jobs, reconnect cycles
- 22:53 greg-g: The dumb "waiting for executors" bug is https://bugzilla.wikimedia.org/show_bug.cgi?id=70597
- 22:51 bd808: Jenkins stuck trying to update database in beta again with the dumb "waiting for executors" bug/problem
September 22
- 16:09 bd808: Ori updating HHVM to 3.3.0-20140918+wmf1 (from deployment-prep SAL)
- 09:37 hashar_: Jenkins: deleting old mediawiki extensions jobs (rm -fR /var/lib/jenkins/jobs/*testextensions-master). They are no more triggered and superseded by the *-testextension jobs.
September 20
- 21:30 bd808: Deleted /var/log/atop.* on deployment-bastion to free some disk space in /var
- 21:29 bd808: Deleted /var/log/account/pacct.* on deployment-bastion to free some disk space in /var
September 19
- 21:16 hashar: puppet is broken on Trusty integration slaves because they try to install the non existing package php-parsekit. WIP will get it sorted on eventually.
- 14:57 hashar: Jenkins friday deploy: migrate all MediaWiki extension qunit jobs to Zuul cloner.
September 17
- 12:20 hashar: upgrading jenkins 1.565.1 -> 1.565.2
September 16
- 16:36 bd808: Updated scap to 663f137 (Check php syntax with parallel `php -l`)
- 04:01 jeremyb: deployment-mediawiki02: salt was broken with a msgpack exception. mv -v /var/cache/salt{,.old} && service salt-minion restart fixed it. also did salt-call saltutil.sync_all
- 04:00 jeremyb: deployment-mediawiki02: (/run was 99%)
- 03:59 jeremyb: deployment-mediawiki02: rm -rv /run/hhvm/cache && service hhvm restart
- 00:51 jeremyb: deployment-pdf01 removed base::firewall (ldap via wikitech)
September 15
- 22:53 jeremyb: deployment-pdf01: pkill -f grain-ensure
- 21:36 bd808: Trying to fix salt with `salt '*' service.restart salt-minion`
- 21:32 bd808: only hosts responding to salt in beta are deployment-mathoid, deployment-pdf01 and deployment-stream
- 21:29 bd808: salt calls failing in beta with errors like "This master address: 'salt' was previously resolvable but now fails to resolve!"
- 20:18 hashar: restarted salt-master
- 19:50 hashar: killed on deployment-bastion a bunch of python /usr/local/sbin/grain-ensure contains ... and /usr/bin/python /usr/bin/salt-call --out=json grains.append deployment_target scap commands
- 18:57 hashar: scap breakage due to ferm is logged as https://bugzilla.wikimedia.org/show_bug.cgi?id=70858
- 18:48 hashar: https://gerrit.wikimedia.org/r/#/c/160485/ tweaked a default ferm configuration file which caused puppet to reload ferm. It ends up having rules that prevent ssh from other host thus breaking rsync \O/
- 18:37 hashar: beta-scap-eqiad job is broken since ~17:20 UTC https://integration.wikimedia.org/ci/job/beta-scap-eqiad/21680/console || rsync: failed to connect to deployment-bastion.eqiad.wmflabs (10.68.16.58): Connection timed out (110)
September 13
- 01:07 bd808: Moved /srv/scap-stage-dir to /srv/mediawiki-staging; put a symlink in as a failsafe
- 00:31 bd808: scap staging dir needs some TLC on deployment-bastion; working on it
- 00:30 bd808: Updated scap to I083d6e58ecd68a997dd78faabe60a3eaf8dfaa3c
September 12
- 01:28 ori: services promoted User:Catrope to projectadmin
September 11
- 20:59 spagewmf: https://integration.wikimedia.org/ci/ is down with 503 errors
- 16:13 bd808: Now that scap is pointed to labmon1001.eqiad.wmnet the deployment-graphite.eqiad.wmflabs host can probably be deleted; it never really worked anyway
- 16:12 bd808: Updated scap to include I0f7f5cae72a87f68d861340d11632fb429c557b9
- 15:09 bd808: Updated hhvm-luasandbox to latest version on mediawiki03 and verified that mediawiki0[12] were already updated
- 15:01 bd808: Fixed incorrect $::deployment_server_override var on deployment-videoscaler01; deployment-bastion.eqiad.wmflabs is correct and deployment-salt.eqiad.wmflabs is not
- 10:05 ori: deployment-prep upgraded luasandbox and hhvm across the cluster
- 08:41 spagewmf: deployment-mediawiki01/02 are not getting latest code
- 05:10 bd808: Reverted cherry-pick of I621d14e4b75a8415b16077fb27ca956c4de4c4c3 in scap; not the actual problem
- 05:02 bd808: Cherry-picked I621d14e4b75a8415b16077fb27ca956c4de4c4c3 to scap to try and fix l10n update issue
September 10
- 19:38 bd808: Fixed beta-recompile-math-texvc-eqiad job on deployment-bastion
- 19:38 bd808: Made /usr/local/apache/common-local a symlink to /srv/mediawiki on deployment-bastion
- 19:37 bd808: Deleted old /srv/common-local on deployment-videoscaler01
- 19:32 bd808: Killed jobs-loop.sh tasks on deployment-jobrunner01
- 19:30 bd808: Removed old mw-job-runner cron job on deployment-jobrunner01
- 19:19 bd808: Deleted /var/log/account/pacct* and /var/log/atop.log.* on deployment-jobrunner01 to make some temporary room in /var
- 19:14 bd808: Deleted /var/log/mediawiki/jobrunner.log and restarted jobrunner on deployment-jobrunner01:
- 19:11 bd808: /var full on deployment-jobrunner01
- 19:05 bd808: Deleted /srv/common-local on deployment-jobrunner01
- 19:04 bd808: Changed /usr/local/apache/common-local symlink to point to /srv/mediawiki on deployment-jobrunner01
- 19:03 bd808: w00t!!! scap jobs is green again -- https://integration.wikimedia.org/ci/job/beta-scap-eqiad/20965/
- 19:00 bd808: sync-common finished on deployement-jobrunner01; trying Jenkins scap job again
- 18:53 bd808: Removed symlink and make /srv/mediawiki a proper directory on deployment-jobrunner01; Running sync-common to populate.
- 18:45 bd808: Made /srv/mediawiki a symling to /srv/common-local on deployment-jobrunner01
- 10:20 jeremyb: deployment-bastion /var at 97%, freed up ~500MB. apt-get clean && rm -rv /var/log/account/pacct*
- 10:17 jeremyb: deployment-bastion good puppet run
- 10:16 jeremyb: deployment-salt had an oom-kill recently. and some box (maybe master, maybe client?) had a disk fill up
- 10:15 jeremyb: deployment-mediawiki0[12] both had good puppet runs
- 10:15 jeremyb: deployment-salt started puppetmaster && puppet run
- 10:14 jeremyb: deployment-bastion killed puppet lock
- 03:04 bd808: Ori made puppet changes that moved the MediaWiki install dir to /srv/mediawiki (https://gerrit.wikimedia.org/r/#/c/159431/). I didn't see that in SAL so I'm adding it here.
September 9
- 03:06 bd808: Restarted jenkins agent on delopment-bastion twice to resolve executor deadlock (bug 70597)
September 7
- 07:00 jeremyb: testing 1,2,3
Archives
- Archive 1 (September 2014 - December 2015)
- Archive 2 (2016)
- Archive 3 (2017)
- Archive 4 (2018)
- Archive 5 (2019)
- Archive 6 (2020)
- Archive 7 (2021)
- Archive 8 (2022)
- Archive 9 (2023)