Nova Resource:Deployment-prep/SAL

From Wikitech
Jump to: navigation, search

2017-06-13

  • 18:47 andrewbogott: root@deployment-salt02:~# salt "*" cmd.run "apt-get -y install facter"

2017-05-19

  • 19:05 mutante: fixing role class config on deployment-phab* (remove role::phabricator::main, add role::phabricator_server in context prefix "deployment-phab. remove again from instance level for phab-01
  • 18:40 mutante: deployment-phab01 still has puppet error "Could not find class role::phabricator::main" and that should simply be removed from it, but i can NOT find it in Horizon, i checked instance config, project config, the "Other" section, the "All classes" tab. Because it's gone. But how do i fix the instance config then?
  • 18:39 mutante: applying role::phabricator_server on instance deployment-phab01 (it had error, could not find role::phabricator::main and the name changed in role/profile conversion)

2017-03-29

  • 18:41 ebernhardson: upgrading elasticsearch and kibana to 5.1.2 on deployment-logstash2 to test puppet+integration prior to prod deployment

2017-03-27

2017-03-20

  • 20:51 andrewbogott: migrating deployment-urldownloader to labvirt1013
  • 20:45 andrewbogott: migrating deployment-pdf01 to labvirt1011
  • 20:14 andrewbogott: migrating deployment-puppetmaster02 to a different labvirt

2017-03-15

  • 09:10 addshore: addshore@deployment-tin mwscript extensions/Cognate/maintenance/populateCognatePages.php --wiki=hewiktionary
  • 09:10 addshore: addshore@deployment-tin mwscript extensions/Cognate/maintenance/populateCognatePages.php --wiki=dewiktionary
  • 09:08 addshore: addshore@deployment-tin mwscript extensions/Cognate/maintenance/populateCognatePages.php --wiki=enwiktionary
  • 08:56 addshore: addshore@deployment-tin mwscript extensions/Cognate/maintenance/populateCognatePages.php --wiki=enwiktionary // (ParameterTypeException, T160503)
  • 08:50 addshore: addshore@deployment-tin mwscript extensions/Cognate/maintenance/populateCognateSites.php --wiki=enwiktionary --site-group=wiktionary // (3 sites added)
  • 08:49 addshore: addshore@deployment-tin mwscript extensions/Wikidata/extensions/Wikibase/lib/maintenance/populateSitesTable.php --wiki=enwiktionary --force-protocol=https --load-from=https://deployment.wikimedia.beta.wmflabs.org/w/api.php
  • 08:49 addshore: addshore@deployment-tin mwscript sql.php --wiki=enwiktionary "TRUNCATE sites; TRUNCATE site_identifiers;"
  • 08:44 addshore: addshore@deployment-tin mwscript extensions/Wikidata/extensions/Wikibase/lib/maintenance/populateSitesTable.php --wiki=enwiktionary --force-protocol=https
  • 08:43 addshore: addshore@deployment-tin mwscript extensions/Cognate/maintenance/populateCognateSites.php --wiki=dewiktionary --site-group=wiktionary // (0 sites added)
  • 08:43 addshore: addshore@deployment-tin mwscript extensions/Cognate/maintenance/populateCognateSites.php --wiki=enwiktionary --site-group=wiktionary // (1 site added)

2017-03-06

  • 19:04 addshore: mwscript sql.php --wiki=aawiki "CREATE DATABASE cognate_wiktionary"

2017-03-01

  • 19:09 addshore: "mwscript extensions/WikimediaMaintenance/addWiki.php --wiki=aawiki he wiktionary hewiktionary he.wiktionary.beta.wmflabs.org" T158628

2017-02-02

  • 00:52 tgr: added mhurd as member

2017-01-23

  • 07:15 _joe_: cherry-picking the move of base to profile::base

2017-01-19

  • 22:11 Krenair: added bunch of others to the same group per request. we should figure out how to make this process sane somehow
  • 22:06 Krenair: added nuria to deploy-service group on deployment-tin

2017-01-17

  • 17:51 urandom: re-enabling puppet on deployment-restbase02
  • 17:47 urandom: re-enabling puppet on deployment-restbase01

2017-01-11

  • 18:07 urandom: restarting restbase cassandra nodes
  • 18:01 urandom: disabling puppet on restbase cassandra nodes to experiment with prometheus exporter

2017-01-08

  • 05:20 Krenair: deployment-stream: live hacked /usr/lib/python2.7/dist-packages/socketio/handler.py a bit (added apostrophes) to try to make rcstream work

2017-01-04

  • 21:30 mutante: deployment-cache-text-04 - running acme-setup command to debug .. Creating CSR /etc/acme/csr/beta_wmflabs_org.pem
  • 21:26 Krenair: trying to troubleshoot puppet by stopping nginx then letting puppet start it
  • 21:05 mutante: deployment-cache-text04 stopping nginx service, running puppet to debug dependency issue

2016-12-19

  • 21:21 andrewbogott: and also python-functools32_3.2.3.2-3~bpo8+1_all.deb
  • 21:20 andrewbogott: upgrading to python-jsonschema_2.5.1-5~bpo8+1_all.deb on deployment-eventlogging03
  • 20:51 andrewbogott: upgrading to python-requests_2.12.3-1_all.deb ./python-urllib3_1.19.1-1_all.deb on deployment-mediawiki04 and deployment-tin

2016-12-04

  • 15:26 Krenair: Found a git-sync-upstream cron on deployment-mx for some reason... commented for now, but wtf was this doing on a MX server?

2016-11-23

  • 15:04 Krenair: fixed puppet on deployment-cache-text04 by manually enabling experimental apt repo, see T150660

2016-11-16

  • 20:02 Krenair: mysql master back up, root identity is now unix socket based rather than password
  • 19:57 Krenair: taking mysql master down to fix perms
  • 07:52 Krenair: the new mysql root password for -db04 is at /tmp/newmysqlpass as well as in a new file in the puppetmaster's labs/private.git

2016-11-09

  • 20:27 Krenair: removed default SSH access from production host 208.80.154.135, the old gallium IP

2016-11-03

  • 05:04 Krenair: beginning to move the rest of beta to the new puppetmaster

2016-11-02

  • 18:51 Krenair: armed keyholder on -tin and -mira
  • 18:50 Krenair: started mysql on -db boxes to bring beta back online

2016-11-01

  • 22:22 Krenair: started mysql on -db03 to hopefully pull us out of read-only mode
  • 22:21 Krenair: started mysql on -db04
  • 22:19 Krenair: stopped and started udp2log-mw on -fluorine02
  • 22:00 Krenair: started moving nodes back to the new puppetmaster
  • 02:55 Krenair: Managed to mess up the deployment-puppetmaster02 cert, had to move those nodes back

2016-10-31

  • 20:57 Krenair: moving some nodes to deployment-puppetmaster02
  • 16:57 bd808: Added Niharika29 as project member

2016-10-27

  • 18:46 bd808: Testing dual page wiki logging by stashbot. (check #3)
  • 18:36 bd808: Testing dual page wiki logging by stashbot. (second attempt)
  • 18:14 bd808: Testing dual page wiki logging by stashbot.

2016-10-24

  • 14:51 Krenair: T142288: Shut off -pdf02 and -conftool

2016-10-10

  • 21:41 Krenair: restarted keyholder-proxy on -tin to make check_keyholder happy with the extra key that was active but unconfigured
  • 21:11 Krenair: fixed puppet on -restbase01/-restbase02 by setting up deployment of cassandra/twcs on deployment-tin
  • 20:56 Krenair: fixed puppet on -tin/-mira by restarting puppetmaster for base_path scap change
  • 15:45 dcausse: deployment-elastic0[5-8]: reduce the number of replicas to 1 max for all indices

2016-10-03

  • 15:40 Krenair: upgraded cache-upload04 to varnish4. hieradata is set on the prefix deployment-cache-upload

2016-09-28

  • 22:33 Krenair: Rebooting deployment-ms-be01 - T146947, T141673

2016-09-26

  • 23:13 Krenair: Rebooting deployment-aqs01 for T141673

2016-09-20

  • 20:16 Krenair: enabled trusty-backports on deployment-puppetmaster

2016-09-14

2016-09-13

  • 20:47 Krenair: Created SRV record _etcd._tcp.beta.wmflabs.org for etcd/confd

2016-09-11

  • 20:35 Krenair: started cron service on deployment-salt02 again, seems it got killed Tue 2016-08-30 13:42:39 UTC - hopefully this will fix the puppet staleness alert

2016-09-08

2016-08-30

  • 23:20 Krenair: removed 'project_id' key from deployment-restbase02's metadata to fix compatibility with the new labsprojectfrommetadata code
  • 18:09 yuvipanda: reboot deployment-kafka03 seems to be stuck

2016-08-19

  • 00:39 Krenair: deployment-fluorine is now deployment-fluorine02 running jessie with the old precise packages shoehorned in

2016-08-12

  • 19:20 Krenair: that fixed it, upload.beta is back up
  • 19:14 Krenair: rebooting deployment-cache-upload04, it's stuck in https://phabricator.wikimedia.org/T141673 and varnish is no longer working there afaict, so trying to bring upload.beta.wmflabs.org back up

2016-08-02

2016-08-01

  • 20:58 Krenair: deleted 2014/2015 files from deployment-stream:/var/log/diamond to get space on /var and stop it warning

2016-07-27

  • 06:07 Tim: fixed broken puppet git checkout on deployment-puppetmaster, updated

2016-07-13

  • 20:45 Krenair: RIP NFS

2016-07-11

  • 23:24 Krenair: Unmounted /data/project (NFS) on all active hosts (mediawiki0[1-3], jobrunner01, tmh01), leaving just deployment-upload (shutoff, to schedule for deletion soon) - T64835

2016-07-09

  • 00:46 Krenair: T64835: `mwscript extensions/WikimediaMaintenance/filebackend/setZoneAccess.php zerowiki --backend=local-multiwrite --private`
  • 00:46 Krenair: T64835: `foreachwikiindblist "% all-labs.dblist - private.dblist" extensions/WikimediaMaintenance/filebackend/setZoneAccess.php --backend=local-multiwrite`
  • 00:46 Krenair: T64835: Live-hacked some temporary swift config in

2016-06-27

  • 22:32 eberhardson: deployed gerrit.wikimedia.org/r/296279 to puppetmaster to test kibana4 role

2016-06-25

  • 03:24 Krenair: Changed eventbus key in secrets (from being a symlink to eventlogging to being a new random key) so check_keyholder works again

2016-06-22

  • 22:23 Krenair: Installed netpbm on all deployment-mediawiki* hosts to fix ProofreadPage thumbnailing. I wonder if we should include the puppet mediawiki::packages::multimedia class on these hosts really

2016-06-13

  • 16:06 Krenair: Rebooted deployment-ircd, it was stuck somehow
  • 13:53 yuvipanda: kicked deployment-salt via nova for Krenair
  • 13:35 Krenair: Fixed puppet on -tin by symlinking eventbus key to eventlogging in -puppetmaster:/var/lib/git/labs/private/modules/secret/secrets/keyholder

2016-06-01

  • 02:14 Krenair: Started redis-server on deployment-rcstream to stop MW hhvm.log spam

2016-05-09

  • 15:39 andrewbogott: migrating deployment-flourine to labvirt1009

2016-05-03

  • 01:42 Krenair: ran package updates on deployment-parsoid06 so that exim4 would start so puppet will run

2016-05-02

  • 09:54 gehel: restart elasticsearch cluster to ensure multicast configuration is disabled (T110236)

2016-04-13

  • 20:37 Krenair: doing the same with -redis02
  • 20:26 Krenair: corrected deployment-cxserver03:/etc/puppet/puppet.conf puppetmaster to use .deployment-prep as part of dns name

2016-04-10

  • 06:04 Krenair: deleted some large files under deployment-mediawiki01:/var/log/nutcracker to free up space on /

2016-04-09

  • 16:08 Krenair: (same for -conf03, -sentry01, -redis01, -upload - some of these are now fully fixed and some are better than they were before)
  • 15:59 Krenair: mostly fixed puppet on deployment-sca02 by changing /etc/puppet/puppet.conf to use project name as part of puppetmaster's hostname
  • 15:56 Krenair: fixed broken /etc/puppet/puppet.conf on deployment-cache-text04 (it started with a copy of the file for the labs central puppetmaster and then had the correct version pointing to the project's puppetmaster)
  • 15:47 Krenair: reenabled puppet on eventlogging04 as no reason was provided for disabling, first run successful

2016-03-30

  • 13:35 Reedy: upgrade hhvm on deployment-mediawiki03 and reboot
  • 12:16 gehel: restarting varnish on deployment-cache-text04

2016-03-29

  • 13:40 Amir1: Added ores-related classes and roles

2016-03-25

  • 20:23 Krenair: started redis-server on deployment-redis01
  • 20:23 Krenair: repaired centralauth.spoofuser table on deployment-db1
  • 20:23 Krenair: fiddled around with puppet on deployment-cache-text04 earlier to fix certs etc.
  • 07:38 tgr: restarting memcached

2016-03-18

2016-03-08

  • 02:26 ori: Updating HHVM on deployment-mediawiki02

2016-03-01

  • 16:54 gehel: fixed a stalled rebase on deployment-puppetmaster:/var/lib/operations/puppet

2016-02-18

  • 13:24 gehel: upgrading elasticsearch to 1.7.5 on cirrus-browser-bot

2016-02-17

  • 23:57 mobrovac: added Ppchelko to the list of members

2016-02-15

2016-02-12

2016-02-11

  • 15:16 gehel: fixed deployment-puppetmaster rebase conflict by removing commit 814f12bc - author is informed

2016-02-08

  • 06:10 tgr: set $wgAuthenticationTokenVersion on beta cluster (test run for T124440)

2016-02-05

2016-01-30

  • 02:57 Krenair: Restarted varnish on cache-text04 for T125282

2015-12-02

2015-11-04

2015-10-09

  • 21:51 ori: Accidentally clobbered /etc/init.d/mysql on deployment-db1, causing deployment-prep failures. Restored now.

2015-09-16

  • 20:39 cscott: updated OCG to version 4032a596ce6eb442b02cc6ee9b79263b1eb23275

2015-09-14

  • 19:18 cscott: updated OCG to version 5811056e28f2bc6408b6da96095352ab381bb11f
  • 12:04 dcausse: restarting elasticsearch (deployment-elastic0[5-8]) to deploy new plugins

2015-08-25

  • 14:42 andrewbogott: moving deployment-cache-mobile04 to labvirt1004

2015-08-12

  • 20:45 urandom: restarted restbase on deployment-restbase01 (dead)

2015-08-05

  • 14:33 godog: update deployment-restbase02 to openjdk8 T104887
  • 14:18 godog: update deployment-restbase01 to openjdk8 T104887

June 29

  • 13:17 dcausse: restarting Elasticsearch to pick up new plugin versions

June 23

  • 13:31 cscott: fixed salt on deployment-pdf02, restarted OCG there.
  • 05:44 cscott: stopped OCG service on deployment-pdf02, see https://phabricator.wikimedia.org/T103473
  • 05:20 cscott: updated OCG to version d7c698d5bf730d34057945e912ac75dc542dd788 ; restarted service.
  • 03:58 cscott: stopped OCG on beta; redis 2.8.x is causing the service to crash on startup.

June 22

  • 21:58 andrewbogott: re-enabling puppet on deployment-videoscaler01 because no reason was given for disabling
  • 20:42 cscott: updated OCG to version b482144f5bd8b427bcc64a3dd287247195aa1951

June 4

  • 20:29 ori: upgrading hhvm-fss from 1.1.4 to 1.1.5, has fix for T101395

May 29

  • 14:07 moritzm: upgrade java on deployment-restbase0[12] to the 7u79 security update

May 28

  • 08:46 godog: test es-tool restart-fast on deployment-elastic05

May 27

  • 21:15 AaronSchulz: populated jobqueue:aggregator:s-wikis:v2 with 1000 fake wiki keys for load testing
  • 21:07 AaronSchulz: Deployed https://gerrit.wikimedia.org/r/#/c/208852/
  • 21:07 AaronSchulz: Deleted 4G of logs on jobrunner01

May 24

  • 18:39 YuviKTM: purged old logs kept on NFS

May 20

  • 20:58 cscott: updated OCG to version ca4f64852de5b1de782b292b50038fbd2dd84266

May 18

  • 15:17 andrewbogott: rebooting deployment-logstash1

May 15

  • 20:50 andrewbogott: rebooted deployment-bastion due to inconsistent run state after suspend/resume

May 13

  • 21:08 cscott: updated OCG to version c7c75e5b03ad9096571dc6dbfcb7022c924ccb4f

May 2

  • 00:51 yuvipanda: created deployment-boomboom to test

April 29

  • 21:03 andrewbogott: suspending and shrinking disks of many instances

April 28

  • 20:57 YuviPanda: KILL KILL KILL DEPLOYMENT-LUCID-SALT WITH FIRE AND BRIMSTONE AND BAD THINGS

April 27

  • 08:01 _joe_: installed hhvm 3.6 on deployment-mediawiki02

April 24

  • 14:25 _joe_: installing hhvm 3.6.1 on mediawiki-deployment01

April 23

  • 17:19 andrewbogott: rebooting deployment-parsoidcache02 because it seems troubled

April 22

  • 12:48 andrewbogott: migrating to new labvirt nodes

April 21

  • 08:33 _joe_: rollback installation of hhvm 3.6
  • 08:09 _joe_: installing HHVM 3.6 and the corresponding extensions on deployment-mediawiki01

April 9

  • 20:11 mutante: fixed apt sources lists on deployment-bastion (T95541)

March 30

  • 22:33 Josve05a: manually start mysql on db1 and db2
  • 21:57 YuviPanda: reboot all instances from virt1000

March 23

  • 20:41 cscott: updated OCG to version 11f096b6e45ef183826721f5c6b0f933a387b1bb

March 18

  • 13:45 mobrovac: added restbase security group
  • 13:35 YuviPanda: made mobrovac projectadmin
  • 13:34 YuviPanda: added mobrovac to project

March 16

  • 18:46 manybubbles: upgraded Elasticsearch on deployment-logstash1

March 11

  • 18:47 YuviPanda: created deployment-mediawiki03

February 27

  • 11:12 YuviPanda: start mysql on deployment-db1

February 26

  • 11:53 YuviPanda: created deployment-parsoid01-test to test patch to use role::parsoid on labs

February 18

  • 13:04 _joe_: installed new version of the hhvm extensions packages

February 17

  • 23:18 Krenair: Started mysql on deployment-db1; beta now appears much less broken than before

February 6

  • 20:07 ^d: scratch that, I rebuilt it as precise. why did I do that?
  • 20:03 ^d: rebuilt deployment-elastic05 with new partition scheme

February 5

  • 12:48 YuviPanda: cherry-picking https://gerrit.wikimedia.org/r/188798 on scap on deployment-prep
  • 12:28 YuviPanda: killed chown on deployment-bastion, running direclty on NFS server
  • 12:13 YuviPanda: running time sudo chown -R www-data:www-data upload7/ on /data/project
  • 12:10 YuviPanda: stopped jobrunner on jobrunner01
  • 11:53 YuviPanda: running git-sync-upstream on deployment-salt to pick up latest ops/puppet changes
  • 11:52 _joe_: converting the web user to www-data
  • 11:44 YuviPanda: deleted mediawiki03 instance, holdover from security testing from long, long ago
  • 11:41 YuviPanda: disabled puppet on mediawiki01, 02, jobrunner01, bastion and salt

February 4

  • 13:56 YuviPanda: created deployment-jobrunner01, trusty instance
  • 13:51 YuviPanda: deleted deployment-jobrunner01, trusty version coming up
  • 11:35 YuviPanda: created instance deployment-mediawiki02
  • 11:26 YuviPanda: deleted instance deployment-mediawiki02
  • 06:37 YuviPanda: created deployment-mediawiki01 host
  • 06:34 YuviPanda: killed deployment-mediawiki01 host. FOREEVERRR

February 2

January 27

  • 18:15 andrewbogott: upgrading libc6 on all instances from deployment-salt

January 20

  • 02:30 YuviPanda: created deployment-mediawiki04 to test roles

January 7

  • 16:25 YuviPanda: added milimetric to NDA sudo’ers groups

December 29

  • 22:24 MaxSem: Created a DNS entry for m.wikidata.beta.wmflabs.org

December 22

  • 12:40 _joe_: upgrading HHVM to the latest version

December 16

  • 16:52 manybubbles: elasticsearch restart finished
  • 16:48 mutante: deployment-db2 is down
  • 16:48 manybubbles: restarting beta's elasticsearch servers to pick up a new version of a plugin. won't interfere with current downtime.

December 13

  • 17:10 bd808: Many strange puppet and scap failures in beta that look to be related to DNS failures
  • 16:03 bd808: Starting work on phab:T78076 to renumber apache users in beta

December 11

  • 22:47 cscott: updated OCG to version bfc3812ef346c9f767135b339cedd123a1bcac98

December 6

  • 05:05 ori: upgrade hhvm-tidy to 0.1-2

December 3

  • 21:33 cscott: updated OCG to version 08e94b19c3f17e699d7e53d9605f65c58e17ea0e

December 2

  • 17:09 _joe_: upgrading HHVM to its latest version
  • 17:08 andrewbogott: this is a test message

December 1

  • 21:50 cscott-split: updated OCG to version a06e7c186796a6ee5d5af81e93688520abdf2596

November 26

  • 20:47 cscott: updated OCG to version 7d8f2b8bd496464041e3ef9c092732457cc8f7ef

November 24

  • 15:16 YuviPanda: modified local hack to account for 47dcefb74dd4faf8afb6880ec554c7e087aa947b
  • 14:58 YuviPanda: cherry-picked 3e45c538978710113e6e28e9d533bf8d18c159a6 and 9d4614a8a352c78505212fd6e9d2a7be6d2e4927 to deployment-salt puppetmaster, restoring local hacks

November 19

November 17

  • 20:37 YuviPanda: cleaned out logs on deployment-bastion
  • 16:48 YuviPanda: delete deployment-analytics01, a tortoise from an ancient time.
  • 05:17 YuviPanda: force apt-get install -f to unstuck puppet
  • 04:49 YuviPanda: clean up coredump on deployment-prep

November 16

November 14

November 13

November 12

November 11

November 10

  • 22:37 cscott: rsync'ed .git from pdf01 to pdf02 to resolve git-deploy issues on pdf02 (git fsck on pdf02 reported lots of errors)
  • 21:41 cscott: updated OCG to version d9855961b18f550f62c0b20da70f95847a215805 (skipping deployment-pdf02)
  • 21:39 cscott: deployment-pdf02 is not responding to git-deploy for OCG

November 5

  • 06:14 ori: restarted hhvm on beta app servers

November 3

  • 22:07 cscott: updated OCG to version 5834af97ae80382f3368dc61b9d119cef0fe129b

October 29

  • 18:55 ori: upgraded hhvm on beta labs to 3.3.0+dfsg1-1+wm1

October 28

  • 23:47 RoanKattouw: ...which was a no-op
  • 23:46 RoanKattouw: Updating puppet repo on deployment-salt puppet master
  • 21:36 RoanKattouw: Creating deployment-parsoid05 as a replacement for the totally broken deployment-parsoid04 (also as a trusty instance rather than precise)
  • 21:06 RoanKattouw: Rebooting deployment-parsoid04, wasn't responding to ssh

October 27

  • 20:23 cscott: updated OCG to version 60b15d9985f881aadaa5fdf7c945298c3d7ebeac

October 22

  • 21:10 arlolra: updated OCG to version e977e2c8ecacea2b4dee837933cc2ffdc6b214cb

October 8

  • 22:04 subbu: updated OCG to version def24eca

October 7

  • 22:50 cscott: updated OCG to version c778ea8b898f8ad8c2b7ad9de78a75469e7ed061

October 6

  • 23:13 YuviPanda: killed extra log files in deployment-bastion
  • 21:44 cscott: updated OCG to version bbdf4c6400cfbbc6030114ad16e1a6f7025eab2c
  • 15:36 cscott: updated OCG to version aee3712b352f51f96569de0bcccf3facf654e688

October 3

  • 19:51 manybubbles: performing rolling restart of elasticsearch nodes to pick up preview of accelerated regex plugin for testing at larger-than-mylaptop-scale
  • 14:02 manybubbles: rebuilding beta's simplewiki cirrus *index*
  • 14:02 manybubbles: rebuilding beta's simplewiki cirrus inde

October 1

  • 20:13 cscott: updated OCG to version 48c495e3656f528abe636ce0cd7562270505534f
  • 16:40 bd808: Added Gilles to under_NDA sudoers group

September 30

  • 22:00 bd808: Cleaned deleted instances out of salt and trebuchet redis
  • 20:26 bd808: Converted deployment-rsync02 to use local puppet & salt masters
  • 15:36 bd808: enabling puppet and forcing run on deployment-mediawiki03
  • 15:34 bd808: enabling puppet and forcing run on deployment-mediawiki02
  • 15:28 bd808: enabling puppet and forcing run on deployment-mediawiki01

September 29

  • 22:45 Reedy: re-enabled beta-scap-eqiad
  • 21:34 Reedy: disabled "beta-scap-eqiad" until things are fixed
  • 21:24 Reedy: deleted l10n cache on deployment-rsync01 to attempt to run sync-common manually
  • 21:22 Reedy: deployment-rsync01 hard drive is far too small
  • 17:57 cscott: updated OCG to version 89d8f29a24295b05d0643abe976fea83b56575c9
  • 06:58 ori: Configured Beta cluster to use redis for session storage
  • 06:57 ori: Created deployment-redis02 and converted it to use local puppet & salt masters
  • 05:23 ori: Created deployment-redis01 and converted it to use local puppet & salt masters

September 28

  • 14:38 andrewbogott: cherry-picked https://gerrit.wikimedia.org/r/#/c/163464/ onto deployment-salt to fix a puppet compile failure.
  • 14:38 andrewbogott: edited and re-cherry-picked roan's citoid patch into beta because the previous version was breaking puppet

September 26

  • 06:34 cscott: updated OCG to version f3a6c1cbba118d4a5e1aa019937dc50159fc823d

September 25

  • 22:48 RoanKattouw: Fixed permissions of deployment-bastion:/srv/deployment/mathoid/mathoid/.git/deploy (needed g+w)
  • 11:36 _joe_: updated hhvm to fix most bugs, also cherry-picked https://gerrit.wikimedia.org/r/#/c/162839/

September 24

  • 23:00 bd808: Updated bash with salt
  • 20:52 cscott: updated OCG to version 48acb8a2031863e35fad9960e48af60a3618def9

September 23

  • 20:14 cscott: updated OCG to version 1cf9281ec3e01d6cbb27053de9f2423582fcc156
  • 17:37 AaronSchulz: Initialized bloom cache on betalabs, enabled it, and populated it for enwiki

September 22

  • 16:08 ori: updating HHVM to 3.3.0-20140918+wmf1

September 20

  • 14:43 andrewbogott: movingdeployment-pdf02 to virt1009
  • 00:36 mutante: raised instance quota to 43

September 19

  • 00:26 cscott: updated OCG to version ce16f7adb60d7c77409e2e11ba0e5d6cce6955d5

September 16

September 15

  • 21:44 andrewbogott: migrating deployment-videoscaler01 to virt1002
  • 21:41 andrewbogott: migrating deployment-sentry2 to virt1002
  • 21:40 cscott: *skipped* deploy of OCG, due to deployment-salt issues
  • 21:19 bd808: Added Matanya to under_NDA sudoers group (bug 70864)

September 12

  • 12:24 _joe_: set up hiera, noop as expected

September 11

  • 16:31 YuviPanda: Delete deployment-graphite instance
  • 02:29 mutante: raised instance quota by 1 to 42

September 10

September 9

  • 20:08 cscott: updated OCG to version c9a2b4cf2502479eeabed07ab2de728695d96e46

September 7

  • 23:48 bd808: Added John F. Lewis to under_NDA sudo policy (bug 70539)
  • 23:29 bd808: Promoted John F. Lewis to project admin (bug 70539)
  • 23:26 bd808: Added Jalexander as project member (bug 70539)


September 5

  • 17:54 bd808: Purged varnish cache on deployment-cache-bits01 -- sudo varnishadm ban req.url '~' /
  • 16:00 YuviPanda: unfuck puppet on deployment-salt, puppet is stupid and does not properly report failed events on last_run_summary.yaml if there's a syntax error or a resource conflict. So I've to read last_run_report and do things with *that* instead now
  • 15:49 YuviPanda: deliberately fucking up puppet to see if icinga complains
  • 09:52 _joe_: cherry-picked I6ec53da483bebfa375eba2383cbf60123ff1ce26, it work

September 4

  • 16:06 bd808: Manually cleaned bogus LocalRenameUserJob jobs from redis
  • 13:54 _joe_: stopped puppet on the appservers but mw03, testing an apache change
  • 05:28 legoktm: stopping jobrunner on deployment-jobrunner01
  • 05:22 legoktm: restarted jobrunner on deployment-jobrunner01
  • 05:14 bd808: Bad jobs in job queue filled up /var on jobrunner01 and killed jobrunner script. Leaving down for now until I find out how to delete the bad jobs.
  • 01:41 bd808: Killed old jobs-loop.sh processes on deployment-jobrunner01
  • 01:24 bd808: Many jobrunner errors like "wikiversions-labs.cdb has no version entry for `amwiki`" with various wiki names
  • 01:23 bd808|AWAY: Started jobrunner service manually on jobrunner01.
  • 00:44 bd808: Puppet run on deployment-jobrunner01 failing with what seem to be dns issues (getaddrinfo: Name or service not known when Trebuchet is running)
  • 00:35 bd808: Puppet run on deployment-jobrunner01 failing with what seem to be dns issues (getaddrinfo: Name or service not known)

September 3

  • 15:02 bd808: _joe_ rolled out a new hhvm package ~5 hours ago
  • 15:01 bd808: morebots is back thanks to petan
  • 14:50 bd808: logmsgbot down apparently

September 2

  • 15:34 bd808: False alarm. SSL is borked in beta and we know that
  • 15:29 bd808: `curl -vL -H 'Host: en.wikipedia.beta.wmflabs.org' localhost` works from deployment-cache-text02
  • 15:27 bd808: https://en.wikipedia.beta.wmflabs.org/ returning ERR_CONNECTION_REFUSED (is varnish down?)

August 29

  • 22:56 bd808: Got puppet to run cleanly on deployment-mediawiki03. Should be ready for serving traffic.
  • 22:39 bd808: Fixed a merge conflict in operations/puppet on deployment-salt
  • 21:46 bd808: Forced install of "right version of libvips-tools on mediawiki03 `sudo apt-get install libvips-tools=7.38.5-2`
  • 08:40 hashar: rebooting deployment-cache-mobile03 (kernel up)

August 28

  • 21:32 bd808: Added "Greg Grossmeier" to UnderNDA sudoers group
  • 17:12 bd808: Changed centralauth db to rename labswiki -> deploymentwiki
  • 16:49 bd808: CentralAuth looks broken on http://deployment.wikimedia.beta.wmflabs.org/
  • 16:49 bd808: Apache vhosts look good again
  • 16:34 bd808: Restarted varnishes on deployment-cache-text02
  • 16:13 andrewbogott: merging a patch that renames 'labswiki' to 'deploymentwiki'
  • 09:21 hashar: resetting git repository in /data/project/apache/conf to point to the betaclusterbranch of operations/mediawiki-config.git discarded all local hacks in the process

August 27

  • 23:03 hashar: Blacklisting the security audit IP again on deployment-cache bits01 mobile03 and text02
  • 22:53 hashar: removed the blackhole ip route from deployment-cache-text02 and deployment-cache-mobile03
  • 22:48 hashar: the IP is a known security audit. See Chris Steipp.
  • 22:46 hashar: blackholed an IP address on deployment-cache-text02 and deployment-cache-mobile03 , it was causing hundred of requests per seconds and overloaded the beta cluster. Use route -n to find the IP
  • 22:37 hashar: restarting udp2log-mw on deployment-bastion. It keeps crashing since fiarly recently
  • 22:26 bd808: when restarting varnish on deployment-cache-text02, don't forget that there are 2 varnish services (varnish and varnish-frontend)
  • 22:19 bd808: restarted varnish (again) on deployment-cache-text02
  • 22:10 bd808: restarted varnish on deployment-cache-text02
  • 16:22 bd808: killing `apt-get update` process running on deployment-bastion since Jun13
  • 14:59 bd808: Resolved puppet git merge conflict on deployment-salt
  • 14:49 bd808: Moved hhvm core dumps to /data/project/hhvm-cores
  • 14:42 bd808: Root dirve full on deployment-mediawiki02; hhvm core files are the culprit

August 25

  • 23:47 ori: stopping hhvm/apache on deployment-mediawiki02 to replace debug build of hhvm with release build
  • 21:44 bd808: Deployed scap 116027f (Make sync-common update l10n cdb files by default)
  • 18:30 ori: deployment-mediawiki02: cleared /tmp; running puppet
  • 15:05 hashar: mediawiki02 rm /tmp/hhvm*.core . Filled as bug 69979
  • 15:01 hashar: mediawiki02 rm /tmp/mw-cache-master/conf*
  • 15:01 hashar: mediawiki02 has mw conf caches under /tmp/mw-cache-master/ and since that partition is filled up, that ends up with conf caches being null file
  • 15:00 hashar: mediawiki02 rm /var/log/upstart/hhvm*
  • 14:53 hashar: mediawiki02 : removed /var/lib/puppet/state/agent_catalog_run.lock
  • 14:46 hashar: restarting udp2log-mw service on -bastion. It is stalled for some reason
  • 14:42 hashar: on mediawiki02 , clearing out some /var/log/upstart/hhvm.* log files see bug 69976
  • 14:34 hashar: mediawiki02 / partition is 100% full

August 22

  • 20:21 hashar: udp2log are back in /data/project/logs . The udp2log-mw service went stall for some reason.
  • 20:08 ori: ran 'git pull' on deployment-salt:/srv/var-lib/git/operations/puppet
  • 19:59 hashar: restarting udp2log-mw service on deployment-bastion
  • 19:59 hashar: bits yielding 503
  • 00:41 bd808: cherry-picked scap change https://gerrit.wikimedia.org/r/#/c/155677/ for testing

August 21

  • 21:49 bd808: Trebuchet happier after all the salt-minion restarts; still have deleted hosts showing in the expected minion list for scap deploys
  • 21:01 twentyafterfour: Started salt-minion on deployment-redis01
  • 21:01 bd808: Started salt-minon on deployment-upload
  • 21:00 bd808: Started salt-minon on deployment-fluoride
  • 21:00 bd808: Started salt-minon on deployment-db1
  • 20:59 bd808: Started salt-minon on deployment-elastic01
  • 20:59 twentyafterfour: Started salt-minion on deployment-eventlogging02
  • 20:58 bd808: Started salt-minon on deployment-elastic02
  • 20:58 bd808: Started salt-minon on deployment-elastic03
  • 20:57 bd808: Started salt-minon on deployment-elastic04
  • 20:57 bd808: Started salt-minon on deployment-analytics01
  • 20:55 bd808: Started salt-minon on deployment-cache-upload02
  • 20:54 bd808: Started salt-minon on deployment-memc04
  • 20:54 bd808: Started salt-minon on deployment-parsoid04
  • 20:49 bd808: Started salt-minon on deployment-memc05
  • 20:48 bd808: Started salt-minon on deployment-db2
  • 20:48 twentyafterfour: Started salt-minion on deployment-cache-text02
  • 20:47 twentyafterfour: Started salt-minion on deployment-memc03
  • 20:46 bd808: Started salt-minon on deployment-cxserver01
  • 20:12 bd808: List of broken salt minions can be obtained with `sudo salt-run manage.down` on deployment-salt
  • 19:55 bd808: Fixed salt on deployment-memc02
  • 19:52 bd808: Salt minions are broken all over beta. Hung grain-ensure calls, hung test.ping calls, downed minions
  • 19:50 bd808: Killed dozens of grain-ensure calls and started salt-minion on deployment-cache-mobile03
  • 19:47 bd808: Killed hung salt-call and started salt-minion on deployment-cache-bits01
  • 19:28 bd808: Deployed cherry-pick of Iea7217a for scap
  • 19:27 bd808: Restarted salt-minion on deployment-jobrunner01 & deployment-videoscaler01
  • 19:27 bd808: Killed rogue salt-master process on deployment-bastion
  • 19:26 bd808: Deleted salt keys for retired apache0[12] minions
  • 00:13 bd808: Upgraded elasticsearch to 1.3.2 on deployment-logstash1

August 19

  • 16:11 hashar: deleted /usr/local/apache/common-local symlink, made it a directory and retriggered https://integration.wikimedia.org/ci/job/beta-scap-eqiad/17887/console
  • 16:03 bd808: Removed local changes to /usr/local/apache/conf/wmflabs-logging.conf on deployment-mediawiki02; logs back to nfs share
  • 15:52 bd808: Changed apache logging level from debug to notice on deployment-mediawiki02 in /usr/local/apache/conf/wmflabs-logging.conf
  • 15:47 bd808: Changed apache logging level from debug to warn on deployment-mediawiki02
  • 15:44 bd808: /var full on deployment-mediawiki02; deleting 572M /var/log/apache2/debug.log.1
  • 15:03 hashar: Killed some stalled scap / rsync process on deployment-bastion that were preventing https://integration.wikimedia.org/ci/job/beta-scap-eqiad/ from acquiring the lock.
  • 14:17 hashar: huge rsync in progress on bastion
  • 14:00 hashar: On bastion reverted the symlink on bastion and manually created directory /usr/local/apache/common-local
  • 13:55 hashar_: On bastion, deleting /usr/local/apache/common-local and symlink it to /srv/common-local

August 18

  • 22:22 ^d: dropped apache01/02 instances, unused and need the resources
  • 18:23 manybubbles: finished upgrading elasticsearch in beta - everything seems ok so far
  • 18:15 bd808: Restarted salt-minion on deployment-mediawiki01 & deployment-rsync01
  • 18:15 bd808: Ran `sudo pkill python` on deployment-rsync01 to kill hundreds of grain-ensure processes
  • 18:12 bd808: Ran `sudo pkill python` on deployment-mediawiki01 to kill hundreds of grain-ensure processes
  • 18:10 manybubbles: finally restarting beta's elasticsearch servers now that they have new jars
  • 17:56 bd808: Manually ran trebuchet fetches on deployment-elastic0*
  • 17:49 bd808: Forcing puppet run on deployment-elastic01
  • 17:47 godog: upgraded hhvm on mediawiki02 to 3.3-dev+20140728+wmf5
  • 17:44 bd808: Trying to restart minions again with `salt '*' -b 1 service.restart salt-minion`
  • 17:39 bd808: Restarting minions via `salt '*' service.restart salt-minion`
  • 17:38 bd808: Restarted salt-master service on deployment-salt
  • 17:19 bd808: 16:37 Restarted Apache and HHVM on deployment-mediawiki02 to pick up removal of /etc/php5/conf.d/mail.ini (logged in prod SAL by mistake)
  • 16:59 manybubbles|lunc: upgrading Elasticsearch in beta to 1.3.2
  • 16:11 bd808: Manually applied https://gerrit.wikimedia.org/r/#/c/141287/12/templates/mail/exim4.minimal.erb on deployment-mediawiki02 and restarted exim4 service
  • 15:28 bd808: Puppet failing for deployment-mathoid due to duplicate definition error in trebuchet config
  • 15:15 bd808: Reinstated puppet patch to depool deployment-mediawiki01 and forced puppet run on all deployment-cache-* hosts
  • 15:04 bd808: Puppet run failing on deployment-mediawiki01 (apache won't start); Puppet disabled on deployment-mediawiki02 ('reason not specified') Probably needs to wait until Giuseppe is back from vacation for fixing.
  • 15:00 bd808: Rebooting deployment-eventlogging02 via wikitech; console filling with OOM killer messages and puppet runs failing with "Cannot allocate memory - fork(2)"
  • 14:29 bd808: Forced puppet run on deployment-cache-upload02
  • 14:27 bd808: Forced puppet run on deployment-cache-text02
  • 14:24 bd808: Forced puppet run on deployment-cache-mobile03
  • 14:20 bd808: Forced puppet run on deployment-cache-bits01

August 17

  • 22:58 bd808: Attempting to reboot deployment-cache-bits01.eqiad.wmflabs via wikitech
  • 22:56 bd808: deployment-cache-bits01.eqiad.wmflabs not allowing ssh access and wikitech console full of OOM killer messages

August 15

  • 21:57 legoktm: set $wgVERPsecret in PrivateSettings.php
  • 21:42 hashSpeleology: Beta cluster database updates are broken due to CentralNotice. Fix up is 154231
  • 20:57 hashSpeleology: deployment-rsync01 : deleting /usr/local/apache/common-local content. Then ln -s /srv/common-local /usr/local/apache/common-local as set by beta::common which is not applied on that host for some reason. bug 69590
  • 20:55 hashSpeleology: puppet administratively disabled on mediawiki02 . Assuming some work in progress on that host. Leaving it untouched
  • 20:54 hashSpeleology: puppet is proceeding on mediawiki01
  • 20:52 hashSpeleology: attempting to unbreak mediawiki code update bug 69590 by cherry picking 154329
  • 20:39 hashSpeleology: in case it is not in SAL. MediaWiki is no more synced to app server bug 69590
  • 20:20 hashSpeleology: rebooting mediawiki01 , /var refuses to clear out and stick at 100% usage
  • 20:16 hashSpeleology: cleaning up /var/log on deployment-mediawiki02
  • 20:14 hashSpeleology: on deployment-mediawiki01 deleting /var/log/apache2/access.log.1
  • 20:13 hashSpeleology: on deployment-mediawiki01 deleting /var/log/apache2/debug.log.1
  • 20:13 hashSpeleology: bunch of instances have a full /var/log :-/
  • 11:37 ori: deployment-cache-bits01 unresponsive; console shows OOMs: https://dpaste.de/LDRi/raw . rebooting
  • 03:20 jeremyb: 02:46:37 UTC <ebernhardson> !log beta /dev/vda1 full. moved /srv-old to /mnt/srv-old and freed up 2.1G

August 14

  • 12:23 hashar: manually rebased operations/puppet.git on puppetmaster

August 13

August 8

  • 16:05 bd808: Fixed merge conflict that was preventing updates on puppet master

August 6

  • 13:13 hashar: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/ is running again
  • 13:13 hashar: removed a bunch of local hack on deployment-bastion:/srv/scap-stage-dir/php-master . That causes the git repo to be dirty and prevents scap from achieving git pull there
  • 12:08 hashar: Manually pruning whole text cache on deployment-cache-text02
  • 12:07 hashar: Apache virtual hosts were not properly loaded on mediawiki02. I have hacked /etc/apache2/apache2.conf to make it Include Include /usr/local/apache/conf/all.conf (instead of main.conf which does not include everything)
  • 08:43 hashar: prunning cache on deployment-cache-text02 / restarting varnish

August 2

  • 08:53 swtaarrs: rebuilt and restarted hhvm on deployment-mediawiki02 with potential fix
  • 05:17 swtaarrs: restarted hhvm on deployment-mediawiki0{1,2} to unwedge them

August 1

  • 15:03 bd808: Updated cherry-pick of Iceb8f43
  • 15:02 bd808: Cleaned up puppet repo on deployment-salt; merge conflicts with local Ia463120 hack; reapplied depool of deployment-mediawiki01
  • 14:50 bd808: Restarted stuck hhvm on deployment-mediawiki02; apache had 89 children waiting for a response
  • 13:27 godog: changed inplace bt-hhvm on deployment-mediawiki01/02 to also copy the binary
  • 05:32 ori: depooled deployment-mediawiki02 to investigate HHVM lock-up by cherry-picking I7df8c5310 on beta.
  • 00:40 ori: disabled puppet on deployment-mediawiki{01,02} and enabled verbose apache logging

July 31

  • 22:41 bd808: Restarted hhvm on -mediawiki{01,02}. Brett looked at 01 before I did and said "it's the same as before"
  • 20:09 cscott: updated OCG to version d2919c59eb09e09fc87777696411a070620aef45
  • 19:59 hashar: Granted sudo right to cscott (under NDA). Will let him reboot OCG service
  • 18:58 ori: re-enabled puppet on deployment-mediawiki{01,02}
  • 10:41 hashar: Taking gdb traces of hhvm on mediawiki01 and mediawiki02. Restarting hhvm
  • 05:08 bd808: HHVM hung on both boxes. Grabbed core and backtrace before restarting

July 30

  • 19:59 bd808: Created local commit 7d56b79 in puppet to work around bugs in Ia463120718dceab087ad3f8e3f35917fa879f387
  • 19:46 bd808: Restored prior /etc/hhvm/php.ini from puppet filebucket archive on deployment-mediawiki0[12]
  • 19:32 bd808: Disabled puppet on deployment-mediawiki02 for the same reason
  • 19:31 bd808: Disabled puppet on deployment-mediawiki01; Ori will look into hhvm config changes that were being applied
  • 16:52 bd808: Fixed beta-scap-eqiad Jenkins job by correcting ssh problems in beta project
  • 16:43 bd808: Fixed ssh to jobrunner01 and videoscaler01 by correcting unrelated puppet manifest problem and forcing run via salt.
  • 16:00 bd808: Puppet runs on videoscaler01 and jobrunner01 failing for "Could not find dependency Ferm::Rule[bastion-ssh] for Ferm::Rule[deployment-bastion-scap-ssh]"
  • 16:00 bd808: Puppet seems manually disabled on apache0[12].
  • 15:59 bd808: Can't ssh to apache0[12], videoscaler01 and jobrunner01. Puppet not running on any of them. libnss-ldapd unattended update has broken /etc/nslcd.conf
  • 15:23 bd808: Removed cherry-pick for Iac547efa83cf059a1276b6e279c3ebd4c7224b2c and updated cherry-pick for I5afba2c6b0fbf90ff8495cc4a82f5c7851893b52 to latest patch set.
  • 15:05 bd808: Two cherry-picks in puppet conflicting with merged production changes: I5afba2c6b0fbf90ff8495cc4a82f5c7851893b52 and Iac547efa83cf059a1276b6e279c3ebd4c7224b2c (ori, twentyafterfour)
  • 14:49 bd808: Started apache2 service on deployment-mediawiki01
  • 14:16 hashar: rebooting hhvm
  • 09:42 hashar: bastion had broken puppet because deployment_server and zuul both declare the same python packages 150501
  • 09:40 hashar: restoring on puppetmaster modules/mediawiki/templates/apache/apache2.conf.erb which got deleted somehow
  • 09:29 hashar: Rebooting apache01/02 to see whether it fix the ssh connection issue
  • 09:27 hashar: manually started hhvm on mediawiki01
  • 09:25 hashar: rebooting deployment-mediawiki01 hhvm process went zombie
  • 09:23 hashar: restarting hhvm on mediawiki 01/02
  • 09:05 hashar_: Beta scap script broken since 6:30am UTC https://integration.wikimedia.org/ci/job/beta-scap-eqiad/

July 29

  • 22:56 cscott: updated OCG to version aeb8623d6ebe41ae7c7e36c57844bd9ea8e6d595
  • 21:02 bd808: Converted deployment-sentry2.eqiad.wmflabs to use beta salt/puppet master
  • 19:14 hashar: Removed all jobs from queue, restarted slave agent. Update Jobs coming back
  • 19:09 hashar: deployment-bastion jenkins slave is stuck. Beta cluster is no more updating code :-//
  • 15:58 godog: restarted hhvm on deploymnet-mediawiki01
  • 15:52 godog: restarted hhvm on deployment-mediawiki02
  • 15:50 godog: installed libevent-dbg on deployment-mediawiki02 to capture an hhvm backtrace
  • 15:17 bd808: _joe_ restarting hhvm on deployment-mediawiki01
  • 15:00 bd808: Apache stuck with 65 children on both deployment-mediawiki servers
  • 10:37 hashar: Restarted hhvm on mediawiki{01,02}

July 28

  • 17:41 bd808: Updated hhvm to latest 3.3-dev+20140728 build on deployment-mediawiki0[12]
  • 15:37 manybubbles: rebuilding elasticsearch indexes to build a weighted all field we'll try to use to improve performance
  • 15:32 bd808: Restarted hhvm on deployment-mediawiki0[12]. All apache children were stuck waiting for hhvm to respond.
  • 15:20 bd808: Restarted apache on deployment-mediawiki02. 65 children and non-responsive to requests. (same as mediawiki01)
  • 15:18 bd808: Restarted apache on deployment-mediawiki01. 65 children and non-responsive to requests.
  • 14:23 manybubbles: or not - looks like I can't!
  • 14:22 manybubbles: reubilding cirrus search indexes to pick up a speed up all field
  • 08:30 hashar: restarted varnish on deployment-cache-bits01 . Hoping to clear bits cache

July 25

  • 18:29 bd808: Added twentyafterfour and several other WMF staff to under_NDA sudo group
  • 17:15 bd808: Morebots is back!
  • 16:38 bd808: pstree showed "hhvm─┬─271*[sh]" on deployment-mediawiki02
  • 16:38 bd808: Killed apache2+hhvm and restarted on deployment-mediawiki0[12]
  • 16:06 bd808: `tcpdump -n udp dst port 8324` shows packets leaving deployment-bastion for deployment-logstash1
  • 16:00 bd808: Stopped udp2log and started udp2log-wm with no apparent effect
  • 16:00 bd808: udp2log events not being sent from deployment-bastion to deployment-logstash1
  • 15:49 bd808: Restarted logstash on deployment-logstash1
  • 09:45 mwalker: rebasing puppet repo to get a ocg patch

July 24

  • 16:09 bd808: Reverted MW config to re-enable luasandbox mode; back to luastandalone for now
  • 15:44 bd808: Updated MW config to re-enable luasandbox mode
  • 15:43 bd808: Updated hhvm-luasandbox to 2.0-3 and restarted hhvm instances
  • 14:21 hashar: killed hhvm process on deployment-mediawiki01 and 02. init script does not work.
  • 02:59 ori: promoted legoktm to project-admin

July 23

  • 23:30 bd808: Running `find . -type d -exec chmod 777 {} +` in /data/project/upload7 to finx shared image dir permisisons
  • 20:49 bd808: Changed config to run lua via external executable to avoid hhvm crashing bug
  • 16:20 bd808: hhvm upgraded to 3.1+20140723-1+wmf1 on deployment-mediawiki0[12]
  • 15:34 bd808: Reverted hhvm to 3.1+20140630-1+wm1 on deployment-mediawiki02
  • 15:21 bd808: Upgraded hhvm to 3.1+20140630; seeing problems with luasandbox extension

July 22

  • 14:26 hashar: upgrading varnish on deployment-cache-mobile03
  • 14:22 hashar: upgrading varnish on deployment-cache-text02
  • 14:02 hashar: rebooting deployment-cache-upload02 varnish not happy with memory mapping
  • 13:51 hashar: rebooting bits varnish cache
  • 13:43 hashar: rebased puppetmaster repo. Rebase got broken after 0317463 - beta: New script to restart apaches got merged in.
  • 13:35 hashar: apt-get upgrade on deployment-cache-bits01 + varnish upgrade
  • 09:28 hashar: Removing role::beta::natfix that is now handled by labs DNS and the class is removed with 146091

July 21

  • 23:37 ori: Switched over beta cluster app servers to HHVM
  • 21:27 bd808: Killed update.php jobs; Antoine will give jobs a longer timeout
  • 21:23 bd808: Running update.php for simplewiki in screen
  • 21:22 bd808: Running update.php for hewiki in screen
  • 21:21 bd808: Running update.php for eswiki in screen
  • 21:21 bd808: Running update.php for cawiki in screen
  • 21:21 bd808: Running update.php for commonswiki in screen
  • 21:18 hashar: Restarting upd2log-mw on deployment-bastion. There is a bunch of [python] <defunct> processes
  • 17:32 bd808: Updated scap to 4871208 (+ cherry pick of I6a56b5e)
  • 17:12 bd808: Hotfix for scap ssh host key checking to fix jenkins scap job
  • 17:03 bd808: Testing scap change I40a891b via cherry-pick
  • 10:25 hashar: on bastion, fixed some puppet dependency to have nutcracker to start with the proper configuration 148043
  • 10:20 hashar: upgrading packages on deployment-bastion
  • 10:19 hashar: deleted /var/lib/apt/lists/lock on bastion. Was prevent apt-get update from running
  • 10:18 hashar: setting up nutcracker on deployment-bastion. It was installed but the puppet class to configure it was not being applied. Related Gerrit patches: 148041 and 148042
  • 09:25 hashar: rebooting deployment-apache02
  • 09:22 hashar: rebooting deployment-apache01.
  • 00:27 ori: deployment-mediawiki01 & deployment-mediawiki02: configured for project-local puppet & salt masters

July 18

  • 00:30 bd808: removed local l10nupdate user from deployment-jobrunner01 and deployment-videoscaler01
  • 00:22 bd808: Killed stuck beta-update-databases-eqiad job ( stuck for over 60m waiting for executor; deadlock?)
  • 00:21 ori: beta broke due to I433826423. app servers load prod apache confs from /etc/apache2/wikimedia. temp fix: locally hack apache2.conf to load /usr/local/apache2/conf/all.conf; disable puppet.

July 17

  • 23:18 bd808: Puppet broken for deployment-bastion by labs specific logic in misc::deployment::vars.
  • 19:01 mwalker: possibly breaking labs by cherry picking an apparmor patch that affects mysql https://gerrit.wikimedia.org/r/#/c/147027/

July 16

  • 19:15 mwalker: updated puppet about 20 minutes ago for new ocg variables (now officially in production puppet instead of just cherry picked)

July 15

  • 18:26 bd808: Removed local mwdeploy user from /etc/passwd on deployment-videoscaler01 and deployment-jobrunner01
  • 16:59 bd808: scap failing to deploymnet-videoscaler01 and deploymnet-jobrunner01 due to other random failures now. Lots of strange permissions errors during rsync
  • 16:37 bd808: scap failing to deploymnet-videoscaler01 and deploymnet-jobrunner01 due to ssh auth failures; likely a puppet config problem

July 10

  • 22:37 bd808: Added Gergő Tisza and Yuvipanda as project admins

July 8

  • 23:37 bd808: Updated Kibana to 0afda49 (latest upstream head)
  • 17:03 greg-g: Added John F. Lewis to the project after his NDA was signed by Mark (RT 7722)

July 7

  • 20:55 bd808: Killed stuck `apt-get update` job on deployment-jobrunner01 started on Jun17
  • 20:20 bd808: Fixed puppet on deployment-analytics01 with manual apt-get commands.
  • 20:08 bd808: Ran `apt-get dist-upgrade` on deployment-analytics01 to upgrade hadoop, hive, pig, etc which were failing to update via puppet.

July 4

  • 02:28 RoanKattouw: Unbroke replication on deployment-db2, it's catching up now

July 3

  • 18:59 legoktm: manually created centralauth.renameuser_status table
  • 16:04 bd808: Updated scap to ff04431
  • 09:24 hashar: Reindexed ElasticSearch index for cawiki/eswiki with: mwscript extensions/CirrusSearch/maintenance/forceSearchIndex.php --wiki {cawiki,eswiki} --batch-size=50
  • 09:22 hashar: Blow up ElasticSearch indices for cawiki and eswiki with: mwscript extensions/CirrusSearch/maintenance/updateOneSearchIndexConfig.php --wiki cawiki --startOver --indexType content && mwscript extensions/CirrusSearch/maintenance/updateOneSearchIndexConfig.php --wiki cawiki --startOver --indexType general
  • 09:10 hashar: used addwiki.php to create the wiki. manually triggered the Jenkins job that update the databases https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/2319/
  • 09:06 hashar: Adding cawiki and eswiki for cxserver testing Ibbcbd4

July 2

  • 07:49 hashar: cxserver being configured! 140723 by Kartik and Niklas \O/

July 1

  • 15:46 bd808: Fixed git rebase conflict in operations/puppet on deployment-salt
  • 13:29 manybubbles: rebuilding Cirrus search index in beta to pick up new configuration and cache warmers
  • 11:20 hashar: Added Filippo Giunchedi to the project as an admin (WMF ops)

June 30

  • 20:47 bd808: The state of puppet for beta is badly broken. I have hacked things to get puppet to apply on deployment-apache0[12] but puppet won't apply on deployment-bastion in part due to the same hacks.
  • 18:48 bd808: Created symlink /apache -> /usr/local/apache on deployment-apache0[12] to fix docroot symlinks
  • 18:09 bd808: Beta apaches are broken with latest puppet config applied. Working to correct.
  • 18:08 bd808: Manually added symlink for /etc/apache/wmf on deployment-apache0[12]

June 26

June 25

  • 20:58 bd808: Fixed rebase conflict in operations/puppet.git on deployment-salt caused by cherry-picked vcl patch left over from varnish submodule usage

June 24

  • 19:29 bd808: Manually updated operations/puppet checkout on deployment-salt to deal with varnish submodule change

June 19

  • 22:47 bd808: Updated scap to 792a572
  • 22:46 bd808: Trebuchet runs on deployment-videoscaler01 are succeeding but not showing up in the `git deploy report` output
  • 22:40 bd808: Deleted /var/log/diamond/diamond.log on deployment-jobrunner01 because /var was full

June 18

  • 16:55 bd808: Setup hourly cron as user bd808 on deployment-salt to test automatic update of puppet repo using ~bd808/git-sync-upstream script

June 17

  • 20:36 bd808: Upgraded elasticsearch to version 1.2.1 on deployment-logstash1

June 16

  • 21:16 bd808: Jenkins beta-scap-eqiad job broken because of missing puppet config on deployment-jobrunner01; needs role::beta::scap_target
  • 20:36 bd808: Enabled puppet on deployment-jobrunner01 and forced a run
  • 20:34 bd808: Puppet disabled on deployment-jobrunner01 since 2014-06-03; No SAL logs explaining why
  • 20:19 bd808: Updated scap to 5adce72; trebuchet reported i-00000237 (deployment-videoscaler01) as not updating, but manual check shows it did sync properly
  • 20:00 bd808: Deleted /var/lib/puppet/state/agent_catalog_run.lock on deployment-bastion after verifying that no puppet processes were running
  • 19:55 bd808: Truncated /var/log/diamond/diamond.log and restarted diamond on deployment-bastion
  • 19:36 bd808: /var/log/diamond is 787M of 1.2G total logs
  • 19:29 bd808: /var 0% free on deployment-bastion; looking for things to clean-up

June 9

  • 15:19 andrewbogott: doing a 'rebase origin' on deployment-salt, because it needs it.
  • 15:10 andrewbogott: updating all instances to puppet 3 via a cherry-pick�� of https://gerrit.wikimedia.org/r/#/c/137898/ on deployment-salt

June 7

  • 02:44 bd808: Restarted logstash on deployment-logstash1; last even logged at 2014-06-06T22:11:04

June 6

  • 19:26 bblack: - synced labs/private on deployment-salt again
  • 16:30 bd808: Rebooted deployment-salt
  • 16:27 bd808: Made /var/log a symlink to /srv/var-log on deployment-salt
  • 16:26 bblack: Updated labs/private.git on puppetmaster. brings in updated zero+netmapper password for beta
  • 16:18 bd808: Changed from role::labs::lvm::biglogs to role::labs::lvm::srv on deployment-salt and made /var/lib a symlink to /srv/var-lib
  • 15:45 bd808: /var on deployment-salt still at 97% full after moving logs; /var/lib is our problem
  • 15:43 bd808: Archived deployment-salt:/var/log to /data/project/deployment-salt
  • 15:40 bd808: Disabled puppet on deployment-salt to work on disk space issues
  • 12:44 hashar: Updated labs/private.git on puppetmaster. Brings Brandon Black change "add labs copy of zerofetcher auth file" 137918
  • 02:48 mwalker: added role::labs::lvm::biglogs to deployment-salt because it is out of room on /var and I don't know what I can delete
  • 01:25 bd808: Live hacked /etc/apache2/wmf/hhvm.conf on apaches to allow them to start
  • 00:30 bd808: `git stash`ed dirty dblist files found in /a/common on deployment-bastion

June 5

  • 14:16 manybubbles: rebuild beta's jawiki's search index without kuromoji - it didn't help much anyway
  • 14:14 manybubbles: recovered from busted elasticsearch - two problems: 1. I had an index that used the kuromoji plugin but I'd uninstalled it and 2. I had plugins for 1.2.1 but was trying to start 1.1.0. Solution: 1. delete the index and recreate it without kuromoji. 2. upgrade to 1.2.1 like I had planned on doing any way.
  • 14:01 manybubbles: elasticsearch cluster got really angry in beta when I restarted some node - its like they aren't talking to eachother properly - trying to recover. once that is done I'll upgrade to 1.2.1 and that might fix it
  • 13:59 hashar: deployment-elastic01 puppet was broken due to bug 63322 i.e. having some HTML garbage as ec2id which would be used as puppet certname
  • 13:47 manybubbles: rolling restart of elasticsearch nodes in beta to pick up new kernel

June 4

  • 20:46 bd808: Fixed file ownership on /data/project/apache/uncommon for beta-recompile-math-texvc-eqiad job
  • 19:27 manybubbles: sorry, can't do that yet,
  • 19:27 manybubbles: plugins deployed to beta - time to restart Elasticsearch in beta - should cause not interruption of service
  • 19:01 manybubbles: deploying Elasticsearch 1.2.1 and some updated plugins to beta
  • 17:11 bd808: Unwedged the jenkins jobs to updating beta by stopping the stuck db update job
  • 16:27 bd808: Changed uid/git for files owned by l10nupdate user
  • 09:50 mwalker: Reset salt caches by running `salt '*' state.clear_cache` from deployment-salt -- deployment-pdf01 now no longer reports errors when returning status for deployment

June 3

  • 22:30 bd808: Deleted unused /data/project/apache/common-local on NFS share.

June 2

  • 19:42 bd808: Updated scap to a7da355
  • 05:14 bd808: Restarted logstash on deployment-logstash1; Last event logged at 2014-06-01T0722:56

May 30

  • 21:45 bd808: Restarted uwsgi on deployment-graphite
  • 18:43 bd808: Updated scap to c4204dd

May 29

  • 21:07 bd808: mwalker cleaned up log spam from upstart on deployment-pdf01
  • 20:59 bd808: /var full on deployment-pdf01
  • 20:55 bd808: Restarted salt minion on deployment-pdf01 with `sudo salt 'i-00000396.eqiad.wmflabs' service.restart salt-minion`

May 28

  • 17:53 bd808: Restarted logstash on deployment-logstash1; last event logged at 2014-05-28T12:11:37
  • 16:56 bd808: Updated scap to fd7e538

May 27

  • 19:08 bd808: Updated scap to 48c7e28
  • 14:56 bd808: Updated scap to 9609e8d

May 23

  • 16:32 bd808: Upgraded elasticsearch to 1.1.0 on deployment-logstash1
  • 13:36 manybubbles: restarting elasticsearch on deployment-elastic01 to pick up some gc setting recommended by elasticsearch team

May 22

  • 23:00 bd808: Added 20after4 as a project admin
  • 22:59 bd808: Added matanya as a project memeber
  • 21:38 bd808|LUNCH: Deployed scap 096cb3f

May 21

  • 17:33 mwalker: converted deployment-pdf01 (i-00000396.eqiad.wmflabs) to use local puppet & salt master
  • 14:50 bd808: restarted logstash on deployment-logstash1; getting really tired of these soft crashes
  • 00:33 bd808: Puppet failing on deployment-videoscaler01 with duplicate definition of Class[Mediawiki::Jobrunner]
  • 00:07 bd808: Fixed puppet for deployment-jobrunner01 using https://gerrit.wikimedia.org/r/#/c/134519/2

May 20

  • 23:49 bd808: Fixed puppet for deployment-apache[12] using https://gerrit.wikimedia.org/r/#/c/134519/2
  • 23:11 bd808: deployment-apache01 needs more work: "Could not set shell on user[mwdeploy]"
  • 23:06 bd808: Fixing puppet config for upstream rename of role::applicationserver -> role::mediawiki
  • 21:14 ori: Converted deployment-stream to use local puppet & salt masters
  • 21:08 RoanKattouw: chown'ed /data/project/parsoid/parsoid.log from mwalker (?!?) to parsoid so Parsoid runs again
  • 15:53 bd808: Deployed scap 7b6fc47 via trebuchet

May 19

  • 14:34 bd808: Restarted logstash service on deployment-logstash1; it stopped logging new events at 10:37:13Z

May 16

  • 21:20 manybubbles: restarting elasticsearch in beta to update some plugins
  • 00:34 bd808: Updated EventLogging to I89819bd

May 15

  • 22:14 bd808: Restarted logstash on deployment-logstash1 yet again; memory leak from invalid encoding bug
  • 00:14 bd808: Disabled puppet on deployment-logstash1 to test a local logstash config change

May 14

  • 23:33 bd808: Added irc input to logstash via I409fec9

May 13

  • 09:28 bd808: Restarted logstash service on deployment-logstash1
  • 09:28 bd808: Logstash events stop at 2014-05-11T18:36:35Z; Log file shows many "Failed parsing date from field" errors which probably triggered the known upstream memory leak bug

May 10

  • 18:02 bd808: Restarted logstash on deployment-logstash1

May 9

  • 12:06 hashar: Creating en_rtlwiki wiki [[bugzilla:50335|bug 50335]]

May 6

  • 17:54 bd808: Restarted logstash on deployment-logstash1
  • 17:53 bd808: Logstash in beta hasn't recorded any events since 2014-05-04T04:32:36.
  • 15:33 manybubbles: rolling restart of Elasticsearch servers in beta to pick up new highlighter plugin to fix bugs found when we fixed hebrew analysis. and to implement phrase highlighting.

May 5

  • 21:29 mwalker: ran puppetstoredconfigclean and revoked puppet and salt keys for i-00000339.eqiad.wmflabs (was pdf01)
  • 21:24 mwalker: removing pdf01 instance -- labs just uses production mwlib which works just fine. I'll recreate this when I make the OCG test instance
  • 20:57 manybubbles: deploying new plugin to Elasticsearch (swift)

May 3

  • 18:10 mwalker: Updated kernel on deployment-pdf01 (manually set console=ttyS0 to match older installed kernels)
  • 17:58 mwalker: Converted i-00000339.eqiad.wmflabs (deployment-pdf01) to use local puppet & salt masters
  • 17:54 mwalker: signed salt key for i-00000339.eqiad.wmflabs (deployment-pdf01)
  • 17:43 bd808: Added mwalker to under_NDA sudoers group

May 2

  • 17:01 bd808: Switched scap to use scripts delivered by trebuchet

May 1

  • 15:46 manybubbles: upgrading Elasticsearch highlighter via a rolling restart
  • 00:56 bd808: Fixed empty PrivateSettings.php configuration file (which I also broke earlier)

April 28

  • 16:12 manybubbles: upgrading highlighter plugin in Elasticsearch
  • 15:43 bd808: Created empty /srv/scap-stage-dir/wmf-config/mwblocker.log file to stop missing file warnings in beta.

April 25

  • 11:31 hashar: commonswiki-75388f96: 0.6183 19.5M SQL ERROR (ignored): Table 'commonswiki.revtag_type' doesn't exist (10.68.16.193)
  • 11:30 hashar: Authentication is broken on the beta cluster. Well at least from commons.wikimedia.beta.wmflabs.org

April 23

  • 19:34 ^demon|lunch: created zhwiki, ukwiki, ruwiki, kowiki, hiwiki, jawiki for testing
  • 10:19 hashar: stopping udp2log and starting udp2log-mw instead (known old bug that prevents logging)

April 22

  • 18:42 bd808: Rebooting deployment-bastion in a wild attempt to get the jenkins slave there working again
  • 18:42 bd808: Rebooting deployment-bastion in a wild attempt to get the jenkins slave there working again

April 18

  • 19:24 manybubbles: rebuilding Cirrus indexes to pick up auxiliary fields and smarter accent matching

April 16

  • 18:56 hashar: Migrating memc04 and memc05 to self master/salt [[bugzilla:64010|bug 64010]]
  • 13:13 manybubbles: done
  • 13:10 manybubbles: rolling restart of Elasticsearch nodes in beta to make super sure it picked up new plugins
  • 09:33 hashar: rebased puppetmaster

April 15

  • 20:02 manybubbles: restarting elasticsearch in beta to pick up a plugin update - no downtime should occur
  • 14:24 hashar: rebased puppetmaster

April 11

  • 17:41 bd808: Tried to enable role::protoproxy::ssl::beta on deployment-cache-text02 but it failed to apply because /etc/ssl/certs/star.wmflabs.org.pem and /etc/ssl/private/star.wmflabs.org.key don't match.
  • 03:59 bd808: sudo apt-get install mysql-client on deployment-bastion
  • 03:54 bd808: Added legoktm as a project member
  • 00:02 bd808: Enabled https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/

April 10

April 9

  • 23:04 bd808: Re-enabled puppet on deployment-apache02 and forced a puppet run
  • 21:39 bd808: Cherry-picked I8f77e0c into puppet and forced puppet run on deployment-bastion

April 8

  • 17:53 manybubbles: rebuilding simplewiki's search index optimized for the new highlighter to check the size difference
  • 05:34 Ryan_Lane: upgraded libssl on all nodes, restarted affected ssl servers
  • 05:03 Ryan_Lane: upgraded libssl on all salt accessible nodes

April 5

  • 11:19 hashar: Attempting to reenable SSL support with 124057

April 4

  • 21:39 bd808: Restarted logstash; it stopped processing events again at 2014-04-04T19:56:46Z
  • 17:31 bd808: Forced puppet run on deployment-cache-text02
  • 17:29 bd808: Manually fixed puppet config on deployment-cache-text02 (the cert html error problem)
  • 17:22 bd808: Rebooting deployment-cache-bits01
  • 17:21 bd808: Forced puppet run on deployment-cache-bits01
  • 16:15 manybubbles: Performing a rolling restart of Elasticsearch nodes to pick up a new plugin

April 3

  • 17:32 bd808: Fixed certname in /etc/puppet/puppet.conf manually on deployment-bastion so puppet would run again.
  • 15:33 bd808: Restarted logstash on deploymnet-logstash1; Stuck in a bad state due to jvm oom logged at 2014-04-03T12:03:43Z

April 2

  • 17:54 manybubbles: done installing plugins on Elasticsearch in beta
  • 14:10 hashar: Fixed database updating job https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/ . It was not running on the proper node.
  • 12:50 hashar: restarted parsoid daemon on deployment-parsoid04.eqiad.wmflabs. It also now log to /data/project/parsoid/parsoid.log
  • 12:36 hashar: Manually deleting parsoid user/group on deployment-parsoid04. Will use the LDAP uid/gid instead.

April 1

  • 21:38 hashar: Removed the Zuul triggers that updated beta cluster in PMTPA 123100.
  • 19:49 bd808: Converted deployment-graphite.eqiad.wmflabs to use local puppet & salt masters
  • 19:20 bd808: Deleting and re-creating deployment-graphite because I forgot to add the web security group
  • 15:57 andrewbogott: shutting down all pmtpa instances
  • 14:32 manybubbles: completed upgrade to Elasticsearch 1.1.0 and fixed deployment-elastic04.
  • 13:32 hashar: Thumbs access more or less fixed
  • 13:31 hashar: deployment-upload is rejecting connection on port 80. Applying role::beta::uploadservice from 122786
  • 13:30 manybubbles: upgrading labs Elasticsearch to 1.1.0
  • 13:06 hashar: Applying role::beta::natfix on deployment-upload.eqiad.wmflabs . Might let it access images from commons.wikimedia.beta.wmflabs.org ( ex: http://upload.beta.wmflabs.org/wikipedia/commons/thumb/4/43/Feed-icon.svg/16px-Feed-icon.svg.png yields: Error retrieving thumbnail from scaling server: couldn't connect to host commons.wikimedia.beta.wmflabs.org )
  • 08:31 hashar: MediaWiki config paths tweaks for Math [[bugzilla:63331|bug 63331]] and Captchas [[bugzilla:63342|bug 63342]]
  • 00:32 bd808: Converted deployment-graphite to use local puppet & salt masters

March 31

  • 21:02 hashar: Making Parsoid daemon to write its logs to /data/project/parsoid/parsoid.log 122561
  • 20:47 hashar: Puppet master is fixed. The certificates got badly messed up, had to regenerate them following the documentation "Regenerate Certificates for Puppet Master"
  • 20:17 hashar: restarted parsoid daemon
  • 20:00 hashar: stopped parsoid . It is killing the application servers
  • 19:53 hashar: restarting both apaches
  • 19:21 hashar: restarting job service on jobrunner01 to apply 122436
  • 19:20 hashar: Unbreak puppetmaster on deployment-salt.eqiad.wmflabs
  • 19:01 hashar: puppet master is broken :(
  • 17:39 hashar: lowering # of jobs spawned by the jobrunner 122436
  • 16:00 bd808: Restarted logstash service on deployment-logstash1; no new log events seen since 2014-03-28T10:57
  • 15:58 bd808: Updated kibana on deployment-logstash1 to e317bc6
  • 15:56 hashar_: Cluster slow because some CirrusSearch job is spamming simplewiki . Gotta find a way to throttle the number of jobs being run on jobrunner01 or add more apache boxes . It is transient anyway, might look at limiting the runs tonight
  • 15:10 hashar_: Rebased puppet repository. Only one hack left: https://gerrit.wikimedia.org/r/#/c/119534/
  • 14:20 hashar: deleting deployment-parsoidcache01 cache the hardway: stopping varnish, deleting files in /srv/vdb/ , starting varnish
  • 14:05 hashar: shutdowning database and apache boxes for now.
  • 14:03 hashar: shutdowning varnishes instances in pmtpa
  • 13:56 hashar: Deleted deployment-cache-upload01 , replaced by deployment-cache-upload02
  • 13:52 hashar: upload varnish cache working :-]
  • 13:47 hashar: applying role::cache::upload to role-cache-upload02
  • 13:37 hashar: migrating deployment-cache-upload02.eqiad.Wmflabs to self puppet/salt master
  • 13:22 hashar: Creating deployment-cache-upload02 to replace deployment-cache-upload01 which was missing the security group "web"
  • 11:30 hashar: Update DNS entries to point to EQIAD instances (aka switching beta cluster to eqiad)

March 28

  • 16:18 hashar: rebased puppet on deployment-salt
  • 15:39 hashar: Last log made to wrong project
  • 15:39 hashar: deleting instance ntegration-selenium-driver no more needed. browsertests jobs should now be runnable on integration-slave1001 and integration-slave1002 (in eqiad)
  • 10:54 hashar: deleting instance integration-debian-builder . That is breaking all debian-glue jobs. Will revisit later next week to get pbuilder/cowbuilder set up on the other eqiad slaves
  • 08:48 hashar: deleting integration-slave-pbuilder. Unneeded (i need a coffee)
  • 08:43 hashar: Created integration-slave-pbuilder on eqiad to replace pmtpa instance integration-debian-builder
  • 00:23 bd808: `sudo chmod -R a+rwx /data/project/upload7`; We need to get this file permissions thing figured out

March 27

  • 15:23 hashar: role::beta::natfix cant run on deployment-bastion.eqiad because the ferm rules conflicts with the Augeas rules coming from udp2log :-(
  • 15:21 hashar: applying role::beta::natfix on deployment-bastion.eqiad
  • 14:58 hashar: fixed up role::beta::natfix . Ferm is now being applied again on various application server instances 121378
  • 13:58 hashar: rebased puppetmaster git repository, reapplied ottomata live hacks.
  • 12:55 hashar: mediawiki l10n cache being rebuild!!!
  • 12:54 hashar: Fixed permissions on eqiad bastion for /srv/scap . Others (such as mwdeploy) could not read / execute scap scripts
  • 11:29 hashar: MediaWiki code and configuration are now self updating on EQIAD cluster via Jenkins jobs. First run: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/4/console
  • 11:11 hashar: deleting job beta-code-update , replaced by datacenter variants beta-code-update-pmtpa and beta-code-update-eqiad
  • 10:54 hashar: Deleting job beta-update-databases , replaced by datacenter variants beta-update-databases-pmtpa and beta-update-databases-eqiad

March 26

  • 19:05 bd808: Added ottomata as a project member and admin
  • 15:46 springle: deployment-db1 data loaded
  • 14:45 bd808: created proxy https://logstash-beta.wmflabs.org for logstash instance
  • 14:17 hashar: fixed up redis configuration in eqiad. Jobrunner is happy now: aawiki-504cd7d2: 0.9649 21.5M Creating a new RedisConnectionPool instance with id 627014d. 121060
  • 14:05 hashar: udp2log functional on eqiad beta cluster \O/
  • 13:55 hashar: stopping udp2log on eqiad bastion, starting udp2log-mw (really should fix that issue one day)
  • 13:52 hashar: dropped some live hack on eqiad in /data/project/apache/common-local and ran git pull
  • 13:14 hashar: Dropping enwikivoyage and dewikivoyage databases from sql02. Related changes are updating the Jenkins config: https://gerrit.wikimedia.org/r/#/c/121045/ and cleaning up the mw-config : https://gerrit.wikimedia.org/r/#/c/121047/
  • 07:53 springle: installed mariadb via puppet on deployment-db1. no data yet

March 25

  • 19:43 hashar: created jenkins slave deployment-bastion.eqiad
  • 17:17 hashar: Created and validated job that updates Parsoid on the EQIAD beta cluster \O/

March 24

  • 23:16 marktraceur: Touching all the MMV scripts because they're not getting invalidated or something
  • 23:10 hashar: l10n cache got broken due to a PHP fatal error I introduced. It is back up now. Found out via https://integration.wikimedia.org/dashboard/
  • 23:09 hashar: upgraded all pmtpa varnishes, ran puppet on all of them. all set!
  • 22:57 hashar: restarting deployment-cache-upload04 , apparently stalled
  • 22:48 hashar: upgrading varnish on all pmtpa caches.
  • 22:47 hashar: apt-get upgrade varnish on deployment-cache-bits03
  • 22:45 marktraceur: attempted restart of varnish on betalabs; seems to have failed, trying again
  • 22:42 hashar: made marktraceur a project admin and granted sudo rights
  • 22:39 marktraceur: Restarting betalabs varnish to workaround https://bugzilla.wikimedia.org/show_bug.cgi?id=63034
  • 17:25 bd808: Converted deployment-db1.eqiad.wmflabs to use local puppet & salt masters
  • 17:06 bd808: Changed rules in sql security group to use CIDR 10.0.0.0/8.
  • 17:05 bd808: Changed rules in search security group to use CIDR 10.0.0.0/8.
  • 17:05 bd808: Built deployment-elastic04.eqiad.wmflabs with local salt/puppet master, secondary disk on /var/lib/elasticsearch and role::elasticsearch::server
  • 16:19 bd808: Built deployment-elastic03.eqiad.wmflabs with local salt/puppet master, secondary disk on /var/lib/elasticsearch and role::elasticsearch::server
  • 16:08 bd808: Built deployment-elastic02.eqiad.wmflabs with local salt/puppet master, secondary disk on /var/lib/elasticsearch and role::elasticsearch::server
  • 15:54 bd808: Built deployment-elastic01.eqiad.wmflabs with local salt/puppet master, secondary disk on /var/lib/elasticsearch and role::elasticsearch::server
  • 10:31 hashar: migrated deployment-solr to self puppet/salt masters

March 21

  • 09:29 hashar: l10ncache is now rebuild properly : https://integration.wikimedia.org/ci/job/beta-code-update/53508/console
  • 09:23 hashar: fixing l10ncache on deplkoyment-bastion : chown -R l10nupdate:l10nupdate /data/project/apache/common-local/php-master/cache/l10n The l10nupdate UID/GID has been changed and are now in LDAP

March 20

  • 23:46 bd808: Mounted secondary disk as /var/lib/elasticsearch on deployment-logstash1
  • 23:46 bd808: Converted deployment-tin to use local puppet & salt masters
  • 22:09 hashar: Migrated videoscaler01 to use self salt/puppet masters.
  • 21:30 hashar: manually installing timidity-daemon on jobrunner01.eqiad so puppet can stop it and stop whining
  • 21:00 hashar: migrate jobrunner01.eqiad.wmflabs to self puppet/salt masters
  • 20:55 hashar: deleting deployment-jobrunner02 , lets start with a single instance for nwo
  • 20:51 hashar: Creating deployment-jobrunner01 and 02 in eqiad.
  • 15:47 hashar: fixed salt-minion service on deployment-cache-upload01 and deployment-cache-mobile03 by deleting /etc/salt/pki/minion/minion_master.pub
  • 15:30 hashar: migrated deployment-cache-upload01.eqiad.wmflabs and deployment-cache-mobile03.eqiad.wmflabs to use the salt/puppetmaster deployment-salt.eqiad.wmflabs.
  • 15:30 hashar: deployment-cache-upload01.eqiad.wmflabs and deployment-cache-mobile03.eqiad.wmflabs recovered!! /dev/vdb does not exist on eqiad which caused the instance to be stalled.
  • 10:48 hashar: Stopped the simplewiki script. Would need to recreate the db from scratch instead
  • 10:37 hashar: Cleaning up simplewiki by deleting most pages in the main namespace. Would free up some disk space. deleteBatch.php is running in a screen on deployment-bastion.pmtpa.wmflabs
  • 10:08 hashar: applying role::labs::lvm::mnt on deployment-db1 to provide additional disk space on /mnt
  • 09:39 hashar: convert all remaining hosts but db1 to use the local puppet and salt masters
  • 04:40 springle: created deployment-db1 for mariadb master in eqiad

March 19

  • 21:23 bd808: Converted deployment-cache-text02 to use local puppet & salt masters
  • 20:21 hashar: migrating eqiad varnish caches to use xfs
  • 17:58 bd808: Converted deployment-parsoid04 to use local puppet & salt masters
  • 17:51 bd808: Converted deployment-eventlogging02 to use local puppet & salt masters
  • 17:22 bd808: Converted deployment-cache-bits01 to use local puppet & salt masters; puppet:///volatile/GeoIP not found on deployment-salt puppetmaster
  • 17:00 bd808: Converted deployment-apache02 to use local puppet & salt masters
  • 16:49 bd808: Converted deployment-apache01 to use local puppet & salt masters
  • 16:30 hashar: Varnish caches in eqiad are failing puppet because there is no /dev/vdb. Will figure it out tomorrow :-]
  • 16:15 hashar: Applying role::logging::mediawiki::errors on deployment-fluoride.eqiad.wmflabs . It is not receiving anything yet though.
  • 15:50 hashar: fixed upd2log-mw daemon not starting on eqiad bastion ( /var/log/udp2log belonged to wrong UID/GID)
  • 15:49 hashar: deleted local user l10nupdate on deployment-bastion. It is in ldap now.

March 18

  • 03:31 bd808: deployment-bastion now using deployment-salt as puppet master

March 17

  • 15:02 hashar: Starting copying /data/project from ptmpa to eqiad
  • 14:46 hashar: manually purging all commonswiki archived files (on beta of course)

March 14

  • 14:47 hashar: changing uid/gid of mwdeploy which is now provisioned via LDAP (aka deleting local user and group on all instance + file permissions tweaks)

March 11

  • 10:46 hashar: dropping some unused databases from deployment-sql instance.

March 10

March 6

  • 09:07 hashar: restarted varnish and varnish-frontend on deployment-cache-text1

March 5

  • 17:26 hashar: hacked in mwversioninuse to return "master=aawiki". Relaunched l10n job using mwdeploy user and then running mw-update-l10n
  • 17:07 hashar: mwversioninuse gives a wmf branch instead of master. That breaks l10n messages update and the job https://integration.wikimedia.org/ci/job/beta-code-update/ . Root cause is the python based scap.

March 3

  • 17:28 manybubbles: doing an Elasticsearch reindex on beta before I try another one in production

February 28

  • 10:17 hashar: Puppet running on varnish upload cache after several months. Might break random things in the process :(

February 27

  • 14:11 manybubbles: upgrading beta to Elasticsearch 1.0

February 26

  • 20:44 hashar: Cleaning up commonswiki archived files with mwscript deleteArchivedFiles.php --wiki=commonswiki --delete
  • 20:44 hashar: deleted all files from http://commons.wikimedia.beta.wmflabs.org/wiki/Category:GWToolset_Batch_Upload (gwtoolset import test). Deleted File:Title_0* (Selenium tests).
  • 15:06 hashar: deleted all thumbs from shared directory: /data/project/upload7/*/*/thumb/*
  • 14:54 hashar: cleaning out 2013 archived logs.

February 25

  • 08:42 hashar: Upgrading all varnishes.

February 24

  • 23:36 MaxSem: Rolled back
  • 23:25 hoo: recursively chowned extensions/MobileFrontend to mwdeploy:mwdeploy
  • 23:21 hoo: chowned /data/project/apache/common-local/php-master/extensions/.git/modules/MobileFrontend/* to mwdeploy:mwdeploy
  • 17:47 MaxSem: Investigating a mobile bug, might cause intermittent problems
  • 17:36 MaxSem: Rebooted deployment-cache-mobile01 - was impossible to log into it though Varnish still worked

February 21

  • 19:42 MaxSem: Adjusted read privs on /home/wikipedia/syslog/apache.log to allow fatalmonitor to work

February 19

  • 16:24 hashar: -bastion : /etc/init.d/udp2log stop && /etc/init.d/udp2log-mw start (known bug)
  • 16:23 hashar: rebooting -bastion
  • 16:22 hashar: rebooting apache32 and apache33 breaking beta :-]

February 17

  • 15:26 hashar: rebooting bits cache

February 11

  • 21:55 manybubbles: update elasticsearch schema after recent changes. will run a links update as well

February 6

  • 22:20 Krinkle: Manually ran changePassword.php to help someone (password reminder emails don't get sent)
  • 14:43 hashar: restarting udp2log-mw on deployment-bastion. Logstash.wmflabs.org no more receiving fatals logs since Jan 31st

February 4

  • 17:22 hashar: fixed up beta-parsoid-update job so Parsoid should be up to date again. The issue is that the multigit job pointed to a wrong host (ZUUL_URL should be zuul.eqiad.wmnet)
  • 13:33 hashar: removing role::memcached from both apache servers
  • 09:58 hashar: rebooting all varnish caches
  • 09:57 hashar: Upgrading all varnish

February 3

  • 16:59 hashar: upgrading varnish on deployment-parsoidcache3

January 30

  • 19:35 hashar: deployment-cache-bits03 restarted gmond, leaked memory. Upgrading varnish
  • 19:32 hashar: Canceled varnish package upgrade on deployment-cache-mobile01 , it runs a specific version ( 3.0.5plus~wmftest-wm1 ) instead of 3.0.3plus~rc1-wm29
  • 19:30 hashar: upgrading varnish on deployment-cache-mobile01
  • 19:29 hashar: upgrading varnish on deployment-cache-bits03
  • 19:29 hashar: upgrading varnish on deployment-staging-cache-mobile02
  • 19:28 hashar: upgrading varnish on deployment-cache-upload04
  • 19:27 hashar: reenabling puppet on deployment-cache-mobile01
  • 17:10 manybubbles: done reindexing beta. everything looks good
  • 16:54 manybubbles: reindexing beta like we're going to do in production when the release train departs later today

January 28

  • 17:10 hashar: added addshore and jhall to project so they can grep logs

January 27

  • 15:17 hashar: applying role::beta::fatalmonitor puppet class on deployment-bastion bug 60046

January 23

  • 19:38 hashar: VisualEditor was not being updated properly because some files belonged to root instead of mwdeploy. Ran chown -R mwdeploy:mwdeploy /data/project/apache/common-local/php-master/extensions/VisualEditor

January 16

  • 20:54 manybubbles: turning elasticsearch's disk space aware allocator

January 15

  • 21:14 manybubbles: finished updating to elasticsearch 0.90.10
  • 08:48 andrewbogott: rebooted deployment-cache-text1

January 2

  • 15:32 hashar: Migrated parsoid on deployment-parsoid2 to use mediawiki/services/parsoid out of a checkouts made in /srv/deployment/parsoid/{parsoid,deploy}. No job self updating it yet
  • 15:00 manybubbles: finished upgrading Elasticsearch in beta. We're on 0.90.9 now.
  • 14:07 hashar: running mw-update-l10n , it was broken because of https://gerrit.wikimedia.org/r/#/c/104741/ fixed up by https://gerrit.wikimedia.org/r/#/c/104953/
  • 13:54 manybubbles: upgrading Elasticsearch servers in beta

December 26

  • 18:54 manybubbles: performing in place index rebuild for wikis in beta after recent cirrus update

December 23

  • 20:40 anomie: Restarting mw-job-runner service on deployment-jobrunner08, since jobs don't seem to be being run
  • 20:03 anomie: Restarting apache on deployment-apache33 to see if that clears the odd errors going on

December 18

  • 10:56 hashar: reenabling puppet on parsoid2 and deploying the new Parsoid upstart configuration 99656