Nova Resource:Tools/SAL/Archive 1

From Wikitech

December 23

  • 06:00 YuviPanda: tools-uwsgi-01 randomly went to SHUTOFF state, rebooting from virt1000

December 22

  • 07:43 YuviPanda: increased RAM and Cores quota for tools

December 19

  • 16:38 YuviPanda: puppet disabled on tools-webproxy because urlproxy.lua is handhacked to remove stupid syntax errors that got merged.
  • 12:00 YuviPanda|brb: created tools-static, static http server
  • 07:07 scfc_de: tools-webgrid-02: rm -f /var/log/exim4/paniclog (OOM, again).

December 17

  • 22:38 YuviPanda: touched /data/project/repo/Packages so tools-webproxy stops complaining about that not xisting and never running apt-get

December 12

  • 14:08 scfc_de: Ran Puppet on all hosts to fix puppet-run issue.

December 11

  • 07:58 YuviPanda: rebooted tools-login, wasn’t responsive.

December 8

  • 00:15 YuviPanda: killed all db and tools-webproxy aliases in /etc/hosts for tools-webproxy, since otherwise puppet fails because ec2id thinks we’re not in labs because hostname -d is empty because we set /etc/hosts to resolve IP directly to tools-webproxy

December 7

  • 06:31 scfc_de: tools-webgrid-02: rm -f /var/log/exim4/paniclog (OOM, again).
  • 06:31 scfc_de: tools-mail: rm -f /var/log/exim4/paniclog (multiple exim4 processes, again).

December 2

  • 21:31 scfc_de: tools-mail: rm -f /var/log/exim4/paniclog (multiple exim4 processes, again).
  • 21:30 scfc_de: tools-webgrid-02: rm -f /var/log/exim4/paniclog (OOM, again).

November 26

  • 19:26 YuviPanda: created tools-webgrid-05 on trusty to set up a working webnode for trusty

November 25

  • 06:53 scfc_de: tools-webgrid-02: rm -f /var/log/exim4/paniclog (OOM, again).

November 24

  • 14:02 YuviPanda: rebooting tools-login, OOM'd
  • 02:51 scfc_de: tools-webgrid-02: rm -f /var/log/exim4/paniclog (OOM, again).

November 22

  • 19:05 scfc_de: tools-webgrid-02: rm -f /var/log/exim4/paniclog (OOM, again).

November 17

  • 20:40 YuviPanda: cleaned out /tmp on tools-login

November 16

  • 21:31 matanya: back to normal
  • 21:27 matanya: "Could not resolve hostname bastion.wmflabs.org"

November 15

  • 07:24 YuviPanda|zzz: move coredumps from tools-webgrid-04 to /home/yuvipanda

November 14

  • 20:23 YuviPanda: cleared out coredumps on tools-webgrid-01 to free up space
  • 18:26 YuviPanda: cleaned out core dumps on tools-webgrid
  • 16:55 scfc_de: tools-webgrid-02: rm -f /var/log/exim4/paniclog (OOM).

November 13

  • 21:11 YuviPanda: disable puppet on tools-dev to check shinken
  • 21:00 scfc_de: qmod -cq continuous@tools-exec-09,continuous@tools-exec-11,continuous@tools-exec-13,continuous@tools-exec-14,mailq@tools-exec-09,mailq@tools-exec-11,mailq@tools-exec-13,mailq@tools-exec-14,task@tools-exec-06,task@tools-exec-09,task@tools-exec-11,task@tools-exec-13,task@tools-exec-14,task@tools-exec-15,webgrid-lighttpd@tools-webgrid-01,webgrid-lighttpd@tools-webgrid-02,webgrid-lighttpd@tools-webgrid-04 (fallout from /var being full).
  • 20:38 YuviPanda: didn't actually stop puppet, need more patches
  • 20:38 YuviPanda: stopping puppet on tools-dev to test shinken
  • 15:30 scfc_de: tools-exec-06, tools-webgrid-01: rm -f /var/tmp/core/*.
  • 13:31 scfc_de: tools-exec-09, tools-exec-11, tools-exec-13, tools-exec-14, tools-exec-15, tools-webgrid-02, tools-webgrid-04: rm -f /var/tmp/core/*.

November 12

  • 22:07 StupidPanda: enabled puppet on tools-exec-07
  • 21:47 StupidPanda: removed coredumps from tools-webgrid-04 to reclaim space
  • 21:45 StupidPanda: removed coredump from tools-webgrid-01 to reclaim space
  • 20:31 YuviPanda: disabling puppet on tools-exec-07 to test shinken

November 7

  • 13:56 scfc_de: tools-submit, tools-webgrid-04: rm -f /var/log/exim4/paniclog (OOM around the time of the filesystem outage).

November 6

  • 13:21 scfc_de: tools-dev: Gzipped /var/log/account/pacct.0 (804111872 bytes); looks like root had his own bigbrother instance running on tools-dev (multiple invocations of webservice per second).

November 5

  • 19:15 mutante: exec nodes have p7zip-full now
  • 10:07 YuviPanda: cleaned out pacct and atop logs on tools-login

November 4

  • 19:50 mutante: - apt-get clean on tools-login, and gzipped some logs

November 1

  • 12:51 scfc_de: Removed log files in /var/log/diamond older than five weeks (pdsh -f 1 -g tools sudo find /var/log/diamond -type f -mtime +35 -ls -delete).

October 30

  • 14:37 YuviPanda: cleaned out pacct and atop logs on tools-dev
  • 06:18 paravoid: killed a "vi" process belonging to user icelabs and running for two days saturating the I/O network bandwidth, and rm'ed a 3.5T(!) .final_mg.txt.swp

October 27

  • 16:06 scfc_de: tools-mail: Killed -HUP old queue runners and restarted exim4; probably the source of paniclog's "re-exec of exim (/usr/sbin/exim4) with -Mc failed: No such file or directory".
  • 15:36 scfc_de: tools-exec-07, tools-exec-14, tools-exec-15: Recreated (empty) /var/log/apache2 and /var/log/upstart.

October 26

  • 12:35 scfc_de: tools-exec-07, tools-exec-14, tools-exec-15: Created /var/log/account.
  • 12:33 scfc_de: tools-trusty: Went through shadowed /var and rebooted.
  • 12:31 scfc_de: tools-exec-07, tools-exec-14, tools-exec-15: Created /var/log/exim4, started exim4 and ran queue.

October 24

  • 20:31 andrewbogott: moved tools-exec-12, tools-shadow and tools-mail to virt1006

October 23

  • 22:55 Coren: reboot tools-shadow, upstart seems hosed

October 14

  • 23:22 YuviPanda|zzz: removed stale puppet lockfile and ran puppet manually on tools-exec-07

October 11

  • 15:31 andrewbogott: rebooting tools-master, stab in the dark
  • 06:01 YuviPanda: restarted gridengine-master on tools-master

October 4

  • 18:31 scfc_de: tools-mail: Deleted /usr/local/bin/collect_exim_stats_via_gmetric and root's crontab; clean-up for Ic9e0b5bb36931aacfb9128cfa5d24678c263886b

October 2

  • 17:59 andrewbogott: added Ryan back to tools admins because that turned out to not have anything to do with the bounce messages
  • 17:32 andrewbogott: removing ryan lane from tools admins, because his email in ldap is defunct and I get bounces every time something goes wrong in tools

September 28

  • 14:45 andrewbogott: rebased /var/lib/git/operations/puppet on toolsbeta-puppetmaster3

September 25

  • 14:43 YuviPanda: cleaned up ghost /var/log (from before biglogs mount) that was taking up space, /var space situation better now

September 17

  • 21:40 andrewbogott: caused a brief auth outage while messing with codfw ldap

September 15

  • 11:00 YuviPanda: tested CPU monitoring on tools-exec-12 by running stress, seems to work

September 13

  • 20:52 yuvipanda: cleaned out rotated log files on tools-webproxy

September 12

  • 21:54 jeremyb: [morebots] booted all bots, reverted to using systemwide (.deb) codebase

September 8

  • 16:08 scfc_de: tools-login: rm -f /var/log/exim4/paniclog (OOM @ 2014-09-07 15:13:59)

September 5

  • 22:22 scfc_de: Deleted stale nginx entries for "rightstool" and "svgcheck"
  • 22:20 scfc_de: Stopped 12 webservices for tool "meta" and started one
  • 18:50 scfc_de: geohack's lighttpd dumped core and left an entry in Redis behind; tools-webproxy: "DEL prefix:geohack"; geohack: "webservice start"

September 4

  • 19:47 lokal-profil: local-heritage Renamed two swedish tables

September 2

  • 04:31 scfc_de: "iptables -A OUTPUT -d 10.68.16.1 -p udp -m udp --dport 53" on all hosts in support of bug #70076

August 23

  • 17:44 scfc_de: qmod -cq task@tools-exec-07 (job #2796555, "11  : before job")

August 21

  • 20:05 scfc_de: Deployed release 1.0.11 of jobutils and miscutils

August 15

  • 16:45 legoktm: fixed grrrit-wm
  • 16:36 legoktm: restarting grrrit-wm

August 14

  • 22:36 scfc_de: Removed again jobs in error state due to LDAP with "for JOBID in $(qstat -u \* | sed -ne 's/^\([0-9]\+\) .*Eqw.*$/\1/p;'); do if qstat -j "$JOBID" | fgrep -q "can't get password entry for user"; then qdel "$JOBID"; fi; done"; cf. also bug #69529

August 12

  • 03:32 scfc_de: tools-exec-08, tools-exec-wmt, tools-webgrid-02, tools-webgrid-03, tools-webgrid-04: Removed stale "apt-get update" processes to get Puppet working again

August 2

  • 16:39 scfc_de: tools.mybot's crontab uses qsub without -M, added that as a temporary measure and will inform user later
  • 16:36 scfc_de: Manually rerouted mails for tools.mybot@tools-submit.eqiad.wmflabs

August 1

  • 22:41 scfc_de: Deleted all jobs in "E" state that were caused by an LDAP failure at ~ 2014-07-30 07:00Z ("can't get password entry for user [...]")

July 24

  • 20:53 scfc_de: Set SGE "mailer" parameter again for bug #61160
  • 14:51 scfc_de: Removed ignored file /etc/apt/preferences.d/puppet_base_2.7 on all hosts

July 21

  • 18:39 scfc_de: Removed stale Redis entries for currentevents, misc2svg, osm4wiki, wp-signpost, wscredits and yadfa
  • 18:38 scfc_de: Restarted webservice for stewardbots because it wasn't in Redis
  • 18:33 scfc_de: Stopped eight (!) webservices of tools.bookmanagerv2 and started one again

July 18

  • 14:29 scfc_de: admin: Set up .bigbrotherrc for toolhistory
  • 13:24 scfc_de: Made tools-webgrid-04 a grid submit host
  • 12:58 scfc_de: Made tools-webgrid-03 a grid submit host

July 16

  • 22:41 YuviPanda: reloaded nginx on tools-webproxy to pick up https://gerrit.wikimedia.org/r/#/c/146466/3
  • 15:18 scfc_de: replagstats OOMed four hours after start on May 6th; with ganglia.wmflabs.org down, not restarting
  • 15:14 scfc_de: Restarted toolhistory with 350 MBytes; OOMed June 1st

July 15

  • 11:31 scfc_de: Started webservice for sulinfo; stopped at 2014-06-29 18:31:04

July 14

  • 20:40 andrewbogott: on tools-login
  • 20:39 andrewbogott: manually deleted /var/lib/apt/lists/lock, forcing apt to update

July 13

  • 13:13 scfc_de: tools-exec-13: Moved /var/log around, reboot, iptables-restore & reenabled queues
  • 13:11 scfc_de: tools-exec-12: Moved /var/log around, reboot & iptables-restore

July 12

  • 17:57 scfc_de: tools-exec-11: Stopping apache2 service; no clue how it got there
  • 17:53 scfc_de: tools-exec-11: Moved log files around, rebooted, restored iptables and reenabled queue ("qmod -e {continuous,task}@tools-exec-11...")
  • 13:00 scfc_de: tools-exec-11, tools-exec-13: qmod -r continuous@tools-exec-1[13].eqiad.wmflabs in preparation of reboot
  • 12:58 scfc_de: tools-exec-11, tools-exec-13: Disabled queues in preparation of reboot
  • 11:58 scfc_de: tools-exec-11, tools-exec-12, tools-exec-13: mkdir -m 2750 /var/log/exim4 && chown Debian-exim:adm /var/log/exim4; I'll file a bug why the directory wasn't created later

July 11

  • 11:59 scfc_de: tools-exec-11, tools-exec-12, tools-exec-13: cp -f /data/project/.system/hosts /etc/hosts

July 10

  • 20:35 scfc_de: tools-exec-11, tools-exec-12, tools-exec-13: iptables-restore /data/project/.system/iptables.conf
  • 16:00 YuviPanda: manually removed mariadb remote repo from tools-exec-12 instance, won't be added to new instances (puppet patch was merged)
  • 01:33 YuviPanda|zzz: tools-exec-11 and tools-exec-13 have been added to the @general hostgroup

July 9

  • 23:14 YuviPanda: applied execnode, hba and biglogs to tools-exec-11 and tools-exec-13
  • 23:09 YuviPanda: created tools-exec-13 with precise
  • 23:08 YuviPanda: created tools-exec-12 as trusty by accident, will keep on standby for testing
  • 23:07 YuviPanda: created tools-exec-12
  • 23:06 YuviPanda: created tools-exec-11
  • 19:23 scfc_de: tools-webproxy: "iptables -A INPUT -p tcp \! --source 127/8 --dport 6379 -j REJECT" to block connections from other Tools instances to Redis again
  • 14:12 scfc_de: tools-exec-cyberbot: Reran Puppet successfully and hotfixed the Peachy temporary file issue; will mail labs-l later
  • 13:33 scfc_de: tools-exec-cyberbot: Freed 402398 inodes ...
  • 12:50 scfc_de: tools-exec-cyberbot: "find /tmp -maxdepth 1 -type f -name \*cyberbotpeachy.cookies\* -mtime +30 -delete" as a first step
  • 12:40 scfc_de: tools-exec-cyberbot: Root partition has run out of inodes
  • 12:34 scfc_de: tools-exec-gift: Forgot to log yesterday: The problems were due to overload (load >> 150); SGE shouldn't have allowed that
  • 12:28 YuviPanda: cleaned out old diamond archive logs on tools-master
  • 12:28 YuviPanda: cleaned out old diamond archive logs on tools-webgrid-04
  • 12:25 YuviPanda: cleaned out old diamond archive logs from tools-exec-08

July 8

  • 20:57 scfc_de: tools-exec-gift: Puppet hangs due to "apt-get update" not finishing in time; manual runs of the latter take forever
  • 19:52 scfc_de: tools-exec-wmt, tools-shadow: Removed stale Puppet lock files and reran manually (handy: "sudo find /var/lib/puppet/state -maxdepth 1 -type f -name agent_catalog_run.lock -ls -ok rm -f \{\} \; -exec sudo puppet agent apply -tv \;")
  • 18:09 scfc_de: tools-webgrid-03, tools-webgrid-04: killall -TERM gmond (bug #64216)
  • 17:57 scfc_de: tools-exec-08, tools-exec-09, tools-webgrid-02, tools-webgrid-03: Removed stale Puppet lock files and reran manually
  • 17:26 scfc_de: tools-tcl-test: Rebooted because system said so
  • 17:04 YuviPanda: webservice start on tools.meetbot since it seemed down
  • 14:55 YuviPanda: cleaned out old diamond archive logs on tools-webproxy
  • 13:39 scfc_de: tools-login: rm -f /var/log/exim4/paniclog ("daemon: fork of queue-runner process failed: Cannot allocate memory")

July 6

  • 12:09 scfc_de: tools-mail: rm -f /var/log/exim4/paniclog after I20afa5fb2be7d8b9cf5c3bf4018377d0e847daef got merged

July 5

July 4

  • 08:51 scfc_de: tools-exec-08 (some hours ago): rm -f /var/log/diamond/* && restart diamond
  • 00:02 scfc_de: tools-master: rm -f /var/log/diamond/* && restart diamond

July 3

  • 16:59 Betacommand: Coren: It may take a while though; what the catscan queries was blocking is a DDL query changing the schema and that pauses replication.
  • 16:58 Betacommand: Coren: transactions over 30ks killed; the DB should start catching up soon.
  • 14:37 Betacommand: replication for enwiki is halted current lag is at 9876

July 2

  • 00:21 YuviPanda: restarted diamond on almost all nodes to stop sending nfs stats, some still need to be flushed
  • 00:21 YuviPanda: restarted diamond on all exec nodes to stop sending nfs stats

July 1

  • 23:09 legoktm: tools-pywikibot started the webservice, don't know why it wasn't running
  • 21:08 scfc_de: Reset queues in error state again
  • 17:51 YuviPanda: tools-exec-04 removed stale pid file and force puppet run
  • 16:07 YuviPanda: applied biglogs to tools-exec-02 and rejigged things
  • 15:54 YuviPanda: tools-exec-02 removed stale puppet pid file, forcing run
  • 15:51 Coren: adjusted resource limits for -exec-07 to match the smaller instance size.
  • 15:50 Coren: created logfile disk for -exec-07 by hand (smaller instance)
  • 01:53 YuviPanda: tools-exec-10 applied biglogs, moved logs around, killed some old diamond logs
  • 01:41 YuviPanda: tools-exec-03 restarted diamond, atop, exim4, ssh to pick up new log partition
  • 01:40 YuviPanda: tools-exec-03 applied biglogs, moved logs around, killed some old diamond logs
  • 01:34 scfc_de: tools-exec-03, tools-exec-10: Removed /var/log/diamond/diamond.log, restarted diamond and bzip2'ed /var/log/diamond/*.log.2014*

June 30

  • 22:10 YuviPanda: ran webservice start for enwp10
  • 22:06 YuviPanda: stale lockfile in tools-login as well, removing and forcing puppet run
  • 22:01 YuviPanda: removed stale lockfile for puppet, forcing run
  • 19:58 YuviPanda|food: added tools-webgrid-04 to webgrid queue, had to start portgranter manually
  • 17:43 YuviPanda: created tools-webgrid-04, applying webnode role and running puppet
  • 17:27 YuviPanda: created tools-webgrid-03 and added it to the queue

June 29

  • 19:45 scfc_de: magnustools: "webservice start"
  • 18:24 YuviPanda: rebooted tools-webgrid-02. Could not ssh, was dead

June 28

  • 21:07 YuviPanda: removed alias for tools-webproxy and tools.wmflabs.org from /etc/hosts on tools-webproxy

June 21

  • 20:09 scfc_de: Created tool mediawiki-mirror (yuvipanda + Nemo_bis) and chown'ed & chmod o-w /shared/mediawiki

June 20

  • 21:01 scfc_de: tools-webgrid-tomcat: Added to submit host list with "qconf -as" for bug #66882
  • 14:47 scfc_de: Restarted webservice for mono; cf. bug #64219

June 16

  • 23:50 scfc_de: Shut down diamond services and removed log files on all hosts

June 15

  • 17:12 YuviPanda: deleted tools-mongo. MongoDB pre-allocates db files, and so allocating one db to every tool fills up the disk *really* quickly, even with 0 data. Their non preallocating version is 'not meant for production', so putting on hold for now
  • 16:50 scfc_de: qmod -cq cyberbot@tools-exec-cyberbot.eqiad.wmflabs
  • 16:48 scfc_de: tools-exec-cyberbot: rm -f /var/log/diamond/diamond.log && restart diamond
  • 16:48 scfc_de: tools-exec-cyberbot: No DNS entry (again)

June 13

  • 22:59 YuviPanda: "sudo -u ineditable -s" to force creation of homedir, since the user was unable to login before. /var/log/auth.log had no record of their attempts, but now seems to work. straange

June 10

  • 21:51 scfc_de: Restarted diamond service on all Tools hosts to actually free the disk space :-)
  • 21:36 scfc_de: Deleted /var/log/diamond/diamond.log on all Tools hosts to free up space on /var

June 3

  • 17:50 Betacommand: Brief network outage. source: It's not clearly determined yet; we aborted the investigation to rollback and restore service. As far as we can tell, there is something subtly wrong with the switch configuration of LACP.

June 2

  • 20:15 YuviPanda: create instance tools-trusty-test to test nginx proxy on trusty
  • 19:00 scfc_de: zoomviewer: Set TMPDIR to /data/project/zoomviewer/var/tmp and ./webwatcher.sh; cannot see *any* temporary files being created anywhere, though. iipsrv.fcgi however has TMPDIR set as planned.

May 27

  • 18:49 wm-bot: petrb: temporarily hardcoding tools-exec-cyberbot to /etc/hosts so that host resolution works
  • 10:36 scfc_de: tools-webgrid-01: removed all files of tools.zoomviewer in /tmp
  • 10:22 scfc_de: tools-webgrid-01: /tmp was full, removed files of tools.zoomviewer older than five days
  • 07:52 wm-bot: petrb: restarted webservice of tool admin in order to purge that huge access.log

May 25

  • 14:27 scfc_de: tools-mail: "rm -f /var/log/exim4/paniclog" to leave only relay_domains errors

May 23

  • 14:14 andrewbogott: rebooting tools-webproxy so that services start logging again
  • 14:10 andrewbogott: applying role::labs::lvm::biglogs on tools-webproxy because /var/log was full and causing errors

May 22

  • 02:45 scfc_de: tools-mail: Enabled role::labs::lvm::biglogs, moved data around & rebooted.
  • 02:36 scfc_de: tools-mail: Removed all jsub notifications from hazard-bot from queue.
  • 01:46 scfc_de: hazard-bot: Disabled minutely cron job github-updater
  • 01:36 scfc_de: tools-mail: Freezing all messages to Yahoo!: "421 4.7.1 [TS03] All messages from 208.80.155.162 will be permanently deferred; Retrying will NOT succeed. See http://postmaster.yahoo.com/421-ts03.html"
  • 01:12 scfc_de: tools-mail: /var is full

May 20

  • 18:34 YuviPanda: back to homerolled nginx 1.5 on proxy, newer versions causing too many issues

May 16

  • 17:01 scfc_de: tools-webgrid-02: rm -f /tmp/core (tools.misc2svg, May 13 06:10, 3861106688)

May 14

  • 16:31 scfc_de: tools-webproxy: "iptables -A INPUT -p tcp \! --source 127/8 --dport 6379 -j REJECT" to block connections from other Tools instances to Redis
  • 00:23 Betacommand: 503's related to bug 65179

May 13

  • 20:36 YuviPanda: restarting redis on tools-webproxy fixed 503s
  • 20:36 valhallasw: redis failed, causing tools-webproxy to thow 503's
  • 19:09 marktraceur: Restarted grrrit because it had a stupid nick

May 10

  • 14:50 YuviPanda: upgraded nginx to 1.7.0 on tools-webproxy to get SPDY/3.1

May 9

  • 13:16 scfc_de: Cleared error state of queues {continuous,mailq,task}@tools-exec-06 and webgrid-lighttpd; no obvious or persistent causes

May 6

  • 19:31 scfc_de: replagstats fixed; Ganglia graphs are now under the virtual host "tools-replags"
  • 17:53 scfc_de: Don't think replagstats is really working ...
  • 16:40 scfc_de: Moved ~scfc/bin/replagstats to ~tools.admin/bin/ and enabled as a continuous job (cf. also bug #48694).

April 28

  • 11:51 YuviPanda: pywikibugs Deployed bf1be7b

April 27

  • 13:34 scfc_de: Restarted webservice for geohack and moved {access,error}.log to {access,error}.log.1

April 24

  • 23:39 YuviPanda: restarted grrrit-wm, not greg-g. greg-g does not survive restarts and hence care must be taken to make sure he is not.
  • 23:38 YuviPanda: restarted greg-g after cherry-picking aec09a6 for auth of IRC bot
  • 23:33 legoktm: restarting grrrit-wm https://gerrit.wikimedia.org/r/129610
  • 13:07 scfc_de: tools-mail: rm -f /var/log/exim4/paniclog (relay_domains bug)

April 20

  • 14:27 scfc_de: tools-redis: Set role::labs::lvm::mnt and $lvm_mount_point=/var/lib, moved the data around and rebooted
  • 14:08 scfc_de: tools-redis: /var is full
  • 08:59 legoktm: grrrit-wm: 2014-04-20T08:28:15.889Z - error: Caught error in redisClient.brpop: Redis connection to tools-redis:6379 failed - connect ECONNREFUSED
  • 08:48 legoktm: Your job 438884 ("lolrrit-wm") has been submitted
  • 08:47 legoktm: [01:28:28] * grrrit-wm has quit (Remote host closed the connection)

April 13

April 12

  • 23:51 scfc_de: tools-mail: rm -f /var/log/exim4/paniclog ("unknown named domain list "+relay_domains"")

April 11

April 10

  • 18:20 scfc_de: tools-webgrid-01, tools-webgrid-02: "kill -HUP" all php-cgis that are not (grand-)children of lighttpd processes

April 8

  • 05:06 Ryan_Lane: restart nginx on tools-proxy-test
  • 05:03 Ryan_Lane: upgraded libssl on all nodes

April 4

  • 15:48 Coren: Moar powar!!1!one: added two exec nodes (-09 -10) and one webgrid node (-02)
  • 11:11 scfc_de: Set /data/project/.system/config/wikihistory.workers to 20 on apper's request

March 30

  • 18:16 scfc_de: Removed empty directories /data/project/{d930913,sudo-test{,-2},testbug{,2,3}}: Corresponding service groups don't exist (anymore)
  • 18:13 scfc_de: Removed /data/project/backup: Only empty dynamic-proxy backup files of January 3rd and earlier

March 29

  • 10:14 wm-bot: petrb: disabled 1 job in cron in -login of user tools.tools-info which was killing login server

March 28

  • 11:53 wm-bot: petrb: did the same on -mail server (removed /var/log/exim4/paniclog) so that we don't get spam every day
  • 11:51 wm-bot: petrb: removed content of /var/log/exim4/paniclog
  • 11:49 wm-bot: petrb: disabled default vimrc which everybody hates on -login

March 21

  • 16:50 scfc_de: tools-login: pkill -u tools.bene (OOM)
  • 16:13 scfc_de: rmdir /home/icinga (totally empty, "drwxr-xr-x 2 nemobis 50383 4096 Mär 17 16:42", perhaps artifact of mass migration?)
  • 15:49 scfc_de: sudo cp -R /etc/skel /home/csroychan && sudo chown -R csroychan.wikidev /home/csroychan; that should close [[bugzilla:62132]]
  • 15:15 scfc_de: sudo cp -R /etc/skel /home/annabel && sudo chown -R annabel.wikidev /home/annabel
  • 15:14 scfc_de: sudo chown -R torin8.wikidev /home/torin8

March 20

  • 18:36 scfc_de: Pointed tools-dev.wmflabs.org at tools-dev.eqiad.wmflabs; cf. [[Bugzilla:62883]]

March 5

  • 13:57 wm-bot: petrb: test

March 4

  • 22:35 wm-bot: petrb: uninstalling it from -login too
  • 22:32 wm-bot: petrb: uninstalling apache2 from tools-dev it has nothing to do there

March 3

  • 19:20 wm-bot: petrb: shutting down almost all services on webserver-02 in order to make system useable and finish upgrade
  • 19:17 wm-bot: petrb: upgrading all packages on webserver-02
  • 19:15 petan: rebooting webserver-01 which is totally dead
  • 19:07 wm-bot: petrb: restarting apache on webserver-02 it complains about OOM but the server has more than 1.5g memory free
  • 19:03 wm-bot: petrb: switched local-svg-map-maker to webserver-02 because 01 is not accessible to me, hence I can't debug that
  • 16:44 scfc_de: tools-webserver-03: Apache was swamped by request for /guc. "webservice start" for that, and pkill -HUP -u local-guc.
  • 12:54 scfc_de: tools-webserver-02: Rebooted, apache2/error.log told of OOM, though more than 1G free memory.
  • 12:50 scfc_de: tools-webserver-03: Rebooted, scripts were timing out
  • 12:42 scfc_de: tools-webproxy: Rebooted; wasn't accessible by ssh.

March 1

  • 03:42 Coren: disabled puppet in pmtpa tool labs\

February 28

  • 14:46 wm-bot: petrb: extending /usr on tools-dev by 800mb
  • 00:26 scfc_de: tools-webserver-02: Rebooted; inaccessible via ssh, http said "500 Internal Server Error"

February 27

  • 15:28 scfc_de: chmod g-w ~fsainsbu/.forward

February 25

  • 22:48 rdwrer: Lol, so, something happened with grrrit-wm earlier and nobody logged any of it. It was yoyoing, Yuvi killed it, then aude did something and now it's back.

February 23

  • 20:46 scfc_de: morebots: labs HUPped to reconnect to IRC

February 21

  • 17:32 scfc_de: tools-dev: mount -t nfs -o nfsvers=3,ro labstore1.pmtpa.wmnet:/publicdata-project /public/datasets; automount seems to have been stuck
  • 15:24 scfc_de: tools-webserver-03: Rebooted, wasn't accessible by ssh and apparently no access to /public/datasets either

February 20

  • 21:23 scfc_de: tools-login: Disabled crontab for local-rezabot and left a message at User talk:Reza#Running bots on tools-login, etc. (fa:بحث_کاربر:Reza1615 is write-protected)
  • 20:15 scfc_de: tools-login: Disabled crontab for local-chobot and left a message at ko:사용자토론:ChongDae#Running bots on tools-login, etc.
  • 10:42 scfc_de: tools-mail: rm -f /var/log/exim4/paniclog ("User 0 set for local_delivery transport is on the never_users list", cf. [[bugzilla:61583]])
  • 10:30 scfc_de: tools-login: rm -f /var/log/exim4/paniclog (OOM)
  • 10:28 scfc_de: Reset error status of task@tools-exec-09 ("can't get password entry for user 'local-voxelbot'"); "getent passwd local-voxelbot" works on tools-exec-09, possibly a glitch

February 19

  • 20:21 scfc_de: morebots: Set "enable_twitter=False" in confs/labs-logbot.py and restarted labs-morebots
  • 19:14 scfc_de: tools-login: Disabled crontab and pkill -HUP -u fatemi127

February 18

  • 11:42 scfc_de: tools-mail: Rerouted queued mail (@tools-login.pmtpa.wmflabs => @tools.wmflabs.org)
  • 11:34 scfc_de: tools-exec-08: Rebooted due to not responding on ssh and SGE
  • 10:39 scfc_de: tools-mail: rm -f /var/log/exim4/paniclog ("User 0 set for local_delivery transport is on the never_users list" => probably artifacts from Coren's LDAP changes)
  • 10:37 scfc_de: tools-login: rm -f /var/log/exim4/paniclog (OOM)

February 14

  • 23:54 legoktm: restarting grrrit-wm since it disappeared
  • 08:19 scfc_de: tools-login: rm -f /var/log/exim4/paniclog (OOM)

February 13

  • 13:11 scfc_de: Deleted old job of user veblenbot stuck in error state
  • 13:08 scfc_de: Deleted old jobs of user v2 stuck in error state
  • 10:49 scfc_de: tools-login: Commented out local-shuaib-bot's crontab with a pointer to Tools/Help

February 12

  • 07:51 wm-bot: petrb: removed /data/project/james/adminstats/wikitools per request from james on irc

February 11

  • 15:47 scfc_de: Restarted webservice for geohack
  • 13:02 scfc_de: tools-login: rm -f /var/log/exim4/paniclog (OOM)
  • 13:00 scfc_de: Killed -HUP local-hawk-eye-bot's jobs; one was hanging with a stale NFS handle on tools-exec-05

February 10

  • 23:16 Coren: rebooting webproxy (braindead autofs)

February 9

February 6

February 4

January 31

  • 03:43 scfc_de: Cleaned up all exim queues
  • 01:26 scfc_de: chmod g-w ~{bgwhite,daniel,euku,fale,henna,hydriz,lfaraone}/.forward (test: sudo find /home -mindepth 2 -maxdepth 2 -type f -name .forward -perm /g=w -ls)

January 30

  • 21:48 scfc_de: chmod g-w ~fluff/.forward
  • 21:40 scfc_de: local-betabot: Added "-M" option to crontab's qsub call and rerouted queued mail (freeze, exim -Mar, exim -Mmd, thaw)
  • 18:33 scfc_de: tools-exec-04: puppetd --enable (apparently disabled sometime around 2014-01-16?!)
  • 17:25 scfc_de: tools-exec-06: mv -f /etc/init.d/nagios-nrpe-server{.dpkg-dist,} (nagios-nrpe-server didn't start because start-up script tried to "chown icinga" instead of "chown nagios")

January 28

  • 04:27 scfc_de: tools-webproxy: Blocked Phonifier

January 25

  • 05:37 scfc_de: tools-webserver-02: rm -f /var/log/exim4/paniclog (OOM)

January 24

  • 01:07 scfc_de: tools-db: Removed /var/lib/mysql2, set expire_logs_days to 1 day
  • 00:11 scfc_de: tools-db: and restarted mysqld
  • 00:11 scfc_de: tools-db: Moved 4.2 GBytes of the oldest binlogs to /var/lib/mysql2/

January 23

  • 19:24 legoktm: restarting grrrit-wm now https://gerrit.wikimedia.org/r/#/c/109116/
  • 19:23 legoktm: ^ was for grrrit-wm
  • 19:23 legoktm: re-committed password to local repo, not sure why that wasn't committed already

January 21

  • 17:41 scfc_de: tools-exec-09: iptables-restore /data/project/.system/iptables.conf

January 20

  • 07:02 andrewbogott: merged a lint patch to the gridengine module. Should be a noop

January 16

  • 17:11 scfc_de: tools-exec-09: "iptables-restore /data/project/.system/iptables.conf" after reboot

January 15

  • 13:36 scfc_de: After reboot of tools-exec-09, all continuous jobs were successfully restarted ("Rr"); task jobs (1974113, 2188472) failed ("19  : before writing exit_status")
  • 13:27 scfc_de: tools-login: rm -f /var/log/exim4/paniclog (OOM)
  • 08:54 andrewbogott: rebooted tools-exec-09
  • 08:32 andrewbogott: rebooted tools-db

January 14

  • 15:10 scfc_de: tools-login: pkill -u local-mlwikisource: Freed 1 GByte of memory
  • 14:58 scfc_de: tools-login: Disabled local-mlwikisource's crontab with explanation
  • 13:57 scfc_de: tools-webserver-02: rm -f /var/log/exim4/paniclog (out of memory errors on 2014-01-10)

January 10

January 9

January 8

  • 13:44 scfc_de: Cleared error states of continuous@tools-exec-05, task@tools-exec-05, task@tools-exec-09

January 7

  • 18:59 scfc_de: tools-login, tools-mail: rm -f /var/log/exim4/paniclog (apparently some artifacts of the LDAP failure)

January 6

  • 14:06 YuviPanda: deleted instance tools-mc, didn't know it had come back from the dead

January 1

  • 13:24 scfc_de: tools-exec-02, tools-master, tools-shadow, tools-webserver-01: Commented out duplicate MariaDB entries in /etc/apt/sources.list and re-ran apt-get update
  • 11:27 scfc_de: tools-webserver-01, tools-webserver-01: rm -f /var/log/exim4/paniclog; out of memory errors
  • 11:18 scfc_de: Emptied /{data/project,home}/.snaplist as the snapshots themselves are not available

December 27

  • 07:39 legoktm: grrrit-wm restart didn't really work.
  • 07:38 legoktm: restarting grrit-wm, for some reason it reconnected and lost its cloak

December 23

  • 18:30 marktraceur: restart grrrit-wm for subbu

December 21

  • 06:50 scfc_de: tools-exec-01: Commented out duplicate MariaDB entries in /etc/apt/sources.list and re-ran apt-get update

December 19

  • 17:22 marktraceur: deploying grrrit config change

December 17

  • 23:19 legoktm: rebooted grrrit-wm with new config stuffs

December 14

  • 18:13 marktraceur: restarting grrrit-wm to fix its nickname
  • 13:17 scfc_de: tools-exec-08: Purged packages libapache2-mod-suphp and suphp-common (probably remnants from when the host was misconfigured as a webserver)
  • 13:09 scfc_de: tools-dev, tools-login, tools-mail, tools-webserver-01, tools-webserver-02: rm /var/log/exim4/paniclog (mostly out of memory errors)

December 4

  • 22:15 Coren: tools-exec-01 rebooted to fix the autofs issue; will return to rotation shortly.
  • 16:33 Coren: rebooting webproxy with new kernel settings to help against the DDOS

December 1

  • 14:05 Coren: underlying virtualization hardware rebooted; tools-master and friends coming back up.

November 25

  • 21:03 YuviPanda: created tools-proxy-test instance to play around with the dynamicproxy
  • 12:16 wm-bot: petrb: deswapping -login (swapoff -a && swapon -a)

November 24

  • 07:19 paravoid: disabled crontab for user avocato on tools-login, see above
  • 07:17 paravoid: pkill -u avocato on tools-login, multiple /home/avocato/pywikipedia/redirect.py DoSing the bastion

November 14

  • 09:12 ori-l: Added aude to lolrrit-wm maintainers group

November 13

  • 22:36 andrewbogott: removed 'imagescaler' class from tools-login because that class hasn't existed for a year. Which, a year ago is before that instance even existed so what the heck?

November 3

  • 16:49 ori-l: grrrit-wm stopped receiving events. restarted it; didn't help. then restarted gerrit-to-redis, which seems to have fixed it.

November 1

  • 16:11 wm-bot: petrb: restarted terminator daemon on -login to sort out memory issues caused by heavy mysql client by elbransco

October 23

  • 15:19 Coren: deleted tools-tyrant and tools-exec-cyberbot (cleanup of obsoleted instances)

October 20

  • 18:52 wm-bot: petrb: everything looks better
  • 18:51 wm-bot: petrb: restarting apache server on tools-webproxy
  • 18:49 wm-bot: petrb: installed links on -dev and going to investigate what is wrong with apaches, documentation, Coren, please update it

October 15

  • 21:03 Coren: labs-login rebooted to fix the ownership/take issue with success.

October 10

  • 09:49 addshore: tools-webserver-01is getting a 500 Internal Server Error again

September 23

  • 06:44 YuviPanda: remove unpuppetized install of openjdk-6 packages causing problems in -dev (for bug: 54444)
  • 06:44 YuviPanda: remove unpuppetized install of openjdk-6 packages causing problems in -dev (for bug: 54444)
  • 05:15 legoktm: logging a log to test the log logging
  • 05:13 legoktm: logging a log to test the log logging

September 11

  • 09:39 wm-bot: petrb: started toolwatcher

August 24

  • 18:00 wm-bot: petrb: freed 1600mb of ram by killing yasbot processes on -login
  • 17:59 wm-bot: petrb: killing all python processes of yasbot on -login, this bot needs to run on grid, -login is constantly getting OOM because of this bot

August 23

  • 12:17 wm-bot: petrb: test
  • 12:15 wm-bot: petrb: making pv from /dev/vdb on new nodes
  • 11:49 wm-bot: petrb: syncing packages of -login with exec nodes
  • 11:48 petan: someone installed firefox on exec nodes, should investigate / remove

August 22

  • 01:24 scfc_de: tools-webserver-03: Installed python-oursql

August 20

  • 23:00 scfc_de: Opened port 3000 for intra-Labs traffic in execnode security group for YuviPanda's proxy experiments

August 19

  • 09:52 wm-bot: petrb: deleting fatestwiki tool, requested by creator

August 16

  • 00:16 scfc_de: tools-exec-01 doesn't come up again even after repeat reboots

August 15

  • 15:14 scfc_de: tools-webserver-01: Simplified /usr/local/bin/php-wrapper
  • 14:31 scfc_de: tools-webserver-01: "dpkg --configure -a" on apt-get's advice
  • 14:24 scfc_de: chmod 644 ~magnus/.forward
  • 03:07 scfc_de: tools-webproxy: Temporarily serving 403s to AhrefsBot/bingbot/Googlebot/PaperLiBot/TweetmemeBot/YandexBot until they reread robots.txt
  • 02:02 scfc_de: robots.txt: "Disallow: /"

August 11

  • 03:14 scfc_de: tools-mc: Purged memcached

August 10

  • 02:36 scfc_de: Disabled terminatord on tools-login and tools-dev
  • 02:24 scfc_de: chmod g-w ~whym/.forward

August 6

  • 19:26 scfc_de: Set up basic robots.txt to exclude Geohack to see how that affects traffic
  • 02:09 scfc_de: tools-mail: Enabled rudimentary Ganglia monitoring in root's crontab

August 5

  • 20:32 scfc_de: chmod g-w ~ladsgroup/.forward

August 2

  • 23:45 scfc_de: tools-dev: Installed dialog for testing

August 1

  • 19:57 scfc_de: Created new instance tools-redis with redis_maxmemory = "7GB"
  • 19:56 scfc_de: Added redis_maxmemory to wikitech Puppet variables

July 31

  • 10:50 HenriqueCrang: ptwikis added graph with mobile edits

July 30

  • 19:08 scfc_de: tools-webproxy: Purged popularity-contest and ubuntu-standard
  • 07:32 wm-bot: petrb: deleted local-addbot jobs
  • 02:01 scfc_de: tools-webserver-01: Symlinked /usr/local/bin/{job,jstart,jstop,jsub} to /usr/bin; were obsolete versions.

July 29

  • 15:15 scfc_de: tools-webserver-01: rm /var/log/exim4/paniclog
  • 15:10 scfc_de: Purged popularity-contest from tools-webserver-01.
  • 02:40 scfc_de: Restarted toolwatcher on tools-login.
  • 02:11 scfc_de: Reboot tools-login, was not responsive

July 25

  • 23:37 Ryan_Lane: added myself to lolrrit-wm tool
  • 12:06 wm-bot: petrb: test
  • 07:11 wm-bot: petrb: created /var/log/glusterfs/bricks/ to stop rotatelogs from complaining about it being missing

July 20

  • 15:19 petan: rebooting tools-redis

July 19

  • 07:06 petan: instances were rebooted for unknown reasons
  • 00:42 helderwiki: it works! :-)
  • 00:41 legoktm: test

July 10

  • 18:04 wm-bot: petrb: installing mysqltcl on grid
  • 18:01 wm-bot: petrb: installing tclodbc on grid

July 5

  • 19:38 AzaToth: test
  • 19:36 AzaToth: test for example
  • 18:23 Coren: brief outage of webproxy complete (back to business!)
  • 18:13 Coren: brief outage of webproxy (rollback 2.4 upgrade)

July 3

  • 13:44 scfc_de: Set "HostbasedAuthentication yes" and "EnableSSHKeysign yes" in tools-dev's /etc/ssh/ssh_config
  • 12:58 petan: rebooting -mc it's aparently OOM dying

July 2

  • 16:24 wm-bot: petrb: installed maria to all nodes so we can connect to db even from sge
  • 12:19 wm-bot: petrb: installing packages -- libmediawiki-api-perl libdatetime-format-strptime-perl libbot-basicbot-perl libdatetime-format-duration-perl

July 1

  • 18:39 wm-bot: petrb: started toolwatcher on - login
  • 14:22 wm-bot: petrb: installing following packages on grid: libdata-dumper-simple-perl libhtml-html5-entities-perl libirc-utils-perl libtask-weaken-perl libobject-pluggable-perl libpoe-component-syndicator-perl libpoe-filter-ircd-perl libsocket-getaddrinfo-perl libpoe-component-irc-perl libxml-simple-perl
  • 12:05 wm-bot: petrb: starting toolwatcher
  • 11:40 wm-bot: petrb: tools is back o/
  • 09:42 wm-bot: petrb: installing python -zmg -matplotlib @ dev
  • 03:33 scfc_de: Rebooted tools-login apparently out of memory and not responding to ssh

June 30

  • 17:58 scfc_de: Set ssh_hba to yes on tools-exec-06
  • 17:13 scfc_de: Installed python-matplotlib and python-zmq on tools-login for YuviPanda

June 26

  • 21:16 Coren: +Tim Landscheidt to project admins, local-admin
  • 14:23 wm-bot: petrb: updating several packages on -login
  • 13:43 wm-bot: petrb: killing old instance of redis: Jun15 ? 00:06:49 /usr/bin/redis-server /etc/redis/redis.conf
  • 13:42 wm-bot: petrb: restarting redis
  • 13:28 wm-bot: petrb: running puppet on -mc
  • 13:27 wm-bot: petrb: adding ::redis role to tools-mc - if anything will break, YuviPanda did it :P
  • 09:35 wm-bot: petrb: updated status.php to version which display free vmem as well

June 25

  • 12:34 wm-bot: petrb: installing php5-mcrypt on exec and web

June 24

  • 15:45 wm-bot: petrb: changed colors of root prompt productions vs testing
  • 07:57 wm-bot: petrb: 50527 4186 22830 1 Jun23 pts/41 00:08:54 python fill2.py eats 48% of ram on -login

June 19

  • 12:17 wm-bot: petrb: increasing limit on mysql connections

June 17

  • 17:34 wm-bot: petrb: /var/spool/cron/crontabs/ has -rw------- 1 8006 crontab 1176 Apr 11 14:07 local-voxelbot fixing

June 16

  • 21:23 Coren: 1.0.3 deployed (jobutils, misctools)

June 15

  • 21:40 wm-bot: petrb: there is no lvm on -db which we need as hell - therefore no swap either nor storage for binary logs :( I got a feeling that mysql will die oom soonish
  • 21:39 wm-bot: petrb: db has 5% free RAM eeeek
  • 18:36 wm-bot: root: removed lot of ?audit? logs from exec-04 they were eating too much storage
  • 18:23 wm-bot: petrb: temporarily disabling /tmp on exec-04 in order to set up lvm
  • 18:23 wm-bot: petrb: exec-04 96% / usage, creating a new volume
  • 12:33 wm-bot: petrb: installing redis on tools-mc

June 14

  • 12:35 wm-bot: petrb: updating logsplitter to new version

June 13

  • 21:59 wm-bot: petrb: replaced logsplitter on both apache servers with far more powerfull c++ version thus saving a lot of resources on both servers
  • 12:43 wm-bot: petrb: tools-webserver-01 is running quite expensive python job (currently eating almost 1gb of ram) it may need to be fixed or moved to separate webserver, adding swap to prevent machine die OOM
  • 12:22 wm-bot: petrb: killing process 31187 sort -T./enwiki/target -t of user local-enwp10 for same reason as previous one
  • 12:21 wm-bot: petrb: killing process 31190 sort -T./enwiki/target of user local-enwp10 for same reason as previous one
  • 12:17 wm-bot: petrb: killing process 31186 31185 69 Jun11 pts/32 1-13:14:41 /usr/bin/perl ./bin/catpagelinks.pl ./enwiki/target/main_pages_sort_by_ids.lst ./enwiki/target/pagelinks_main_sort_by_ids.lst because it seems to be a bot running on login server eating too many resources

June 11

  • 07:36 wm-bot: petrb: installed libdigest-crc-perl

June 10

  • 13:05 wm-bot: petrb: installing libcrypt-gcrypt-perl
  • 08:45 wm-bot: petrb: updated /usr/local/bin/logsplitter on webserver-01 in order to fix !b 49383
  • 08:45 wm-bot: petrb: updated /usr/local/bin/logsplitter on webserver-01 in order to fix become afcbot 49383
  • 08:44 wm-bot: petrb: updated /usr/local/bin/logsplitter on webserver-01 in order to fix become afcbot 49383
  • 08:25 wm-bot: petrb: fixing missing packages on exec nodes

June 9

  • 20:44 wm-bot: petrb: moved logs on -login to separate storage

June 8

  • 21:24 wm-bot: petrb: installing python-imaging-tk on grid
  • 21:20 wm-bot: petrb: installing python-tk
  • 21:16 wm-bot: petrb: installing python-flickrapi on grid
  • 21:16 wm-bot: petrb: installing
  • 16:49 wm-bot: petrb: turned off wmf style of vi on tools-dev feel free to slap me :o or do cat /etc/vim/vimrc.local >> .vimrc if you love it
  • 15:33 wm-bot: petrb: grid is overloaded, needs to be either enlarged or jobs calmed down :o
  • 09:55 wm-bot: petrb: backporting tcl 8.6 from debian
  • 09:38 wm-bot: petrb: update python requests to version 1.2.3.1

June 7

  • 15:29 Coren: Deleted no-longer-needed tools-exec-cg node (spun off to its own project)

June 5

  • 09:52 wm-bot: petrb: on -dev
  • 09:52 wm-bot: petrb: moving /usr to separate volume expect problems :o
  • 09:41 wm-bot: petrb: moved /var/log to separate volume on -dev
  • 09:31 wm-bot: petrb: houston we have problem, / on dev is 94%
  • 09:28 wm-bot: petrb: installed openjdk7 on -dev
  • 09:00 wm-bot: petrb: removing wd-terminator service
  • 08:39 wm-bot: petrb: started toolwatcher
  • 07:04 wm-bot: petrb: installing maven on -dev

June 4

  • 14:49 wm-bot: petrb: installing sbt in order to fix b48859
  • 13:28 wm-bot: petrb: installing csh on cluster
  • 08:37 wm-bot: petrb: installing python-memcache on exec nodes

June 3

  • 21:40 Coren: Rebooting -login; it's trashing. Will keep an eye on it.
  • 14:15 wm-bot: petrb: removing popularity contest
  • 14:11 wm-bot: petrb: removing /etc/logrotate.d/glusterlogs on all servers to fix logrotate daemon
  • 09:43 wm-bot: petrb: syncing packages on exec nodes to avoid troubles with missing libs on some etc

June 2

  • 08:39 wm-bot: petrb: installing ack-grep everywhere per yuvipanda and irc

June 1

  • 20:57 wm-bot: petrb: installed this to exec nodes because it was on some and not on others cpp-4.4 cpp-4.5 cython dbus dosfstools ed emacs23 ftp gcc-4.4-base iptables iputils-tracepath ksh lsof ltrace lshw mariadb-client-5.5 nano python-dbus python-egenix-mxdatetime python-egenix-mxtools python-gevent python-greenlet strace telnet time -y
  • 20:42 wm-bot: petrb: installing wikitools cluster wide
  • 20:40 wm-bot: petrb: installing oursql cluster wide
  • 10:46 wm-bot: petrb: created new instance for experiments with sasl memcache tools-mc

May 31

  • 19:17 petan: deleting xtools project (requested by Cyberpower678)
  • 17:24 wm-bot: petrb: removing old kernels from -dev because / is almost full
  • 17:17 wm-bot: petrb: installed lsof to -dev
  • 15:55 wm-bot: petrb: installed subversion to exec nodes 4 legoktm
  • 15:47 wm-bot: petrb: replacing mysql with maria on exec nodes
  • 15:46 wm-bot: petrb: replacing mysql with maria on exec nodes
  • 15:14 wm-bot: petrb: installing default-jre in order to satisfy its dependencies
  • 15:13 wm-bot: petrb: installing /data/project/.system/deb/all/sbt.deb to -dev in order to test it
  • 13:04 wm-bot: petrb: installing bashdb on tools and -dev
  • 12:27 wm-bot: petrb: removing project local-jimmyxu - per request on irc
  • 10:54 wm-bot: petrb: killing process 3060 on -login (mahdiz 3060 1964 88 May30 ? 21:32:51 /bin/nano /tmp/crontab.Ht3bSO/crontab) it takes max cpu and doesn't seem to be attached

May 30

  • 12:24 wm-bot: petrb: deleted job 1862 from queue (error state)
  • 08:26 wm-bot: petrb: updated sql command

May 29

  • 21:05 wm-bot: petrb: running sudo apt-get install php5-gd

May 28

  • 20:00 wm-bot: petrb: installing p7zip-full to -dev and -login

May 27

  • 08:46 wm-bot: petrb: changed config of mysql to use /mnt as path to save binary logs, this however requires server to be restarted

May 24

  • 08:44 petan: setting up lvm on new exec nodes because it is more flexible and allows us to change the size of volumes on the fly
  • 08:28 petan: created 2 more exec nodes, setting up now...

May 23

  • 09:20 wm-bot: petrb: process 27618 on -login is constantly eating 100% of cpu, changing priority to 20

May 22

  • 20:54 wm-bot: petrb: changing ownership of /data/project/bracketbot/ to local-bracketbot
  • 14:28 labs-logs-bottie: petrb: installed netcat as well
  • 14:28 labs-logs-bottie: petrb: installed telnet to -dev
  • 14:02 Coren: tools-webserver-02 now live; / and /cluebot/ moved there

May 21

  • 20:27 labs-logs-bottie: petrb: uploaded hosts to -dev

May 19

  • 13:40 labs-logs-bottie: petrb: killing that nano process seems to be some hang and unattached anyway
  • 12:59 labs-logs-bottie: petrb: changed priority of nano process to 19
  • 12:55 labs-logs-bottie: petrb: local-hawk-eye-bot /bin/nano /tmp/crontab.d4JhUj/crontab eat too much cpu
  • 12:50 petan: nvm previous line
  • 12:50 labs-logs-bottie: petrb: vul alias viewuserlang

May 14

  • 21:22 labs-logs-bottie: petrb: created a separate volume for /tmp on login so that temp files do not fragment root fs and it does not get filled up by them, it also makes it easier to track filesystem usage
  • 13:16 Coren: reboot -dev, need to test kernel upgrade

May 10

  • 15:08 Coren: create tools-webserver-02 for Apache 2.4 experimentation

May 9

  • 04:12 Coren: added -exec-03 and -exec-04. Moar power!!1!

May 6

  • 19:59 Coren: made tools-dev.wmflabs.org public
  • 08:04 labs-logs-bottie: petrb: created a small swap on -login so that users can not bring it to OOM so easily and so that unused memory blocks can be swapined in order to use the remaining memory more effectively
  • 08:00 labs-logs-bottie: petrb: making lvm from unused disk from /mnt on -login so that we can eventually use it somewhere if needed

May 4

  • 17:50 labs-logs-bottie: petrb: foobar as well
  • 17:47 labs-logs-bottie: petrb: removing project flask-stub using rmtool
  • 15:33 labs-logs-bottie: petrb: fixing missing db user for local-stub
  • 12:51 labs-logs-bottie: petrb: creating mysql accounts by hand for alchimista and fubar

May 2

  • 20:49 labs-logs-bottie: petrb: uploaded motd to exec-N as well, with information which server users connected to

May 1

  • 16:59 labs-logs-bottie: petrb: fixed invalid permissions on /home

April 27

  • 18:54 labs-logs-bottie: petrb: installing pymysql using pip on whole grid because it is needed for greenrosseta (for some reason it is better than python-mysql package)

April 26

  • 23:55 Coren: reboot to finish security updates
  • 08:00 labs-logs-bottie: petrb: patching qtop
  • 07:57 labs-logs-bottie: petrb: added tools-dev to admin host list so that qtop works and fixing the bug of qtop
  • 07:28 labs-logs-bottie: petrb: installing GE tools to -dev so that we can develop new j|q* stuff there

April 25

  • 19:00 Coren: Maintenance over; systems restarted and should be working.
  • 18:18 labs-logs-bottie: petrb: we are getting in troubles with memory on tools-db there is only less than 20% free memory
  • 18:01 Coren: Begin maintenance (login disabled)
  • 13:21 petan: removing local-wikidatastats from ldap

April 24

  • 13:17 labs-logs-bottie: petrb: sudo chown local-peachy PeachyFrameworkLogo.png
  • 11:37 labs-logs-bottie: petrb: created new project stats and cloned acl from wikidatastats, which is supposed to be deleted
  • 11:32 legoktm: wikidatastats attempting to install limn
  • 11:15 labs-logs-bottie: petrb: installing npm to -login instance
  • 07:34 petan: creating project wikidatastats for legoktm addshore and yuvipandianablah :P

April 23

  • 13:32 labs-logs-bottie: petrb: changing permissions of cyberbot and peachy to 775 so that it is easier to use them
  • 12:14 labs-logs-bottie: petrb: qtop on -dev
  • 12:12 labs-logs-bottie: petrb: removed part of motd from login server that got there in a mysterious way

April 19

  • 22:38 Coren: reboot -login, all done with the NFS config. yeay.
  • 17:13 Coren: (final?) reboot of -login with the new autofs configuration
  • 16:24 Coren: (rebooted -login)
  • 16:24 Coren: autofs + gluster = fail
  • 14:45 Coren: reboot -login (NFS mount woes)

April 15

  • 22:29 Coren: also a test; note how said bot knows its place.  :-)
  • 22:14 andrewbogott: this is a test of labs-morebots.
  • 21:49 andrewbogott: this is a test
  • 15:41 labs-logs-bottie: petrb: installing p7zip everywhere
  • 08:00 labs-logs-bottie: petrb: installing dev packages needed for YuviPanda on login box

April 11

  • 22:39 Coren: rebooted tools-puppet-test (no end-user impact): hung filesystem prevents login
  • 07:42 labs-logs-bottie: petrb: removed reboot information from motd

April 10

  • 21:42 labs-logs-bottie: petrb: reverting the change
  • 21:35 labs-logs-bottie: petrb: inserting /lib to /etc/ld.so.conf in order to fix the bug with gcc / ubuntu see irc logs (22:30 GMT)
  • 21:22 labs-logs-bottie: petrb: installing jobutils.deb on login
  • 20:30 labs-logs-bottie: petrb: installing some dev tools to -dev
  • 20:23 petan: created -dev instance for various purposes

April 8

  • 14:07 labs-logs-bottie: petrb: ongrid apt-get install mono-complete
  • 13:50 labs-logs-bottie: local-afcbot: unable to run mono applications: The assembly mscorlib.dll was not found or could not be loaded.

April 4

  • 14:40 labs-logs-bottie: petrb: trying to convert afcbot to new service group local-afcbot

April 2

  • 16:04 labs-logs-bottie: petrb: installed log to /home/petrb/bin/ and testing it
  • 15:55 petan: patched /usr/local/bin/qdisplay so that it can display jobs per node properly
  • 15:54 petan: giving sudo to Petrb in order to update qdisplay

March 28

  • 15:44 Coren: reboot (still unactivated) tools-shadow

March 26

  • 18:17 Coren: Doubled the size of the compute grid! (added tools-exec-02 to the grid)

March 21

  • 23:30 Coren: turned on interpretation of .py as CGI by default on tools-webserver-* to parallel .php
  • 16:15 Coren: Added tools-login.wmflabs.org public IP for the tools-login instance and allowed incoming ssh to it.

March 19

  • 14:21 Coren: reboot cycle (all instances) to apply security updates

March 13

  • 14:04 Coren: restarted webserver: relax AllowOverride options

March 11

  • 15:47 Coren: enabled X forwarding for qmon. Also, installed qmon.
  • 13:17 Coren: added python-requests (1.0, from pip)

March 7

  • 20:41 Coren: tools' php errors now sent to ~/php_errors.log
  • 19:31 Coren: access.log now split by tools (in tool homedir)
  • 16:15 Coren: can haz database (support for user/tool databases in place)

March 6

  • 20:25 Coren: tools-db installed mariadb-server from official repo
  • 19:50 Coren: created tools-db instance for a (temporary) mysql install

March 5

  • 21:45 Coren: rejiggered the webproxy config to be smarter about paths not leading to specific tools

February 26

  • 23:49 Coren: Original note structure: created tools-{master,exec-01,webserver-01,webproxy} instances
  • 18:39 Coren: Created tools-puppet-test for dev and testing of tools' puppet classes.
  • 01:52 Coren: created instance tools-login (primary login/dev instance)
  • 01:52 Coren: created sudo policies and security groups (skeletal)
  • 01:08 Coren: Creation of the new project for preproduction deployment of the current (preleminary) plan mw:Wikimedia Labs/Tool Labs/Design