Nova Resource:Tools/SAL/Archive 1
Appearance
December 23
- 06:00 YuviPanda: tools-uwsgi-01 randomly went to SHUTOFF state, rebooting from virt1000
December 22
- 07:43 YuviPanda: increased RAM and Cores quota for tools
December 19
- 16:38 YuviPanda: puppet disabled on tools-webproxy because urlproxy.lua is handhacked to remove stupid syntax errors that got merged.
- 12:00 YuviPanda|brb: created tools-static, static http server
- 07:07 scfc_de: tools-webgrid-02: rm -f /var/log/exim4/paniclog (OOM, again).
December 17
- 22:38 YuviPanda: touched /data/project/repo/Packages so tools-webproxy stops complaining about that not xisting and never running apt-get
December 12
- 14:08 scfc_de: Ran Puppet on all hosts to fix puppet-run issue.
December 11
- 07:58 YuviPanda: rebooted tools-login, wasn’t responsive.
December 8
- 00:15 YuviPanda: killed all db and tools-webproxy aliases in /etc/hosts for tools-webproxy, since otherwise puppet fails because ec2id thinks we’re not in labs because hostname -d is empty because we set /etc/hosts to resolve IP directly to tools-webproxy
December 7
- 06:31 scfc_de: tools-webgrid-02: rm -f /var/log/exim4/paniclog (OOM, again).
- 06:31 scfc_de: tools-mail: rm -f /var/log/exim4/paniclog (multiple exim4 processes, again).
December 2
- 21:31 scfc_de: tools-mail: rm -f /var/log/exim4/paniclog (multiple exim4 processes, again).
- 21:30 scfc_de: tools-webgrid-02: rm -f /var/log/exim4/paniclog (OOM, again).
November 26
- 19:26 YuviPanda: created tools-webgrid-05 on trusty to set up a working webnode for trusty
November 25
- 06:53 scfc_de: tools-webgrid-02: rm -f /var/log/exim4/paniclog (OOM, again).
November 24
- 14:02 YuviPanda: rebooting tools-login, OOM'd
- 02:51 scfc_de: tools-webgrid-02: rm -f /var/log/exim4/paniclog (OOM, again).
November 22
- 19:05 scfc_de: tools-webgrid-02: rm -f /var/log/exim4/paniclog (OOM, again).
November 17
- 20:40 YuviPanda: cleaned out /tmp on tools-login
November 16
- 21:31 matanya: back to normal
- 21:27 matanya: "Could not resolve hostname bastion.wmflabs.org"
November 15
- 07:24 YuviPanda|zzz: move coredumps from tools-webgrid-04 to /home/yuvipanda
November 14
- 20:23 YuviPanda: cleared out coredumps on tools-webgrid-01 to free up space
- 18:26 YuviPanda: cleaned out core dumps on tools-webgrid
- 16:55 scfc_de: tools-webgrid-02: rm -f /var/log/exim4/paniclog (OOM).
November 13
- 21:11 YuviPanda: disable puppet on tools-dev to check shinken
- 21:00 scfc_de: qmod -cq continuous@tools-exec-09,continuous@tools-exec-11,continuous@tools-exec-13,continuous@tools-exec-14,mailq@tools-exec-09,mailq@tools-exec-11,mailq@tools-exec-13,mailq@tools-exec-14,task@tools-exec-06,task@tools-exec-09,task@tools-exec-11,task@tools-exec-13,task@tools-exec-14,task@tools-exec-15,webgrid-lighttpd@tools-webgrid-01,webgrid-lighttpd@tools-webgrid-02,webgrid-lighttpd@tools-webgrid-04 (fallout from /var being full).
- 20:38 YuviPanda: didn't actually stop puppet, need more patches
- 20:38 YuviPanda: stopping puppet on tools-dev to test shinken
- 15:30 scfc_de: tools-exec-06, tools-webgrid-01: rm -f /var/tmp/core/*.
- 13:31 scfc_de: tools-exec-09, tools-exec-11, tools-exec-13, tools-exec-14, tools-exec-15, tools-webgrid-02, tools-webgrid-04: rm -f /var/tmp/core/*.
November 12
- 22:07 StupidPanda: enabled puppet on tools-exec-07
- 21:47 StupidPanda: removed coredumps from tools-webgrid-04 to reclaim space
- 21:45 StupidPanda: removed coredump from tools-webgrid-01 to reclaim space
- 20:31 YuviPanda: disabling puppet on tools-exec-07 to test shinken
November 7
- 13:56 scfc_de: tools-submit, tools-webgrid-04: rm -f /var/log/exim4/paniclog (OOM around the time of the filesystem outage).
November 6
- 13:21 scfc_de: tools-dev: Gzipped /var/log/account/pacct.0 (804111872 bytes); looks like root had his own bigbrother instance running on tools-dev (multiple invocations of webservice per second).
November 5
- 19:15 mutante: exec nodes have p7zip-full now
- 10:07 YuviPanda: cleaned out pacct and atop logs on tools-login
November 4
- 19:50 mutante: - apt-get clean on tools-login, and gzipped some logs
November 1
- 12:51 scfc_de: Removed log files in /var/log/diamond older than five weeks (pdsh -f 1 -g tools sudo find /var/log/diamond -type f -mtime +35 -ls -delete).
October 30
- 14:37 YuviPanda: cleaned out pacct and atop logs on tools-dev
- 06:18 paravoid: killed a "vi" process belonging to user icelabs and running for two days saturating the I/O network bandwidth, and rm'ed a 3.5T(!) .final_mg.txt.swp
October 27
- 16:06 scfc_de: tools-mail: Killed -HUP old queue runners and restarted exim4; probably the source of paniclog's "re-exec of exim (/usr/sbin/exim4) with -Mc failed: No such file or directory".
- 15:36 scfc_de: tools-exec-07, tools-exec-14, tools-exec-15: Recreated (empty) /var/log/apache2 and /var/log/upstart.
October 26
- 12:35 scfc_de: tools-exec-07, tools-exec-14, tools-exec-15: Created /var/log/account.
- 12:33 scfc_de: tools-trusty: Went through shadowed /var and rebooted.
- 12:31 scfc_de: tools-exec-07, tools-exec-14, tools-exec-15: Created /var/log/exim4, started exim4 and ran queue.
October 24
- 20:31 andrewbogott: moved tools-exec-12, tools-shadow and tools-mail to virt1006
October 23
- 22:55 Coren: reboot tools-shadow, upstart seems hosed
October 14
- 23:22 YuviPanda|zzz: removed stale puppet lockfile and ran puppet manually on tools-exec-07
October 11
- 15:31 andrewbogott: rebooting tools-master, stab in the dark
- 06:01 YuviPanda: restarted gridengine-master on tools-master
October 4
- 18:31 scfc_de: tools-mail: Deleted /usr/local/bin/collect_exim_stats_via_gmetric and root's crontab; clean-up for Ic9e0b5bb36931aacfb9128cfa5d24678c263886b
October 2
- 17:59 andrewbogott: added Ryan back to tools admins because that turned out to not have anything to do with the bounce messages
- 17:32 andrewbogott: removing ryan lane from tools admins, because his email in ldap is defunct and I get bounces every time something goes wrong in tools
September 28
- 14:45 andrewbogott: rebased /var/lib/git/operations/puppet on toolsbeta-puppetmaster3
September 25
- 14:43 YuviPanda: cleaned up ghost /var/log (from before biglogs mount) that was taking up space, /var space situation better now
September 17
- 21:40 andrewbogott: caused a brief auth outage while messing with codfw ldap
September 15
- 11:00 YuviPanda: tested CPU monitoring on tools-exec-12 by running stress, seems to work
September 13
- 20:52 yuvipanda: cleaned out rotated log files on tools-webproxy
September 12
- 21:54 jeremyb: [morebots] booted all bots, reverted to using systemwide (.deb) codebase
September 8
- 16:08 scfc_de: tools-login: rm -f /var/log/exim4/paniclog (OOM @ 2014-09-07 15:13:59)
September 5
- 22:22 scfc_de: Deleted stale nginx entries for "rightstool" and "svgcheck"
- 22:20 scfc_de: Stopped 12 webservices for tool "meta" and started one
- 18:50 scfc_de: geohack's lighttpd dumped core and left an entry in Redis behind; tools-webproxy: "DEL prefix:geohack"; geohack: "webservice start"
September 4
- 19:47 lokal-profil: local-heritage Renamed two swedish tables
September 2
- 04:31 scfc_de: "iptables -A OUTPUT -d 10.68.16.1 -p udp -m udp --dport 53" on all hosts in support of bug #70076
August 23
- 17:44 scfc_de: qmod -cq task@tools-exec-07 (job #2796555, "11 : before job")
August 21
- 20:05 scfc_de: Deployed release 1.0.11 of jobutils and miscutils
August 15
- 16:45 legoktm: fixed grrrit-wm
- 16:36 legoktm: restarting grrrit-wm
August 14
- 22:36 scfc_de: Removed again jobs in error state due to LDAP with "for JOBID in $(qstat -u \* | sed -ne 's/^\([0-9]\+\) .*Eqw.*$/\1/p;'); do if qstat -j "$JOBID" | fgrep -q "can't get password entry for user"; then qdel "$JOBID"; fi; done"; cf. also bug #69529
August 12
- 03:32 scfc_de: tools-exec-08, tools-exec-wmt, tools-webgrid-02, tools-webgrid-03, tools-webgrid-04: Removed stale "apt-get update" processes to get Puppet working again
August 2
- 16:39 scfc_de: tools.mybot's crontab uses qsub without -M, added that as a temporary measure and will inform user later
- 16:36 scfc_de: Manually rerouted mails for tools.mybot@tools-submit.eqiad.wmflabs
August 1
- 22:41 scfc_de: Deleted all jobs in "E" state that were caused by an LDAP failure at ~ 2014-07-30 07:00Z ("can't get password entry for user [...]")
July 24
- 20:53 scfc_de: Set SGE "mailer" parameter again for bug #61160
- 14:51 scfc_de: Removed ignored file /etc/apt/preferences.d/puppet_base_2.7 on all hosts
July 21
- 18:39 scfc_de: Removed stale Redis entries for currentevents, misc2svg, osm4wiki, wp-signpost, wscredits and yadfa
- 18:38 scfc_de: Restarted webservice for stewardbots because it wasn't in Redis
- 18:33 scfc_de: Stopped eight (!) webservices of tools.bookmanagerv2 and started one again
July 18
- 14:29 scfc_de: admin: Set up .bigbrotherrc for toolhistory
- 13:24 scfc_de: Made tools-webgrid-04 a grid submit host
- 12:58 scfc_de: Made tools-webgrid-03 a grid submit host
July 16
- 22:41 YuviPanda: reloaded nginx on tools-webproxy to pick up https://gerrit.wikimedia.org/r/#/c/146466/3
- 15:18 scfc_de: replagstats OOMed four hours after start on May 6th; with ganglia.wmflabs.org down, not restarting
- 15:14 scfc_de: Restarted toolhistory with 350 MBytes; OOMed June 1st
July 15
- 11:31 scfc_de: Started webservice for sulinfo; stopped at 2014-06-29 18:31:04
July 14
- 20:40 andrewbogott: on tools-login
- 20:39 andrewbogott: manually deleted /var/lib/apt/lists/lock, forcing apt to update
July 13
- 13:13 scfc_de: tools-exec-13: Moved /var/log around, reboot, iptables-restore & reenabled queues
- 13:11 scfc_de: tools-exec-12: Moved /var/log around, reboot & iptables-restore
July 12
- 17:57 scfc_de: tools-exec-11: Stopping apache2 service; no clue how it got there
- 17:53 scfc_de: tools-exec-11: Moved log files around, rebooted, restored iptables and reenabled queue ("qmod -e {continuous,task}@tools-exec-11...")
- 13:00 scfc_de: tools-exec-11, tools-exec-13: qmod -r continuous@tools-exec-1[13].eqiad.wmflabs in preparation of reboot
- 12:58 scfc_de: tools-exec-11, tools-exec-13: Disabled queues in preparation of reboot
- 11:58 scfc_de: tools-exec-11, tools-exec-12, tools-exec-13: mkdir -m 2750 /var/log/exim4 && chown Debian-exim:adm /var/log/exim4; I'll file a bug why the directory wasn't created later
July 11
- 11:59 scfc_de: tools-exec-11, tools-exec-12, tools-exec-13: cp -f /data/project/.system/hosts /etc/hosts
July 10
- 20:35 scfc_de: tools-exec-11, tools-exec-12, tools-exec-13: iptables-restore /data/project/.system/iptables.conf
- 16:00 YuviPanda: manually removed mariadb remote repo from tools-exec-12 instance, won't be added to new instances (puppet patch was merged)
- 01:33 YuviPanda|zzz: tools-exec-11 and tools-exec-13 have been added to the @general hostgroup
July 9
- 23:14 YuviPanda: applied execnode, hba and biglogs to tools-exec-11 and tools-exec-13
- 23:09 YuviPanda: created tools-exec-13 with precise
- 23:08 YuviPanda: created tools-exec-12 as trusty by accident, will keep on standby for testing
- 23:07 YuviPanda: created tools-exec-12
- 23:06 YuviPanda: created tools-exec-11
- 19:23 scfc_de: tools-webproxy: "iptables -A INPUT -p tcp \! --source 127/8 --dport 6379 -j REJECT" to block connections from other Tools instances to Redis again
- 14:12 scfc_de: tools-exec-cyberbot: Reran Puppet successfully and hotfixed the Peachy temporary file issue; will mail labs-l later
- 13:33 scfc_de: tools-exec-cyberbot: Freed 402398 inodes ...
- 12:50 scfc_de: tools-exec-cyberbot: "find /tmp -maxdepth 1 -type f -name \*cyberbotpeachy.cookies\* -mtime +30 -delete" as a first step
- 12:40 scfc_de: tools-exec-cyberbot: Root partition has run out of inodes
- 12:34 scfc_de: tools-exec-gift: Forgot to log yesterday: The problems were due to overload (load >> 150); SGE shouldn't have allowed that
- 12:28 YuviPanda: cleaned out old diamond archive logs on tools-master
- 12:28 YuviPanda: cleaned out old diamond archive logs on tools-webgrid-04
- 12:25 YuviPanda: cleaned out old diamond archive logs from tools-exec-08
July 8
- 20:57 scfc_de: tools-exec-gift: Puppet hangs due to "apt-get update" not finishing in time; manual runs of the latter take forever
- 19:52 scfc_de: tools-exec-wmt, tools-shadow: Removed stale Puppet lock files and reran manually (handy: "sudo find /var/lib/puppet/state -maxdepth 1 -type f -name agent_catalog_run.lock -ls -ok rm -f \{\} \; -exec sudo puppet agent apply -tv \;")
- 18:09 scfc_de: tools-webgrid-03, tools-webgrid-04: killall -TERM gmond (bug #64216)
- 17:57 scfc_de: tools-exec-08, tools-exec-09, tools-webgrid-02, tools-webgrid-03: Removed stale Puppet lock files and reran manually
- 17:26 scfc_de: tools-tcl-test: Rebooted because system said so
- 17:04 YuviPanda: webservice start on tools.meetbot since it seemed down
- 14:55 YuviPanda: cleaned out old diamond archive logs on tools-webproxy
- 13:39 scfc_de: tools-login: rm -f /var/log/exim4/paniclog ("daemon: fork of queue-runner process failed: Cannot allocate memory")
July 6
- 12:09 scfc_de: tools-mail: rm -f /var/log/exim4/paniclog after I20afa5fb2be7d8b9cf5c3bf4018377d0e847daef got merged
July 5
- 22:36 YuviPanda: cleared diamond archive logs on a bunch of machines, submitted patch to get rid of archive logs
- 22:17 YuviPanda: changed grid scheduling config, set weight_priority to 0.1 from 0.0 for https://bugzilla.wikimedia.org/show_bug.cgi?id=67555
July 4
- 08:51 scfc_de: tools-exec-08 (some hours ago): rm -f /var/log/diamond/* && restart diamond
- 00:02 scfc_de: tools-master: rm -f /var/log/diamond/* && restart diamond
July 3
- 16:59 Betacommand: Coren: It may take a while though; what the catscan queries was blocking is a DDL query changing the schema and that pauses replication.
- 16:58 Betacommand: Coren: transactions over 30ks killed; the DB should start catching up soon.
- 14:37 Betacommand: replication for enwiki is halted current lag is at 9876
July 2
- 00:21 YuviPanda: restarted diamond on almost all nodes to stop sending nfs stats, some still need to be flushed
- 00:21 YuviPanda: restarted diamond on all exec nodes to stop sending nfs stats
July 1
- 23:09 legoktm: tools-pywikibot started the webservice, don't know why it wasn't running
- 21:08 scfc_de: Reset queues in error state again
- 17:51 YuviPanda: tools-exec-04 removed stale pid file and force puppet run
- 16:07 YuviPanda: applied biglogs to tools-exec-02 and rejigged things
- 15:54 YuviPanda: tools-exec-02 removed stale puppet pid file, forcing run
- 15:51 Coren: adjusted resource limits for -exec-07 to match the smaller instance size.
- 15:50 Coren: created logfile disk for -exec-07 by hand (smaller instance)
- 01:53 YuviPanda: tools-exec-10 applied biglogs, moved logs around, killed some old diamond logs
- 01:41 YuviPanda: tools-exec-03 restarted diamond, atop, exim4, ssh to pick up new log partition
- 01:40 YuviPanda: tools-exec-03 applied biglogs, moved logs around, killed some old diamond logs
- 01:34 scfc_de: tools-exec-03, tools-exec-10: Removed /var/log/diamond/diamond.log, restarted diamond and bzip2'ed /var/log/diamond/*.log.2014*
June 30
- 22:10 YuviPanda: ran webservice start for enwp10
- 22:06 YuviPanda: stale lockfile in tools-login as well, removing and forcing puppet run
- 22:01 YuviPanda: removed stale lockfile for puppet, forcing run
- 19:58 YuviPanda|food: added tools-webgrid-04 to webgrid queue, had to start portgranter manually
- 17:43 YuviPanda: created tools-webgrid-04, applying webnode role and running puppet
- 17:27 YuviPanda: created tools-webgrid-03 and added it to the queue
June 29
- 19:45 scfc_de: magnustools: "webservice start"
- 18:24 YuviPanda: rebooted tools-webgrid-02. Could not ssh, was dead
June 28
- 21:07 YuviPanda: removed alias for tools-webproxy and tools.wmflabs.org from /etc/hosts on tools-webproxy
June 21
- 20:09 scfc_de: Created tool mediawiki-mirror (yuvipanda + Nemo_bis) and chown'ed & chmod o-w /shared/mediawiki
June 20
- 21:01 scfc_de: tools-webgrid-tomcat: Added to submit host list with "qconf -as" for bug #66882
- 14:47 scfc_de: Restarted webservice for mono; cf. bug #64219
June 16
- 23:50 scfc_de: Shut down diamond services and removed log files on all hosts
June 15
- 17:12 YuviPanda: deleted tools-mongo. MongoDB pre-allocates db files, and so allocating one db to every tool fills up the disk *really* quickly, even with 0 data. Their non preallocating version is 'not meant for production', so putting on hold for now
- 16:50 scfc_de: qmod -cq cyberbot@tools-exec-cyberbot.eqiad.wmflabs
- 16:48 scfc_de: tools-exec-cyberbot: rm -f /var/log/diamond/diamond.log && restart diamond
- 16:48 scfc_de: tools-exec-cyberbot: No DNS entry (again)
June 13
- 22:59 YuviPanda: "sudo -u ineditable -s" to force creation of homedir, since the user was unable to login before. /var/log/auth.log had no record of their attempts, but now seems to work. straange
June 10
- 21:51 scfc_de: Restarted diamond service on all Tools hosts to actually free the disk space :-)
- 21:36 scfc_de: Deleted /var/log/diamond/diamond.log on all Tools hosts to free up space on /var
June 3
- 17:50 Betacommand: Brief network outage. source: It's not clearly determined yet; we aborted the investigation to rollback and restore service. As far as we can tell, there is something subtly wrong with the switch configuration of LACP.
June 2
- 20:15 YuviPanda: create instance tools-trusty-test to test nginx proxy on trusty
- 19:00 scfc_de: zoomviewer: Set TMPDIR to /data/project/zoomviewer/var/tmp and ./webwatcher.sh; cannot see *any* temporary files being created anywhere, though. iipsrv.fcgi however has TMPDIR set as planned.
May 27
- 18:49 wm-bot: petrb: temporarily hardcoding tools-exec-cyberbot to /etc/hosts so that host resolution works
- 10:36 scfc_de: tools-webgrid-01: removed all files of tools.zoomviewer in /tmp
- 10:22 scfc_de: tools-webgrid-01: /tmp was full, removed files of tools.zoomviewer older than five days
- 07:52 wm-bot: petrb: restarted webservice of tool admin in order to purge that huge access.log
May 25
- 14:27 scfc_de: tools-mail: "rm -f /var/log/exim4/paniclog" to leave only relay_domains errors
May 23
- 14:14 andrewbogott: rebooting tools-webproxy so that services start logging again
- 14:10 andrewbogott: applying role::labs::lvm::biglogs on tools-webproxy because /var/log was full and causing errors
May 22
- 02:45 scfc_de: tools-mail: Enabled role::labs::lvm::biglogs, moved data around & rebooted.
- 02:36 scfc_de: tools-mail: Removed all jsub notifications from hazard-bot from queue.
- 01:46 scfc_de: hazard-bot: Disabled minutely cron job github-updater
- 01:36 scfc_de: tools-mail: Freezing all messages to Yahoo!: "421 4.7.1 [TS03] All messages from 208.80.155.162 will be permanently deferred; Retrying will NOT succeed. See http://postmaster.yahoo.com/421-ts03.html"
- 01:12 scfc_de: tools-mail: /var is full
May 20
- 18:34 YuviPanda: back to homerolled nginx 1.5 on proxy, newer versions causing too many issues
May 16
- 17:01 scfc_de: tools-webgrid-02: rm -f /tmp/core (tools.misc2svg, May 13 06:10, 3861106688)
May 14
- 16:31 scfc_de: tools-webproxy: "iptables -A INPUT -p tcp \! --source 127/8 --dport 6379 -j REJECT" to block connections from other Tools instances to Redis
- 00:23 Betacommand: 503's related to bug 65179
May 13
- 20:36 YuviPanda: restarting redis on tools-webproxy fixed 503s
- 20:36 valhallasw: redis failed, causing tools-webproxy to thow 503's
- 19:09 marktraceur: Restarted grrrit because it had a stupid nick
May 10
- 14:50 YuviPanda: upgraded nginx to 1.7.0 on tools-webproxy to get SPDY/3.1
May 9
- 13:16 scfc_de: Cleared error state of queues {continuous,mailq,task}@tools-exec-06 and webgrid-lighttpd; no obvious or persistent causes
May 6
- 19:31 scfc_de: replagstats fixed; Ganglia graphs are now under the virtual host "tools-replags"
- 17:53 scfc_de: Don't think replagstats is really working ...
- 16:40 scfc_de: Moved ~scfc/bin/replagstats to ~tools.admin/bin/ and enabled as a continuous job (cf. also bug #48694).
April 28
- 11:51 YuviPanda: pywikibugs Deployed bf1be7b
April 27
- 13:34 scfc_de: Restarted webservice for geohack and moved {access,error}.log to {access,error}.log.1
April 24
- 23:39 YuviPanda: restarted grrrit-wm, not greg-g. greg-g does not survive restarts and hence care must be taken to make sure he is not.
- 23:38 YuviPanda: restarted greg-g after cherry-picking aec09a6 for auth of IRC bot
- 23:33 legoktm: restarting grrrit-wm https://gerrit.wikimedia.org/r/129610
- 13:07 scfc_de: tools-mail: rm -f /var/log/exim4/paniclog (relay_domains bug)
April 20
- 14:27 scfc_de: tools-redis: Set role::labs::lvm::mnt and $lvm_mount_point=/var/lib, moved the data around and rebooted
- 14:08 scfc_de: tools-redis: /var is full
- 08:59 legoktm: grrrit-wm: 2014-04-20T08:28:15.889Z - error: Caught error in redisClient.brpop: Redis connection to tools-redis:6379 failed - connect ECONNREFUSED
- 08:48 legoktm: Your job 438884 ("lolrrit-wm") has been submitted
- 08:47 legoktm: [01:28:28] * grrrit-wm has quit (Remote host closed the connection)
April 13
- 14:20 scfc_de: Restarted webservice for wikihistory to see if the change to PHP_FCGI_MAX_REQUESTS increases reliability
- 14:17 scfc_de: tools-webgrid-01, tools-webgrid-02: Set PHP_FCGI_MAX_REQUESTS to 500 in /usr/local/bin/lighttpd-starter per http://redmine.lighttpd.net/projects/1/wiki/docs_performancefastcgi#Why-is-my-PHP-application-returning-an-error-500-from-time-to-time
April 12
- 23:51 scfc_de: tools-mail: rm -f /var/log/exim4/paniclog ("unknown named domain list "+relay_domains"")
April 11
- 16:21 scfc_de: tools-login: Killed -HUP process consuming 2.6 GByte; cf. wikitech:User talk:Ralgis#Welcome to Tool Labs
April 10
- 18:20 scfc_de: tools-webgrid-01, tools-webgrid-02: "kill -HUP" all php-cgis that are not (grand-)children of lighttpd processes
April 8
- 05:06 Ryan_Lane: restart nginx on tools-proxy-test
- 05:03 Ryan_Lane: upgraded libssl on all nodes
April 4
- 15:48 Coren: Moar powar!!1!one: added two exec nodes (-09 -10) and one webgrid node (-02)
- 11:11 scfc_de: Set /data/project/.system/config/wikihistory.workers to 20 on apper's request
March 30
- 18:16 scfc_de: Removed empty directories /data/project/{d930913,sudo-test{,-2},testbug{,2,3}}: Corresponding service groups don't exist (anymore)
- 18:13 scfc_de: Removed /data/project/backup: Only empty dynamic-proxy backup files of January 3rd and earlier
March 29
- 10:14 wm-bot: petrb: disabled 1 job in cron in -login of user tools.tools-info which was killing login server
March 28
- 11:53 wm-bot: petrb: did the same on -mail server (removed /var/log/exim4/paniclog) so that we don't get spam every day
- 11:51 wm-bot: petrb: removed content of /var/log/exim4/paniclog
- 11:49 wm-bot: petrb: disabled default vimrc which everybody hates on -login
March 21
- 16:50 scfc_de: tools-login: pkill -u tools.bene (OOM)
- 16:13 scfc_de: rmdir /home/icinga (totally empty, "drwxr-xr-x 2 nemobis 50383 4096 Mär 17 16:42", perhaps artifact of mass migration?)
- 15:49 scfc_de: sudo cp -R /etc/skel /home/csroychan && sudo chown -R csroychan.wikidev /home/csroychan; that should close [[bugzilla:62132]]
- 15:15 scfc_de: sudo cp -R /etc/skel /home/annabel && sudo chown -R annabel.wikidev /home/annabel
- 15:14 scfc_de: sudo chown -R torin8.wikidev /home/torin8
March 20
- 18:36 scfc_de: Pointed tools-dev.wmflabs.org at tools-dev.eqiad.wmflabs; cf. [[Bugzilla:62883]]
March 5
- 13:57 wm-bot: petrb: test
March 4
- 22:35 wm-bot: petrb: uninstalling it from -login too
- 22:32 wm-bot: petrb: uninstalling apache2 from tools-dev it has nothing to do there
March 3
- 19:20 wm-bot: petrb: shutting down almost all services on webserver-02 in order to make system useable and finish upgrade
- 19:17 wm-bot: petrb: upgrading all packages on webserver-02
- 19:15 petan: rebooting webserver-01 which is totally dead
- 19:07 wm-bot: petrb: restarting apache on webserver-02 it complains about OOM but the server has more than 1.5g memory free
- 19:03 wm-bot: petrb: switched local-svg-map-maker to webserver-02 because 01 is not accessible to me, hence I can't debug that
- 16:44 scfc_de: tools-webserver-03: Apache was swamped by request for /guc. "webservice start" for that, and pkill -HUP -u local-guc.
- 12:54 scfc_de: tools-webserver-02: Rebooted, apache2/error.log told of OOM, though more than 1G free memory.
- 12:50 scfc_de: tools-webserver-03: Rebooted, scripts were timing out
- 12:42 scfc_de: tools-webproxy: Rebooted; wasn't accessible by ssh.
March 1
- 03:42 Coren: disabled puppet in pmtpa tool labs\
February 28
- 14:46 wm-bot: petrb: extending /usr on tools-dev by 800mb
- 00:26 scfc_de: tools-webserver-02: Rebooted; inaccessible via ssh, http said "500 Internal Server Error"
February 27
- 15:28 scfc_de: chmod g-w ~fsainsbu/.forward
February 25
- 22:48 rdwrer: Lol, so, something happened with grrrit-wm earlier and nobody logged any of it. It was yoyoing, Yuvi killed it, then aude did something and now it's back.
February 23
- 20:46 scfc_de: morebots: labs HUPped to reconnect to IRC
February 21
- 17:32 scfc_de: tools-dev: mount -t nfs -o nfsvers=3,ro labstore1.pmtpa.wmnet:/publicdata-project /public/datasets; automount seems to have been stuck
- 15:24 scfc_de: tools-webserver-03: Rebooted, wasn't accessible by ssh and apparently no access to /public/datasets either
February 20
- 21:23 scfc_de: tools-login: Disabled crontab for local-rezabot and left a message at User talk:Reza#Running bots on tools-login, etc. (fa:بحث_کاربر:Reza1615 is write-protected)
- 20:15 scfc_de: tools-login: Disabled crontab for local-chobot and left a message at ko:사용자토론:ChongDae#Running bots on tools-login, etc.
- 10:42 scfc_de: tools-mail: rm -f /var/log/exim4/paniclog ("User 0 set for local_delivery transport is on the never_users list", cf. [[bugzilla:61583]])
- 10:30 scfc_de: tools-login: rm -f /var/log/exim4/paniclog (OOM)
- 10:28 scfc_de: Reset error status of task@tools-exec-09 ("can't get password entry for user 'local-voxelbot'"); "getent passwd local-voxelbot" works on tools-exec-09, possibly a glitch
February 19
- 20:21 scfc_de: morebots: Set "enable_twitter=False" in confs/labs-logbot.py and restarted labs-morebots
- 19:14 scfc_de: tools-login: Disabled crontab and pkill -HUP -u fatemi127
February 18
- 11:42 scfc_de: tools-mail: Rerouted queued mail (@tools-login.pmtpa.wmflabs => @tools.wmflabs.org)
- 11:34 scfc_de: tools-exec-08: Rebooted due to not responding on ssh and SGE
- 10:39 scfc_de: tools-mail: rm -f /var/log/exim4/paniclog ("User 0 set for local_delivery transport is on the never_users list" => probably artifacts from Coren's LDAP changes)
- 10:37 scfc_de: tools-login: rm -f /var/log/exim4/paniclog (OOM)
February 14
- 23:54 legoktm: restarting grrrit-wm since it disappeared
- 08:19 scfc_de: tools-login: rm -f /var/log/exim4/paniclog (OOM)
February 13
- 13:11 scfc_de: Deleted old job of user veblenbot stuck in error state
- 13:08 scfc_de: Deleted old jobs of user v2 stuck in error state
- 10:49 scfc_de: tools-login: Commented out local-shuaib-bot's crontab with a pointer to Tools/Help
February 12
- 07:51 wm-bot: petrb: removed /data/project/james/adminstats/wikitools per request from james on irc
February 11
- 15:47 scfc_de: Restarted webservice for geohack
- 13:02 scfc_de: tools-login: rm -f /var/log/exim4/paniclog (OOM)
- 13:00 scfc_de: Killed -HUP local-hawk-eye-bot's jobs; one was hanging with a stale NFS handle on tools-exec-05
February 10
- 23:16 Coren: rebooting webproxy (braindead autofs)
February 9
- 18:14 legoktm: restarting grrrit-wm, it keeps joining and quitting
- 04:27 legoktm: rebooting grrrit-wm - https://gerrit.wikimedia.org/r/#/c/112308
February 6
- 22:50 legoktm: restarting grrrit-wm https://gerrit.wikimedia.org/r/111889
February 4
- 20:38 legoktm: restarting grrrit-wm: 'Send mediawiki/extension/Thanks to -corefeatures' https://gerrit.wikimedia.org/r/111257
January 31
- 03:43 scfc_de: Cleaned up all exim queues
- 01:26 scfc_de: chmod g-w ~{bgwhite,daniel,euku,fale,henna,hydriz,lfaraone}/.forward (test: sudo find /home -mindepth 2 -maxdepth 2 -type f -name .forward -perm /g=w -ls)
January 30
- 21:48 scfc_de: chmod g-w ~fluff/.forward
- 21:40 scfc_de: local-betabot: Added "-M" option to crontab's qsub call and rerouted queued mail (freeze, exim -Mar, exim -Mmd, thaw)
- 18:33 scfc_de: tools-exec-04: puppetd --enable (apparently disabled sometime around 2014-01-16?!)
- 17:25 scfc_de: tools-exec-06: mv -f /etc/init.d/nagios-nrpe-server{.dpkg-dist,} (nagios-nrpe-server didn't start because start-up script tried to "chown icinga" instead of "chown nagios")
January 28
- 04:27 scfc_de: tools-webproxy: Blocked Phonifier
January 25
- 05:37 scfc_de: tools-webserver-02: rm -f /var/log/exim4/paniclog (OOM)
January 24
- 01:07 scfc_de: tools-db: Removed /var/lib/mysql2, set expire_logs_days to 1 day
- 00:11 scfc_de: tools-db: and restarted mysqld
- 00:11 scfc_de: tools-db: Moved 4.2 GBytes of the oldest binlogs to /var/lib/mysql2/
January 23
- 19:24 legoktm: restarting grrrit-wm now https://gerrit.wikimedia.org/r/#/c/109116/
- 19:23 legoktm: ^ was for grrrit-wm
- 19:23 legoktm: re-committed password to local repo, not sure why that wasn't committed already
January 21
- 17:41 scfc_de: tools-exec-09: iptables-restore /data/project/.system/iptables.conf
January 20
- 07:02 andrewbogott: merged a lint patch to the gridengine module. Should be a noop
January 16
- 17:11 scfc_de: tools-exec-09: "iptables-restore /data/project/.system/iptables.conf" after reboot
January 15
- 13:36 scfc_de: After reboot of tools-exec-09, all continuous jobs were successfully restarted ("Rr"); task jobs (1974113, 2188472) failed ("19 : before writing exit_status")
- 13:27 scfc_de: tools-login: rm -f /var/log/exim4/paniclog (OOM)
- 08:54 andrewbogott: rebooted tools-exec-09
- 08:32 andrewbogott: rebooted tools-db
January 14
- 15:10 scfc_de: tools-login: pkill -u local-mlwikisource: Freed 1 GByte of memory
- 14:58 scfc_de: tools-login: Disabled local-mlwikisource's crontab with explanation
- 13:57 scfc_de: tools-webserver-02: rm -f /var/log/exim4/paniclog (out of memory errors on 2014-01-10)
January 10
- 10:41 legoktm: grrrit-wm: restarting https://gerrit.wikimedia.org/r/106670
- 09:00 legoktm: grrrit-wm: setting up #mediawiki-feed, https://gerrit.wikimedia.org/r/106555
January 9
- 18:26 legoktm: rebased grrrit-wm on origin/master since fetching gerrit was failing
- 18:21 legoktm: restarting grrrit-wm https://gerrit.wikimedia.org/r/#/c/106501/
January 8
- 13:44 scfc_de: Cleared error states of continuous@tools-exec-05, task@tools-exec-05, task@tools-exec-09
January 7
- 18:59 scfc_de: tools-login, tools-mail: rm -f /var/log/exim4/paniclog (apparently some artifacts of the LDAP failure)
January 6
- 14:06 YuviPanda: deleted instance tools-mc, didn't know it had come back from the dead
January 1
- 13:24 scfc_de: tools-exec-02, tools-master, tools-shadow, tools-webserver-01: Commented out duplicate MariaDB entries in /etc/apt/sources.list and re-ran apt-get update
- 11:27 scfc_de: tools-webserver-01, tools-webserver-01: rm -f /var/log/exim4/paniclog; out of memory errors
- 11:18 scfc_de: Emptied /{data/project,home}/.snaplist as the snapshots themselves are not available
December 27
- 07:39 legoktm: grrrit-wm restart didn't really work.
- 07:38 legoktm: restarting grrit-wm, for some reason it reconnected and lost its cloak
December 23
- 18:30 marktraceur: restart grrrit-wm for subbu
December 21
- 06:50 scfc_de: tools-exec-01: Commented out duplicate MariaDB entries in /etc/apt/sources.list and re-ran apt-get update
December 19
- 17:22 marktraceur: deploying grrrit config change
December 17
- 23:19 legoktm: rebooted grrrit-wm with new config stuffs
December 14
- 18:13 marktraceur: restarting grrrit-wm to fix its nickname
- 13:17 scfc_de: tools-exec-08: Purged packages libapache2-mod-suphp and suphp-common (probably remnants from when the host was misconfigured as a webserver)
- 13:09 scfc_de: tools-dev, tools-login, tools-mail, tools-webserver-01, tools-webserver-02: rm /var/log/exim4/paniclog (mostly out of memory errors)
December 4
- 22:15 Coren: tools-exec-01 rebooted to fix the autofs issue; will return to rotation shortly.
- 16:33 Coren: rebooting webproxy with new kernel settings to help against the DDOS
December 1
- 14:05 Coren: underlying virtualization hardware rebooted; tools-master and friends coming back up.
November 25
- 21:03 YuviPanda: created tools-proxy-test instance to play around with the dynamicproxy
- 12:16 wm-bot: petrb: deswapping -login (swapoff -a && swapon -a)
November 24
- 07:19 paravoid: disabled crontab for user avocato on tools-login, see above
- 07:17 paravoid: pkill -u avocato on tools-login, multiple /home/avocato/pywikipedia/redirect.py DoSing the bastion
November 14
- 09:12 ori-l: Added aude to lolrrit-wm maintainers group
November 13
- 22:36 andrewbogott: removed 'imagescaler' class from tools-login because that class hasn't existed for a year. Which, a year ago is before that instance even existed so what the heck?
November 3
- 16:49 ori-l: grrrit-wm stopped receiving events. restarted it; didn't help. then restarted gerrit-to-redis, which seems to have fixed it.
November 1
- 16:11 wm-bot: petrb: restarted terminator daemon on -login to sort out memory issues caused by heavy mysql client by elbransco
October 23
- 15:19 Coren: deleted tools-tyrant and tools-exec-cyberbot (cleanup of obsoleted instances)
October 20
- 18:52 wm-bot: petrb: everything looks better
- 18:51 wm-bot: petrb: restarting apache server on tools-webproxy
- 18:49 wm-bot: petrb: installed links on -dev and going to investigate what is wrong with apaches, documentation, Coren, please update it
October 15
- 21:03 Coren: labs-login rebooted to fix the ownership/take issue with success.
October 10
- 09:49 addshore: tools-webserver-01is getting a 500 Internal Server Error again
September 23
- 06:44 YuviPanda: remove unpuppetized install of openjdk-6 packages causing problems in -dev (for bug: 54444)
- 06:44 YuviPanda: remove unpuppetized install of openjdk-6 packages causing problems in -dev (for bug: 54444)
- 05:15 legoktm: logging a log to test the log logging
- 05:13 legoktm: logging a log to test the log logging
September 11
- 09:39 wm-bot: petrb: started toolwatcher
August 24
- 18:00 wm-bot: petrb: freed 1600mb of ram by killing yasbot processes on -login
- 17:59 wm-bot: petrb: killing all python processes of yasbot on -login, this bot needs to run on grid, -login is constantly getting OOM because of this bot
August 23
- 12:17 wm-bot: petrb: test
- 12:15 wm-bot: petrb: making pv from /dev/vdb on new nodes
- 11:49 wm-bot: petrb: syncing packages of -login with exec nodes
- 11:48 petan: someone installed firefox on exec nodes, should investigate / remove
August 22
- 01:24 scfc_de: tools-webserver-03: Installed python-oursql
August 20
- 23:00 scfc_de: Opened port 3000 for intra-Labs traffic in execnode security group for YuviPanda's proxy experiments
August 19
- 09:52 wm-bot: petrb: deleting fatestwiki tool, requested by creator
August 16
- 00:16 scfc_de: tools-exec-01 doesn't come up again even after repeat reboots
August 15
- 15:14 scfc_de: tools-webserver-01: Simplified /usr/local/bin/php-wrapper
- 14:31 scfc_de: tools-webserver-01: "dpkg --configure -a" on apt-get's advice
- 14:24 scfc_de: chmod 644 ~magnus/.forward
- 03:07 scfc_de: tools-webproxy: Temporarily serving 403s to AhrefsBot/bingbot/Googlebot/PaperLiBot/TweetmemeBot/YandexBot until they reread robots.txt
- 02:02 scfc_de: robots.txt: "Disallow: /"
August 11
- 03:14 scfc_de: tools-mc: Purged memcached
August 10
- 02:36 scfc_de: Disabled terminatord on tools-login and tools-dev
- 02:24 scfc_de: chmod g-w ~whym/.forward
August 6
- 19:26 scfc_de: Set up basic robots.txt to exclude Geohack to see how that affects traffic
- 02:09 scfc_de: tools-mail: Enabled rudimentary Ganglia monitoring in root's crontab
August 5
- 20:32 scfc_de: chmod g-w ~ladsgroup/.forward
August 2
- 23:45 scfc_de: tools-dev: Installed dialog for testing
August 1
- 19:57 scfc_de: Created new instance tools-redis with redis_maxmemory = "7GB"
- 19:56 scfc_de: Added redis_maxmemory to wikitech Puppet variables
July 31
- 10:50 HenriqueCrang: ptwikis added graph with mobile edits
July 30
- 19:08 scfc_de: tools-webproxy: Purged popularity-contest and ubuntu-standard
- 07:32 wm-bot: petrb: deleted local-addbot jobs
- 02:01 scfc_de: tools-webserver-01: Symlinked /usr/local/bin/{job,jstart,jstop,jsub} to /usr/bin; were obsolete versions.
July 29
- 15:15 scfc_de: tools-webserver-01: rm /var/log/exim4/paniclog
- 15:10 scfc_de: Purged popularity-contest from tools-webserver-01.
- 02:40 scfc_de: Restarted toolwatcher on tools-login.
- 02:11 scfc_de: Reboot tools-login, was not responsive
July 25
- 23:37 Ryan_Lane: added myself to lolrrit-wm tool
- 12:06 wm-bot: petrb: test
- 07:11 wm-bot: petrb: created /var/log/glusterfs/bricks/ to stop rotatelogs from complaining about it being missing
July 20
- 15:19 petan: rebooting tools-redis
July 19
- 07:06 petan: instances were rebooted for unknown reasons
- 00:42 helderwiki: it works! :-)
- 00:41 legoktm: test
July 10
- 18:04 wm-bot: petrb: installing mysqltcl on grid
- 18:01 wm-bot: petrb: installing tclodbc on grid
July 5
- 19:38 AzaToth: test
- 19:36 AzaToth: test for example
- 18:23 Coren: brief outage of webproxy complete (back to business!)
- 18:13 Coren: brief outage of webproxy (rollback 2.4 upgrade)
July 3
- 13:44 scfc_de: Set "HostbasedAuthentication yes" and "EnableSSHKeysign yes" in tools-dev's /etc/ssh/ssh_config
- 12:58 petan: rebooting -mc it's aparently OOM dying
July 2
- 16:24 wm-bot: petrb: installed maria to all nodes so we can connect to db even from sge
- 12:19 wm-bot: petrb: installing packages -- libmediawiki-api-perl libdatetime-format-strptime-perl libbot-basicbot-perl libdatetime-format-duration-perl
July 1
- 18:39 wm-bot: petrb: started toolwatcher on - login
- 14:22 wm-bot: petrb: installing following packages on grid: libdata-dumper-simple-perl libhtml-html5-entities-perl libirc-utils-perl libtask-weaken-perl libobject-pluggable-perl libpoe-component-syndicator-perl libpoe-filter-ircd-perl libsocket-getaddrinfo-perl libpoe-component-irc-perl libxml-simple-perl
- 12:05 wm-bot: petrb: starting toolwatcher
- 11:40 wm-bot: petrb: tools is back o/
- 09:42 wm-bot: petrb: installing python -zmg -matplotlib @ dev
- 03:33 scfc_de: Rebooted tools-login apparently out of memory and not responding to ssh
June 30
- 17:58 scfc_de: Set ssh_hba to yes on tools-exec-06
- 17:13 scfc_de: Installed python-matplotlib and python-zmq on tools-login for YuviPanda
June 26
- 21:16 Coren: +Tim Landscheidt to project admins, local-admin
- 14:23 wm-bot: petrb: updating several packages on -login
- 13:43 wm-bot: petrb: killing old instance of redis: Jun15 ? 00:06:49 /usr/bin/redis-server /etc/redis/redis.conf
- 13:42 wm-bot: petrb: restarting redis
- 13:28 wm-bot: petrb: running puppet on -mc
- 13:27 wm-bot: petrb: adding ::redis role to tools-mc - if anything will break, YuviPanda did it :P
- 09:35 wm-bot: petrb: updated status.php to version which display free vmem as well
June 25
- 12:34 wm-bot: petrb: installing php5-mcrypt on exec and web
June 24
- 15:45 wm-bot: petrb: changed colors of root prompt productions vs testing
- 07:57 wm-bot: petrb: 50527 4186 22830 1 Jun23 pts/41 00:08:54 python fill2.py eats 48% of ram on -login
June 19
- 12:17 wm-bot: petrb: increasing limit on mysql connections
June 17
- 17:34 wm-bot: petrb: /var/spool/cron/crontabs/ has -rw------- 1 8006 crontab 1176 Apr 11 14:07 local-voxelbot fixing
June 16
- 21:23 Coren: 1.0.3 deployed (jobutils, misctools)
June 15
- 21:40 wm-bot: petrb: there is no lvm on -db which we need as hell - therefore no swap either nor storage for binary logs :( I got a feeling that mysql will die oom soonish
- 21:39 wm-bot: petrb: db has 5% free RAM eeeek
- 18:36 wm-bot: root: removed lot of ?audit? logs from exec-04 they were eating too much storage
- 18:23 wm-bot: petrb: temporarily disabling /tmp on exec-04 in order to set up lvm
- 18:23 wm-bot: petrb: exec-04 96% / usage, creating a new volume
- 12:33 wm-bot: petrb: installing redis on tools-mc
June 14
- 12:35 wm-bot: petrb: updating logsplitter to new version
June 13
- 21:59 wm-bot: petrb: replaced logsplitter on both apache servers with far more powerfull c++ version thus saving a lot of resources on both servers
- 12:43 wm-bot: petrb: tools-webserver-01 is running quite expensive python job (currently eating almost 1gb of ram) it may need to be fixed or moved to separate webserver, adding swap to prevent machine die OOM
- 12:22 wm-bot: petrb: killing process 31187 sort -T./enwiki/target -t of user local-enwp10 for same reason as previous one
- 12:21 wm-bot: petrb: killing process 31190 sort -T./enwiki/target of user local-enwp10 for same reason as previous one
- 12:17 wm-bot: petrb: killing process 31186 31185 69 Jun11 pts/32 1-13:14:41 /usr/bin/perl ./bin/catpagelinks.pl ./enwiki/target/main_pages_sort_by_ids.lst ./enwiki/target/pagelinks_main_sort_by_ids.lst because it seems to be a bot running on login server eating too many resources
June 11
- 07:36 wm-bot: petrb: installed libdigest-crc-perl
June 10
- 13:05 wm-bot: petrb: installing libcrypt-gcrypt-perl
- 08:45 wm-bot: petrb: updated /usr/local/bin/logsplitter on webserver-01 in order to fix !b 49383
- 08:45 wm-bot: petrb: updated /usr/local/bin/logsplitter on webserver-01 in order to fix become afcbot 49383
- 08:44 wm-bot: petrb: updated /usr/local/bin/logsplitter on webserver-01 in order to fix become afcbot 49383
- 08:25 wm-bot: petrb: fixing missing packages on exec nodes
June 9
- 20:44 wm-bot: petrb: moved logs on -login to separate storage
June 8
- 21:24 wm-bot: petrb: installing python-imaging-tk on grid
- 21:20 wm-bot: petrb: installing python-tk
- 21:16 wm-bot: petrb: installing python-flickrapi on grid
- 21:16 wm-bot: petrb: installing
- 16:49 wm-bot: petrb: turned off wmf style of vi on tools-dev feel free to slap me :o or do cat /etc/vim/vimrc.local >> .vimrc if you love it
- 15:33 wm-bot: petrb: grid is overloaded, needs to be either enlarged or jobs calmed down :o
- 09:55 wm-bot: petrb: backporting tcl 8.6 from debian
- 09:38 wm-bot: petrb: update python requests to version 1.2.3.1
June 7
- 15:29 Coren: Deleted no-longer-needed tools-exec-cg node (spun off to its own project)
June 5
- 09:52 wm-bot: petrb: on -dev
- 09:52 wm-bot: petrb: moving /usr to separate volume expect problems :o
- 09:41 wm-bot: petrb: moved /var/log to separate volume on -dev
- 09:31 wm-bot: petrb: houston we have problem, / on dev is 94%
- 09:28 wm-bot: petrb: installed openjdk7 on -dev
- 09:00 wm-bot: petrb: removing wd-terminator service
- 08:39 wm-bot: petrb: started toolwatcher
- 07:04 wm-bot: petrb: installing maven on -dev
June 4
- 14:49 wm-bot: petrb: installing sbt in order to fix b48859
- 13:28 wm-bot: petrb: installing csh on cluster
- 08:37 wm-bot: petrb: installing python-memcache on exec nodes
June 3
- 21:40 Coren: Rebooting -login; it's trashing. Will keep an eye on it.
- 14:15 wm-bot: petrb: removing popularity contest
- 14:11 wm-bot: petrb: removing /etc/logrotate.d/glusterlogs on all servers to fix logrotate daemon
- 09:43 wm-bot: petrb: syncing packages on exec nodes to avoid troubles with missing libs on some etc
June 2
- 08:39 wm-bot: petrb: installing ack-grep everywhere per yuvipanda and irc
June 1
- 20:57 wm-bot: petrb: installed this to exec nodes because it was on some and not on others cpp-4.4 cpp-4.5 cython dbus dosfstools ed emacs23 ftp gcc-4.4-base iptables iputils-tracepath ksh lsof ltrace lshw mariadb-client-5.5 nano python-dbus python-egenix-mxdatetime python-egenix-mxtools python-gevent python-greenlet strace telnet time -y
- 20:42 wm-bot: petrb: installing wikitools cluster wide
- 20:40 wm-bot: petrb: installing oursql cluster wide
- 10:46 wm-bot: petrb: created new instance for experiments with sasl memcache tools-mc
May 31
- 19:17 petan: deleting xtools project (requested by Cyberpower678)
- 17:24 wm-bot: petrb: removing old kernels from -dev because / is almost full
- 17:17 wm-bot: petrb: installed lsof to -dev
- 15:55 wm-bot: petrb: installed subversion to exec nodes 4 legoktm
- 15:47 wm-bot: petrb: replacing mysql with maria on exec nodes
- 15:46 wm-bot: petrb: replacing mysql with maria on exec nodes
- 15:14 wm-bot: petrb: installing default-jre in order to satisfy its dependencies
- 15:13 wm-bot: petrb: installing /data/project/.system/deb/all/sbt.deb to -dev in order to test it
- 13:04 wm-bot: petrb: installing bashdb on tools and -dev
- 12:27 wm-bot: petrb: removing project local-jimmyxu - per request on irc
- 10:54 wm-bot: petrb: killing process 3060 on -login (mahdiz 3060 1964 88 May30 ? 21:32:51 /bin/nano /tmp/crontab.Ht3bSO/crontab) it takes max cpu and doesn't seem to be attached
May 30
- 12:24 wm-bot: petrb: deleted job 1862 from queue (error state)
- 08:26 wm-bot: petrb: updated sql command
May 29
- 21:05 wm-bot: petrb: running sudo apt-get install php5-gd
May 28
- 20:00 wm-bot: petrb: installing p7zip-full to -dev and -login
May 27
- 08:46 wm-bot: petrb: changed config of mysql to use /mnt as path to save binary logs, this however requires server to be restarted
May 24
- 08:44 petan: setting up lvm on new exec nodes because it is more flexible and allows us to change the size of volumes on the fly
- 08:28 petan: created 2 more exec nodes, setting up now...
May 23
- 09:20 wm-bot: petrb: process 27618 on -login is constantly eating 100% of cpu, changing priority to 20
May 22
- 20:54 wm-bot: petrb: changing ownership of /data/project/bracketbot/ to local-bracketbot
- 14:28 labs-logs-bottie: petrb: installed netcat as well
- 14:28 labs-logs-bottie: petrb: installed telnet to -dev
- 14:02 Coren: tools-webserver-02 now live; / and /cluebot/ moved there
May 21
- 20:27 labs-logs-bottie: petrb: uploaded hosts to -dev
May 19
- 13:40 labs-logs-bottie: petrb: killing that nano process seems to be some hang and unattached anyway
- 12:59 labs-logs-bottie: petrb: changed priority of nano process to 19
- 12:55 labs-logs-bottie: petrb: local-hawk-eye-bot /bin/nano /tmp/crontab.d4JhUj/crontab eat too much cpu
- 12:50 petan: nvm previous line
- 12:50 labs-logs-bottie: petrb: vul alias viewuserlang
May 14
- 21:22 labs-logs-bottie: petrb: created a separate volume for /tmp on login so that temp files do not fragment root fs and it does not get filled up by them, it also makes it easier to track filesystem usage
- 13:16 Coren: reboot -dev, need to test kernel upgrade
May 10
- 15:08 Coren: create tools-webserver-02 for Apache 2.4 experimentation
May 9
- 04:12 Coren: added -exec-03 and -exec-04. Moar power!!1!
May 6
- 19:59 Coren: made tools-dev.wmflabs.org public
- 08:04 labs-logs-bottie: petrb: created a small swap on -login so that users can not bring it to OOM so easily and so that unused memory blocks can be swapined in order to use the remaining memory more effectively
- 08:00 labs-logs-bottie: petrb: making lvm from unused disk from /mnt on -login so that we can eventually use it somewhere if needed
May 4
- 17:50 labs-logs-bottie: petrb: foobar as well
- 17:47 labs-logs-bottie: petrb: removing project flask-stub using rmtool
- 15:33 labs-logs-bottie: petrb: fixing missing db user for local-stub
- 12:51 labs-logs-bottie: petrb: creating mysql accounts by hand for alchimista and fubar
May 2
- 20:49 labs-logs-bottie: petrb: uploaded motd to exec-N as well, with information which server users connected to
May 1
- 16:59 labs-logs-bottie: petrb: fixed invalid permissions on /home
April 27
- 18:54 labs-logs-bottie: petrb: installing pymysql using pip on whole grid because it is needed for greenrosseta (for some reason it is better than python-mysql package)
April 26
- 23:55 Coren: reboot to finish security updates
- 08:00 labs-logs-bottie: petrb: patching qtop
- 07:57 labs-logs-bottie: petrb: added tools-dev to admin host list so that qtop works and fixing the bug of qtop
- 07:28 labs-logs-bottie: petrb: installing GE tools to -dev so that we can develop new j|q* stuff there
April 25
- 19:00 Coren: Maintenance over; systems restarted and should be working.
- 18:18 labs-logs-bottie: petrb: we are getting in troubles with memory on tools-db there is only less than 20% free memory
- 18:01 Coren: Begin maintenance (login disabled)
- 13:21 petan: removing local-wikidatastats from ldap
April 24
- 13:17 labs-logs-bottie: petrb: sudo chown local-peachy PeachyFrameworkLogo.png
- 11:37 labs-logs-bottie: petrb: created new project stats and cloned acl from wikidatastats, which is supposed to be deleted
- 11:32 legoktm: wikidatastats attempting to install limn
- 11:15 labs-logs-bottie: petrb: installing npm to -login instance
- 07:34 petan: creating project wikidatastats for legoktm addshore and yuvipandianablah :P
April 23
- 13:32 labs-logs-bottie: petrb: changing permissions of cyberbot and peachy to 775 so that it is easier to use them
- 12:14 labs-logs-bottie: petrb: qtop on -dev
- 12:12 labs-logs-bottie: petrb: removed part of motd from login server that got there in a mysterious way
April 19
- 22:38 Coren: reboot -login, all done with the NFS config. yeay.
- 17:13 Coren: (final?) reboot of -login with the new autofs configuration
- 16:24 Coren: (rebooted -login)
- 16:24 Coren: autofs + gluster = fail
- 14:45 Coren: reboot -login (NFS mount woes)
April 15
- 22:29 Coren: also a test; note how said bot knows its place. :-)
- 22:14 andrewbogott: this is a test of labs-morebots.
- 21:49 andrewbogott: this is a test
- 15:41 labs-logs-bottie: petrb: installing p7zip everywhere
- 08:00 labs-logs-bottie: petrb: installing dev packages needed for YuviPanda on login box
April 11
- 22:39 Coren: rebooted tools-puppet-test (no end-user impact): hung filesystem prevents login
- 07:42 labs-logs-bottie: petrb: removed reboot information from motd
April 10
- 21:42 labs-logs-bottie: petrb: reverting the change
- 21:35 labs-logs-bottie: petrb: inserting /lib to /etc/ld.so.conf in order to fix the bug with gcc / ubuntu see irc logs (22:30 GMT)
- 21:22 labs-logs-bottie: petrb: installing jobutils.deb on login
- 20:30 labs-logs-bottie: petrb: installing some dev tools to -dev
- 20:23 petan: created -dev instance for various purposes
April 8
- 14:07 labs-logs-bottie: petrb: ongrid apt-get install mono-complete
- 13:50 labs-logs-bottie: local-afcbot: unable to run mono applications: The assembly mscorlib.dll was not found or could not be loaded.
April 4
- 14:40 labs-logs-bottie: petrb: trying to convert afcbot to new service group local-afcbot
April 2
- 16:04 labs-logs-bottie: petrb: installed log to /home/petrb/bin/ and testing it
- 15:55 petan: patched /usr/local/bin/qdisplay so that it can display jobs per node properly
- 15:54 petan: giving sudo to Petrb in order to update qdisplay
March 28
- 15:44 Coren: reboot (still unactivated) tools-shadow
March 26
- 18:17 Coren: Doubled the size of the compute grid! (added tools-exec-02 to the grid)
March 21
- 23:30 Coren: turned on interpretation of .py as CGI by default on tools-webserver-* to parallel .php
- 16:15 Coren: Added tools-login.wmflabs.org public IP for the tools-login instance and allowed incoming ssh to it.
March 19
- 14:21 Coren: reboot cycle (all instances) to apply security updates
March 13
- 14:04 Coren: restarted webserver: relax AllowOverride options
March 11
- 15:47 Coren: enabled X forwarding for qmon. Also, installed qmon.
- 13:17 Coren: added python-requests (1.0, from pip)
March 7
- 20:41 Coren: tools' php errors now sent to ~/php_errors.log
- 19:31 Coren: access.log now split by tools (in tool homedir)
- 16:15 Coren: can haz database (support for user/tool databases in place)
March 6
- 20:25 Coren: tools-db installed mariadb-server from official repo
- 19:50 Coren: created tools-db instance for a (temporary) mysql install
March 5
- 21:45 Coren: rejiggered the webproxy config to be smarter about paths not leading to specific tools
February 26
- 23:49 Coren: Original note structure: created tools-{master,exec-01,webserver-01,webproxy} instances
- 18:39 Coren: Created tools-puppet-test for dev and testing of tools' puppet classes.
- 01:52 Coren: created instance tools-login (primary login/dev instance)
- 01:52 Coren: created sudo policies and security groups (skeletal)
- 01:08 Coren: Creation of the new project for preproduction deployment of the current (preleminary) plan mw:Wikimedia Labs/Tool Labs/Design