bayes
Appearance
This page contains historical information. It may be outdated or unreliable.
bayes | |
Location: | pmtpa |
Status | |
---|---|
Overall: | This device has been decommissioned.
|
Icinga?: | host status services status |
Hardware | |
Software |
bayes was a Sun server in Tampa, used by ezachte for metrics scripts written in R. (Do not touch right before metrics meetings)
Use ILOM for bayes.mgmt.
Via Erik Zachte:
About the scripts: Monthly processing exists of two steps: 1) SquidCountArchive.pl : collect counts into csv files 2) SquidReportArchive.pl: generate reports from these csv files Ad 1) One script parses the logs twice: -First to collect ip address frequencies (mainly for edit stats, if an ip address would occur twice in a 1:1000 sample that would most likely be a bot, even if the agent strings does not say so, few false positives are unavoidable) -Then to collect all other counts I expect input in /a/squid/archive (folder structure in bayes follows locke in this) Output goes to /a/ezachte/yyyy-mm/yyyy-mm-dd/public and [same]/private (folder 'private' contains files which we should not publish out of privacy concerns, e.g. ip address counts) For detailed description of csv files and list of reports, see http://www.mediawiki.org/wiki/Wikistats/TrafficReports As I said in earlier mail: resources at locke are at a premium, we had some issues over the year (server overload -> many messages got lost) That is why I run on a copy of the files on bayes now, as follows. cd /a/squid/archive rsync -v --bwlimit=4096 [your id]@locke.wikimedia.org:/a/squid/archive/sampled-1000.log-201112*.gz . How to run my scripts: 1) nice perl SquidCountArchive.pl -d 2012/12/01-2012/12/31 (-d = any date range) (takes about an hour per log file) The script starts with making sure all log data for a full day are available in the archive folder. To this end it checks head and tail of each gz file for first and last timestamp (and caches this for reuse, as the files are compressed it takes a while) Note that EzLib.pm is expected in /home/ezachte, you can modify this in first lines of SquidCountsArchive.pl (This needs to change of course, was a workaround for lack of access to shared perl folders in several dissimilar servers) Here is a list of csv files per day with non-sensitive data: (again see http://www.mediawiki.org/wiki/Wikistats/TrafficReports for details) -rw-r--r-- 1 ezachte wikidev 280k Jan 15 17:46 SquidDataAgents.csv.bz2 -rw-r--r-- 1 ezachte wikidev 400k Jan 15 17:46 SquidDataBanners.csv -rw-r--r-- 1 ezachte wikidev 87M Jan 15 17:46 SquidDataBinaries.csv (wegens omvang even weggelaten, is nieuw en alleen om dagelijkse aantal gedownloaded images te schatten) -rw-r--r-- 1 ezachte wikidev 363k Jan 15 17:46 SquidDataClients.csv -rw-r--r-- 1 ezachte wikidev 906k Jan 15 17:46 SquidDataClientsByWiki.csv -rw-r--r-- 1 ezachte wikidev 2.2k Jan 15 17:46 SquidDataCountriesSaves.csv -rw-r--r-- 1 ezachte wikidev 144k Jan 15 17:46 SquidDataCountriesViews.csv -rw-r--r-- 1 ezachte wikidev 1.8M Jan 15 17:46 SquidDataCountriesViewsTimed.csv -rw-r--r-- 1 ezachte wikidev 186k Jan 15 17:46 SquidDataCrawlers.csv -rw-r--r-- 1 ezachte wikidev 2.3k Jan 15 17:46 SquidDataExtensions.csv -rw-r--r-- 1 ezachte wikidev 2.3k Jan 15 17:46 SquidDataGoogleBots.csv -rw-r--r-- 1 ezachte wikidev 1.9k Jan 15 17:46 SquidDataImages.csv -rw-r--r-- 1 ezachte wikidev 39k Jan 15 17:46 SquidDataIndexPhp.csv -rw-r--r-- 1 ezachte wikidev 6.5k Jan 15 17:46 SquidDataLanguages.csv -rw-r--r-- 1 ezachte wikidev 2.2k Jan 15 17:46 SquidDataMethods.csv -rw-r--r-- 1 ezachte wikidev 18k Jan 15 17:46 SquidDataOpSys.csv -rw-r--r-- 1 ezachte wikidev 461k Jan 15 17:46 SquidDataOrigins.csv -rw-r--r-- 1 ezachte wikidev 2.4M Jan 15 17:46 SquidDataRequests.csv -rw-r--r-- 1 ezachte wikidev 61k Jan 15 17:46 SquidDataRequestsM.csv -rw-r--r-- 1 ezachte wikidev 1.9k Jan 15 17:46 SquidDataRequestsWap.csv -rw-r--r-- 1 ezachte wikidev 129k Jan 15 17:46 SquidDataScripts.csv -rw-r--r-- 1 ezachte wikidev 32k Jan 15 17:46 SquidDataSearch.csv -rw-r--r-- 1 ezachte wikidev 4.7k Jan 15 17:46 SquidDataSkins.csv These are the sensitive files: -rw-r--r-- 1 ezachte wikidev 77k Jan 15 17:46 DebugSquidDataErrDoNotPublish.txt -rw-r--r-- 1 ezachte wikidev 4.1M Jan 15 17:46 DebugSquidDataOutDoNotPublish.txt -rw-r--r-- 1 ezachte wikidev 942k Jan 15 17:46 DebugSquidDataOutDoNotPublish2.txt -rw-r--r-- 1 ezachte wikidev 1.3M Jan 15 17:46 SquidDataEditsSavesDoNotPublish.txt.bz2 -rw-r--r-- 1 ezachte wikidev 162k Jan 15 16:37 SquidDataIpFrequenciesDoNotPublish.csv.bz2 (only one used at reporting step, some other are for visualizations or trace only) -rw-r--r-- 1 ezachte wikidev 583k Jan 15 17:46 SquidDataReferersDoNotPublish.txt -rw-r--r-- 1 ezachte wikidev 8.4M Jan 15 17:46 SquidDataViewsVizDoNotPublish-20111101.gz Ad 2) There are four different modes for reporting script 2a) -m 201112 (-m for month) reads all daily csv files for some month and produces all non-geo related reports 2b) -c (-c for countries ) produces geo-related reports plus historical trends 2c) -c -q 2011Q4 (-q for quarter) produces some quartely report (essetially similar to yearly with more strict filter) 2d) -w (-w for Wikipedia) once a year or so , collects new country metrics (population, internet usage, flags) from several Wikipedia articles into SquidReportCountryMetaInfo.csv