bayes

From Wikitech
This page contains historical information. It may be outdated or unreliable.
bayes
Location: pmtpa
Status
Overall:
This device has been decommissioned.
Icinga?: host status services status
Hardware
Software

bayes was a Sun server in Tampa, used by ezachte for metrics scripts written in R. (Do not touch right before metrics meetings)

Use ILOM for bayes.mgmt.

Via Erik Zachte:

About the scripts:

Monthly processing exists of two steps:

1) SquidCountArchive.pl : collect counts into csv files
2) SquidReportArchive.pl: generate reports from these csv files

Ad 1)

One script parses the logs twice:
 -First to collect ip address frequencies  (mainly for edit stats, if an ip
address would occur twice  in a 1:1000 sample that would most likely be a
bot, even if the agent strings does not say so, few false positives are
unavoidable)
 -Then to collect all other counts

I expect input in /a/squid/archive (folder structure in bayes follows locke
in this)
Output goes to /a/ezachte/yyyy-mm/yyyy-mm-dd/public and [same]/private
(folder 'private' contains files which we should not publish out of privacy
concerns, e.g. ip address counts)
For detailed description of csv files and list of reports, see
http://www.mediawiki.org/wiki/Wikistats/TrafficReports

As I said in earlier mail: resources at locke are at a premium, we had some
issues over the year (server overload -> many messages got lost)
That is why I run on a copy of the files on bayes now, as follows.

cd /a/squid/archive
rsync -v --bwlimit=4096 [your
id]@locke.wikimedia.org:/a/squid/archive/sampled-1000.log-201112*.gz .

How to run my scripts:
1) nice perl SquidCountArchive.pl -d 2012/12/01-2012/12/31
(-d = any date range)  (takes about an hour per log file)

The script starts with making sure all log data for a full day are available
in the archive folder. To this end it checks head and tail of each gz file
for first and last timestamp (and caches this for reuse, as the files are
compressed it takes a while)

Note that EzLib.pm is expected in /home/ezachte, you can modify this in
first lines of SquidCountsArchive.pl
(This needs to change of course, was a workaround for lack of access to
shared perl folders in several dissimilar servers)

Here is a list of csv files per day with non-sensitive data: (again see
http://www.mediawiki.org/wiki/Wikistats/TrafficReports for details)

-rw-r--r-- 1 ezachte wikidev 280k Jan 15 17:46 SquidDataAgents.csv.bz2
-rw-r--r-- 1 ezachte wikidev 400k Jan 15 17:46 SquidDataBanners.csv
-rw-r--r-- 1 ezachte wikidev  87M Jan 15 17:46 SquidDataBinaries.csv (wegens
omvang even weggelaten, is nieuw en alleen om dagelijkse aantal gedownloaded
images te schatten)
-rw-r--r-- 1 ezachte wikidev 363k Jan 15 17:46 SquidDataClients.csv
-rw-r--r-- 1 ezachte wikidev 906k Jan 15 17:46 SquidDataClientsByWiki.csv
-rw-r--r-- 1 ezachte wikidev 2.2k Jan 15 17:46 SquidDataCountriesSaves.csv
-rw-r--r-- 1 ezachte wikidev 144k Jan 15 17:46 SquidDataCountriesViews.csv
-rw-r--r-- 1 ezachte wikidev 1.8M Jan 15 17:46
SquidDataCountriesViewsTimed.csv
-rw-r--r-- 1 ezachte wikidev 186k Jan 15 17:46 SquidDataCrawlers.csv
-rw-r--r-- 1 ezachte wikidev 2.3k Jan 15 17:46 SquidDataExtensions.csv
-rw-r--r-- 1 ezachte wikidev 2.3k Jan 15 17:46 SquidDataGoogleBots.csv
-rw-r--r-- 1 ezachte wikidev 1.9k Jan 15 17:46 SquidDataImages.csv
-rw-r--r-- 1 ezachte wikidev  39k Jan 15 17:46 SquidDataIndexPhp.csv
-rw-r--r-- 1 ezachte wikidev 6.5k Jan 15 17:46 SquidDataLanguages.csv
-rw-r--r-- 1 ezachte wikidev 2.2k Jan 15 17:46 SquidDataMethods.csv
-rw-r--r-- 1 ezachte wikidev  18k Jan 15 17:46 SquidDataOpSys.csv
-rw-r--r-- 1 ezachte wikidev 461k Jan 15 17:46 SquidDataOrigins.csv
-rw-r--r-- 1 ezachte wikidev 2.4M Jan 15 17:46 SquidDataRequests.csv
-rw-r--r-- 1 ezachte wikidev  61k Jan 15 17:46 SquidDataRequestsM.csv
-rw-r--r-- 1 ezachte wikidev 1.9k Jan 15 17:46 SquidDataRequestsWap.csv
-rw-r--r-- 1 ezachte wikidev 129k Jan 15 17:46 SquidDataScripts.csv
-rw-r--r-- 1 ezachte wikidev  32k Jan 15 17:46 SquidDataSearch.csv
-rw-r--r-- 1 ezachte wikidev 4.7k Jan 15 17:46 SquidDataSkins.csv

These are the sensitive files:

-rw-r--r-- 1 ezachte wikidev  77k Jan 15 17:46
DebugSquidDataErrDoNotPublish.txt
-rw-r--r-- 1 ezachte wikidev 4.1M Jan 15 17:46
DebugSquidDataOutDoNotPublish.txt
-rw-r--r-- 1 ezachte wikidev 942k Jan 15 17:46
DebugSquidDataOutDoNotPublish2.txt
-rw-r--r-- 1 ezachte wikidev 1.3M Jan 15 17:46
SquidDataEditsSavesDoNotPublish.txt.bz2
-rw-r--r-- 1 ezachte wikidev 162k Jan 15 16:37
SquidDataIpFrequenciesDoNotPublish.csv.bz2 (only one used at reporting step,
some other are for visualizations or trace only)
-rw-r--r-- 1 ezachte wikidev 583k Jan 15 17:46
SquidDataReferersDoNotPublish.txt
-rw-r--r-- 1 ezachte wikidev 8.4M Jan 15 17:46
SquidDataViewsVizDoNotPublish-20111101.gz

Ad 2)

There are four different modes for reporting script

2a)  -m 201112 (-m for month) reads all daily csv files for some month and
produces all non-geo related reports
2b) -c (-c for countries ) produces geo-related reports plus historical
trends
 2c) -c -q 2011Q4 (-q for quarter) produces some quartely report (essetially
similar to yearly with more strict filter)
2d) -w (-w for Wikipedia)  once a year or so , collects new country metrics
(population, internet usage, flags) from several Wikipedia articles into
SquidReportCountryMetaInfo.csv