Analytics/Archive/Webstatscollector
- More current information may be available (or not) at Analytics/Pageviews.
Webstatscollector was a set of services that computed unsampled hourly per page pageviews from udp2log. It powered pagecounts-raw until 2015-01-01. Webstatscollector itself got turned off on 2015-01-29.
The output is made available through pagecounts-raw at dumps.wikimedia.org, and ends up powering for example http://stats.grok.se/, and parts of http://stats.wikimedia.org/.
Architecture
Webstatscollector is made up of two separate processes:
- filter, and
- collector.
filter is run by udp2log (currently on oxygen). All udp2log logs are piped through filter. filter has hardcoded url matches that it will not count as pageviews.
The output of filter is then piped into log2udp which sends logs over to the collector process on gadolinium.
As of 2014-08-27 collector is continuously writing data to the disks at about 15MB/s. Since that puts stress on disks, collector is run in a tmpfs directory, so it is writing to tmpfs and does not stress disks. That tmpfs directory's âdumpsâ subdirectory (that's where the aggregated, hourly files get written to) symlinks to a real disk, so the aggregated, hourly files survive a reboot.
As of 2014-09-02 the collector's per-page Berkeley database grows up to an on-disk size of ~1GB for the busy hours. The per-project database is orders of magnitude smaller.
Relevant sources
- filter is built from webstatscollector's filter.c
- collector is built from webstatscollector's collector.c
- log2udp is built from udplog's srcmisc/log2udp.ccp
Updating and Deploying a New Version
webstatscollector is puppetized and installed using a (hacky) .deb package.
Build a new package by running:
$version=0.2 rm -f ../webstatscollector_$version.orig.tar.gz && tar -cvzf ../webstatscollector_$version.orig.tar.gz . && debuild -us -uc
Put this .deb into our apt repository following the instructions here: https://wikitech.wikimedia.org/wiki/Reprepro#Importing_packages.
Then, upgrade the package and restart the binaries:
On oxygen:
apt-get update apt-get install webstatscollector # restart udp2log with the new filter binary service udp2log restart
On gadolinium:
apt-get update apt-get install webstatscollector # restart the collector process service webstats-collector restart
Done!
Used Page View definition
The flow diagram describes at https://phabricator.wikimedia.org/diffusion/ANME/browse/master/pageviews/webstatscollector/pageview_definition.png illustrates the used pageview definition.
As webstatscollector throws away requests coming from WMF internal IP addresses, it needs to be fed the requests arriving at the SSL endpoints to count SSL traffic.
Requests to mobile sites get aggregated across projects. So âen.mw" does not refer to âPageViews to mobile site of enwikiâ, but âPageViews of mobile sites of counted english wikisâ. So requests to each of http://en.m.wikivoyage.org/wiki/Main_Page
, and http://en.m.wikipedia.org/wiki/Main_Page
will get counted towards en.mw
.