- More current information may be available (or not) at Analytics/Pageviews.
Webstatscollector was a set of services that computed unsampled hourly per page pageviews from udp2log. It powered pagecounts-raw until 2015-01-01. Webstatscollector itself got turned off on 2015-01-29.
Webstatscollector is made up of two separate processes:
- filter, and
filter is run by udp2log (currently on oxygen). All udp2log logs are piped through filter. filter has hardcoded url matches that it will not count as pageviews.
The output of filter is then piped into log2udp which sends logs over to the collector process on gadolinium.
As of 2014-08-27 collector is continuously writing data to the disks at about 15MB/s. Since that puts stress on disks, collector is run in a tmpfs directory, so it is writing to tmpfs and does not stress disks. That tmpfs directory's “dumps” subdirectory (that's where the aggregated, hourly files get written to) symlinks to a real disk, so the aggregated, hourly files survive a reboot.
As of 2014-09-02 the collector's per-page Berkeley database grows up to an on-disk size of ~1GB for the busy hours. The per-project database is orders of magnitude smaller.
- filter is built from webstatscollector's filter.c
- collector is built from webstatscollector's collector.c
- log2udp is built from udplog's srcmisc/log2udp.ccp
Updating and Deploying a New Version
webstatscollector is puppetized and installed using a (hacky) .deb package.
Build a new package by running:
$version=0.2 rm -f ../webstatscollector_$version.orig.tar.gz && tar -cvzf ../webstatscollector_$version.orig.tar.gz . && debuild -us -uc
Put this .deb into our apt repository following the instructions here: https://wikitech.wikimedia.org/wiki/Reprepro#Importing_packages.
Then, upgrade the package and restart the binaries:
apt-get update apt-get install webstatscollector # restart udp2log with the new filter binary service udp2log restart
apt-get update apt-get install webstatscollector # restart the collector process service webstats-collector restart
Used Page View definition
The flow diagram describes at https://phabricator.wikimedia.org/diffusion/ANME/browse/master/pageviews/webstatscollector/pageview_definition.png illustrates the used pageview definition.
As webstatscollector throws away requests coming from WMF internal IP addresses, it needs to be fed the requests arriving at the SSL endpoints to count SSL traffic.
Requests to mobile sites get aggregated across projects. So “en.mw" does not refer to “PageViews to mobile site of enwiki”, but “PageViews of mobile sites of counted english wikis”. So requests to each of
http://en.m.wikipedia.org/wiki/Main_Page will get counted towards