Analytics/Archive/Webstatscollector

From Wikitech
This page contains historical information. It may be outdated or unreliable.
More current information may be available (or not) at Analytics/Pageviews.

Webstatscollector was a set of services that computed unsampled hourly per page pageviews from udp2log. It powered pagecounts-raw until 2015-01-01. Webstatscollector itself got turned off on 2015-01-29.

The output is made available through pagecounts-raw at dumps.wikimedia.org, and ends up powering for example http://stats.grok.se/, and parts of http://stats.wikimedia.org/.

Architecture

Webstatscollector architecture

Webstatscollector is made up of two separate processes:

  • filter, and
  • collector.

filter is run by udp2log (currently on oxygen). All udp2log logs are piped through filter. filter has hardcoded url matches that it will not count as pageviews.

The output of filter is then piped into log2udp which sends logs over to the collector process on gadolinium.

As of 2014-08-27 collector is continuously writing data to the disks at about 15MB/s. Since that puts stress on disks, collector is run in a tmpfs directory, so it is writing to tmpfs and does not stress disks. That tmpfs directory's “dumps” subdirectory (that's where the aggregated, hourly files get written to) symlinks to a real disk, so the aggregated, hourly files survive a reboot.

As of 2014-09-02 the collector's per-page Berkeley database grows up to an on-disk size of ~1GB for the busy hours. The per-project database is orders of magnitude smaller.

Relevant sources

Updating and Deploying a New Version

webstatscollector is puppetized and installed using a (hacky) .deb package.

Build a new package by running:

 $version=0.2
 rm -f ../webstatscollector_$version.orig.tar.gz && tar -cvzf ../webstatscollector_$version.orig.tar.gz . && debuild -us -uc

Put this .deb into our apt repository following the instructions here: https://wikitech.wikimedia.org/wiki/Reprepro#Importing_packages.

Then, upgrade the package and restart the binaries:

On oxygen:

 apt-get update
 apt-get install webstatscollector
 # restart udp2log with the new filter binary
 service udp2log restart

On gadolinium:

 apt-get update
 apt-get install webstatscollector
 # restart the collector process
 service webstats-collector restart

Done!

Used Page View definition

The flow diagram describes at https://phabricator.wikimedia.org/diffusion/ANME/browse/master/pageviews/webstatscollector/pageview_definition.png illustrates the used pageview definition.

As webstatscollector throws away requests coming from WMF internal IP addresses, it needs to be fed the requests arriving at the SSL endpoints to count SSL traffic.

Requests to mobile sites get aggregated across projects. So “en.mw" does not refer to “PageViews to mobile site of enwiki”, but “PageViews of mobile sites of counted english wikis”. So requests to each of http://en.m.wikivoyage.org/wiki/Main_Page, and http://en.m.wikipedia.org/wiki/Main_Page will get counted towards en.mw.