Analytics/Data Lake/Traffic/Pageviews

From Wikitech
Jump to: navigation, search
See also the pageviews API, available since the end of 2015.

The pageviews dataset has per-article and per-project view counts for all Wikimedia Foundation projects since May 2015. It filters out as many spiders and bots as we can detect. In its domain_code column, explained below, it separates access through the desktop, mobile, and zero sites. The Pageview definition explains how we filter and count pageviews.

This stream is owned by the Analytics Team.

Contained data

(If you are familiar with the pagecounts-raw dataset, you might want to look at the differences between those two datasets right away.)


Disambiguating abbreviations ending in “.m”

The are two ways for an abbreviation to end in .m. Either because the domain is a whitelisted project on wikimedia.org (like commons.wikimedia.org being abbreviated to commons.m), or the domain is the mobile site of wikipedia (like en.m.wikipedia.org being abbreviated to en.m).

Since the whitelisted wikimedia.org projects (see abbreviation table above) never match a language code on wikipedia, the mapping between domain name and abbreviation is bijective.

While this solution requires an if for the edge case of "Summing up pageviews across all mobile sites", it allows to stay compatible with pagecounts-raw's abbreviations while at the same time also keeping the concept and semantics of abbreviating domain names. Also it makes it easier to automate comparisons between this dataset and TSVs (like sampled-1000) or Hive data.

Differences to the pagecounts-raw dataset

The format of this dataset and the pagecounts-raw dataset is the same. But this dataset also filters out traffic that is very likely spider or bot traffic. This dataset also includes mobile and zero traffic. For example, having a line like

 de.m.voy Berlin 176 314159

for the mobile site page de.m.wikivoyage.org/wiki/Berlin ) and the zero site (E.g.: having a line like

 ms.zero Cinta_Elysa 4 32944

for the zero site page ms.zero.wikipedia.org/wiki/Cinta_Elysa ).

Requests to the mobile site and requests from mobile devices or apps

“mobile site” refers to the mobile site (so URLs having .m. before the wikipedia.org, … in the URL), not to device identification. Note however that mobile phones and tablets are by default redirected to the mobile sites.

Also, traffic from mobile apps is not singled out.

Requests to the zero site and Wikipedia Zero requests

Wikipedia Zero requests can (depending on the setup for the Wikipedia Zero partner) hit either

  • mobile site (having “.m.” in the unabbreviated domain name), or
  • zero site (having “.zero.” in the unabbreviated domain name).

Hence, aggregating all lines that have “.zero” in the domain abbreviation (like

 ms.zero Cinta_Elysa 4 32944

) does not allow to obtain the total volume of Wikipedia Zero traffic, but only gives the total volume of traffic to the zero site. The bigger part of Wikipedia Zero traffic goes to the mobile site. Note however, that the mobile site sees both Wikipedia Zero and non-Wikipedia Zero traffic. So there is no way to compute the “total volume of Wikipedia Zero traffic”.

Availability

dumps.wikimedia.org

The stream is available as hourly files at http://dumps.wikimedia.org/other/pageviews/.

To maintain compatibility with pagecounts-raw, the date in the file name refers to the end of the capturing period, not the beginning.

stat1002.eqiad.wmnet

The stream is available as hourly files at /mnt/hdfs/wmf/data/archive/pageviews on stat1002.

To maintain compatibility with pagecounts-raw, the date in the file name refers to the end of the capturing period, not the beginning.

Analytics cluster

The stream is available as hourly files at /wmf/data/archive/pageviews in the Analytics cluster.

To maintain compatibility with pagecounts-raw, the date in the file name refers to the end of the capturing period, not the beginning.

Events and known problems since 2015-05-01

You can follow the feed for these incident updates.

Date from Date until Bug Details
2016-07-20 today task TT141506 An update to Windows caused Chrome 41 user agents to appear to be erroneously requesting the Main page of a few wikis. This caused a large spike in overall traffic.

Idiosyncrasies

Capitalization is split up

Some requests look as though they were made to EN.WIKIPEDIA.ORG... We normalize these requests to look like they hit en.wikipedia.org

Note