Analytics/Pageviews

From Wikitech
This page contains historical information. It may be outdated or unreliable.
See meta:Research:Page view for current information about the definition and datasets being used from 2015 on.
See Analytics/Data Lake/Traffic/Pageviews for technical background information.

Summary

In 2014/15, the Analytics and Research Teams at Wikimedia developed a new and more comprehensive definition and algorithm to count pageviews. "Pageviews" or "Current Pageviews" refers to a tally using the new algorithm. As of 2015, many dashboards and reports continue to use the legacy definition of pageviews and those counts should be referred to as "Legacy Pageviews".

[Note: In the comparison below, "legacy pageviews" refers to pageview numbers aggregated at the project level. Page-level pageview numbers were available based on unsampled webrequest logs even in the pre-2015 version, see e.g. pagecounts-raw.]

Legacy Pageviews Pageviews

(Current Pageviews)

Data Source sampled web-request logs un-sampled web-request logs
Cons
  • the data source is sampled
  • the definition is several years old
  • excludes the apps
Pros
  • the data source is un-sampled
  • better detection and exclusion of automated traffic (spiders, web crawlers, bots, ....)
Examples WMF Quarterly Report
Specification Here Here
Uses

Eventually some dashboards will be deprecated

or migrated to use the current pageview definition.

Details

  • The Research team has developed a new definition of Pageviews: https://meta.wikimedia.org/wiki/Research:Page_view
  • The new definition is better because:
    • it is based on all the web-request logs (not a 1:1000 sampling of them);
    • it can detect and flag more automated traffic (which can be a significant fraction of total traffic on some wikis);
  • The new definition will evolve over time as new ways of viewing content are developed (e.g. a mobile app using and APIs)
  • The new definition is generated on the analytics cluster using Hadoop and web request logs
  • Current pageview counts are being tallied starting May 1 2015
  • Where appropriate, we will slowly transition uses of legacy pageviews to use current pageviews or we will point out where legacy pageviews are used.
  • The previous definition is documented here: https://phabricator.wikimedia.org/diffusion/ANME/browse/master/pageviews/webstatscollector/pageview_definition.png
  • The previous definition relied on a tool named webstatscollector: Analytics/Webstatscollector

Comparing current and legacy pageviews

Legacy pageview counts can/are larger than current pageview counts because of automated traffic. The current definition makes a better effort at counting traffic from real persons and excluding automata. Please take this into account when trying to plot year over year changes in traffic. For example, when looking at http://stats.wikimedia.org/EN/TablesPageViewsMonthlyCombined.htm there is a discontinuity in traffic on May 2015 because that is when the current pageview definition is used to report traffic.