Analytics/Pageviews

From Wikitech
Jump to: navigation, search

Summary

In 2014/15, the Analytics and Research Teams at Wikimedia developed a new and more comprehensive definition and algorithm to count pageviews. "Pageviews" or "Current Pageviews" refers to a tally using the new algorithm. As of 2015, many dashboards and reports continue to use the legacy definition of pageviews and those counts should be referred to as "Legacy Pageviews".

Legacy Pageviews Pageviews

(Current Pageviews)

Data Source sampled web-request logs un-sampled web-request logs
Cons
  • the data source is sampled
  • the definition is several years old
  • excludes the apps
Pros
  • the data source is un-sampled
  • better detection and exclusion of spiders
Examples WMF Quarterly Report
Specification Here Here
Uses

Eventually some dashboards will be deprecated

or migrated to use the current pageview definition.

Details

  • The Research team has developed a new definition of Pageviews: https://meta.wikimedia.org/wiki/Research:Page_view
  • The new definition is better because:
    • it is based on all the web-request logs (not a 1:1000 sampling of them);
    • it can detect and flag more spiders (which can be a significant fraction of total traffic on some wikis);
  • The new definition will evolve over time as new ways of viewing content are developed (e.g. a mobile app using and APIs)
  • The new definition is generated on the analytics cluster using Hadoop and web request logs
  • Current pageview counts are being tallied starting May 1 2015
  • Where appropriate, we will slowly transition uses of legacy pageviews to use current pageviews or we will point out where legacy pageviews are used.
  • The previous definition is documented here: https://phabricator.wikimedia.org/diffusion/ANME/browse/master/pageviews/webstatscollector/pageview_definition.png
  • The previous definition relied on a tool named webstatscollector: Analytics/Webstatscollector

Comparing current and legacy pageviews

Legacy pageview counts can/are larger than current pageview counts because of traffic from spiders. The current definition makes a better effort at counting traffic from real persons and excluding automata. Please take this into account when trying to plot year over year changes in traffic. For example, when looking at http://stats.wikimedia.org/EN/TablesPageViewsMonthlyCombined.htm there is a discontinuity in traffic on May 2015 because that is when the current pageview definition is used to report traffic.