User:Milimetric/Notebook/Pageview Hourly

From Wikitech

Overview

Dataset Fact Sheet
Update Frequency Hourly, with a lag of about 2 hours
Trusted Dataset Yes

Description

A page view is a request for the content of a web page. Page views on Wikimedia projects is our most important content consumption metric.

The Wikimedia Foundation defined what a Pageview means for the projects we host. The data is extracted from Webrequest and retained since May 2015.

Once pageview_hourly data is available, it is used to generate all other pageview related datasets. The data is sanitized and pushed out to the public via dumps and the pageview API. It's also processed internally and loaded for querying in various interfaces.

This data is transformed to lower granularity in order to better protect user privacy. The rest of this description will focus on this transformation. Additional context for how we handle privacy at Wikimedia Foundation can be found in documents such as meta:Data_retention_guidelines.

Dimensions and Metrics

One way to think of this data is as a bunch of buckets. The size of the buckets is measured by the view_count, the only metric here. And the bucket is defined by the value of all the dimensions we track. For example:

"In a specific hour, on ro.wikipedia, article X in namespace 0 was viewed N times by probably human users from Spain, through the desktop website, using an iPad"

That would be one row in this dataset.

Data Transformation Process

As raw webrequest data is transformed into pageview_hourly records, the following types of transformations reduce entropy for the purpose of long-term privacy-protecting storage:

  • Extracting information
  • Annotating data
  • Aggregating

For example, the user_agent_map field is a mapping of properties like "device family" and "browser family" that can be extracted with reasonable certainty from the User Agent string found in a webrequest record.

An example of annotating data is the "automata" agent_type. We use heuristics to determine when a specific user agent is acting less like a human and more like an automata. Well behaved automatic agents will generally be detected as "spider" by well-established regular expressions. Less well-behaved agents need different approaches (also documented here, and you can find the code here).

Once we simplify dimensions and reduce entropy, we can aggregate view_counts. This removes detail like which specific IP accessed which article, and keeps buckets like "X users from Spain accessed a specific article".

Each field will have detailed explanation about any transformations that apply to it, see the Schema tab.

Examples

See Also