Jump to content

Data Platform/Data Lake/Traffic

From Wikitech

Traffic refers to pageviews to the pages of a wiki project. This page links to detailed information about traffic datasets in the Data Lake.

Most of the datasets below are updated at hourly granularity, meaning that you'll get an hour of new data every hour, with between 2 and 3 hours delay (for the hour to be finished, and the data to be computed).

Datasets

Hive tables

These datasets are available as Hive tables and can be queried using one of the available SQL engines, or accessed directly through HDFS.

Dataset Name Description
webrequest hive table

- See also a separate list of Hive tables derived from webrequest

The webrequest stream contains data on all the hits to Wikimedia's servers. This includes requests for page HTML, images, CSS, and Javascript, as well as requests to the API.
pageview_actor hive table The wmf.pageview_actor table is a smaller version of webrequest table with fewer columns.
pageview_hourly hive table The wmf.pageview_hourly table contains 'pre-aggregated' webrequest data, filtered to keep only pageviews, and aggregated over a predefined set of dimensions.
projectview_hourly hive table The wmf.projectview_hourly table is 'pre-aggregated' webrequest data at the project level. It is different from the wmf.pageview_hourly dataset in that it involves less dimensions and is therefore smaller in data size (and faster to query).
unique devices This dataset gives you how many distinct devices visit our projects
browser general This dataset gives you pageview statistics broken down by user-agent related dimensions like OS family, OS major, browser family, browser major
mediawiki_api_request The mediawiki_api_request table provides the log of api requests to MediaWiki
mobile apps session metrics Contains aggregate stats about pageview sessions on the Android and iOS Wikipedia mobile apps
mobile apps uniques Counts how many different Android and iOS Wikipedia mobile apps installs accessed Wikimedia sites during the given day or month
inter language Traffic between different languages on the same project family
virtualpageview_hourly Provides data about page previews on desktop Wikipedia

Dumps

These datasets are made available as files, updated at regular intervals.

Deprecated or Obsolete Datasets

The following datasets are no longer in use, but the pages are kept to document history:

Access

All data in the Data Lake is private by default. For this, reference Data_Platform/Data access. Some of the data above is public in other systems (see Analytics main page)

History

Some partial information about the evolution of publishing analytics data at WMF is recorded here in a timeline.