Jump to content

Data Platform/Data Lake/Traffic/Virtualpageview hourly

From Wikitech

The wmf.virtualpageview_hourly table (available on Hive) contains aggregated data about "virtual pageviews", i.e. user actions of consuming content on Wikimedia sites that are not proper pageviews, but are similarly focused on the content of a particular wiki page.

As of mid-2018, the only kind of virtual pageviews recorded in this table are page previews of Wikipedia articles on desktop (limited to previews popups that remain visible for at least one second). It contains valid data back to April 2018, and is viewable in Turnilo.

Internally, it is based on an auxiliary EventLogging table (Schema:VirtualPageView) where more detailed data is kept for 90 days, analogous to how wmf.pageview_hourly is generated as a "refinement" of the webrequest table. The format of this table also follows wmf.pageview_hourly as closely as possible (e.g. regarding information about the page being previewed, partitioning, information about the client like whether it is assumed to be a bot), in order to facilitate joins and other comparative analysis.

Current Schema

> DESCRIBE wmf.virtualpageview_hourly;

col_name	data_type	comment
project	string	Project name from hostname
language_variant	string	Language variant from path (not set if present in project name)
page_title	string	Page title from popup preview (canonical)
access_method	string	Always desktop (virtualpageviews are a desktop only feature for now)
agent_type	string	Agent accessing the pages, can be spider or user
referer_class	string	Always internal (virtualpageviews are always shown in wiki pages)
continent	string	Continent of the accessing agents (maxmind GeoIP database)
country_code	string	Country iso code of the accessing agents (maxmind GeoIP database)
country	string	Country (text) of the accessing agents (maxmind GeoIP database)
subdivision	string	Subdivision of the accessing agents (maxmind GeoIP database)
city	string	City iso code of the accessing agents (maxmind GeoIP database)
user_agent_map	map<string,string>	User-agent map with device_family, browser_family, browser_major, os_family, os_major, os_minor and wmf_app_version keys and associated values
record_version	string	Keeps track of changes in the table content definition - https://wikitech.wikimedia.org/wiki/Analytics/Data/virtualpageview_hourly
view_count	bigint	Number of virtualpageviews of the corresponding bucket
page_id	bigint	Page ID from popup preview
namespace_id	int	Namespace ID from popup preview
source_page_title	string	Page title from source page (canonical)
source_page_id	bigint	Page ID from source page
source_namespace_id	int	Namespace ID from source page
year	int	Unpadded year
month	int	Unpadded month
day	int	Unpadded day
hour	int	Unpadded hour
	NULL	NULL
# Partition Information	NULL	NULL
# col_name            	data_type           	comment             
	NULL	NULL
year	int	Unpadded year
month	int	Unpadded month
day	int	Unpadded day
hour	int	Unpadded hour

Like in Pageview hourly and other traffic tables, the year, month, day, and hour fields are Hive partitions.

Changes and known problems since March 2018

Date from Task record_version Details
2018-03-14 First test events recorded in the aggregate table
2018-04-01 Phab:T189906 Rollout of EL schema completed
2018-04-06 Phab:T190188 DNT fix
2018-04-11 Phab:T191966#4124181 Rollout of the page previews feature to all (IP) users on dewiki
2018-04-17 Phab:T191101#4135462 Rollout of the page previews feature to all (IP) users on enwiki
2018-07-12 Phab:T196904 Fix for a rare issue where no virtual pageviews were logged for certain source pages with very long names
2018-08-20 Phab:T197971 The dataset will no longer include spammy domains, like wikipedia0.com
2019-06-05 Phab:T190840 From now on, events coming from non-wikimedia hostnames (translation services, wiki clones etc.) are filtered out.

See also