Data Platform/Data Lake/Traffic/Virtualpageview hourly
The wmf.virtualpageview_hourly table (available on Hive) contains aggregated data about "virtual pageviews", i.e. user actions of consuming content on Wikimedia sites that are not proper pageviews, but are similarly focused on the content of a particular wiki page.
As of mid-2018, the only kind of virtual pageviews recorded in this table are page previews of Wikipedia articles on desktop (limited to previews popups that remain visible for at least one second). It contains valid data back to April 2018, and is viewable in Turnilo.
Internally, it is based on an auxiliary EventLogging table (Schema:VirtualPageView) where more detailed data is kept for 90 days, analogous to how wmf.pageview_hourly is generated as a "refinement" of the webrequest table. The format of this table also follows wmf.pageview_hourly as closely as possible (e.g. regarding information about the page being previewed, partitioning, information about the client like whether it is assumed to be a bot), in order to facilitate joins and other comparative analysis.
Current Schema
> DESCRIBE wmf.virtualpageview_hourly; col_name data_type comment project string Project name from hostname language_variant string Language variant from path (not set if present in project name) page_title string Page title from popup preview (canonical) access_method string Always desktop (virtualpageviews are a desktop only feature for now) agent_type string Agent accessing the pages, can be spider or user referer_class string Always internal (virtualpageviews are always shown in wiki pages) continent string Continent of the accessing agents (maxmind GeoIP database) country_code string Country iso code of the accessing agents (maxmind GeoIP database) country string Country (text) of the accessing agents (maxmind GeoIP database) subdivision string Subdivision of the accessing agents (maxmind GeoIP database) city string City iso code of the accessing agents (maxmind GeoIP database) user_agent_map map<string,string> User-agent map with device_family, browser_family, browser_major, os_family, os_major, os_minor and wmf_app_version keys and associated values record_version string Keeps track of changes in the table content definition - https://wikitech.wikimedia.org/wiki/Analytics/Data/virtualpageview_hourly view_count bigint Number of virtualpageviews of the corresponding bucket page_id bigint Page ID from popup preview namespace_id int Namespace ID from popup preview source_page_title string Page title from source page (canonical) source_page_id bigint Page ID from source page source_namespace_id int Namespace ID from source page year int Unpadded year month int Unpadded month day int Unpadded day hour int Unpadded hour NULL NULL # Partition Information NULL NULL # col_name data_type comment NULL NULL year int Unpadded year month int Unpadded month day int Unpadded day hour int Unpadded hour
Like in Pageview hourly and other traffic tables, the year, month, day, and hour fields are Hive partitions.
Changes and known problems since March 2018
Date from | Task | record_version | Details |
---|---|---|---|
2018-03-14 | First test events recorded in the aggregate table | ||
2018-04-01 | Phab:T189906 | Rollout of EL schema completed | |
2018-04-06 | Phab:T190188 | DNT fix | |
2018-04-11 | Phab:T191966#4124181 | Rollout of the page previews feature to all (IP) users on dewiki | |
2018-04-17 | Phab:T191101#4135462 | Rollout of the page previews feature to all (IP) users on enwiki | |
2018-07-12 | Phab:T196904 | Fix for a rare issue where no virtual pageviews were logged for certain source pages with very long names | |
2018-08-20 | Phab:T197971 | The dataset will no longer include spammy domains, like wikipedia0.com | |
2019-06-05 | Phab:T190840 | From now on, events coming from non-wikimedia hostnames (translation services, wiki clones etc.) are filtered out. |
See also
- Phab:T186728 "Record and aggregate page previews" (2018 task about the creation of this table)
- The code that generates the table:
- Phab:T193524 "Publish data on seen page previews" Task about making some of the data public