Data Platform/Data Lake/Traffic/Pageview hourly

The wmf.pageview_hourly table (available on Hive) contains 'pre-aggregated' webrequest data, filtered to keep only pageviews, and aggregated over a predefined set of dimensions. It is stored in the Parquet columnar file format and partitioned by (year, month, day, hour). The data goes back to May 1, 2015.

As of 2018, this dataset is not public for privacy reasons, but less granular pageview data derived from it is made public in various forms, see m:Research:Page view.

Notes:

For performance optimization across jobs, the data is actually aggregated from the pageview_actor intermediate dataset, itself being a filtered view of webrequest.
The Projectview hourly table is a much smaller table containing the same data without page_title and page_id (and some other fields like user_agent_map), meaning it is aggregated at the project level.
Projectview data (without page-title and page-id) is also available in Druid fast-querying engine. Many of the classical project-level queries should be answered with either Turnilo or Superset on pageview_daily or pageview_hourly datasources (LDAP access required).

Current Schema

$ hive --database wmf

hive (wmf)> describe pageview_hourly;
OK
col_name    data_type   comment
project                 string                  Project name from requests hostname
language_variant        string                  Language variant from requests path (not set if present in project name)
page_title              string                  Page Title from requests path and query
access_method           string                  Method used to access the pages, can be desktop, mobile web, or mobile app
zero_carrier            string                  Zero carrier if pageviews are accessed through one, null otherwise
agent_type              string                  Agent accessing the pages, can be spider,  user or automated
referer_class           string                  Can be none (null, empty or '-'), unknown (domain extraction failed), internal (domain is a wikimedia project), external (search engine) (domain is one of google, yahoo, bing, yandex, baidu, duckduckgo), external (any other)
continent               string                  Continent of the accessing agents (computed using maxmind GeoIP database)
country_code            string                  Country iso code of the accessing agents (computed using maxmind GeoIP database)
country                 string                  Country (text) of the accessing agents (computed using maxmind GeoIP database)
subdivision             string                  Subdivision of the accessing agents (computed using maxmind GeoIP database)
city                    string                  City iso code of the accessing agents (computed using maxmind GeoIP database)
user_agent_map      	map<string,string>  	User-agent map with device_family, browser_family, browser_major, os_family, os_major, os_minor and wmf_app_version keys and associated values
record_version          string                  Keeps track of changes in the table content definition - https://wikitech.wikimedia.org/wiki/Analytics/Data/Pageview_hourly
view_count              bigint                  number of pageviews
page_id             	int                 	MediaWiki page_id for this page title. For redirects this could be the page_id of the redirect or the page_id of the target. This may not always be set, even if the page is actually a pageview.
namespace_id            int                     MediaWiki namespace_id for this page title. This may not always be set, even if the page is actually a pageview.
year                    int                     Unpadded year of pageviews
month                   int                     Unpadded month of pageviews
day                     int                     Unpadded day of pageviews
hour                    int                     Unpadded hour of pageviews
# Partition Information
# col_name              data_type               comment
year                    int                     Unpadded year of pageviews
month                   int                     Unpadded month of pageviews
day                     int                     Unpadded day of pageviews
hour                    int                     Unpadded hour of pageviews

Notice the year, month, day, and hour fields. These are Hive partitions, and are explicit mappings to hourly aggregations in HDFS. You must include at least one partition predicate in the where clause of your queries (even if it is just year > 0). Partitions allow you to reduce the amount of data that Hive must parse and process before it returns you results. For example, if are only interested in data during a particular day, you could add where year = 2014 and month = 1 and day = 12. This will instruct Hive to only process data for partitions that match that partition predicate. You may use partition fields as you would any normal field, even though the field values are not actually stored in the data files.

The dataset contains both a page_title and page_id fields.

page_title is extracted from the requested URL (either from path or query). We expect it to be present and correct on most cases. The special value used when the title is not extracted is -.

page_id is received in the X-Analytics header. As of 2017-06-12, page_id is populated on access methods desktop and mobile web requests, but not mobile app. This means that >95% of pageview requests have a page_id so far. In case of redirects, the page_id we received is the one of the redirected-to page. This means that, for instance, on the English Wikipedia the same page_id 534366 is associated with the different page_titles Barack_Obama (original content page), Barack_obama (redirect to main content page), Barack_Hussein_Obama (another redirect), Barack_H._Obama (again another redirect) ...

Sample queries

SELECT
  concat(month,'/',day,'/',year), access_method, sum(view_count)
FROM 
  wmf.pageview_hourly
WHERE
  year = 2015
  AND month = 8
  AND agent_type = "user"
  AND country = "Canada"
  AND project = "en.wikipedia"
  AND page_title = "Justin_Trudeau"
GROUP BY
  year, month, day, access_method;

Changes and known problems since 2015-06-16

Date from	Task	record_version	Details
2022-01-26	task T300164		Data loss has been incurred from June 4th 2021 to January 26th 2022 due to a varnishkafka install problem on some of the caching hosts. Summary of the data loss analysis: Between 2021-06-04 and 2021-11-03 we have lost on average 2.80% of webrequest-text, leading to a similar abount of pageview loss. Between 2021-11-04 and 2022-01-27 we have lost on average 4.34% of webrequest-text, leading to a similar abount of pageview loss. For more information, see Analytics/Data Lake/Data Issues/2021-06-04 Traffic Data Loss.
2020-08-24	task T257860		Mobile pageview definition is corrected to count only pageviews with pageview=1 in the header, this correct issues with mobile pageviews apparent since late 2020 -05-18
2020-07-01	task T256514		Update definition to count new android mobile content consumption.
2020-06-25	task T256049		Aggregate from the intermediate pageview_actor table for performance.
2020-04-29	task T238357	0.0.7	Add the `automated` agent-type. This agent-type is applied to pageviews identified as being requested by bots not self-identified (self-identified bots are categorized as `spider`). See Analytics/Data_Lake/Traffic/BotDetection for more details.
2020-04-21	task T245468		Update title validation to enable pageview data for pages with title that include characters with Unicode character values above 0xFFFF like emoji.
2019-11-22	task T237743		Update of UA parser.
2019-09-18	task T212854		Update user-agent parser to up-to-date version. Main differences are `os_family` bug fixes (`Windows 98` is now `Windows` for instance), `os_major` bug fixes (`8.1`is now `8`), `Google Search App` on IOs is not falsely counted at `Safari` anymore, and various `device_family` improvement.
2019-07-22	task T226730		Special:Something pages removed from pageviews (other than Search pages), these pages should have been removed from day 1 but code was buggy.
2019-04-23	task T144100	0.0.6	We accidentally stopped computing pageviews for pages with a "plus" sign in the title. This was fixed by Task: T224451.
2019-04-23	Task: T144100	0.0.6	Invalid characters now removed from titles on pageview dumps
2019-04-01	task T219842		Kafka jumbo failover produces some data loss for hour 22 UTC.
2019-01-07	task T212862, task T153821		Wikitech added to PageviewDefinition.
2018-04	task T187014		From February 6 to April 16, 2018, the geolocation data for traffic from Opera browsers on mobile web was incorrect (wrongly labeled them as coming from the US, "Unknown" and some other countries rather than their true origin).
2018-03-20	task T154702		Fixed broken referer categorization for Safari (see above), resultingin an increase in pageviews classified as internally referred and decrease in pageviews recorded with "none" during April 2018
2017-06	task T163233		Addition of throttling to cdn layer will reduce traffic spikes coming from the same ips in short time periods
2017-02-09	task T156628, task T155141, task T157528	0.0.5	Update pageview definition to remove previews (POST with "action=submit" in query is now excluded from pageviews). Add DSXS (self-identified bot) to bot regex. Add namespace_id field (not always present though).
From 2016-02	task T148780		Meta referrer tag value not supported by Safari, Safari sessions would appear to be shorter
2016-10	task T145922		Fix to mediawiki that changed requests that were (wrongly) returning 200 to (lawfully) return 404, those would not be counted as pageviews. We removed about 6 million pageviews monthly
2016-08-05	task T141506		Spikes on pageviews for Main_Page in several projects, due to disproportionate number of requests of users in Chrome 41 on Windows. Maybe TLS bug on Windows security update.
2016-02-22	task T148780		Since this date, browsers not implementing the referrer meta tag correctly fail to populate the referrer header when clicking on wiki internal links.
2015-12-01	task T116023	0.0.4	Add the mediawiki page_id for this title when available. For redirects this could be the page_id of the redirect or the page_id of the target. This may not always be set, even if the page is actually a pageview.
2015-08-31	task T110614	0.0.3	Backfilled data from the beginning of pageview_hourly (May 1st, 2015) to correct bugs: in page_title, spaces were not always transformed into underscores wrong hosts were included and are no more (outreach.wikimedia.org, donate.wikipedia.org, arbcom-*.wikipedia.org) We took advantage of this backfill to reorder the user_agent_map fields in a more coherent place.
2015-06-01^[1]	task T107436	0.0.2	Add parsed user agent data (user_agent_map field) to aggregated pageviews to prepare wikistat-2.0 request.
2015-05-01^[2]	task T99931	0.0.1	Create table with pageviews aggregated from 'text' and 'mobile' refined webrequest sources and backfill aggregation from beginning of May.

Notes

↑ Action taken on the 31st of July, but since data is available from beginning of June, the date of this line is set so
↑ Action taken on the 7th of June, but since data is available from beginning of May, the date of this line is set so

[1] Action taken on the 31st of July, but since data is available from beginning of June, the date of this line is set so

[2] Action taken on the 7th of June, but since data is available from beginning of May, the date of this line is set so

[1]

[2]

Current Schema

Sample queries

Changes and known problems since 2015-06-16

See also

Notes