The wmf.pageview_hourly table (available on Hive) contains 'pre-aggregated' webrequest data, filtered to keep only pageviews, and aggregated over a predefined set of dimensions. It is stored in the Parquet columnar file format and partitioned by (year, month, day, hour). The data goes back to May 1, 2015.
In 2015, a project about sanitizing this dataset was launched, see this page.
$ hive --database wmf hive (wmf)> describe pageview_hourly; OK col_name data_type comment project string Project name from requests hostname language_variant string Language variant from requests path (not set if present in project name) page_title string Page Title from requests path and query access_method string Method used to access the pages, can be desktop, mobile web, or mobile app zero_carrier string Zero carrier if pageviews are accessed through one, null otherwise agent_type string Agent accessing the pages, can be spider or user referer_class string Can be none (null, empty or '-'), unknown (domain extraction failed), internal (domain is a wikimedia project), external (search engine) (domain is one of google, yahoo, bing, yandex, baidu, duckduckgo), external (any other) continent string Continent of the accessing agents (computed using maxmind GeoIP database) country_code string Country iso code of the accessing agents (computed using maxmind GeoIP database) country string Country (text) of the accessing agents (computed using maxmind GeoIP database) subdivision string Subdivision of the accessing agents (computed using maxmind GeoIP database) city string City iso code of the accessing agents (computed using maxmind GeoIP database) user_agent_map map<string,string> User-agent map with device_family, browser_family, browser_major, os_family, os_major, os_minor and wmf_app_version keys and associated values record_version string Keeps track of changes in the table content definition - https://wikitech.wikimedia.org/wiki/Analytics/Data/Pageview_hourly view_count bigint number of pageviews page_id int MediaWiki page_id for this page title. For redirects this could be the page_id of the redirect or the page_id of the target. This may not always be set, even if the page is actually a pageview. namespace_id int MediaWiki namespace_id for this page title. This may not always be set, even if the page is actually a pageview. year int Unpadded year of pageviews month int Unpadded month of pageviews day int Unpadded day of pageviews hour int Unpadded hour of pageviews # Partition Information # col_name data_type comment year int Unpadded year of pageviews month int Unpadded month of pageviews day int Unpadded day of pageviews hour int Unpadded hour of pageviews
Notice the year, month, day, and hour fields. These are Hive partitions, and are explicit mappings to hourly aggregations in HDFS. You must include at least one partition predicate in the where clause of your queries (even if it is just year > 0). Partitions allow you to reduce the amount of data that Hive must parse and process before it returns you results. For example, if are only interested in data during a particular day, you could add where year = 2014 and month = 1 and day = 12. This will instruct Hive to only process data for partitions that match that partition predicate. You may use partition fields as you would any normal field, even though the field values are not actually stored in the data files.
Changes and known problems since 2015-06-16
- See also m:Research:Page view#Change log for changes to the page view definition itself, and Analytics/Data_Lake/Traffic/Webrequest#Changes and known problems since 2015-03-04 for issues and updates affecting all webrequests (including non-pageviews)
|2015-05-01||task T99931||0.0.1||Create table with pageviews aggregated from 'text' and 'mobile' refined webrequest sources and backfill aggregation from beginning of may.|
|2015-06-01||task T107436||0.0.2||Add parsed user agent data (user_agent_map field) to aggregated pageviews to prepare wikistat-2.0 request.|
|2015-08-31||task T110614||0.0.3||Backfilled data from the beginning of pageview_hourly (May 1st, 2015) to correct bugs:
We took advantage of this backfill to reorder the user_agent_map fields in a more coherent place.
|2015-12-01||task T116023||0.0.4||Add the mediawiki page_id for this title when available. For redirects this could be the page_id of the redirect or the page_id of the target. This may not always be set, even if the page is actually a pageview.|
|2016-02-22||task T148780||Since this date, browsers not implementing the referrer meta tag correctly fail to populate the referrer header when clicking on wiki internal links.|
|2016-08-05||task T141506||Spikes on pageviews for Main_Page in several projects, due to disproportionate number of requests of users in Chrome 41 on Windows. Maybe TLS bug on Windows security update.|
|2016-10||task TT145922||Fix to mediawiki that changed requests that were (wrongly) returning 200 to (lawfully) return 404, those would not be counted as pageviews. We removed about 6 million pageviews monthly|
|From 2016-02||task TT148780||Meta referrer tag value not supported by Safari, Safari sessions would appear to be shorter|
|2017-02-09||Task T156628, Task T155141, Task T157528||0.0.5||Update pageview definition to remove previews (POST with "action=submit" in query is now excluded from pageviews). Add DSXS (self-identified bot) to bot regex. Add namespace_id field (not always present though).|
|2017-06||Task T163233||Addition of throttling to cdn layer will reduce traffic spikes coming from the same ips in short time periods|
|2018-04||Task T187014||From February 6 to April 16, 2018, the geolocation data for traffic from Opera browsers on mobile web was incorrect (wrongly labeled them as coming from the US, "Unknown" and some other countries rather than their true origin).|
Page title and id
The dataset contains both a page_title and page_id fields.
- page_title is extracted from the requested URL (either from path or query). We expect it to be present and correct on most cases. The special values used when the title is not extracted is
- page_id is received in the x-analytics header. As of 2017-06-12, page_id is populated on access methods
mobile web requests, but not
mobile app.This means ~90% of pageview requests have a page_id so far. In case of redirect, the page_id we received is the one of the redirected-to page. This means that, for instance, the same page_id
534366is associated with the different page_titles
Barack_Obama(original content page),
Barack_obama(redirect to main content page),
Barack_H._Obama(again another redirect) ...
Of interest, how redirects work in mediawiki: Task T53736
- The code that generates it:
- Projectview hourly, a much smaller table containing the same data but page_title and page_id aggregated on the project level.
- Action taken on the 7th of June, but since data is available from beginning of may, the date of this line is set so
- Action taken on the 31st of July, but since data is available from beginning of June, the date of this line is set so