Data Platform/Data Lake/Traffic/Pagecounts-ez
Appearance
This dataset is described on its dumps download page.
This dataset is a compressed format of the best pageview data that the Wikimedia Foundation had at any point in its historyː
- From 2007 to December 2015, it compressed the pagecounts-raw dataset, which is now deprecated (providing pageviews per project from December 2007 on, and pageviews per article from late 2011 on)
- From Dec 2015 to Present day, it compresses the pageviews dataset
More information about each of those datasets can be found on their pages.
One hour skewing issue
The data on this dataset, when compared to the canonical Pageviews API, is skewed one hour to the left. This means that on Pagecounts-EZ reports as midnight the pagecount value that in reality corresponds to 11PM the previous day:
Dataset | 12am | 1am | 2am | 3am | 4am | 5am | 6am | 7am | 8am | 9am | 10am | 11am | 12pm |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Pageview API | 23 | 234 | 43 | 345 | 64 | 12 | 534 | 654 | 43 | 645 | 98 | 65 | 75 |
Pagecounts EZ | 89 | 23 | 234 | 43 | 345 | 64 | 12 | 534 | 654 | 43 | 645 | 98 | 65 |
See also
- Erik Zachte's 2011 announcement of this dataset (source for the description on the dumps download page, link is broken there): [1],[2]
- Phabricator task to make per-article data available in pagecounts-ez back to 2008 (instead of 2011)
- m:Learning patterns/Tips for reading project codes from pageviews data files