Analytics/Archive/Data/Pagecounts-raw

From Wikitech
See also the pageviews API, available since the end of 2015, and other sources of pageview data.
This page contains historical information. It may be outdated or unreliable.

NOTE: This dataset is deprecated since 2016-08-01, see this thread

pagecounts-raw holds the desktop sites' pageview data (separately for every page) for the timespan from 2007 to 2016, in the same format that webstatscollector used to emit and based on the same pageview definition (which differs from the newer definition introduced in 2014/15).

The dataset also contains the "projectcounts" aggregate pageview data counting traffic to an entire project (e.g. English Wikipedia or Romanian Wikivoyage), which includes mobile views, in contrast includes this page-level data.

This stream is owned by the Analytics Engineering Team.

For the timespan from September 2014 on, it is recommended to use pagecounts-all-sites instead, which includes mobile views.

For the timespan from May 2015 on, the pageviews dataset is recommended, which is based on an improved pageview definition (and also includes mobile views).

Contained data

The dataset consists of files with names

${YEAR}/${YEAR}-${MONTH}/pagecounts-${YEAR}${MONTH}${DAY}-${HOUR}0000.gz
${YEAR}/${YEAR}-${MONTH}/projectcounts-${YEAR}${MONTH}${DAY}-${HOUR}0000


The pagecounts or pageviews are gzipped text files holding hourly per page aggregates of pageviews, and projectcounts or projectviews are plain text files holding hourly per domain-name[1] aggregates at the project level. The pagecounts and projectcounts files also include total response byte sizes at their respective aggregation level, but this was dropped from the pageviews and projectviews files because it wasn't very accurate.

The time used in the filename is in UTC timezone refers to the end of the aggregation period, not the beginning.

Both page and project files are made up of lines having 4 space-separated fields:

domain_code page_title count_views total_response_size
Field name Description
domain_code Domain name of the request, abbreviated.

The domain coding scheme in pageviews, pagecounts-all-sites, and pagecounts-raw is kept compatible on purpose, thus retaining quirks and inconsistencies in the coding scheme (and perhaps adding to the confusion with new added complexity). Our apologies if the scheme looks a bit complex (it is), but codes are unambiguous, and are primarily for machine-reading.

Common trailing parts in the domain name have been abbreviated. The main inconsistency is: project 'wikipedia.org' doesn't add a suffix for project name, where 'wikibooks.org' adds .b., 'wiktionary.org' adds .k', etc. (the original scheme predates Wikimedia's mobile site).

Domain_code now can also be an abbreviation for mobile and zero domain names, in which case .m or .zero is inserted as second part of the domain name (just like with full domain name). E.g. 'en.m.v' stands for "en.m.wikiversity.org". (Again, as project Wikipedia is not coded in the abbreviation: 'en' stands for "en.wikipedia.org", and 'en.m' stands for "en.m.wikipedia.org".

Domain trailing part Coded as Database name
.wikipedia.org *wiki

(be careful about the other non- wikipedia sites using this however)

.wikibooks.org .b *wikibooks
.wiktionary.org .d *wiktionary
.wikimediafoundation.org .f foundationwiki
.wikimedia.org .m

Only the following domains are considered

  • commons.wikimedia.org
  • meta.wikimedia.org
  • incubator.wikimedia.org
  • species.wikimedia.org
  • strategy.wikimedia.org
  • outreach.wikimedia.org
  • usability.wikimedia.org
  • quality.wikimedia.org
  • commonswiki
  • metawiki
  • incubatorwiki
  • specieswiki
  • strategywiki
  • outreachwiki
  • usabilitywiki
  • qualitywiki
.m.${WHITELISTED_PROJECT}.org .mw (See explanation below)
.wikinews.org .n *wikinews
.wikiquote.org .q *wikiquote
.wikisource.org .s *wikisource
.wikiversity.org .v *wikiversity
.wikivoyage.org .voy *wikivoyage
.mediawiki.org .w mediawikiwiki
.wikidata.org .wd wikidatawiki
page_title For page-level files, it holds the title of the unnormalized part after /wiki/ in the request URL (E.g.: Main_Page, Berlin). The page title may also be extracted from the title or page query parameters, e.g. /w/index.php?title=Main+Page. The title will be URL-decoded and will formatted as a canonical DBkey with spaces replaced by underscores.

For project-level files or when the title cannot be extracted, it is -.

count_views The number of times this page has been viewed in the respective hour.
total_response_size The total response size caused by the requests for this page in the respective hour. This is a sum over field #7 of Cache log format fields.

So for example a line

en Main_Page 42 50043

means 42 requests to "en.wikipedia.org/wiki/Main_Page", which accounted in total for 50043 response bytes. And

de.m.voy Berlin 176 314159

would stand for 176 requests to "de.m.wikivoyage.org/wiki/Berlin", which accounted in total for 314159 response bytes.

Each domain_code and page_title pair occurs at most once.

The file is sorted by domain_code and page_title.


Data not included

This dataset does not contain per language, or per title counts for a project's mobile site. See pagecounts-all-sites, if you need them.

(note: this line should be be moved from template to parent page) So pagecounts-raw does not contain counts for mobile or zero sites. Use file version pagecounts-all-sites if you need them.

Aggregation for .mw

Note: anomaly retained for backward compatibility! These lines better belong in project-level files. Best to ignore .mw lines.

The .mw abbreviation aggregates the mobile sites across all projects per language. The page_name gets set to the used language.

So consider a given hour only sees the following requests:

https://en.m.wikipedia.org/wiki/Main_Page
https://en.m.wikipedia.org/wiki/Berlin
https://en.m.wiktionary.org/wiki/House

(and assuming each request accounted for 100 bytes), the hour's page-level file would consist only of the line

 en.mw en 3 300

. The corresponding project-level file would be

 en.mw - 3 300

. So while the .mw abbreviation counts the mobile site, it throws wikipedia, wiktionary into the same bucket. And also, it does not distinguish between page_titles.

Availability

dumps.wikimedia.org

The stream is available unsampled as gzipped hourly files from http://dumps.wikimedia.org/other/pagecounts-raw/.

The date in the file name refers to the end of the capturing period, not the beginning.

stat1004 and stat007

Data from 2007 to 2016 is available as hourly files at /mnt/hdfs/wmf/data/archive/pagecounts-raw/ on stat1004.eqiad.wmnet. Also, the folder /mnt/data/pagecounts/incoming on stat1007 has hourly files with data from 2015 and 2016.

The date in the file name refers to the end of the capturing period, not the beginning.

There is also a Hive table called projectcounts_raw with data from 2007 to 2016 that may be related.

pagecounts-ez

Adapted from a post on Analytics-l, February 2018:

Another option is to download the data in lossless compressed form, https://dumps.wikimedia.org/other/pagecounts-ez/ (see also Analytics/Data Lake/Traffic/Pagecounts-ez). The format is clever and doesn't lose granularity, should be a lot quicker than pagecounts-raw (this is basically what stats.grok.se did with the data as well, so downloading this way should be equivalent).

Toolforge

Adapted from a post on Analytics-l, February 2018:

You can also work on Toolforge, a virtual cloud that's on the same network as the data, so getting the data is a lot faster and you can use our compute resources (free, of course): Portal:Toolforge (IRC support: #wikimedia-cloud connect). See also PAWS.

Events and known problems since 2014-03-01

Date from Date until Bug Details
* 2014-09-02 ~16:19 bug 70140 Https traffic from ulsfo gets counted twice.
2014-04-17 2014-07-07 bug 67456 Logs from SSL endpoints was not fed into webstatscollector, hence SSL traffic has not been counted by webstatscollector.
2014-07-07 ~16:25 2014-09-02 ~16:19 bug 70295 Requests to Special:CentralAutoLogin/* have been counted.
2014-07-08 19:00 2014-07-08 22:00 bug 67694 A 2014 FIFA World Cup (soccer) related traffic spike caused udp2log overload and lead to up to ~10% packetloss during this period of time.
2014-07-13 19:00 2014-07-13 23:00 bug 67694 A 2014 FIFA World Cup (soccer) related traffic spike caused udp2log overload and lead to up to ~25% packetloss during this period of time.
2014-07-29 01:35 2014-07-29 01:42 bug 68796 Most of esams missing between 2014-07-29T01:35:45 and 2014-07-29T01:42:00 due to flapping network link (<=11% of total zero traffic around that time)
2014-08-16 ~22:43 2014-08-16 ~22:49 bug 69663 Root mount on oxygen went full, which caused services to panic and udp2log dropped requests during that time
2014-08-17 ~06:26 2014-08-17 ~06:30 bug 69663 Root mount on oxygen went full again, which caused services to panic and udp2log dropped requests during that time
2014-08-24 14:00 2014-08-27 21:00 bug 70118 Resource scarceness on gadolinium causing higher drop rates, and service restarts chopping off part of the data for some hours.
2014-08-28 16:01 2014-08-28 ~20:30 bug 70136 Permission errors on gadolinium prohibited writing of hourly files
2014-10-08 22:00 2014-10-08 24:00 bug 71879 ULSFO having connectivity issues leading to partial message loss
* 2014-10-15 ~19:02:30 bug 66352 Pageviews to “undefined” and “Undefined” pages have been counted
* 2014-10-15 ~19:02:30 bug 71790 Redirects have been counted
2014-10-15 ~19:00:00 2014-10-15 ~19:02:30 bug 72102 No messages collected during deployment of new webstatscollector version
2014-10-15 ~20:22:00 2014-10-15 ~20:23:00 bug 72107 No messages collected during restart of webstatscollector's filter
2014-10-20 13:06 2014-10-20 13:27 bug 72306 ULSFO connectivity issues causing packet loss between 6% and 47% for ulsfo caches.
2014-10-21 ~10:30 2014-10-21 ~11:43 bug 72355 Ulsfo connectivity issues causing packet loss for ulsfo caches.
2014-11-25 ~01:56 2014-12-04 14:03 task T76390 Change of HTTPS setup makes requests HTTPS from eqiad and esams (not ulsfo) get count twice.

On 2014-12-08, backfilling the affected period with good data from pagecounts-all-sites finished. So since then, the pagecounts/projectcounts files for the affected period are good again.

2014-11-30 ~03:50 2014-11-30 ~10:13 task T76334 No data while analytics infrastructure suffered eqiad network issues.

On 2014-12-08, backfilling the affected period with good data from pagecounts-all-sites finished. So since then, the pagecounts/projectcounts files for the affected period are good again.

2015-01-01 00:00 n/a Switch from webstatscollector generated files to Hive generated files (If32afc, stripped-down variant of pagecounts-all-sites), see announcement ("You may see a slight increase in article counts. The webrequest data in HDFS is less lossy than the udp2log data").
2015-01-13 ~22:20 2015-01-13 ~23:18 task T86973 No data due to firewall problems

Other notes:

  • In June 2014, tablet views were switched from the desktop site to the mobile site, causing mobile views to increase and correspondingly the desktop views (i.e. also the per-page numbers from pagecounts-raw) to drop.

Earlier issues (incomplete list):

  • 2013-07-23 to 2013-07-24: some data loss (resulting in empty files for 17 hours)
  • Remarks about some Wikistats issues (2009-) that may or may not have affected pagecounts-raw too
  • Errata list from 2011 on Wikistats

See also

  1. ↑ Hence, the “project” in projectcounts is somewhat a misnomer, but kept for historical compatibility.