Analytics/Archive/Data/Pagecounts-all-sites

From Wikitech
See also the pageviews API, available since the end of 2015.
This page contains historical information. It may be outdated or unreliable.

NOTEː This dataset is deprecated since 2016-08-01, see this thread

The pagecounts-all-sites dataset is holding output from 2014-16 that mimics pagecounts-raw files, but gets generated from Hadoop data using Hive. Also, it extends[1] the webstatscollector pageview definition to mobile and zero sites.

This stream is owned by the Analytics Team.

Note: Extending webstatcollect pageview definition, this dataset files have a one hour shift later than any other dataset handled by the analytics-team (particularly webrequest, pageview-hourly, projectview-hourly).

For instance for data between 2018-09-27T13:00:00 and 2018-09-27T14:00:00, pagecounts-all-site uses 2018-09-27T14:00:00 while other dataset uses 2018-09-27T13:00:00.

Contained data

(If you are familiar with the pagecounts-raw dataset, you might want to look at the differences between those two datasets right away.)


The pagecounts or pageviews are gzipped text files holding hourly per page aggregates of pageviews, and projectcounts or projectviews are plain text files holding hourly per domain-name[2] aggregates at the project level. The pagecounts and projectcounts files also include total response byte sizes at their respective aggregation level, but this was dropped from the pageviews and projectviews files because it wasn't very accurate.

The time used in the filename is in UTC timezone refers to the end of the aggregation period, not the beginning.

Both page and project files are made up of lines having 4 space-separated fields:

domain_code page_title count_views total_response_size
Field name Description
domain_code Domain name of the request, abbreviated.

The domain coding scheme in pageviews, pagecounts-all-sites, and pagecounts-raw is kept compatible on purpose, thus retaining quirks and inconsistencies in the coding scheme (and perhaps adding to the confusion with new added complexity). Our apologies if the scheme looks a bit complex (it is), but codes are unambiguous, and are primarily for machine-reading.

Common trailing parts in the domain name have been abbreviated. The main inconsistency is: project 'wikipedia.org' doesn't add a suffix for project name, where 'wikibooks.org' adds .b., 'wiktionary.org' adds .k', etc. (the original scheme predates Wikimedia's mobile site).

Domain_code now can also be an abbreviation for mobile and zero domain names, in which case .m or .zero is inserted as second part of the domain name (just like with full domain name). E.g. 'en.m.v' stands for "en.m.wikiversity.org". (Again, as project Wikipedia is not coded in the abbreviation: 'en' stands for "en.wikipedia.org", and 'en.m' stands for "en.m.wikipedia.org".

Domain trailing part Coded as Database name
.wikipedia.org *wiki

(be careful about the other non- wikipedia sites using this however)

.wikibooks.org .b *wikibooks
.wiktionary.org .d *wiktionary
.wikimediafoundation.org .f foundationwiki
.wikimedia.org .m

Only the following domains are considered

  • commons.wikimedia.org
  • meta.wikimedia.org
  • incubator.wikimedia.org
  • species.wikimedia.org
  • strategy.wikimedia.org
  • outreach.wikimedia.org
  • usability.wikimedia.org
  • quality.wikimedia.org
  • commonswiki
  • metawiki
  • incubatorwiki
  • specieswiki
  • strategywiki
  • outreachwiki
  • usabilitywiki
  • qualitywiki
.m.${WHITELISTED_PROJECT}.org .mw (See explanation below)
.wikinews.org .n *wikinews
.wikiquote.org .q *wikiquote
.wikisource.org .s *wikisource
.wikiversity.org .v *wikiversity
.wikivoyage.org .voy *wikivoyage
.mediawiki.org .w mediawikiwiki
.wikidata.org .wd wikidatawiki
page_title For page-level files, it holds the title of the unnormalized part after /wiki/ in the request URL (E.g.: Main_Page, Berlin). The page title may also be extracted from the title or page query parameters, e.g. /w/index.php?title=Main+Page. The title will be URL-decoded and will formatted as a canonical DBkey with spaces replaced by underscores.

For project-level files or when the title cannot be extracted, it is -.

count_views The number of times this page has been viewed in the respective hour.
total_response_size The total response size caused by the requests for this page in the respective hour. This is a sum over field #7 of Cache log format fields.

So for example a line

en Main_Page 42 50043

means 42 requests to "en.wikipedia.org/wiki/Main_Page", which accounted in total for 50043 response bytes. And

de.m.voy Berlin 176 314159

would stand for 176 requests to "de.m.wikivoyage.org/wiki/Berlin", which accounted in total for 314159 response bytes.

Each domain_code and page_title pair occurs at most once.

The file is sorted by domain_code and page_title.

Disambiguating abbreviations ending in “.m”

The are two ways for an abbreviation to end in .m. Either because the domain is a whitelisted project on wikimedia.org (like commons.wikimedia.org being abbreviated to commons.m), or the domain is the mobile site of wikipedia (like en.m.wikipedia.org being abbreviated to en.m).

Since the whitelisted wikimedia.org projects (see abbreviation table above) never match a language code on wikipedia, the mapping between domain name and abbreviation is bijective.

While this solution requires an if for the edge case of "Summing up pageviews across all mobile sites", it allows to stay compatible with pagecounts-raw's abbreviations while at the same time also keeping the concept and semantics of abbreviating domain names. Also it makes it easier to automate comparisons between this dataset and TSVs (like sampled-1000) or Hive data.

Differences to the pagecounts-raw dataset

Every line that is in pagecounts-raw is also in pagecounts-all-sites.

Additionally, pagecounts-all-sites counts the mobile site (E.g.: having a line like

 de.m.voy Berlin 176 314159

for the mobile site page de.m.wikivoyage.org/wiki/Berlin ) and the zero site (E.g.: having a line like

 ms.zero Cinta_Elysa 4 32944

for the zero site page ms.zero.wikipedia.org/wiki/Cinta_Elysa ).

Next to that, there should not be differences between pagecounts-raw and pagecounts-all-sites.

Requests to the mobile site and requests from mobile devices or apps

“mobile site” refers to the mobile site (so URLs having .m. before the wikipedia.org, … in the URL), not to device identification. Note however that mobile phones and tablets are by default redirected to the mobile sites.

Also, traffic from mobile apps is not singled out, and according to the webstatscollector pageview definition, API requests are not counted.

Requests to the zero site and Wikipedia Zero requests

Wikipedia Zero requests can (depending on the setup for the Wikipedia Zero partner) hit either

  • mobile site (having “.m.” in the unabbreviated domain name), or
  • zero site (having “.zero.” in the unabbreviated domain name).

Hence, aggregating all lines that have “.zero” in the domain abbreviation (like

 ms.zero Cinta_Elysa 4 32944

) does not allow to obtain the total volume of Wikipedia Zero traffic, but only gives the total volume of traffic to the zero site. The bigger part of Wikipedia Zero traffic goes to the mobile site. Note however, that the mobile site sees both Wikipedia Zero and non-Wikipedia Zero traffic. So there is no way to compute the “total volume of Wikipedia Zero traffic”.

Availability

dumps.wikimedia.org

The stream is available as hourly files at http://dumps.wikimedia.org/other/pagecounts-all-sites/.

To maintain compatibility with pagecounts-raw, the date in the file name refers to the end of the capturing period, not the beginning.

stat1002.eqiad.wmnet

The stream is available as hourly files at /mnt/hdfs/wmf/data/archive/pagecounts-all-sites on stat1002.

To maintain compatibility with pagecounts-raw, the date in the file name refers to the end of the capturing period, not the beginning.

Analytics cluster

The stream is available as hourly files at /wmf/data/archive/pagecounts-all-sites in the Analytics cluster.

To maintain compatibility with pagecounts-raw, the date in the file name refers to the end of the capturing period, not the beginning.

Events and known problems since 2014-10-01

You can follow the feed for these incident updates.

Date from Date until Bug Details
2014-10-08 23:02 2014-10-08 23:11 bug 71876 ULSFO connectivity issues causing duplicates and missing requests worth <2 minutes
2014-10-13 13:37:15 2014-10-13 13:38:26 bug 72028 analytics1021 dropped out of its partition leader role. No duplicates, but missing lines worth <2 seconds.
* 2014-10-15 19:00:00 bug 66352 Pageviews to “undefined” and “Undefined” pages have been counted
* 2014-10-15 19:00:00 bug 71790 Redirects have been counted
2014-10-20T02:05:08 2014-10-20T02:05:16 bug 72252 analytics1021 dropped out of its partition leader role. No duplicates, but missing lines worth <2 seconds.
2014-10-20 13:07 2014-10-20 13:26 bug 72296 ULSFO connectivity issues causing duplicates and missing requests worth ~3 minutes of data.
2014-10-21 11:41 2014-10-21 12:00 bug 72352 ULSFO connectivity issues causing duplicates and missing requests worth ~80 seconds of data.
2014-10-27T07:12:29 2014-10-27T07:12:32 bug 72550 analytics1021 dropped out of its partition leader role. No duplicates, but missing lines worth <<1 second for the text cluster.
2014-11-23 15:16 2014-11-23 15:28 No bugzilla, no bug :-( ULSFO connectivity issues causing duplicates worth ~10 minutes of data and missing requests worth ~15 seconds of data.
2014-12-04 16:22:36 2014-12-04 16:26:55 task T85312 analytics1021 dropped out of its partition leader role. No duplicates, but missing lines worth ~30 seconds of total traffic
2014-12-10 14:18 2014-12-10 14:18 task T85675 analytics1021 dropped out of its partition leader role. No duplicates, but missing lines worth ~1 seconds of total traffic
2014-12-10 15:27 2014-12-10 15:27 task T85675 Leader re-election brought analytics1021 back into set of partition leaders. No duplicates, but missing lines worth <1 seconds traffic
2014-12-11 14:54:33 2014-12-11 14:54:35 task T85712 analytics1021 dropped out of its partition leader role. No duplicates, but missing lines worth <1 second of total traffic
2014-12-26 06:02:18 2014-12-26 06:02:20 task T85709 analytics1021 dropped out of its partition leader role. No duplicates, but missing lines worth <1 second of total traffic
2014-12-29 17:23:21 2014-12-29 17:45:22 task T85695 Broken varnishkafka configuration got picked up by three mobile caches and caused missing data worth 50 seconds of total traffic.
2015-01-03 10:21:12 2015-01-03 10:21:14 task T85758 analytics1021 dropped out of its partition leader role. No duplicates, but missing lines worth <1 second of total traffic
2016-08-15 today task T130656 pagecounts-raw and pagecounts-all-sites are no longer published: https://lists.wikimedia.org/pipermail/analytics/2016-August/005339.html

Idiosyncrasies

Capitalization is split up

Some requests look as though they were made to EN.WIKIPEDIA.ORG... To stay compatible with the original files, we separate the counts per project and per different capitalization. So, for example, you might see:

en - 12345 123456
EN - 1 1234
En - 89 12345

And although the lowercase en entry is the main one and will have most requests, there are other requests to English Wikipedia hiding in these other entries.

See also

Note

  1. ↑ Note that this extension to mobile and zero site does not solve the long-standing issues with webstatscollector's pageview definition. It is more a stop-gap measure, and comes with all the issues of webstatscollector's pageview definition.
  2. ↑ Hence, the “project” in projectcounts is somewhat a misnomer, but kept for historical compatibility.