Analytics/AQS/Legacy Pagecounts

From Wikitech
< Analytics‎ | AQS
Jump to: navigation, search

This page documents a public API developed and maintained by the Wikimedia Foundation that serves analytical data about pagecounts of Wikipedia and its sister projects. Pagecount is the legacy definition of what we now call "Pageview". This API makes available the pagecounts agreggated per project from January 2008 to July 2016. The main difference among pagecounts and the current pageview data is lack of filtering of self-reported bots, thus automated and human traffic are reported together.

Quick start

Pagecounts

Daily counts

Get a daily pagecount timeseries of en.wikipedia.org for the month of October 2010:

GET https://wikimedia.org/api/rest_v1/metrics/legacy/pagecounts/aggregate/en.wikipedia.org/all-sites/daily/2010100100/2010103100

Monthly counts

Get a pagecount monthly timeseries of de.wikipedia.org from 2010 to 2012

GET https://wikimedia.org/api/rest_v1/metrics/legacy/pagecounts/aggregate/de.wikipedia.org/all-sites/monthly/2010010100/2013010100

Get a pagecount monthly timeseries of de.wikipedia.org from 2010 to 2012 (only mobile data)

GET https://wikimedia.org/api/rest_v1/metrics/legacy/pagecounts/aggregate/de.wikipedia.org/mobile-site/monthly/2010010100/2013010100

Get a pagecount monthly timeseries of de.wikipedia.org from 2010 to 2012 (only desktop data)

GET https://wikimedia.org/api/rest_v1/metrics/legacy/pagecounts/aggregate/de.wikipedia.org/desktop-site/monthly/2010010100/2013010100

Pagecounts for all projects combined

Get a pagecount monthly timeseries for all projects, all sites

GET https://wikimedia.org/api/rest_v1/metrics/legacy/pagecounts/aggregate/all-projects/all-sites/monthly/2007120918/2017040100

Get a pagecount monthly timeseries for all projects, all sites (desktop views only)

GET https://wikimedia.org/api/rest_v1/metrics/legacy/pagecounts/aggregate/all-projects/desktop-site/monthly/2007120918/2017040100

The API

What is it?

The API is a collection of REST endpoints that serve analytical data about pageviews in Wikimedia projects. It's developed and maintained by WMF's Analytics and Services teams, and is implemented using Analytics' Hadoop cluster and RESTBase. This API is meant to be used by anyone interested in pageview statistics on Wikimedia wikis: Foundation, communities, and the rest of the world.

How to access

The API is accessible over https at wikimedia.org/api/rest_v1. As it is public, it doesn't need authentication and it supports CORS. The urls are structured like this:

/metrics/legacy/pagecounts/{endpoint}/{parameter 1}/{parameter 2}/.../{parameter N}

Technical Reference

Please, see AQS's RESTBase docs for a complete and interactive technical reference on API endpoints.

Updates and backfilling

This data is of historical kind, meaning it was loaded once and is not updated since then. It might see corrections from 2008 to 2016, but it will never be updated after July 2016. For newer data, refer to the improved Analytics/AQS/Pageviews metric.

Gotchas

404 may mean zero 
At some point you may get a 404 not found response from the API. Sometimes, this means that there are 0 pageviews for the given project, timespan and filters you specified in the query. The problem is that the API, because of implementation reasons, can not distinguish between actual zeros, or data that hasn't been loaded yet in the database. For now, it's up to the user to control that.
404s within timeseries 
Because of the same caveat (404 may mean zero), if you request a timeseries from the API, you might get no data for the dates that have 0 pageviews. This may create holes in the timeseries and break charting libraries. For now, it's up to the user to control that and fill in the missing zeros.
429 throttling 
Client has made too many request and it is being throttled, this will happen if the storage cannot keep up with the request ratio from a given IP. Throttling is enforced at the storage layer, meaning that if you request data we have in cache (cause other client has requested it earlier) there is no throttling.
End timestamp inclusive or exclusive? 
For now, the end timestamp in the hourly and daily endpoints is inclusive. So, for hourly, if your end timestamp is 2012010100, the last item in the results will contain data for the hour 0 of Jan 1st, 2012. Similariy, for daily, if your end timestamp is 2012010100, the last item's date will be Jan 1st, 2012. However, in the monthly granularity, if your end timestamp is 2012010100, the last item in the results will correspond to Dec 2011 (exclusive). We plan to fix this discrepancy ([task]), but for now, it's like that.

Clients

Here are a few clients already available:

Changes and known problems since December 2007

Date from Date until Task record_version Details
December 2007 end of data Task T162157 * No data for metawiki, too many quality issues

See also