Data Platform/Systems/Wikistats 2/Metrics/FAQ
This page aims to document the reasons why metrics might differ among legacy wikistats (http://stats.wikimedia.org) and wikistats 2 (http://stats.wikimedia.org/v2/)
How does data differ among wikistats 1 and wikistats 2
See Analytics/AQS/Wikistats_2/Data_Quality
Why are pageviews numbers sometimes 0 and sometimes undefined?
This is tricky! We have to store multiple counts in one "cell" of our database, because the data is so big. So, for example, for english Wikipedia's Cat page, we store counts for all combinations of where it was accessed (desktop/mobile app/mobile web/total) and by what type of user (user/bot/total). If there are some pages that have numbers for one of those combinations, then we're sure our system knows that the view count for other combinations is 0. So we return 0. But if that row is missing entirely for a particular period of time, then we don't know if there was a problem loading the data or if the count is 0. So we return nothing or "undefined" or 404 depending on where you're looking at the data.
Does active editor metrics include contributors that were blocked and whose contributions were deleted ?
The Wikistats 2 editors metric (in total or split by various dimensions):
- does not include contributions on pages that have since been deleted
- does include revisions that have been suppressed or hidden, if the revision user has not been suppressed
- does include revisions that have been suppressed with suppressed revision users, but attributes those to the anonymous editors total
- does include contributions from editors that are/have been blocked. At the time of the contribution they would of course not have been blocked, so their contribution counts if the page was not deleted and the revision has not been suppressed.
For an explanation of deleted pages, see: https://www.mediawiki.org/wiki/Help:Deletion_and_undeletion For an explanation of revision suppression and hiding, see: https://www.mediawiki.org/wiki/Manual:RevisionDelete
Pages created: numbers for project X do not match with data on legacy wikistats, those numbers are lower
stats.wikimedia.org doesn't include non-content pages in its stats. You can tick the checkbox for splitting by page type, untick the 'non-content' box, and the values should match the ones in wikistats a lot more closely.
Total number of pages created as reported by Wikistats 2 is less than the total number of pages on project
This is probably due to redirects: the new-page metric (as well as the edited-pages one) doesn't include redirect pages.
You can't combine two or more splits and filters
Yes, you can't combine splits by design for now. We are weighing two options going forward. One is to simply change the radio buttons to checkboxes. Another is to build a more advanced interface where you can compare multiple wikis, split by as many dimensions as are available, etc. We will have a consultation about this once the initial Alpha release dust settles.
Are splits always exclusive
For example, splitting by user type gives you bots identified by their user group, "group bots" and bots identified by having "bot" in their name, "name bots". Those can both happen at the same time, so do those numbers overlap? No, all of our splits are made so they add up to the total. This makes all the graphs make a little bit more sense and we will add something to the UI explaining this and explaining which split values have priority. In the case of this example, if a user is identified as belonging to the bot user group, even if their name is also "somethingBot" they will be classified as a "group bot".
How do we count anonymous users
It would be more accurate to count them by their User Agent + IP, especially to capture better numbers for mobile editing. However, we discard user agent strings very quickly from our databases for privacy reasons and therefore we only group anonymous users by their IP. This means we probably under-count by around 15% as of this writing.
Wikistats is Primarily for the Community
The original Wikistats was designed and maintained for years as an inspirational tool for contributors and community. In this spirit, we explicitly stated our target audience is, in order of priority: contributors, community, and the press. Wikimedia Foundation staff was left off of this list and there was a discussion about it on our Round 1 consultation: mw:Wikistats_2.0_Design_Project/RequestforFeedback/Round1 - See also related discussion. There are, of course, overlapping needs, and we are more than happy to consider WMF staff as contributors and community in their own right. But where the needs of managers to highlight certain metrics might go against the need of Wikistats to support and celebrate contributors and community, we will side with the latter. The reason for this is simple, because our team, Analytics, spends most of our time meeting the needs of WMF staff with powerful internal-only tools. And Wikistats is an attempt to focus on a different perspective. For example, the data that powers Wikistats is the main value that we have created, and it's available to query in Hive, Druid, and Superset, with dashboards and regular report capabilities.
Why do pageviews API endpoints serve fresh data but edit API endpoints serve monthly data
Both the edit data and pageview data come from the Hadoop-based Analytics Data Lake. However, because of limitations in the underlying MediaWiki application databases that are the source of the edit data, some complex reconstruction and de-normalization is required, and that takes several days to a week. This mostly affects the historical data, but the reconstruction currently has to be done for all history at once because historical data sometimes changes long after the fact in the MediaWiki databases. So, the entire dataset is regenerated every month, which would be impossible to do daily.
How does Wikistats handle deleted and restored pages
Right now, we are ignoring some pages with deletes or complicated histories, for example where they've been deleted and then restored. This is a complicated part of mediawiki, and there are some efforts to fix the underlying problems (see https://phabricator.wikimedia.org/T196950). In the meantime, we will take another look at the delete/restore cycle and see if we can extract some clean data and metrics from it. Essentially, for the majority of wikimedia projects' history, when a deleted page was restored, it lost its original ID. As of a few years ago, this improved, and the page retains its original ID. Unless a new page was created with the same title in the meantime, in which case the restored page gets a new ID. This gets even more complicated when you restore pages and merge them with existing pages, sometimes partially. In summary, If we can extract meaningful data from this and publish it, we will, but right now our metrics are excluding some pages that have gone through delete/restore.
Editors metric has changed from April 2019
Before April 2019, all our metrics were not counting deleted revisions, except for the Editors one. We changed it to be consistent, and from April 2019, the Editors metric DOESN'T count deleted revisions. This change has an impact of about ~10% decrease, depending on wikis and time.
The wiki I am interested on does not appear on the drop down menu
For wikis that are labelled as "private" on the sitematrix metrics are not available. If, however, the wiki you are interested on is public and it does not appear on list please do send us a phabricator ticket.