Analytics/AQS/Wikistats 2/Data Quality/VettingPerProject

From Wikitech

In this page we present data quality checks we've made for AQS-Wikistats.

We have compared metrics generated by AQS to the ones created on http://stats.wikimedia.org per project (except new registered users, see note).

Note that we have pending comparing agreggates as in "edits for all wikipedias", that data is thus far not available on wikistats2. It is our plan to offer it in the future.

We also have not compared data for "all projects". Example: "edits for all projects" as that data is not available in legacy wikistats.

Metric Description
Edits Number of edits per project, per month, by any user type (including bots), on pages belonging to the content namespace including redirects[1]
Average of new articles per day on a given month, reported monthy Pages-created-per-month/days-in-that-month by any user type (including bots), on pages belonging to the content namespace, excluding redirects
Active editors Number of editors (registered and not bots) having made more than 5 edits per month on content pages
very active editors Number of editors (registered and not bots) having made more than 100 edits per month on content pages
New registered users New registered user per project per month -- It is to be noted that the original metric we use for comparison is here generated using the SQL database as this data is not available in wikistats.
Total article count Cumulative sum of pages created per month by any user-type, on pages belonging to the content namespaces, excluding redirects
  1. It is to be noted that on wikistats edit-numbers are rounded. In order to check our metrics, we have applied the same rounding pattern, but this can lead to small differences going un-noticed

8 most viewed Wikipedias: English, Russian, German, Spanish, Japanese, French, Chinese and Italian

Metric Results
Edits The average monthly-difference between AQS and wikistats is of less than 1% except for Spanish wikipedias where it's 1.5%.

Early days (before 2006) show higher differences in percentage due to smaller number, but we're happy with this result.

In three of the projects (German, Spanish and Japanese Wikipedias) there is a trend of AQS providing higher numbers than wikistats, but this come from Wikistats way of counting edits.

Average of new articles per day on a given month, reported monthy The average monthly difference is here as well low (less than 1%) with some higher differences in early times.
Active editors The average monthly difference is very low for that metric, except for Spanish, German and Japanese Wikipedias, which let us think the difference is related to the one experienced in edits.
very active editors This metric evaluation provides same results as the active editors one.
New registered users AQS matches database original data almost exactly.

For all projects except Japanese Wikipedia, the difference is of less than 0.1%. In Japanese Wikipedia, since 2015-07, AQS under-counts more and more new users in comparison to the ones reported by the database.

This difference is due to many events in the logging being marked as deleted, therefore preventing us of knowing whether user is self-created or not.

Total article count The average monthly difference is here as well low (less than 1%) with some higher differences in early times.

Other special big wiki projects: Wikidata and Commons

Metric Results
Edits The average monthly-difference between AQS and wikistats is of 0.01% for commons and 0% for edits.
new articles per day The average monthly difference is here as well very low for commons (0.08%), but larger for wikidata (5.3%, under-counted by AQS due to wikistats counting redirects for this specific wiki).
Active editors The average monthly difference is very low for both commons and wikidata (respectively 0.15% average difference and 0.04%).
very active editors This metric evaluation provides almost same results as the very-active editors one (0.04% average difference for commons and 0.19% for wikidata).
New registered users AQS matches database original data almost exactly (0.02% average difference for both).
Total article count The average monthly difference is here as well very low for commons (0.06%), but larger for wikidata (5.3%, under-counted by AQS due to wikistats counting redirects for this specific wiki - 0.01 when including redicrects).

A small Wikipedia project: Tibetan Wikipedia

Metric Results
Edits AQS matches wikistats original data exactly
new articles per day AQS matches wikistats original data exactly
Active editors There is a regular mismatch of 1 or 2 active editors (AQS provides higher numbers) that represent a high percentage difference given base number are small.
very active editors AQS matches wikistats original data exactly
New registered users AQS matches database original data almost exactly.

For months of February and March 2012, we observe a difference of respectively -8 (AQS serves 17, expected 25) and -2 (AQS serves 18, expected 20).

Total article count AQS matches wikistats original data exactly

Other things

  • For early days data (before October 2002), revisions have no user-id (0), preventing us to correctly link them.
  • There are 9 Wikipedia projects present in stats.wikimedia.org that are not yet included in AQS:
    • zh-yue.wikipedia.org
    • be-x-old.wikipedia.org
    • zh-classical.wikipedia.org
    • bat-smg.wikipedia.org
    • nds-nl.wikipedia.org
    • map-bms.wikipedia.org
    • roa-tara.wikipedia.org
    • mo.wikipedia.org
    • cbk-zam.wikipedia.org