Analytics/AQS/Wikistats 2/DataQuality/VettingPerProjectFamilies

From Wikitech
Jump to navigation Jump to search

Around the end of 2018 we added support for project families (all wikipedias, all wiktionaries, all wikivoyages...) to a number of contributor-related metrics in Wikistats 2. To verify that the numbers for these metrics are as close as possible to reality we compared them to the canonical source of statistics for Wikipedias (the original version of Wikimedia Statistics). This study is similar to the one conducted at the beginning of 2018 for wiki projects, and is equally restricted to the Wikipedia project family.

A total of three metrics have been vetted: edits, average of article creations per day and total article count. One of the metrics, New Registered Users, is not available in Wikistats 1, and another one, Editors by Activity Level, is not available yet in Wikistats 2. Those two metrics have been excluded from this report.

Data comparisons show no significant differences to the findings in the per-project study, with a few caveats described below.

Metric Summary
Edits Taking into account nostalgia wikipedia, the numbers remain consistent, including the underreporting of edits in 2004 and 2005.
Average of new articles per day Data is almost the same in Wikistats and Wikistats 2, except for the last two years where Wikistats 1 wasn't reporting a number of wikis.
Total article count table|All|page_type~content content-only pages to date] in Wikistats 2.

Analysis

Definition of project families

We consider a family any of the big Wikimedia wikis that has a version thereof for each language. Therefore sites like wikidata, mediawiki, meta-wiki and wikitech are not considered project families. This is the complete list of project families that can be queried with Wikistats2.

  • Wikipedia
  • Wikiquote
  • Wikibooks
  • Wiktionary
  • Wikisource
  • Wikiversity
  • Wikivoyage
  • Wikinews

Wikistats 1 only has data for the "Wikipedia" family thus this the only family for which we can attempt to vet the data.

Obtention of Wikistats 2 data

Tabular data for each of the Wikistats 2 metrics was obtained through the Wikistats 2 UI, selecting the applicable breakdowns as described below (content-only new pages, content-only edits, etc.) and pressing the download button. These files were then imported into the Google Sheets worksheet linked at the bottom of this page.

The Nostalgia Wikipedia case

For all metrics explained below, the January 2001 - January 2002 period shows a consistent burst of over-reporting in Wikistats 2, even taking into account the variation described in the per-project vetting report. After some digging, the conclusion was that Nostalgia Wikipedia (nostalgiawiki) was never added to the list of Wikistats 1 metrics, since its content is just a snapshot of English WIkipedia as of December 2001, therefore all its editing activity being a repeat of that from enwiki in that period.

Edits

Metric definition

Metric page in Wikistats 1

Metric page in Wikistats 2

The numbers in this page on Wikistats 1 correspond to content-only edits on Wikistats 2. Taking into account nostalgia wikipedia, the numbers remain consistent, including the underreporting of edits in 2004 and 2005.

All Wikipedia edits - Difference3.pngEditswikipedia200220182.png

Number of edits for All Wikipedias reported by Wikistats 1 vs reported by Wikistats 2 minus Nostalgia Wikipedia


Average of new articles per day

Metric definition

Metric page in Wikistats 1

Metric page in Wikistats 2

The numbers in the page of Wikistats 1 correspond to content-only new pages on Wikistats 2.

Vetting avg new pages200220182.png

Present-day variation

In the graph above, it seems the closer the date is to the present, the more creations are overestimated when compared to Wikistats 1. This is because WIkistats 1 kept track, by the end of its lifetime, of 278 Wikipedias, while the Data Lake contains data for 313.

Unreported wikis: aa, ady, azb, bat-smg, be-tarask, be-x-old, cbk-zam, cho, commons, din, dty, gag, gor, ho, hz, ii, inh, jam, kbp, kj, kr, lfn, lrc, map-bms, mh, mus, nds-nl, ng, nostalgia, olo, pfl, roa-tara, sat, sat, shn, ten, test, test2, wg-en, xmf, zh-classical, zh-yue.

Full data for unreported wikis here

Screen Shot 2019-02-19 at 6.37.09 PM.png

Total article count

Metric definition

Metric page in Wikistats 1

Metric page in Wikistats 2

Variation on this metric is extremely high on the the 2001-2003 period, but then remains consistent way below 1%. The metric in Wikistats 1 corresponds to content-only pages to date in Wikistats 2.

Fullrange on var pagestodate3.pngVariation pagestodate fullrange 200320183.png

NewarticlesnotreportedforAllWikipedias.png

Calculations and additional info of interest

This study has been conducted using the January 2019 mediawiki history snapshot. Unless stated otherwise, the date range of the report's data is January 2001 to December 2018, when the last computation batch was done in Wikistats 1. All variation numbers are expressed as the variation (as a percentage) between Wikistats2 and Wikistats1 reported numbers.

Google Drive worksheet for this study

Go to the main article on per project metrics editing data quality.