ORES/Guide: Visualizing the change in Wikidata Item quality

Deprecation warning. The ORES infrastructure has been deprecated and it is not longer present/maintained in the WMF infrastructure.

The change in the quality of Wikidata items over time can be nicely visualized with Sankey diagrams as shown in T261331: Metrics: Get the quality score changes per month for at least 2019 and 2020 in a sankey diagram based on the new model.

There are three main steps necessary to create those diagrams:

Create snapshots from the Wikidata database dumps
Create a file listing the change between those snapshots
Visualize those changes in a Sankey or Alluvial diagram

1. Create Snapshots from Wikidata database dumps

Since dumps are very large (~1TB) and take a long time to process, it makes sense to perform this task on either production stats machines like stat1007 or on toolforge.

The scores for the snapshots can be extracted from a dump with extract_scores.py:

month=$(date +"%Y%m")
day="${month}01"

./utility extract_scores /mnt/data/xmldatadumps/public/wikidatawiki/${day}/wikidatawiki-${day}-pages-articles[1234567890]?*.xml-*.bz2 --model models/wikidatawiki.item_quality.gradient_boosting.model --sunset ${day}000000 --processes=10 --score-at monthly --class-weight '"E"=1' --class-weight '"D"=2' --class-weight '"C"=3' --class-weight '"B"=4' --class-weight '"A"=5' --verbose > run_${month}.out 2> run_${month}.err

grep ${month} run_${month}.out > wikidata_quality_snapshot_${month}.tsv
rm -f run_${month}.out

Make Snapshots available

On the stats machine mentioned above, gzip and copy the monthly snapshots to /srv/published/datasets/wmde-analytics-engineering/Wikidata/WD_QualitySnapshots. This will make them available to the public at: https://analytics.wikimedia.org/published/datasets/wmde-analytics-engineering/Wikidata/WD_QualitySnapshots/

2. Compare Snapshots

To create the diagram, we need to create a dataset of the change between two snapshots. The small python script below has been used to do that:

import sys
before_result = {
    'A': set(),
    'B': set(),
    'C': set(),
    'D': set(),
    'E': set()
}
after_result = {
    'A': set(),
    'B': set(),
    'C': set(),
    'D': set(),
    'E': set()
}
types = ['E', 'D', 'C', 'B', 'A']

with open(sys.argv[1], 'r') as f:
    for line in f:
        line = line.replace('\n', '').split('\t')
        before_result[line[4]].add(int(line[1][1:]))

with open(sys.argv[2], 'r') as f:
    for line in f:
        line = line.replace('\n', '').split('\t')
        after_result[line[4]].add(int(line[1][1:]))

result = {
    'A': {},
    'B': {},
    'C': {},
    'D': {},
    'E': {},
    'New': {}
}

for type_ in before_result:
    for qid in before_result[type_]:
        if qid in after_result[type_]:
            result[type_][type_] = result[type_].get(type_, 0) + 1
            continue
        for type__ in types:
            if qid in after_result[type__]:
                result[type_][type__] = result[type_].get(type__, 0) + 1
                break
        else:
            result[type_]['Deleted'] = result[type_].get('Deleted', 0) + 1
    
for type_ in types:
    type_n = 0
    for type__ in types:
        type_n += result[type__].get(type_, 0)
    result['New'][type_] = len(after_result[type_]) - type_n

print(result)

(Original pastebin: https://paste.ubuntu.com/p/pfv8PRzw9q/)

3. Visualize Change

The data generated can be used rawgraphs.io to produces the desired alluvial diagrams. See also their tutorial.