ORES/Guide: Visualizing the change in Wikidata Item quality

From Wikitech
Jump to navigation Jump to search

The change in the quality of Wikidata items over time can be nicely visualized with Sankey diagrams as shown in T261331: Metrics: Get the quality score changes per month for at least 2019 and 2020 in a sankey diagram based on the new model.

There are three main steps necessary to create those diagrams:

  1. Create snapshots from the Wikidata database dumps
  2. Create a file listing the change between those snapshots
  3. Visualize those changes in a Sankey or Alluvial diagram

1. Create Snapshots from Wikidata database dumps

Since dumps are very large (~1TB) and take a long time to process, it makes sense to perform this task on either production stats machines like stat1007 or on toolforge.

The scores for the snapshots can be extracted from a dump with extract_scores.py:

month=$(date +"%Y%m")
day="${month}01"

./utility extract_scores /mnt/data/xmldatadumps/public/wikidatawiki/${day}/wikidatawiki-${day}-pages-articles[1234567890]?*.xml-*.bz2 --model models/wikidatawiki.item_quality.gradient_boosting.model --sunset ${day}000000 --processes=10 --score-at monthly --class-weight '"E"=1' --class-weight '"D"=2' --class-weight '"C"=3' --class-weight '"B"=4' --class-weight '"A"=5' --verbose > run_${month}.out 2> run_${month}.err

grep ${month} run_${month}.out > wikidata_quality_snapshot_${month}.tsv
rm -f run_${month}.out


Make Snapshots available

On the stats machine mentioned above, gzip and copy the monthly snapshots to /srv/published/datasets/wmde-analytics-engineering/Wikidata/WD_QualitySnapshots. This will make them available to the public at: https://analytics.wikimedia.org/published/datasets/wmde-analytics-engineering/Wikidata/WD_QualitySnapshots/

2. Compare Snapshots

To create the diagram, we need to create a dataset of the change between two snapshots. The small python script below has been used to do that:

import sys
before_result = {
    'A': set(),
    'B': set(),
    'C': set(),
    'D': set(),
    'E': set()
}
after_result = {
    'A': set(),
    'B': set(),
    'C': set(),
    'D': set(),
    'E': set()
}
types = ['E', 'D', 'C', 'B', 'A']

with open(sys.argv[1], 'r') as f:
    for line in f:
        line = line.replace('\n', '').split('\t')
        before_result[line[4]].add(int(line[1][1:]))

with open(sys.argv[2], 'r') as f:
    for line in f:
        line = line.replace('\n', '').split('\t')
        after_result[line[4]].add(int(line[1][1:]))

result = {
    'A': {},
    'B': {},
    'C': {},
    'D': {},
    'E': {},
    'New': {}
}

for type_ in before_result:
    for qid in before_result[type_]:
        if qid in after_result[type_]:
            result[type_][type_] = result[type_].get(type_, 0) + 1
            continue
        for type__ in types:
            if qid in after_result[type__]:
                result[type_][type__] = result[type_].get(type__, 0) + 1
                break
        else:
            result[type_]['Deleted'] = result[type_].get('Deleted', 0) + 1
    
for type_ in types:
    type_n = 0
    for type__ in types:
        type_n += result[type__].get(type_, 0)
    result['New'][type_] = len(after_result[type_]) - type_n

print(result)

(Original pastebin: https://paste.ubuntu.com/p/pfv8PRzw9q/)

3. Visualize Change

The data generated can be used rawgraphs.io to produces the desired alluvial diagrams. See also their tutorial.