ORES/Guide: Visualizing the change in Wikidata Item quality
The change in the quality of Wikidata items over time can be nicely visualized with Sankey diagrams as shown in T261331: Metrics: Get the quality score changes per month for at least 2019 and 2020 in a sankey diagram based on the new model.
There are three main steps necessary to create those diagrams:
- Create snapshots from the Wikidata database dumps
- Create a file listing the change between those snapshots
- Visualize those changes in a Sankey or Alluvial diagram
1. Create Snapshots from Wikidata database dumps
Since dumps are very large (~1TB) and take a long time to process, it makes sense to perform this task on either production stats machines like stat1007 or on toolforge.
The scores for the snapshots can be extracted from a dump with extract_scores.py
:
month=$(date +"%Y%m")
day="${month}01"
./utility extract_scores /mnt/data/xmldatadumps/public/wikidatawiki/${day}/wikidatawiki-${day}-pages-articles[1234567890]?*.xml-*.bz2 --model models/wikidatawiki.item_quality.gradient_boosting.model --sunset ${day}000000 --processes=10 --score-at monthly --class-weight '"E"=1' --class-weight '"D"=2' --class-weight '"C"=3' --class-weight '"B"=4' --class-weight '"A"=5' --verbose > run_${month}.out 2> run_${month}.err
grep ${month} run_${month}.out > wikidata_quality_snapshot_${month}.tsv
rm -f run_${month}.out
Make Snapshots available
On the stats machine mentioned above, gzip and copy the monthly snapshots to /srv/published/datasets/wmde-analytics-engineering/Wikidata/WD_QualitySnapshots
. This will make them available to the public at: https://analytics.wikimedia.org/published/datasets/wmde-analytics-engineering/Wikidata/WD_QualitySnapshots/
2. Compare Snapshots
To create the diagram, we need to create a dataset of the change between two snapshots. The small python script below has been used to do that:
import sys
before_result = {
'A': set(),
'B': set(),
'C': set(),
'D': set(),
'E': set()
}
after_result = {
'A': set(),
'B': set(),
'C': set(),
'D': set(),
'E': set()
}
types = ['E', 'D', 'C', 'B', 'A']
with open(sys.argv[1], 'r') as f:
for line in f:
line = line.replace('\n', '').split('\t')
before_result[line[4]].add(int(line[1][1:]))
with open(sys.argv[2], 'r') as f:
for line in f:
line = line.replace('\n', '').split('\t')
after_result[line[4]].add(int(line[1][1:]))
result = {
'A': {},
'B': {},
'C': {},
'D': {},
'E': {},
'New': {}
}
for type_ in before_result:
for qid in before_result[type_]:
if qid in after_result[type_]:
result[type_][type_] = result[type_].get(type_, 0) + 1
continue
for type__ in types:
if qid in after_result[type__]:
result[type_][type__] = result[type_].get(type__, 0) + 1
break
else:
result[type_]['Deleted'] = result[type_].get('Deleted', 0) + 1
for type_ in types:
type_n = 0
for type__ in types:
type_n += result[type__].get(type_, 0)
result['New'][type_] = len(after_result[type_]) - type_n
print(result)
(Original pastebin: https://paste.ubuntu.com/p/pfv8PRzw9q/)
3. Visualize Change
The data generated can be used rawgraphs.io to produces the desired alluvial diagrams. See also their tutorial.