User:AKhatun/Wikidata Vertical Analysis

From Wikitech
Jump to navigation Jump to search

Wikidata is an open knowledge base in the form of a graph accessible through SPARQL queries among other things). The graph is formed using triples in the form - (Subject, Predicate, Object). These components connect each other forming a huge interconnected web of data. Wikidata is growing super fast and it is time to think scaling. With this aim, this page shows some analysis on wikidata to find out:

  • Amount of certain kinds of vertical data like labels, descriptions, scientific articles etc
  • How many queries ask for each of these vertical slices
  • Analysis of these queries

Phabricator tickets: Phab:T282790, Phab:T291190.

TL;DR

If blazegraph (wikidatas backend) were to fail, what can we remove from wikidata so that it can still keep functioning? Some data points found across items in wikidata such as labels, descriptions, identifiers etc are possible candidates. Some analysis done on these vertical data are described in the following sections.

"Number of days for Wikidata to recover" is the estimated number of days for Wikidata to get back to its current size if some amount of triples is removed from Wikidata. To clarify: Descriptions form ~20% of Wikidata triples. If we were to remove them, then given the rate at which Wikidata is growing, it would take ~500 days for Wikidata to jump back to its current size, despite removing the descriptions. See more about Wikidata growth rate below.

Distribution of triples
Predicates Number of Triples % of Total Triples Number of days for Wikidata to recover
altLabel 102593854 0.8 21.5
description 2471378661 19.5 518
external id 1140577555 8.9 239
label 499663174 3.9 104
name 78785768 0.6 16
Labels dist.png
Distribution of queries that access vertical data
Predicates Number of Queries % of Total Queries Query Time (hr) % of Total Query Time
altLabel 29325216 16 827 5
description 21863454 12 2601 18
external id 55127216 30 5455 39
label 88861469 48 10099 72
name 14965990 8 1829 13
Vertical total qtime qcount.png

Vertical Data Analysis

Wikidata snapshot of 20210712 was used for this analysis.

Total triples

Before we begin, the total number of triples in this specific snapshot of wikidata is 12671768950, approximately 12.6 billion. The growth rate of triples is not constant, but considering the growth an approximate straight line, in grafana dashboard, Wikidata grows at a rate of 4.77 million triples per day. Thats a lot! The rate was calculated from the number of triples at the start and end of a 90-day interval (11/3/21 to 6/6/21). During this period wikidata grew 3.38%!!.
Following analyses are done assuming 4.77M triples per day growth where applicable, therefore take numbers as an approximate only. To repeat, the wikidata growth rate is not constant, the 4.77M per day growth is a wide approximation.

Description

The number of triples with the predicate schema:description is 2471378661, 19.5% of all triples.

Description Triple Count Triple % Number of days for Wikidata to recover
English 72609016 0.57 15
Other Languages 2398769645 18.93 502.8
Total 2471378661 19.5 518

Additional Info

Some more information of descriptions.

Number of items that have a description 87048501
Average description per item 28.4
Maximum description count per item 258
Number of item with one description 9910091 (11% of items)
Number of item with more than one description 77138410 (88% of items)
Number of items that have a English description 72609016
Number of items that don't have English descriptions 14439485 (16.6%)

Therefore, 16.6% of all items that have a description don't have english descriptions. If we were to remove all non-English description, 16.6% items that had a description won't have a description anymore.

Distribution of descriptions per item

Top 10 number of descriptions per item
Description per Item Count Count % Cummulative %
1 9910091 11.38 11.38
2 10845750 12.46 23.84
3 13939579 16.01 39.85
4 5221876 6.00 45.85
5 3180051 3.65 49.50
6 2061753 2.37 51.87
7 1456036 1.67 53.54
8 938750 1.08 54.62
9 918864 1.06 55.68
10 886663 1.02 56.70

Desc item dist.png

Language distribution of descriptions

440 different language tags in descriptions. 50% of the descriptions are of 32 languages and 90% of the descriptions are of 94 languages.

Top language tags in descriptions
Language tag Description count Description %
nl 75405965 3.05
en 72609016 2.94
de 61716292 2.50
ar 45939199 1.86
fr 42861255 1.73
es 39989399 1.62
uk 39859846 1.61
ast 38642801 1.56
ca 36901411 1.49
bn 36750936 1.49

Extra distribution figures in Jupyter Notebook # Description ## Distribution of language tags

Label

The number of triples with the predicate rdfs:label is 499663174, 3.9% of all triples.

Label Triple Count Triple % Number of days for Wikidata to recover
English 79778129 0.6 16
Other Languages 419885045 3.3 88
Total 499663174 3.9 104

Additional Info

Some more information of labels.

Number of items that have a label 93474062
Avgerage label per item 5.34
Maximum label count per item 446
Number of item with one label 20084825 (21% of items)
Number of item with more than one label 73389237 (78% of items)
Number of items that have a English label 79778129
Number of items that don't have English labels 13695933 (14.65%)

Therefore, 14.7% of all items that have a label don't have english labels. If we were to remove all non-English labels, 14.7% that had a label won't have a label anymore.

Distribution of labels per item

Top ten labels per item
Label per Item Count Count % Cummulative %
1 20084825 21.49 21.49
2 41697507 44.61 66.10
3 10030895 10.73 76.83
4 4988361 5.34 82.17
5 2568068 2.75 84.92
6 1857891 1.99 86.91
7 1366863 1.46 88.37
8 1592480 1.70 90.07
9 683102 0.73 90.80
10 731273 0.78 91.58

Language distribution of labels

476 different language tags in labels. 40% of the labels are of only 6 languages and 50% of the labels are of 12 languages.

Top language tags in labels
Language tag Label count Label %
en 79778129 15.97
nl 56940665 11.40
ast 16106324 3.22
fr 14594937 2.92
de 14352435 2.87
es 13005130 2.60
ga 9162180 1.83
it 9090037 1.82
bn 8531392 1.71
pt 7966495 1.59

More distribution figures in Jupyter Notebook # Labels ## Distribution of language tags

Other predicates like Label

Other preciates are skos:altLabel, schema:name. Note that there are no triples with the predicate skos:prefLabel.

Distribution of label predicates
Predicate Triple Count Triple % Number of days for Wikidata to recover
shema:name 78785768 0.62 16.5
skos:altLabel 102593854 0.81 21.5
rdfs:label 499663174 3.9 104


Language distributon of other label predicates
Label Triple Count Triple % Number of days for Wikidata to recover
English schema:name 13721324 0.11 3
Other Languages schema:name 65064444 0.51 13.6
English skos:altLabel 9157038 0.07 2
Other Languages skos:altLabel 65064444 0.74 19.5
English rdfs:label 79778129 0.6 16
Other Languages rdfs:label 419885045 3.3 88
Total English 102656491 0.8 21.5
Total Other Language 550013933 4.3 115
Total 652670424 5.15 137

More distributions in Jupyter Notebook # altLabels and Jupyter Notebook # schema:name

External Identifier

Identifiers are properties, like P297. They are wikibase:propertyType wikibase:ExternalId , i.e they are of property type External ID. Example identifiers are UNBIS Thesaurus ID, BBK (library and bibliographic classification), Symptom Ontology ID, Bilibili bangumi ID etc. There are 6322 distinct external identifiers (as of 10 August, 2021).

These properties appear as /prop, meaning the object is a statement and holds more information. Or as /prop/direct (and /prop/direct-normalized) meaning the object a single URI or literal, doesn't hold more information than that.

Triples related to external identifiers*
Triple type Triple Count % Triple Number of days for Wikidata to recover
external identifiers as /prop 179679329 1.4 37
external identifiers as /prop/direct 179486550 1.4 37
external identifiers as /prop/direct-normalized 63666217 0.5 13
triples of /prop statement 717745459 5.6 150
Total 1140577555 8.897 239

*Note that triples that define the IDs themselves are not inluded here. Those are in the range of 0.009% of the entire dataset.

Top external IDs
ID ID label Triple Count % Triples with ID % Cummulative
P356 DOI 81479716 19.27 19.27
P698 PubMed ID 63920010 15.12 34.39
P2671 Google Knowledge Graph ID 22127898 5.23 39.62
P3083 SIMBAD ID 16316711 3.86 43.48
P646 Freebase ID 13274336 3.14 46.62
P932 PMCID 12706776 3.01 49.63
P1566 GeoNames ID 11115096 2.63 52.26
P5875 ResearchGate publication ID 9157349 2.17 54.43
P214 VIAF ID 8108557 1.92 56.35
P496 ORCID iD 5222825 1.24 57.59
P846 GBIF taxon ID 4573391 1.08 58.67
P244 Library of Congress authority ID 3894577 0.92 59.59
P227 GND ID 3763072 0.89 60.48
P7859 WorldCat Identities ID 3667589 0.87 61.35
P6179 Dimensions Publication ID 3080555 0.73 62.08
P2326 GNS Unique Feature ID 2935976 0.69 62.77
P5055 IRMNG ID 2717119 0.64 63.41
P213 ISNI 2659310 0.63 64.04
P235 InChIKey 2531375 0.60 64.64
P234 InChI 2516244 0.60 65.24

Around 19% of the triples with external IDs are triples related to P356 (DOI), 15% to P698 (PubMed ID). 64 (out of 6322) IDs form 80% of the triples having external IDs, 209 form 90%. See more with figures in Jupyter Notebook # External Identifiers


Query Analysis

WDQS external queries of 08/2021 was used for this analysis.

Description

Label

AltLabel

Name

External Identifiers

Combined

Vertical qcount.png Vertical qtime.png Vertical total qtime class.png Vertical qtime class log.png