Jump to content

User:AKhatun/Wikidata Vertical Analysis

From Wikitech

Wikidata is an open knowledge base in the form of a graph accessible through SPARQL queries (among other things). The graph is formed using triples in the form - Subject, Predicate, Object. These components connect each other forming a huge interconnected web of data. Wikidata is growing super fast and it is time to think scaling. With this aim, this page shows some analysis on Wikidata to find out:

  • Amount of certain kinds of vertical data like labels, descriptions, scientific articles etc
  • How many queries ask for each of these vertical slices
  • Analysis of these queries

Phabricator tickets: T282790, T291190.

TL;DR

If blazegraph (wikidata's backend) were to fail, what can we remove temporarily from wikidata so that it can keep functioning? Some data points found across items in wikidata such as labels, descriptions, identifiers etc are possible candidates. Analysis done on these vertical data are described in the following sections.

"Number of days for Wikidata to recover" is the estimated number of days for Wikidata to get back to its current size if some amount of triples is removed from Wikidata. To clarify: Descriptions form ~20% of Wikidata triples. If we were to remove them, then given the rate at which Wikidata is growing, it would take ~500 days for Wikidata to jump back to its current size, despite removing the descriptions. See more about Wikidata growth rate below.

Distribution of triples
Predicates Number of Triples % of Total Triples Number of days for Wikidata to recover
altLabel 102,593,854 0.8 21.5
description 2,471,378,661 19.5 518
external id 1,140,577,555 8.9 239
label 499,663,174 3.9 104
name 78,785,768 0.6 16
Distribution of queries that access vertical data (monthly)
Predicates Number of Queries % of Total Queries Query Time (hr) % of Total Query Time
altLabel 29.3 M 16 827 5
description 21.8 M 12 2601 18
external id 55 M 30 5455 39
label 88 M 48 10099 72
name 15 M 8 1829 13

Vertical Data Analysis

Wikidata snapshot of 20210712 was used for this analysis.

Total triples

Before we begin, the total number of triples in this specific snapshot of wikidata is 12671768950, approximately 12.6 billion. The growth rate of triples is not constant, but considering the growth an approximate straight line, in grafana dashboard, Wikidata grows at a rate of 4.77 million triples per day. This rate was calculated from the number of triples at the start and end of a 90-day interval (11/3/21 to 6/6/21). During this period wikidata grew 3.38%!.
Following analyses are done assuming 4.77M triples per day growth where applicable, therefore take numbers as an approximate only. To repeat, the wikidata growth rate is not constant, the 4.77M per day growth is a wide approximation.

Description

The number of triples with the predicate schema:description is 2471378661, 19.5% of all triples.

Description Triple Count Triple % Number of days for Wikidata to recover
English 72609016 0.57 15
Other Languages 2398769645 18.93 502.8
Total 2471378661 19.5 518

Additional Info

Some more information of descriptions.

Number of items that have a description 87048501
Average description per item 28.4
Maximum description count per item 258
Number of item with one description 9910091 (11% of items)
Number of item with more than one description 77138410 (88% of items)
Number of items that have a English description 72609016
Number of items that don't have English descriptions 14439485 (16.6%)

Therefore, 16.6% of all items that have a description don't have English descriptions. If we were to remove all non-English description, 16.6% items that had a description won't have a description anymore.

Distribution of descriptions per item

Top 10 number of descriptions per item
Description per Item Count Count % Cummulative %
1 9910091 11.38 11.38
2 10845750 12.46 23.84
3 13939579 16.01 39.85
4 5221876 6.00 45.85
5 3180051 3.65 49.50
6 2061753 2.37 51.87
7 1456036 1.67 53.54
8 938750 1.08 54.62
9 918864 1.06 55.68
10 886663 1.02 56.70

Language distribution of descriptions

440 different language tags in descriptions. 50% of the descriptions are of 32 languages and 90% of the descriptions are of 94 languages.

Top language tags in descriptions
Language tag Description count Description %
nl 75405965 3.05
en 72609016 2.94
de 61716292 2.50
ar 45939199 1.86
fr 42861255 1.73
es 39989399 1.62
uk 39859846 1.61
ast 38642801 1.56
ca 36901411 1.49
bn 36750936 1.49

Extra distribution figures in Jupyter Notebook # Description ## Distribution of language tags

Label

The number of triples with the predicate rdfs:label is 499663174, 3.9% of all triples.

Label Triple Count Triple % Number of days for Wikidata to recover
English 79778129 0.6 16
Other Languages 419885045 3.3 88
Total 499663174 3.9 104

Additional Info

Some more information of labels.

Number of items that have a label 93474062
Avgerage label per item 5.34
Maximum label count per item 446
Number of item with one label 20084825 (21% of items)
Number of item with more than one label 73389237 (78% of items)
Number of items that have a English label 79778129
Number of items that don't have English labels 13695933 (14.65%)

Therefore, 14.7% of all items that have a label don't have English labels. If we were to remove all non-English labels, 14.7% that had a label won't have a label anymore.

Distribution of labels per item

Top ten labels per item
Label per Item Count Count % Cummulative %
1 20084825 21.49 21.49
2 41697507 44.61 66.10
3 10030895 10.73 76.83
4 4988361 5.34 82.17
5 2568068 2.75 84.92
6 1857891 1.99 86.91
7 1366863 1.46 88.37
8 1592480 1.70 90.07
9 683102 0.73 90.80
10 731273 0.78 91.58

Language distribution of labels

476 different language tags in labels. 40% of the labels are of only 6 languages and 50% of the labels are of 12 languages.

Top language tags in labels
Language tag Label count Label %
en 79778129 15.97
nl 56940665 11.40
ast 16106324 3.22
fr 14594937 2.92
de 14352435 2.87
es 13005130 2.60
ga 9162180 1.83
it 9090037 1.82
bn 8531392 1.71
pt 7966495 1.59

More distribution figures in Jupyter Notebook # Labels ## Distribution of language tags

Other predicates like Label

Other predicates are skos:altLabel, schema:name. Note that there are no triples with the predicate skos:prefLabel.

Distribution of predicates
Predicate Triple Count Triple % Number of days for Wikidata to recover
shema:name 78785768 0.62 16.5
skos:altLabel 102593854 0.81 21.5
rdfs:label 499663174 3.9 104


Language distribution of other predicates
Label Triple Count Triple % Number of days for Wikidata to recover
English schema:name 13721324 0.11 3
Other Languages schema:name 65064444 0.51 13.6
English skos:altLabel 9157038 0.07 2
Other Languages skos:altLabel 65064444 0.74 19.5
English rdfs:label 79778129 0.6 16
Other Languages rdfs:label 419885045 3.3 88
Total English 102656491 0.8 21.5
Total Other Language 550013933 4.3 115
Total 652670424 5.15 137

More distributions in Jupyter Notebook # altLabels and Jupyter Notebook # schema:name

External Identifier

Identifiers are properties, like P297. They are wikibase:propertyType wikibase:ExternalId , i.e they are of property type External ID. Example identifiers are UNBIS Thesaurus ID, BBK (library and bibliographic classification), Symptom Ontology ID, Bilibili bangumi ID etc. There are 6322 distinct external identifiers (as of 10 August, 2021).

These properties appear as /prop, meaning the object is a statement and holds more information. Or as /prop/direct (and /prop/direct-normalized) meaning the object a single URI or literal, doesn't hold more information than that.

Triples related to external identifiers*
Triple type Triple Count % Triple Number of days for Wikidata to recover
external identifiers as /prop 179679329 1.4 37
external identifiers as /prop/direct 179486550 1.4 37
external identifiers as /prop/direct-normalized 63666217 0.5 13
triples of /prop statement 717745459 5.6 150
Total 1140577555 8.897 239

*Note that triples that define the IDs themselves are not included here. Those are in the range of 0.009% of the entire dataset.

Top external IDs
ID ID label Triple Count % Triples with ID % Cummulative
P356 DOI 81479716 19.27 19.27
P698 PubMed ID 63920010 15.12 34.39
P2671 Google Knowledge Graph ID 22127898 5.23 39.62
P3083 SIMBAD ID 16316711 3.86 43.48
P646 Freebase ID 13274336 3.14 46.62
P932 PMCID 12706776 3.01 49.63
P1566 GeoNames ID 11115096 2.63 52.26
P5875 ResearchGate publication ID 9157349 2.17 54.43
P214 VIAF ID 8108557 1.92 56.35
P496 ORCID iD 5222825 1.24 57.59
P846 GBIF taxon ID 4573391 1.08 58.67
P244 Library of Congress authority ID 3894577 0.92 59.59
P227 GND ID 3763072 0.89 60.48
P7859 WorldCat Identities ID 3667589 0.87 61.35
P6179 Dimensions Publication ID 3080555 0.73 62.08
P2326 GNS Unique Feature ID 2935976 0.69 62.77
P5055 IRMNG ID 2717119 0.64 63.41
P213 ISNI 2659310 0.63 64.04
P235 InChIKey 2531375 0.60 64.64
P234 InChI 2516244 0.60 65.24

Around 19% of the triples with external IDs are triples related to P356 (DOI), 15% to P698 (PubMed ID). 64 (out of 6322) IDs form 80% of the triples having external IDs, 209 form 90%. See more with figures in Jupyter Notebook # External Identifiers


Query Analysis

WDQS external queries of 08/2021 was used for this analysis. All the following numbers were calculated for monthly data.

Note that:

  • Only the queries that contain direct mention of the predicates were considered. Generic open ended queries that happen to match the predicates were not considered here. For example, queries like ?sub ?pred ?obj or ?sub ?obj "label_string" were not counted, but queries like ?sub rdfs:label ?obj or ?sub rdfs:label "label_string" were taken into consideration.
  • The query counts and percentages are not mutually exclusive across vertical slices. Queries that contain rdfs:label, for example, more often than not also contain skos:altLabel. Such queries increase counts for both categories.

Total Queries

  • Total number of monthly queries: ~190M
  • Total monthly query execution time: ~14,000 hours
Total query time class distribution
Query Time Class Number of queries
less_10ms 2,832,276
10ms_to_100ms 132,966,668
100ms_to_1s 42,001,720
1s_to_10s 3,144,002
more_10s 879,330

Description

  • Number of queries where schema:description occurs anywhere in the query (predicate/object/VALUES etc): 21,863,454
  • Number of queries where schema:description is the predicate (subset of the former): 21,862,863
  • Number of queries where schema:description is part of a more complex path: 4
  • Total number of queries with schema:description: 21,863,454, which is 12% of the monthly queries.
  • Queries with descriptions make up 2,600 hours or 18.65% of monthly query time.
Top user agents that use descriptions
User agent Number of queries % of description queries % of all queries
searx/1.0.0 5,000,782 22 2.7
UA#X 3,532,030 16 1.9
Python-urllib/3.6 3,233,151 14.7 1.77
searx/0.18.0 2,313,343 10 1.27
searx/1.0.0-unknown 952,243 4.3 0.52
Top user agents that use descriptions (by time)
User agent Query time (hr) % time of description queries % time of all queries
searx/1.0.0 638 24.5 4.5
Python-urllib/3.6 335 12.8 2.4
searx/0.18.0 256 9.8 1.8
UA-X 204 7.8 1.4
UA-X 156 6.0 1.1

Label

  • Number of queries where rdfs:label occurs anywhere in the query (predicate/object/VALUES etc): 42,883,256
  • Number of queries where rdfs:label is the predicate (subset of the former): 40,532,779
  • Number of queries where wikibase:label service is used to access labels: 72,936,044
  • Number of queries where rdfs:label is part of a more complex path: 2,537,238
  • Total number of queries with rdfs:label: 88,861,469, which is 48.8% of the monthly queries.
  • Queries with labels make up 10,000 hours or 72% of monthly query time.
Top user agents that use labels
User agent Number of queries % of label queries % of all queries
UA-X 7,817,612 8.8 4.3
wikidataintegrator/0.8.4 6,999,877 7.9 3.8
NERBot/0.0 6,118,205 6.9 3.36
searx/1.0.0 5,097,913 5.7 2.8
UA-X 3,532,030 3.9 1.9
Pywikibot/6.1.0 3,347,977 3.7 1.8
Python-urllib/3.6 3,233,689 3.6 1.77
UA-X 2,947,910 3.3 1.62
UA-X 2,502,071 2.8 1.37
searx/0.18.0 2,346,649 2.6 1.29
WikidataQueryServiceR 2,131,186 2.4 1.17
Top user agents that use labels (by time)
User agent Query time (hr) % time of label queries % time of all queries
UA-X 1679 16.63 12.04
UA-X 831 8.22 5.95
searx/1.0.0 646 6.4 4.63
UA-X 471 4.67 3.38
Python-urllib/3.6 335 3.32 2.4
NERBot/0.0 291 2.88 2.08
UA-X 285 2.82 2.04
searx/0.18.0 259 2.56 1.85
UA-X 200 1.99 1.44
UA-X 156 1.54 1.11
searx/1.0.0-unknown 110 1.09 0.79

altLabel

  • Number of queries where skos:altLabel occurs anywhere in the query (predicate/object/VALUES etc): 29,325,216
  • Number of queries where skos:altLabel is the predicate (subset of the former): 25,928,709
  • Number of queries where skos:altLabel is part of a more complex path: 2,470,100
  • Total number of queries with skos:altLabel: 29,325,216, which is 16% of the monthly queries.
  • Queries with altLabels make up 800 hours or 5% of monthly query time.
Top user agents that use altLabels
User agent Number of queries % of altLabel queries % of all queries
Toolforge - mix-n-match 20,215,093 68 11
Python-urllib/3.6 3,233,151 11 1.77
UA-X 919,678 3 0.5
UA-X 655,385 2.2 0.36
UA-X 388,507 1.3 0.2
Top user agents that use altLabels (by time)
User agent Query time (hr) % time of altLabel queries % time of all queries
Python-urllib/3.6 335 40.5 2.4
Toolforge - mix-n-match 141 17.1 1
UA-X 75 9.1 0.54
UA-X 31 3.8 0.22
UA-X 30 3.6 0.21


Name

  • Number of queries where schema:name occurs anywhere in the query (predicate/object/VALUES etc): 14,965,990
  • Number of queries where schema:name is the predicate (subset of the former): 14,964,300
  • Number of queries where schema:name is part of a more complex path: 1,327
  • Total number of queries with schema:name: 14,965,990, which is 8% of the monthly queries.
  • Queries with labels make up 1,800 hours or 13% of monthly query time.
Top user agents that use schema:name
User agent Number of queries % of schema:name queries % of all queries
searx/1.0.0 5,001,390 33 2.7
searx/0.18.0 2,313,343 15 1.27
searx/1.0.0-unknown 952,243 6 0.52
searx/1.0.0-200-313a9847 513,220 3.4 0.28
WikidataIdTool/1.0 507,367 3.4 .27
Top user agents that use schema:name (by time)
User agent Query time (hr) % time of schema:name queries % time of all queries
searx/1.0.0 638 34.9 4.5
searx/0.18.0 256 14 1.8
searx/1.0.0-unknown 110 6 0.8
searx/1.0.0-200-313a9847 66 3.6 0.5
searx/1.0.0-211-968b2899 51 2.8 0.3

External Identifiers

External Identifiers are those that have the wikibase:propertyType of wikibase:ExternalId. There are ~6500 such properties (as of Sep, 2021). Ids with the top usage in queries, occurring anywhere in the query from predicate, objects, to VALUES table etc, are shown below. Again, the counts are not mutually exclusive since the same query can host multiple of these properties.


Top External Ids used in queries (monthly)
Id P value Id name Query count % of all queries
P345 IMDb ID 16304253 8.967
P2013 Facebook ID 14233516 7.828
P2002 Twitter username 14040155 7.722
P212 ISBN-13 13646249 7.505
P2003 Instagram username 13518766 7.435
P218 ISO 639-1 code 13512920 7.432
P957 ISBN-10 13455799 7.400
P498 ISO 4217 code 13422212 7.382
P2397 YouTube channel ID 13389689 7.364
P434 MusicBrainz artist ID 13313017 7.322
P1651 YouTube video ID 13291716 7.310
P436 MusicBrainz release group ID 13288620 7.309
P435 MusicBrainz work ID 13259330 7.292
P966 MusicBrainz label ID 13254823 7.290
P846 GBIF taxon ID 6650333 3.658
P300 ISO 3166-2 code 6594479 3.627
P691 NKCR AUT ID 3992781 2.196
P214 VIAF ID 1920419 1.056
P698 PubMed ID 1843036 1.014
P2949 WikiTree person ID 1698142 0.934
  • Total number of queries with External Ids: 55,127,216, which is 30% of the monthly queries.
  • Queries with external ids make up 5,500 hours or 39% of monthly query time.
Top user agents that use external Ids (by counts)
User agent Number of queries % of external Id queries % of all queries
wikidataintegrator/0.8.4 6993803 12 3.8
Rust mediawiki API/0.2.7 6603929 11 3.6
Hub 6231886 11 3.4
searx/1.0.0 5097329 9 2.8
Googlebot/2.1 2614683 4 1.43
Top user agents that use external Ids (by time)
User agent Query time (hr) % time of external Id queries % time of all queries
UA-X 1300 23.8 9.3
searx/1.0.0 646 11.8 4.6
searx/0.18.0 259 4.7 1.8
Needle/0.9.2 205 3.7 1.4
UA-X 204 3.7 1.4

Combined