Jump to content

User:AKhatun/Wikidata Scholarly Articles Subgraph Analysis

From Wikitech

Scholarly articles form a large portion of Wikidata, this has been established for some time. This page highlights the baseline analysis done on Scholarly articles in Wikidata. The aim is to identify not only what portion of Wikidata is scholarly articles, but also how connected is it to other parts of Wikidata, how many users query this subgraph and what percentage of queries they are. The analysis is therefore divided into two parts:

  • Analysis on Wikidata:
    • What are scholarly articles
    • Number and percentage of Wikidata entities that are scholarly articles
    • How many entities connect to scholarly articles
    • How many entities do scholarly articles connect to
    • Rate of growth of scholarly articles
    • Number of authors that would be isolated from Wikidata if scholarly articles were removed (meaning, these author entities were created probably only for the purposes of the articles they wrote.)
  • Analysis of the WDQS SPARQL Queries:
    • What defines a query to be associated with scholarly articles
    • The number and percentage of queries associated with scholarly articles
    • The number and percentage of queries that require entities other than scholarly articles vs those that require only scholarly articles

Ticket: T281854

TL;DR

Scholarly Articles

  • Scholarly Articles are a significantly large subset of Wikidata with ~37M entities. Triples related to these entities make up 50% of all Wikidata triples.
  • The number of these entities, as counted over a span of 45 days, did not grow as much. But the number of triples related to these entities is growing (100M in these 45 days).
  • An approximate measure of how connected scholarly article subgraph is to other parts of Wikidata (based on direct connection):
    • Excluding authors and citations, ~85M (0.66%) triples go from scholarly articles to other parts of Wikidata
    • 266M (2%) triples go from other parts of Wikidata to scholarly articles
  • 20% of triples related to scholarly articles are descriptions (descriptions are also ~20% of all Wikidata triples. See User:AKhatun/Wikidata_Vertical_Analysis). Then comes ranks, types, and references.
  • 20% (0.36M/1.8M) of the authors of scholarly articles also author things that are not scholarly articles. Meaning 80% of the authors are solely included in the scholarly articles subgraph
  • An approximation approach was taken to find queries that are related to scholarly articles using predicates, items, and literals.
  • These queries form ~2% of all monthly queries, in terms of query counts.
  • This number is also a bit of over-count, based on analysis of a subset of relevant queries. For example, the query contains a property often used with scholarly articles, but the query is asking for information about the property itself, rather than using that property to ask about scholarly articles.
  • Total query time for scholarly article queries: 152 hours. This is 1% of total query time for a month.

Scholarly Articles

The Wikidata dump of 20210719 was used for the following analysis.

Definition of Scholarly Articles

Research papers or books can be considered scholarly articles. This includes journals articles, conference papers, books etc. Wikidata has several types of entities to cover these. More technically speaking, the entities involved are:

Most of these have overlaps with scholarly articles, and with themselves. Scholarly articles have the largest count (37M) while everything else combined is in the thousands (~130K, excluding those that are included in scholarly articles), therefore the analysis is more focused on scholarly articles than others. The 130K could be even less if the other article types have overlap among themselves. This is not an exhaustive list, but cover most of what we can call scholarly articles or papers or journal articles. A detailed tree diagram of how these entities are related can be found in Scholarly Articles' Tree in Wikidata.

Scholarly Article Stats

Number of Scholarly Articles

  • Total entity count: 94M [1]
  • Number of entities that are instance of scholarly article: 37308158 (37.3M). 40% of all Wikidata entities are therefore scholarly articles.
  • Total triple count: 12819818340 (12.8B)
  • Number of triples included within the scholarly articles: 6399256630 (6.4B). 50% of all triples are directly related to scholarly articles.

Technically, the triple related to scholarly articles come under the 'context' of items that are scholarly articles. Example context of an item is Q39790431 dump. The context count includes:

  • + all triples in which the item is a subject
  • + the statement triples that rise from these triples
  • + triples that define the item itself
  • - the refs and vals, as those are re-usable in other items.

Triples per article

i.e triples related to the articles.

  • Average triple per article: 171
  • Minimum triple per article: 10
  • Maximum triple per article: 41847

See examples of articles with high number of triples in Wikidata_Basic_Analysis#Items

Direct triples per article

i.e triples where scholarly article is the subject.

  • Average direct triple per article: 84
  • Minimum direct triple per article: 7
  • Maximum direct triple per article: 16758

See examples of articles with high number of direct triples in Wikidata_Basic_Analysis#Top_Subjects

This raises the question: Are only a few articles responsible for this huge number of triples, or are the triples distributed evenly among the articles? So we look into the distribution of triples.

Distribution of triples per article

Distribution of the count of triples per article
Number of triples Count of articles
10 to 100 10583118
100 to 1k 26626656
1k to 10k 96563
more than 40k 1821

Number of days to recovery

If 6.4B triples were to be removed from wikidata, given the current rate of growth, how long would it take for wikidata to get back to its original size again?
The growth rate of triples is not constant, but considering the growth an approximate straight line, in grafana dashboard, Wikidata grows at a rate of 4.77M triples per day. This rate was calculated from the number of triples at the start and end of a 90-day interval (11/3/21 to 6/6/21). It could be faster or a bit slower than this.

  • Wikidata will take 1300 days = 44 months = appx. 3.6 years to get back to it's original size. This is a wide approximation, since the growth rate of wikidata is not constant.

Entities connected to scholarly articles

  • Links to scholarly articles: Number of triples that have scholarly articles as object is 529702818 (530M)
  • Outside links to scholarly articles: Number of triples that have scholarly articles as object, but subject is not another scholarly article is 266708031 (266M)

Entities scholarly articles connect to

Queries to directly find what links from scholarly articles to other non-scholarly articles is running into time outs. Another way to estimate this is looking into predicates that tend to point towards other non-scholarly article items in Wikidata.

The predicates considered were:

  • main subject
  • language of work or name
  • stated as
  • on focus list of Wikimedia project
  • describes a project that uses
  • determination method
  • sponsor
  • genre
  • object has role

These together are 85851140 (85M) triples, therefore 85M triples link directly from scholarly articles to things possibly non-scholarly article.
Note that this does not include triples contained in the statements of these triples, if any; but should not add too many triples to it either.

Rate of growth of scholarly articles

The growth rate of scholarly articles is shown in the figure below. The data is only over a span of 7 weeks.

  • The number of scholarly articles doesn't seem grow much.
  • The number of triples related to scholarly articles grew around 100M in this time.


More trend data can be found in wikicite.org/statistics, but they report for all publications, a much larger category than scholarly articles.

Scholarly Article Counts Summary

Predicates of Scholarly Articles

  • Total distinct predicates: Total distinct predicates of scholarly articles is 2107
  • Non-wikidata predicates: 17 of these predicates are non-wikidata predicates, i.e unlike P31/P279etc, they do not start with the prefix wikidata.org. These predicates form 60% of the scholarly article triples.
  • Non-wiki predicates: 14 of these predicates don't start with wikidata.org or wikiba.se. These are 46% of the scholarly article triples.
  • Descriptions: Descriptions of scholarly articles form 20.5% of the scholarly article triples. This is 10% of all wikidata triples. Recall that all descriptions (of all items) together forms 19.5% of the entire wikidata.
  • External IDs: There are 1000 external IDs associated with scholarly articles, which form ~50% of the distinct predicates of scholarly articles. External IDs form 4% of scholarly article triples, 2% of all triples.

Top Predicates

Top predicates of scholarly articles
Predicate Predicate label # of Triples % of Scholarly Article Triples % of all Triples
http://schema.org/description 1321691671 20.654 10.252
http://wikiba.se/ontology#rank 773820830 12.092 6.002
http://www.w3.org/1999/02/22-rdf-syntax-ns#type 773788620 12.092 6.002
http://www.w3.org/ns/prov#wasDerivedFrom 691773237 10.810 5.366
http://www.wikidata.org/prop/P2860 cites work 263097842 4.111 2.041
http://www.wikidata.org/prop/statement/P2860 cites work 263097837 4.111 2.041
http://www.wikidata.org/prop/direct/P2860 cites work 263004896 4.110 2.040
http://www.wikidata.org/prop/qualifier/P1545 series ordinal 154088914 2.408 1.195
http://www.wikidata.org/prop/P2093 author name string 134315644 2.099 1.042
http://www.wikidata.org/prop/statement/P2093 author name string 134315587 2.099 1.042
http://www.wikidata.org/prop/direct/P2093 author name string 134227496 2.098 1.041
http://www.w3.org/2000/01/rdf-schema#label 74437634 1.163 0.577
http://www.wikidata.org/prop/statement/P31 instance of 40319296 0.630 0.313
http://www.wikidata.org/prop/P31 instance of 40319296 0.630 0.313
http://www.wikidata.org/prop/direct/P31 instance of 40319268 0.630 0.313
http://www.wikidata.org/prop/P1476 title 37524904 0.586 0.291
http://www.wikidata.org/prop/statement/P1476 title 37524903 0.586 0.291
http://www.wikidata.org/prop/direct/P1476 title 37523899 0.586 0.291
http://www.wikidata.org/prop/P577 publication date 37309627 0.583 0.289
http://www.wikidata.org/prop/statement/P577 publication date 37309626 0.583 0.289
http://www.wikidata.org/prop/statement/value/P577 publication date 37309625 0.583 0.289

Top Properties

Considering only wikidata predicates, i.e mainly properties of items, and grouping based on property label of the property, we get the following count of top predicates. Includes p, s, ps, psv, wdt, wdtn, pq, pqv, pqn where applicable. They do not contain triple count of triples that expand from the statements.

Top properties of scholarly articles
Property name # of Triples % of Scholarly Article Triples % of all Triples
cites work 789200575 12.332 6.122
author name string 402859029 6.296 3.125
series ordinal 154089316 2.408 1.195
publication date 149227733 2.332 1.156
DOI 134464365 2.100 1.045
instance of 120957868 1.890 0.939
title 112574265 1.758 0.873
published in 109147370 1.707 0.846
page(s) 104199512 1.629 0.807
volume 103612864 1.620 0.804
PubMed ID 95893499 1.499 0.744
issue 95083032 1.485 0.738
author 59439111 0.929 0.461
main subject 41515900 0.648 0.321
language of work or name 34801291 0.543 0.270
PMCID 19059585 0.297 0.147
ResearchGate publication ID 13734703 0.216 0.108
stated as 8609378 0.135 0.067
exact match 8162235 0.129 0.063
Dimensions Publication ID 4617733 0.072 0.036

Top External IDs

Grouping based on property label of the external IDs, we get the following count of top external IDs. Includes p, s, ps, psv, wdt, wdtn, pq, pqv, pqn where applicable. They do not contain triple count of triples that expand from the statements.

Top external IDs of scholarly articles
External ID # of Triples % of Scholary Article Triples % of all Triples
DOI 134464365 2.100 1.045
PubMed ID 95893499 1.499 0.744
PMCID 19059585 0.297 0.147
ResearchGate publication ID 13734703 0.216 0.108
Dimensions Publication ID 4617733 0.072 0.036
CJFD journal article ID 2915210 0.045 0.024
DBLP publication ID 2080695 0.035 0.015
ADS bibcode 1186590 0.018 0.009
OpenCitations bibliographic resource ID 1161475 0.020 0.010
arXiv ID 1040311 0.015 0.009

Distribution of predicates

The distribution of distinct predicates per article is given below:

  • Average distinct predicate per article: 29
  • Maximum distinct predicate per article: 102
  • Minimum distinct predicate per article: 7
Distribution of distinct predicate per scholarly article
Number of distinct predicate Number of articles
less than 10 277
10 to 20 152995
20 to 30 18146794
30 to 40 18294391
40 to 50 711176
50 to 60 2493
60 to 70 26
70 to 80 4
80 to 90 1
90 to 100 0
more than 100 1

Scholarly Articles' Author

The following analysis is done with the predicate P50, which links to an author item in Wikidata. Other author data include 'author name string', but it is not considered here as these are literals and do not link to other Wikidata items.

  • Total number of distinct authors (through P50) in wikidata: 1.9M
    1. Number of distinct authors who wrote scholarly articles: 1.8M
      • Number of distinct authors who wrote ONLY scholarly articles: 1.44M
    2. Number of distinct authors who wrote other kinds of articles: 0.5M
      • Number of distinct authors who wrote did not write any scholarly article: 0.13M
    3. Number of authors who wrote both scholarly and non-scholarly articles: 0.36

  • Number of direct links to authors in scholarly articles: 19756767 (19.7M)
  • There are ~28M triples where items (not scholarly articles) link to authors.
  • Average author per article: ~2
  • Maximum author per article: 1070
  • Minimum author per article: 1

Queries related to scholarly articles

This section aims to find statistics about the queries that touch on the scholarly articles subgraph. Given the inter-connected nature of Wikidata, it is very hard to find queries that somehow relate to scholarly articles, but some approximations are possible. To this end the various types of possible queries possible are:

  1. The query directly mentions some scholarly article(s) (e.g list the authors of an article)
  2. The query asks for scholarly article in results (e.g list articles published in a specific date)
  3. The query asks for information that has to pass through the scholarly articles (e.g list authors of articles published in a specific date)

To approximate the number of queries that relate to scholarly articles certain assumptions were made. These should practically cover all such queries.

  1. Query containing the QID of scholarly article: Q13442814
  2. The query will contain the QID of the articles themselves
  3. The query will contain objects, subjects, or predicates that are used most often in the scholarly article subgraph in Wikidata (e.g author, publication date etc). This set of queries should approximately cover 2nd and 3rd category of queries.
    • Query containing predicates that are used mostly in scholarly articles subgraph
    • Query containing subject/object URIs that are used mostly in scholarly articles subgraph
    • Query containing literals that are used mostly in scholarly articles subgraph

Q: What do you mean by subject/predicate/object mostly in scholarly articles subgraph?
A: Some items occur almost always in relation to scholarly articles. Typical examples include: author property, certain author items, publication date, cites, DOI, and lots of other external IDs. One way to get an approximation of such items is to find out the percentage of use of an item in scholarly article versus in the entire Wikidata. Distribution of this percentage and more analysis given below.

The following analysis uses Wikidata dump of 20210816 and WDQS public SPARQL queries of 07/2021 and 08/2021. All query related values below are monthly counts.

Summary of Query Counts

Count of SPARQL queries related to scholarly articles (monthly)
Category Count % of all queries
Query contains scholarly article QID Q13442814 70K 0.04
Query contains scholarly article instance QID 730K 0.4
Query contains properties mostly relevant to scholarly articles 2.7M 1.4
Query contains subject or object URIs mostly relevant to scholarly articles 750K 0.4
Query contains literals mostly relevant to scholarly articles 825K (max 2.2M) 0.4 (max 1.2)
Total scholarly article related queries 3.7M (max 4.7M) 1.96 (max 2.5)
Total queries 190M -

Query Stats

Queries with scholarly article QID

Although there are other kinds of articles (see Definition of Scholarly Articles for details), since scholarly articles outnumber others significantly, the first step was to find queries that specify the QID of scholarly articles ( Q13442814) directly. This includes queries that may ask for list of scholarly with other conditions. For instance: List of scholarly articles published in a specific date or by a specific author.

  • The number of queries that contained the QID of scholarly articles cam out to be ~70K, which is 0.04% of monthly queries.

Queries with scholarly article instance

Another almost direct way to identify queries related to scholarly articles is to get the queries that mentions any scholarly article. For this, the queries are checked for presence of items that are instance of scholarly articles. For example: List the authors for an article.

  • The number of such queries was ~730K, which is 0.4% of monthly queries.

Queries with properties mostly relevant to scholarly articles

Some properties are used almost always for scholarly articles, such as author (P50), author name string (P2093), cites work (P2860), IDs like P7710, P5875, P818, etc. If these properties were to be used in a query, we can assume the query has to pass through the scholarly articles subgraph.
There are around 2000 distinct properties used in the scholarly articles subgraph (refer to Predicates of Scholarly Articles for details on properties). To get a list of properties most concerned with scholarly articles, the following steps were taken:

  1. The usage count of these properties was counted in the entirety of Wikidata (total_count)
  2. The usage count of properties was counted within scholarly articles subgraph (sa_count)
  3. Then percentage of usage within scholarly article subgraph was calculated (sa_count/total_count)
  4. All properties with >=99% usage solely in scholarly articles subgraph were considered properties most concerned with scholarly articles.

A histogram of the distribution of usage percentage for all predicates used in the scholarly article subgraph is given below. Note that the predicates were grouped with their P values. That is, wd:P50, wdt:P50 would be considered simply P50.

  • The number of predicates used >=99% in scholarly article subgraph is 40. If each predicate was considered separately, i.e wd, wdt, wdtn etc were considered separately, then 128 predicates are counted to be used >=99% in scholarly articles.
  • The number of queries that use these 40 predicates is 2.7M, which is 1.4% of monthly queries.
Predicates that occur mostly in scholarly article subgraph (used >=99%)
Used in Wikidata Used in scholarly article subgraph % usage in scholarly article subgraph
Rxivist preprint ID 9 9 100
GONIAT paper ID 3 3 100
I-Revues ID 54 54 100
Epistemonikos ID 3 3 100
National Criminal Justice ID 3 3 100
OpenReview.net submission ID 18 18 100
Paperity article ID 42 42 100
Arnet Miner publication ID 9 9 100
PubAg ID 9 9 100
What Works Clearinghouse study ID 30 30 100
CJFD journal article ID 2915224 2915210 100
Anais do Museu Paulista article ID 1656 1656 100
Scilit work ID 36 36 100
ScienceOpen publication ID 21 21 100
PMCID 19060175 19059585 99.997
PubMed ID 95897900 95893499 99.995
COVIDWHO ID 41751 41748 99.993
ResearchGate publication ID 13736027 13734703 99.990
Erudit article ID 26340 26334 99.977
arXiv ID 1040677 1040311 99.965
ADS bibcode 1187030 1186590 99.963
describes a project that uses 240579 240426 99.936
Australian Faunal Directory publication ID 54570 54534 99.934
Dimensions Publication ID 4620838 4617733 99.933
cites work 789804605 789200575 99.924
BHL part ID 8874 8862 99.865
issue 95317006 95083032 99.755
author name string 404079137 402859029 99.698
corrigendum / erratum 94870 94579 99.693
Altmetric ID 38379 38256 99.680
BioStor work ID 180236 179642 99.670
volume 104172639 103612864 99.463
zbMATH work ID 16590 16491 99.403
Fatcat ID 446784 443994 99.376
IEEE Xplore document ID 6072 6030 99.308
Semantic Scholar corpus ID 45123 44802 99.289
Mathematical Reviews ID 8307 8241 99.205
affiliation string 47300 46909 99.173
page(s) 105099096 104199512 99.144
is retracted by 4303 4260 99.001


Top predicates (based on usage-count) in scholarly articles subgraph
Used in Wikidata Used in scholarly article subgraph % usage in scholarly article subgraph
cites work 789804605 789200575 99.924
author name string 404079137 402859029 99.698
series ordinal 157124675 154089316 98.068
publication date 159534379 149227733 93.540
DOI 135841849 134464365 98.986
instance of 294973877 120957868 41.006
title 123016603 112574265 91.511
published in 112375873 109147370 97.127
page(s) 105099096 104199512 99.144
volume 104172639 103612864 99.463
PubMed ID 95897900 95893499 99.995
issue 95317006 95083032 99.755
author 63828357 59439111 93.123
main subject 44587072 41515900 93.112
language of work or name 42976957 34801291 80.977
PMCID 19060175 19059585 99.997
ResearchGate publication ID 13736027 13734703 99.990
stated as 9056223 8609378 95.066
exact match 9823282 8162235 83.091
Dimensions Publication ID 4620838 4617733 99.933
CJFD journal article ID 2915224 2915210 100.000
DBLP publication ID 2141650 2080695 97.154
full work available at URL 4566802 1574331 34.473
ADS bibcode 1187030 1186590 99.963
OpenCitations bibliographic resource ID 1237320 1161475 93.870
Predicate usage-count distribution in scholarly articles
mean 3,898,755
std 37,445,470
min 1
25% 3
50% 10
75% 156
90th percentile 17,086
95th percentile 180,075
99th percentile 103,859,256
max 789,200,600

Top properties used in other subgraphs

Top predicates that are used a lot in scholarly articles (>95th percentile of counts), but also used a lot outside. These predicates are possibly susceptible to inadvertent query calls to scholarly articles.
Used in Wikidata Used in scholarly article subgraph % usage in scholarly article subgraph Used outside scholarly articles % usage outside scholarly article subgraph
instance of 294973877 120957868 41.00 174016009 58.99
title 123016603 112574265 91.51 10442338 8.48
publication date 159534379 149227733 93.54 10306646 6.46
language of work or name 42976957 34801291 80.97 8175666 19.02
author 63828357 59439111 93.12 4389246 6.87
copyright status 4212816 503159 11.94 3709657 88.05
published in 112375873 109147370 97.12 3228503 2.87
main subject 44587072 41515900 93.11 3071172 6.88
series ordinal 157124675 154089316 98.06 3035359 1.93
full work available at URL 4566802 1574331 34.47 2992471 65.52
exact match 9823282 8162235 83.09 1661047 16.90
DOI 135841849 134464365 98.98 1377484 1.01
on focus list of Wikimedia project 1591312 272541 17.12 1318771 82.87
author name string 404079137 402859029 99.69 1220108 0.30
page(s) 105099096 104199512 99.14 899584 0.85
cites work 789804605 789200575 99.92 604030 0.07
volume 104172639 103612864 99.46 559775 0.53
stated as 9056223 8609378 95.06 446845 4.93
copyright license 626695 319255 50.94 307440 49.05
Internet Archive ID 765078 499905 65.34 265173 34.65
issue 95317006 95083032 99.75 233974 0.24
first line 326508 201612 61.74 124896 38.25
OpenCitations bibliographic resource ID 1237320 1161475 93.87 75845 6.12
DBLP publication ID 2141650 2080695 97.15 60955 2.84
JSTOR article ID 676666 655669 96.89 20997 3.10
PubMed ID 95897900 95893499 99.99 4401 0.00
BHL Page ID 187381 183977 98.18 3404 1.81
Dimensions Publication ID 4620838 4617733 99.93 3105 0.06
Fatcat ID 446784 443994 99.37 2790 0.62
ResearchGate publication ID 13736027 13734703 99.99 1324 0.00
PMCID 19060175 19059585 99.99 590 0.00
ADS bibcode 1187030 1186590 99.96 440 0.03
arXiv ID 1040677 1040311 99.96 366 0.03
describes a project that uses 240579 240426 99.93 153 0.06
CJFD journal article ID 2915224 2915210 100.00 14 0.00

Queries with sub/obj URIs mostly relevant to scholarly articles

Just like properties, some items can occur more in scholarly articles subgraph than other places in wikidata. Following a similar procedure, subject and object URIs usage percentage was calculated and those >=99% were considered more related to scholarly articles. The queries were then searched for presence of these more relevant items.

A histogram of the distribution of usage percentage for all subjects and object URIs used in the scholarly article subgraph is given below.

  • The number of queries that use items more relevant to scholarly articles is 750K, which is 0.4% of monthly queries.
  • Note that number of queries containing instances of scholarly articles was 730K. Therefore, most of the former number actually includes queries that directly mention the scholarly articles.
  • The figure shows that quite a large number of items are used mostly in scholarly articles (>=99% usage), which were then used to sift through queries.

Queries with literals mostly relevant to scholarly articles

Literals were analyzed separately. They always occur as objects, with or without additional language or datatype tags ("label"@en, "2021"^^xsd:integer). A user can construct queries containing literals in the following ways:

  1. match the whole literal string, e.g "labelstring". Additionally use LANG/DTYPE for further filtering. But the literal appears as plain string.
  2. match literal with language or dtype tags, e.g "labelstring@en"
  3. match part of the literal, e.g using regex(?g, "matchThisSubstring") or more complex expressions.

For the purposes of analysis, only 1 and 2 were considered for finding the queries since matching substrings is rather complicated and not as much reliable. But we suspect that if a query were to contain literals related to scholarly articles, it should also contain some predicates or other URIs that are mostly related to scholarly articles. Similar to before, literals that were used >=99% of the times in scholarly subgraphs were used to find queries related to scholarly articles.

A histogram of the distribution of usage percentage for all literals used in the scholarly article subgraph is given below.

  • The number of such queries was ~825K, which is 0.4% of monthly queries. Note that this number was obtained from July. In August, it was ~600K.

Removing literals used in references and values

References and Values are not considered to be part of scholarly article subgraph since they may be used in other places as well. But this causes the usage percentage of certain items clearly related to scholarly articles to be less than 99%, and does not get included in query counting. Therefore, query counting was done later by removing all references and values. For URIs count, removal of refs and vals does not give significantly different count of scholarly article related queries. But some differences are seen when literals are matched to count queries.

  • The number of queries with literal usage >=99% in scholarly article subgraph, excluding any occurrences in references or values, is 2.2M for July, which is 1.2% of monthly queries. It is 1.6M for August, which is 0.84% of monthly queries.
  • The total number of queries related to scholarly articles therefore becomes 4.7M, forming 2.5% of monthly queries. These values are marked as max in the summary table.

Queries with labels and descriptions mostly relevant to scholarly articles

While literals already cover labels and descriptions, a separate analysis was done on them. The process of finding scholarly article related queries remains the same, except this time we only look for labels and descriptions of scholarly articles in the queries.

  • The number of queries asking for labels or descriptions of scholarly articles was ~300K, which is 0.16% of monthly queries.

Summary query stats with various cut-offs

As described above, for predicates, URIs, and literals, a cut-off of 99% was chosen. This means, for example, predicates that occur at least 99% of the times in scholarly articles subgraph are considered predicates that are related to scholarly articles. Then SPARQL queries containing these predicates were considered queries related to scholarly articles. Counts were also determined for cut-off values of 95% and 90% for comparison purposes. Summary of all these counts for August, 2021 is given below.

In order to prevent under-counting (although possibly over-counting), literals were considered by removing references and values. This gives higher count of queries than with references and values. Besides, some values may not match with the summary table initially provided which contains maximum value obtained from July and August for each category.

Count of SPARQL queries related to scholarly articles (monthly) for different cut-off values
Category Count with 99% cut-off Count with 95% cut-off Count with 90% cut-off
Query contains scholarly article QID Q13442814 68,868
Query contains scholarly article instance QID 580,488
Query contains properties mostly relevant to scholarly articles 2,596,672 4,017,406 19,226,263
Query contains subject or object URIs mostly relevant to scholarly articles 594,759 904,441 1,185,951
Query contains literals mostly relevant to scholarly articles 1,610,880 1,653,899 1,806,235
Total scholarly article related queries 3,735,535 6,311,813 21,902,823
Percent of all queries 2.5% 3.2% 11%
Total queries 197,000,537

Given the sharp decrease in query count with increase in percentage for cut-off, 99% is a pretty precise value(but still over-counting).

This section analyses the queries that were extracted following the above methods as being related to scholarly articles. A total of ~3.7M queries (out of 190M) were identified as such.

User agent

  • Number of distinct user agents: 3,138
Top user agents for count of scholarly article related queries (Aug 2021)
User Agent Count Percentage of scholarly article queries
wikidataintegrator/0.8.4 2254245 61.3
@wikimedia/kartotherian-geoshapes/1.1.4 199195 5.4
UA # 1 154249 4.2
Scholia 116809 3.2
Toolforge - mix-n-match 110080 2.99
Toolforge - legacy code 109578 2.97
UA # 2 105035 2.9
UA # 3 62926 1.7
PyPoli University Matching 42906 1.2
SearchBot/2.0 32454 0.9
Toolforge - wikidata-terminator 32116 0.87
@wikimedia/kartotherian-geoshapes 24827 0.67
UA # 4 22480 0.6

N.B: Unknown or non-bot user agents were marked UA # x

Query Analysis

Most queries of the top user-agents do not directly ask for articles, rather relate to properties and items related to scholarly articles.

  • There are only 3 types of queries (For August)
  • More than 99% of the queries simply ask for labels and links for some properties and items
  • The properties are mostly external IDs. They are related to scholarly articles, but the query does not directly ask for scholarly articles. Rather just asks for labels and external links for those properties.
  • Similarly, many queries ask for basic information of some items that are related to scholarly articles.
  • Sometimes the items are not directly related to scholarly articles. For instance, geographic names (Asia), topic names (science, nature) etc. These queries were captured because these items probably were used more in the scholarly article context than others.

Query Times

  • Total query time for all queries (with status code 200 and 500) for a month: ~14,000 hours
  • Total query time for scholarly article queries: 152 hours. This is 1% of total query time.
Top User Agents in terms of query execution time for scholarly articles queries
User Agent Query time (hrs) % time of scholarly article queries % time of all queries
wikidataintegrator/0.8.4 28.5 18.7 0.2
@wikimedia/kartotherian-geoshapes/1.1.4 27.3 17.9 0.12
Rust mediawiki API; mediawiki-rust/0.2.7 17.2 11.3 0.12
Toolforge - legacy code 6.7 4.4 0.05
UA # 2 4.3 2.8 0.03
@wikimedia/kartotherian-geoshapes/1.1.3 3.4 2.2 0.02
UA # 5 2.8 1.9 0.02
UA # 6 2.2 1.5 0.016
UA # 7 2.1 1.39 0.015
Toolforge - wikidata-todo 1.9 1.29 0.014
UA # 8 1.9 1.27 0.013
UA # 9 1.6 1.08 0.011
PyPoli University Matching 1.5 1.04 0.011
python-requests/2.21.0 1.5 1.03 0.011
UA # 10 1.5 1.01 0.011
UA # 11 1.4 0.96 0.010
Scholia 1.4 0.9 0.01
C++ WikiAPI 1.3 0.9 0.009
SearchBot/2.0 1.2 0.8 0.009
UA # 12 1.1 0.78 0.008
Query time distribution for scholarly article queries
query time class count
less_10ms 40,992
10ms_to_100ms 3,090,231
100ms_to_1s 497,279
1s_to_10s 43,911
more_10s 5,342

Triples Analysis

The following table shows the top subject, predicates, and objects used in queries that were identified as being related to scholarly articles. Top items are the top wikidata items or properties used anywhere within a query. These can occur as part of triples (subject/predicate/object) or outside (within VALUES).

Top items in queries related to scholarly articles
Subject count Predicate count Object count Item count
property 4572314 wikibase:language 2861468 [AUTO_LANGUAGE],en 2349365 P1630 2264200
bd:serviceParam 2958054 wikibase:propertyType 2271979 propertyType 2262518 P698 184303
item 1056936 wdt:P1630 2264038 formatter_url 2254181 P31 521323
id 991439 rdfs:label 545307 en 475305 P932 509593
hint:Prior 506337 hint:gearing 468550 forward 466818 P279 382700
q 439497 http://www.wikidata.org/prop/P31>
/<http://www.wikidata.org/prop/statement/P31>)
/(<http://www.wikidata.org/prop/direct/P279>)*
302074 id 307873 P356 164921
link 434718 schema:about 275565 idLabel 211336 P131 155524
gas:program 359477 schema:isPartOf 273056 https://en.wikipedia.org/ 208643 P582 134232
work 330934 wdt:P31 241829 ?0 199052 P577 109605
locst 253818 wdt:P356 181389 item 198614 P50 108597
?0 127841 wdt:P50 141939 locst 126906 P361 108377
internal_id 108761 wdt:P698 132169 p 92839 P2093 105217
author_statement 88878 pq:P582 129357 wd:Q4057820 65957 P214 100948
article 82675 p:P131 126906 wd:Q2198484 65957 P1433 96566
?1 72329 <http://www.wikidata.org/prop/statement/P131>/
(<http://www.wikidata.org/prop/direct/P131>)*
126333 wd:Q13626398 65957 P176 71575
person 71203 wdt:P577 110947 sitelinks 63155 P1545 68155
statement 66780 p 109503 datetime 59823 Q13626398 65976
prop 65362 wdt:P214 98577 selfauthor 52568 Q4057820 65958
citing_work 62469 wikibase:directClaim 91965 ?1 52506 Q2198484 65957
subject 45541 wdt:P465 90871 3^^http://www.w3.org/2001/XMLSchema#integer 51704 P452 65479

Paths

The following table shows the top paths used in queries related to scholarly articles. Ordinary properties are not considered as paths. The following list contains not only the paths, but also their breakdown into components paths (as done by Jena ARQ while parsing SPARQL queries). For instance: (p:P31/ps:P31)/(wdt:P279)* is recorded as:

  • (p:P31/ps:P31)/(wdt:P279)*
    • (p:P31/ps:P31)
      • p:P31
      • ps:P31
    • (wdt:P279)*
      • wdt:P279
Top paths in queries related to scholarly articles
Path count
<http://www.wikidata.org/prop/direct/P279> 24524
(<http://www.wikidata.org/prop/direct/P279>)* 22321
<http://www.wikidata.org/prop/direct/P131> 14840
(<http://www.wikidata.org/prop/direct/P131>)* 14839
<http://www.wikidata.org/prop/statement/P31> 12708
<http://www.wikidata.org/prop/P31> 12708
<http://www.wikidata.org/prop/P31>/<http://www.wikidata.org/prop/statement/P31> 12708
(<http://www.wikidata.org/prop/P31>/<http://www.wikidata.org/prop/statement/P31>)/(<http://www.wikidata.org/prop/direct/P279>)* 12693
<http://www.wikidata.org/prop/statement/P131>/(<http://www.wikidata.org/prop/direct/P131>)* 12633
<http://www.wikidata.org/prop/statement/P131> 12633
<http://www.wikidata.org/prop/direct/P31> 12187
<http://www.wikidata.org/prop/direct/P31>/(<http://www.wikidata.org/prop/direct/P279>)* 60883
<http://www.wikidata.org/prop/direct/P361> 51271
(<http://www.wikidata.org/prop/direct/P361>)* 49358
<http://www.wikidata.org/prop/statement/P361> 45422
<http://www.wikidata.org/prop/P361> 45422
<http://www.wikidata.org/prop/P361>/<http://www.wikidata.org/prop/statement/P361> 45418
(<http://www.wikidata.org/prop/P361>/<http://www.wikidata.org/prop/statement/P361>)/(<http://www.wikidata.org/prop/direct/P361>)* 45418
<http://www.wikidata.org/prop/direct/P50> 27444
<http://www.wikidata.org/prop/direct/P2093> 27408
  1. wikidata:Special:Statistics