Wikidata Query Service/WDQS Graph Split Impact Analysis

The scope of this analysis is to assess the impact (on existing SPARQL queries) of splitting the Wikidata graph into two subgraphs: scientific articles (also referred to as scholarly articles) on one side and the rest (called Wikidata main) on the other. We will refer to the existing graph served by WDQS as the full graph.

TL/DR: Data shows that less than 10% of the queries might be affected by the graph split but this only affects a few user-agents. The top 5 of the most impacted user-agents accounts for more than 90% of the total impact. In other other words it sounds possible to drastically reduce the impact of the split by addressing a few user-agents.

Dataset

We used 6 different samples gathered from the WDQS query logs. Five from known tools used to access WDQS by the Wikidata community:

Listeria
MixNMatch
Pywikibot
WikidataIntegrator
SPARQLWrapper

10,000 queries collected in October and November 2023 were extracted from each (except for MixNMatch where only 8396 queries were found in the query logs during this period). Another sample called random with 100,000 queries representative of the various query lengths and execution times we see in our query logs was also extracted^[1].

Basic sample statistics

Below is a table showing the number of queries per sample versus the number of unique queries.

Sample	Total	Unique
Listeria	10000	101
MixNMatch	8396	626
Pywikibot	10000	5455
SPARQLWrapper	10000	9919
WikidataIntegrator	10000	2507
random	100000	85838

For some samples a non-negligible number of queries are repeated. The total could be used as a base to assess the impact on a per-query-execution basis, the unique counts to assess the impact per use-case (boldly considering a unique query to be a use-case). Since both might have value, most metrics analyzed here will be compared against both the total and the de-duplicated view of the samples.

Collecting SPARQL query results

For all the queries we collected the results (results) returned from the full graph and the main graph. For various reasons not all queries successfully ran and some failed either originally (when they initially hit the production servers) or during the analysis when they ran on our experimental endpoints.

Sample	Total	successful (originally)	successful (full)	successful (main)	% successful (originally)	% successful (full)	% successful (main)
Listeria	10 000	133	9 466	9 466	1.33	94.66	94.66
MixNMatch	8 396	7 477	8 369	8 369	89.05	99.68	99.68
Pywikibot	10 000	9 812	9 949	9 951	98.12	99.49	99.51
SPARQLWrapper	10 000	9 864	9 907	9 912	98.64	99.07	99.12
WikidataIntegrator	10 000	9 997	9 998	9 999	99.97	99.98	99.99
random	100 000	97 909	99 887	99 869	97.91	99.89	99.87

The success rate is generally good (except for Listeria which has a very low 1.33%^[2]). For simplicity and given the high success rate of the queries on the experimental endpoints the analysis will focus solely on the successful queries.

Methodology

Assessing the impact requires knowing the intent of the SPARQL query. In other words, does the author of the query expect/want scientific articles to impact the results of their queries or not. Knowing this with certainty is not possible without a confirmation from the author itself but we can try to approximate this by:

Determining the presence of a scientific article in the query or one of its results^[3]
Running the query against the full graph and the Wikidata main graph (loaded from the exact same dump) and finding differences in the results returned. If there is a difference it is perhaps because the result was dependent on the presence of scientific articles.

Assessing the first point might give a good confidence of the intent of the query but spotting differences is a less direct signal and might need adjustments to take false positives into account.

Unfortunately some SPARQL queries might return varying results not because scientific articles are missing but because of some of the SPARQL features being used, and thus might be a cause of many false positives. In the samples used for this analysis we identified 6 SPARQL features^[4] that cause possible false positives.

Similarly comparing the query results might yield false positives because of the order in which the results are returned by Blazegraph (many queries do not include sorting criteria).

A quick note should be made regarding queries that return zero results, without any results we only have the SPARQL query string to detect the presence of a scientific article and it is also "easier" to have similar results.

Overall we combine four criteria that we name as:

true positives: for queries where we identified a direct reference to scientific article
same: for queries returning the same results on both graphs (regardless of the order)
zero results: for queries returning zero results on both graphs
deterministic: for queries without a feature that might affect the results for unrelated reasons

Combining these we can classify the queries into four buckets^[5]

unimpacted: for queries with high confidence that they are not impacted by the split (not a true positive, same, no zero results, deterministic or not)
zeros: for queries returning zero results with relatively good confidence that they are not impacted (not a true positives, returning zero results, deterministic or not)
unsure: for queries with some uncertainty whether they might be impacted or not (not a true positive, returning different results, not deterministic)
impacted: for the queries that are impacted (all other combination of the criteria above, mainly all true positives and deterministic queries returning different results)

Results

Below is a graph showing the impact in % compared to the total number of successful queries:

and this version shows the impact in % compared to the number of unique queries:

Proper interpretation of these results will require a more qualitative analysis but we can see that except MixNMatch^[6] the ratios are roughly identical when compared to the plain samples or when using the de-duplicated ones.

Understanding how the impact is shared by various use-cases/user-agents might be interesting to visualize. One of the goals of the community consultation phase (tracked under phab:T356773) of the wikidata:Wikidata:SPARQL_query_service/WDQS_graph_split project is to collect use-cases and help/guide the impacted tools & clients to evaluate the feasibility of using SPARQL federation to make their use-cases compatible with the split.

While we cannot display specific user-agents for privacy reasons we can graph the cumulative proportion the various user-agents might have on the overall impact. This way we can get a sense of how quickly we could reduce the overall impact of this project if we address such clients/use-cases (the x axis is the number of user-agents, the y axis is the cumulative proportion in % of the queries that are considered as impacted in the bar charts above).

The above graph says that the most impacted user-agent accounts for 45% of the total impact, the top 5 account for more than 90% of the impacted queries.

And the version on de-duplicated queries which shows roughly the same trend:

In other words it does seem plausible to drastically reduce the negative impact of the graph split. This assumes that we can get in touch with the maintainers of such client/tools, which might not be possible for those that do not set a proper user-agent.

The tail is not particularly long with 56 user-agents^[7] in these samples.

Annex

Query execution error statistics

In order to understand why a query might have failed when it initially hit WDQS we can interpret the HTTP status code returned by WDQS. The table below shows the number of queries per sample and status code:

Sample	200	400	403	429	500	503
Listeria	133	<100	9 829	<100	<100	0
MixNMatch	7 477	0	0	0	919	0
Pywikibot	9 812	<100	0	<100	172	0
SPARQLWrapper	9 864	<100	<100	<100	<100	0
WikidataIntegrator	9 997	0	0	<100	<100	0
random	97 909	<100	1 979	<100	<100	<100

200 is a success
400 is a bad request, generally a SPARQL syntax error or a misbehaving HTTP client
403 forbidden, used by WDQS when a client is banned after making too many requests
429 to warn a client and ask it to slowdown when it is making too many requests
500 a server failure (a timeout, an internal error)
503 service unavailable (when Blazegraph is unable to respond, could be a timeout or an internal error)

This does seem to indicate that the high failure rate of Listeria is due to the throttling mechanism protecting WDQS.

Detecting a reference to a scientific article

Since we know the list of Wikidata items that are scientific articles we can search for a reference in the SPARQL query or within the results returned by the query when running against the full graph. The table below shows how many of these we found:

Queries with a reference to a scientific article
Sample	total	Scientific article in query	Scientific article in results	total
Listeria	10 000	0	0	0
MixNMatch	8 396	0	7 150	7 150
Pywikibot	10 000	<100	<100	100
SPARQLWrapper	10 000	<100	<100	<100
WikidataIntegrator	10 000	<100	<100	<100
random	100 000	4 482	4 298	7 201

Queries (de-duplicated) with a reference to a scientific article
Sample	total	Scientific article in query	Scientific article in results	total
Listeria	101	0	0	0
MixNMatch	626	0	<100	<100
Pywikibot	5 455	<100	<100	<100
SPARQLWrapper	9 919	<100	<100	<100
WikidataIntegrator	2 507	<100	<100	<100
random	85 838	4 482	4 287	7 190

Non deterministic SPARQL features

We consider a SPARQL feature to be non-deterministic if for multiple executions the results returned are not guaranteed to be the same when running on the exact same graph.

The non-deterministic SPARQL features that we identified are:

wikibase:mwapi service EntitySearch

The use of the MW API may yield different results depending on when the API request is performed (since it's calling the Wikidata API which is consistently being updated)

 SELECT ?item ?itemLabel ?type ?typeLabel WHERE {
   SERVICE wikibase:mwapi {
       bd:serviceParam wikibase:api "EntitySearch" .
       bd:serviceParam wikibase:endpoint "www.wikidata.org" .
       bd:serviceParam mwapi:search "something" .
       bd:serviceParam mwapi:language "en" .
       ?item wikibase:apiOutputItem mwapi:item .
       ?num wikibase:apiOrdinal true .
   }   ?item (wdt:P31|wdt:P279|wdt:P366) ?type
   SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
  } ORDER BY ASC(?num) LIMIT 10

GROUP_CONCAT

The GROUP_CONCAT SPARQL feature does not provide a stable ordering when it concatenates the group.

SELECT ?extid (count(?q) AS ?cnt) (GROUP_CONCAT(?q; SEPARATOR = '|') AS ?items)
WHERE { ?q wdt:P973 ?extid }
GROUP BY ?extid HAVING (?cnt>1)
ORDER BY ?extid

Label service AltLabel (`lbl_srv_alt_label`)

Asking for aliases with the mw:Wikidata_Query_Service/User_Manual#Label_service is not deterministic when there are multiple aliases for the entity

SELECT ?property ?propertyType ?propertyLabel ?propertyDescription ?propertyAltLabel WHERE {
  ?property wikibase:propertyType ?propertyType .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
ORDER BY ASC(xsd:integer(STRAFTER(STR(?property), 'P')))

SAMPLE

The SPARQL SAMPLE is not required to be deterministic

SELECT ?universe (SAMPLE(?label) AS ?label) (COUNT(?planet) AS ?count)
WHERE
{
  ?planet wdt:P31 wd:Q2775969;
          wdt:P1080 ?universe.
  ?universe rdfs:label ?label.
  FILTER(LANG(?label) = "en").
}
GROUP BY ?universe
ORDER BY DESC(?count)

Blazegraph bd:slice service `slice.offset`

Blazegraph bd:slice with bd:slice.offset cannot operate similarly between two different Blazegraph instances

SELECT ?item $ddate_st WHERE {
  SERVICE bd:slice { ?item p:P570 ?ddate_st . bd:serviceParam bd:slice.offset 100 . bd:serviceParam bd:slice.limit 200 . }
  ?item wdt:P31 wd:Q5 .
}

Using a top level limit on a large result set (`top_level_limit`)

A SPARQL query using top level LIMIT X produces X results without deterministic ordering clauses and might yield different results

SELECT ?item {
  ?item wdt:P31 wd:Q5 .
  FILTER NOT EXISTS { ?item wdt:P570 [] }
} LIMIT 100

Distribution in the samples

Number of non-deterministic query features
Sample	Total	Total deterministic	EntitySearch	group_concat	lbl_srv_alt_label	limit_offset	sample	slice_offset	top_level_limit
Listeria	10 000	1 380	0	0	0	5 169	0	1 995	4 080
MixNMatch	8 396	659	0	7 737	0	0	0	0	0
Pywikibot	10 000	7 367	1 138	0	0	1 028	<100	428	530
SPARQLWrapper	10 000	9 434	<100	359	<100	<100	<100	0	205
WikidataIntegrator	10 000	1 765	0	<100	8 225	<100	0	0	<100
random	100 000	67 559	712	17 997	10 543	<100	2 465	<100	2 097

Number of non-deterministic query features over de-duplicated queries
Sample	Total	Total deterministic	EntitySearch	group_concat	lbl_srv_alt_label	limit_offset	sample	slice_offset	top_level_limit
Listeria	<100	<100	0	0	0	<100	0	<100	<100
MixNMatch	626	313	0	313	0	0	0	0	0
Pywikibot	5 455	3 212	863	0	0	939	<100	428	327
SPARQLWrapper	9 919	9 357	<100	359	<100	<100	<100	0	201
WikidataIntegrator	2 507	1 559	0	<100	938	<100	0	0	<100
random	85 838	63 969	637	16 526	2 102	<100	2 422	<100	1 469

Combining 4 criteria of a SPARQL query

The 4 criteria of a query can be combined in 12 different ways (4 combinations are not possible, e.g. having different results and zeros results), to help the analysis we use a simple notation mechanism encoded over four characters:

s: same
t: true positives
z: zero results
d: deterministic

and using the case to determine if the query has this criterion or not.

For instance Stzd means that a query has obtained similar results on both graphs, has no direct reference to a scientific article, has obtained at least one result, and is not deterministic.

The distribution of these classes over the successful queries is:

Number of queries by criteria group
Sample	Total	STZD	STZd	STzD	STzd	StZD	StZd	StzD	Stzd	sTzD	sTzd	stzD	stzd
Listeria	9 466	0	0	0	0	<100	0	765	5 557	0	0	190	2 857
MixNMatch	8 369	0	0	0	0	<100	331	572	260	<100	7 111	0	<100
Pywikibot	9 945	0	0	<100	0	888	1 036	6 440	1 032	<100	<100	<100	442
SPARQLWrapper	9 906	<100	0	<100	0	5 363	326	3 905	202	<100	<100	<100	<100
WikidataIntegrator	9 998	0	0	0	0	371	<100	1 312	7 139	<100	0	0	1 086
random	99 865	1 191	<100	<100	<100	21 825	16 333	37 044	13 797	5 796	177	1 584	2 081

Number of queries (de-duplicated) by criteria group
Sample	Total	STZD	STZd	STzD	STzd	StZD	StZd	StzD	Stzd	sTzD	sTzd	stzD	stzd
Listeria	<100	0	0	0	0	<100	0	<100	<100	0	0	<100	<100
MixNMatch	619	0	0	0	0	<100	150	266	135	<100	<100	0	<100
Pywikibot	5 410	0	0	<100	0	772	979	2 405	783	<100	<100	<100	403
SPARQLWrapper	9 831	<100	0	<100	0	5 359	326	3 838	198	<100	<100	<100	<100
WikidataIntegrator	2 505	0	0	0	0	218	<100	1 259	870	<100	0	0	<100
random	85 753	1 191	<100	<100	<100	21 259	15 419	35 412	5 545	5 796	166	241	687

MixNMatch impact

One query is the cause of most of the impact on MixNMatch (7000+ hits):

SELECT ?extid (count(?q) AS ?cnt) (GROUP_CONCAT(?q; SEPARATOR = '|') AS ?items)
{ ?q wdt:P973 ?extid }
GROUP BY ?extid HAVING (?cnt>1)            ORDER BY ?extid

P973 is pretty generic so the query alone does not allow us to identify what MixNMatch catalogs could be the source. Looking at the source code^[8] this query does seem useful for identifying issues in Wikidata. The individual issues do seem to be ignored when they are too numerous (400) so the individual results of this particular query on P973 do not seem to matter.

Notes

↑ Please see T349512_representative_wikidata_query_samples for more details on the samples
↑ Listeria does appear to be heavily throttled/banned by WDQS protection measures, see details in § Query execution error statistics
↑ see § Detecting a reference to a scientific article
↑ see § Non_deterministic_SPARQL_features
↑ see section § Combining 4 criteria of a SPARQL query for more details
↑ A unique MixNMatch query is the cause of 7000+ impacted query executions, see § MixNMatch impact
↑ to ease the analysis some grouping was made, e.g. by removing the version number of some UA or group web browsers into a single group
↑ MicroSync.php

[1] Please see T349512_representative_wikidata_query_samples for more details on the samples

[2] Listeria does appear to be heavily throttled/banned by WDQS protection measures, see details in § Query execution error statistics

[3] see § Detecting a reference to a scientific article

[4] see § Non_deterministic_SPARQL_features

[5] see section § Combining 4 criteria of a SPARQL query for more details

[6] A unique MixNMatch query is the cause of 7000+ impacted query executions, see § MixNMatch impact

[7] to ease the analysis some grouping was made, e.g. by removing the version number of some UA or group web browsers into a single group

[8] MicroSync.php

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]