Wikidata Query Service/WDQS Graph Split Impact Analysis
The scope of this analysis is to assess the impact (on existing SPARQL queries) of splitting the Wikidata graph into two subgraphs: scientific articles (also referred to as scholarly articles) on one side and the rest (called Wikidata main) on the other. We will refer to the existing graph served by WDQS as the full graph.
TL/DR: Data shows that less than 10% of the queries might be affected by the graph split but this only affects a few user-agents. The top 5 of the most impacted user-agents accounts for more than 90% of the total impact. In other other words it sounds possible to drastically reduce the impact of the split by addressing a few user-agents.
Dataset
We used 6 different samples gathered from the WDQS query logs. Five from known tools used to access WDQS by the Wikidata community:
- Listeria
- MixNMatch
- Pywikibot
- WikidataIntegrator
- SPARQLWrapper
10,000 queries collected in October and November 2023 were extracted from each (except for MixNMatch where only 8396 queries were found in the query logs during this period). Another sample called random with 100,000 queries representative of the various query lengths and execution times we see in our query logs was also extracted[1].
Basic sample statistics
Below is a table showing the number of queries per sample versus the number of unique queries.
Sample | Total | Unique |
---|---|---|
Listeria | 10000 | 101 |
MixNMatch | 8396 | 626 |
Pywikibot | 10000 | 5455 |
SPARQLWrapper | 10000 | 9919 |
WikidataIntegrator | 10000 | 2507 |
random | 100000 | 85838 |
For some samples a non-negligible number of queries are repeated. The total could be used as a base to assess the impact on a per-query-execution basis, the unique counts to assess the impact per use-case (boldly considering a unique query to be a use-case). Since both might have value, most metrics analyzed here will be compared against both the total and the de-duplicated view of the samples.
Collecting SPARQL query results
For all the queries we collected the results (results) returned from the full graph and the main graph. For various reasons not all queries successfully ran and some failed either originally (when they initially hit the production servers) or during the analysis when they ran on our experimental endpoints.
Sample | Total | successful (originally) | successful (full) | successful (main) | % successful (originally) | % successful (full) | % successful (main) |
---|---|---|---|---|---|---|---|
Listeria | 10 000 | 133 | 9 466 | 9 466 | 1.33 | 94.66 | 94.66 |
MixNMatch | 8 396 | 7 477 | 8 369 | 8 369 | 89.05 | 99.68 | 99.68 |
Pywikibot | 10 000 | 9 812 | 9 949 | 9 951 | 98.12 | 99.49 | 99.51 |
SPARQLWrapper | 10 000 | 9 864 | 9 907 | 9 912 | 98.64 | 99.07 | 99.12 |
WikidataIntegrator | 10 000 | 9 997 | 9 998 | 9 999 | 99.97 | 99.98 | 99.99 |
random | 100 000 | 97 909 | 99 887 | 99 869 | 97.91 | 99.89 | 99.87 |
The success rate is generally good (except for Listeria which has a very low 1.33%[2]). For simplicity and given the high success rate of the queries on the experimental endpoints the analysis will focus solely on the successful queries.
Methodology
Assessing the impact requires knowing the intent of the SPARQL query. In other words, does the author of the query expect/want scientific articles to impact the results of their queries or not. Knowing this with certainty is not possible without a confirmation from the author itself but we can try to approximate this by:
- Determining the presence of a scientific article in the query or one of its results[3]
- Running the query against the full graph and the Wikidata main graph (loaded from the exact same dump) and finding differences in the results returned. If there is a difference it is perhaps because the result was dependent on the presence of scientific articles.
Assessing the first point might give a good confidence of the intent of the query but spotting differences is a less direct signal and might need adjustments to take false positives into account.
Unfortunately some SPARQL queries might return varying results not because scientific articles are missing but because of some of the SPARQL features being used, and thus might be a cause of many false positives. In the samples used for this analysis we identified 6 SPARQL features[4] that cause possible false positives.
Similarly comparing the query results might yield false positives because of the order in which the results are returned by Blazegraph (many queries do not include sorting criteria).
A quick note should be made regarding queries that return zero results, without any results we only have the SPARQL query string to detect the presence of a scientific article and it is also "easier" to have similar results.
Overall we combine four criteria that we name as:
- true positives: for queries where we identified a direct reference to scientific article
- same: for queries returning the same results on both graphs (regardless of the order)
- zero results: for queries returning zero results on both graphs
- deterministic: for queries without a feature that might affect the results for unrelated reasons
Combining these we can classify the queries into four buckets[5]
- unimpacted: for queries with high confidence that they are not impacted by the split (not a true positive, same, no zero results, deterministic or not)
- zeros: for queries returning zero results with relatively good confidence that they are not impacted (not a true positives, returning zero results, deterministic or not)
- unsure: for queries with some uncertainty whether they might be impacted or not (not a true positive, returning different results, not deterministic)
- impacted: for the queries that are impacted (all other combination of the criteria above, mainly all true positives and deterministic queries returning different results)
Results
Below is a graph showing the impact in % compared to the total number of successful queries:
and this version shows the impact in % compared to the number of unique queries:
Proper interpretation of these results will require a more qualitative analysis but we can see that except MixNMatch[6] the ratios are roughly identical when compared to the plain samples or when using the de-duplicated ones.
Understanding how the impact is shared by various use-cases/user-agents might be interesting to visualize. One of the goals of the community consultation phase (tracked under phab:T356773) of the wikidata:Wikidata:SPARQL_query_service/WDQS_graph_split project is to collect use-cases and help/guide the impacted tools & clients to evaluate the feasibility of using SPARQL federation to make their use-cases compatible with the split.
While we cannot display specific user-agents for privacy reasons we can graph the cumulative proportion the various user-agents might have on the overall impact. This way we can get a sense of how quickly we could reduce the overall impact of this project if we address such clients/use-cases (the x
axis is the number of user-agents, the y
axis is the cumulative proportion in %
of the queries that are considered as impacted in the bar charts above).
The above graph says that the most impacted user-agent accounts for 45% of the total impact, the top 5 account for more than 90% of the impacted queries.
And the version on de-duplicated queries which shows roughly the same trend:
In other words it does seem plausible to drastically reduce the negative impact of the graph split. This assumes that we can get in touch with the maintainers of such client/tools, which might not be possible for those that do not set a proper user-agent.
The tail is not particularly long with 56 user-agents[7] in these samples.
Annex
Query execution error statistics
In order to understand why a query might have failed when it initially hit WDQS we can interpret the HTTP status code returned by WDQS. The table below shows the number of queries per sample and status code:
Sample | 200 | 400 | 403 | 429 | 500 | 503 |
---|---|---|---|---|---|---|
Listeria | 133 | <100 | 9 829 | <100 | <100 | 0 |
MixNMatch | 7 477 | 0 | 0 | 0 | 919 | 0 |
Pywikibot | 9 812 | <100 | 0 | <100 | 172 | 0 |
SPARQLWrapper | 9 864 | <100 | <100 | <100 | <100 | 0 |
WikidataIntegrator | 9 997 | 0 | 0 | <100 | <100 | 0 |
random | 97 909 | <100 | 1 979 | <100 | <100 | <100 |
- 200 is a success
- 400 is a bad request, generally a SPARQL syntax error or a misbehaving HTTP client
- 403 forbidden, used by WDQS when a client is banned after making too many requests
- 429 to warn a client and ask it to slowdown when it is making too many requests
- 500 a server failure (a timeout, an internal error)
- 503 service unavailable (when Blazegraph is unable to respond, could be a timeout or an internal error)
This does seem to indicate that the high failure rate of Listeria is due to the throttling mechanism protecting WDQS.
Detecting a reference to a scientific article
Since we know the list of Wikidata items that are scientific articles we can search for a reference in the SPARQL query or within the results returned by the query when running against the full graph. The table below shows how many of these we found:
Sample | total | Scientific article in query | Scientific article in results | total |
---|---|---|---|---|
Listeria | 10 000 | 0 | 0 | 0 |
MixNMatch | 8 396 | 0 | 7 150 | 7 150 |
Pywikibot | 10 000 | <100 | <100 | 100 |
SPARQLWrapper | 10 000 | <100 | <100 | <100 |
WikidataIntegrator | 10 000 | <100 | <100 | <100 |
random | 100 000 | 4 482 | 4 298 | 7 201 |
Sample | total | Scientific article in query | Scientific article in results | total |
---|---|---|---|---|
Listeria | 101 | 0 | 0 | 0 |
MixNMatch | 626 | 0 | <100 | <100 |
Pywikibot | 5 455 | <100 | <100 | <100 |
SPARQLWrapper | 9 919 | <100 | <100 | <100 |
WikidataIntegrator | 2 507 | <100 | <100 | <100 |
random | 85 838 | 4 482 | 4 287 | 7 190 |
Non deterministic SPARQL features
We consider a SPARQL feature to be non-deterministic if for multiple executions the results returned are not guaranteed to be the same when running on the exact same graph.
The non-deterministic SPARQL features that we identified are:
wikibase:mwapi service EntitySearch
The use of the MW API may yield different results depending on when the API request is performed (since it's calling the Wikidata API which is consistently being updated)
SELECT ?item ?itemLabel ?type ?typeLabel WHERE {
SERVICE wikibase:mwapi {
bd:serviceParam wikibase:api "EntitySearch" .
bd:serviceParam wikibase:endpoint "www.wikidata.org" .
bd:serviceParam mwapi:search "something" .
bd:serviceParam mwapi:language "en" .
?item wikibase:apiOutputItem mwapi:item .
?num wikibase:apiOrdinal true .
} ?item (wdt:P31|wdt:P279|wdt:P366) ?type
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
} ORDER BY ASC(?num) LIMIT 10
GROUP_CONCAT
The GROUP_CONCAT
SPARQL feature does not provide a stable ordering when it concatenates the group.
SELECT ?extid (count(?q) AS ?cnt) (GROUP_CONCAT(?q; SEPARATOR = '|') AS ?items)
WHERE { ?q wdt:P973 ?extid }
GROUP BY ?extid HAVING (?cnt>1)
ORDER BY ?extid
Label service AltLabel (lbl_srv_alt_label
)
Asking for aliases with the mw:Wikidata_Query_Service/User_Manual#Label_service is not deterministic when there are multiple aliases for the entity
SELECT ?property ?propertyType ?propertyLabel ?propertyDescription ?propertyAltLabel WHERE {
?property wikibase:propertyType ?propertyType .
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
ORDER BY ASC(xsd:integer(STRAFTER(STR(?property), 'P')))
SAMPLE
The SPARQL SAMPLE is not required to be deterministic
SELECT ?universe (SAMPLE(?label) AS ?label) (COUNT(?planet) AS ?count)
WHERE
{
?planet wdt:P31 wd:Q2775969;
wdt:P1080 ?universe.
?universe rdfs:label ?label.
FILTER(LANG(?label) = "en").
}
GROUP BY ?universe
ORDER BY DESC(?count)
Blazegraph bd:slice service slice.offset
Blazegraph bd:slice
with bd:slice.offset
cannot operate similarly between two different Blazegraph instances
SELECT ?item $ddate_st WHERE {
SERVICE bd:slice { ?item p:P570 ?ddate_st . bd:serviceParam bd:slice.offset 100 . bd:serviceParam bd:slice.limit 200 . }
?item wdt:P31 wd:Q5 .
}
Using a top level limit on a large result set (top_level_limit
)
A SPARQL query using top level LIMIT X
produces X
results without deterministic ordering clauses and might yield different results
SELECT ?item {
?item wdt:P31 wd:Q5 .
FILTER NOT EXISTS { ?item wdt:P570 [] }
} LIMIT 100
Distribution in the samples
Sample | Total | Total deterministic | EntitySearch | group_concat | lbl_srv_alt_label | limit_offset | sample | slice_offset | top_level_limit |
---|---|---|---|---|---|---|---|---|---|
Listeria | 10 000 | 1 380 | 0 | 0 | 0 | 5 169 | 0 | 1 995 | 4 080 |
MixNMatch | 8 396 | 659 | 0 | 7 737 | 0 | 0 | 0 | 0 | 0 |
Pywikibot | 10 000 | 7 367 | 1 138 | 0 | 0 | 1 028 | <100 | 428 | 530 |
SPARQLWrapper | 10 000 | 9 434 | <100 | 359 | <100 | <100 | <100 | 0 | 205 |
WikidataIntegrator | 10 000 | 1 765 | 0 | <100 | 8 225 | <100 | 0 | 0 | <100 |
random | 100 000 | 67 559 | 712 | 17 997 | 10 543 | <100 | 2 465 | <100 | 2 097 |
Sample | Total | Total deterministic | EntitySearch | group_concat | lbl_srv_alt_label | limit_offset | sample | slice_offset | top_level_limit |
---|---|---|---|---|---|---|---|---|---|
Listeria | <100 | <100 | 0 | 0 | 0 | <100 | 0 | <100 | <100 |
MixNMatch | 626 | 313 | 0 | 313 | 0 | 0 | 0 | 0 | 0 |
Pywikibot | 5 455 | 3 212 | 863 | 0 | 0 | 939 | <100 | 428 | 327 |
SPARQLWrapper | 9 919 | 9 357 | <100 | 359 | <100 | <100 | <100 | 0 | 201 |
WikidataIntegrator | 2 507 | 1 559 | 0 | <100 | 938 | <100 | 0 | 0 | <100 |
random | 85 838 | 63 969 | 637 | 16 526 | 2 102 | <100 | 2 422 | <100 | 1 469 |
Combining 4 criteria of a SPARQL query
The 4 criteria of a query can be combined in 12 different ways (4 combinations are not possible, e.g. having different results and zeros results), to help the analysis we use a simple notation mechanism encoded over four characters:
s
: samet
: true positivesz
: zero resultsd
: deterministic
and using the case to determine if the query has this criterion or not.
For instance Stzd
means that a query has obtained similar results on both graphs, has no direct reference to a scientific article, has obtained at least one result, and is not deterministic.
The distribution of these classes over the successful queries is:
Sample | Total | STZD | STZd | STzD | STzd | StZD | StZd | StzD | Stzd | sTzD | sTzd | stzD | stzd |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Listeria | 9 466 | 0 | 0 | 0 | 0 | <100 | 0 | 765 | 5 557 | 0 | 0 | 190 | 2 857 |
MixNMatch | 8 369 | 0 | 0 | 0 | 0 | <100 | 331 | 572 | 260 | <100 | 7 111 | 0 | <100 |
Pywikibot | 9 945 | 0 | 0 | <100 | 0 | 888 | 1 036 | 6 440 | 1 032 | <100 | <100 | <100 | 442 |
SPARQLWrapper | 9 906 | <100 | 0 | <100 | 0 | 5 363 | 326 | 3 905 | 202 | <100 | <100 | <100 | <100 |
WikidataIntegrator | 9 998 | 0 | 0 | 0 | 0 | 371 | <100 | 1 312 | 7 139 | <100 | 0 | 0 | 1 086 |
random | 99 865 | 1 191 | <100 | <100 | <100 | 21 825 | 16 333 | 37 044 | 13 797 | 5 796 | 177 | 1 584 | 2 081 |
Sample | Total | STZD | STZd | STzD | STzd | StZD | StZd | StzD | Stzd | sTzD | sTzd | stzD | stzd |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Listeria | <100 | 0 | 0 | 0 | 0 | <100 | 0 | <100 | <100 | 0 | 0 | <100 | <100 |
MixNMatch | 619 | 0 | 0 | 0 | 0 | <100 | 150 | 266 | 135 | <100 | <100 | 0 | <100 |
Pywikibot | 5 410 | 0 | 0 | <100 | 0 | 772 | 979 | 2 405 | 783 | <100 | <100 | <100 | 403 |
SPARQLWrapper | 9 831 | <100 | 0 | <100 | 0 | 5 359 | 326 | 3 838 | 198 | <100 | <100 | <100 | <100 |
WikidataIntegrator | 2 505 | 0 | 0 | 0 | 0 | 218 | <100 | 1 259 | 870 | <100 | 0 | 0 | <100 |
random | 85 753 | 1 191 | <100 | <100 | <100 | 21 259 | 15 419 | 35 412 | 5 545 | 5 796 | 166 | 241 | 687 |
MixNMatch impact
One query is the cause of most of the impact on MixNMatch (7000+ hits):
SELECT ?extid (count(?q) AS ?cnt) (GROUP_CONCAT(?q; SEPARATOR = '|') AS ?items)
{ ?q wdt:P973 ?extid }
GROUP BY ?extid HAVING (?cnt>1) ORDER BY ?extid
P973
is pretty generic so the query alone does not allow us to identify what MixNMatch catalogs could be the source. Looking at the source code[8] this query does seem useful for identifying issues in Wikidata. The individual issues do seem to be ignored when they are too numerous (400) so the individual results of this particular query on P973
do not seem to matter.
Notes
- ↑ Please see T349512_representative_wikidata_query_samples for more details on the samples
- ↑ Listeria does appear to be heavily throttled/banned by WDQS protection measures, see details in § Query execution error statistics
- ↑ see § Detecting a reference to a scientific article
- ↑ see § Non_deterministic_SPARQL_features
- ↑ see section § Combining 4 criteria of a SPARQL query for more details
- ↑ A unique MixNMatch query is the cause of 7000+ impacted query executions, see § MixNMatch impact
- ↑ to ease the analysis some grouping was made, e.g. by removing the version number of some UA or group web browsers into a single group
- ↑ MicroSync.php