Jump to content

Wikidata Query Service/WDQS Graph Split Impact Analysis

From Wikitech

The scope of this analysis is to assess the impact (on existing SPARQL queries) of splitting the Wikidata graph into two subgraphs: scientific articles (also referred to as scholarly articles) on one side and the rest (called Wikidata main) on the other. We will refer to the existing graph served by WDQS as the full graph.

TL/DR: Data shows that less than 10% of the queries might be affected by the graph split but this only affects a few user-agents. The top 5 of the most impacted user-agents accounts for more than 90% of the total impact. In other other words it sounds possible to drastically reduce the impact of the split by addressing a few user-agents.

Dataset

We used 6 different samples gathered from the WDQS query logs. Five from known tools used to access WDQS by the Wikidata community:

  • Listeria
  • MixNMatch
  • Pywikibot
  • WikidataIntegrator
  • SPARQLWrapper

10,000 queries collected in October and November 2023 were extracted from each (except for MixNMatch where only 8396 queries were found in the query logs during this period). Another sample called random with 100,000 queries representative of the various query lengths and execution times we see in our query logs was also extracted[1].

Basic sample statistics

Below is a table showing the number of queries per sample versus the number of unique queries.

Sample Total Unique
Listeria 10000 101
MixNMatch 8396 626
Pywikibot 10000 5455
SPARQLWrapper 10000 9919
WikidataIntegrator 10000 2507
random 100000 85838

For some samples a non-negligible number of queries are repeated. The total could be used as a base to assess the impact on a per-query-execution basis, the unique counts to assess the impact per use-case (boldly considering a unique query to be a use-case). Since both might have value, most metrics analyzed here will be compared against both the total and the de-duplicated view of the samples.

Collecting SPARQL query results

For all the queries we collected the results (results) returned from the full graph and the main graph. For various reasons not all queries successfully ran and some failed either originally (when they initially hit the production servers) or during the analysis when they ran on our experimental endpoints.

Sample Total successful (originally) successful (full) successful (main) % successful (originally) % successful (full) % successful (main)
Listeria 10 000 133 9 466 9 466 1.33 94.66 94.66
MixNMatch 8 396 7 477 8 369 8 369 89.05 99.68 99.68
Pywikibot 10 000 9 812 9 949 9 951 98.12 99.49 99.51
SPARQLWrapper 10 000 9 864 9 907 9 912 98.64 99.07 99.12
WikidataIntegrator 10 000 9 997 9 998 9 999 99.97 99.98 99.99
random 100 000 97 909 99 887 99 869 97.91 99.89 99.87

The success rate is generally good (except for Listeria which has a very low 1.33%[2]). For simplicity and given the high success rate of the queries on the experimental endpoints the analysis will focus solely on the successful queries.

Methodology

Assessing the impact requires knowing the intent of the SPARQL query. In other words, does the author of the query expect/want scientific articles to impact the results of their queries or not. Knowing this with certainty is not possible without a confirmation from the author itself but we can try to approximate this by:

  1. Determining the presence of a scientific article in the query or one of its results[3]
  2. Running the query against the full graph and the Wikidata main graph (loaded from the exact same dump) and finding differences in the results returned. If there is a difference it is perhaps because the result was dependent on the presence of scientific articles.

Assessing the first point might give a good confidence of the intent of the query but spotting differences is a less direct signal and might need adjustments to take false positives into account.

Unfortunately some SPARQL queries might return varying results not because scientific articles are missing but because of some of the SPARQL features being used, and thus might be a cause of many false positives. In the samples used for this analysis we identified 6 SPARQL features[4] that cause possible false positives.

Similarly comparing the query results might yield false positives because of the order in which the results are returned by Blazegraph (many queries do not include sorting criteria).

A quick note should be made regarding queries that return zero results, without any results we only have the SPARQL query string to detect the presence of a scientific article and it is also "easier" to have similar results.

Overall we combine four criteria that we name as:

  • true positives: for queries where we identified a direct reference to scientific article
  • same: for queries returning the same results on both graphs (regardless of the order)
  • zero results: for queries returning zero results on both graphs
  • deterministic: for queries without a feature that might affect the results for unrelated reasons

Combining these we can classify the queries into four buckets[5]

  • unimpacted: for queries with high confidence that they are not impacted by the split (not a true positive, same, no zero results, deterministic or not)
  • zeros: for queries returning zero results with relatively good confidence that they are not impacted (not a true positives, returning zero results, deterministic or not)
  • unsure: for queries with some uncertainty whether they might be impacted or not (not a true positive, returning different results, not deterministic)
  • impacted: for the queries that are impacted (all other combination of the criteria above, mainly all true positives and deterministic queries returning different results)

Results

Below is a graph showing the impact in % compared to the total number of successful queries:

and this version shows the impact in % compared to the number of unique queries:

Proper interpretation of these results will require a more qualitative analysis but we can see that except MixNMatch[6] the ratios are roughly identical when compared to the plain samples or when using the de-duplicated ones.

Understanding how the impact is shared by various use-cases/user-agents might be interesting to visualize. One of the goals of the community consultation phase (tracked under phab:T356773) of the wikidata:Wikidata:SPARQL_query_service/WDQS_graph_split project is to collect use-cases and help/guide the impacted tools & clients to evaluate the feasibility of using SPARQL federation to make their use-cases compatible with the split.

While we cannot display specific user-agents for privacy reasons we can graph the cumulative proportion the various user-agents might have on the overall impact. This way we can get a sense of how quickly we could reduce the overall impact of this project if we address such clients/use-cases (the x axis is the number of user-agents, the y axis is the cumulative proportion in % of the queries that are considered as impacted in the bar charts above).

The above graph says that the most impacted user-agent accounts for 45% of the total impact, the top 5 account for more than 90% of the impacted queries.

And the version on de-duplicated queries which shows roughly the same trend:

In other words it does seem plausible to drastically reduce the negative impact of the graph split. This assumes that we can get in touch with the maintainers of such client/tools, which might not be possible for those that do not set a proper user-agent.

The tail is not particularly long with 56 user-agents[7] in these samples.

Annex

Query execution error statistics

In order to understand why a query might have failed when it initially hit WDQS we can interpret the HTTP status code returned by WDQS. The table below shows the number of queries per sample and status code:

Sample 200 400 403 429 500 503
Listeria 133 <100 9 829 <100 <100 0
MixNMatch 7 477 0 0 0 919 0
Pywikibot 9 812 <100 0 <100 172 0
SPARQLWrapper 9 864 <100 <100 <100 <100 0
WikidataIntegrator 9 997 0 0 <100 <100 0
random 97 909 <100 1 979 <100 <100 <100
  • 200 is a success
  • 400 is a bad request, generally a SPARQL syntax error or a misbehaving HTTP client
  • 403 forbidden, used by WDQS when a client is banned after making too many requests
  • 429 to warn a client and ask it to slowdown when it is making too many requests
  • 500 a server failure (a timeout, an internal error)
  • 503 service unavailable (when Blazegraph is unable to respond, could be a timeout or an internal error)

This does seem to indicate that the high failure rate of Listeria is due to the throttling mechanism protecting WDQS.

Detecting a reference to a scientific article

Since we know the list of Wikidata items that are scientific articles we can search for a reference in the SPARQL query or within the results returned by the query when running against the full graph. The table below shows how many of these we found:

Queries with a reference to a scientific article
Sample total Scientific article in query Scientific article in results total
Listeria 10 000 0 0 0
MixNMatch 8 396 0 7 150 7 150
Pywikibot 10 000 <100 <100 100
SPARQLWrapper 10 000 <100 <100 <100
WikidataIntegrator 10 000 <100 <100 <100
random 100 000 4 482 4 298 7 201
Queries (de-duplicated) with a reference to a scientific article
Sample total Scientific article in query Scientific article in results total
Listeria 101 0 0 0
MixNMatch 626 0 <100 <100
Pywikibot 5 455 <100 <100 <100
SPARQLWrapper 9 919 <100 <100 <100
WikidataIntegrator 2 507 <100 <100 <100
random 85 838 4 482 4 287 7 190

Non deterministic SPARQL features

We consider a SPARQL feature to be non-deterministic if for multiple executions the results returned are not guaranteed to be the same when running on the exact same graph.

The non-deterministic SPARQL features that we identified are:

wikibase:mwapi service EntitySearch

The use of the MW API may yield different results depending on when the API request is performed (since it's calling the Wikidata API which is consistently being updated)

 SELECT ?item ?itemLabel ?type ?typeLabel WHERE {
   SERVICE wikibase:mwapi {
       bd:serviceParam wikibase:api "EntitySearch" .
       bd:serviceParam wikibase:endpoint "www.wikidata.org" .
       bd:serviceParam mwapi:search "something" .
       bd:serviceParam mwapi:language "en" .
       ?item wikibase:apiOutputItem mwapi:item .
       ?num wikibase:apiOrdinal true .
   }   ?item (wdt:P31|wdt:P279|wdt:P366) ?type
   SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
  } ORDER BY ASC(?num) LIMIT 10

GROUP_CONCAT

The GROUP_CONCAT SPARQL feature does not provide a stable ordering when it concatenates the group.

SELECT ?extid (count(?q) AS ?cnt) (GROUP_CONCAT(?q; SEPARATOR = '|') AS ?items)
WHERE { ?q wdt:P973 ?extid }
GROUP BY ?extid HAVING (?cnt>1)
ORDER BY ?extid

Label service AltLabel (lbl_srv_alt_label)

Asking for aliases with the mw:Wikidata_Query_Service/User_Manual#Label_service is not deterministic when there are multiple aliases for the entity

SELECT ?property ?propertyType ?propertyLabel ?propertyDescription ?propertyAltLabel WHERE {
  ?property wikibase:propertyType ?propertyType .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
ORDER BY ASC(xsd:integer(STRAFTER(STR(?property), 'P')))

SAMPLE

The SPARQL SAMPLE is not required to be deterministic

SELECT ?universe (SAMPLE(?label) AS ?label) (COUNT(?planet) AS ?count)
WHERE
{
  ?planet wdt:P31 wd:Q2775969;
          wdt:P1080 ?universe.
  ?universe rdfs:label ?label.
  FILTER(LANG(?label) = "en").
}
GROUP BY ?universe
ORDER BY DESC(?count)

Blazegraph bd:slice service slice.offset

Blazegraph bd:slice with bd:slice.offset cannot operate similarly between two different Blazegraph instances

SELECT ?item $ddate_st WHERE {
  SERVICE bd:slice { ?item p:P570 ?ddate_st . bd:serviceParam bd:slice.offset 100 . bd:serviceParam bd:slice.limit 200 . }
  ?item wdt:P31 wd:Q5 .
}

Using a top level limit on a large result set (top_level_limit)

A SPARQL query using top level LIMIT X produces X results without deterministic ordering clauses and might yield different results

SELECT ?item {
  ?item wdt:P31 wd:Q5 .
  FILTER NOT EXISTS { ?item wdt:P570 [] }
} LIMIT 100

Distribution in the samples

Number of non-deterministic query features
Sample Total Total deterministic EntitySearch group_concat lbl_srv_alt_label limit_offset sample slice_offset top_level_limit
Listeria 10 000 1 380 0 0 0 5 169 0 1 995 4 080
MixNMatch 8 396 659 0 7 737 0 0 0 0 0
Pywikibot 10 000 7 367 1 138 0 0 1 028 <100 428 530
SPARQLWrapper 10 000 9 434 <100 359 <100 <100 <100 0 205
WikidataIntegrator 10 000 1 765 0 <100 8 225 <100 0 0 <100
random 100 000 67 559 712 17 997 10 543 <100 2 465 <100 2 097
Number of non-deterministic query features over de-duplicated queries
Sample Total Total deterministic EntitySearch group_concat lbl_srv_alt_label limit_offset sample slice_offset top_level_limit
Listeria <100 <100 0 0 0 <100 0 <100 <100
MixNMatch 626 313 0 313 0 0 0 0 0
Pywikibot 5 455 3 212 863 0 0 939 <100 428 327
SPARQLWrapper 9 919 9 357 <100 359 <100 <100 <100 0 201
WikidataIntegrator 2 507 1 559 0 <100 938 <100 0 0 <100
random 85 838 63 969 637 16 526 2 102 <100 2 422 <100 1 469

Combining 4 criteria of a SPARQL query

The 4 criteria of a query can be combined in 12 different ways (4 combinations are not possible, e.g. having different results and zeros results), to help the analysis we use a simple notation mechanism encoded over four characters:

  • s: same
  • t: true positives
  • z: zero results
  • d: deterministic

and using the case to determine if the query has this criterion or not.

For instance Stzd means that a query has obtained similar results on both graphs, has no direct reference to a scientific article, has obtained at least one result, and is not deterministic.

The distribution of these classes over the successful queries is:

Number of queries by criteria group
Sample Total STZD STZd STzD STzd StZD StZd StzD Stzd sTzD sTzd stzD stzd
Listeria 9 466 0 0 0 0 <100 0 765 5 557 0 0 190 2 857
MixNMatch 8 369 0 0 0 0 <100 331 572 260 <100 7 111 0 <100
Pywikibot 9 945 0 0 <100 0 888 1 036 6 440 1 032 <100 <100 <100 442
SPARQLWrapper 9 906 <100 0 <100 0 5 363 326 3 905 202 <100 <100 <100 <100
WikidataIntegrator 9 998 0 0 0 0 371 <100 1 312 7 139 <100 0 0 1 086
random 99 865 1 191 <100 <100 <100 21 825 16 333 37 044 13 797 5 796 177 1 584 2 081
Number of queries (de-duplicated) by criteria group
Sample Total STZD STZd STzD STzd StZD StZd StzD Stzd sTzD sTzd stzD stzd
Listeria <100 0 0 0 0 <100 0 <100 <100 0 0 <100 <100
MixNMatch 619 0 0 0 0 <100 150 266 135 <100 <100 0 <100
Pywikibot 5 410 0 0 <100 0 772 979 2 405 783 <100 <100 <100 403
SPARQLWrapper 9 831 <100 0 <100 0 5 359 326 3 838 198 <100 <100 <100 <100
WikidataIntegrator 2 505 0 0 0 0 218 <100 1 259 870 <100 0 0 <100
random 85 753 1 191 <100 <100 <100 21 259 15 419 35 412 5 545 5 796 166 241 687

MixNMatch impact

One query is the cause of most of the impact on MixNMatch (7000+ hits):

SELECT ?extid (count(?q) AS ?cnt) (GROUP_CONCAT(?q; SEPARATOR = '|') AS ?items)
{ ?q wdt:P973 ?extid }
GROUP BY ?extid HAVING (?cnt>1)            ORDER BY ?extid

P973 is pretty generic so the query alone does not allow us to identify what MixNMatch catalogs could be the source. Looking at the source code[8] this query does seem useful for identifying issues in Wikidata. The individual issues do seem to be ignored when they are too numerous (400) so the individual results of this particular query on P973 do not seem to matter.

Notes

  1. Please see T349512_representative_wikidata_query_samples for more details on the samples
  2. Listeria does appear to be heavily throttled/banned by WDQS protection measures, see details in § Query execution error statistics
  3. see § Detecting a reference to a scientific article
  4. see § Non_deterministic_SPARQL_features
  5. see section § Combining 4 criteria of a SPARQL query for more details
  6. A unique MixNMatch query is the cause of 7000+ impacted query executions, see § MixNMatch impact
  7. to ease the analysis some grouping was made, e.g. by removing the version number of some UA or group web browsers into a single group
  8. MicroSync.php