User:AndreaWest/WDQS Q and A

From Wikitech

The following are various WDQS questions & answers. I appreciate everyone's insights.

Thanks to DCausse, Jheald and AKhatun for comments and inputs.

SPARQL SERVICEs vs Functions

  • Question: Are Wikidata custom SERVICES different than Federated Query?
    • Answer: SERVICES are extensions as federated queries. Blazegraph supports both SPARQL Custom Functions and SERVICE/federated query functions. There may be differences in the level of control of Blazegraph internals across these two approaches. But, a likely reason for choosing a SERVICE approach is that custom functions required the use of Blazegraph internal APIs, which were subject to revision.
    • Background:
      • Using the keyword, SERVICE, means that they are considered as Federated Queries - which can be seen because the Wikidata federated endpoints are also referenced using SERVICE.
      • Currently, there are SERVICES for labels, geospatial stuff, dates, sampling and more.
  • Question: Why were SPARQL custom functions not added/used instead?
    • Answer: Beyond the internal API issue noted above, extension points using the SERVICE tag seem to have more information about the query overall and its variables, and can be processed first/last using query hints. Examining the current set of custom SERVICEs provides insight into their implementations:
      • label service accesses all query variables to do some magic with ?item - ?itemLabel pairs; Rewriting this as (a set of) function would be more verbose but could provide better control over the processing (and therefore, the execution time)
      • mwapi service accesses the MediaWiki API and hence is consistent with a federation endpoint
      • gas service was implemented by the Blazegraph developers) and was designed as a federation endpoint
  • Question: Is/was there a reason to prefer the GeoSpatial SERVICE over GeoSPARQL native support? I.E., instead of a SERVICE, you just "natively" write a query such as:
 SELECT ?geom ?feature {
   ?f a :Location ;
      rdfs:label ?feature ;
      geo:hasGeometry ?geom .
   ?geom geof:within (38.855 -77.111 38.885 -77.052) }
    • Answer: Blazegraph does not support GeoSPARQL but (starting with V2.1.0) provided some geospatial capabilities with the addition of latitude/longitude datatypes and a geo:search SERVICE. WDQS extended these with additional SERVICEs for locations within a certain radius of a central point or within a certain bounding box, with support for distance calculations, and more.

General comments:

  • It would make sense to revisit the current implementations if support for standards like GeoSPARQL are available in the query engine replacement and/or flexibility (with equivalent functionality) can be provided using a mix of SERVICEs and custom functions.

Wikidata Query Questions

  • Question: For all wikidata, what is the count/prevalence of items that are only used as subjects (NOT used as objects)?
    • Answer: There are currently a total of 99.7M items. 60M items occur as subjects and objects; 40M occur as subjects only. (40% are not currently referenced.)
  • Question: Same question as above but for scholarly article items only
    • Answer: There are currently a total of 37.5M items identified as scholarly articles. 20.5M of these occur as subjects and objects; 17M occur as subjects only. (45% are not currently referenced.)
  • Question: Do queries ever use the SPARQL forms, CONSTRUCT, INSERT, DELETE and DESCRIBE? (Or only SELECT and ASK)
    • Answer: 198M queries (for December 2021) were evaluated. The results break down as follows:
      • SELECT : 191,642,376
      • ASK : 1,597,493
      • CONSTRUCT : 1,030,611
      • DESCRIBE : 523,840
      • Unidentified: ~2M
      • Since users do not have write permissions, INSERTs/DELETEs are not performed.
  • Question: Do queries use SPARQL functions such as CONCAT? If so, what functions are used? And, does their use correlate with timeouts occurring?
    • Answer: It can be assumed that all of the SPARQL 1.1 functions have been used at some time. There are no current estimates of usage or any relationship of query patterns to timeouts. This could be a valid line of inquiry in order to establish a baseline for load testing. Also, it could be that 'queries that are more ambitious are more likely to use less common parts of the language, and might also be more likely to hit the 60-second limiter. And ... one does expect REGEX() or string functions generally to be ... more expensive, and so reduce the amount that can be achieved in 60 seconds.' (Jheald)
  • Question: I see in the second table in WDQS Triples Analysis, Node Type Distribution that NODE_LITERAL is the SUBJECT 30 times? What is an example of this? This seems wrong to occur at all.
    • Answer: Indeed, the queries where a NODE_LITERAL occurs as the SUBJECT are invalid.

General comments:

  • Regarding items that are not referenced as objects ... Although not currently referenced, items are created for their potential/eventual use. It may be interesting to evaluate current savings to be achieved by removing non-referenced items to a different database or graph, and move them between dbs/graphs when referenced. This functionality could become part of the streaming update functionality.

Other Things to Consider

  • Date/Time considerations: One thing that the user community might definitely appreciate would be a more flexible datatype for representing times than the basic xsd:dateTime, which (AFAIK) cannot represent the precision of dates alone -- so dates with only a year appear in query outputs with "1 January" as a spurious day and month. An extended type with more flexibility would be valuable, so long as all functions like < = > day() month() year() max() min() str() strdt() etc were extended/overloaded to be defined for it. (From Jheald)