User:AndreaWest/Blazegraph Features and Capabilities

From Wikitech

The following is a list of Blazegraph-specific features and capabilities used by WDQS and its community. Defining alternative implementations that minimize the user impact is of critical importance.

Overview of Blazegraph-Specific Features and Capabilities

  • SPARQL functionality extensions
    • Typically, SPARQL is extended by new datatypes and functions
      • The current Blazegraph implementation has a mix of function extensions (geof:distance, geof:globe, geof:latitude, geof:longitude and wikibase:decodeUri) and SERVICE extensions (such as wikibase:label, wikibase:around or wikibase:mwapi)
      • Whereas functions return a single value, the WDQS SERVICES provide multiple outputs
      • Each of the current datatypes/functions and SERVICES are discussed below
  • Named subqueries
    • Documentation
    • Note that although support for subqueries is required for SPARQL compliance, naming is not a compliant feature
    • It is likely that subqueries will not be name-able
    • Based on the placement of the subquery in the overall SPARQL and the use of query hints, a subquery's order of execution can be controlled

Named Sub-Queries

Named sub-queries are a readability convenience. The following examples show how naming can be replaced/handled by judicious placement of sub-queries within the overall query. However, there may be cases where sub-query placement does not sufficiently improve performance. These queries should be individually documented.

Example 1

Example 1 demonstrates a named sub-query that is used multiple times. It produces 2 results and executes in 8062 msecs.

SELECT ?status ?count ?total ((xsd:integer(0.5 + (1000 * ?count / ?total)) / 10) AS ?pct)
WITH {
  SELECT ?status (COUNT(DISTINCT(?id)) AS ?count) WHERE {
    ?item wdt:P4638 ?id .
    BIND(xsd:integer(STRAFTER(str(?item), 'Q')) AS ?num) .
    BIND(IF(?num < 75000000, 'matched', 'unmatched') AS ?status) .
  } GROUP BY ?status
} AS %counts
WITH {
  SELECT (SUM(?count) AS ?total) WHERE {
     INCLUDE %counts
  }
} AS %total
WHERE {
    INCLUDE %counts .
    INCLUDE %total .
} ORDER BY ?status

It can be rewritten (yes, this is very ugly) as the following, which also produces 2 results but in 29384 msecs (more than double the time).

SELECT ?status ?count ?total ((xsd:integer(0.5 + (1000 * ?count / ?total)) / 10) AS ?pct)
WHERE {
  { SELECT ?status (COUNT(DISTINCT(?id)) AS ?count) WHERE {    # SUB-QUERY
      ?item wdt:P4638 ?id .
      BIND(xsd:integer(STRAFTER(str(?item), 'Q')) AS ?num) .
      BIND(IF(?num < 75000000, 'matched', 'unmatched') AS ?status) .
  } GROUP BY ?status }
  { SELECT (SUM(?count) AS ?total) WHERE {
     { SELECT ?status (COUNT(DISTINCT(?id)) AS ?count) WHERE {   # DUPLICATED AS SUB-SUB-QUERY
         ?item wdt:P4638 ?id .
         BIND(xsd:integer(STRAFTER(str(?item), 'Q')) AS ?num) .
         BIND(IF(?num < 75000000, 'matched', 'unmatched') AS ?status) .
     } GROUP BY ?status }
  } }
} ORDER BY ?status

Or, it can be rewritten more elegantly as the following, which produces the same results but in 16389 msecs.

SELECT ?status ?count ?total ((xsd:integer(0.5 + (1000 * ?count / ?total)) / 10) AS ?pct)
WHERE {
  { SELECT (COUNT(DISTINCT(?id)) AS ?total) WHERE {           # Faster calculation of ?total
      ?item wdt:P4638 ?id 
  } }
  { SELECT ?status (COUNT(DISTINCT(?id)) AS ?count) WHERE {   # Original Sub-Query
      ?item wdt:P4638 ?id .
      BIND(xsd:integer(STRAFTER(str(?item), 'Q')) AS ?num) .
      BIND(IF(?num < 75000000, 'matched', 'unmatched') AS ?status) .
  } GROUP BY ?status }
} ORDER BY ?status

Example 2

Example 2 demonstrates a particularly complex query using multiple, named sub-queries. It is documented at Showcase Queries Section 1.13. This query is described as "All oldest living US ex-presidents in chronological order". However, when executed, it returns results that do not seem to make sense. Some are shown below - such as dates with no president, presidents who are not living, presidents listed multiple times, etc.

date president presidentLabel
1797-03-05T00:00:00Z http://www.wikidata.org/entity/Q23 George Washington
1799-12-15T00:00:00Z
1801-03-05T00:00:00Z http://www.wikidata.org/entity/Q11806 John Adams
1826-07-05T00:00:00Z http://www.wikidata.org/entity/Q11813 James Madison
1836-06-29T00:00:00Z http://www.wikidata.org/entity/Q11816 John Quincy Adams
1837-03-05T00:00:00Z http://www.wikidata.org/entity/Q11817 Andrew Jackson
1845-06-09T00:00:00Z http://www.wikidata.org/entity/Q11816 John Quincy Adams
1848-02-24T00:00:00Z http://www.wikidata.org/entity/Q11820 Martin Van Buren
1862-07-25T00:00:00Z http://www.wikidata.org/entity/Q12325 James Buchanan
1868-06-02T00:00:00Z http://www.wikidata.org/entity/Q12306 Millard Fillmore
1874-03-09T00:00:00Z http://www.wikidata.org/entity/Q8612 Andrew Johnson
1875-08-01T00:00:00Z
1877-03-05T00:00:00Z http://www.wikidata.org/entity/Q34836 Ulysses S Grant

A different, much simpler version of this query is the following:

# Oldest US presidents, when ending term
SELECT DISTINCT ?age ?president ?presidentLabel ?president_birthdate ?president_endterm
WHERE {
        ?president wdt:P31 wd:Q5 ; p:P39 ?president_statement.
        ?president_statement ps:P39 wd:Q11696.                          # Held position of US President
        ?president wdt:P569 ?president_birthdate.                       # Validated in advance that all birthdays are known
        OPTIONAL { ?president_statement pq:P582 ?president_endterm. }   # OPTIONAL for current President
        BIND(IF(BOUND(?president_endterm),                              # Calculate age
                (YEAR(?president_endterm) - YEAR(?president_birthdate)
                   - IF(MONTH(?president_endterm) < MONTH(?president_birthdate)
                        || (MONTH(?president_endterm) = MONTH(?president_birthdate)
                            && DAY(?president_endterm) < DAY(?president_birthdate)),1,0)),
                (YEAR(NOW()) - YEAR(?president_birthdate))) AS ?age) .     # Use current age (SHOULD use similar calc as above)
        ?president rdfs:label ?presidentLabel.
        FILTER (LANG(?presidentLabel) = "en")
} ORDER BY DESC(?age)

Try it!

SPARQL Functional Extensions

In order to support the current Blazegraph functional extensions, creation of similar custom SPARQL functions would be needed. This capability is supported by all the Blazegraph alternative backends.

  • The functions, geof:globe, geof:latitude and geof:longitude, are simple decompositions of the geometry data of a POINT
    • Typically, the "geof:" prefix represents the namespace defined by http://www.opengis.net/def/function/geosparql/
    • Note that since these are NOT valid functions in GeoSPARQL, it seems inappropriate to reference them using the "geof:" namespace
      • Another namespace prefix (such as "wdqs:") is recommended
        • "wdqs:" is recommended (as opposed to "wikibase:") to avoid confusion between the functions, such as wdqs:latitude, which would be used in a SPARQL SELECT, BIND or FILTER clause, and the corresponding value property (wikibase:geoLatitude) which is used in the RDF data as a predicate
          • For example, one might query, "?item p:P625 ?coordinate_statement. ?coordinate_statement psv:P625 ?coordinate_node. ?coordinate_node wikibase:geoLatitude ?lat."
    • In both Wikidata and GeoSPARQL, a geometric POINT is expressed using WKT (well-known text) serialization, which specifies a coordinate system, followed by a longitude and latitude
      • GeoSPARQL also allows other serializations, but WKT is the most-often used
      • The coordinate system (also known as the spatial reference system) is defined either by WGS84 on Earth or identified by an item ID (within right and left carets, '<' and '>') which specifies a non-Earth/planetary body
      • geof:globe/wdqs:globe retrieves the coordinate system
      • geof:latitude/wdqs:latitude and geof:longitude/wdqs:longitude functions split the POINT data into its component parts
  • The function, wikibase:decodeURI, can be defined using the logic at https://github.com/wikimedia/wikidata-query-rdf/blob/master/blazegraph/src/main/java/org/wikidata/query/rdf/blazegraph/constraints/DecodeUriBOp.java
    • Recommend that the function also use the prefix, "wdqs:", for consistency with the above

Note that the RDF dump contains the properties, wikibase:geoGlobe, wikibase:geoLatitude and wikibase:geoLongitude. These are value properties that can be used to construct the GeoSPARQL POINT geometries. The following query shows how this would be done:

@prefix geo: <http://www.opengis.net/ont/geosparql#> .
@prefix sf: <http://www.opengis.net/ont/sf> .   # sf is "Simple Features"

CONSTRUCT {?item a geo:Feature ; geo:hasDefaultGeometry ?itemPtGeom .
           ?itemPtGeom a geo:Geometry ; geo:asWKT ?wkt_loc } 

WHERE {
  ?item wdt:P625 ?wkt_loc .
  BIND(IRI(CONCAT(str(?item), "PtGeom")) as ?itemPtGeom) .
}

The use of geo:hasDefaultGeometry enables the GeoSPARQL query rewriting rules to apply. These enable simpler query expressions since you can substitute Features (like an instance of an airport or school) for its geometry. For example, the following patterns in a query:

SELECT ?subj1 ?subj2 WHERE {
  ... 
  ?subj1 geo:hasDefaultGeometry ?subjGeom1 . ?subjGeom1 geo:asWKT ?subjLoc1 .
  ?subj2 geo:hasDefaultGeometry ?subjGeom2 . ?subjGeom2 geo:asWKT ?subjLoc2 .
  FILTER(geof:sfContains(?subjLoc1, ?subjLoc2))
  ...
}

become:

SELECT ?subj1 ?subj2 WHERE {
  ...
  ?subj1 geo:sfContains ?subj2 .
  ...
}

Geospatial Support Using GeoSPARQL

The last Blazegraph property extension is geof:distance. That is aligned with GeoSPARQL, where it is also identified as geof:distance. Blazegraph's geof:distance takes as input two string-based POINTs and returns the distance between them in kilometers. The GeoSPARQL function, geof:distance, also supports the input of two POINTs and adds a third parameter, units (which could be defaulted in the code base to kilometers). Note that GeoSPARQL 1.0 has only a few basic units of measure defined. These are adequate for Wikidata's use. The proposed GeoSPARQL 1.1 specification indicates the use of the Quantities, Units, Dimensions and Types ontology (QUDT), which is much broader.

The previous paragraph glossed over a detail which is important when migrating Wikidata to GeoSPARQL compliance. Whereas Wikidata uses a GlobeCoordinate declaration for a point location, GeoSPARQL uses geometries. Therefore, when loading the RDF dump into the alternative database, the query shown above should be performed and the new CONSTRUCTed triples added to the dump. Ideally, this could be done in Wikidata itself.

Beyond the geof:distance function, there are other valuable GeoSPARQL properties and functions which could be used in Wikidata queries. These include:

  • Specification of geometries/locations beyond POINTs, such as POLYGONs (which are specified as a group of POINTs that define the geometry's boundary)
  • geof:buffer function, which conceptualizes the space around a geometry (such as a POINT), where the space is defined by a radius given by some units
  • geof:envelope function, which returns the minimal bounding box for an input geometry
    • Given a complex POLYGON, the function would return another POLYGON defining the 4 corners of the minimal bounding box
  • Specification of topology vocabulary properties which relate 2 geometries and relation functions which compare two geometries and return a boolean indicating if they meet the criteria of the function:
    • geo:sfEquals (the property) or geof:sfEquals (which returns true if the 2 geometries are equal)
    • geo:sfDisjoint (the property) or geof:sfDisjoint (which returns true if the 2 geometries are disjoint/separate, which is the inverse of geof:sfEquals)
    • geo:sfIntersects (the property) or geof:sfIntersects (which returns true if any part of the first geometry overlaps with any part of the second)
    • geo:sfTouches (the property) or geof:sfTouches (which returns true if a boundary of the first geometry comes into contact with the boundary of the second, but the interiors of the geometries do NOT intersect)
    • geo:sfCrosses (the property) or geof:sfCrosses (which returns true if the interior of the first geometry comes into contact with the interior or boundary of the second)
    • geo:sfWithin (the property) or geof:sfWithin (which returns true if the second geometry completely encloses the first)
    • geo:sfContains (the property) or geof:sfContains (which returns true if the first geometry completely encloses the second)

Note that some of the above will be used to address the Blazegraph geospatial SERVICEs (wikibase:around and wikibase:box), as explained below.

SERVICE Extensions

This section describes how the WDQS- and Blazegraph-specific SERVICEs (wikibase:label, wikibase:mwapi, wikibase:around, wikibase:box, gas:service, bd:sample and bd:slice) could be supported moving forward.

The geospatial SERVICES, wikibase:around and :box, can (and should) be provided by the use of GeoSPARQL. Details and examples are discussed below.

Unfortunately, there is no straightforward, functional approach for supporting wikibase:mwapi, the GAS service and bd:sample. The problem is that these SERVICEs return multiple (possibly many) results, and some execute based on complex parameters that are defined using unique triple patterns. That combination of requirements does not translate into the standard SPARQL function extensions, which take a set of predefined parameters and return a single result. In order to support these SERVICEs, modifications to the backend code bases will be required - to distinguish local SERVICE IRIs from HTTP federated requests, and then invoke appropriate "handlers".

Note that this discussion did not reference the bd:slice and wikibase:label SERVICEs. bd:slice functionality can be provided by a judicious use of sub-queries. On the other hand, label details can be provided using a SPARQL function extension, although that function will be less convenient than (but with equivalent capabilities to) the existing SERVICE approach. The inconvenience will be due to the need to repeat language preferences. The alternatives for bd:slice and wikibase:label are described in more detail below.

wikibase:around and wikibase:box

It is reasonable to replace the wikibase:around and :box SERVICEs with graph patterns that utilize the GeoSPARQL geometry and topology relation functions discussed above. This approach might be most easily explained by using examples.

Let us first examine a query using the wikibase:around SERVICE, which finds airports within 100km of Berlin:

SELECT ?place ?location ?dist WHERE {
  wd:Q64 wdt:P625 ?berlinLoc .       # Berlin coordinates
  SERVICE wikibase:around { 
      ?place wdt:P625 ?location . 
      bd:serviceParam wikibase:center ?berlinLoc . 
      bd:serviceParam wikibase:radius "100" . 
      bd:serviceParam wikibase:distance ?dist.
  } 
  FILTER EXISTS { ?place wdt:P31/wdt:P279* wd:Q1248784 }    # Is an airport
} ORDER BY ASC(?dist)

This could be written as:

prefix uom: <http://www.opengis.net/def/uom/OGC/1.0/>
prefix geo: <http://www.opengis.net/ont/geosparql#>
prefix geof: <http://www.opengis.net/def/function/geosparql/>
SELECT ?place ?location ?dist WHERE {
  wd:Q64 geo:hasDefaultGeometry [ geo:asWKT ?berlinLoc ] .    # Berlin location
  ?place wdt:P31/wdt:P279* wd:Q1248784 ;               # Get airports
         geo:hasDefaultGeometry [ geo:asWKT ?location ] .     # And their coordinates
  BIND (geof:distance(?berlinLoc, ?location, uom:meter) as ?dist) .
  FILTER (?dist <= 100000)
} ORDER BY ASC(?dist)

But, the above query will have poor performance if there are a large number of "?places" (in this example, airports) that are retrieved. As written, the query is retrieving all relevant places and their locations, then calculating distance and lastly filtering out the results.

Alternately (and better performing since it can make use of geospatial indexing and query rewriting), the check could be accomplished by the following query:

prefix uom: <http://www.opengis.net/def/uom/OGC/1.0/>
prefix geo: <http://www.opengis.net/ont/geosparql#>
SELECT ?place ?location ?dist WHERE {
  {  # Create a geometric object representing the area surrounding Berlin
     SELECT ?berlinLoc ?aroundBerlinLoc WHERE {
        wd:Q64 geo:hasDefaultGeometry [ geo:asWKT ?berlinLoc ] .       # Berlin location
        BIND (geof:buffer(?berlinLoc, 100000, uom:meter) as ?aroundBerlinLoc) }   
  }                
  ?place wdt:P31/wdt:P279* wd:Q1248784 ;                        # Get airports
         geo:sfWithin ?aroundBerlinLoc ;                        # Limited to the area around Berlin 
         geo:hasDefaultGeometry [ geo:asWKT ?placeLoc ] .       # And get airport location
  BIND (geof:distance(?berlinLoc, ?placeLoc, uom:meter) as ?dist) .    # Get the distance
} ORDER BY ASC(?dist)

Depending on the capabilities of the backend, this query could be shortened further.

In order to support the wikibase:box functionality, a similar approach is taken - although geof:buffer is replaced by a custom wdqs:box function. For example, this query using wikibase:box finds all schools between San Jose and Sacramento CA:

SELECT * WHERE
{ hint:Query hint:optimizer "None" .
  wd:Q16553 wdt:P625 ?SJloc .
  wd:Q18013 wdt:P625 ?SCloc .
  SERVICE wikibase:box {
      ?place wdt:P625 ?location .
      bd:serviceParam wikibase:cornerWest ?SJloc .
      bd:serviceParam wikibase:cornerEast ?SCloc .
    }
  ?place wdt:P31/wdt:P279* wd:Q3914 .
}

It becomes:

prefix uom: <http://www.opengis.net/def/uom/OGC/1.0/>
prefix geo: <http://www.opengis.net/ont/geosparql#>
SELECT ?place ?location WHERE {
  { 
     SELECT ?boundingBox WHERE {
        wd:Q16553 geo:hasDefaultGeometry [ geo:asWKT ?sjLoc ] .       # San Jose location
        wd:Q18013 geo:hasDefaultGeometry [ geo:asWKT ?sacLoc ] .      # Sacramento location
        BIND (wdqs:box(?sjLoc, ?sacLoc) as ?boundingBox) }  
  }
  ?place geo:sfWithin ?boundingBox .                                   # Get locations within the box
         wdt:P31/wdt:P279* wd:Q3914 .                                  # That are schools
}

Note that the above proposes a new function (wdqs:box) that constructs a bounding polygon based on two POINT locations. This is accomplished by decomposing the two POINTs into their latitudes and longitudes, and then creating a POLYGON using the SPARQL STRDT function. The function could be provided for convenience. If not provided, the functionality is implemented using the following graph patterns:

BIND (wdqs:latitude(?sjLoc) as ?sjLat) .
BIND (wdqs:longitude(?sjLoc) as ?sjLong).
BIND (wdqs:latitude(?sacLoc) as ?sacLat) .
BIND (wdqs:longitude(?sacLoc) as ?sacLong) .
# Note that a POLYGON must be closed (e.g., begin and end at the same POINT)
BIND (CONCAT("POLYGON(", STR(?sacLong), " ", STR(?sacLat), ", ", STR(?sjLong), " ", STR(?sacLat), ", ",
             STR(?sjLong), " ", STR(?sjLat), ", ", STR(?sacLong), " ", STR(?sjLat), ", ",
             STR(?sacLong), " ", STR(?sacLat), ")") as ?polygonString ) .   
BIND (STRDT(?polygonString, geo:wktLiteral) as ?boundingBox) .

wikibase:label

The wikibase:label SERVICE provides an easy means to retrieve rdfs:label, skos:altLabel and schema:description values for an entity. Its main uses are to simplify the SPARQL query and to provide language preferences for the text that is returned. The latter is the more significant aspect of the SERVICE and is the main focus of the functions defined here.

The label SERVICE could be implemented in 3 new SPARQL functions, each returning a string literal:

string literal wdqs:label (variable var, "string_of_language_codes")
string literal wdqs:altLabel (variable var, "string_of_language_codes")
string literal wdqs:description (variable var, "string_of_language_codes")

These functions would be used in BIND statements to associate specific variable names to the returned texts, which would then be referenced in the query's SELECT clause or used later in the query, for example in a FILTER statement.

As an example of the use of the current label SERVICE, the following query lists the US presidents and their spouses:

SELECT DISTINCT ?p ?pLabel ?s ?sLabel WHERE {
   wd:Q30 p:P6/ps:P6 ?p .
   ?p wdt:P26 ?s .
   SERVICE wikibase:label {
    bd:serviceParam wikibase:language "en" .
   }
}

This would be (re)written as:

SELECT ?p ?pLabel ?s ?sLabel WHERE {
   wd:Q30 p:P6/ps:P6 ?p .
   ?p wdt:P26 ?s .
   BIND (wdqs:label(?p, "en") as ?pLabel)
   BIND (wdqs:label(?s, "en") as ?sLabel)
}

As another example, consider this query which uses the manual mode of the label SERVICE:

SELECT * WHERE {
     SERVICE wikibase:label {
       bd:serviceParam wikibase:language "fr,de,en" .
       wd:Q123 rdfs:label ?q123Label .
       wd:Q123 skos:altLabel ?q123Alt .
       wd:Q123 schema:description ?q123Desc .
       wd:Q321 rdfs:label ?q321Label .
    }
}

This would be written using the same syntax as above:

SELECT * WHERE {
     BIND (wdqs:label(wd:Q123, "fr,de,en") as ?q123Label) .
     BIND (wdqs:altLabel(wd:Q123, "fr,de,en") as ?q123Alt) .
     BIND (wdqs:description(wd:Q123, "fr,de,en") as ?q123Desc) .
     BIND (wdqs:label(wd:Q321, "fr,de,en") as ?q321Label) .
}

The downside of this approach is the need to repeat the language preferences in each function call.

An alternate approach to defining new functions is to explicitly query for the labels or descriptions in the required languages by using a FILTER clause. Taking this approach, the first example would be written as:

SELECT DISTINCT ?p ?pLabel ?s ?sLabel WHERE {
   wd:Q30 p:P6/ps:P6 ?p .
   ?p wdt:P26 ?s .
   ?p rdfs:label ?pLabel .
   FILTER (lang(?pLabel) = "en") .
   ?s rdfs:label ?sLabel .
   FILTER (lang(?sLabel) = "en") .
}

Note that it is also possible to use the SPARQL function, langmatches. This supports matching based on regional variations of languages (such as en-GB or en-US). If the above query is modified to use this function, as follows:

SELECT DISTINCT ?p ?pLabel ?s ?sLabel WHERE {
   wd:Q30 p:P6/ps:P6 ?p .
   ?p wdt:P26 ?s .
   ?p rdfs:label ?pLabel .
   FILTER (langmatches(lang(?pLabel), "en")) .
   ?s rdfs:label ?sLabel .
   FILTER (langmatches(lang(?sLabel), "en")) .
}

It will return 247 results (versus the 51 results from the original query and the one above)! This is because each individual language variation is a unique literal.

If it is necessary or desirable to check for language variations, and ONLY ONE result should be returned for each ?p and ?s pair (president and their spouse), add the SAMPLE and GROUP BY features to the query, as follows:

SELECT ?p (SAMPLE(?pLabel) as ?pLabel) ?s (SAMPLE(?sLabel) as ?sLabel) WHERE {
   wd:Q30 p:P6/ps:P6 ?p .
   ?p wdt:P26 ?s .
   ?p rdfs:label ?pLabel .
   FILTER (langmatches(lang(?pLabel), "en")) .
   ?s rdfs:label ?sLabel .
   FILTER (langmatches(lang(?sLabel), "en")) .
} GROUP BY ?p ?s

There is one difference between using the wikibase:label SERVICE (or when defined, the new wdqs: functions) and the explicit rdfs:label/skos:altLabel/schema:description triples. That is the fact that the custom wikibase:/wdqs: routines will return the Q-ID of an item, if a label does not exist in the requested language(s). In the presidents and spouses examples above, the query would not return a result if there was no label (specifically, no English label) for EITHER the president or the spouse.

If it is possible that a label may not be defined, use the SPARQL OPTIONAL language feature. For the query above, this would be written as:

SELECT ?p (SAMPLE(?pLabel) as ?pLabel) ?s (SAMPLE(?sLabel) as ?sLabel) WHERE {
   wd:Q30 p:P6/ps:P6 ?p .
   ?p wdt:P26 ?s .
   OPTIONAL { ?p rdfs:label ?pLabel .
              FILTER (langmatches(lang(?pLabel), "en")) . }
   OPTIONAL { ?s rdfs:label ?sLabel .
              FILTER (langmatches(lang(?sLabel), "en")) . }
} GROUP BY ?p ?s

Note that in the case of the presidents and their spouses, there are English labels defined for all of them, and the OPTIONAL is not required.

bd:slice

The functionality of bd:slice is discussed in the code, Slice Service Factory documentation. In its simplest form, it provides a means to get a subset of results. However, the same functionality can be provided by using a sub-query with a limit/offset.

Let's illustrate this with an example. The query below returned 3743 results in 37074 ms. (The query without bd:slice, with the WHERE clause, "?item wdt:P31 wd:Q13442814. MINUS {?item wdt:P577 ?date}", timed out.)

# Work-around for query for scholarly articles with no date of publication (which times out without bd:slice)
SELECT ?item ?itemLabel 
WHERE 
{
  SERVICE bd:slice {
    ?item wdt:P31 wd:Q13442814.
    bd:serviceParam bd:slice.limit 1000000   # 1M items returned
  }
  minus {
    ?item wdt:P577 ?date.
  }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}

The same functionality can be achieved by using a sub-query, as follows:

SELECT ?item ?itemLabel 
WHERE 
{
  { 
    SELECT ?item WHERE { ?item wdt:P31 wd:Q13442814 } LIMIT 1000000
  }
  minus {
    ?item wdt:P577 ?date.
  }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}

Running this query returned 3743 results but took 49671 ms.

(COMMENT: The two queries above are not in fact equivalent.
bd:slice is guaranteed to always return results in the same order -- the first 10,000 items will always be the same first 10,000 items, the next 10,000 items will always be the same next 10,000 items. But no particular order is guaranteed by just using LIMIT and OFFSET, different runs may return different sets. This is of crucial importance if trying to scan through a large set chunk by chunk without having to realise it all, (eg as adding an ORDER BY clause to sort it all would do), because that would take too long. (Jheald (talk) 21:43, 23 February 2023 (UTC))).

The other use of bd:slice is to return a count (using the bd:slice.range predicate). There were no queries in the February 2022 set that used this predicate, and one query in March (shown below). It returned 1 result (the count of triples = 151295) in 111 ms.

SELECT ?range WHERE
{
  SERVICE bd:slice
  {
    ?item wdt:P6039 ?o .
    bd:serviceParam bd:slice.range ?range .
  }
}

This query can be rewritten using a simple SPARQL COUNT feature. It also returns 1 result (151295) but in 205 ms.

SELECT (COUNT(*) as ?range) WHERE
{ 
  ?item wdt:P6039 ?o .  
}

Note that the timings above do vary based on caching of results.

wikibase:mwapi, gas:service and bd:sample

The remainder of the Blazegraph SERVICEs are each described on the following pages:

In order to provide similar functionality, each of the backend code bases would have to be modified to distinguish a SERVICE invocation addressed to a local IRI (e.g., with the prefix, "urn:", "wdqs:" or similar) and an actual, external HTTP endpoint. That checking could occur:

  1. When the SPARQL is being parsed (its algebra/semantics are being defined)
  2. While iterating through/executing the component clauses of the query
  3. By modifying the SERVICE processing itself

The latter two options are likely preferable - since the backend infrastructure would already account for variable bindings and combining results into the final solution.

When executing a local IRI/SERVICE, it is most logical to check a registry of possible "handlers" and then invoke the appropriate code or return an error. The graph patterns of the SERVICE clause and current variable bindings would be passed to the "handler" code, as is done for all SERVICEs. Results would have to be returned consistent with the SPARQL 1.1 Federated Query specification, meaning that they would be an array of variable-RDF term bindings.

It is likely that the current SERVICE implementations would need to be adapted to the design points of the specific backends, but the majority of the processing logic should be able to be reused.

Note that one of the backend alternatives (Apache Jena) already has hooks for providing custom SERVICES. This implementation takes the approach of invoking the custom SERVICE while iterating through the query processing (bullet #2, above). Unfortunately, at the time of writing (late April 2022), there is no documentation related to its use. There is, however, a simple test scenario defined.

Frequency of Use of the Blazegraph SERVICE Extensions

Modifying the existing implementations to support local SERVICE extensions and adjusting the logic of those extensions to execute in the particular backend environment may be costly and/or introduce errors. In addition, the Wikidata documentation related to the Blazegraph-specific extensions (gas:service and bd:sample) states that support may be discontinued at some time in the future. Making such a call will require discussion with the community.

To inform discussion, the following shows the usage statistics of these extensions (across all queries issued in February 2022):

Usage of Custom SERVICE Extensions
SERVICE Percentage of queries
wikibase:mwapi 11.88%
gas:service 0.024%
bd:sample 0.002%

As an aside, the bd:slice SERVICE (discussed above) is used in 0.04% of the February 2022 queries (significantly higher than gas:service or bd:sample).

Of these extensions, the MWAPI SERVICE is the most critical to support.

More Detail on the Use of the GAS SERVICE

The GAS SERVICE is used infrequently to count the number of "hops" between two items and/or to find the shortest path between them (based on a breadth-first or shortest path algorithm). Examples of these types of queries are:

A review by James Heald (noted in https://phabricator.wikimedia.org/T305858#7846300) found that "All of these queries ... use either the BFS "breadth first search" or the SSSP "single source shortest path" gas service, with the two seemingly completely interchangeable -- the only difference seems to be that BFS returns a whole number for the number of hops found to each node, whereas SSSP seems to return a real number. But the performance of the two seems to be entirely similar for the queries they are being used on, with identical results if BFS is replaced by SSSP or vice-versa."

Note that there is a possible alternative to implement counting hops. For example, to find the ancestors of the composer Bach, see this query. It works by first finding the ancestors and then by counting the generations (per the Phabricator ticket listed above, "hat-tip to Tony Bowden and Andrew Gray, 2018"). However, the approach is not usable in general because it is inefficient, cannot be stopped at a certain depth/count or when a specific item is found, and fails when traversing symmetric properties (e.g., it could "circle back" on itself). In addition, the approach assumes that there is only one path from the starting point, which may not be true. Due to the latter restriction, it would NOT work to find a shortest path given different alternatives.

Since there is no alternative for the GAS SERVICE, it should be supported. However, the priority of providing that support will be influenced by the low frequency of its use.

As another example of a similar capability, but different implementation, see Stardog's PATH query syntax.