User:AndreaWest/Background on SPARQL Benchmarks

From Wikitech

The following sections detail existing benchmark definitions and frameworks, as well as relevant reference material.

Existing Benchmarks

The W3C maintains a web page on RDF Store Benchmarks. Here is background on a few of those as well as several geospatial benchmarks (listed in alphabetical order).

  • BSBM (Berlin SPARQL Benchmark)
    • Dataset is based on an e-commerce use case with eight classes (Product, ProductType, ProductFeature, Producer, Vendor, Offer, Review, and Person) and 51 properties
    • Synthetically-generated data scaled in size based on the number of products
      • For example, a 100M triple dataset has approximately 9M instances across the various classes
      • Both an RDF and relational representation created to allow comparison of backing storage technologies
    • Benchmark utilizes a mix of 12 distinct queries (1 CONSTRUCT, 1 DESCRIBE and 10 SELECT) intended to test combinations of moderately complex queries in concurrent loads from multiple clients
      • Queries vary with respect to parameterized (but randomized) properties, differing across the various test runs
      • SPARQL language features exercised are: Filtering, 9+ graph patterns, unbound predicates, negation, OPTIONAL, LIMIT, ORDER BY, DISTINCT, REGEX and UNION
    • Performance metrics include dataset load time, query mixes per hour (QMpH), and queries per second (QpS, determined by taking the number of queries of a specific type in a test run and dividing by their total execution time)
      • All performance metrics require reporting of the size of the dataset, and the first and second metrics also should report the number of concurrent query clients
  • DBPedia Benchmark (deprecated, but the approach to query definition is informative)
    • Dataset uses one or more DBPedia resources, with possibility to create larger sets by changing namespace names, and to create smaller subsets by selecting a random fraction of triples or by sampling triples across classes
      • The sampling approach attempts to preserve data characteristics of indegree/outdegree (min/max/avg number of edges into/out of a vertex)
    • Queries defined by analyzing requests made against the DBPedia SPARQL endpoint, coupled with specifying SPARQL features to test
      • Analysis process involved 4 steps: Query selection from the SPARQL endpoint log; stripping syntactic constructs (such as namespace prefix definitions); calculation of similarity measures (e.g., Levenshtein string similarity); and, clustering based on the similarity measures (as documented in DBPedia SPARQL Benchmark)
      • SPARQL features to test: Number of triple patterns (to exercise JOIN operations, from 1 to 25), plus the inclusion of UNION and OPTIONAL constructors, the DISTINCT solution modifier, and FILTER, LANG, REGEX and STR operators
      • Result was 25 SPARQL SELECT templates with different variable components (usually an IRI, a literal or a filter condition), with goal of 1000+ different possible values per component
    • Benchmark tests utilize variable DBPedia dataset sizes (10% to 200%) and query mixes based on the 25 templates and parameterized values
    • Performance metrics include query mixes per hour (QMpH), number and type of queries which timed out, and queries per second (QpS, calculated as a mean and geometric mean)
      • Performance is reported relative to the dataset size
  • FedBench (Evaluating federated query)
    • Three, interlinked data collections defined that differ in size, coverage, types of links and types of data (actual vs synthetic)
      • First is cross-domain, holding data from DBpedia, GeoNames, Jamendo, Linked-MDB, New York Times and Semantic Web Dog Food (approximately 160M triples)
      • Second is targeted at Life Sciences, holding data from DBPedia, KEGG, DrugBank and ChEBI (approx 53M triples)
      • Last is the SP2Bench dataset (10M triples)
    • 36 total, fixed SELECT queries specified that exercise both SPARQL language and use-case scenarios
      • 7 cross-domain, 7 life-science, 11 SP2Bench and 11 linked-data queries
      • Cross-domain and life-science queries test "federation-specific aspects, in particular (1) number of data sources involved, (2) join complexity, (3) types of links used to join sources, and (4) varying query (and intermediate) result size"
      • SP2Bench queries are discussed below and included in FedBench to exercise SPARQL language features (only the SELECT queries are used)
      • Linked-data queries focused on basic graph patterns (e.g., conjunctive query)
    • Performance metrics based mainly on query execution time
  • Geospatial/GeoSPARQL benchmarks
    • EuroSDR geospatial benchmark
      • Tests performed in two scenarios:
        • Linked data environment integrating geospatial and other data, based on the ICOS Data Portal
          • ICOS uses several, backing ontologies but they are not GeoSPARQL compliant
          • The EuroSDR work redesigned the ontologies for compliance and transformed the ICOS geometry data from GeoJSON to WKT (Well-Known Text)
          • Resulting dataset generated from the ICOS data in March 2019 and has over 2M RDF statements
        • Using the Geographica dataset (discussed below)
      • 25 fixed queries in both scenarios, which were selected from/modifications of the Geographica micro-benchmark discussed below (5 queries test non-topological construct functions, 10 queries evaluate spatial selection, and 10 queries test spatial join)
      • Performance metrics include load time, query execution time in each test iteration, and result correctness related to # of results and the reported geometries
    • GeoFedBench
      • 2 scenarios defined based on linking land usage datasets or land usage and water availability data - from the GitHub web page:
        • GSSBench suite ... "is derived from the practical use-case of linking land usage data with water availability data for food security. The federation contains 3 data layers (namely Administrative, Snow cover, and Crop-type data), and each layer is also divided geospatially. Thus, each endpoint contains only one thematic layer and refers to a specific area. This suite is used mainly to evaluate the effectiveness of the source selection mechanism of a federation engine, and, in particular, if the source selector is aware of the geospatial nature of the source endpoints".
          • Uses data from the GADM (Database of Global Administrative Areas), Extreme Earth food security, and Invekos datasets
          • 7 query templates defined with parameterized values
        • GDOBench suite ... is derived from the practical use-case of linking land usage data with ground observations for the purpose of estimating crop type accuracy. The query load of the suite contains difficult query characteristics (usually not found in current SPARQL benchmarks, such as inner SELECT queries and negation through FILTER NOT EXISTS). Moreover, the bottleneck of the evaluation is the federated geospatial within-distance operator, which is considered a difficult operation since it cannot be evaluated fast using a spatial index (as it is the case for the spatial relationships such as contains, within, etc.)".
          • Uses data from the Invekos and Lucas datasets
          • 4 query templates defined with parameterized values
      • Performance metrics include number of queries that executed and their times, number of queries that timed out, and number of results for queries that completed (to evaluate both performance and correctness)
    • Geographica
      • Two datasets defined - one based on publicly available linked data and the other based on synthetic data
        • Publicly available data focused on Greece and used information from DBpedia, GeoNames, LinkedGeoData (related to road networks and rivers in Greece), Greek Administrative Geography, CORINE Land Use/Land Cover, and wildfire hotspots from the National Observatory of Athens' TELEIOS project)
          • Complete dataset contains more than 30K points, 12K polylines and 82K polygons
        • Synthetic data produces different sized datasets with different thematic and spatial selectivity
      • Two benchmarks defined to exercise the publicly available data - a micro and a macro benchmark
        • Micro benchmark focused on evaluation of primitive spatial functions testing "non-topological functions, spatial selections, spatial joins and spatial aggregate functions"
          • 29 fixed queries - 6 non-topological queries, 11 spatial selection queries, 10 spatial join queries (joining across different named graphs) and 2 aggregate function queries (one is specific to the stSPARQL language developed for Strabon)
        • Macro benchmark focused on performance in different use cases/application scenarios
          • 16 fixed queries - 4 geocoding queries (related to finding the name of a location, given certain criteria, or finding a city or street closest to a specified point), 3 map queries (related to finding a point of interest given some criteria and then roads or buildings around it), 6 "wildfire" use case queries (related to finding land cover area, primary roads, cities and municipalities within a bounding box, as well as forests on fire or roads which may be damaged) and 3 aggregation/counting of location (CLC) queries
      • For the synthetic data, various queries are generated from two templates using different properties and criteria
        • One template selects a location based on criteria + within or intersecting a bounding box, and the other template selects 2 locations based on criteria + within or intersecting or touching each other
      • Performance metrics for both datasets include statistics on load time, the overall time to execute a test run, and the execution times of individual queries
    • GeoSPARQL Benchmark (Evaluating compliance to the GeoSPARQL specification)
      • Benchmark tests all 30 GeoSPARQL requirements defined in the specification
      • Dataset and concrete queries defined for each requirement, with some requirements only having 1 test and others having multiple tests if there are various sub-requirements defined in the specification
      • Performance metrics indicate the percentage of supported (sub-)requirements of an implementation, out of the 30 overall requirements
  • LUBM (Lehigh University Benchmark)
    • Dataset based on a "university" ontology (Universities, Professors, Students, Courses, etc.) with 43 classes and 32 properties
    • Synthetically-generated data scaled in size
      • Defined datasets range from 1 to 8000 universities, the largest one having approximately 1B triples
    • 14 fixed queries defined focused on instance retrieval (SELECT queries) and limited inference (based on subsumption/subclassing, owl:TransitiveProperty and owl:inverseOf)
      • Factors of importance: Proportion of the instances involved (size and selectivity); Complexity of the query; Requirement for traversal of class/property hierarchies; and Requirement for inference
      • Queries do not include language features such as OPTIONAL, UNION, DESCRIBE, etc.
    • Performance metrics include load time, query response time, answer completeness and correctness, and a combined metric (similar to F-Measure) based on completeness/correctness
  • SP2Bench (SPARQL Performance Benchmark)
    • Dataset based on the structure of the DBLP Computer Science Bibliography with 8 classes and 22 properties
    • Synthetic data (of different sizes) generated based on the characteristics of the underlying DBLP information
    • 17 different, fixed queries exercising SPARQL language features and JOIN operations, as well as SPARQL complexity and result size
      • 3 ASK queries and 14 SELECT queries defined that test JOINs, FILTER, UNION, OPTIONAL, DISTINCT, ORDER BY, LIMIT, OFFSET, and blank node and container processing
      • Evaluating "long path chains (i.e. nodes linked to ... other nodes via a long path), bushy patterns (i.e. single nodes that are linked to a multitude of other nodes), and combinations of these two"
    • Performance metrics include the load time, success rate (separately reporting success rates for for all document sizes, and distinguishing between success, timeout, memory issues and other errors), global and per-query performance (where the former combines the per-query results and produces both the mean and geometric mean), and memory consumption (reporting both the maximum consumption and the average across all queries)
  • UOBM (University Ontology Benchmark, very similar to but extending LUBM)
    • Two ontologies defined with different inferencing requirements (OWL Lite and OWL DL)
    • Ontology classes and properties added (69 total classes and 43 properties in the OWL DL ontology)
    • Generation of synthetic data to include links between universities' and departments' data
    • 15 fixed SELECT queries defined

Test Frameworks and Tools

The following tools (listed in alphabetical order) will be evaluated for use in Wikidata backend testing:

Background References