User:AndreaWest/RDF SPARQL Technologies

From Wikitech
Jump to navigation Jump to search

There are some excellent articles describing the underlying technologies of various RDF stores and SPARQL engines. This page presents details and thoughts triggered from a review of the following papers:

Storage Structure

  • Relational-table-based, graph-based, key-value-based, tensor-based, binary storage, ...
    • Relational tables can be:
      • Statement tables, aka triple tables (one or more tables that are lists of the RDF triples or quads, indexed in different orders of graph/subject/predicate/object)
        • Save on joins between tables, but incur expensive self-joins
        • So, queries that contain multiple triple patterns are slow to execute
      • Property tables (one or more tables that group commonly accessed nodes/items, with similar properties, where the items identify the row and each property is in a separately identified column)
        • Also typically need a statement table for nodes/items that are not grouped, and need a way to deal with multi-valued properties
        • Efficient for star joins (subject-subject) but not path joins (subject-object, unless specifically indexed), sink joins (object-object) and hybrid joins
      • Vertically partitioned by predicate (tables for each predicate with 2 columns for the subject and object respectively)
  • Distributed architectures may be backed by a NoSQL solution, Hadoop/MapReduce/Spark structure, ...

Indexing

  • Normalization of IRIs and long strings
    • Mapping to integer identifiers/hashes to save space, using dictionary tables for lookups
  • Triple/quad indexes
    • Can cover all possible query patterns (also known as triple-permutation indexing)
        • spo, sop, pso, pos, ops, osp
        • Note that indexes improve query performance but come at the cost of maintenance and storage requirements, which makes them problematic for a data store that has frequent updates (such as Wikidata)
    • A quad index adds a graph element, and therefore increases the number of query patterns to 16
    • Note that for vertically partitioned tables, only the 5 patterns where the predicate is constant are reasonable
  • Entity index
    • Optimize for queries focused on a single subject
      • For example, "star joins" are query patterns with a single, common element that is typically the subject (but could also be the object)
      • Note that property tables can efficiently handle star joins if the appropriate table (for the element) can be found and the relevant predicate/property columns are indexed
  • Property index
    • Modifies an entity index by also indexing by/adding properties/predicates
  • Path index
    • Considers successive subject -> object paths of a query pattern such as "?w predicate1 ?x . ?x predicate2 ?y . ?y predicate3 ?z ."
      • So the path is w-predicate1-x-predicate2-y-predicate3-z
    • Materializes subject-object joins, but could also utilize subject-subject and object-object joins
    • A brute force approach converts the path to a string and uses B+trees, but there are likely a huge number of possible paths (which then requires prioritization)
    • Other path indexes are possible
  • Structural index
    • Based on distance measures, quotient graphs, ... that identify the query-relevant/high-value regions of the graph

Join Processing

  • Pairwise, multiway, worst-case optimal (WCO), GPU-based, ... joins
  • Can involve:
    • Caching, pipelining, swapping and prefetching
    • Considering custom statistics based on the triple structure and counts, or runtime sampling of the data
      • Statistics may speed query but come at the cost of computing and maintaining them
    • Syntactic heuristics for join reordering (such as literals being more selective than IRIs)

SPARQL Query Processing

  • Support for filter expressions, optionals, path queries, ...
    • Use of OPTIONALs can "lead to jumps in computational complexity" and contribute to performance issues
    • Ideally, could recommend that they not be used or processed on a specific/targeted server
  • Categorized into simple star joins, chained joins/navigational graph patterns and complex queries
    • Star joins are based on retrieving one or more predicate values for the same subject
    • Chained joins are based on subject-object and object-subject joins (i.e., an element is a subject in one pattern and an object in another)
    • Navigational graph patterns are a kind of chained join that specifically uses property paths (paths of arbitrary length)
    • Complex queries are a composite of the above
  • Utilizes work-load aware database design to decrease inter-/intra-node and partition communication
    • Lack of awareness can lead to incorrect data fragmentation and localization (in separate graphs, stores or partitions), which increase graph query sizes, cross db/partition communication, and/or large intermediate query result sets

Centralized (Single-Node) vs Distributed

  • What is the criteria for partitioning?
    • Deterministic function(s) with hash-based keys are the most popular
    • Types of partitioning:
      • Horizontal / sharding / row-wise distribution of data
      • Vertical / column-wise distribution by subject, predicate and/or object
        • When partitioning, examine the number of elements per partitioning key and attempt to group for maximized load balancing, OR utilize research/domain insight to determine what elements are queried together
      • Range - distribution by range values of the partitioning key
      • Workload
      • Graph-based which needs to solve the k-way partitioning problem
        • Recursive-Bisection, TCV-Min, Min-Edgecut, ...
      • Hash of subject or predicate IRIs (which is likely to cause partition imbalance)
    • Could utilize multiple partition keys to support different index permutations, triple patterns, ...
    • Need to consider the overall data AND graph structures such that triples storage across nodes and queries across nodes are minimized
    • Goal is to fragment the data and allocate it to processing nodes to minimize communication costs for both updates and queries
      • Requires understanding the access patterns so that the characteristics of the workload can be understood
      • This paper (second one in the top list) partitions by frequent/infrequent properties, where triples with infrequent properties are stored separately (in a "cold graph")
  • How does replication come into play with partitioning?
    • Can frequently used data be separately partitioned and then better replicated/load-balanced to improve performance?
  • Is partitioning over GPUs possible/desirable?

Thoughts on RDF in NoSQL Databases

  • "... storing RDF on NoSQL databases is proven to be complicated ..."
    • "It includes problems such as RDF-to-NoSQL mapping, fragmenting the RDF graph on smaller parts, partitioning the triples and graph fragments among multiple nodes or databases, intelligently indexing triples, and efficiently caching triples to maintain scalability and satisfactory performance"
      • Regarding the RDF-to-NoSQL mapping, it can "potentially generate many tables for storing a single triple ... [and] lead to delays during ingestion operations"
      • "With respect to querying, a document database can be used for storing triples that are accessed by star shaped queries (the most common RDF query type), specially using a hierarchical mapping from RDF to document ... vertical partitioning, [the most common partitioning scheme,] can be a bottleneck when multiple calls are required to solve a join, so it can be used only for a limited number of calls."
      • "Indexing all the permutations of a triple can lead to a big index, so combining the mapping/querying solution with the most straight index permutations (S-PO and O-SP) seems to be the right choice. Partitioning is an important topic also for NoSQL databases. However, to delegate this operation is not a good design choice because NoSQL databases have no metadata information about the RDF graph, which can lead to many cross server joins. Key/value databases are the right choice for caching due to [their] low latency even in a multiple triple permutation."
    • From Persistence of RDF Data into NoSQL: A Survey and a Reference Architecture (link also provided in the bulleted list of articles)
    • Above paper "[highlights] the need [to study] even more complex tasks such as intermediary fragmentation, workload-awareness in different components of the RDF data management solution (e.g., indexer, partitioner, mapper), and more intensive in-memory databases usage"
    • Also, not addressed in the paper (and needing further study) are "triple compression, query plan and optimization, SPARQL translation and join processing in [the] NoSQL database context"