User:AndreaWest/WDQS Blazegraph Analysis

From Wikitech

The following are my learnings and thoughts related to replacing Blazegraph as the RDF store/query engine behind the Wikidata Query Service (WDQS).

Phabricator Ticket: Epic, T206560

Results

Four potential candidates are short-listed for replacing Blazegraph (listed in alphabetical order):

  • Apache Jena with the Fuseki SPARQL Server component
  • Qlever
  • RDF4J V4
  • Virtuoso Open-Source

Problem Description

The Wikidata Query Service is part of the overall data access strategy for Wikidata. Currently, the service is hosted on two private and two internal load-balanced clusters, where each server in the cluster is running an (open-source) Blazegraph image, and providing a SPARQL endpoint for query access. The Blazegraph infrastructure presents a problem going forward since:

  • There is no short- or long-term maintenance/support strategy, given the acquisition of the Blazegraph personnel by Amazon (for their Neptune product)
  • Blazegraph is a single-server implementation (with replication for availability) that is suffering performance problems at the current database size of ~13B triples
  • Query performance is based on synchronous response and several year old technology, and is experiencing timeouts

Requirements

There are many moving parts and possible alternatives (and combinations of alternatives) for hosting the triple storage and query engine of the Wikidata Query Service. Here are some initial observations on the requirements for an acceptable solution.

A solution MUST:

  • Support database sizes of 25.6B+ triples
    • This number is an estimate based on a linear growth rate and 5 years of growth (using the worst case slope, from 23 Nov 21 to 07 Dec 21)
  • Support SPARQL 1.1 and the "standard" output formats (JSON, CSV/TSV, XML)
  • Support SPARQL Federated Query
    • Current list of federated endpoints and SERVICE extension points
    • Federated query also allows for the possibility of separating Wikidata across multiple DBs
    • Note that the SERVICES should be investigated to determine their performance as SPARQL functions
  • Support both read and write in high density
    • Possibility for spikes of ~1000 added/deleted triples/sec
    • Average update range, 50-200 triples/sec
  • Support monitoring the server and databases via instrumentation and reported metrics
    • Ideally metrics and functionality are accessible via CLI for site reliability and scripting
    • Specific metrics and functionality TBD
  • Provide ability to obtain and tune query plans
  • Utilize indexing schemes that align/can be aligned with the query scenarios of Wikidata
    • Possible to index one or more of: subjects/objects (items), properties/paths, query/join patterns, ...
  • Support all SPARQL query functions and allow for the addition of custom functions
  • Be licensed as open source
    • Be well-commented and actively maintained with a community of users
  • No requirement for user authentication for query

A solution SHOULD:

  • Reduce (or at minimum, maintain) query time and timeouts and improve throughput
  • Allow database reload to occur in days (not weeks)
    • Initially, data is loaded from dumps and then updated by the RDF stream updater
    • In case of data drift (there are issues in the update process), the data is re-initialized from dumps a few times per year
  • Support named/stored queries for re-execution and re-use
    • This would be useful from a compatibility perspective with Blazegraph, although it is not part of the SPARQL standard
  • Support geospatial query/GeoSPARQL
  • Have minimal impact on the capabilities of current users and their queries/implementations
    • Need to understand what changes to existing queries would be required
    • This is especially relevant for non-standard, Blazegraph SPARQL syntax, such as query hints, named subqueries, ...
    • How onerous do the queries (and their debug) become with changes such as adding federation or splitting services across different interfaces?
  • Be robust in the face of an internet driven workload - e.g.:
    • Understand thread pools, memory, ... management
    • Understand how a problematic query affects a server’s overall performance

A solution MAY:

  • Provide capabilities for distributed computing/processing (versus single-node)
  • Support paged output/query continuation
  • Support other query languages such as GraphQL, Gremlin, ... for improved programmatic and human ease of use
  • Support user roles and/or authentication for throttling
    • This could also be provided in the UI or by load-balanced pre-processing
  • Allow inference and reasoning, including the definition of user rules
  • Provide DevOps and query interface for browser-based maintenance and debug
  • Support RDF*/SPARQL* for annotating/querying statements


Unnecessary (?):

  • Provide data integrity/validation support
    • Validation is performed at the time of Wikidata edit
    • However, this functionality could aim in reporting constraint violations
  • Support ACID transactions for updates
    • Writes are done from our RDF stream updater, "accidentally using transactions to write batches of data" (for performance reasons and not following any kind of semantic coherence)
    • Updates align with edits on Wikidata which are done atomically (one property at a time)
    • Transaction boundaries do not have to be conversational with a client

Candidate Alternatives

This section will be expanded with more details. For now, this is a simple list of ideas that will be evaluated to be bundled together in a final solution:

  • Move off Blazegraph to a different backend store (candidates are listed in the following section)
  • Split the Wikidata "knowledge graph" into two or more databases AND/OR two or more named graphs (for example, a database/graph holding the RDF for scholarly articles vs another holding all the remaining data, or hosting the truthy data in a separate db/graph)
    • For separate databases, the solution would require either post-query joining of results or federation
  • Add support for (potentially long-running) queued queries with asynchronous reporting of results
  • Execute RDF data cleanup to remove redundancies and segregate "unused" items (items that are not referenced as the object of any triple)
  • Improve/tune RDF indexes and their use in queries to increase performance
    • There are various types of indices based on triple patterns, paths, items, properties, joins (for frequently encountered joins), ...
  • Utilize user roles (different roles automatically execute on different dbs with different loads/performance) and/or authentication for throttling
  • Support saving of both queries and results with the ability to re-execute a query if the results are considered "stale"
  • Better incorporate (via federation?) ElasticSearch with query AND/OR move users to other services such as Wikidata ElasticSearch
    • How to do this in an easy to use/explain/understand fashion?
  • Establish cloud-deployable containers for the different cloud environments (AWS, Azure, ...) to increase the feasibility of local deployments
  • Move custom SERVICES (which are federated queries) to SPARQL functions
    • Note that SERVICEs/federated queries are NOT a general solution to distributed query evaluation; They involve HTTP operations - so the context of the SERVICE, what elements are bound, etc. matter a great deal
    • The current services are documented at https://www.mediawiki.org/wiki/Wikidata_Query_Service/User_Manual#Extensions and include label, geospatial, date and other functionality
    • Another service that may be (very occasionally) used is bd:slice, which allows a user to extract part of the current solution set in a query in a repeatable way (valuable for stepping through the solution set of a very large query chunk-by-chunk in a systematic way, with a stable repeated ordering, without having to specifically ORDER that large solution set which may be unacceptably expensive)
  • Restrict SPARQL capabilities (perhaps on a subset of servers?) to disallow features that might be dangerous and/or expensive
    • For example, OPTIONAL clauses have performance impacts


Alternative Data Stores

Note that these stores and query engines are prioritized by # of stars in GitHub and many will be further investigated. Some background implementation details and statistics are also provided. An interesting resource (but not complete) is the W3C Large Triple Store web page.

Open-source (note that last updated times are based on examining the GitHub pages on 2 Mar 2022):

  • Apache Jena (last updated 1 Mar 2022, 808 stars, 78 contributors) and Jena Fuseki SPARQL server (last updated 3 days ago)
    • Written in Java
    • ~150 open issues (issues in JIRA)
    • 16.7B triples db size per the W3C web page
    • Replicated for high availability using RDF-Delta (last updated 6 Feb 2022, 44 stars, 10 contributors)
      • Is this needed given that updates are coordinated by the stream updater?
  • Virtuoso Open-Source (stable/7 branch last updated 22 Jun 2021, develop/7 updated 17 Feb 2022, 732 stars, 17 contributors)
    • Note that the paid version has more much functionality and scalability than the open-source
    • Written in C and XSLT
    • 555 open issues (377 closed)
    • 94.2B triples db size per the W3C web page
  • RDF4J (last updated 27 Feb 2022, 274 stars)
    • Written in Java
    • 253 open issues (1640 closed)
    • "RDF4J Native Store is ... currently aimed at medium-sized datasets in the order of 100 million triples"
    • RDF4J 4.0 development branch moves to LMDB backend for V4 release (date TBD)
  • QLever (last updated 1 Mar 2022, 91 stars, 15 contributors)


Open-source but early in development/for research:

  • gStore (last updated 29 Jan 2022, 469 stars, 25 contributors)
    • Written in C++
    • 5 open issues (65 closed)
    • Still in active development - Would require completion of SPARQL 1.1 support
      • No property paths, limited FILTER and ORDER BY support (ORDER BY is single variable), and some characters (such as tags, '<' and '>') not allowed
      • Issues section on GitHub refers to a website for questions, but with the addendum "currently, only for Chinese"
      • Code compiles and indexes can be built, but binaries for running the gstore server crash with a segmentation fault and the Dockerfile does not work due to various compilation errors (see https://github.com/pkumod/gStore/issues/116)
    • A very old fork of gStore (6+ years ago) was modified to support being distributed; It is no longer viable
    • gStore-D2 has also been referenced but its code base is not available
  • OxiGraph (last updated 3 Feb 2022, 441 stars, 11 contributors)
    • Database based on the RocksDB key-value store, written in Rust
    • 23 open issues (46 closed)
    • Still early in development (Version 0.3)
    • Targeted as embedded store; Identified as "hobby project"
  • Wukong (last updated 10 Dec 2019, 161 stars, 13 contributors)
    • Written in C++
    • 2 open issues (15 closed)
    • Can be distributed and also utilizes GPUs
    • Early in development (Version 0.2)
  • MillenniumDB (last updated 12 Dec 2021, 16 stars, 3 contributors)
    • Written in C++ and developed by the Millennium Institute for Foundational Research on Data (IMFD)
    • 10 open issues (0 closed)
    • Very early in development, and missing ability to modify the database, OPTIONAL functionality, support for dates and lists, FILTER functionality and more

Open-source but no recent updates:

  • Apache Rya (last updated 16 Dec 2020, 102 stars, 27 contributors)
    • Built on top of Accumulo, and implemented as an extension to RDF4J, written in Java
    • ~200 open issues (issues in JIRA)
    • Can be distributed
    • Note that there is no recent activity for this project and JIRA issues appear to be languishing

Open-source but no SPARQL support or problematic SPARQL infrastructure:

  • TerminusDB (last updated today, 1700+ stars, 21 contributors)
    • Toolkit for linking documents via an API, written in Prolog
    • No SPARQL support
  • LevelGraph (last updated 16 Aug 2021, 1400 stars, 25 contributors)
    • LevelDB-backed RDF graph database for Node.js, written in Javascript
    • No SPARQL support
  • CM-Well (last updated 30 Sept 2021, 167 stars, 20 contributors)
    • Written in Scala, accessed by REST APIs; Developed by Thomson Reuters & Refinitiv
    • Non-native support for SPARQL
      • Requires separate load of data to a Jena instance (non-standard and non-performant for Wikidata Query)
      • No support for ASK, DESCRIBE
  • quadstore (last updated 27 Jan 2022, 123 stars, 5 contributors)
    • LevelDB-backed RDF graph database for Node.js, written in TypeScript and JavaScript
    • Designed as a client-side store with local query support
    • No SPARQL endpoint or support for federated query
  • SANSA-Stack (last updated 26 Jan 2022, 118 stars, 37 contributors)
    • Requires Spark 3.x.x with Scala 2.12 setup, written in Java and Scala
    • Can be distributed
    • Incomplete SPARQL 1.1 support
      • No federated query, property paths, functions (EXISTS/NOT EXISTS, IN/NOT IN, …) not supported
      • SPARQL support based on ontop to translate to SQL
  • Atomic Data Rust (last updated hours ago, 39 stars, 1 contributor)
    • Based on "Atomic Data"
    • No SPARQL support
  • DataCommons Mixer (last updated 6 Feb 2022, 8 stars)
    • Designed for GCP (Google Cloud Platform) and GKE (Kubernetes)
    • SPARQL support is limited to the following (simple) query structure:
      • P *Prologue (prefixes), S *Select (variable name or DISTINCT), W *Where (an array of triples of the form, subject-predicate-array of objects), O *Orderby and L *Limit

Open-source but unlikely to scale to billions of triples:

  • Corese (last updated 1 March 2022, 46 stars, 9 contributors)
    • Written in Java by National Institute for Research in Digital Science and Technology (INRIA)
    • 25 open issues (36 closed)
    • Also supports SHACL
    • Expected to scale to 50M~100M triples per server
    • CeCILL-C License
  • Parliament (last updated 26 days ago, 35 stars)
    • Research implementation, written in Java, developed by Raytheon BBN
    • 3 open issues (25 closed)
  • LUPOSDATE (last updated 17 months ago, 18 stars)
    • Academic implementation developed by IFIS at the University of Lübeck, written in Java
    • No issues ever reported

Open-source but no backing store:

  • ontop (last updated 27 Jan 2022, 460 stars)
    • Also incomplete SPARQL support


Proprietary:

  • AllegroGraph, AnzoGraph, GraphDB, MarkLogic, Oracle, Neptune, RDFox, Stardog, TriplyDB, ...
  • Proprietary and with no support for SPARQL: DGraph, Neo4J

Dead, no development within last 2 years or more:

  • 4Store, Redland/RedStore, Mulgara, Jena-HBase, HBase-RDF, H2RDF, CumulusRDF, AdaptRDF, CliqueSquare, RDFDB, Akutan, ...
  • Halyard (last updated 5 Dec 2019, 100 stars', 5 contributors)
    • Developed by Merck, written in Java
    • 21 open issues (40 closed)
    • Can be distributed
    • Project is dead with no response to issues
    • Extension for RDF* forked at https://github.com/pulquero/Halyard (2 stars, 1 contributor)

Other Questions

  • How is the query UI (as seen on https://query.wikidata.org/) impacted by a move off Blazegraph?
    • Note that the UI is a standalone static javascript application, NOT hosted on the Blazegraph server
    • It MAY have accidental strong coupling with Blazegraph (TBD)
  • What are the current features and capabilities of Blazegraph that are non-standard (such as named queries)? How might they be supported?
  • What are the specific trade-offs (between local/distributed processing, compression, db indexes, query optimizations, ...) that work best for Wikidata (not necessarily in general)?