Talk:Wikidata query service/ScalingStrategy

From Wikitech
Jump to navigation Jump to search

About evaluating the alternative solution: we also need to evaluate what is the delta between existing functionality with regard to custom services and the alternative, in terms of:

  • Complexity of covering existing delta (development time) - e.g. label service, geo-search, MWAPI, custom functions, etc.
  • Complexity of developing new customization (is it easier or harder to customize)
  • Complexity of migration to the users, if any (changes in queries, changes in bot procedures, etc.)

My 2 cents WDQS scaling issue

I have found this github page on the graph-technologies usefull. It shows a small number of open source graph databases, none of them being RDF-triple stores (more on that later). Among those, only 2 advertise horizontal scalability:

  • JanusGraph (successor of Titan, now part DataStax) - Written in java, using scalable data-storage (cassandra/hbase) and indexing engines (ElasticSearch/SolR), queryable using Gremlin (Apache TinkerPop graph stack).
  • DGraph - Written in Go, self-contained, queryable in an augmented version of GraphQL+.

These findings make me think that keeping SparQL as main query language for WDQS will a tough challenge.

Paving that road in my mind, I can't think of any other way than transforming the wikidata RDF representation to a more suitable one for graph-properties engines (they actually differ a lot and query performance will be dependent on how data is structured), and write a transpiler between SparQL and the query language of the chosen technology. About SparQL transpilation, there already have been some effort (see this github project), but more analysis needs to be done, one the scope of the SparQL language covered by the transpilation, and of this coverage in relation to queries actually sent to WDQS. (----Joal (talk) 21:14, 20 February 2019 (UTC))

I personally would consider non-SPARQL engine a non-starter for the following reasons:
  1. SPARQL is a de-facto data exchange standard in Linked Data world. When you talk about SPARQL, everybody knows what you're talking about, and there are many resources and libraries to support it, and a standard body behind it. Basing on it means much larger appeal than any niche solution.
  2. We have a large set of queries, bots, tools, learning materials, etc. assembled targeted at SPARQL users. Re-creating all this for a different toolset would be a huge effort.
  3. Tinkerpop/Gremlin, as far as I remember, is a programming language. Running a public endpoint with open programming language endpoint giving every user the full power of Java is insanely hard. Also, whetever problems people have mastering SPARQL, I am sure Gremlin would not be easier for them, quite the contrary.
  4. GraphQL is nice for data retrieval/API, but I am not sure it can efficiently handle complex queries or built for it. I am also not sure it's a good fit for a kind of data we have in Wikidata - (almost) typeless, schemaless dataset.
If the solution has SPARQL gateway, even though basic language is not SPARQL, I think it still may be acceptable. See though my notes on evaluating the delta on the main page.

Smalyshev (talk) 07:33, 26 February 2019 (UTC)

Comments on Quality Constraints usage of internal cluster

Currently QC runs checks that may or may not involve hitting the query service for 25% of all edits on Wikidata using the job queue. We want to boost that to be 100% of all edits already, but we have stopped at 25% for now until we get persistent storage for storing results and also look at possibly reducing the # of queries that we send to the query service. ·addshore· talk to me! 09:33, 6 March 2019 (UTC)

Read load comments

I think one thing to do with read load, is to shorten the "interactive" query time limit, and have some sort of queue for queries that take longer than the short interactive limit. This is discussed in:

·addshore· talk to me! 09:33, 6 March 2019 (UTC)

Disk usage increases

Another thing on the horizon other than the regular semi predictable increase in Wikidata data would be the quality constraint check results for all entities on Wikidata. I'm not sure how much DU this would add, but I imagine it would be a non 0 amount. ·addshore· talk to me! 09:33, 6 March 2019 (UTC)

New Grant Project

You can find it at https://meta.wikimedia.org/wiki/Grants:Project/Future-proof_WDQS and the draft pre-print describing the solution at https://en.wikiversity.org/wiki/WikiJournal_Preprints/Generic_Tuple_Store#Grant

Any sensible ways to split horizontally?

If the RDF could be split up somehow then scaling becomes easier, right? Blazegraph does support federated querying, so if the split could be done in such a way that incoming SPARQL can be translated into the appropriate federated form, and with the hope that most queries only need to go to one (or two?) servers, wouldn't that potentially work? Of course there would be a need for some custom code...

For example, suppose we split the items data based on P31 value, so any RDF with a subject that is P31/P279* Q386724 ("works", or a large subgroup like scholarly articles) goes on one server, species go on another, humans on another, everything else somewhere else. Items linked as RDF objects could have their statements duplicated to where the associated subject items are, so that most queries that don't go too deep would not have to cross servers.

Would that kind of approach actually help? Maybe some analysis of query patterns is needed? ArthurPSmith (talk) 19:37, 8 August 2019 (UTC)