Current problem

The current architecture does not allow us to scale the write load horizontally:

Every update has to be performed on every single node
All the data have to be stored on every node

This leads to 2 problems:

~~Problem 1: lag. Well known and most pressing issue at the moment.~~ (currently solved by the streaming updater)
Problem 2: capacity. How long can we sustain the current increase in number of triples on the current hardware

+1B triples in three months sept -> nov 2019

Possible solutions

Vertical scaling [lag,capacity]

This is the de-facto solution used at the moment; we have no other way to scale today other than increasing the capacity (CPU speed and disk size) of the current hardware. This is obviously not a long-term solution but we have room (compared to mysql master dbs) if the budget allows.

The current hardware is:

4 clusters of 3 machines each (public and private endpoints for each datacenter).

For each server (latest order of machines, older machines might slightly less powerful):

2x Intel Xeon Silver 4215 2.5G, 8C/16T 11M cache, DDR4-2400
128G RAM
8x 960GB SATA, software RAID10
1G NIC

Optimizations [lag]

There is room for optimization in the current process which might help to reduce the lag.

As of today the current update strategy is to reconcile the triple store using the latest data available from the wikibase Special:EntityData endpoint. This process was designed to be robust to the loss of events as all the entity triples are sent back to the triple store.

Increase parallelism

Tracked in task T238045 Small optimization where the goal is to parallelize:

the preparation of the update (fetch some state from the triple store and wikibase)
the update sent to blazegraph

Status: deployed, small impact

Rework the update process

The goal here is to optimize the update process to only send the triple store the required triples by doing a diff between two revisions.

For instance if a revision introduces a new site link we should only communicate to the store the triples required to add this site link. This optimization requires re-working the current updater process by decoupling most of its components.

Status: deployed, big impact: phab:T244590

Backpressure propagation [lag]

Tracked in task T221774.

By reducing the throughput of edit events on wikidata we somehow control the lag on WDQS. This makes the maxlag param (a way to inform bots to throttle their edits) dependent on the lag of the WDQS cluster.

Status: deployed, see impact

Horizontal scaling via sharding [lag,capacity]

Solutions whose goal is to partition the data in a way that a single edit event is sent to a subset of the nodes, not all of them.

Blazegraph does not allow us to do this automatically, although very few triple/graph stores claim to provide such functionality other than the ones based on spark/hdfs or other backends of this kind.

Find a storage solution that can do sharding natively

Most solutions that support real-time updates do not offer graph partitioning.

Neo4j: no, provide multi-cluster facility but sharding happens manually with multiple databases
Dgraph: yes (vertically using predicates)
Garlik 4store (though stale project, but available under GPL-3) seems to support multiple backend nodes.
Virtuoso: no

TODO: investigate…

Sharding does not seem to be a solved problem, even if Dgraph claims to support sharding we should be cautious and understand its limitations.

Find ways/criteria to split the data into multiple distinct groups

These sections involve “product-y” questions that the search team alone may not be able to answer.

Answering these questions depends on the availability of the data in hadoop to run large-scale analysis.

Sparql queries are available in hadoop Done
Import a dump of the triples into hdfs Done

Analysis of the graph structure:

Split direct vs fully reified statements version of the graph

We currently provide the wikidata claims one way using the ‘direct’ namespaces (wdt/wdtn) and another way with full reification.

The truthy (direct) graph is much smaller. By analyzing the query patterns we could determine how often the fully reified graph is required.

TODO: count the number of triples in truthy graph

We could then answer questions like:

Are all/most of the use cases where the lag is expected to be low covered by the truthy graph?
Would all/most of the use cases where full reification is needed be acceptable if run on a less “up to date” and “slower” system?

It should be relatively easy to inspect the query logs.

While this might drastically reduce the size of the graph it is not really solving the problem: all of the “truthy” graph will be stored on a single node. If the truthy graph is not enough, would there be a smaller graph that solves all/most of the use cases?

Split data horizontally (shard per entity)

By analyzing the graph we could try to figure out if there are ways to split entities in multiple buckets that could point to more or less completely separate domains that are rarely queried together.

Analyzing the data and the queries might be helpful but it is unclear if this can actually be done (complexity of the analysis itself) nor if it is useful.

Questions to answer would be:

Would there be some domain in wikidata that could be isolated and where having a domain-query.wikidata.org service would make sense? The data related to the entities of this domain will be queried locally, joining with entities of another domain would have to happen via explicit federation.

Experimentation by splitting scholarly articles: In progress see phab:T337013