Wikidata query service/ScalingStrategy

From Wikitech
Jump to navigation Jump to search

Wikidata Query Service in its current implementation is limited in its scaling options. At the moment, each WDQS cluster is a group of independent servers, sharing nothing, with each server independently updated and each server holding a full data set. The objective of this document is to highlight constraints of current systems and issues/factors we need to keep in mind when thinking of scaling. Because of the 1-server limitation, it is very possible that the current system cannot be scaled as is and we need to revisit the whole stack.

Production Constraints

Storage capacity

The current implementation does not shard data. The data growth is thus limited to the storage available on a single server. While it is possible to increase the storage on the current servers (more / larger SSDs), not being able to shard the data is a severe limitation.

Historically, storage needs had been increasing about 5% each 3 month, which means by the end of 2019, if the trend continues, we'd need the disk space upgrade.

Read load

At the moment, an increase of read load can be addressed by adding servers to a cluster. The read load is spread evenly across all servers. The expectation is that read capacity increases linearly with the number of servers in the cluster.

Anecdotal evidence (reading the GC logs) show that under heavy read load memory allocation climbs to > 2GB/s, the bottleneck here is probably a combination of CPU and memory bandwidth.

Write load

As the writes are duplicated independently on each server, the write load is limited by the IO bandwidth available on a single server. While there are possibilities to increase that bandwidth (more / faster SSDs), not being able to spread the write load across multiple servers is a severe limitation.

It also implies that if the number of servers is increased to support additional read load, the overall write load is automatically increased as well (though the load per server stays the same).

Graphs

Test Constraints

Validating Evolution of the current Implementation

Obviously, evolution of the current implementation needs to be tested before being deployed. The test environment has the following constraints:

  • dataset size roughly equivalent to production dataset: Reducing the dataset while keeping it coherent enough and connected enough to be representative and so that queries are meaningful is a hard problem. Some minor optimization can be done (only importing English labels, skipping sitelinks, etc.).
  • write load equivalent to production: The data needs to be kept in sync so that the update process can be tested meaningfully. So full synchronization (possibly with slight restrictions as described above) has to run.
  • read load can be reduced: Test is done in an ad-hoc fashion, testing specific queries
  • no clustering required: Since in the current implementation the servers in the same cluster share nothing, there is no need to test a multi node cluster.

Validating a different implementation

When evaluating an alternative solution to Blazegraph, we want to address the current scaling constraints. Thus, in addition to the constraints above, we need to test the sharding capabilities on a multi node cluster. A possible replacement: http://janusgraph.org/

In order to evaluate the alternative solution, we would need to do basic viability check, including:

  • Whether it is possible to load the current Wikidata set and whether there would be a need to modify any data. E.g.: we use non-standard date format (in order to support long date ranges) and extended geodata format (in order to support non-Earth globes) and how long this takes for a full data set.
  • See how much space the full data set takes and whether these requirements are acceptable for us.
  • Identify a basic set of queries that we want to evaluate for performance, and run these queries, validate that the performance is acceptable.
  • Check that Updater can run against this solution (if it's not SPARQL, may require modifying the Updater) and can keep up with typical Wikidata update load.
  • Check that query performance is still acceptable while running the updater.
  • Capture the chunk of actual query traffic from production endpoint, replay these (or translated, if the solution is non-SPARQL) queries and see that the performance is acceptable.

After that, we would need to evaluate migration costs, including:

  • Complexity of covering existing functionality delta (development time) - e.g. label service, geo-search, MWAPI, custom functions, etc.
  • Relative complexity of developing new customization (is it easier or harder to customize)
  • Complexity of migration to the users, if any (changes in queries, changes in bot procedures, etc.)

An ideal testing environment for validation

  • Real hardware or as close as we can be to it (unshared virtualization has been proven to not distort the picture too much, but shared virtualization is too slow)
  • SSDs as storage
  • 128G memory minimum
  • CPU specs close to production
  • At least 2 distinct hosts for sharding/clustering - if virtualized, it may be better to have them on separate physical machines both to avoid contention for shared resources and to test the influence of network latency.

Private vs Public clusters

Wikidata Query Service started its life as a public SPARQL endpoint, allowing anyone to freely use the service to experiment with SPARQL queries on the Wikidata graph. This is by nature a hard problem in term of scaling / stability, very similar to exposing a read only SQL interface to the whole great big Internet. The load on this public endpoint is limited by the resources available. Whatever resources we throw at that problem will likely be consumed by our users if we put no limits on query time and frequency. The stability of the public endpoint is ensured by limiting the users (throttling, memory limits, timeouts, ...). Limiting resource consumption of anonymous users is tricky and error prone, at some point we might want to require some kind of authentication, similar to how MySQL labs replica are free to use by anyone, but do require authentication. An alternative solution might be to have gradual scale of limits, from restrictive ones for non-authorized user to generous ones for known users which we can personally work with and ensure their queries behave well.

The sizing of that cluster is a policy problem (how much resources are we ready to give to the users), not a technical question. The users of the public WDQS endpoint understand fairly well the stability constraints.

A private WDQS endpoint has been created, intended to serve controlled queries, used for synchronous traffic. The main users of that cluster are Recommendation API and Wikibase Quality Checks. The stability of the private cluster is much easier to achieve and we have not had any major issue so far, though Quality checks have been hitting the limits from time to time. As more usage of SPARQL services in Wikimedia projects is considered, we may need to upgrade capacity for the private cluster too. Sizing here will be driven by requirements provided by the WMF users of the endpoint.

Other Concerns

  • While Blazegraph has some sharding support, it does not seem like a viable solution (citation needed) in its current state.
  • Blazegraph is mostly unsupported and development is frozen. The latest commit to the GitHub repository are from February 2017. Amazon has bought Systap (the company behind Blazegraph) and the team has been hired to work on Amazon's own graph database solution (Neptune), which may share some code with Blazegraph but is completely proprietary. While the former Blazegraph team is still interested in helping us and keeping Blazegraph project alive, this is not a viable long term solution.


Ideas

  • In order to compare platforms could we...create a synthetic query set (based on what Markus found and analysis of the logs and so on) and run it on both platforms and see the results?

See: https://iccl.inf.tu-dresden.de/web/Inproceedings3044/en which comes from this research: https://meta.wikimedia.org/wiki/Research:Understanding_Wikidata_Queries

WDQS dataset: https://iccl.inf.tu-dresden.de/web/Wissensbasierte_Systeme/WikidataSPARQL/en


Meeting 2019-02 -05

Testing new stack, where?

1. WMCS

  • we could buy servers, part of WMCS, but dedicated to WDQS testing
  • virtualization will isolate tests
  • no support for publishing metrics, so limited use for testing scaling
  • we do have some testing clusters in prod but they are fully managed and fully puppetized, so really not a fit for this test.
  • Having servers where we install untested software?
  • we could start in WMCS and later move to prod, would we need to do puppetization?
  • the hard integration is to use the infra from prod for metrics.

2. Special VLan

  • how about testing in a different vlan? (like analytics does it)
    • costly in terms of setup
    • creating the vlan is easy, having the supporting infrastructure is not (like the metrics), so this isn't any better than WMCS

3. Production host, firewalled

Prod host with a bunch of firewall (by default is drop everything), make it more stringent and you can still install binaries not puppetized, we use sometimes decommissioned hardware.

Some questions?

Could we externalize the test to AWS or similar?

  • we would loose all integration with the rest of our infrastructure
  • we would gain some standard AWS tooling (like metrics)
  • not that different than building it in cloud
  • will it work for a short time test?

Adding disks (or disk space) to current WDQS is not that easy

Mark mentions a graph database for other needs, Daniel's project around dependency management, Nuria mentions that one is still to early stage. Joseph mentions JANUS is preferred solution due to its sparql endpoint.