Wikidata Query Service/Migration/Development Infrastructure
This page describes the development infrastructure we are putting in place to support the WDQS backend migration. It contains a brief tutorial on setting up triple stores on local workstation, and eqiad instances.
Local development
This section illustrates how to bootstrap a tripe store locally, on a Linux workstation, and ingest data from rdf mutation streams.
We'll use QLever as an example, but the same steps applies to virtuoso (modulo appropriate binary names and config changes).
1. Build and start QLever
QLever vendors binaries and a python cli as a wheel available from pypi. For testing and development we bootstrap the database from source instead.
We'll be using the build toolchain provided in https://gitlab.wikimedia.org/repos/wikidata-platform/triplestores. As a prerquisite, make sure that the Nix build system is installed (see docs in the linked repo).
The following will setup a x86_64 c++ toolchain, fetch qlever and related dependencies, and build the database with:
$ nix flake new qlever -t git+https://gitlab.wikimedia.org/repos/wikidata-platform/triplestores#qlever $ cd qlever $ nix build
Indexing and Server binary will be available under ./result/bin.
Start and test QLever with:
$ mkdir index && cd index $ ../result/bin/ServerMain --index-basename wikidata --port 7001 --memory-max-size 32G --cache-max-size 16G --default-query-timeout 2000s -a localhost $ curl -X POST http://localhost:7001/sparql?access-token=localhost -H "Content-Type: application/sparql-query" --data-binary "SELECT (COUNT(*) AS ?count) WHERE {?s ?p ?o}" | jq
2. Populate the index from rdf mutation streams
Local workstation can't access WMF production network and the rdf mutation streams from Kafka. We'll use the PoC golang client, instead of the qlever wikidata update utility, to consume public event streams and update the triple store.
The project requires golang. If you installed Nix in step 1, helpers are available in the repo to bootstrap a golang toolchain.
$ git clone git@gitlab.wikimedia.org:repos/wikidata-platform/go-wikidata-updater.git $ nix develop $ go build ./cmd/go-wikidata-updater
This will compile a go-wikidata-updater binary. To update the QLever index from the main graph rdf mutation stream run
$ ./go-wikidata-updater -sparql-endpoint "http://localhost:7001/sparql?access-token=localhost" -stream-url "https://stream.wikimedia.org/v2/stream/rdf-streaming-updater.mutation-main.v2"
eqiad test nodes
We currently have the following test nodes in eqiad. Wikidata Platform engineerings have root access. The servers are not yet managed via gitops, and allow us deployment and experimentation with non debian-packaged software (including building from source).
The same toolchain discussed in Local Developmet can be used to build the databases on eqiad. The hosts will have access to Kafka, and can use prod rdf-streaming-consumer Java tooling for realtime updates.
Wikidata entity dumps are available as nfs shares at /mnt/nfs/dumps-clouddumps1001.wikimedia.org/wikidatawiki/entities/
wdqs1028
Currently hosts QLever and Virtuoso indexes at /srv/wdqs/qlever and /srv/wdqs/virtuoso respectively.
| Host | wdqs1028.eqiad.wmnet |
| Database | |
| Endpoint | |
| Index | |
| Real-time update |
wdqs1029
| Host | wdqs1029.eqiad.wmnet |
| Database | |
| Endpoint | |
| Index | |
| Real-time update |
wdqs1030
| Host | wdqs1030.eqiad.wmnet |
| Database | |
| Endpoint | |
| Index | |
| Real-time update |
wdqs1031
| Host | wdqs1031.eqiad.wmnet |
| Database | |
| Endpoint | |
| Index | |
| Real-time update |
wdqs1032
Not yet operational
| Host | wdqs1030.eqiad.wmnet |
| Database | |
| Endpoint | |
| Index | |
| Real-time update |