Jump to content

Wikidata Query Service/Migration/Development Infrastructure

From Wikitech

This page describes the development infrastructure we are putting in place to support the WDQS backend migration. It contains a brief tutorial on setting up triple stores on local workstation, and eqiad instances.

Local development

This section illustrates how to bootstrap a tripe store locally, on a Linux workstation, and ingest data from rdf mutation streams.

We'll use QLever as an example, but the same steps applies to virtuoso (modulo appropriate binary names and config changes).

1. Build and start QLever

QLever vendors binaries and a python cli as a wheel available from pypi. For testing and development we bootstrap the database from source instead.

We'll be using the build toolchain provided in https://gitlab.wikimedia.org/repos/wikidata-platform/triplestores. As a prerquisite, make sure that the Nix build system is installed (see docs in the linked repo).

The following will setup a x86_64 c++ toolchain, fetch qlever and related dependencies, and build the database with:

$ nix flake new qlever -t git+https://gitlab.wikimedia.org/repos/wikidata-platform/triplestores#qlever
$ cd qlever
$ nix build

Indexing and Server binary will be available under ./result/bin.

Start and test QLever with:

$ mkdir index && cd index
$ ../result/bin/ServerMain --index-basename wikidata --port 7001 --memory-max-size 32G --cache-max-size 16G --default-query-timeout 2000s -a localhost
$ curl -X POST http://localhost:7001/sparql?access-token=localhost -H "Content-Type: application/sparql-query" --data-binary "SELECT (COUNT(*)  AS ?count) WHERE {?s ?p ?o}" | jq

2. Populate the index from rdf mutation streams

Local workstation can't access WMF production network and the rdf mutation streams from Kafka. We'll use the PoC golang client, instead of the qlever wikidata update utility, to consume public event streams and update the triple store.

The project requires golang. If you installed Nix in step 1, helpers are available in the repo to bootstrap a golang toolchain.

$ git clone git@gitlab.wikimedia.org:repos/wikidata-platform/go-wikidata-updater.git
$ nix develop
$ go build ./cmd/go-wikidata-updater

This will compile a go-wikidata-updater binary. To update the QLever index from the main graph rdf mutation stream run

$ ./go-wikidata-updater -sparql-endpoint "http://localhost:7001/sparql?access-token=localhost" -stream-url "https://stream.wikimedia.org/v2/stream/rdf-streaming-updater.mutation-main.v2"

eqiad test nodes

We currently have the following test nodes in eqiad. Wikidata Platform engineerings have root access. The servers are not yet managed via gitops, and allow us deployment and experimentation with non debian-packaged software (including building from source).

The same toolchain discussed in Local Developmet can be used to build the databases on eqiad. The hosts will have access to Kafka, and can use prod rdf-streaming-consumer Java tooling for realtime updates.

Wikidata entity dumps are available as nfs shares at /mnt/nfs/dumps-clouddumps1001.wikimedia.org/wikidatawiki/entities/

wdqs1028

Currently hosts QLever and Virtuoso indexes at /srv/wdqs/qlever and /srv/wdqs/virtuoso respectively.

Host wdqs1028.eqiad.wmnet
Database
Endpoint
Index
Real-time update

wdqs1029

Host wdqs1029.eqiad.wmnet
Database
Endpoint
Index
Real-time update

wdqs1030

Host wdqs1030.eqiad.wmnet
Database
Endpoint
Index
Real-time update

wdqs1031

Host wdqs1031.eqiad.wmnet
Database
Endpoint
Index
Real-time update

wdqs1032

Not yet operational

Host wdqs1030.eqiad.wmnet
Database
Endpoint
Index
Real-time update