Nova Resource:Wikidata-query

Project Name	wikidata-query
Details, admins/members	openstack-browser
Monitoring	grafana SAL

Wikidata Query Service

Description

Wikidata Query service - see https://www.mediawiki.org/wiki/Wikidata_query_service

Purpose

The Wikidata Query Service (WDQS) provides a way to access Wikidata data, via a SPARQL API.

Anticipated traffic level

10-100 hits per day

Anticipated time span

months

Project status

currently running

Contact address

smalyshev@wikimedia.org

Willing to take contributors or not

willing

Subject area narrow or broad

broad

Puppetized instance setup

Create labs instance (xlarge recommended)
Set up role role::puppet::self with puppetmaster name wdqs-puppetmaster (see also Help:Self-hosted_puppetmaster)
Add role role::wdqs in role config
The service install is in /srv/wdqs/blazegraph
Follow the instructions in /srv/wdqs/blazegraph/docs/getting-started.md to set up the service

Blazegraph HA journal cluster Implementation log/notes

These notes use this as a guide: http://wiki.blazegraph.com/wiki/index.php/HAJournalServer

Each machine has a 60Gb /srv partition
The three machines share /data/project
The compressed journal is shared at /data/project/blazegraph/bigdata-wikidata-statements.jnl.gz

Prerequisites

Make /data/project/blazegraph, with group write permissions for project-wikidata-query. This is shared between all machines in the cluster.

$ sudo mkdir /data/project/blazegraph
$ sudo chgrp project-wikidata-query /data/project/blazegraph
$ sudo chmod g+rwx /data/project/blazegraph

Install Java:

$ sudo apt-get install openjdk-7-jdk

ZooKeeper

Get ZooKeeper:

$ cd /srv
$ sudo mkdir zookeeper
$ sudo wget http://apache.mirrors.tds.net/zookeeper/zookeeper-3.4.6/zookeeper-3.4.6.tar.gz \
  -O /srv/zookeeper/zookeeper-3.4.6.tar.gz

Extract ZooKeeper and set some useful permissions:

$ cd zookeeper
$ sudo tar -xzf zookeeper-3.4.6.tar.gz
$ sudo chown -R root.project-wikidata-query /srv/zookeeper
$ sudo chmod -R g+rw /srv/zookeeper

Configure ZooKeeper:

$ cd /srv/zookeeper/zookeeper-3.4.6/conf
$ cp zoo_sample.cfg zoo.cfg

Change the data directory in zoo.cfg:

dataDir=/srv/zookeeper

Append to zoo.cfg:

initLimit=5
syncLimit=2
server.1=wdq-bg1:2888:3888
server.2=wdq-bg2:2888:3888
server.3=wdq-bg3:2888:3888

Set Zookeeper's myid (use 1, 2, or 3 depending on which server, wdq-bg1, wdg-bg2, or wdq-bg3):

$ echo 1 > /srv/zookeeper/myid

Start ZooKeeper:

$ cd /srv/zookeeper/zookeeper-3.4.6/bin
$ ./zkServer.sh start

Blazegraph

Create /srv/blazegraph, with group write permissions for project-wikidata-query

$ sudo mkdir /srv/blazegraph
$ sudo chgrp project-wikidata-query /srv/blazegraph
$ sudo chmod g+rwx /srv/blazegraph

Clone the Blazegraph 1.5.0 release to /srv/blazegraph/BIGDATA_RELEASE_1_5_0

$ git clone -b BIGDATA_RELEASE_1_5_0 --single-branch git://git.code.sf.net/p/bigdata/git \
  /srv/blazegraph/BIGDATA_RELEASE_1_5_0

Extract Journal to /srv/blazegraph/BIGDATA_RELEASE_1_5_0/bigdata.jnl

$ gunzip -c -k /data/project/blazegraph/bigdata-wikidata-statements.jnl.gz > \
  /srv/blazegraph/BIGDATA_RELEASE_1_5_0/bigdata.jnl

Install ant

$ sudo apt-get install ant

Test Blazegraph

This part is optional.

Tunnel port 9999

$ ssh -L 9999:localhost:9999 wdq-bg1

Test Blazegraph:

$ cd /srv/blazegraph/BIGDATA_RELEASE_1_5_0
$ ant start-blazegraph

"Use" the kb database via http://localhost:9999/bigdata/#namespaces.

Run a query through http://localhost:9999/bigdata/#query:

prefix wdq: <http://www.wikidata.org/entity/>
prefix wdo: <http://www.wikidata.org/ontology#>
prefix xs: <http://www.w3.org/2001/XMLSchema#>
select ?entity ?date WHERE {
  ?entity ?relatedTo ?dateS .
  ?dateS wdq:P569v ?dateV .
  ?dateV wdo:preferredCalendar wdq:Q1985727 .
  ?dateV wdo:time ?date .
  FILTER (?date > "1918-04-11"^^xs:date && ?date < "1918-06-11"^^xs:date)
}

Blazegraph HA journal cluster

For more details, follow along with these steps as a guide.

Create the deployment artifacts:

$ cd /srv/blazegraph/BIGDATA_RELEASE_1_5_0
$ ant deploy-artifact

Substitute some values in config.sh. You can also just append these to the end of the file, and they'll override any existing values already in config.sh.

/srv/blazegraph/BIGDATA_RELEASE_1_5_0/dist/bigdata/bin/config.sh:

export FEDNAME=wdq-bg
export FED_DIR=/srv/blazegraph
export LOGICAL_SERVICE_ID=HA-Replication-Cluster-1
export LOCATORS="jini://wdq-bg1/,jini://wdq-bg2/,jini://wdq-bg3/"
export ZK_SERVERS="wdq-bg1:2181,wdq-bg2:2181,wdq-bg3:2181"
export REPLICATION_FACTOR=3
export JAVA_OPTS="${JAVA_OPTS} -Xmx4g"

Change the write cache buffer count in HAJournal.config.

/srv/blazegraph/BIGDATA_RELEASE_1_5_0/dist/bigdata/var/config/jini/HAJournal.config:

// new NV(Options.WRITE_CACHE_BUFFER_COUNT,ConfigMath.getProperty("WRITE_CACHE_BUFFER_COUNT","2000")),
new NV(Options.WRITE_CACHE_BUFFER_COUNT,ConfigMath.getProperty("WRITE_CACHE_BUFFER_COUNT","6")),

Switch to user blazegraph.

Start the Zookeeper:

$ cd /srv/zookeeper/zookeeper-3.4.6/bin
$ ./zkServer.sh start

Launch Blazegraph:

$ /srv/blazegraph/BIGDATA_RELEASE_1_5_0/dist/bigdata/bin/startHAServices

To be able to access all three servers locally, use ssh tunneling:

$ ssh wdq-bg1 -L 8081:localhost:8080
$ ssh wdq-bg2 -L 8082:localhost:8080
$ ssh wdq-bg3 -L 8083:localhost:8080

Now you can browse to localhost:8081 for wdq-bg1, localhost:8082 for wdq-bg2, and localhost:8083 for wdq-bg3.

DB test server

DB test server with big disk (160G) is db01. Blazegraph is set up to run under blazegraph user with home at /srv/blazegraph. Running (after su'ing to blazegraph user):

 
$ cd /srv/blazegraph
$ sh run.sh

Edit documentation

Server admin log

2022-11-17

18:38 andrewbogott: committed a local labs/private change on wdqspuppet to get puppet syncing again

2022-01-19

17:36 andrewbogott: rebooting wcqs-beta-01.wikidata-query.eqiad1.wikimedia.cloud to recover from (presumed) fallout from the scratch/nfs move

2022-01-10

13:56 dcaro: Replace too big flavor t206636 for a one with a smaller disk, t206636v2 (T297454)

2021-02-02

07:29 dcaro: large VM wcqs-beta-01 is exhausting the hosts disk space (cloudvirt-wdqs1001) (T273579)

2020-10-01

18:31 andrewbogott:... (more)