Nova Resource:Wikidata-query
Project Name | wikidata-query |
---|---|
Details, admins/members |
openstack-browser |
Monitoring |
Wikidata Query Service
Description
Wikidata Query service - see https://www.mediawiki.org/wiki/Wikidata_query_service
Purpose
The Wikidata Query Service (WDQS) provides a way to access Wikidata data, via a SPARQL API.
Anticipated traffic level
10-100 hits per day
Anticipated time span
months
Project status
currently running
Contact address
smalyshev@wikimedia.org
Willing to take contributors or not
willing
Subject area narrow or broad
broad
Puppetized instance setup
- Create labs instance (xlarge recommended)
- Set up role
role::puppet::self
with puppetmaster namewdqs-puppetmaster
(see also Help:Self-hosted_puppetmaster) - Add role
role::wdqs
in role config - The service install is in
/srv/wdqs/blazegraph
- Follow the instructions in /srv/wdqs/blazegraph/docs/getting-started.md to set up the service
Blazegraph HA journal cluster Implementation log/notes
These notes use this as a guide: http://wiki.blazegraph.com/wiki/index.php/HAJournalServer
- Each machine has a 60Gb /srv partition
- The three machines share /data/project
- The compressed journal is shared at /data/project/blazegraph/bigdata-wikidata-statements.jnl.gz
Prerequisites
Make /data/project/blazegraph, with group write permissions for project-wikidata-query. This is shared between all machines in the cluster.
$ sudo mkdir /data/project/blazegraph $ sudo chgrp project-wikidata-query /data/project/blazegraph $ sudo chmod g+rwx /data/project/blazegraph
Install Java:
$ sudo apt-get install openjdk-7-jdk
ZooKeeper
Get ZooKeeper:
$ cd /srv $ sudo mkdir zookeeper $ sudo wget http://apache.mirrors.tds.net/zookeeper/zookeeper-3.4.6/zookeeper-3.4.6.tar.gz \ -O /srv/zookeeper/zookeeper-3.4.6.tar.gz
Extract ZooKeeper and set some useful permissions:
$ cd zookeeper $ sudo tar -xzf zookeeper-3.4.6.tar.gz $ sudo chown -R root.project-wikidata-query /srv/zookeeper $ sudo chmod -R g+rw /srv/zookeeper
Configure ZooKeeper:
$ cd /srv/zookeeper/zookeeper-3.4.6/conf $ cp zoo_sample.cfg zoo.cfg
Change the data directory in zoo.cfg:
dataDir=/srv/zookeeper
Append to zoo.cfg:
initLimit=5 syncLimit=2 server.1=wdq-bg1:2888:3888 server.2=wdq-bg2:2888:3888 server.3=wdq-bg3:2888:3888
Set Zookeeper's myid (use 1, 2, or 3 depending on which server, wdq-bg1, wdg-bg2, or wdq-bg3):
$ echo 1 > /srv/zookeeper/myid
Start ZooKeeper:
$ cd /srv/zookeeper/zookeeper-3.4.6/bin $ ./zkServer.sh start
Blazegraph
Create /srv/blazegraph, with group write permissions for project-wikidata-query
$ sudo mkdir /srv/blazegraph $ sudo chgrp project-wikidata-query /srv/blazegraph $ sudo chmod g+rwx /srv/blazegraph
Clone the Blazegraph 1.5.0 release to /srv/blazegraph/BIGDATA_RELEASE_1_5_0
$ git clone -b BIGDATA_RELEASE_1_5_0 --single-branch git://git.code.sf.net/p/bigdata/git \ /srv/blazegraph/BIGDATA_RELEASE_1_5_0
Extract Journal to /srv/blazegraph/BIGDATA_RELEASE_1_5_0/bigdata.jnl
$ gunzip -c -k /data/project/blazegraph/bigdata-wikidata-statements.jnl.gz > \ /srv/blazegraph/BIGDATA_RELEASE_1_5_0/bigdata.jnl
Install ant
$ sudo apt-get install ant
Test Blazegraph
This part is optional.
Tunnel port 9999
$ ssh -L 9999:localhost:9999 wdq-bg1
Test Blazegraph:
$ cd /srv/blazegraph/BIGDATA_RELEASE_1_5_0 $ ant start-blazegraph
"Use" the kb database via http://localhost:9999/bigdata/#namespaces.
Run a query through http://localhost:9999/bigdata/#query:
prefix wdq: <http://www.wikidata.org/entity/> prefix wdo: <http://www.wikidata.org/ontology#> prefix xs: <http://www.w3.org/2001/XMLSchema#> select ?entity ?date WHERE { ?entity ?relatedTo ?dateS . ?dateS wdq:P569v ?dateV . ?dateV wdo:preferredCalendar wdq:Q1985727 . ?dateV wdo:time ?date . FILTER (?date > "1918-04-11"^^xs:date && ?date < "1918-06-11"^^xs:date) }
Blazegraph HA journal cluster
For more details, follow along with these steps as a guide.
Create the deployment artifacts:
$ cd /srv/blazegraph/BIGDATA_RELEASE_1_5_0 $ ant deploy-artifact
Substitute some values in config.sh. You can also just append these to the end of the file, and they'll override any existing values already in config.sh.
/srv/blazegraph/BIGDATA_RELEASE_1_5_0/dist/bigdata/bin/config.sh:
export FEDNAME=wdq-bg export FED_DIR=/srv/blazegraph export LOGICAL_SERVICE_ID=HA-Replication-Cluster-1 export LOCATORS="jini://wdq-bg1/,jini://wdq-bg2/,jini://wdq-bg3/" export ZK_SERVERS="wdq-bg1:2181,wdq-bg2:2181,wdq-bg3:2181" export REPLICATION_FACTOR=3 export JAVA_OPTS="${JAVA_OPTS} -Xmx4g"
Change the write cache buffer count in HAJournal.config.
/srv/blazegraph/BIGDATA_RELEASE_1_5_0/dist/bigdata/var/config/jini/HAJournal.config:
// new NV(Options.WRITE_CACHE_BUFFER_COUNT,ConfigMath.getProperty("WRITE_CACHE_BUFFER_COUNT","2000")), new NV(Options.WRITE_CACHE_BUFFER_COUNT,ConfigMath.getProperty("WRITE_CACHE_BUFFER_COUNT","6")),
Switch to user blazegraph.
Start the Zookeeper:
$ cd /srv/zookeeper/zookeeper-3.4.6/bin $ ./zkServer.sh start
Launch Blazegraph:
$ /srv/blazegraph/BIGDATA_RELEASE_1_5_0/dist/bigdata/bin/startHAServices
To be able to access all three servers locally, use ssh tunneling:
$ ssh wdq-bg1 -L 8081:localhost:8080 $ ssh wdq-bg2 -L 8082:localhost:8080 $ ssh wdq-bg3 -L 8083:localhost:8080
Now you can browse to localhost:8081 for wdq-bg1, localhost:8082 for wdq-bg2, and localhost:8083 for wdq-bg3.
DB test server
DB test server with big disk (160G) is db01. Blazegraph is set up to run under blazegraph user with home at /srv/blazegraph. Running (after su'ing to blazegraph user):
$ cd /srv/blazegraph $ sh run.sh
Server admin log
2022-11-17
- 18:38 andrewbogott: committed a local labs/private change on wdqspuppet to get puppet syncing again
2022-01-19
- 17:36 andrewbogott: rebooting wcqs-beta-01.wikidata-query.eqiad1.wikimedia.cloud to recover from (presumed) fallout from the scratch/nfs move
2022-01-10
- 13:56 dcaro: Replace too big flavor t206636 for a one with a smaller disk, t206636v2 (T297454)
2021-02-02
- 07:29 dcaro: large VM wcqs-beta-01 is exhausting the hosts disk space (cloudvirt-wdqs1001) (T273579)
2020-10-01
- 18:31 andrewbogott:... (more)