Wikidata query service

From Wikitech
Jump to navigation Jump to search
Wikidata Query Service components
Wikidata Query Service components

Wikidata Query Service is the Wikimedia implementation of SPARQL server, based on Blazegraph engine, to service queries for Wikidata and other data sets. Please see more detailed description in the User Manual.

See also https://www.mediawiki.org/wiki/Wikidata_query_service/Implementation

Development environment

You will need java and maven, java can be installed from: https://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html

Code

The source code is in gerrit project wikidata/query/rdf. In order to start working on Wikidata Query Service codebase, clone this repository:

git clone https://gerrit.wikimedia.org/r/wikidata/query/rdf 

or github mirror:

git clone https://github.com/wikimedia/wikidata-query-rdf.git

or if you want to push changes and have a Gerrit account:

git clone ssh://someone@gerrit.wikimedia.org:29418/wikidata/query/rdf 

After update submodules:

$ cd wikidata-query-rdf
[../wikidata_query_rdf]$ git submodule update --init
Submodule 'gui' (https://gerrit.wikimedia.org/r/wikidata/query/gui) registered for path 'gui'

Build

Then you can build the distribution package by running:

cd wikidata-query-rdf
./mvnw package

and the package will be in the dist/target directory. Or, to run Blazegraph service from the development environment (e.g. for testing) use:

bash war/runBlazegraph.sh

Add "-d" option to run it in debug mode. If your build is failing cause your version of maven is a different one you can:

 mvn package -Denforcer.skip=true


In order to run Updater, use:

 bash tools/runUpdate.sh

The build relies on Blazegraph packages which are stored in Archiva, and the source is in wikidata/query/blazegraph gerrit repository. See instructions on Mediawiki for the case where dependencies need to be rebuilt.

See also documentation in the source for more instructions.

Build Blazegraph

If there are changes needed to Blazegraph source, they should be checked into wikidata/query/blazegraph repo. After that, the new Blazegraph sub-version should be built and WDQS should switch to using it. The procedure to follow:

  1. Commit fixes (watch for extra whitespace changes!)
  2. Update README.wmf with descriptions of which changes were done against mainstream
  3. Blazegraph source in master branch will be on snapshot version, e.g. 2.1.5-wmf.4-SNAPSHOT - set it to non-snapshot: mvn versions:set -DnewVersion=2.1.5-wmf.4
  4. Make local build: mvn clean; bash scripts/mavenInstall.sh; mvn -f bigdata-war/pom.xml install -DskipTests=true
  5. Switch Blazegraph version in maim pom.xml of WDQS repo to 2.1.5-wmf.4 (do not push it yet!). Build and verify everything works as intended.
  6. Commit the version change in Blazegraph, push it to the main repo. Tag it with the same version and push the tag too.
  7. Run to deploy: mvn -f pom.xml -P deploy-archiva deploy -P Development; mvn -f bigdata-war/pom.xml -P deploy-archiva deploy -P Development
  8. Commit the version change in WDQS, and push to gerrit. Ensure the tests pass (this would also ensure Blazegraph deployment to Archiva worked properly).
  9. After merging the WDQS change, follow the procedure below to deploy new WDQS version.
  10. Bump Blazegraph master version back to snapshot - mvn versions:set -DnewVersion=2.1.5-wmf.5-SNAPSHOT - and commit/push it.

Administration

Hardware

We're currently running on the following servers:

  • public cluster, eqiad: wdqs1006, wdqs1004, wdqs1005
  • public cluster, codfw: wdqs2001, wdqs2002, wdqs2003
  • internal cluster, eqiad: wdqs1003, wdqs1007, wdqs1008
  • internal cluster, codfw: wdqs2004, wdqs2005, wdqs2006

These clusters are in active/active mode (traffic is sent to both), but due to how we route traffic with GeoDNS, the primary cluster (usually eqiad) sees most of the traffic.

Server specs are similar to the following:

  • CPU: dual Intel(R) Xeon(R) CPU E5-2620 v3
  • Disk: 1600GB raw raided space SSD
  • RAM: 128GB

Monitoring

Icinga group

Grafana dashboard: https://grafana.wikimedia.org/dashboard/db/wikidata-query-service

Grafana frontend dashboard: https://grafana.wikimedia.org/dashboard/db/wikidata-query-service-frontend

WDQS dashboard: http://discovery.wmflabs.org/wdqs/

Deployment

Sources

The source code is in the Gerrit project wikidata/query/rdf (GitHub mirror). The GUI source code is Gerrit project wikidata/query/gui (GitHub mirror), which is also a submodule of the main project.

The deployment version of the query service is in the Gerrit project wikidata/query/deploy, with the deployment version of the GUI, wikidata/query/gui-deploy (production branch), as a submodule.

Labs Deployment

Note that currently deployment is via git-fat (see below) which may require some manual steps after checkout. This can be done as follows:

  1. Check out wikidata/query/deploy repository and update gui submodule to current production branch (git submodule update).
  2. Run git-fat pull to instantiate the binaries if necessary.
  3. rsync the files to deploy directory (/srv/wdqs/blazegraph)

Use role role::wdqs::labs for installing WDQS. You may also want to enable role::labs::lvm::srv to provide adequate diskspace in /srv.

Command sequence for manual install:

git clone https://gerrit.wikimedia.org/r/wikidata/query/deploy
cd deploy
git fat init
git fat pull
git submodule init
git submodule update
sudo rsync -av --exclude .git\* --exclude scap --delete . /srv/wdqs/blazegraph

Production Deployment

Production deployment is done via git deployment repository wikidata/query/deploy. The procedure is as follows:

  1. mvn package the source repository.
  2. mvn deploy -Pdeploy-archiva in the source repository - this deploys the artifacts to archiva. Note that for this you will need repositories wikimedia.releases and wikimedia.snapshots configured in ~/.m2/settings.xml with archiva username/password.
  3. Install new files (which will be also in dist/target/service-*-dist.zip) to deploy repo above. Commit them. Note that since git-fat uses archiva as primary storage, there can be a delay between files being deployed to archiva and them appearing on rsync and ready for git-fat deployment.
  4. Use scap deploy to deploy the new build.

The puppet role that needs to be enabled for the service is role::wdqs.

It is recommended to test deployment checkout on beta (see above) before deploying it in production.

GUI deployment

GUI deployment files are in repository wikidata/query/gui-deploy branch production. It is a submodule of wikidata/query/deploy which is linked as gui subdirectory. The following script can be used to sync GUI deployment repo inside WDQS repo:

#!/bin/bash
set -e
pushd gui
git checkout production
git pull
popd
git commit -m "Update GUI" gui
git push

New deployment GUI versions are automatically built by WDQSGuiBuilder after every merge in the wikidata/query/gui repo, so usually you can just +2 them (query). You can also run grunt deploy in the GUI directory to generate such a patch by hand (which still needs to be merged manually). Either way, after the gui-deploy repo has been updated, update the submodule gui on wikidata/query/deploy to latest production head and commit/push the change. Deploy as per above.

Data reload procedure

Data preparation

This can be done while services are running and does not require any downtime. Ensure that there's enough disk space on /srv/wdqs or use other space for the files. Space required is around 2.5x the size of the compressed dump, or currently around 100G.

  1. Download latest Wikidata dump: https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.bz2 or https://dumps.wikimedia.your.org/wikidatawiki/entities/latest-all.ttl.bz2 (Wikimedia downloads are bandwidth-limited so using a mirror may be faster)
  2. Download latest Lexeme dump: https://dumps.wikimedia.your.org/wikidatawiki/entities/latest-lexemes.ttl.bz2
  3. Create /srv/wdqs/munged and /srv/wdqs/lex-munged
  4. Run munger for main database: bash munge.sh -f latest-all.ttl.bz2 -d /srv/wdqs/munged
  5. Run munger for lexemes: bash munge.sh -f latest-all.ttl.bz2 -d /srv/wdqs/lex-munged

Data loading

This is the procedure for reloading main service. See the procedure for categories service below.

  1. Go to icinga and schedule downtime: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=wdqs2002
  2. Depool: HOME=/root sudo depool
  3. Remove data loaded flag: rm /srv/wdqs/data_loaded
  4. Stop the updater: sudo service wdqs-updater stop
  5. Turn on maintenance: touch /var/lib/nginx/wdqs/maintenance
  6. Stop Blazegraph: sudo service wdqs-blazegraph stop
  7. Remove old db: rm /srv/wdqs/wikidata.jnl
  8. Start blazegraph: sudo service wdqs-blazegraph start, check that /srv/wdqs/wikidata.jnl is created.
  9. Check logs: sudo journalctl -u wdqs-blazegraph -f and less /var/log/wdqs/wdqs-blazegraph.log.
  10. Load data: bash loadData.sh -n wdq -d /srv/wdqs/munged
  11. Load lexeme dump: curl -XPOST --data-binary update="LOAD <file:///srv/wdqs/lex-munged/wikidump-000000001.ttl.gz>" http://localhost:9999/bigdata/namespace/wdq/sparql
  12. Restore data loaded flag: touch /srv/wdqs/data_loaded
  13. Start updater: sudo service wdqs-updater start
  14. Check logs: sudo journalctl -u wdqs-updater -f
  15. Wait for the updater to catch up - look at /var/log/wdqs/wdqs-updater.log
  16. Repool: HOME=/root sudo pool

Categories reload procedure

Categories are now living in a separate service, so should not be reloaded together with main service.

  1. Depool: HOME=/root sudo depool
  2. Stop categories service: sudo service wdqs-categories stop
  3. Remove old db: rm /srv/wdqs/categories.jnl
  4. Start blazegraph: sudo service wdqs-categories start, check that /srv/wdqs/categories.jnl is created.
  5. Check logs: sudo journalctl -u wdqs-categories -f and less /var/log/wdqs/wdqs-categories.log.
  6. Reload categories from weekly dump: /usr/local/bin/reloadCategories.sh or if needed to be done manually: bash createNamespace.sh categories; bash forAllCategoryWikis.sh loadCategoryDump.sh categories
  7. Reload daily diffs:
    1. For the day next to the weekly dump's day: loadCategoryDaily.sh {TS} fromDump{WEEKLYTS}-, where TS is next day's date, YYYYMMDD format, and WEEKLYTS is the date of the weekly dump (would be one day earlier than TS). E.g.: loadCategoryDaily.sh 20181007 fromDump20181006-. Note the dash at the end of the prefix.
    2. For each following day: loadCategoryDaily.sh {TS} where TS is the day of the diff. E.g. loadCategoryDaily.sh 20181008.
    3. If possible, it is recommended to reload the data close to the date of the weekly dump, to minimize amount of dailies that are needed to load. As an alternative, one can perform weekly reload as soon as the new weekly dump is ready.
  8. Repool: HOME=/root sudo pool

Data transfer procedure

Transferring data from between nodes is typically faster than recovering from a dump. The port 9876 is opened between the wdqs nodes of the same cluster for that purpose. The procedure is as below.

  1. depool source and destination nodes: sudo HOME=/root depool
  2. shutdown wdqs-blazegraph and wdqs-updater on both source and destination hosts
  3. on destination node: nc -l -p 9876 | pigz -c -d | tee >( sha256sum > /dev/stderr ) | pv -b -r > /srv/wdqs/wikidata.jnl
  4. on source node: cat /srv/wdqs/wikidata.jnl | tee >( sha256sum > /dev/stderr ) | pigz -c | nc -w 3 <destination_fqdn> 9876
  5. verify the transfer with sha256sum
  6. copy /srv/wdqs/aliases.map to the destination node
  7. on destination node: touch /srv/wdqs/data_loaded
  8. on destination node: sudo chown blazegraph: /srv/wdqs/wikidata.jnl
  9. restart wdqs-blazegraph and wdqs-updater on both source and destination
  10. pool source and destination nodes: sudo HOME=/root pool

For copying categories instance data, same procedure with the following changes:

  • The file name is categories.jnl
  • Instance that needs to be down/restarted is wdqs-categories (no need to touch updater)

Updating federation whitelist

Manually updating entities

It is possible to update single entity or a number of entities on each server, in case data gets out of sync. The command to do it is:

 cd /srv/deployment/wdqs/wdqs; bash runUpdate.sh -n wdq -N -S -- -b 500 --ids Q1234 Q5678 ...

In order to do it on all servers at once, commands like pssh can be used:

 pssh -t 0 -p 20 -P -o logs -e elogs -H "$SERVERS" "cd /srv/deployment/wdqs/wdqs; bash runUpdate.sh -n wdq -N -S -- -b 500 --ids $*"

Where $SERVERS would contain the list of servers updated. Note that since it is done via command line, updating larger batches of IDs will need some scripting to split them into manageable chunks. Doing bigger updates at moderate pace, with pauses to not interfere with regular updates, is recommended.

Updating IDs by timeframe

Sometimes, due to some malfunction, a segment of updates for certain time period gets lost. If it's a recent segment, updater can be reset to start with certain timestamp by using --start TIMESTAMP --init (you have to shut down regular updater, reset the timestamp, and then start it again). If the missed segment is in the past, the best way is to fetch IDs that were updated in that time period, using Wikidata recentchanges API, and then update these IDs as described above.

Example of such script can be found here: https://phabricator.wikimedia.org/P8919. The output should be filtered and duplicates removed, then fed to a script calling to update script as per above.

Updating value or reference nodes

Since value (wdv:hash) and reference (wdref:hash) nodes are supposed to be immutable, they will not be touched by updates to items that use them. To fix these nodes, you need to delete all triples with these nodes as the subject (SPARQL DELETE through production access), then trigger an update (as above) for items which reference these nodes (so they will be recreated; only one item per node necessary).

Issues

Scaling strategy

Wikidata_query_service/ScalingStrategy

Contacts

If you need more info, talk to User:Smalyshev, User:Gehel or anybody from mw:Discovery team.

Usage