Wikidata query service/Runbook

From Wikitech
Jump to navigation Jump to search
Wikidata Query Service component diagram

Wikidata Query Service is the Wikimedia implementation of SPARQL server, based on Blazegraph engine, to service queries for Wikidata and other data sets. This "runbook" goal is to provide short instructions of what to do in case of issue with the service. Please see more detailed description in the User Manual, Wikidata query service and Wikidata Query Service/Implementation.

Clusters

We operate 3 separate WDQS clusters, with different expectations in term of availability and stability of response times. All clusters only expose a readonly interface. Writes are exclusively managed internally, with WDQS pulling data from different sources (mainly from Wikidata):

  • wdqs-internal: The cluster used by other internal service, mainly mediawiki. This cluster should be highly available with predictable response times.
  • wdqs: The public facing SPARQL endpoint. This cluster is used by external applications and researchers. While it does see "real user traffic", due to the nature of a SPARQL endpoint, we do expect a lower availability and more random response times on this cluster. The users of this cluster should be aware of those limitations.
  • wdqs-test: Those 2 servers are used for testing and should see no "real user traffic".

Main Services

  • wdqs-blazegraph: main blazegraph instance
  • wdqs-categories: blazegraph instance dedicated to categories
  • wdqs-updater: updates the main blazegraph instance from wikidata

Main dashboards

Known limitations

Exposing a SPARQL endpoint to the internet is similar to exposing an SQL endpoint. The user generated load can vary widely in the complexity per request and cause widely varying performances. Throttling is applied based on overall computing time and error rate to protect from overly eager clients. The throttling is bucketed by user-agent and IP address. An abusive client which does not rate limit when receiving HTTP 429 responses will be banned for a longer duration where all its requests will be denied with an HTTP 403.

Further analysis

Before restarting wdqs-blazegraph or wdqs-updater, please collect multiple stack traces for further analysis:

for i in `seq 1 10`;
do
  sudo -u blazegraph jcmd <pid> Thread.print > thread.${i}
  sleep 5
done

Known issues

Update lag

Synchronisation from wikidata (and other data sources) can be viewed on Grafana. Common cause of lag are:

  • Too much write load (too many edits on Wikidata)
  • Too much read load on the WDQS systems, competing for I/O with the writes. You can check the queries currently running by going to http://localhost:9999/bigdata/status?showQueries=details (it's an HTML page so ssh port forwarding to local browser may be a good idea).
  • Updater process crashed or stucked. Check the update progress with "journalctl -u wdqs-updater -f". It should report progress regularly, about once a minute or so. The progress report looks like this: Jun 25 22:15:20 wdqs1003 bash[15223]: 22:15:20.654 [main] INFO  org.wikidata.query.rdf.tool.Updater - Polled up to 2018-06-25T22:15:16Z at (7.1, 5.4, 5.3) updates per second and (1453.4, 1087.6, 1018.1) milliseconds per second Watch the bold items - the time should be reasonably close (within minute) to current time, unless Updater is catching up from the lag, and the ms/s number should be close to 1000. If it's significantly less, the server is overloaded and can't catch up with updates. High numbers in updates/s can indicate high write load.

Remediation

  • restarting the updater process: service wdqs-updater restart
  • restart of the blazegraph service: service wdqs-blazegraph restart . Note that Updater can recover from Blazegraph restart automatically, but if it's stuck, restarting it after Blazegraph restart is recommended.
  • depooling the problematic server until it catches up on updates (thus reducing the competition on I/O)

Issues with Kafka / Recent Change pollers

WDQS supports 2 mechanism for getting updates from Wikidata. One based on Kafka, one based on the Recent Changes API. In case of issues with one of those backends, switching to the other might help.

Procedure

The updater used if configured in Puppet, via the profile::wdqs::use_kafka_for_updates hiera variable (See 466722 for an example). The updater needs to be restarted for the change to be activated (service wdqs-updater restart). The Kafka based poller stores kafka offsets in Blazegraph, those need to be reset when switching from Kafka to Recent Changes to ensure that Kafka Poller will start from the right place when re-enabled (see Resetting Kafka Offsets for details). The following commands needs to be run on each server where the poller configuration is changed:

curl -X POST --data-urlencode update@reset_kafka.sparql http://localhost:9999/bigdata/namespace/wdq/sparql

Where reset_kafka.sparql contains this query:

DELETE {
?z rdf:first ?head ; rdf:rest ?tail .
}
WHERE {
  [] wikibase:kafka ?list .
  ?list rdf:rest* ?z .
  ?z rdf:first ?head ; rdf:rest ?tail .
} ;
DELETE WHERE {
  <https://www.wikidata.org> wikibase:kafka ?o .
};

Timeouts

Sadly, the public WDQS cluster is subject to varying performances depending on user generated load. Throttling should protect it but this is an imperfect solution. Logstash provides a view of the different clients being throttled. So far, we have seen this issue only on the public WDQS cluster. The wdqs-categories service is usually not timing out. Overload is due to a combination of read and write load exceeding the current capacity.

Remediation

  • restart of the blazegraph main service: systemctl restart wdqs-blazegraph
  • restart of the blazegraph categories service: systemctl restart wdqs-categories (usually not necessary)
  • find the most offending client and ban its user agent (see 518691 as an example).
  • shutdown wdqs-updater: systemctl stop wdqs-updater (this will obviously increase replication lag, but will protect read traffic).

Server de-synchronization

Ideally, all servers should have the same data. However, it can happen, due to some glitches, that the data on one server differs from the data on another server. This can happen for a brief period due to update lags, however if the data is old and still in not in sync, this can lead to query results being inconsistent between servers and thus the same query returning different result at different time. You can see which server returned the result by inspecting x-served-by header in the response.

Remediation

In the case of servers being out of sync, the best way is to update the affected entities manually. Of course the exact entities depend on the nature of the query, so it can be tricky to find out. But if the entities are known, run this on each server:

cd /srv/deployment/wdqs/wdqs; bash runUpdate.sh -n wdq -N -- --ids Q1234,Q345,Q45567

where argument to --ids is the IDs of the affected entities. This can be done at any time without affecting the function of the server.

Logstash Spamming

Due to improper sanitization of user requests, logstash can become overwhelmed with loads of logging errors from logback. An example of this error is: Invalid format for the WKT value: value=11.576,48.216805555556. These errors can be seen here: https://logstash.wikimedia.org/app/kibana#/dashboard/default?_g=h@89f0c75&_a=h@8f80f8c

Remediation

Once the specific logger that is generating the spam is identified, that specific logger can be disabled by adding this line <logger name="<logger_name>" level="off"/> to logback config for wdqs and submit the patch for review. The logback configuration is automatically reloaded, no service restart is required. An example of a submitted patch is: https://gerrit.wikimedia.org/r/c/operations/puppet/+/468958.

WDQS Updater service crashing due to memory error

Larger than usual updates can make wdqs-updater service run out of memory. An example can be seen here: https://phabricator.wikimedia.org/T210235. OutOfMemoryErrors are only sent to journald, not to logback log files. UpdateBuilder.bindValue() running out of memory is a good indication of that issue:

Nov 23 11:28:14 wdqs1009 wdqs-updater[28004]: Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
Nov 23 11:28:14 wdqs1009 wdqs-updater[28004]:         at java.util.Arrays.copyOf(Arrays.java:3332)
Nov 23 11:28:14 wdqs1009 wdqs-updater[28004]:         at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
Nov 23 11:28:14 wdqs1009 wdqs-updater[28004]:         at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:472)
Nov 23 11:28:14 wdqs1009 wdqs-updater[28004]:         at java.lang.StringBuffer.append(StringBuffer.java:310)
Nov 23 11:28:14 wdqs1009 wdqs-updater[28004]:         at java.lang.StringBuffer.append(StringBuffer.java:97)
Nov 23 11:28:14 wdqs1009 wdqs-updater[28004]:         at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:486)
Nov 23 11:28:14 wdqs1009 wdqs-updater[28004]:         at java.lang.StringBuffer.append(StringBuffer.java:338)
Nov 23 11:28:14 wdqs1009 wdqs-updater[28004]:         at java.util.regex.Matcher.appendReplacement(Matcher.java:890)
Nov 23 11:28:14 wdqs1009 wdqs-updater[28004]:         at java.util.regex.Matcher.replaceAll(Matcher.java:955)
Nov 23 11:28:14 wdqs1009 wdqs-updater[28004]:         at java.lang.String.replace(String.java:2240)
Nov 23 11:28:14 wdqs1009 wdqs-updater[28004]:         at org.wikidata.query.rdf.tool.rdf.UpdateBuilder.bindValue(UpdateBuilder.java:40)
Nov 23 11:28:14 wdqs1009 wdqs-updater[28004]:         at org.wikidata.query.rdf.tool.rdf.RdfRepository.syncFromChanges(RdfRepository.java:308)
Nov 23 11:28:14 wdqs1009 wdqs-updater[28004]:         at org.wikidata.query.rdf.tool.Updater.handleChanges(Updater.java:214)
Nov 23 11:28:14 wdqs1009 wdqs-updater[28004]:         at org.wikidata.query.rdf.tool.Updater.run(Updater.java:129)
Nov 23 11:28:14 wdqs1009 wdqs-updater[28004]:         at org.wikidata.query.rdf.tool.Update.run(Update.java:146)
Nov 23 11:28:14 wdqs1009 wdqs-updater[28004]:         at org.wikidata.query.rdf.tool.Update.main(Update.java:81)
Nov 23 11:28:14 wdqs1009 systemd[1]: wdqs-updater.service: Main process exited, code=exited, status=1/FAILURE

Remediation

Temporarily reducing the batch size (-b parameter) should allow the large updates to go through. Check phab:T210235 for more details.

Overload due to high edit rate

In some cases, high edit rate will push the load on the service to a point where it can't keep up with both the read and the write load. In that case, shutting down the wdqs-updater service will provide some breathing room, at the cost of no updates being applied to that cluster.