Wikidata Query Service/Runbook

From Wikitech
Jump to navigation Jump to search
Wikidata Query Service component diagram

Wikidata Query Service is the Wikimedia implementation of SPARQL server, based on Blazegraph engine, to service queries for Wikidata and other data sets. This "runbook" goal is to provide short instructions of what to do in case of issue with the service. Please see more detailed description in the User Manual, Wikidata Query Service and Wikidata Query Service/Implementation.

For issues/alerts related to the WDQS Streaming Updater see Wikidata Query Service/Streaming Updater#Runbooks.

Clusters

We operate 3 separate WDQS clusters, with different expectations in term of availability and stability of response times. All clusters only expose a readonly interface. Writes are exclusively managed internally, with WDQS pulling data from different sources (mainly from Wikidata):

  • wdqs-internal: The cluster used by other internal service, mainly mediawiki. This cluster should be highly available with predictable response times.
  • wdqs: The public facing SPARQL endpoint. This cluster is used by external applications and researchers. While it does see "real user traffic", due to the nature of a SPARQL endpoint, we do expect a lower availability and more random response times on this cluster. The users of this cluster should be aware of those limitations.
  • wdqs-test: Those 2 servers are used for testing and should see no "real user traffic".

Main Services

  • wdqs-blazegraph: main blazegraph instance
  • wdqs-categories: blazegraph instance dedicated to categories
  • wdqs-updater: updates the main blazegraph instance from wikidata

Main dashboards

Known limitations

Exposing a SPARQL endpoint to the internet is similar to exposing an SQL endpoint. The user generated load can vary widely in the complexity per request and cause widely varying performances. Throttling is applied based on overall computing time and error rate to protect from overly eager clients. The throttling is bucketed by user-agent and IP address. An abusive client which does not rate limit when receiving HTTP 429 responses will be banned for a longer duration where all its requests will be denied with an HTTP 403.

Further analysis

Before restarting wdqs-blazegraph or wdqs-updater, please collect multiple stack traces for further analysis:

for i in `seq 1 10`;
do
  sudo -u blazegraph jcmd <pid> Thread.print > thread.${i}
  sleep 5
done

Known issues

Update lag

Synchronisation from wikidata (and other data sources) can be viewed on Grafana. Common cause of lag are:

  • Too much write load (too many edits on Wikidata)
  • Too much read load on the WDQS systems, competing for I/O with the writes. You can check the queries currently running by going to http://localhost:9999/bigdata/status?showQueries=details (it's an HTML page so ssh port forwarding to local browser may be a good idea).
  • Updater process crashed or stucked. Check the update progress logs in /var/log/wdqs/wdqs-updater.log. It should report progress regularly, about once a minute or so. The progress report looks like this: Jun 25 22:15:20 wdqs1003 bash[15223]: 22:15:20.654 [main] INFO org.wikidata.query.rdf.tool.Updater - Polled up to 2018-06-25T22:15:16Z at (7.1, 5.4, 5.3) updates per second and (1453.4, 1087.6, 1018.1) milliseconds per second Watch the bold items - the time should be reasonably close (within minute) to current time, unless Updater is catching up from the lag, and the ms/s number should be close to 1000. If it's significantly less, the server is overloaded and can't catch up with updates. High numbers in updates/s can indicate high write load.


Remediation

  • restarting the updater process: service wdqs-updater restart
  • restart of the blazegraph service: service wdqs-blazegraph restart . Note that Updater can recover from Blazegraph restart automatically, but if it's stuck, restarting it after Blazegraph restart is recommended.
  • depooling the problematic server until it catches up on updates (thus reducing the competition on I/O)

Blazegraph deadlock

Occasionally a query will cause the blazegraph process to become deadlocked, rendering it unresponsive. As a result, blazegraph - which functions as the backend for wdqs - will fail to respond, leading to nginx (the frontend) reporting a timeout (as its attempt to talk to blazegraph on the other end has failed).

The cause of deadlocked queries is unknown; it seems to be more likely with queries of higher complexity, but there is no single type of query that can cause the issue.

Symptoms:

  • The graphs Triples, Lag, Queries per seconds and some others stop reporting data for the affected node in the wikidata-query-service graphana dashboard.
  • Blazegraph is unreachable and might die with a OOM at some point

Remediation

If it affects a single node:

  • restart blazegraph on this node and depool it if the lag is too high (>1hour)

If it affect multiple nodes: The main strategy is to take short-term steps to eliminate the pain while working to locate any offenders that are slamming WDQS with large volumes of heavy complexity (read: lots of computation required) queries.

The short-term steps are to restart blazegraph on the affected node(s). This will unstick the process, but it can become deadlocked again from subsequent queries, thus why it is necessary to identify the offender(s). Sometimes the nodes fall back into deadlock so soon after restart that it stops being worth it to do the restarts; in those scenarios focus all attention on finding whose queries are causing the deadlock.

Below we will describe what to do in more detail when presented with a "query of death scenario" which ultimately is triggering blazegraph deadlock.

Identifying which nodes are operational

For Blazegraph deadlock issues (which is the result of a "query of death"), degree of outage can be viewed in the RDF Triples graph: https://grafana.wikimedia.org/d/000000489/wikidata-query-service?viewPanel=7&orgId=1&refresh=1m&var-cluster_name=wdqs

Nodes that are reporting null values for Triples are locked up and unable to respond to queries.

Identifying the user agent

In a "query of death" scenario, there's generally a certain query type that reliably correlates with the problem.

For example, in the Sep 2 2020 incident, the root cause appeared to be a malformed query that triggered (unsurprisingly) a MalformedQueryException.

In the July 23 2020 incident, the root cause was a poorly performing (large) query.

Kibana (the web frontend for Elasticsearch) is ideally suited to parse logs and perform basic operations like summing, visualizations, etc.

TIP: When adding filters (to pare the log messages down to just the relevant ones), Kibana can sometimes get very laggy, especially if there's a lot of filter options.

To work-around this, after you click Add a filter +, click Edit Query DSL to edit the DSL directly rather than use the graphical interface.

A screenshot of Kibana's query filter view, used to illustrate the process for adding a filter via the Query DSL

Here's an example of a simple filter that just makes sure the field stack_trace contains the string cause=java.util.concurrent.ExecutionException: org.openrdf.query.MalformedQueryException.

(See https://www.elastic.co/guide/en/elasticsearch/reference/7.x/query-filter-context.html#query-filter-context-ex for crafting more advanced query filters manually)

{

 "query": {
   "bool": {
     "must": [
       { "match": { "stack_trace":   "cause=java.util.concurrent.ExecutionException: org.openrdf.query.MalformedQueryException" }}
     ]
   }
 }

}


See https://logstash.wikimedia.org/goto/d386fbdc24d9f51c902b1bf50077558a for an example dashboard which filters only for MalformedQueryException (note: requires access to production logs)

Banning the user agent

Once you've isolated the queries causing blazegraph to lock up in the first place, (as opposed to queries which are innocently getting timed out *because* blazegraph has already locked up), you should ban either the user agent or the source IP address in question. Generally a user agent ban is better because the queries might be coming from an application which relies on, say, cloud ec2 instances, in which case the IP address itself is subject to change. Use your judgement.

User agents can be banned at the nginx level like so: https://github.com/wikimedia/puppet/commit/d683630d32855fbe6b167d4d215d87d5aca61366

Add the ban to the nginx config, and run puppet across the relevant wdqs instances, which should automatically reload nginx on completion (make sure to check via sudo systemctl status nginx.service)

Cleanup

Once the correct user agent(s) have been banned, restart blazegraph on all affected nodes.

Take a mental note to circle back in the future to unblock any users whose user agents were banned. (Do this during a time where you'll be around to monitor and make sure the deadlock condition does not re-surface)

Issues with Kafka / Recent Change pollers

WDQS supports 2 mechanism for getting updates from Wikidata. One based on Kafka, one based on the Recent Changes API. In case of issues with one of those backends, switching to the other might help.

Procedure

The updater used if configured in Puppet, via the profile::wdqs::use_kafka_for_updates hiera variable (See 466722 for an example). The updater needs to be restarted for the change to be activated (service wdqs-updater restart). The Kafka based poller stores kafka offsets in Blazegraph, those need to be reset when switching from Kafka to Recent Changes to ensure that Kafka Poller will start from the right place when re-enabled (see Resetting Kafka Offsets for details). The following commands needs to be run on each server where the poller configuration is changed:

curl -X POST --data-urlencode update@reset_kafka.sparql http://localhost:9999/bigdata/namespace/wdq/sparql

Where reset_kafka.sparql contains this query:

DELETE {
?z rdf:first ?head ; rdf:rest ?tail .
}
WHERE {
  [] wikibase:kafka ?list .
  ?list rdf:rest* ?z .
  ?z rdf:first ?head ; rdf:rest ?tail .
} ;
DELETE WHERE {
  <https://www.wikidata.org> wikibase:kafka ?o .
};

Timeouts

Sadly, the public WDQS cluster is subject to varying performances depending on user generated load. Throttling should protect it but this is an imperfect solution. Logstash provides a view of the different clients being throttled. So far, we have seen this issue only on the public WDQS cluster. The wdqs-categories service is usually not timing out. Overload is due to a combination of read and write load exceeding the current capacity.

Remediation

  • restart of the blazegraph main service: systemctl restart wdqs-blazegraph
  • restart of the blazegraph categories service: systemctl restart wdqs-categories (usually not necessary)
  • find the most offending client and ban its user agent (see 518691 as an example).
  • shutdown wdqs-updater: systemctl stop wdqs-updater (this will obviously increase replication lag, but will protect read traffic).

Server de-synchronization

Ideally, all servers should have the same data. However, it can happen, due to some glitches, that the data on one server differs from the data on another server. This can happen for a brief period due to update lags, however if the data is old and still in not in sync, this can lead to query results being inconsistent between servers and thus the same query returning different result at different time. You can see which server returned the result by inspecting x-served-by header in the response.

Remediation

In the case of servers being out of sync, the best way is to update the affected entities manually. Of course the exact entities depend on the nature of the query, so it can be tricky to find out. But if the entities are known, run this on each server:

cd /srv/deployment/wdqs/wdqs; bash runUpdate.sh -n wdq -N -- --ids Q1234,Q345,Q45567

where argument to --ids is the IDs of the affected entities. This can be done at any time without affecting the function of the server.

Logstash Spamming

Due to improper sanitization of user requests, logstash can become overwhelmed with loads of logging errors from logback. An example of this error is: Invalid format for the WKT value: value=11.576,48.216805555556. These errors can be seen here: https://logstash.wikimedia.org/app/kibana#/dashboard/default?_g=h@89f0c75&_a=h@8f80f8c

Remediation

Once the specific logger that is generating the spam is identified, that specific logger can be disabled by adding this line <logger name="<logger_name>" level="off"/> to logback config for wdqs and submit the patch for review. The logback configuration is automatically reloaded, no service restart is required. An example of a submitted patch is: https://gerrit.wikimedia.org/r/c/operations/puppet/+/468958.

WDQS Updater service crashing due to memory error

Larger than usual updates can make wdqs-updater service run out of memory. An example can be seen here: https://phabricator.wikimedia.org/T210235. OutOfMemoryErrors are only sent to journald, not to logback log files. UpdateBuilder.bindValue() running out of memory is a good indication of that issue:

Nov 23 11:28:14 wdqs1009 wdqs-updater[28004]: Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
Nov 23 11:28:14 wdqs1009 wdqs-updater[28004]:         at java.util.Arrays.copyOf(Arrays.java:3332)
Nov 23 11:28:14 wdqs1009 wdqs-updater[28004]:         at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
Nov 23 11:28:14 wdqs1009 wdqs-updater[28004]:         at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:472)
Nov 23 11:28:14 wdqs1009 wdqs-updater[28004]:         at java.lang.StringBuffer.append(StringBuffer.java:310)
Nov 23 11:28:14 wdqs1009 wdqs-updater[28004]:         at java.lang.StringBuffer.append(StringBuffer.java:97)
Nov 23 11:28:14 wdqs1009 wdqs-updater[28004]:         at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:486)
Nov 23 11:28:14 wdqs1009 wdqs-updater[28004]:         at java.lang.StringBuffer.append(StringBuffer.java:338)
Nov 23 11:28:14 wdqs1009 wdqs-updater[28004]:         at java.util.regex.Matcher.appendReplacement(Matcher.java:890)
Nov 23 11:28:14 wdqs1009 wdqs-updater[28004]:         at java.util.regex.Matcher.replaceAll(Matcher.java:955)
Nov 23 11:28:14 wdqs1009 wdqs-updater[28004]:         at java.lang.String.replace(String.java:2240)
Nov 23 11:28:14 wdqs1009 wdqs-updater[28004]:         at org.wikidata.query.rdf.tool.rdf.UpdateBuilder.bindValue(UpdateBuilder.java:40)
Nov 23 11:28:14 wdqs1009 wdqs-updater[28004]:         at org.wikidata.query.rdf.tool.rdf.RdfRepository.syncFromChanges(RdfRepository.java:308)
Nov 23 11:28:14 wdqs1009 wdqs-updater[28004]:         at org.wikidata.query.rdf.tool.Updater.handleChanges(Updater.java:214)
Nov 23 11:28:14 wdqs1009 wdqs-updater[28004]:         at org.wikidata.query.rdf.tool.Updater.run(Updater.java:129)
Nov 23 11:28:14 wdqs1009 wdqs-updater[28004]:         at org.wikidata.query.rdf.tool.Update.run(Update.java:146)
Nov 23 11:28:14 wdqs1009 wdqs-updater[28004]:         at org.wikidata.query.rdf.tool.Update.main(Update.java:81)
Nov 23 11:28:14 wdqs1009 systemd[1]: wdqs-updater.service: Main process exited, code=exited, status=1/FAILURE

Remediation

Temporarily reducing the batch size (-b parameter) should allow the large updates to go through. Check task T210235 for more details.

Overload due to high edit rate

In some cases, high edit rate will push the load on the service to a point where it can't keep up with both the read and the write load. In that case, shutting down the wdqs-updater service will provide some breathing room, at the cost of no updates being applied to that cluster.

Categories update lag

Categories are updated weekly from Mediawiki dumps. Due to the delay between dumps being created and categories being reloaded, the lag can be larger than a week. Updates are done through loadCategoriesDaily.sh script, run as a cron. Logs are in /var/log/wdqs/reloadCategories.log. Icinga alerts might be too aggressive, in case of alert, re-scheduling the check might be enough to fix it.

Note that the script that is used for reloads, forAllCategoryWikis.sh, will skip wikis that are failing and move to the next ones - you can find which wikis failed by looking for similar logs -

ERROR:loadCategoriesDaily.sh - Failed for wowiktionary, moving on to the next entry...

Free allocators decrease rapidly

BlazegraphFreeAllocatorsDecreasingRapidly

Blazegraph may start burning allocators too rapidly causing the journal to grow too quickly. It might be due:

  • A data-reload is in progress, this alert should have been silenced but check in SAL logs and ping mw:Search_Platform folks, a data-reload is visible on the number of triples as it is likely to be below other machines but also on the affect machine by seeing curl requests like:

root 21321 13356 0 14:27 ? 00:00:00 curl --silent --show-error --retry 10 --retry-delay 30 --retry-connrefused -XPOST --data-binary update=LOAD <file:///srv/wdqs/munged/wikidump-000000295.ttl.gz> http://localhost:9999/bigdata/namespace/wdq/sparql

  • The machine is catching-up on lag, verify that the two graphs are correlated and possibly adjust the alert. Wait for the lag to catchup and see if this alert resolves by itself.
  • Blazegraph is misbehaving, this is the scenario where action is required, if the service looks OK: no lag, no reload:
    • blazegraph must be restarted to stop the bleeding: systemctl restart wdqs-blazegraph
    • if the bleeding continues the next is likely journal corruption and disk full, the solution is to run the data transfer cookbook from a sane source.