Wikidata Query Service/Runbook

From Wikitech
Wikidata Query Service component diagram

Wikidata Query Service is the Wikimedia implementation of SPARQL server, based on Blazegraph engine, to service queries for Wikidata and other data sets. This "runbook" goal is to provide short instructions of what to do in case of issue with the service. Please see more detailed description in the User Manual, Wikidata Query Service and Wikidata Query Service/Implementation.

For issues/alerts related to the WDQS Streaming Updater see Wikidata Query Service/Streaming Updater#Runbooks.

Clusters

We operate 3 separate WDQS clusters, with different expectations in term of availability and stability of response times. All clusters only expose a readonly interface. Writes are exclusively managed internally, with WDQS pulling data from different sources (mainly from Wikidata):

  • wdqs-internal: The cluster used by other internal service, mainly mediawiki. This cluster should be highly available with predictable response times.
  • wdqs: The public facing SPARQL endpoint. This cluster is used by external applications and researchers. While it does see "real user traffic", due to the nature of a SPARQL endpoint, we do expect a lower availability and more random response times on this cluster. The users of this cluster should be aware of those limitations.
  • wdqs-test: Those 2 servers are used for testing and should see no "real user traffic".

Main Services

  • wdqs-blazegraph: main blazegraph instance
  • wdqs-categories: blazegraph instance dedicated to categories
  • wdqs-updater: updates the main blazegraph instance from wikidata

Main dashboards

Known limitations

Exposing a SPARQL endpoint to the internet is similar to exposing an SQL endpoint. The user generated load can vary widely in the complexity per request and cause widely varying performances. Throttling is applied based on overall computing time and error rate to protect from overly eager clients. The throttling is bucketed by user-agent and IP address. An abusive client which does not rate limit when receiving HTTP 429 responses will be banned for a longer duration where all its requests will be denied with an HTTP 403.

Further analysis

Before restarting wdqs-blazegraph or wdqs-updater, please collect multiple stack traces for further analysis:

for i in `seq 1 10`;
do
  sudo -u blazegraph jcmd <pid> Thread.print > thread.${i}
  sleep 5
done

Known Remediations

Resolving blazegraph deadlock

Remediation

If it affects a single node:

  • restart blazegraph on this node and depool it if the lag is too high (>1hour)

If it affect multiple nodes: The main strategy is to take short-term steps to eliminate the pain while working to locate any offenders that are slamming WDQS with large volumes of heavy complexity (read: lots of computation required) queries.

The short-term steps are to restart blazegraph on the affected node(s). This will unstick the process, but it can become deadlocked again from subsequent queries, thus why it is necessary to identify the offender(s). Sometimes the nodes fall back into deadlock so soon after restart that it stops being worth it to do the restarts; in those scenarios focus all attention on finding whose queries are causing the deadlock.

Below we will describe what to do in more detail when presented with a "query of death scenario" which ultimately is triggering blazegraph deadlock.

Identifying which nodes are operational

For Blazegraph deadlock issues (which is the result of a "query of death"), degree of outage can be viewed in the RDF Triples graph: https://grafana.wikimedia.org/d/000000489/wikidata-query-service?viewPanel=7&orgId=1&refresh=1m&var-cluster_name=wdqs

Nodes that are reporting null values for Triples are locked up and unable to respond to queries.

Identifying Abusive Traffic

Webrequest Live Method (user agent, query string, IPs, autonomous system numbers etc)
Turnilo

The following dashboard breaks down webrequests_live_128 by Time to First Byte. The intention is to filter for requests that are taking an exceedingly long time to finish. Once filtered as such, we can group by variables like AS Number in order to get an idea of where the bulk of the failing requests are coming from. This should allow the operator(s) to rapidly hone in on the source of the problem.

The selected filters / splits are intended to be a starting point, but of course each incident is different so it is expected that the operator will fiddle with the values as needed when investigating a given incident.

Dashboard

In the past we've had a bit more ease with Turnilo, but superset can be used as well:

Superset

https://superset.wikimedia.org/superset/dashboard/449 (quick links: geo time chart, request tables chart)

https://superset.wikimedia.org/superset/dashboard/454

Kafkacat method (user agent)

From any stats machine:

 kafkacat -b kafka-jumbo1001.eqiad.wmnet -t eqiad.wdqs-external.sparql-query -C -o end 2>/dev/null  | jq '.http.request_headers."user-agent"'

This is a crude method and might produce too much info to be useful.


Logstash Method

Note: This method may be a bit outdated. We recommend starting with the above Turnilo method and then trying this method if that approach doesn't bear fruit.

In a "query of death" scenario, there's generally a certain query type that reliably correlates with the problem.

For example, in the Sep 2 2020 incident, the root cause appeared to be a malformed query that triggered (unsurprisingly) a MalformedQueryException.

In the July 23 2020 incident, the root cause was a poorly performing (large) query.

Logstash is ideally suited to parse logs and perform basic operations like summing, visualizations, etc.

TIP: When adding filters (to pare the log messages down to just the relevant ones), the OpenSearch Dashboards interface can get very laggy, especially if there's a lot of filter options.

To work-around this, after you click Add a filter +, click Edit Query DSL to edit the DSL directly rather than use the graphical interface.

A screenshot of Logstash view, used to illustrate the process for adding a filter via the Query DSL

Here's an example of a simple filter that just makes sure the field stack_trace contains the string cause=java.util.concurrent.ExecutionException: org.openrdf.query.MalformedQueryException.

(See https://www.elastic.co/guide/en/elasticsearch/reference/7.x/query-filter-context.html#query-filter-context-ex for crafting more advanced query filters manually)

{
  "query": {
    "bool": {
      "must": [
        { "match": { "stack_trace":   "cause=java.util.concurrent.ExecutionException: org.openrdf.query.MalformedQueryException" }}
      ]
    }
  }
}

See https://logstash.wikimedia.org/goto/d386fbdc24d9f51c902b1bf50077558a for an example dashboard which filters only for MalformedQueryException (note: requires access to production logs)

Known issues

Update lag

See Wikidata_Query_Service/Streaming_Updater for everything related to the update process.

Blazegraph deadlock

Occasionally a query will cause the blazegraph process to become deadlocked, rendering it unresponsive. As a result, blazegraph - which functions as the backend for wdqs - will fail to respond, leading to nginx (the frontend) reporting a timeout (as its attempt to talk to blazegraph on the other end has failed).

The cause of deadlocked queries is unknown; it seems to be more likely with queries of higher complexity, but there is no single type of query that can cause the issue.

Symptoms:

  • The graphs Triples, Lag, Queries per seconds and some others stop reporting data for the affected node in the wikidata-query-service graphana dashboard.
  • Blazegraph is unreachable and might die with a OOM at some point


Banning the user agent

The current state of the art method is using requestctl

Cleanup

Once the correct user agent(s) have been banned, restart blazegraph on all affected nodes.

Take a mental note to circle back in the future to unblock any users whose user agents were banned. (Do this during a time where you'll be around to monitor and make sure the deadlock condition does not re-surface)

Issues with Kafka / Recent Change pollers

WDQS supports 2 mechanism for getting updates from Wikidata. One based on Kafka, one based on the Recent Changes API. In case of issues with one of those backends, switching to the other might help.

Procedure

The updater used if configured in Puppet, via the profile::wdqs::use_kafka_for_updates hiera variable (See 466722 for an example). The updater needs to be restarted for the change to be activated (service wdqs-updater restart). The Kafka based poller stores kafka offsets in Blazegraph, those need to be reset when switching from Kafka to Recent Changes to ensure that Kafka Poller will start from the right place when re-enabled (see Resetting Kafka Offsets for details). The following commands needs to be run on each server where the poller configuration is changed:

curl -X POST --data-urlencode update@reset_kafka.sparql http://localhost:9999/bigdata/namespace/wdq/sparql

Where reset_kafka.sparql contains this query:

DELETE {
?z rdf:first ?head ; rdf:rest ?tail .
}
WHERE {
  [] wikibase:kafka ?list .
  ?list rdf:rest* ?z .
  ?z rdf:first ?head ; rdf:rest ?tail .
} ;
DELETE WHERE {
  <https://www.wikidata.org> wikibase:kafka ?o .
};

Timeouts

Sadly, the public WDQS cluster is subject to varying performances depending on user generated load. Throttling should protect it but this is an imperfect solution. Logstash provides a view of the different clients being throttled. So far, we have seen this issue only on the public WDQS cluster. The wdqs-categories service is usually not timing out. Overload is due to a combination of read and write load exceeding the current capacity.

Remediation

  • restart of the blazegraph main service: systemctl restart wdqs-blazegraph
  • restart of the blazegraph categories service: systemctl restart wdqs-categories (usually not necessary)
  • find the most offending client and ban its user agent (see 518691 as an example).
  • shutdown wdqs-updater: systemctl stop wdqs-updater (this will obviously increase replication lag, but will protect read traffic).

Throttling Implementation

Throttling is implemented in the RDF java code and can be enabled/disabled via JVM options, specifically by using '-Dwdqs.throttling-filter.enabled=' in Puppet's profile::query_service::blazegraph_extra_jvm_opts. Example hieradata here

Server de-synchronization

Ideally, all servers should have the same data. However, it can happen, due to some glitches, that the data on one server differs from the data on another server. This can happen for a brief period due to update lags, however if the data is old and still in not in sync, this can lead to query results being inconsistent between servers and thus the same query returning different result at different time. You can see which server returned the result by inspecting x-served-by header in the response.

Remediation

In the case of servers being out of sync, the best way is to update the affected entities manually. Of course the exact entities depend on the nature of the query, so it can be tricky to find out. But if the entities are known, run this on each server:

cd /srv/deployment/wdqs/wdqs; bash runUpdate.sh -n wdq -N -- --ids Q1234,Q345,Q45567

where argument to --ids is the IDs of the affected entities. This can be done at any time without affecting the function of the server.

Logstash Spamming

Due to improper sanitization of user requests, logstash can become overwhelmed with loads of logging errors from logback. An example of this error is: Invalid format for the WKT value: value=11.576,48.216805555556. These errors can be seen here: https://logstash.wikimedia.org

Remediation

Once the specific logger that is generating the spam is identified, that specific logger can be disabled by adding this line <logger name="<logger_name>" level="off"/> to logback config for wdqs and submit the patch for review. The logback configuration is automatically reloaded, no service restart is required. An example of a submitted patch is: https://gerrit.wikimedia.org/r/c/operations/puppet/+/468958.

WDQS Updater service crashing due to memory error

Larger than usual updates can make wdqs-updater service run out of memory. An example can be seen here: https://phabricator.wikimedia.org/T210235. OutOfMemoryErrors are only sent to journald, not to logback log files. UpdateBuilder.bindValue() running out of memory is a good indication of that issue:

Nov 23 11:28:14 wdqs1009 wdqs-updater[28004]: Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
Nov 23 11:28:14 wdqs1009 wdqs-updater[28004]:         at java.util.Arrays.copyOf(Arrays.java:3332)
Nov 23 11:28:14 wdqs1009 wdqs-updater[28004]:         at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
Nov 23 11:28:14 wdqs1009 wdqs-updater[28004]:         at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:472)
Nov 23 11:28:14 wdqs1009 wdqs-updater[28004]:         at java.lang.StringBuffer.append(StringBuffer.java:310)
Nov 23 11:28:14 wdqs1009 wdqs-updater[28004]:         at java.lang.StringBuffer.append(StringBuffer.java:97)
Nov 23 11:28:14 wdqs1009 wdqs-updater[28004]:         at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:486)
Nov 23 11:28:14 wdqs1009 wdqs-updater[28004]:         at java.lang.StringBuffer.append(StringBuffer.java:338)
Nov 23 11:28:14 wdqs1009 wdqs-updater[28004]:         at java.util.regex.Matcher.appendReplacement(Matcher.java:890)
Nov 23 11:28:14 wdqs1009 wdqs-updater[28004]:         at java.util.regex.Matcher.replaceAll(Matcher.java:955)
Nov 23 11:28:14 wdqs1009 wdqs-updater[28004]:         at java.lang.String.replace(String.java:2240)
Nov 23 11:28:14 wdqs1009 wdqs-updater[28004]:         at org.wikidata.query.rdf.tool.rdf.UpdateBuilder.bindValue(UpdateBuilder.java:40)
Nov 23 11:28:14 wdqs1009 wdqs-updater[28004]:         at org.wikidata.query.rdf.tool.rdf.RdfRepository.syncFromChanges(RdfRepository.java:308)
Nov 23 11:28:14 wdqs1009 wdqs-updater[28004]:         at org.wikidata.query.rdf.tool.Updater.handleChanges(Updater.java:214)
Nov 23 11:28:14 wdqs1009 wdqs-updater[28004]:         at org.wikidata.query.rdf.tool.Updater.run(Updater.java:129)
Nov 23 11:28:14 wdqs1009 wdqs-updater[28004]:         at org.wikidata.query.rdf.tool.Update.run(Update.java:146)
Nov 23 11:28:14 wdqs1009 wdqs-updater[28004]:         at org.wikidata.query.rdf.tool.Update.main(Update.java:81)
Nov 23 11:28:14 wdqs1009 systemd[1]: wdqs-updater.service: Main process exited, code=exited, status=1/FAILURE

Remediation

Temporarily reducing the batch size (-b parameter) should allow the large updates to go through. Check task T210235 for more details.

Overload due to high edit rate

In some cases, high edit rate will push the load on the service to a point where it can't keep up with both the read and the write load. In that case, shutting down the wdqs-updater service will provide some breathing room, at the cost of no updates being applied to that cluster.

Categories update lag

Categories are updated weekly from Mediawiki dumps. Due to the delay between dumps being created and categories being reloaded, the lag can be larger than a week. Updates are done through loadCategoriesDaily.sh script, run as a cron. Logs are in /var/log/wdqs/reloadCategories.log. Icinga alerts might be too aggressive, in case of alert, re-scheduling the check might be enough to fix it.

Note that the script that is used for reloads, forAllCategoryWikis.sh, will skip wikis that are failing and move to the next ones - you can find which wikis failed by looking for similar logs -

ERROR:loadCategoriesDaily.sh - Failed for wowiktionary, moving on to the next entry...

Free allocators decrease rapidly

BlazegraphFreeAllocatorsDecreasingRapidly

Blazegraph may start burning allocators too rapidly causing the journal to grow too quickly. It might be due:

  • A data-reload is in progress, this alert should have been silenced but check in SAL logs and ping mw:Search_Platform folks, a data-reload is visible on the number of triples as it is likely to be below other machines but also on the affect machine by seeing curl requests like:

root 21321 13356 0 14:27 ? 00:00:00 curl --silent --show-error --retry 10 --retry-delay 30 --retry-connrefused -XPOST --data-binary update=LOAD <file:///srv/wdqs/munged/wikidump-000000295.ttl.gz> http://localhost:9999/bigdata/namespace/wdq/sparql

  • The machine is catching-up on lag, verify that the two graphs are correlated and possibly adjust the alert. Wait for the lag to catchup and see if this alert resolves by itself.
  • Blazegraph is misbehaving, this is the scenario where action is required, if the service looks OK: no lag, no reload:
    • blazegraph must be restarted to stop the bleeding: systemctl restart wdqs-blazegraph
    • if the bleeding continues the next is likely journal corruption and disk full, the solution is to run the data transfer cookbook from a sane source.