WMDE/Wikidata/Alerts

Alertmanager

Grafana

Alertmanager handle wikidata alerts dashboard on Grafana.

The dashboard can be found here: https://grafana.wikimedia.org/d/TUJ0V-0Zk/wikidata-alerts

Maxlag: Above 10 for 1 hour

More specific information about the lag can be queried from the API: https://www.wikidata.org/w/api.php?action=query&maxlag=-1
More data on the query service: https://grafana.wikimedia.org/d/000000489/wikidata-query-service
SAL can be used to track down changes to the problematic servers
dispatch lag is no longer part of max-lag and now tracked/alerted with its own metric
#wikimedia-search ^connect contains people that know more about running/administrating the wdqs servers

In the past this has been caused by:

lag on queryservice servers (see: task T302330 )
wdqs host being restarted, but no depooled (see: task T322010 )

Edits: Wikidata edit rate

The edit rate on Wikidata can be a good indicator that something somewhere is wrong, although it will not always indicate exactly what that is.

You can view the edits dashboard at https://grafana.wikimedia.org/d/000000170/wikidata-edits

If MAXLAG is high, that might be a reason for low edit rate.

You may want to investigate what is going on with the API (as all edits go via the API) https://grafana.wikimedia.org/d/000000559/api-requests-breakdown?refresh=5m&orgId=1&var-metric=p50&var-module=wb*

API: Max p95 execute time for write modules

Investigate the wb api @ https://grafana.wikimedia.org/d/000000559/api-requests-breakdown?refresh=5m&orgId=1&var-metric=p50&var-module=wb*

In the past this has been caused by:

s8 db being overloaded, often for a fixable reason
Memcached being overloaded, in the past indicating UBNs

Termbox Request Errors

This kind of error occurs when Wikibase is unable to reach the Termbox Service, i.e. the HTTP request itself fails and is unlikely to have reached its destination. This error does *not* get triggered by erroneous responses, so it means there is a problem on the MediaWiki/Wikibase side or network issues.

Change Dispatching

See WMDE/Wikidata/Dispatching and Wikibase: Change propagation. More metrics can be found on https://grafana-rw.wikimedia.org/d/hGFN2TH7z/edit-dispatching-via-jobs

See WMDE/Wikidata/Runbooks/Change dispatching/Alert for a details on the alerts.

Oozie Job

Sometimes these jobs will fail for random reasons.

They will be restarted, so no need to worry on a first failure.

If things continue to fail, contact WMF analytics to investigate on IRC in #wikimedia-analytics.