Jump to content

OpenSearch Dashboards

From Wikitech

OpenSearch Dashboards (previously known as Kibana) is the frontend for Logstash, available at https://logstash.wikimedia.org.

This page is the user guide for OpenSearch Dashboards at WMF. For information about its operation, see Logstash. To read more about the software and its history, check OpenSearch on Wikipedia.

Quick intro

Where do logs come from?

Logs from MediaWiki end up here.

E.g. $logger = LoggerFactory::getInstance('Flow'); $logger->info(...) in MediaWiki PHP corresponds in Logstash to type:mediawiki channel:Flow level:INFO

For more about how to do that within works MediaWiki, see mw:Structured logging.

  • Start from one of the blue Dashboard links near the top, more are available from the Load icon near the top right.
  • In "Events over time" click and drag to zoom in to a specific region.
  • On the top right, you can change the time range to start with, e.g. last 24 hours, last 7 days. Smaller ranges are faster. If you find no results, check if you accidentally queried the future!
  • If you get lost, start again from the Home icon on the top right, or go to https://logstash.wikimedia.org/

See also slide 11 and onwards in the TechTalk on ELK by Bryan Davis, for more highlighted dashboard features (formerly known as "Logstash" or "Kibana").

Team dashboard and triage process talk

Watch "🎥 How to Logstash & Kibana (2020)" by Timo Tijhof (20 minutes, 📙 slides) to learn what production errors are, how we monitor production errors, and how you can create a team dashboard as part of a Phabricator triage workflow.

Beta Cluster Logstash

Web interface
https://beta-logs.wmcloud.org/
Access control
Credentials for Beta's Logstash can be found on officewiki, or by connecting to deployment-deploy04.deployment-prep.eqiad1.wikimedia.cloud and reading /root/secrets.txt. Unlike production services, Beta Cluster may not use Developer accounts (LDAP) for authentication.
$ ssh deployment-deploy04.deployment-prep.eqiad1.wikimedia.cloud -- sudo cat /root/secrets.txt
service: https://beta-logs.wmcloud.org
user: ************
password: ************
Kafka access to deployment-prep
The security group kafka-logging must allow ingress from the logging collector on port 9093.
When commissioning a new logging collector, the certificate authority keystore (/etc/ssl/localcerts/wmf-java-cacerts) must be manually copied onto the new logging collector otherwise logstash will not start: (File does not exist or cannot be opened /etc/ssl/localcerts/wmf-java-cacerts).

How to

How to look up MediaWiki errors

By request ID (PHP non-fatal errors)

This type of error ID is shown to the user when a PHP non-fatal error occurs. An example request ID is b3165253-43e8-4708-88b3-a07ea636d8ed. These typically get caught by a try/catch in MediaWikiEntryPoint->run(), and then are displayed to the user and logged in Logstash.

  • ≡ -> OpenSearch Dashboards -> Discover
  • delete all existing filters by pressing the X next to each one
  • Click "+ Add filter"
    • Field -> reqId
    • Operator -> Is
    • Value -> paste the request ID
    • Save
  • Show dates -> expand the range to 90 days or greater -> Update
  • Scroll down to "Raw Events List"
  • Click the > icon next to one of the log entries to expand it
  • Click on the Phatality tab for sanitized remarkup code to copy paste into Phabricator

By MediaWiki error type (see if an error type is still being emitted or if it went away)

It can be difficult to search for a specific type of error. For example, multiple different extensions output PHP Warning: Undefined array key 1, and it can be hard to narrow the search further.

  • ≡ -> OpenSearch Dashboards -> Discover
  • delete all existing filters by pressing the X next to each one
  • Click "+ Add filter"
    • Field -> exception.message
    • Operator -> Is
    • Value -> example: PHP Warning: Undefined array key 1
    • Save
  • Show dates -> expand the range to 90 days or greater -> Update
  • Scroll down to "Raw Events List"
  • Click the > icon next to one of the log entries to expand it

How to look up HTTP error codes

Sometimes the user will get an error page that says "Our servers are currently under maintenance or experiencing a technical issue" and at the bottom contains an HTTP error code and a Varnish XID. This is the error page for when the website returns an HTTP error code. This can be caused by things such as using a blocked IP address (HTTP 403) or by a PHP fatal error (HTTP 500).

Only HTTP 5XX errors are logged. HTTP 4XX are not logged in Logstash.

You cannot search by Varnish XID (although maybe in the future, phab:T176065). One thing you can search for is IP address.

  • ≡ -> OpenSearch Dashboards -> Discover
  • delete all existing filters by pressing the X next to each one
  • Click "+ Add filter"
    • Field -> type
    • Operator -> Is
    • Value -> webrequest
    • Save
  • Click "+ Add filter"
    • Field -> ip
    • Operator -> Is
    • Value -> [theIpAddress]
    • Save
  • Show dates -> expand the range to 90 days or greater -> Update
  • Scroll down to "Raw Events List"
  • Click the > icon next to one of the log entries to expand it

There are no stack traces for these, so the URL may be your best clue.

How to look up MediaWiki JavaScript errors

Use a dashboard such as "mediawiki-client-errors" to browse these. Unlike PHP errors, JS errors do not have a reqId.

How to look up MediaWiki WikimediaDebug browser extension verbose logs

  • Install WikimediaDebug browser extension
  • Tick "Verbose log"
  • Turn it from off to on
  • Visit a page
  • Open the extension again. It will display links to logstash. Click on the top link, which should be the main request. (The other links are API requests.)

Tips

Discovery page

  • On the home page, click on "Discover: Run ad-hoc Logstash queries".
  • Select the index to search within. At the time of writing, the default selected index is "dlq-*". You may prefer to search in "logstash-*" for example.
  • Type some words within the search field. e.g., type "gitlab" to look for Gitlab logs.
  • Change the time range to increase the probability of your request returning something.

Homepage

The Home page is itself also dashboard. It has a single text panel with a Markdown list of links. Add your own for easy access!

Unfortunately it's not possible to just copy & paste stuff out of a search window (because formatting will break badly). If you feel like you really need the data, you can copy the POST request via firefox dev tools (as cURL) and then:

# "log" being the field to print from each document
curl ... | gunzip | jq -r '.rawResponse | .hits | .hits | .[]._source | "\(.timestamp) \(.log)" | gsub("[\\n\\t]"; "")'

Download CSV

For saved searches you can also download a CSV of the first 10000 results by clicking "reporting" then "Generate CSV". Note that navigating to "discover" and the searching will not enable "Generate CSV". You first need to save your search to be able to download its results as CSV.

Gotchas

The browser address bar for Logstash navigations, are by default personalised to your login session. Sharing such link with other people, leads to a "Unable to restore URL" error.

To share results with others, use the "Share" link from the top right navigation, choose "Permalinks", enable "Short URL", and then press "Copy link".

By default, shared links use the "Snapshot" mode which means it captures the state of the dashboard queries and panels as-is. This includes e.g. the timestamp slider, so if you're viewing "Last 1 hour" then the shared link will show different results an hour from now.

To share a specific result, use the single document link instead. Expand one of the raw events in the feed down on the dashboard, and copy the "View single document" link.

Use "Share" -> "Permalinks" -> "Snaphshot" -> "Copy link" without the "Short URL" to get a link that you can modify e.g. to automatically generate URLs that filter for a certain keyword. Add filters or a search expression in the UI then observe how the URL changes. The _a and _g query parameters are RISON structures (JSON encoded in an URL-friendly format), and you can freely change values in them, and strip out anything you don't need to change from the dashboard's default value. Typically, you'd only keep the time expression in _g and the filters array in _a, and omit the various metadata from filters and only keep the query part.

E.g. this URL query would apply a myField=myValue filter to the given dashboard and set the time range to the last 24 hours:

?_a=(filters:!((query:(match_phrase:(myField:myValue)))))&_g=(time:(from:now-24h,to:now))

No results

  • If you see no events at all, perhaps you are querying the future only?
  • If you see no results or the results seem unrelated, press the magnifying glass at right of the main query bar to submit again. There is a race condition where if you modify the query while it is running, it ends up re-submitting the last completed query instead of the one you just wrote.
  • If you see events suddenly stop, perhaps the query includes the future (e.g. "Today" and "This week" instead of "Last 24h" or "Last week").
  • If you think you found when results first started to match your query, double check if it aligns with "today - 90 days ago" which is our message retention.

Visualisation panels

  • The visualisation panels are re-usable and thus saved globally.
  • Avoid changing existing visualisations unless intending to change other dashboards that use it at the same time. Otherwise "Save as.." under a new name first.