Logs/Runbook

Introduction

This page outlines some useful techniques that can be used to help diagnose issues with Wikimedia sites, based on the various application and infrastructure logs available.

The choice of which technique(s) to employ will largely depend on the nature of the situation.

Ad-Hoc Analysis

Sometimes it is useful to be able to perform ad-hoc analysis of a real-time incident, by viewing a live log file of certain events and filtering it according to your needs. The following examples may be adapted to your specific requirements.

Webrequests Sampled

Superset dashboards

There are two useful dashboards available in Superset for analyzing webrequests:

Live data (from now until 24h ago): https://superset.wikimedia.org/superset/dashboard/webrequest-live/

Historical data (from ~1h ago to 1 month ago): https://superset.wikimedia.org/superset/dashboard/webrequest-128/

The dashboards are made of different tabs with different data, see the Help tab for more details on how to use them.

Log files

The log file with sampled webrequests is available on centrallog1001 and centrallog2002 in the file: /srv/log/webrequest/sampled-1000.json

As the name suggests 1 in 1,000 requests are extracted from the stream in Kafka and are retained in this file. Each file contains one day's logs and 62 day's worth of old logs are stored in /srv/log/webrequest/archive

Aggregated data

On the hosts with OS bullseye (centrallog2002 as of Nov. 2022) or newer, there is a json-webrequests-stats script installed to gather statistics from the logs.

$ json-webrequests-stats -h
usage: json-webrequests-stats [-h] [-n NUM] [-c {text,upload,all}] [-q [QUERIES ...]]

Script to parse sampled-1000.json or 5xx.json logs and report aggregated results.

The script expects as standard input a subset of lines from /srv/weblog/webrequest/*.json logs.
It uses the Python library `gjson` [1] that is the Python porting of the Go library GJSON [2] and accepts the same
syntax [3] for manipulating JSON objects.

Example usage:

    # Get stats from the live traffic
    tail -n 100000 /srv/weblog/webrequest/sampled-1000.json | json-webrequests-stats

    # Save the current live traffic to work on the same set of data while refining the search
    tail -n 100000 /srv/weblog/webrequest/sampled-1000.json > ~/sampled.json
    cat ~/sampled.json | json-webrequests-stats
    # There is some interesting traffic with a specific pattern, filter by it and get the statistics relative to only
    # that specific traffic
    cat ~/sampled.json | json-webrequests-stats -n 20 -c text -q 'uri_path="/w/api.php"'
    # Apply multiple filters to narrow down the search
    cat ~/sampled.json | json-webrequests-stats -n 20 -c text -q 'uri_path="/w/api.php"' 'user_agent%"SomeBot.*"'

    # Get stats from the live 5xx error logs
    tail -n 10000 /srv/weblog/webrequest/5xx.json | json-webrequests-stats

optional arguments:
  -h, --help            show this help message and exit
  -n NUM, --num NUM     How many top N items to return for each block. (default: 10)
  -c {text,upload,all}, --cdn {text,upload,all}
                        For which CDN to show the stats. (default: all)
  -q [QUERIES ...], --queries [QUERIES ...]
                        A GJSON additional array query to use to pre-filter the data, without the parentheses required
                        by the GJSON syntax [4], e.g.: -q 'uri_path="/w/api.php"'. Accepts multiple values, e.g.:
                        -q 'uri_path="/w/api.php"' 'uri_host="en.wikipedia.org"'. (default: None)

[1]       Python gjson: https://volans-.github.io/gjson-py/index.html
[2]           Go GJSON: https://github.com/tidwall/gjson/blob/master/README.md
[3]       GJSON Syntax: https://github.com/tidwall/gjson/blob/master/SYNTAX.md
[4] GJSON Query syntax: https://github.com/tidwall/gjson/blob/master/SYNTAX.md#queries

The script provides the aggregated data (top N) for a set of requests fields and also the aggregated data obtained summing the response_size (how much bandwidth) and the time_firstbyte (how much mediawiki time) for each unique value of the fields and reporting the top N of them.

This is an example output, with IPs and UAs redacted and using the -n 3 CLI argument for brevity.

Once a specific pattern is found, is possible to filter the data using the -q/--queries CLI argument to get the statistic of only the filtered traffic, that usually allows to find more specific patterns (UA, IPs, etc.) fairly quickly.

Grep-able oputput

$ jq  -r "[.uri_path,.hostname,.user_agent,.ip] | @csv" /srv/log/webrequest/sampled-1000.json

Select all public_cloud nets with 429

$ tail -n10000 /srv/weblog/webrequest/sampled-1000.json | jq -r 'select(.http_status == "429") | select(.x_analytics | contains("public_cloud=1"))'

Select all requests with a specific user_agent and .referer

$ jq -r 'if .user_agent == "-" and .referer == "-" then [.uri_path,.hostname,.user_agent,.ip] else empty end | @csv' /srv/log/webrequest/sampled-1000.json

List of the top 10 IPs by response size

$ jq -r '.ip + " " + (.response_size | tostring)' /srv/log/webrequest/sampled-1000.json| awk '{ sum[$1] += $2 } END { for (ip in sum) print sum[ip],ip }' | sort -nr | head -10

Select logs matching specific HTTP status, datestamp prefix, host, and uri_path, outputting the top query parameters found

$ tail -n300000 /srv/weblog/webrequest/sampled-1000.json| jq -r 'select(.http_status == "429") | select(.dt | contains("2022-06-10T14:5")) | select(.uri_host == "www.wikipedia.org") | select(.uri_path == "/") | .uri_query' | sort | uniq -c | sort -gr | head

5xx errors

most of the queries for the sampled-1000 log would work here as well

Grepable

$ tail -f  /srv/log/webrequest/5xx.json | jq "[.uri_host, .uri_path, .uri_query, .http_method, .ip, .user_agent] | @csv"

Mediawiki

all ips which have made more the 100 large requests

$ awk '$2>60000 {print $11}' /var/log/apache2/other_vhosts_access.log | sort | uniq -c | awk '$1>100 {print}'

Retrospective Analysis

When the situation calls for analysis of more historical data, or to access the complete set of data, the Analytics Systems can help.

Turnilo

Turnilo has access to the [[1]] dataset, which is loaded every hour to Druid. As the name suggests, this samples 1 in 128 requests.

Data Lake

The primary source for webrequest logs is the Data Lake and the Analytics/Data Lake/Traffic/Webrequest tables in Hive.

These tables are updated hourly and may be queried using Hive, Presto, or Spark.

Please see https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Webrequest#Sample_queries for some sample queries using Hive.