Performance/Runbook/Webperf-processor services

From Wikitech
(Redirected from Webperf1001)
Jump to navigation Jump to search

This is the run book for deploying and monitoring webperf-processor services.

Hosts

The puppet role for these services is role::webperf:processors_and_site.

Find the current production hosts for this role in puppet: site.pp. Find the current beta host at openstack-browser: deployment-prep.

Hosts as of Jan 2022 (T305460):

navtiming

The navtiming service (written in Python) consumes NavigationTiming and SaveTiming events from EventLogging (over Kafka), and after processing submits them to Graphite (over Statsd) and Prometheus, from which they can be visualised in Grafana.

The events start their life in Extension:NavigationTiming as part of MediaWiki, which beacons them to EventLogging (beacon js source).

Meta

Infrastructure diagram.

Monitor navtiming

Application logs

Application logs for this service are available via journalctl (not aggregated by Logstash).

  • Ssh to the host you want to monitor.
  • Run sudo journalctl -u navtiming -f -n100

Raw events

To look at the underlying Kafka stream directly you can use Kafkacat from our webperf host (requires perf-admins shell) or from a stats host (requires analytics-privatedata-users shell)

# Read the last 1000 items and stop
webperf1003$ kafkacat -C -b 'kafka-jumbo1001.eqiad.wmnet:9092' -t eventlogging_NavigationTiming -o '-1000' | head -n1000

# Consume live, stop after 10 new items
webperf1003$ kafkacat -C -b 'kafka-jumbo1001.eqiad.wmnet:9092' -t eventlogging_NavigationTiming | head -n10

# Read the last 1000 items and stop after 10 events match the grep pattern
webperf1003$ kafkacat -C -b 'kafka-jumbo1001.eqiad.wmnet:9092' -t eventlogging_NavigationTiming -o '-1000' | grep largestContentfulPaint | head -n10

Event validation

When our JS client submits events to the EventGate server, these are validated by our schema. The messages that are valid, and sent forward into the Kafka topic. The messages that are rejected, are logged for us to review in the EventGate-validation dashboard in Logstash.

Deploy navtiming

This service runs on the webperf*1 hosts.

To update the service on the Beta Cluster:

  1. Connect with ssh deployment-webperf21.deployment-prep.eqiad1.wikimedia.cloud
  2. run sudo journalctl -u navtiming -f -n100 and keep this open during the following steps
  3. in a new tab, connect with ssh deployment-deploy03.deployment-prep.eqiad1.wikimedia.cloud (or whatever the current deployment-deploy* host is, check).
  4. cd /srv/deployment/performance/navtiming
  5. git pull
  6. scap deploy
  7. Review the scap output (here) and the journalctl output (on the webperf server) for any errors.

To deploy a change in production:

  1. Before you start, open a terminal window in which you monitor the service on a host in the currently main data center. For example, if Eqiad is primary, ssh to webperf10##.eqiad.wmnet and run sudo journalctl -u navtiming -f -n100.
  2. In another terminal window ssh to the deployment server: ssh deployment.eqiad.wmnet and navigate to the navtiming directory:cd /srv/deployment/performance/navtiming.
  3. Prepare the working copy:
    • Ensure the working copy is clean, git status.
    • Fetch the latest changes from Gerrit remote, git fetch origin.
    • Review the changes, git log -p HEAD..@{u}.
    • Apply the changes to the working copy, git rebase.
  4. Deploy the changes, this will automatically restarts the service afterward.
    • Run scap deploy

Verify a deploy in production:

  1. Check the logs you are tailing in your tab, look for new errors.
  2. Go to the navtiming dashboard in Grafana and verify that we get metrics to Graphite. Zoom in and refresh and verify that new metrics are still received. Wait a couple of minutes and see that metrics comes in after your deployment.
  3. Do the same for the Prometheus metrics, you can do that in the response start dashboard following the same pattern as for Graphite.

Rollback a change in production:

  1. Revert the change in Gerrit by accessing your change set in Gerrit and click on the revert button and add the reason why you are reverting.
  2. +2 to the revert and wait for the code to be merged.
  3. Follow the instruction in how to deploy a change in production and make sure your revert is there when you review your change.

Restart navtiming

sudo systemctl restart navtiming

Check that Prometheus is running

You can check metrics and verify that Prometheus is running by using curl on the webperf host:

curl localhost:9230/metrics

Then you will see all the metrics collected. And if you want to measure how long time it takes to get the metrics you can use:

curl -o /dev/null -s -w 'Total: %{time_total}s\n'  localhost:9230/metric

coal

Written in Python.

Application logs are kept locally, and can be read via sudo journalctl -u coal.

Reprocessing past periods

Coal data for an already processed period can be overwritten safely. To backfill a period after an outage, run coal manually on one of the perf hosts (no need to stop the existing process), using a different consumer group, and use the --start-timestamp option (careful about the timestamp being expressed in milliseconds since Epoch). Once you see that the outage gap has been filled, you can safely stop that manual coal process.

Restart coal

sudo systemctl restart coal

statsv

The statsv service (written in Python) forwards data from the Kafka stream for /beacon/statsv web requests to Statsd.

Application logs are kept locally, and can be read via sudo journalctl -u statsv.

Restart statsv

sudo systemctl restart statsv

coal-web

Written in Python.

perfsite

This powers the site at https://performance.wikimedia.org/. Beta Cluster instance at https://performance.wikimedia.beta.wmflabs.org/.

Deploy the site

  • Follow instructions in the README to create a commit.
  • Push to Gerrit for review.
  • Once merged, Puppet will update the web servers within 30min.