webperf

From Wikitech
Jump to navigation Jump to search

webperf is a set of scripts that aggregate data from EventLogging to statsd and Graphite.

Setup

Currently running on:

Source code is in performance/navtiming.git (Gerrit) and deployed using Scap3.

Provisioned by Puppet using role::webperf::processors_and_site .

Each service runs as a systemd unit.

Services

navtiming

The navtiming service (written in Python) extracts information for the NavigationTiming and SaveTiming schemas from EventLogging using Kafka, and submits it to Graphite via Statsd. The original data is submitted to EventLogging by a JS client for MediaWiki (beacon js source, MediaWiki extension).

Application logs are kept locally, and can be read via sudo journalctl -u navtiming.

coal

Written in Python.

Application logs are kept locally, and can be read via sudo journalctl -u coal.

statsv

The statsv service (written in Python) forwards data from the Kafka stream for /beacon/statsv web requests to Statsd.

Application logs are kept locally, and can be read via sudo journalctl -u statsv.

coal-web

Written in Python.

site

This powers the site at https://performance.wikimedia.org/. Beta Cluster instance at https://performance-beta.wmflabs.org/.

Metrics

navtiming-1 metrics

The navtiming-1 metrics are available in Graphite under the frontend.navtiming prefix.

The "NavigationTiming" data in

See also PerformanceTiming in the W3C Navigation Timing spec.

Offsets

  • fetchStart: From PerformanceTiming, collected relative to navigationStart.
  • responseStart: From PerformanceTiming, collected relative to navigationStart.
  • firstPaint:
    • In MSIE/Edge, this comes from a non-standard msFirstPaint property of PerformanceTiming.
    • In Chromium-based browsers (Chrome, Opera, Android), this uses chrome.loadTimes().firstPaintTime.
  • domInteractive: From PerformanceTiming, collected relative to navigationStart.
  • domComplete: From PerformanceTiming, collected relative to navigationStart.
  • loadEventStart: From PerformanceTiming, collected relative to navigationStart.
  • loadEventEnd: From PerformanceTiming, collected relative to navigationStart.

Deltas

  • dnsLookup: Computed client-side from PerformanceTiming domainLookupEnd - domainLookupStart.
  • redirecting: Computed client-side from PerformanceTiming redirectEnd - redirectStart.
  • mediaWikiLoadComplete: Computed client-side from our custom measures mwLoadEnd - mwLoadStart. mwLoadStart is the start of execution the JavaScript startup module, and mwLoadEnd is the point where all client-side CSS and JavaScript have finished downloading and executing.
  • waiting: Computed server-side as responseStart - requestStart.
  • connecting: Computed server-side as connectEnd - connectStart.
  • receiving: Computed server-side as responseEnd - responseStart.
  • sslNegotiation: Computed server-side as connectEnd - secureConnectionStart.

Caveats

  • no-zeroes: In navtiming-1, we exclude zero values from all metrics before computing aggregates. This has several implications:
    • In navtiming-1, the average represents only cases where the client actually had to perform a given task. For example, the average "dnsLookup" in navtiming-1 does NOT represent the average time spent on DNS across all page views. Rather, it represents the average time required to perform a dnsLookup. This is important because on repeat views, browsers usually re-use TCP or HTTPS sessions, which means no DNS is involved. In addition to that, even if a new connection is required, the DNS may still be a local cache hit and round to 0. In navtiming-1, only non-zero values are considered in the Graphite aggregates.
    • In navtiming-1, the sample rates effectively vary from metric to metric. Tasks of which the result is cacheable or re-usable (e.g. dns lookup, tcp connection, ssl session) don't happen on all pages, or report as zero. These individual data points are then excluded from those metrics only.
    • This caveat mostly affects fetchStart, dnsLookup, redirecting, connecting, sslNegotiation.
  • sane-filter: In navtiming-1 values that are invalid (negative) or above a certain threshold are excluded (180s, 3 min). When this happens, it causes uneven reporting in Graphite because the exclusion happens on individual metric values. E.g. a page reporting dnsLookup=0, domInteractive=100s, domComplete=200s; will only send domInteractive=100s to Statsd.

navtiming-2 metrics

The navtiming-2 metrics are available in Graphite under the frontend.navtiming2 prefix.

Difference with navtiming-1

Notable differences:

  • Offsets are computed relative to fetchStart instead of navigationStart.
  • We no longer filter out zero values.
  • The sanity filter no longer has an upper bound.
  • When the sanity filter encounters negative numbers, it rejects the entire event instead of just the individual data point.

See phab:T104902 for more information about why the metrics were redefined.

Offsets:

  • responseStart: From PerformanceTiming, relative to fetchStart.
  • firstPaint: (non-standard)
  • domInteractive: From PerformanceTiming, relative to fetchStart.
  • domComplete: From PerformanceTiming, relative to fetchStart.
  • loadEventStart: From PerformanceTiming, relative to fetchStart.
  • loadEventEnd: From PerformanceTiming, relative to fetchStart.

Deltas:

  • dns: Computed client-side from PerformanceTiming domainLookupEnd - domainLookupStart. (Transmitted as "dnsLookup")
  • unload: Computed client-side from PerformanceTiming unloadEventEnd - unloadEventStart.
  • redirect: Computed client-side from PerformanceTiming redirectEnd - redirectStart. (Transmitted as "redirecting").
  • mediaWikiLoad: Computed client-side based on custom mwLoadEnd and mwLoadStart measures. (Transmitted as "mediaWikiLoadComplete").
  • tcp: Computed server-side as connectEnd - connectStart. This includes SSL negotiation.
  • request: Computed server-side as responseStart - requestStart.
  • response: Computed server-side as responseEnd - responseStart.
  • processing: Computed server-side as domComplete - responseEnd.
  • onLoad: Computed server-side as loadEventEnd - loadEventStart.
  • ssl: Computed server-side as connectEnd - secureConnectionStart. This is a subset of tcp.

SaveTiming metrics

SaveTiming get reported to mw.performance.save in statsd. To see if it's running properly, the mw.performance.save.sample_rate key should have hits.

Former services

mw-js-deprecate

The mw-js-deprecate metrics come from the 'mw.deprecate' event, as fired by MediaWiki core JavaScript from mw.log.deprecate() using mw.track(). It is then reported via statsv and statsd to the mw.js.deprecate.* namespace in Graphite.

Before Statsv, this information was logged via the DeprecatedUsage schema using EventLogging.

See also