Performance/Essay/Performance survey (2019)

From Wikitech

We started the Performance survey with the intention to identify which metrics best represent a satisfying experience, as defined by real people, with the assumption that generally "faster is better" (which we confirmed, no surprises there!) and with the hope that there is or are metrics available (or computable) in modern browsers that would correlate with a positive experience. Specifically, that we could find whether "page load time" (loadEventEnd) is still a good indicator and what better indicates might be available in browsers today.

How

To give this hypothesis the best chance, we analyzed both the full range of our long-term RUM metrics (such as Time to first byte, DNS latency, redirect count, First paint time, and page load time) as well as temporarily several new metrics. These new metrics included: RUM SpeedIndex, Element Timing for "time to first paragraph", and Resource Timing for the bandwidth size and download time of the first image.

We also included several aspects of metadata that while not directly under our control could help explain or counteract certain biases in the data. For example, a particularly negatively perceived article, or a certain account experience-level or geography that correlated particularly positively or negatively to permit relative analysis within a certain group (e.g. analyze categorically slower experiences separately). Meta data included: page weight (transferSize), CPU benchmark score, wiki hostname, account metadata like edit count (<10 edits, 10-100, etc.), article metadata like page ID and image count, device metadata like country, time of day, connection type (3G/WiFi), and whether a CentralNotice banner was presented on this pageview.

The research project revolved mainly around training a machine learning model with all of the above data as input, with the task to output (or "predict") what a real user's would respond on the perf survey (positive or negative exp), and to identify which metric(s) are most predictive of a good or bad perceived experience.

Learned

In a nut shell, we learned that (blog post):

  1. the new and upcoming browser metrics did not contain a new metric that represented user experience better than page load time (in fact, most were significantly worse).
  2. all metrics, including our existing ones, correlate relatively poorly (on the Pearson coefficient scale from 0 to 1, our "best" metric scored a mere 0.14).
  3. domComplete (page load time) is that "best" one.
  4. firstContentfulPaint (paint timing) was a very close second.

Further reading

See also