Performance/Essay/Future of the NavigationTiming extension (2021)

Background

The https://www.mediawiki.org/wiki/Extension:NavigationTiming extension is Wikimedia's own real user measurement (RUM) solution to measure the performance for Wikipedia users. For a couple of years the Wikimedia Performance Team has had a task to look into if we should change replace the solution with something else. The problems we want to solve is the following:

Do we collect relevant metrics?
How can we minimise the time spent for the performance team adding a new metric?
How can we enable developers to use the User Timing and Element Timing API to get more valuable metrics? The goal would be make it as easy as possible: When you as a developer adds a new User Timing, our RUM solution automatically picks it up and its drawn in one of the performance graphs.

Current status

The navigation timing extension has grown over time. It started out to collect metrics from the Navigation Timing API to get some insights of performance metrics from Wikipedia users.

Collected metrics

In August 2021 the extension collected the following metrics:

Navigation timing metrics - metrics from the Navigation Timing API that all modern browsers support. We also include the gaps between the metrics, since we a couple of years ago had a suspicion that all browsers wasn't reporting the metrics correctly. We also get the transfer size of the main document.
Server Timing metrics - using the Server Timing API to get the host and cache type for the main document

Feature policies violations - using the feature-policy-violation observer to get violations. I don't think we collect those in the backend though.

Layout shift - Chromium browser Layout Shift API . The current version collects all layout shifts, not the cumulative or kind of cumulative Google recommends as one of their Google Web Vitals.

Save timings - our own metric measuring a edit submission

Central notice timings - a user timing, telling us that we at that moment show a central notice banner. Collecting that helps us know if central notice is effecting our metrics.

Top image resource timing - get all Resource Timing API metrics from what we think (by doing some magic) is the article image.

Our own CPU benchmark with battery level - to get a feeling of what happens with our users hardware over time. We also get the battery level if the user are on an Android phone, since lower battery slow down the phone.

First Input delay - we collect all information about first input from the First Input API.

Element timing custom metric - we collect all metrics from the Element Timing API.

RUM Speed Index - we use the https://github.com/WPO-Foundation/RUM-SpeedIndex code to try to get the same Speed Index as we can do in synthetic tools.
Paint timings - we collect first paint and first contentful paint though the Paint Timing API . For browser that supports old/non standardised ways of getting the first paint metrics, we use the old ways.

The extension also show the performance survey.

Missing metrics

We collect many metrics but we also miss out of out on some of the latest ones in the web performance community: Largest contentful paint, cumulative layout shift and (CPU) long tasks.

Alternatives

We have a couple of ways to move forward with how we collect metrics.

Changing tool

In the past we've been talking about replacing our own Navigation Timing Extension with an Open Source alternative. Using an Open Source alternative (e.g Boomerang) would potentially help us in a way that we do not need to implement every new metric ourself that we want to collect. There's also many many quirks in browsers that we have run into through the years, so using something that others also use could potentially help us avoid those.

Adopting something another tool needs a lot of work though: We need adopt the tool so we can receive the metrics in the backend, we need to review the tool (and every upgrade), we need to configure the tool so it collects what we need and we potentially needs to add missing metrics that we collect today.

Crafting a new more generic tool

Another alternative is to create a new extension where we try to be more generic in how we collect metrics. That way our real user measurement collector could be something other web properties can use. The positive is that we would share our knowledge to the rest of the performance community and best case we could also get help from others to develop it since that extension will have a bigger audience.

Cleanup the current version

There's also a third alternative: cleanup the current version of the navigation timing extension (remove collecting metrics that we do not use) and add the metrics that we are missing. That makes the extension up to date and then we can push the decision about moving to another tool or create a new one to the future.

Tuning the navigation timing extension

Adopting another tool or build a new tool is good for the long term but it will need a lot of work. This would probably be a couple of quarterly goals, but we need to look into it more to know more exact.

Cleanup the current version first makes most sense at the moment. By removing the metrics we don't use and and add the metrics we are missing, the extension is up to date with what we need for now. As the next step we can fine-tune and make it more generic so developers can add their own metrics. This work can be done within one quarter for one person. We can then post pone evaluating Boomerang/creating a more generic tool to the future and when we think its important for the team, we can focus on it. We will also create more value faster by collecting the new metrics that we are missing.

Remove unused metrics

We should remove the metrics that we don't use. That will decrease the code in the extension and the amount of data we collect.

Remove RUM Speed Index
Remove Top image resource timing
Remove battery level (when we published some kind of study)
Remove the fields from Element Timing API that are used to identify the element. The name, identifier, url and render time is enough.
Remove fields from First Input Delay and keep just the FID metric.

Tune some of the metrics we collect to only collect bare minimum

Remove element that shifts information from Layout Shift and just collect the cumulative metrics as Google collects the Google Web Vitals.
Look into if we can move the Central notice timings to collect User Timings metrics (do a "stop" list of names ).
Rewrite Element timings and add a stop list if ids of what we will collect.

Add missing metrics

Collect Largest Contentful Paint
Collect Long Tasks (total, total length and number before first paint)

New schema: perfbeacon?

Evaluate adding a new schema "perfbeacon" where we collect all metrics that happens after LoadEventEnd.

Outstanding issues

There are a couple of issues we should discuss within the team before we move on:

Should we remove "Feature policies violations"?

Should remove the gaps between Navigation Timing Metrics?
Should we introduce a new schema for all the metrics?

Then we have the issue with the performance study. Right now there's no one that is responsible for that within the team. Should we remove it?

Next steps

Remove the metrics that we don't use.
Evaluate a new schema for metrics, tune the others that we gonna keep and iterate them within the team.
Send data to the new schema
Add the new metrics
Validate the data in the new schema against the old schema
Remove sending metric to the old schema(s)