User:Ori/Performance engineering

This page is currently a draft.
More information and discussion about changes to this draft on the talk page.

Rationale

Performance engineering at the Wikimedia Foundation has been oriented toward the backend, simply because that's where the hard scalability problems were, and because historically that's where we had the ability to profile code and measure performance. Over time, the Wikimedia Foundation has built up a good stack of performance monitoring tools, and staff have accumulated expertise in using these tools to spot bottlenecks & resolve them. At the same time, certain aspects of site performance are not represented in our tooling and practices: we don't have a good grip on users' actual experience of site performance. We know that user experience is shaped by the user's location relative to our data centers, the limitations of their network link, the performance characteristics of the device that they use to view the site, and the behavior of JavaScript code that executes in their user agent. With 92-95% of requests served entirely out of the cache, it's clear that we should pay more attention to these factors.

I think that there's a lot that we can do on this front, and a lot we stand to gain. Getting a better picture of performance from the perspective of user experience can help us

Develop engineering habits & best practices that are attuned to client-side performance considerations

Everybody knows that serious issues with JavaScript performance exist, but only a small handful of individuals (Roan, Timo, Trevor) have experience in identifying them. We can help make these skills more diffuse throughout the engineering department by surfacing clear and reliable data about client-side performance and by writing documentation and giving presentations about how to interpret it.

Recently two major products (ULS and VE) were significantly set back by the fact that they shipped with some severe performance issues. The fact that our JavaScript code is executed on devices and JavaScript engines with different performance characteristics means that these issues often remain invisible to developers unless they manifest in their specific development environment.

Determine where we should locate new data centers.

We're already capturing geo-coded network latency measurements from a sample of production traffic. Asher and Faidon both think that we should use this data to inform procurement, peering, etc. Faidon and I have concrete plans for using the data to gauge the impact of routing countries in the Middle East to esams rather than eqiad (the change happened earlier this month).

Assess the value of new technologies like SPDY and HTTP 2.0

Very often deploying these technologies involves making expensive and time-consuming changes to our infrastructure, and it's hard to validate investment in these things without a way of measuring what impact they're going to have.

How do we get there?

I think the next step is to continue the work of interpreting Navigation Timing data. Navigation Timing and its complements (User Timing and Performance Timeline) are a set of browser APIs that exposes precise latency measurements to JavaScript code from the underlying browser engine. We're currently using the EventLogging and NavigationTiming extensions to capture these measurements from a sample of our traffic. However, the raw data is difficult to interpret. There's a lot of work we need to do to correlate changes in these measurements with changes in the size of our static asset payload, the performance characteristics of JavaScript code, and changes in network configuration. I've been documenting my progress on this front on Wikitech. I got very useful feedback from Dario and Roan already. My goal is to identify the set of measurements that give the clearest indication of performance problems and to figure out how to best aggregate and visualize them.

One limitation of the Navigation Timing data is that it can indicate the existence of performance problems in client-side code, but it does not identify the exact component that is responsible. The JavaScript payload that accompanies page views concatenates code from different products developed by different teams, so if we simply report that page rendering time has increased, it would still not be clear who is on the hook for investigating it. I think that we can get this data by adding profiling capabilities to ResourceLoader. Because ResourceLoader centralizes the logic for loading and executing modules, we can use it to measure the time it takes to fetch, parse & execute individual modules. (Although many modules defer work to $( document ).ready, it is still possible to profile this code by having ResourceLoader inject a modified jQuery object into module scope that overloads jQuery.fn.on, the method which modules use to bind handlers to events).