Query string normalization

From Wikitech

Problem

The following two URLs are different from the point of view of the caching software running on a CDN node (ATS, Varnish) but represent the same page:

To make optimum use of cache resources, we want to normalize these requests so they have a single canonical form. We do this by sorting query parameters in Varnish.

Theory

Applications are generally insensitive to the order of query parameters, but there is a class of edge-cases that have to do with query parameter keys that appear multiple time in a query string. The way applications handle these varies:

  • Lattermost occurrence wins: for example, this URL should lead to the edit (action=edit) interface:
https://en.wikipedia.org/w/index.php?title=Varnish&action=history&action=edit
Re-ordering the action parameter would instead take the user to the history interface, which is undesirable.
const url_with_dupes = 'https://wtf/?foo=b&foo=a';
console.log(new URL(url_with_dupes).searchParams.get('foo'))  // output: b
Values are additive: For example, given the query string ?foo[]=b&foo[]=a, PHP decodes foo to ['b', 'a']. Re-ordering the parameters would change the order of values in the resulting array.

The way to handle all such cases correctly (i.e., ensure that the re-ordering of query parameters is transparent to the backend application) is to preserve duplicate parameters and to use a stable sort that maintains their relative order.

There are some additional subtleties related to PHP array syntax and URL encoding. See the test cases for libvmod-querysort.

MediaWiki

In its default configuration, the CDN expiry code in MediaWiki is sensitive to parameter ordering. Specifically, MediaWiki allows the full Cache-Control: s-maxage= only if the request URL is an exact match against one of the URL forms that gets purged on article update. The intent is to ensure that we don't cache something with no way to purge it. The canonical forms for a standard article include /w/index.php?title=X&action=history. For example:

These two forms are a problem, because they place title= before action=, and the order of these parameters is flipped when query parameters are sorted, which in turn affects expiry. To handle this, in I3c52ca47e09 we introduced a configuration variable to MediaWiki, $wgCdnMatchParameterOrder, that can be set to false to make the CDN URL matching insensitive to parameter order. This configuration variable is set to false for all Wikimedia wikis.

Implementation

Varnish

Query parameter normalization is done in Varnish using a custom vmod, libvmod-querysort. The code in this vmod is a fork of Varnish's std.querysort, modified to make the sort stable and to handle PHP array syntax correctly. If you are porting this code to another caching platform, make sure the new implementation passes the same test cases.

Query normalization was rolled out in August 2022.

Purged

purged sorts query parameters as well. This was implemented (Ia6b494662) as a stop-gap when we realized that purge requests are handled by ATS and don't get rewritten by Varnish. A better solution is to apply query string normalization in ATS. Some notes about how to do that are available in task T317064 comment 8213291.

See also

  • task T138093: Investigate query parameter normalization for MW/services