URI path normalization

From Wikitech

Problem

The following two URLs are different from the point of view of the caching software running on a CDN node (ATS, Varnish) but do represent the same page:

Pages like the above, with parentheses and certain other special characters in their titles, have more than one correct URL: one with literal parentheses, one with parentheses URL-encoded, and a mix of the two. However, when a page changes, purges are sent only for the urlencoded URL: if the encoded URL is cached, it does not get purged.

We need URLs to be converted to a single, univocal representation before caching and fetching objects from cache. Also, the conversion needs to happen before purging, so that if Steve_Fuller_(sociologist) is cached, a PURGE for Steve_Fuller_%28sociologist%29 invalidates the object.

The question is: given a URL, which characters should be encoded (eg: !0x21), which hex escape should be decoded (eg: 0x7e~), and which characters/hex escapes should be left untouched?

Theory

RFC 3986 section 2 splits the 256 possible byte values completely into 3 sets: Unreserved, Disallowed and Reserved:

  • 66 Unreserved: 0-9 A-Z a-z - . _ ~
  • 172 Disallowed: 0x00-0x20 0x7F-0xFF < > | { } " % \ ^ `
  • 18 Reserved: : / ? # [ ] @ ! $ & ' ( ) * + , ; =

Unreserved and Disallowed characters do not present any issue from the point of view of obtaining a univocal representation. We decode the former, and encode the latter.

Troubles begin when Reserved characters are used. According to RFC3986 section 2.2:

URIs include components and subcomponents that are delimited by characters in the "reserved" set. These characters are called "reserved" because they may (or may not) be defined as delimiters by the generic syntax, by each scheme-specific syntax, or by the implementation-specific syntax of a URI's dereferencing algorithm.

Application-specific knowledge is required to choose what to do with Reserved characters:

If data for a URI component would conflict with a reserved character's purpose as a delimiter, then the conflicting data must be percent-encoded before the URI is formed.

For example, when it comes to MediaWiki, 0x2f can be decoded to / given that, as far as MW is concerned, slashes are fine in titles.

$ varnishlog -q 'ReqMethod eq "PURGE" and ReqURL ~ "Profiling_Python"' | grep ReqURL &
$ curl -X PURGE 'http://127.0.0.1:3128/wiki/User:Ema%2fProfiling_Python'
-   ReqURL         /wiki/User:Ema%2fProfiling_Python
-   ReqURL         /wiki/User:Ema/Profiling_Python

This should not happen for RESTBase, instead, as the application uses / as a delimiter: T127387

Without application-specific knowledge, the following rules should be followed:

  1. Unreserved hex escapes should always be decoded: 0x7e~
  2. Disallowed characters should be encoded to their hex escape representation: >0x3e
  3. Reserved character (and their hex escape representation) should be left as-is

With application-specific knowledge, we can carefully normalize Reserved characters too.

Given that we are operating on the Path component, which is delimited by ? or #, those 2 characters should be left unchanged. The remaining 16 characters in the Reserved set can be encoded/decoded with application-specific knowledge:

: / [ ] @ ! $ & ' ( ) * + , ; =

We'll call this set of characters the Customizable set.

When it comes to MediaWiki, all of the 16 characters in the Customizable set can be put into a specific subset that is either always-decoded or always-encoded, giving us "complete" normalization:

mediawiki_decode_set = : / @ ! $ ( ) * ,
mediawiki_encode_set = [ ] & ' + = ;

RESTBase is similar to MediaWiki, but needs to accept MediaWiki titles with slashes in the %2F form, while still keeping its own functional path-delimiting slashes unencoded as mentioned earlier.

restbase_decode_set = : @ ! $ ( ) * , ;
restbase_encode_set = [ ] & ' + =

When it comes to upload.wikimedia.org, instead, all file titles go through PHP's rawurlencode() when their storage URL is generated. We thus want all characters in the Customizable set to be encoded. The two subsets should thus be:

upload_decode_set =
upload_encode_set = : / [ ] @ ! $ & ' ( ) * + , ; =

Implementation

The problem has been solved for ATS using a Lua script, normalize-path.lua. Path normalization behavior can be configured by specifying which characters to encode and which to decode as a remap rule via hiera. Characters need to be specified in hex. For example:

    - type: map
      target: http://upload.wikimedia.org
      replacement: https://swift.discovery.wmnet
      params:
          - '@plugin=/usr/lib/trafficserver/modules/tslua.so'
          - '@pparam=/etc/trafficserver/lua/normalize-path.lua'
          # decode    /
          - '@pparam="2F"'
          # encode    !  $  &  '  (  )  *  +  ,  :  ;  =  @  [  ]
          - '@pparam="21 24 26 27 28 29 2A 2B 2C 3A 3B 3D 40 5B 5D"'
          - '@plugin=/usr/lib/trafficserver/modules/tslua.so'
          - '@pparam=/etc/trafficserver/lua/x-mediawiki-original.lua'

In Varnish, we deal with the issue by using a C function embedded in VCL, normalize_path_encoding(). Behavior can be changed by passing different "decoder rings" to the function. For example:

sub normalize_upload_path { C{
    static const size_t upload_decoder_ring[256] = {
      // 0x00-0x1F (all unprintable)
        0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
      //  ! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
        0,0,0,2,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,2,
      //@ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _
        0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,1,
      //` a b c d e f g h i j k l m n o p q r s t u v w x y z { | } ~ <DEL>
        0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,1,0,
      // 0x80-0xFF (all unprintable)
        0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
        0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
        0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
        0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
    };

    normalize_path_encoding(ctx, upload_decoder_ring);
}C }

See also