Jump to content

Edge uniques/Canary

From Wikitech
This page is currently a draft.
Material may not yet be complete, information may presently be omitted, and certain parts of the content may be subject to radical, rapid alteration. More information pertaining to this may be available on the talk page.

The Edge uniques canary document is a canary that can be used to verify what Wikimedia Foundation does with the Edge Unique cookie. WMF makes a commitment to keep this page updated, such that it is easy to discover:

  • Every way the cookie moves through our infrastructure.
  • Changes to how the cookie is formed or transferred, both on purpose and accidental.

For all of the following, much more detail is available by following the links. This page should be kept short so it can efficiently serve its stated purpose.

General Data Flow

Our private Content Delivery Network (CDN) is what handles user traffic at the edge of our infrastructure around the world. Users' traffic generally flows through 3 layers of software in our CDN:

  • HAProxy is the first software in our infrastructure that handles users' connections at Layer 7. It unwraps TLS encryption and is the HTTP implementation that the user's agent or browser directly interacts with. It is therefore the first software in our network that has the potential to see sensitive data, such as these WMF-Uniq cookies. Some traffic is handled without ever leaving this layer, such as when HAProxy is used to limit or block easily-detectable, extremely aggressive traffic patterns. Most normal traffic is reverse proxied to the next layer of software (Varnish) on the same server, over a local socket connection.
  • Varnish contains a lot of our complex "business logic" dealing with HTTP processing, and also serves as our fast in-memory cache for hot read-only content. Most normal requests from logged-out readers are served directly from Varnish's content cache, and thus the request chain ends here and Varnish sends its response back through HAProxy to the user. For requests which are a cache miss here, or for uncacheable content, or for which there is an editor session that precludes caching, Varnish reverse-proxies the traffic to the next layer (ATS).
  • Apache Traffic Server (ATS) is also a caching proxy like Varnish, and also encodes some parts of our HTTP processing "business logic". It has a much larger cache which is stored on local NVMe disks, and for cache misses above, it tries to provide a deeper level of cache hits to cover some of the long tail of content that didn't fit in Varnish's small fast memory cache. For requests which were uncacheable or cache misses at this layer, ATS will forward requests to the actual application layer software, such as MediaWiki.

Monitoring for changes

Our initial implementation follows the design. Accordingly, nothing beyond our CDN has access to this cookie. For this to change, at least one of the following layers would need code changes.

HAProxy

  • TLS termination. We can technically access the cookie here, starting after TLS termination, but we currently don't. In the future, it is possible we may access the cookie at this layer to implement some DDoS countermeasures.
  • Main puppet definition

Varnish

ATS

  • This layer does not have access to the cookie, because requests only get here via Varnish, and Varnish deletes the cookie as shown above. If this were to change, code would probably need to be added here (TODO: find where ATS code lives and point to it here)

Application Software

  • All application software (e.g. MediaWiki and others) are reached through the CDN software stack above (specifically, through ATS). It is the CDN software's job to not forward any knowledge of the Edge Uniques cookie to any application layer software.
  • The world of application software is vastly more complex and interconnected, and has various forms of persistent database storage. It would be much more difficult to audit and prevent accidental storage of these cookies in this layer.

Analytics Storage

  • The CDN also sends analytics logs about our site traffic to our analytics clusters, utilizing secure Kafka queues to transport that data back to the Analytics databases in our core sites. These logs do contain other PII, such as IP addresses and other request details. We explicitly do not send the WMF-Uniq cookie or its identifier in any of these streams towards Analytics.
  • This mechanism is in flux: