Thumbor

From Wikitech
Jump to navigation Jump to search
For common tasks related to Thumbor, see Thumbor/Runbook

The Wikimedia media thumbnailing infrastructure is based on Thumbor.

As of June 2017, all thumbnail traffic for public and beta wikis is served and rendered by Thumbor (launch task: T121388).

As of February 2018, all thumbnail traffic for private wikis is served by Thumbor.

As of April 2018, the MediaWiki imagescaler hardware has been repurposed (T188062).

As of April 2023, Thumbor is serving production traffic from Kubernetes.

Rationale

  1. Better security isolation. Thumbor is stateless and connects to Swift, Poolcounter and DC-local Thumbor-specific Memcache instances (see "Throttling" below). In contrast, MediaWiki is connected to many more services, as well as user data and sessions. Considering how common security vulnerability discoveries are in media-processing software, it makes sense to isolate media thumbnailing as much as possible.
  2. Better support. Thumbor has a lively community of its own, and is a healthy open-source project. In contrast, the media-handling code in MediaWiki is supported on a best-effort basis by very few people.
  3. Easier operations. Thumbor is a simple service and should be easy to operate.

Supported file types

We have written Thumbor engines for all the file formats used on Wikimedia wikis. Follow these links for special information about the Thumbor engines for those formats:

These engines are a reimplementation of the logic that resides in MediaWiki core and extensions for the same image formats, often leveraging the same underlying open-source libraries or executables. Whenever possible, reference images generated with MediaWiki are used for the Thumbor integration tests.

Broader ecosystem

In order to understand Thumbor's role in our software stack, one has to understand how Wikimedia production is currently serving those images.

Public wikis

The edge, where user requests first land, is Varnish. Most requests for a thumbnail are a hit on the Varnish frontend or backend caches.

When Varnish can't find a copy of the requested thumbnail - whether it's a thumbnail that has never been requested before, or ones that fell out of Varnish cache - Varnish hits the Swift proxies. We run a custom plugin on our Swift proxies, which is responsible for parsing the thumbnail URL, determining whether there is a copy of that thumbnail already stored in Swift, serving it if that's the case, asking Thumbor to generate it otherwise.

There is one exception to this workflow, which is when a request is made directly to thumb.php. In that case the request isn't cached by Varnish and is send to MediaWiki, which then proxies it to Thumbor. This is the same behavior used by private wikis, described below. These requests are actually undesirable because of their inefficiency (skipping Varnish caching) and are all coming from gadgets, not from MediaWiki itself. It would be interesting to perform a cleanup campaign, encouraging gadget owners to migrate their code to the proper way of crafting well-cached thumbnail URLs and subsequently blocking thumb.php use on public wikis once the cleanup is complete.

Private wikis

In the case of private wikis, Varnish doesn't cache thumbnails, because MediaWiki-level authentication is required to ensure that the client has access to the desired content (is logged into the private wiki). Therefore, Varnish passes the requests to MediaWiki, which verifies the user's credentials. Once authentication is validated, MediaWiki proxies the HTTP request to Thumbor. A shared secret key between MediaWiki and Thumbor is used to increase security.

Hitting Thumbor (common)

When Thumbor receives a request, it tries to fetch the original media from Swift. If it can't, Swift will return a 404. Thumbor then proceeds to generate the request thumbnail for that media. Once it's done, Thumbor serves the resulting image, which the Swift proxy then forwards to Varnish, which serves it to the client. Varnish saves a copy in its own cache, and Thumbor saves a copy in Swift.

Ways our use of Thumbor deviates from its original intent

Disk access

Thumbor, in its default configuration, never touched the disk, for performance purposes. Since most image processing software isn't capable of streaming content, it keeps the originals entirely in memory in a request lifecycle. This works fine for most websites that deal with original media files that are a few megabytes at most. But the variety of files found on Wikimedia wikis mean we deal with some original files that are several gigabytes. This core logic in Thumbor of keeping originals in memory doesn't scale to the concurrency of large files we can experience.

This logic of passing the whole original around is deeply baked into Thumbor, which makes it difficult to change Thumbor itself to behave differently. Which is why we opted for a workaround, in the form of custom loaders. Loaders are a class of Thumbor plugins responsible for loading the original media from a given source.

Our custom loaders stream the original media coming from its source (eg. Swift) directly to a file on disk. The path of that file is then passed via a context variable, and the built-in variable in Thumbor that normally contains the whole original only contains the beginning of the file. Passing this extract allows us to leverage Thumbor's built-in logic for file type detection, because most file types signal what type they are at the beginning of the file.

Filters

Filters normally do something in Thumbor. We had needs, such as multipage support, that span very different engines. Which is why we repurposed filters to simply pass information to each engine, which is then responsible for applying the filter's functionality, instead of having logic for every possible engine baked into the filter. This deviates from Thumbor's intent to have filters do something, since not all engines have to do something according to a filter.

Image processing ordering

Thumbor tends to perform image operations right away (including filters), when it processes them. For performance and quality conservation purposes, we often queue those image operations and perform them all at once in a single command. This need is also increased by our reliance on subprocesses.

Subprocesses

Thumbor's default engines do everything with Python libraries. While this has the advantage of cleaner code, and doing everything in the same process, it has the disadvantage... of doing everything in the same process. On Wikimedia sites, we deal with a very wide variety of media. Some of the files would require too much memory to resize and can't be processed, some take too long. In the default Thumbor way of doing things, we could only set resource limits on the Thumbor process and no time limits because Thumbor is single-threaded (so a call to an operation on a python library can't be aborted). By doing all our image processing using subprocess commands, we have better control over resource and time limits for image processing. Which means that a given original being problematic is much less likely to take down the Thumbor process with it, or hog it, and other content can be processed.

Multi-engine setup

Thumbor doesn't have infrastructure for multiple engines. It only expects a single engine as configuration and has a hardcoded special case for GIF. Due to this lack of generic multi-engine support, we developed our own using a proxy engine, which acts as the default Thumbor engine and routes requests to the various custom engines we've written.

We've also had to monkey-patch Thumbor's MIME type support to enable the new mime types supported by our various engines. Overall this is a weak area in Thumbor's extensibility that we had to work around, but changes could be made upstream to be more accommodating to our usage pattern.

Throttling

In order to prevent abuse and to distribute server resources more fairly, Thumbor has a few throttling mechanisms in place. These happen as early as possible in the request handling, in order to avoid unnecessary work.

Memcached-based

Failure throttling require having a memory of past events. For this we use Memcached. In order to share the throttling information across Thumbor instances, we use a local nutcracker instance running on each Thumbor server, pointing to all the Thumbor servers in a given datacenter. This is configured in Puppet, with the list of servers in hiera under the thumbor_memcached_servers and thumbor_memcached_servers_nutcracker config variables.

In Thumbor's configuration, the memcached settings used for this are defined in FAILURE_THROTTLING_MEMCACHE and FAILURE_THROTTLING_PREFIX, found in Deployment Charts.

Failure

The failure throttling logic itself is governed by the FAILURE_THROTTLING_MAX and FAILURE_THROTTLING_DURATION Thumbor config variables. This throttling limits retries on failing thumbnails. Some originals are broken or can't be rendered by our thumbnailing software and there would be no point retrying them every time we encounter them. This limit allows us to avoid rendering problematic originals for a while. We don't want to blacklist them permanently, however, as upgrading media-handling software might suddenly make originals that previously couldn't be rendered start working. This limit having an expiry guarantees that the benefits of upgrades apply naturally to problematic files, without requiring to clear a permanent blacklist whenever software is upgraded on the Thumbor hosts.

Poolcounter-based

For other forms of throttling, we use Poolcounter. Both to combat malicious and unintentional DDoS, as well as regulate resource consumption. The Poolcounter server configuration shared by the different throttling types is defined in the POOLCOUNTER_SERVER, POOLCOUNTER_PORT and POOLCOUNTER_RELEASE_TIMEOUT Thumbor config variables, found in Deployment Charts.

Per-IP

We limit the amount of concurrent thumbnail generation requests per client IP address. The configuration for that throttle is governed by the and POOLCOUNTER_CONFIG_PER_IP Thumbor config variable, found in Deployment Charts.

Per-original

We limit the amount of concurrent thumbnail generation requests per original media. The configuration for that throttle is governed by the and POOLCOUNTER_CONFIG_PER_ORIGINAL Thumbor config variable, found in Deployment Charts.

Expensive

Some thumbnail types are disproportionately expensive to render thumbnails for (in terms of CPU time, mostly). Those expensive types are subject to an extra throttle, defined by the POOLCOUNTER_CONFIG_EXPENSIVE Thumbor config variable, found in Deployment Charts.

Not per-user

Unlike MediaWiki, Thumbor doesn't implement a per-user Poolcounter throttle. First because Thumbor has greater isolation (on purpose) and doesn't have access to any user data, including sessions. Secondly, the per-IP throttle should covers the same ground, as logged-in users should have little IP address variance during a session.

Logging

Thumbor logs go to stdout on the Thumbor containers. The logging configuration is defined in the deployment-charts repo, under the THUMBOR_LOG_CONFIG Thumbor config variable.

Thumbor logs also go to Logstash; one way to filter for them is host:thumbor*.

Configuration

Thumbor consumes its configuration from the /etc/thumbor.d/ folder. The .conf files found in that folder are parsed in alphabetical order by Thumbor.

Thumbor in Kubernetes is configured using our standard Helmfile patterns. This means that all configuration lives in either the chart defaults or in the helmfile.d directory for the service.

Scaling

To increase the capacity of Thumbor, increase the replicas parameter in the helmfile configuration and redeploy.

Updating custom plugins

Our custom Thumbor plugins have their reference repo at https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/thumbor-plugins/

Testing the changes

Before putting anything up for review, you can test your changes locally. You need to have Docker installed on your machine. Some tests, so-called online tests, require connecting to the internet, that's why the Docker container of these tests should be run with the internet connection. All the tests and the flake8 linter are simply run by calling make docker_test at the root of the thumbor plugins directory.

Once the tests pass locally and you push the changes to Gerrit for a review, both the linter and tests are automatically run via Jenkins. If you miss something related to the tests or linter locally, you will see an error message about it in Gerrit.

Deployment

Thumbor uses the standard Kubernetes service deployment pattern.

Restarting

Follow the standard roll-restart method for Kubernetes services.

Dashboards and logs

Grafana thumbor dashboard

Logstash

Manhole

Thumbor runs with python manhole for debugging/inspection purposes. See also T146143: Figure out a way to live-debug running production thumbor processes

To invoke manhole, e.g. on thumbor on port 8827:

 sudo -u thumbor socat - unix-connect:/srv/thumbor/tmp/thumbor@8827/manhole-8827

Local development

As of July 2022, we are running Debian inside the Docker container and now it is so easy to develop Thumbor plugins locally:

https://gerrit.wikimedia.org/g/operations/software/thumbor-plugins

Pull it, install Docker and run the development version of the project as such:

mylinux@DESKTOP-DW7K8KS:~/thumbor-plugins$ make build

You can find more information about the local configurations of Thumbor plugins in the README.md file.

Question: how do you create sample thumbnail images to use in a test case?

The Docker image can run the thumbor server standalone, which defaults to running on port 8800. To try the example below, make sure that the FILE_LOADER_ROOT_PATH configuration variable is set to '/srv/service/tests/integration/originals'. You need to define this in the thumbor.conf file, there is more info about it in the README.md file of the project.

mylinux@DESKTOP-DW7K8KS:~/thumbor-plugins$ wget http://localhost:8800/thumbor/unsafe/450x/Carrie.jpg
--2023-01-20 11:00:22--  http://localhost:8800/thumbor/unsafe/450x/Carrie.jpg
Resolving localhost (localhost)... 127.0.0.1
Connecting to localhost (localhost)|127.0.0.1|:8800... connected.
HTTP request sent, awaiting response... 200 OK
Length: 52069 (51K) [image/jpeg]
Saving to: ‘Carrie.jpg’

Carrie.jpg                             100%[============================================================================>]  50.85K  --.-KB/s    in 0s 

See also