From Wikitech

Proton is a service that converts the Wikipedia articles into PDF. It uses Puppeteer to fetch the Wikipedia page, render it in headless chromium, and then calls Puppeteer's page.pdf() to return PDF version of the article.

Source Code


Deploying changes

For more information: Kubernetes/Deployments

Updating Puppeteer & Chromium

We use a very small set of Puppeteer features and usually, it is pretty safe to update both the Puppeteer library and the Chromium browser. Before you start updating Pupeteer and Chromium, please keep in mind that versions of Puppeteer and Chromium are tighly coupled:

Puppeteer acts as an indivisible entity with Chromium. Each version of Puppeteer bundles a specific version of Chromium – the only version it is guaranteed to work with. This is not an artificial constraint: A lot of work on Puppeteer is actually taking place in the Chromium repository.

For more information please refer to Why doesn’t Puppeteer v.XXX work with Chromium v.YYY?

Puppeteer updates

We pin puppeteer to a specific version in the package.json file. The latest Puppeteer version can be found on Puppeteer releases page. The update process is very simple and it narrows down to bumping the puppeteer version in the package.json file, running npm install to fetch new version and testing that service renders PDF correctly. Puppeteer is usually shipped with not-yet-stable version for Chrome, there is no need to update the Puppeteer with every release. Because Puppeteer is coupled with specific Chromium version - the Puppeteer updates should be performed only when the new version provides useful features/fixed issues related to HTML/PDF rendering.

Chromium updates

We decided to use Chromium bundled with the operating system as this approach sounded like a most reasonable solution. The Chromium packages in Debian (OS we're using) are verified by Debian maintainers and are guaranteed to work and not have any destructive behavior.

We analyzed other ways to ship Chromium, but they were rejected:

  • using Chromium version bundled with Puppeteer - This was rejected due to fact that Puppeteer downloads the chromium from some servers and we do not have control over it. There was no safe way to verify that downloaded version is safe to use in WMF environment.
  • store Chromium executable in the Proton repository - This was rejected due to the size of chromium executable. It's over 100MB. The chromium-render repository would grow too fast and it would become pretty difficult to maintain in near future.
  • installing chromium manually (or via some script) - This was rejected due to higher maintenance cost. The Chromium version shipped with Debian is proven to work properly with the Puppeteer version we're currently using.

When you decide to update or Puppeteer, or Chromium browser you should pick the version of Puppeteer that uses the Chromium version (or vice versa) close enough to the one bundled with given Puppeteer version. We cannot update Chromium that often as Debian release cycle is bit slow and the bundled Chromium version is not the latest stable.

Puppeteer configuration

Wikimedia environment is very specific and it requires special puppet configuration. We need to pass additional config options that is very difficult explain why, as those can look like security loopholes:

  • --ignoreHTTPSErrors flag was introduced because we use a self-signed certificate for our internal wiki domains (since the CA is our Puppet), and using internal domains is the standard way of accessing MediaWiki appservers from REST services. Given that Proton cannot communicate with the outside world, and even if it receives malicious HTML, it should be able to handle it safely, it is safe to use the `ignoreHTTPSErrors` configuration flag. This config is set only on production environment (in deploy repo). The chromium-render repository doesn't have that option set.
  • --no-sandbox and --disable-setuid-sandbox flags are required to properly execute Chromium inside docker environment. Chromium sandboxing requires kernel user namespaces set up properly. You can find more information about the issue on Chrome won't work without --no-sandbox option issue. Chrome process is firejailed which means is already sandboxed by us and there is no need to use built-in chrome sandboxing.
  • --font-rendering-hinting=medium, --enable-font-antialiasing, --disable-gpu flags are used to tune up the fonts rendering. We want consistent fonts rendering across all production/staging/beta and development platforms.
  • --hide-scrollbars and --no-first-run flags are used to improve rendering PDF page. Most probably they are not required, but it is safer to keep then on

Data flow