Jump to content

Videoscaling

From Wikitech

When a video is uploaded to commons, videoscaling jobs are created in the JobQueue to reencode the file to a variety of wiki-friendly formats. This encoding process is completed by the TimedMediaHandler (TMH) extension in MediaWiki. Unlike other jobs on Foundation infrastructure, videoscaling jobs are not read by Changeprop workers - this change was necessary as part of the MediaWiki On Kubernetes migration to use a more Kubernetes-friendly process. Jobs are read and processed by Mercurius which is a fast, lightweight service that runs multiple workers in parallel to pass jobs to a Shellbox Kubernetes deployment named shellbox-video via a maintenance script inside the EventBus extension. TimedMediaHandler runs inside shellbox to ensure process isolation as the files are being processed.

Videoscaling work exists in two queues - webVideoTranscode and webVideoTranscodePrioritized.

Implementation

In Kubernetes, we define two jobs (one per queue to guarantee we meet the configured concurrency), which in turn create two pods. These pods run an instance of Mercurius, which passes consumed messages via stdin to a wrapper script. This wrapper script simply parses out the database field in order to call a maintenance script which uses the EventBus API to process the job. Inside the mw-videoscaler pod (which is based on a standard PHP 8.1 mediawiki-multiversion image), TimedMediaHandler is called, which in turn makes a request to shellbox-video. Shellbox runs the script provided by TMH, with parameters also configured by TMH using a variety of TMH_* environment variables.

Rolling out a new version

When scap runs and a new image is generated, a helmfile apply is run on the mw-videoscaler namespace like other namespaces, and new jobs are created with the new version of the shipped code. However, unlike other namespaces, we do not automatically delete old objects- we use Helm annotations to preserve the old Job objects. During the scap rollout, the release.json configmap shared by all jobs is updated and if older jobs notice that the version has changed, they will stop processing jobs, finish any existing work and then terminate gracefully. In the mean time, workers on the new pods will pick up where the old workers left off and continue to process jobs.

mw-videoscaler jobs will be restarted if Mercurius exits with a non-zero status. A zero status means that Mercurius has noticed a change in release version and has finished processing.

Configuring

When Mercurius detects a change of image, the currently running instance of Mercurius will shut down workers as they complete their jobs. A new instance can be started while this happens to pick up where the previous instance left off.

Currently we configure one instance of Mercurius per job in order to ensure certainty around concurrency. These instances run in the mw-videoscaler Kubernetes namespace.

Configure workers

The mercurius.workers Helm value Mercurius is configured with (in mediawiki/values.yaml) dictates the number of jobs that will be processed in parallel per job. When increasing, be mindful of the number of workers in shellbox-video that will be required to pick up the extra work.

Managing

Stopping mercurius gracefully

To stop the existing instances without starting replacement instances, edit the mediawiki-main-mercurius-config configmap in the mw-videoscaler namespace to change the release in the release.json configmap - the new ID does not need to correspond to a real version, it just needs to change. Mercurius will gracefully close workers and finish work before changing the job status to completed. To restart workers after this, delete the jobs in state Finished and do a helmfile apply.

Stopping Mercurius in an emergency

Simply delete all jobs in the mw-videoscaler namespace. This will cause user-visible errors on the encode status in question for videos being processed.

Jobs don't create upon an apply

Sometimes when jobs error out or have completed, new jobs won't be created after an apply - generally all that's needed in this case is a helmfile sync rather than an apply.