Changeprop/Memorandum 2023-11

From Wikitech

Change-Propagation was originally developed by the MediaWiki Services team starting in 2015. In the past 2 or 3 years, the team has been disbanded and none of the original developers or maintainers work at the Wikimedia Foundation.

This document is an overview of change-propagation and WMF's deployments of it. It exists to aid upcoming conversations around what to do about change-propagation.

https://phabricator.wikimedia.org/T350156

Overview

Change-Propagation is a WMF home grown NodeJS service.  Its main configuration consumes events from Kafka and responds to them usually by triggering an HTTP request to a remote service (e.g. a cache purge, posting jobs to the MediaWiki API, etc.), but supports other types of inputs (e.g. incoming HTTP requests) and triggers (e.g. producing to Kafka) as well.

Change-Propagation is built on two other NodeJS frameworks also created by WMF:

  • HyperSwitch - framework for creating REST web services. Its defining feature is its use of modular Swagger specs for the service configuration.
  • Service-runner - generalized runtime facilities for Node.js services

At its core, Change-Propagation uses HyperSwitch to map swagger specs to functions to handle incoming requests.  Change-Propagation’s own Rule configuration syntax is used to map input (usually events consumed from Kafka) to outgoing HTTP requests.  Sometimes these triggered HTTP requests are directly to remote services.  In the cases where custom logic is required, the locally declared HyperSwitch endpoints are triggered (e.g. /sys/links -> sys/dep_updates.js).  

See also: Requests for comment/Requirements for change propagation - MediaWiki.

Features

  • Uses Kafka to ensure* (see Otto's Notes below) guaranteed delivery with at least once semantics
  • Automatic retries (with Kafka retry topics) with exponential delays
  • Deduplication (via Redis) to avoid reprocessing the same event (does not guarantee exactly once)
  • Rate limiting (via Redis)
  • Concurrency limiting
  • Sampling support (unused in WMF deployments(?))
  • HTPC purge support (unused in WMF deployments, now handled by purged)
  • Persistent error tracking via a dedicated error topic in Kafka

Deployments at WMF

change-propagation is deployed in kubernetes wikikube.  It is also deployed in deployment-prep (beta) in WMF Cloud VPS.

As of 2023-11, there are 2 deployments of ‘change-propagation’ in WMF production.

changeprop

The default deployment.  Originally created for triggering RESTBase rerenders and varnish CDN cache purges.

changeprop handles (as of 2023-11)

  • HTTP CDN Cache Purging via resource-purge events
  • Direct RESTBase API rerenders on MediaWiki change events (resource_change, revision-create, etc.):
    • MediaWiki Page RESTBase rerender
    • MediaWiki Page RESTBase rerender on null edit
    • MediaWiki Revision RESTBase rerender
    • MediaWiki revision visibility change
      • triggers some RESTBase revision and page rerenders mentioned above by posting resource_change events
    • MediaWiki page delete and page_restore
      • Triggers RESTBase rerenders by posting new relevant resource_change events
      • Triggers backlinks updates, see below
    • MediaWiki Page Summary Rerender
    • MediaWiki Mobile Section Rerender
    • MediaWiki Media List Rerender
    • MediaWiki Mobile HTML Rerender
  • Custom RESTBase Rerenders:
    • MW Dependency Updates for backlinks (e.g. red->blue links), page transcludes (e.g. templates) and Wikidata sitelinks.  
      • Change events listed above call custom local endpoints in sys/dep_updates.js
      • The custom code asks the relevant MW API for dependent resources.  E.g. /sys/links/transcludes/{title} asks for all pages where the triggering page (template, article, image, etc.) is included.  
      • changeprop is configured to subscribe to these topics (e.g. in on_transclusion_update rule) and issues the relevant RESTBase rerenders.
      • If the event is a ‘continue’ event, it calls e.g. /sys/links/transcludes/{original_event.title} with the continue event body.  This causes the code to request from the MW API with the previously fetched continue token, continuing the pagination results.
  • LiftWing Change Propagation
    • drafttopic and outlink-topic-model endpoints are called.
    • These endpoints will score and produce score change events to EventGate.
  • ORES Change Propagation - DISABLED AND NO LONGER USED.
    • Previously triggered:
      • ORES endpoint score precaching
      • Custom changeprop code in sys/ores_updates.js to emit mediawiki.revision-score events.

changeprop-jobqueue

MediaWiki has a JobQueue interface which is implemented in the EventBus extension.  It serializes MediaWiki Jobs to mediawiki/job events, and emits them to eventgate-main, which validates and produces them to Kafka as the various mediawiki.job.* streams.

changeprop-jobqueue subscribes to mediawiki.job topics in various rules, and forwards the event usually to a job runner’s RunSingleJob.php endpoint, e.g. https://jobrunner.discovery.wmnet/rpc/RunSingleJob.php.  Note that this endpoint is NOT part of MediaWiki core.  It is a custom WMF endpoint deployed as part of operations/mediawiki/config.

RunSingleJob.php will hand the incoming job event off to EventBus\JobExecutor.  This JobExecutor will deserialize the event into a MediaWiki Job and call Job::run to execute the Job.

There is some custom logic used by changeprop-jobqueue to handle repartitioning (in sys/partitioner.js) of high volume jobs topics, like the jobs that cause CirrusSearch to update ElasticSearch indexes, or to repartition based on wiki database.  This allows the changeprop-jobqueue consumers of these high volume and partitioned topics to use Kafka consumer parallelism to consume and post jobs faster.

Improvement idea: EventGate + EventStreamConfig can handle partitioning, there should be no need for repartitioning by changeprop-jobqueue.

changeprop-jobqueue configuration is actually much simpler than default changeprop’s.  All rules either POST to RunSingleJob.php, or they repartition the job into a new topic.

changeprop-jobqueue handles (as of 2023-11)

changeprop-jobqueue handles async queuing of all MediaWiki jobs at WMF.  It is difficult to catalog all of them, especially since EventStreamConfig is configured to allow wildcard streams for mediawiki.job.*, and changeprop-jobqueue is configured to consume to any of these topics and post them to RunSingleJob.php.  Here is a list of all mediawiki.job.* topics in the Kafka main cluster.  (NOTE: These have not been filtered for activity or typos.)

Otto’s Notes

  • Auto retries using ‘retry’ Kafka topics: This is a nice feature.
  • 'Continue' pagination. Nice, but I think there is a flaw in its implementation: it produces events of multiple schemas to the same topics.
  • Ratelimiting: Uses ratelimit.js for implementing custom sliding window ratelimiting in Redis.
    • The only implementation I can see is a ‘blacklist’ implementation, that is used in changeprop to rate limit based on too many errors to meta.uri.  That is, if an event that matches a rule ends up triggering an error, it will be rate limited if too many errors are from a given meta.uri
  • Auto deduplication of messages by using Redis to store previously seen meta.domain+meta.id and meta.domain+sha1(message) keys in redis.
  • Custom partitioning logic used by changeprop-jobqueue not really needed.
    • This can be handled by EventStreamConfig + EventGate message_key_fields to have topics partitioned in Kafka correctly before they reach changeprop.
  • Unused code in purge.js, afaict.
  • helm chart needs some love  
  • Decent grafana dashboard and metrics.
  • HyperSwitch (on which change-propagation is based) is actually pretty slick.  I don’t know exactly why we’d want to use http routes for internal function calls, but it is cool, especially if you wanted to build a SOA and make sure OpenAPI specs match what is actually happening.  
    • Using HyperSwitch to produce and consume streams is a little awkward, but is interesting.  It reminds me of knative eventing (Event Platform evaluation & ticket) FaaS type stuff, except that the function deployment and routing is all in local config for the service.
  • There is an undocumented ‘cases’ feature that will take precedence over the usual top level match / match_not rules.
  • There have been no non-SRE/maintenance changes to changeprop code since 2020.  There have been occasional rule updates in helm charts.

Otto's Raw Notes

Changeprop - Wikitech

https://github.com/wikimedia/mediawiki-services-change-propagation/tree/master

Configuration is based on HyperSwitch - MediaWiki

Hooks together routes to javascript functions

E.g.

   /{api:sys}/queue:

     x-modules:

       - path: sys/kafka.js

These define ‘route’ urls (e.g. /sys/queue) that when called will call the JS file as a function that takes the provided.

In the change-prop case, sys/kafka.js is the main used function, and it takes a rule ‘templates’ param.  Templates define the kafka topics to subscribe to, the match conditions on incoming events, and the ‘exec’ to do if the match succeeds.

All potential ‘x-modules’ are defined in the change-propagation repository.

The routes defined are often the ones called by exec, e.g. for ‘page_edit’ rule:

On matched mediawiki.revision-create event, POST to /sys/links/transcludes/{message.page_title}.  

/sys/links route is defined as

x-modules:

       - path: sys/dep_updates.js

This will POST to the local change prop server’s endpoint (I think?) and then that endpoint will POST to the MW API.

There is an undocumented ‘cases’ feature that will take precedence over the usual top level ‘match’/

‘match_not’ rules.

There have been no non SRE/maintainence changes to changeprop code since 2020.  There have been occasional rule updates in helm chart.

Route handler -> rules smells a lot like knative eventing  https://wikitech.wikimedia.org/wiki/Data_Engineering/Evaluations/Event_Platform/Stream_Processing/Framework_Evaluation#Knative_Eventing.  Declaring route endpoints that can be called in response to event subscription, or directly as a URL (I think that this is how changeprop works).

It also is in some ways like benthos, which has its own configuration language for ‘stream  processing’

Retry logic is conflating events with RPC?

Changeprop and cpjobqueue run in deployment-prep, but their deploy process is VERY messy.