Changeprop

From Wikitech

changeprop (or Change Propagation) is the name given to a service that processes change events generated by MediaWiki and stored in Kafka. Various actions are taken based on the messages read from Kafka. Common actions take the form of HTTP requests or CDN purges.

What it does

  • Changeprop uses Kafka to ensure guaranteed delivery. We use the Apache Kafka message broker to attain at least once delivery semantics: once an event is in Kafka, we can be sure that it will be processed and an event will follow. This allows us to build very long and complex sequences of dependencies without fear of loss of events.
  • Automatic retries with exponential delays, large job deduplication, and persistent error tracking via a dedicated error topic in Kafka
  • The config system allows us to add simple update rules with only a few lines of YAML and without code changes or deploys
  • Fine-granted monitoring dashboard allows us to track rates and delays for individual topics, rates of event production and much more. Changeprop graphs can occasionally be used to discover bugs in other parts of the infrastructure around it.

How it works

Changeprop reads events from Kafka. The topics changeprop reads from are defined in config.yaml - the dc_name variable is a prefix to the topic defined on a per-rule basis. So for example in eqiad for the mw_purge rule which uses the resource_change topic, the full topic will be eqiad.resource_change. Each rule specifies the topic to which it subscribes.

Rules

Rules define a list of cases to which a rule is to respond. General rule properties allow the definition of things like retries, delays and other features.

The "match" section of a rule dictates a pattern to match, which can include URL matching and tag matching (for example, mw_purge events also contain "tags":["purge"] and will only match if the URL pattern and the URL matches the pattern specified). URL match patterns are frequently used to target specific sites (for example have a rule only apply to Wiktionary) or classes of article. Matches can also be fine tuned to not match using not_match. If the match it satisfied, the exec section is executed. The exec will generally be a HTTP request of a defined method to the specified URI. A rule can have multiple match and corresponding exec sections in its cases list - if a pattern is created where matches are mutually exclusive, a rule can act as a switch statement using the same topic and the same semantics but different matches. Headers and other parameters can be defined for an exec section - see the existing rules for details.

Service interactions

Changeprop talks to Redis to manage rate limiting and exclusion lists for problematic or high-traffic articles. All communication is done via Nutcracker. In Kubernetes, a local Nutcracker sidecar container runs within the changeprop pod, proxying access to a list of redis servers.

Many of changeprop's operations are accomplished by sending HTTP requests to RESTBase.

Where it runs

Changeprop currently runs in Kubernetes in codfw and eqiad. There is also an instance in the staging cluster that does not process prod traffic. In labs, changeprop runs in regular Docker on deployment-changeprop-1.deployment-prep.eqiad1.wikimedia.cloud.

Adding features

Adding a new rule

  1. Add the rule to deployment-charts/charts/changeprop/templates/_config.yaml
  2. Bump the Chart.yaml version
  3. Commit, get review and merge
  4. Deploy changeprop and changeprop-jobqueue from the deployment host using Kubernetes/Deployments#Code_deployment/configuration_changes

Deploying

To Kubernetes

Changeprop uses the Kubernetes/Deployments workflow to deploy changes.

To deployment-prep

In the Beta Cluster, Changeprop runs in Docker on deployment-changeprop-1.deployment-prep.eqiad1.wikimedia.cloud. The configuration passed to changeprop is generated by scripts in the deployment-charts repository, in order to use the same templates and avoid deviation. This means that if you want to change the configuration in beta/deployment-prep, you will first need to edit the configuration in deployment-charts. The values for deployment-prep are stored in the values-beta.yaml file.

Generating the configuration

In deployment-charts, cd to charts/changeprop and ./make_beta_config.py. The output from this command will be the configuration to be deployed.

For example, to generate the changeprop configuration from your localhost:

cd /home/somepath/deployment-charts/charts/changeprop && ./make_beta_config.py . changeprop'

To generate the jobqueue configuration:

cd /home/somepath/deployment-charts/charts/changeprop && ./make_beta_config.py . jobqueue'

Deploying the configuration

The configuration is in config.yaml in a docker volume on deployment-changeprop-1.deployment-prep.eqiad1.wikimedia.cloud and deployment-docker-cpjobqueue01.deployment-prep.eqiad.wmflabs, named changeprop and cpjobqueue respectively. Configuration needs to be edited within this volume. The host directory can be discovered using `docker volume inspect`.

Ensure that the config is world readable when copying in a new file. Then run service changeprop restart to load the configuration. Files other than config.yaml in this volume will be ignored.

For example, to generate and deploy the changeprop configuration from your localhost:

cd /home/somepath/deployment-charts/charts/changeprop && ./make_beta_config.py . changeprop' \
  | \
ssh deployment-changeprop-1.deployment-prep.eqiad1.wikimedia.cloud \
  sudo sh -xc \''cat > $(docker volume inspect changeprop -f {{.Mountpoint}})/config.yaml && systemctl restart changeprop'\'

To generate and deploy the cpjobqueue configuration:

cd /home/somepath/deployment-charts/charts/changeprop && ./make_beta_config.py . jobqueue | \
ssh deployment-changeprop-1.deployment-prep.eqiad1.wikimedia.cloud \
  sudo sh -xc \''cat > $(docker volume inspect cpjobqueue -f {{.Mountpoint}})/config.yaml && systemctl restart cpjobqueue'\'

Ideally the docker volume would have been pre-created with a fixed host path.

Testing

changeprop can be tested by issuing events to Kafka that changeprop will consume. An example test command against the resource_change topic for the k8s staging cluster is: cat mw_purge_example.json | kafkacat -b localhost:9092 -p 0 -t 'staging.resource_change'.

All IDs in these examples are random UUIDs. Not varying UUID between tests runs the risk of being seen as a duplicate event and being skipped. The "dt" field should also be changed to be close to the current time and date, as changeprop will not take action on older events.

mw_purge

{"$schema":"/resource_change/1.0.0","meta":{"dt": "2020-04-02T17:16:25Z", "uri":"https://en.wikipedia.org/wiki/Draft:Editta_Braun","id":"22350141-bbe2-488d-9f73-a1aa6094ac5c","domain":"en.wikipedia.org","stream":"resource_change"},"tags":["purge"]}

null_edit

{"$schema":"/resource_change/1.0.0","meta":{"uri":"https://fr.wikipedia.org/wiki/Oribiky","id":"b92d40b0-3206-469d9615-2fbf61a04418","dt":"2020-04-02T17:16:28Z","domain":"fr.wikipedia.org","stream":"resource_change"},"tags":["null_edit"]}

How to monitor it

There is a Grafana dashboard for Changeprop. The various graphs provide information about things such as rule execution rate and rule backlogs for each rule for various streams.

Rule backlog is the time between the creation of event and the beginning of processing. If the backlog grows over time - change propagation can't keep up with the event rate and either concurrency should be increased, or some other action taken. Backlogs can have occasional spikes, but steady backlog growth is a clear indication of a problem.

Debugging

Querying configuration

Changeprop's configuration can be queried if you have access to deploy1001:

  1. ssh to the deploy server for the datacenter
  2. cd to the appropriate directory (for example /srv/deployment-charts/helmfile.d/services/staging/changeprop)
  3. run kube_env changeprop $CLUSTER to set up your Kubernetes environment
  4. show the configuration via kubectl describe configmap changeprop-staging-base-config

The suffixes nutcracker-config and metrics-config are also available as configmaps.

Non-issues

Periodically Changeprop will log a message along the lines of the following:

{"name":"change-propagation","hostname":"changeprop-staging-684b9ddbd-4wdkn","pid":141,"level":"ERROR","err":{"message":"Local: Broker transport failure","name":"changeprop-staging","stack":"Error: Local: Broker transport failure\n    at Function.createLibrdkafkaError [as create] (/srv/service/node_modules/node-rdkafka/lib/error.js:334:10)\n    at /srv/service/node_modules/node-rdkafka/lib/kafka-consumer.js:448:29","code":-195,"errno":-195,"origin":"kafka","rule_name":"page_create","executor":"RuleExecutor","levelPath":"error/consumer"},"msg":"Local: Broker transport failure","time":"2020-04-29T13:10:17.443Z","v":0}

This can be ignored as long as the occurrences aren't too close together (currently they happen roughly once every hour in staging), they will not interrupt normal operation of changeprop.

Where it lives

See also