Changeprop

From Wikitech
Jump to navigation Jump to search

changeprop (or Change Propagation) is the name given to a service that processes change events generated by Mediawiki and stored in Kafka. Various actions are taken based on the messages read from Kafka. Common actions take the form of HTTP requests or CDN purges.

What it does

  • Changeprop uses Kafka to ensure guaranteed delivery. We use the Apache Kafka message broker to attain at least once delivery semantics: once an event is in Kafka, we can be sure that it will be processed and an event will follow. This allows us to build very long and complex sequences of dependencies without fear of loss of events.
  • Automatic retries with exponential delays, large job deduplication, and persistent error tracking via a dedicated error topic in Kafka
  • The config system allows us to add simple update rules with only a few lines of YAML and without code changes or deploys
  • Fine-granted monitoring dashboard allows us to track rates and delays for individual topics, rates of event production and much more. Changeprop graphs can occasionally be used to discover bugs in other parts of the infrastructure around it.

How it works

Changeprop reads events from Kafka. The topics changeprop reads from are defined in config.yaml - the dc_name variable is a prefix to the topic defined on a per-rule basis. So for example in eqiad for the mw_purge rule which uses the resource_change topic, the full topic will be eqiad.resource_change. Each rule specifies the topic to which it subscribes.

Rules

Rules define a list of cases to which a rule is to respond. General rule properties allow the definition of things like retries, delays and other features.

The "match" section of a rule dictates a pattern to match, which can include URL matching and tag matching (for example, mw_purge events also contain "tags":["purge"] and will only match if the URL pattern and the URL matches the pattern specified). URL match patterns are frequently used to target specific sites (for example have a rule only apply to Wiktionary) or classes of article. Matches can also be fine tuned to not match using not_match. If the match it satisfied, the exec section is executed. The exec will generally be a HTTP request of a defined method to the specified URI. A rule can have multiple match and corresponding exec sections in its cases list - if a pattern is created where matches are mutually exclusive, a rule can act as a switch statement using the same topic and the same semantics but different matches. Headers and other parameters can be defined for an exec section - see the existing rules for details.

Service interactions

Changeprop talks to Redis to manage rate limiting and exclusion lists for problematic or high-traffic articles. All communication is done via Nutcracker. In Kubernetes, a local Nutcracker sidecar container runs within the changeprop pod, proxying access to a list of redis servers.

Many of changeprop's operations are accomplished by sending HTTP requests to RESTBase.

Where it runs

Changeprop currently runs in the scb cluster and in Kubernetes in a currently limited capacity with plans to move fully in future.

Adding features

Adding a new rule

  1. TODO

Deploying

To scb

Changeprop has been removed from scb and cannot be redeployed there.

To Kubernetes

Depending on what needs to be changed in a Kubernetes deploy of changeprop, edits might need to take place in one of two locations - the Helmfile or the Helm chart. Whether the change is to the Helm chart itself, or the Helmfile that configures it, the deploy process to Kubernetes is the same.

Applying changes

For the purposes of this section $env means one of eqiad, codfw or staging. Once your changes have been reviewed and merged by giving a +2 to your change when no rebase is required:

  • a user with root will need to ssh to deploy1001 and sudo to root
  • cd to /srv/deployment-charts/helmfile.d/services/changeprop/.
  • Do a git log -n1 to ensure that your change has been merged and is present in the local checkout of the repo.
  • Check the impact of your changes on configuration files etc by running helmfile -e $env diff
  • If everything looks okay, run a helmfile -e $env sync and then monitor kubectl get pods to ensure everything comes back up healthy

Helmfile changes

For the purposes of this section we'll assume that all changes will be against the staging environment. Helmfile changes happen in the helmfile.d section of the deployment-charts repository. Typically a change to this section will relate to changing an existing configured value for deployed instances (ie: adding Kafka or Redis servers, changing the Varnish multicast IP address).

Helm chart changes

  1. Make your changes to the chart
  2. Bump the version flag in the changeprop/Chart.yaml file.
  3. Add the Chart.yaml file to your change for review

To deployment-prep

Changeprop runs in Docker in deployment-prep on deployment-docker-changeprop01.deployment-prep.eqiad.wmflabs. The configuration passed to changeprop is generated by scripts in the deployment-charts repository, in order to use the same templates and avoid deviation. This means that if you want to change the configuration in beta/deployment-prep, you will first need to edit the configuration in deployment-charts. The values for deployment-prep are stored in the values-beta.yaml file.

Generating the configuration

In deployment-charts, cd to charts/changeprop and ./make_beta_config.py .. The output from this command will be the configuration to be deployed.

Deploying the configuration

The configuration lives in a docker volume on deployment-docker-changeprop01.deployment-prep.eqiad.wmflabs, named changeprop. Configuration needs to be edited within this volume. To edit, run sudo docker run -it -v changeprop:/srv/changeprop alpine /bin/sh and edit /srv/changeprop/config.yaml as required. Then run service changeprop restart to load the configuration. Files other than config.yaml in this volume will be ignored.

Testing

changeprop can be tested by issuing events to Kafka that changeprop will consume. An example test command against the resource_change topic for the k8s staging cluster is: cat mw_purge_example.json | kafkacat -b localhost:9092 -p 0 -t 'staging.resource_change'.

All IDs in these examples are random UUIDs. Not varying UUID between tests runs the risk of being seen as a duplicate event and being skipped. The "dt" field should also be changed to be close to the current time and date, as changeprop will not take action on older events.

mw_purge

{"$schema":"/resource_change/1.0.0","meta":{"dt": "2020-04-02T17:16:25Z", "uri":"https://en.wikipedia.org/wiki/Draft:Editta_Braun","id":"22350141-bbe2-488d-9f73-a1aa6094ac5c","domain":"en.wikipedia.org","stream":"resource_change"},"tags":["purge"]}

null_edit

{"$schema":"/resource_change/1.0.0","meta":{"uri":"https://fr.wikipedia.org/wiki/Oribiky","id":"b92d40b0-3206-469d9615-2fbf61a04418","dt":"2020-04-02T17:16:28Z","domain":"fr.wikipedia.org","stream":"resource_change"},"tags":["null_edit"]}

How to monitor it

There is a Grafana dashboard for Changeprop. The various graphs provide information about things such as rule execution rate and rule backlogs for each rule for various streams.

Rule backlog is the time between the creation of event and the beginning of processing. If the backlog grows over time - change propagation can't keep up with the event rate and either concurrency should be increased, or some other action taken. Backlogs can have occasional spikes, but steady backlog growth is a clear indication of a problem.

Debugging

Querying configuration

Changeprop's configuration can be queried if you have access to deploy1001:

  1. ssh to deploy1001.eqiad.wmnet
  2. cd to the appropriate directory (for example /srv/deployment-charts/helmfile.d/services/staging/changeprop)
  3. run source .hfenv to set up your environment
  4. show the configuration via kubectl describe configmap changeprop-staging-base-config

The suffixes nutcracker-config and metrics-config are also available as configmaps.

Non-issues

Periodically Changeprop will log a message along the lines of the following:

{"name":"change-propagation","hostname":"changeprop-staging-684b9ddbd-4wdkn","pid":141,"level":"ERROR","err":{"message":"Local: Broker transport failure","name":"changeprop-staging","stack":"Error: Local: Broker transport failure\n    at Function.createLibrdkafkaError [as create] (/srv/service/node_modules/node-rdkafka/lib/error.js:334:10)\n    at /srv/service/node_modules/node-rdkafka/lib/kafka-consumer.js:448:29","code":-195,"errno":-195,"origin":"kafka","rule_name":"page_create","executor":"RuleExecutor","levelPath":"error/consumer"},"msg":"Local: Broker transport failure","time":"2020-04-29T13:10:17.443Z","v":0}

This can be ignored as long as the occurrences aren't too close together (currently they happen roughly once every hour in staging), they will not interrupt normal operation of changeprop.

Where it lives

Related pages