Envoy

From Wikitech

What is Envoy proxy

Envoy (GitHub) is an L7 proxy and communication bus designed for large modern service-oriented architectures. It provides several features for a reverse proxy including but not limited to:

  • HTTP2 support.
  • L3/L4 filter architecture, so it can be used for TLS termination, traffic mirroring, and other use cases.
  • Good observability and tracing, supporting statsd, zipking etc.
  • rate limiting, circuit breakers support.
  • dynamic configuration through the xDS protocol.
  • service discovery.
  • gRPC, Redis, MongoDB proxy support.

Envoy at WMF

There are two main use cases for envoy at WMF.

  • Act as a TLS terminator / proxy for internal services. This is done for services:
    • in the deployment pipeline (via the tls helpers in the deployment charts) where it works as a sidecar container to the service if tls is enabled for the specific chart.
    • For services not in the pipeline, using profile::tlsproxy::envoy
  • Act as a local proxy to other services for MediaWiki (for now), via profile::services_proxy::envoy

TLS termination

If you want to add TLS termination to a new deployment chart, just use the scaffold script - it will create your starting chart with tls termination primitives already in place. If you want to add TLS termination to an existing chart, you just have to:

  • Link common_templates/<version>/_tls_helpers.tpl in the templates directory of the chart
  • Insert the appropriate calls to those templates across the configmap, deployment, service and networkpolicy templates.

See https://gerrit.wikimedia.org/r/#/c/operations/deployment-charts/+/558092/ as an example.

If you want to add TLS termination to a service in puppet, include profile::tlsproxy::envoy in its role in puppet, and add the hiera configuration following the suggestions in the class documentation.

Services Proxy

The services proxy is installed on all servers that run MediaWiki, and does expose them via HTTP on localhost:<PORT>. Some endpoints might also define a specific Host header.

The service proxy offers:

  • Persistent connections
  • Advanced TLS tunneling (envoy supports TLS 1.3)
  • Retry logic
  • Circuit breaking (still not implemented)
  • Header rewriting
  • Telemetry for all backends
  • Tracing (still not implemented)
  • Precise timeouts (microsecond resolution)

You can find a intro presentation on the service proxy in the "SRE Sessions" Google Drive.

Add a new service (listener)

The currently defined services are defined in hieradata/common/profile/services_proxy/envoy.yaml.

You can define your proxy to point to any valid DNS record, which will be re-resolved periodically. This means, it works with discovery records in DNS.

To add a new service you just need to add an entry to that list. A basic example may look like:

- name: mathoid
  port: 6013
  timeout: "5s"
  service: mathoid
  keepalive: "4.5s"
  retry:
    retry_on: "5xx"
    num_retries: 1

Please refer to the class documentation in puppet for details: modules/profile/manifests/services_proxy/envoy.pp.

Use a listener

To make use of a configured listener, it needs to be enabled for your host or within your kubernetes helm chart.

For hosts:

  • Include profile::services_proxy::envoy in your puppet role
  • Add the listener(s) you would like to enable in hiera key profile::services_proxy::envoy::enabled_listeners (like this example for MW installations)

For kubernetes:

  • Include common_templates/0.2/_tls_helpers.tpl in your helm chart (you probably already have, this comes with the default scaffold)
  • Add the listener(s) you would like to enable in helm key .Values.discovery.listeners

You then need to configure the application to use http://localhost:<listener_port> to connect to the upstream service via the envoy listener.

Example (calling mw-api)

To call the MediaWiki API from your application, add the "mwapi-async" listener as described above and send your requests to http://localhost:6500. As you use localhost now, you will need to add a proper Host-Header to your request to reach the Wikipedia you need:

def getPageDict(title: str, wiki_id: str, api_url: str) -> dict:
    [...]
    # This will only work for wikipedias, but it's just an example
    mwapi_host = "{0}.wikipedia.org".format(
        wiki_id.replace("wiki", "").replace("_", "-")
    )
    headers = {"User-Agent": "mwaddlink",
        "Host": mwapi_host,
    }
    req = requests.get(api_url, headers=headers, params=params)

[...]

getPageDict(page_title, wiki_id, "http://localhost:6500/w/api.php")

Please note: wikipedia.org, wikidata.org, and wikimedia.org hosts all use mediawiki, and one might expect them to use one of the mw-api-* envoy listeners. However, it is important to take note of the actual service that serves the endpoint you are trying to access. For example in the table shown below, although the language_pairs and pageviews endpoints have wikimedia.org as part of their host header, they use different envoy listeners:

endpoint name enpoint host header enpoint uri envoy listener
language_pairs cxserver.wikimedia.org http://localhost:6015/v1/languagepairs cxserver
pageviews wikimedia.org http://localhost:6033/wikimedia.org/v1/metrics/pageviews rest-gateway
wikipedia {source}.wikipedia.org http://localhost:6500/w/api.php mw-api-int-async-ro
wikidata www.wikidata.org http://localhost:6500/w/api.php mw-api-int-async-ro
event_logger intake-analytics.wikimedia.org http://localhost:6004/v1/events?hasty=true eventgate-analytics

Runtime configuration

Envoy allows you to change parts of its configuration at runtime, using the administration interface. You will find that exposed via localhost:9631 on instances and localhost:1666 or /var/run/envoy/admin.sock in kubernetes pods.

The following example increases the log level for the http logger to debug and configures the logger for the mwapi-async listener to log all requests (instead of just errors) in a apache combined like log format (it's different, though. See: https://www.envoyproxy.io/docs/envoy/latest/configuration/observability/access_log/usage#config-access-log and https://blog.getambassador.io/understanding-envoy-proxy-and-ambassador-http-access-logs-fee7802a2ec5).

curl -XPOST localhost:1666/logging?http=debug
curl -XPOST localhost:1666/runtime_modify?mwapi-async_min_log_code=200

curl -XPOST --unix-socket /var/run/envoy/admin.sock http://localhost/logging?http=debug

For easier access to the port inside of kubernetes pods/containers, use nsenter on the kubernetes node the container runs on or take a look at k8sh.

From a kubernetes host you can do the following to find the socket path and then use curl (i.e. without nsenter)

  • docker ps # find the container id (first column)
  • docker inspect <id> --format '{{.GraphDriver.Data.MergedDir}}'
  • cd to the directory above
  • curl -XPOST --unix-socket run/envoy/admin.sock http://localhost/logging?http=debug

Telemetry

Envoy telemetry data is embedded in a bunch of service dashboards in Grafana.wikimedia.org already. For generic dashboards, go to:

Building envoy for WMF

Envoy community has presented recently https://www.getenvoy.io/ an envoy proxy distribution that offers amongst other artifacts, when we started to consider envoy that distribution channel didn't exist at that time. Unfortunately, the deb packages they provide are quite incomplete.

Prepare a new version

The operations/debs/envoyproxy repository includes just the debian control files (starting from envoy version 1.26.1). We don't actually build envoy but package it from upstream binary releases. Part of the process is to download the release tarball and verify its sha512 hash against what upstream provides. A trusted source for the pubkey of their signature could not be found.

Because of that, you will need to set HTTP proxy variables for internet access on the build host.

The general process to follow is:

  • Check out operations/debs/envoyproxy on your workstation
  • Decide if you want to update an existing version (switch so the corresponding vX.Y branch) or add a new version (create a new vX.Y branch based off of the latest one)
  • Create a patch to bump the debian changelog
export NEW_VERSION=1.26.1 # envoy version you want to package
dch -v ${NEW_VERSION}-1 -D buster-wikimedia "Update to v${NEW_VERSION}"
git commit debian/changelog

# Make sure to submit the patch to the correct branch
git review vX.Y
  • Merge
git checkout vX.Y

# Ensure you allow networking in pbuilder
# This option needs to be in the file, an environment variable will *not* work!
echo "USENETWORK=yes" >> ~/.pbuilderrc

# Build the package
https_proxy=http://webproxy.$(hostname -d):8080 DIST=buster pdebuild


Import with reprepo

# On apt1001, copy the packages from the build host
rsync -vaz build2001.codfw.wmnet::pbuilder-result/buster-amd64/envoyproxy*<PACKAGE VERSION>* .

sudo -i reprepro -C main include buster-wikimedia $HOME/envoyproxy*<PACKAGE VERSION>*.changes

# Copy the package over to other distributions if needed (this is possible because they only contain static binaries)
sudo -i reprepro copy bullseye-wikimedia buster-wikimedia envoyproxy
sudo -i reprepro copy bookworm-wikimedia buster-wikimedia envoyproxy

# If you want to test out a new version without rolling it out to production, you may import to the "envoy-future" component instead of "main"
# although this one only exists for buster currently
sudo -i reprepro -C component/envoy-future include buster-wikimedia $HOME/envoyproxy*<PACKAGE VERSION>*.changes

Copying the envoy-future package to main

If the exact version of Envoy you want is already available in the envoy-future component, and you want it to make it available in main, you don't have to rebuild it.

On apt1001, find the existing .deb file under /srv/wikimedia/pool/component/envoy-future/e/envoyproxy. Then import it to main:

sudo -i reprepro -C main includedeb buster-wikimedia path/to/envoyproxy_1.XX.X-1_amd64.deb

# Copy to other distributions as needed
sudo -i reprepro copy bullseye-wikimedia buster-wikimedia envoyproxy
sudo -i reprepro copy bookworm-wikimedia buster-wikimedia envoyproxy

Build the envoy docker image

  • Bump the changelog of the envoy image (example)
    # in https://gerrit.wikimedia.org/r/plugins/gitiles/operations/docker-images/production-images/
    cd images/envoy
    # or for envoy-future
    cd images/envoy-future/
    
    # Bump changelog
    dch -D wikimedia --force-distribution -c changelog -v <envoy version number>-1
    
  • Go on one build server (role role::builder in puppet) and run
$ cd /srv/images/production-images
# If someone's been naughty and hand patched the repo, this will alert you before messing with the local git history
$ sudo git pull --ff-only
$ sudo build-production-images

The script will only build the images not present on our the docker registry - so in your case supposedly only the envoy image.

Update envoy

In CI

We're using envoy in operations/deployment-charts to lint and verify auto-generated envoy config.

To determine the envoy version, run "envoy --version" within the helm-linter image. You can do this on your laptop:

docker run --pull always --rm -it --entrypoint /usr/bin/envoy docker-registry.wikimedia.org/releng/helm-linter --version

To update the envoy version used there, bump the changelog at dockerfiles/helm-linter/changelog which triggers an update to the latest version:

dch -D wikimedia --force-distribution -c changelog

And add the new version to jjb/operations-misc.yaml in a second patch (example)

When this is merged and build, run CI (maybe just rebuild last at https://integration.wikimedia.org/ci/job/helm-lint/ ?) to verify the new envoy version against our config.

envoy update rollout

Just use Debdeploy as usual. It is advised that a new version is rolled out as follows:

  • Start with one mwdebug node
    • Check curl -s localhost:9631/server_info to ensure the expected version is running
    • sudo tail -f /var/log/envoy/*.log
    • Try to navigate wikipedia via the mwdebug instance you choose (X-Wikimedia-Debug)
    • Check the envoy telemetry and appservers dashboard
  • On one mediawiki and one restbase node (to see if everything is okay with real traffic)
  • On the mediawiki and restbase canaries 'A:mw-canary or A:restbase-canary'
  • One (smaller) Kubernetes service (staging, passive DC, active DC)

Keep it like that for a while. If everything goes well, continue with:

  • The rest of tls-terminated proxies (cumin query P{R:Package = envoyproxy} and not (P{O:mediawiki::common} or P{C:profile::restbase}) or use debmonitor)
  • The rest of mediawiki and restbase nodes
  • The rest of the Kubernetes deployments

Kubernetes/deployment pipeline

Once the image is published (you can verify they are by running docker pull from your computer), you should deploy one or more low traffic services with the new image to gain some confidence. If that goes well, change the default for all deployments to the new version.

Don't forget to remove the hardcoded image_version from the deployments you used for verification after updating the default.

Deploy single services with a new envoy version

Set new envoy version as default for all chart deployments