Event Platform/EventGate/Administration

From Wikitech

EventGate is deployed using the WMF Helm & Kubernetes deployment pipeline. This page will describe how to build and deploy EventGate services as well as document how to administer and debug EventGate in beta and production.

Our deployments of EventGate are done using the eventgate-wikimedia repository. This is an npm module that implements a WMF specific EventGate factory, and specifies EventGate as a dependency. It launches service-runner via the EventGate module with config provided here that sets eventgate_factory_module to eventgate-wikimedia.js.


Mediawiki Vagrant development

See: Event_Platform/EventGate#Development_in_Mediawiki_Vagrant


Beta / deployment-prep

Since we deploy Docker images to Kubernetes in production, we want to run these same images in beta. This is done by including the role::beta::docker_services class on a deployment-prep node via the Horizon Puppet Configuration interface. The configuration of the service and image is done by editing Hiera config in the same Horizon interface. deployment-eventgate-3 is a good example. The EventBus Mediawiki extension in beta is configured with $wgEventServices that point to these instances.

For details on how to access and work with deployment-prep, refer to Event Platform/Beta Deployments.

Production

Primary documentation for Kubernetes Deployments is here: Deployments_on_kubernetes

Production deployments of EventGate use WMF's Service Deployment Pipeline. Deploying new code and configuration to this pipeline currently has several steps. You should first be familiar with the various technologies phases of this pipeline. Here's some reading material for ya!

  • Deployment pipeline
  • Deployment Pipeline Design (AKA Streamlined Service Delivery Design)
  • Blubber - Dockerfile generator, ensures consistent Docker images.
  • Helm - Manages deployment releases to Kubernetes clusters. Helm Charts describe e.g. Docker images and versions, service config templating, automated monitoring, metrics and logging, service replica scaling, etc.
  • Kubernetes - Containerized cloud clusters made of of 'pods'. Each pod can run multiple containers.


Deployment Pipeline Overview

Here's a general overview of how a code and then a Helm chart change in EventGate makes it to production. Code changes require Docker image rebuilds, and eventgate Helm chart changes require a new chart version and release upgrade.

Each EventGate service is deployed via the same eventgate Helm chart. Each service runs in its own Kubernetes namespace and has a distinct release name. The services are configured and deployed using helmfile custom values files and commands.

Current services (as of 2020-03)

  • eventgate-main - Produces lower volume 'production' events to Kafka main-* clusters.
  • eventgate-analytics - Produces high volume 'analytics' events to Kafka jumbo-eqiad cluster.
  • eventgate-analytics-external - Produces medium volume client side 'analytics' events to Kafka jumbo-eqiad cluster.
  • eventgate-logging-external - Produces client side error logs to the Kafka logging-* cluster for use in logstash.

In case you get confused, here are the Helm and Kubernetes terms for the eventgate-analytics service:

  • main app (service) name: eventgate-analytics
  • docker image name: eventgate-wikimedia (built from the eventgate-wikimedia gerrit repository)
  • Helm chart: eventgate
  • Helm release name: canary or production
  • Kubernetes cluster name: staging, eqiad or codfw
  • Kubernetes namespace: eventgate-analytics

In the eventgate-analytics service examples below, you will be deploying the eventgate-wikimedia docker image from the eventgate Helm chart deploying and applying values via helmfile.

There are 3 repositories that may need changes.

  • EventGate - This is the generic pluggable library & service
  • eventgate-wikimedia - Wikimedia specific implementation code and deployment pipeline Blubber files.
  • deployment-charts - Helm charts and helmfile values, specifies configs for service deployment.

If you make a change to EventGate or eventgate-wikimedia, you must trigger a rebuild of the eventgate Docker image, then change the image version in the eventgate chart and deploy. If you just need to make a config or chart change, then you only need to build a new chart and deploy.

EventGate / eventgate-wikimedia Code Change

EventGate repository change

  1. First merge/push the change to the EventGate repository.
  2. Push a version bump to EventGate's package.json, and [Publish to npmhttps://www.npmjs.com/package/eventgate publish to npm]. You will need npm permissions to publish a new EventGate version.
  3. Then change the eventgate dependency SHA version in eventgate-wikimedia package.json.

eventgate-wikimedia repository change

  1. Change is merged to eventgate-wikimedia. This will trigger a service-pipeline-build
  2. Jenkins trigger-service-pipeline-test-and-publish is triggered and launches the service-pipeline-test-and-publish job.
  3. Once service-pipeline-test-and-publish finishes, the image will be available in our Docker registry https://docker-registry.wikimedia.org. You can list existing image tags with curl https://docker-registry.wikimedia.org/v2/wikimedia/eventgate-wikimedia/tags/list.
  4. Once the image is available, upgrade the appropriate release(s) in Kubernetes clusters. Edit the appropriate helm values.yaml file(s) in the deployment-charts repo. E.g. helmfile.d/services/eventgate-analytics/values.yaml and update the image version. Merge this change. 1 minute later the updated values file will be pulled on the deployment server.
  5. Jump to deployment.eqiad.wmnet. Upgrade the eventgate-analytics service in Kubernetes and verify that it works. Again, to do this follow the instructions at Deployments_on_kubernetes#Code_deployment/configuration_changes.

eventgate-wikimedia schema repository change

Most eventgate services at wikimedia use remote schema repositories, so they do not require an image rebuild and deploy to pick up a new schema. However, if you modify an existent schema version (hopefully you never have to do this), or if you need eventgate-main to use a new schema or schema version, you'll need an image rebuild and deploy/restart of the eventgate service.

To bump the schema repo, edit eventgate-wikimedia/.pipeline/blubber.yaml and change the git SHA(s) associated with the schema repository you want to update.

    builder:
      # Clone Wikimedia event schema repositories into /srv/service/schemas/event/*
      # If you update schema repository, you'll need to update
      # the SHAs that are checked out here, and then rebuild docker images.
      command:
        - >-
            mkdir -p /srv/service/schemas/event &&
            git clone --single-branch -- https://gerrit.wikimedia.org/r/schemas/event/primary /srv/service/schemas/event/primary && cd /srv/service/schemas/event/primary && git reset --hard d725698 &&
            git clone --single-branch -- https://gerrit.wikimedia.org/r/schemas/event/secondary /srv/service/schemas/event/secondary && cd /srv/service/schemas/event/secondary && git reset --hard 7405981 # <-- change these SHAs

Commit and merge this change. Deployment pipeline will automatically build a new docker image version and post a comment on the gerrit change with the image version.

Follow steps 4 and 5 above to deploy the change.


eventgate service values config change

Service specific configs are kept in values.yaml files inside of helmfile.d To make a simple config value change, edit the appropriate service / cluster(s) values.yaml files, e.g. deployment-charts/helmfile.d/services/eventgate-analytics/values*.yaml. Commit and merge the change, wait up to 1 minute for the change to be synced on the deployment server, then follow the upgrade process described in steps 5.

eventgate chart change

To modify the Helm chart to e.g. change a template or default values, do the following:

1. Edit the eventgate chart in the deployment-charts repository.

2. Test locally in Minikube (more below).

3. Once satisfied, bump the chart version in Chart.yaml. (NOTE: The chart version is independent of the EventGate code version.)

4. Commit and submit the changes to gerrit for review. Once merged, the new chart release should show up at https://helm-charts.wikimedia.org/api/stable/charts/eventgate.

5. Follow the above instructions at Deployments_on_kubernetes#Code_deployment/configuration_changes to upgrade your service to the new deployment.

EventStreamConfig change

EventGate instances are configured to request stream configuration from the MediaWiki EventStreamConfig API, but the way they do so varies depending on configuration. For most 'production' instances, stream configuration is not often edited. To avoid runtime coupling of production EventGate instances, these production instances are configured to only look up their pertinent stream configs at when the service starts. However, eventgate-wikimedia also supports 'dynamic' runtime stream config lookup; meaning if a stream is being produced for which EventGate does not have stream configuration, it will attempt to look up that configuration from the remote EventStreamConfig API.

eventgate-analytics-external is meant for feature instrumentation, and has a higher rate of stream configuration changes. It is the only EventGate instance (as of 2020-08) that looks up event stream configuration at runtime.

To make a change to stream config, either to add a new stream or to change a setting:

1. Edit wgEventStreams in mediawiki-config/wmf-config/ext-EventStreamConfig.php. This might look like:

        'resource-purge' => [
            'schema_title' => 'resource_change',
            'destination_event_service' => 'eventgate-main',
        ],

The stream config entry is keyed by stream name, and must minimally specify the schema_title setting (the title field of the event schemas that will be allowed in this stream), and the destination_event_service setting to the EventGate service name that is allowed to produce this event stream. Other stream config settings may be used by services other than EventGate (e.g. the EventLogging extension). Some default settings are set for all streams in wgEventStreamsDefaultSettings, but can be overridden for specific streams.

2. Merge and sync this change.

What happens next is dependent on if the EventGate instance uses static or dynamic stream config"

3a. If this stream config change is for an EventGate instance that uses dynamic stream config, no action is needed; the new stream config will be automatically looked up when it is used.

3b. If this was a change for an EventGate that uses static stream config, you'll have to restart the pods to get them to look up the change.

See Event_Platform/EventGate/Administration#Recreate_all_k8s_pods_(AKA_full_service_restart)

Troubleshooting in production

All helmfile and kubectl commands below assume your CWD is a helmfile.d service directory on the deployment server, e.g. /srv/deployment-charts/helmfile.d/services/staging/eventgate-analytics

curl a specific pod

# Get pods, copy and IP address
kube_env eventgate-analytics staging; kubectl get pods -o wide

# curl the http (not https) port (usually 8192 for all eventgates)
curl 10.64.75.101:8192/v1/stream-configs

Get detailed status of Helm release

See Migrating_from_scap-helm#Seeing_the_current_status

Upgrade a Helm release

See Migrating_from_scap-helm#Code_deployment/configuration_changes

Rollback to a previous Helm chart version

See Migrating_from_scap-helm#Rolling_back_changes

Targeting a specific release with Helmfile (e.g. canary)

 helmfile -e eqiad --selector name=canary ...

List k8s pods and their k8s host nodes

kube_env eventgate-analytics eqiad; kubectl get pods -o wide

Roll restart all pods

 helmfile -e eqiad --state-values-set  roll_restart=1 sync

Delete a specific k8s pod

sudo -i; kube_env admin <CLUSTER>; kubectl -n <tiller_namespace> delete pod <pod_name>

(<tiller_namespace> is likely the service name, e.g. eventgate-main.)

Delete all k8s pods in a cluster

You shouldn't do this in production!

sudo -i; kube_env admin <CLUSTER>; kubectl -n eventstreams kubectl delete pod -n <tiller_namespace> --all

(<tiller_namespace> is likely the service name, e.g. eventgate-main.)

Tail sdtout/logs on all pods in a service

 for pod in $(kube_env eventgate-analytics eqiad; kubectl get pods -o wide  | grep eventgate | awk '{print $1}'); do kube_env eventgate-analytics eqiad; kubectl logs -c $TILLER_NAMESPACE -f --since 1h -c    $TILLER_NAMESPACE $pod & done | jq .

Tail stdout/logs on a specific k8s pod container

In staging (automaticly using the single active pod id):

kube_env eventgate-analytics eqiad; kubectl logs -c $TILLER_NAMESPACE -f --since 60m $(kube_env eventgate-analytics eqiad; kubectl get pods -l app=$TILLER_NAMESPACE  -o wide | tail -n 1 | awk '{print $1}') | jq .

For a specific pod:

kube_env eventgate-analytics eqiad; kubectl logs -c $TILLER_NAMESPACE -f --since 60m <pod_name> | jq .

Get a shell on a specific k8s pod container

In staging (automaticly using the single active pod id):

kube_env eventgate-analytics eqiad; sudo KUBECONFIG=/etc/kubernetes/admin-staging.config kubectl exec -ti -n $TILLER_NAMESPACE -c $TILLER_NAMESPACE $(kube_env eventgate-analytics eqiad; kubectl get pods -l app=$TILLER_NAMESPACE  -o wide | tail -n 1 | awk '{print $1}') bash

For a specific pod:

CLUSTER=eqiad # or codfw
kube_env eventgate-analytics eqiad; sudo KUBECONFIG=/etc/kubernetes/admin-$CLUSTER.config kubectl exec -ti -n $TILLER_NAMESPACE -c $TILLER_NAMESPACE <pod_name> bash

strace on a process in a specific pod container

First find the host node your pod is running on. See above for kubectl get pods. ssh into that node.

# Get the docker container id in your pod.  This will be $1 in the output.
sudo docker ps | grep <pod_name> | grep nodejs
# now get the pid
sudo docker top <container_id> | grep '/usr/bin/node'
# strace it:
sudo strace -p <node_pid>

Or, all in one command (after finding your pod_name and logging into the k8s node:

pod_name=eventgate-analytics-7b6fbdf7b6-bmlh6
sudo strace -p $(sudo docker top $(sudo docker ps | grep $pod_name | grep nodejs | head -n 1 | awk '{print $1}')  | grep /usr/bin/node | head -n 1 | awk '{print $2}')

Get a root shell on a specific k8s pod container

Again, find the node where your pod is running and log into that node. Then:

sudo docker exec -ti -u root $(sudo docker ps |grep <pod_name> | grep nodejs | tail -n 1 | awk '{print $1}') /bin/bash

Profiling nodejs

The eventgate chart has a 'debug mode' that will allow you to enable and collect stack profiling information from the running NodeJS processes. You'll need to have root permissions to gather the profiling results.

To enable debug mode on a canary release:

helmfile -e eqiad --selector name=canary apply --set debug_mode_enabled=true

This should enable v8 profiling logging in the container, as well as add some configs that will allow you to use perf to generate flamegraphs.

Find the node that the relevant (canary) pod is running on.

$ kube_env eventgate-analytics eqiad
$ kubectl get pods -o wide | grep canary
# eventgate-canary-6b5758794-8s2dj        2/2     Running   21 (13m ago)   116m   10.67.139.214   kubernetes1036.eqiad.wmnet   <none>           <none>

ssh into the relevant node, in this case kubernetes1036.eqiad.wmnet and become root.

Generating a flamegraph with perf

Find the relevant container PID. We'll be working with two PIDs, the root namespace PID, as well as the container's namespace PID. The mapping from root -> container PID can be found in /proc/$pid/status.

$ sudo -s
$ ps auxff | grep inspect= # you're looking for your PID here.  debug_mode_enabled adds this to the CLI, so good enough to grep for it

# eventgate uses service-runner, which has master and worker pids.  
# You'll likely want a worker pid, so a 'child' process.
900       951865  5.3  0.0 1729532 65444 ?       Ssl  16:28   0:01  \_ nodejs --inspect=0.0.0.0:9229 --prof --no-logfile-per-isolate --logfile=/tmp/eventgate-analytics-external-v8.log --perf-basic-prof --no-turbo-inlining --interpreted-frames-native-stack /srv/service/node_modules/.bin/eventgate -c /etc/eventgate/config.yaml
900       952133 23.3  0.0 3919036 130948 ?      Sl   16:28   0:05      \_ /usr/bin/node --inspect=0.0.0.0:9229 --prof --no-logfile-per-isolate --logfile=/tmp/eventgate-analytics-external-v8.log --perf-basic-prof --no-turbo-inlining --interpreted-frames-native-stack --inspect-port=9230 /srv/service/node_modules/service-runner/service-runner.js -c /etc/eventgate/config.yaml

# Our worker PID in root namespace is 952133

# Find the container namespace PID in /proc:
$ pid=952133
$ cat /proc/$pid/status | grep NSpid
NSpid:	952133	7173

# The worker container namespace PID is 7173.

Run perf record on the root namespace pid to collect some tracing data:

# record perf data for 30 seconds
perf record -F 99 -p  $pid  -g -- sleep 30s

Copy the 'map' files from the container filesystem. These provide nodejs function name mappings so that the output is easier to read.

# Find the docker container ID
$ docker ps | grep inspect=
525ea21a930e   72eb2231932b                                                                                 "nodejs --inspect=0.…"   47 seconds ago      Up 46 seconds                k8s_eventgate-analytics_eventgate-canary-b64b4b55c-nqmgx_eventgate-analytics_0b7fef68-967b-4092-ab1d-c68667b3ee85_4

# Our container ID is 525ea21a930e

# list the files in /tmp that we want to grab
$ docker exec -t 525ea21a930e  ls /tmp
eventgate-analytics-v8.log  perf-1522.map  perf-7173.map

# use docker cp to copy the relevant .map file to local filesystem.
# From above, we saw that the container namespace PID is 7173.
# Also rename the file to the root namespace pid so perf will know to read it.
$ docker cp 525ea21a930e /tmp/perf-7173.map ./perf-$pid.map

# Now we should have perf.data and perf-$pid.map files in the cwd.  
# Generate trace output file:
perf script --header> stacks.$pid.out

Now that we've got the trace output file, we can use FlameGraph to generate a .svg image. Clone FlameGraph repo.

./FlameGraph/stackcollapse-perf.pl < stacks.$pid.out | ./FlameGraph/flamegraph.pl > stacks.$pid.svg

You can then open the .svg file in a browser.

See also: https://nodejs.org/en/docs/guides/diagnostics-flamegraph

Viewing v8 profiler output

debug_mode_enabled also creates a v8 profile data file. You can copy this from the docker container and run node --prof-process to generate a text report.

# Copy the v8 profile log file from the docker container
$ docker cp 525ea21a930e:/tmp/eventgate-analytics-v8.log ./

# Use node --prof-process.  
# Requires nodejs installed, so you probably should 
# copy this log file to your local machine.
$ node --prof-process eventgate-analytics-v8.log > eventgate-analytics-v8.processed.txt

See also: https://nodejs.org/en/docs/guides/simple-profiling

Helm Chart Development

User:Alexandros_Kosiaris/Benchmarking_kubernetes_apps has some instructions on setting up Minikube and Helm for chart development and then benchmarking. This section provides some EventGate specific instructions.

EventGate Helm development environment setup

1. Install Minikube. Follow instructions at https://kubernetes.io/docs/tasks/tools/install-minikube/. Minikube is a virtualized local developement single host Kubernetes cluster.

If Minikube is not started, you can start it with:

minikube start

You'll also need to turn on promiscuous mode so that the Kafka pod will work properly:

minikube ssh
sudo ip link set docker0 promisc on
exit

(See: https://stackoverflow.com/questions/45748536/kafka-inaccessible-once-inside-kubernetes-minikube/52792288#52792288).

Networking for macOS users

On macOS minikube setup with --driver=docker (the default) won't allow access to services from other hosts on the same network. One workaround is to start minikube with the virtualbox driver .

Alternatively it is possible to port forward services to the host machine with:

kubectl port-forward svc/<service>  <local port>:<remote port>

$(minikube ip) should be replaced with localhost in all examples that follow.


2. Install Helm. Follow instructions at https://docs.helm.sh/using_helm/#installing-helm. You will need to download the appropriate version for your OS and place it in the $PATH (or %PATH% if you are on Windows)

3. Install Blubber. Follow the Blubber as a (micro)Service instructions at https://wikitech.wikimedia.org/wiki/Blubber/Download.

4. Use Minikube as your Docker host:

eval $(minikube docker-env)

This command will configure the local environment to reuse docker daemon inside the minikube instance. This is necessary, among other things, to use locally built docker images.

5. clone the eventgate-wikimedia repository

git clone https://gerrit.wikimedia.org/r/eventgate-wikimedia
cd eventgate-wikimedia

6. Build a local eventgate-wikimedia development Docker image using Blubber:

 blubber .pipeline/blubber.yaml development > Dockerfile && docker build -t eventgate-dev .

There are several variants in the blubber.yaml file. Here development is selected, and the Docker image is tagged with eventgate-dev.

7. If you don't already have it, clone the operations/deployment-charts repository.

 git clone https://gerrit.wikimedia.org/r/operations/deployment-charts

8. Install the Kafka development Helm chart into Minikube:

cd deployment-charts/charts
helm install kafka-dev ./kafka-dev

This will install a Zookeeper and Kafka pod and keep it running.

9. Install a development chart release into Minikube:

helm install eventgate-dev --set main_app.image=eventgate-dev --set main_app.conf.name=eventgate ./eventgate

10. Test that it works:

# Consume from the Kafka test event topic
kafkacat -C -b $(minikube ip):30092 -t datacenter1.test.event
# In another shell, define a handy service alias:
alias service="echo $(minikube ip):$(kubectl get svc --namespace default eventgate-development -o jsonpath='{.spec.ports[0].nodePort}')"
 
# POST to the eventgate-development service in Minikube
curl -v -H 'Content-Type: application/json' -d '{"$schema": "/test/event/0.0.2", "meta": {"stream": "test.event", "id": "12345678-1234-5678-1234-567812345678", "dt": "2019-01-01T00:00:00Z", "domain": "wikimedia.org"}, "test": "specific test value"}'  $(service)/v1/events

You should see some output from curl like:

...
< HTTP/1.1 201 All 1 out of 1 events were accepted.
...

Networking for macOS users

macOS users running minikube with driver=driver should port forawrd kafka with

kubectl port-forward svc/kafka  30092:30092

And access the service at localhost:30092 instead of (minikube ip):30092

11. Now that the development release is running, you can make local changes to it and re-deploy those changes in Minikube:

helm delete --purge eventgate-dev && helm install -n eventgate-dev --set main_app.image=eventgate-dev ./eventgate

Benchmarking

Benchmarking of EventGate was done during its initial production k8s pods estimation following User:Alexandros_Kosiaris/Benchmarking_kubernetes_apps. The initial results are not documented, but a phabricator comment indicates that a single instance (with certain resource settings) can handle around 1800 events per second.