Event Platform/EventGate/Administration

From Wikitech
Jump to navigation Jump to search

EventGate is deployed using the WMF Helm & Kubernetes deployment-pipeline. This page will describe how to build and deploy EventGate services as well as document how to administer and debug EventGate in beta and production.

Our deployments of EventGate are done using the eventgate-wikimedia repository. This is an npm module that implements a WMF specific EventGate factory, and specifies EventGate as a dependency. It launches service-runner via the EventGate module with config provided here that sets eventgate_factory_module to eventgate-wikimedia.js.


Mediawiki Vagrant development

See: Event_Platform/EventGate#Development_in_Mediawiki_Vagrant


Beta / deployment-prep

Since we deploy Docker images to Kubernetes in production, we want to run these same images in beta. This is done by including the role::beta::docker_services class on a deployment-prep node via the Horizon Puppet Configuration interface. The configuration of the service and image is done by editing Hiera config in the same Horizon interface. deployment-eventgate-1 is a good example. The EventBus Mediawiki extension in beta is configured with $wgEventServices that point to these instances.

Production

Production deployments of EventGate use WMF's Service Deployment Pipeline. Deploying new code and configuration to this pipeline currently has several steps. You should first be familiar with the various technologies phases of this pipeline. Here's some reading material for ya!

  • Deployment pipeline
  • Deployment Pipeline Design (AKA Streamlined Service Delivery Design)
  • Blubber - Dockerfile generator, ensures consistent Docker images.
  • Helm - Manages deployment releases to Kubernetes clusters. Helm Charts describe e.g. Docker images and versions, service config templating, automated monitoring, metrics and logging, service replica scaling, etc.
  • Kubernetes - Containerized cloud clusters made of of 'pods'. Each pod can run multiple containers.


Deployment Pipeline Overview

Here's a general overview of how a code and then a Helm chart change in EventGate makes it to production. Code changes require Docker image rebuilds, and eventgate Helm chart changes require a new chart version and release upgrade.

Each EventGate service is deployed via the same eventgate Helm chart. Each service runs in its own Kubernetes namespace and has a distinct release name. The services are configured and deployed using helmfile custom values files and commands.

Current services (as of 2019-07):

  • eventgate-analytics - Produces high volume 'analytics' events to Kafka jumbo-eqiad cluster.
  • eventgate-main - Produces lower volume 'production' events to Kafka main-* clusters.

In case you get confused, here are the Helm and Kubernetes terms for the eventgate-analytics service:

  • service name: eventgate-analytics
  • docker image name: eventgate-wikimedia (built from the eventgate-wikimedia gerrit repository)
  • Helm chart: eventgate
  • Helm release name: analytics
  • Kubernetes namespace: eventgate-analytics

In the eventgate-analytics service examples below, you will be deploying the eventgate-wikimedia docker image from the eventgate Helm chart with the release name 'analytics', deploying and applying values via helmfile.

There are 3 repositories that may need changes.

  • EventGate - This is the generic pluggable library & service
  • eventgate-wikimedia - Wikimedia specific implementation code and deployment pipeline Blubber files.
  • deployment-charts - Helm charts and helmfile values, specifies configs for service deployment.

If you make a change to EventGate or eventgate-wikimedia, you must trigger a rebuild of the eventgate Docker image, then change the image version in the eventgate chart and deploy. If you just need to make a config or chart change, then you only need to build a new chart and deploy.

EventGate / eventgate-wikimedia Code Change

If this is an EventGate change, first push the change to the EventGate repository, then change the eventgate dependency SHA version in eventgate-wikimedia package.json.

1. Change is merged to eventgate-wikimedia. This will trigger a service-pipeline-build

2. Jenkins trigger-service-pipeline-test-and-publish is triggered and launches the service-pipeline-test-and-publish job.

3. Once service-pipeline-test-and-publish finishes, the image will be available in our Docker registry https://docker-registry.wikimedia.org. You can list existing image tags with:

 curl https://docker-registry.wikimedia.org/v2/wikimedia/eventgate-wikimedia/tags/list

Once the image is available, we can upgrade the appropriate release(s) in Kubernetes clusters.

4. Edit the appropriate environment and service helmfile.d values.yaml file(s) in the deployment-charts repo. E.g. helmfile.d/services/{staging,codfw,eqiad}/eventgate-analytics/values.yaml and update the image version. Merge this change. 1 minute later the updates values file will be pulled on the deployment server.

5. Upgrade the analytics release in the Kubernetes staging cluster and verify that it works:

cd /srv/deployment-charts/helmfile.d/services/staging/eventgate-analytics
# diff to see what you'll be changing
source .hfenv; helmfile diff
# Apply the helmfile; THIS WILL ACTUALLY DEPLOY!
source .hfenv; helmfile apply
# Wait about a minute. You can check on upgrade status with:
source .hfenv; helmfile status
# POST to the service
time curl -v -X POST -H 'Content-Type: application/json' -d@/srv/scap-helm/eventgate/test_event_0.0.2.json 'http://kubestage1001.eqiad.wmnet:31192/v1/events?hasty=true'

6. Upgrade the analytics release in the Kubernetes codfw and eqiad clusters:

# codfw
cd /srv/deployment-charts/helmfile.d/services/codfw/eventgate-analytics
source .hfenv; helmfile diff
source .hfenv; helmfile apply
# eqiad
cd /srv/deployment-charts/helmfile.d/services/eqiad/eventgate-analytics
source .hfenv; helmfile diff
source .hfenv; helmfile apply

eventgate service values config change

Service specific configs are kept in values.yaml files inside of helmfile.d To make a simple config value change, edit the appropriate service / cluster(s) values.yaml files, e.g. deployment-charts/helmfile.d/services/*/eventgate-analytics/values.yaml. Commit and merge the change, wait up to 1 minute for the change to be synced on the deployment server, then follow the upgrade process described in steps 5-6 above.

eventgate chart change

To modify the Helm chart to e.g. change a template or default values, do the following:

1. Edit the eventgate chart in the deployment-charts repository.

2. Test locally in Minikube (more below).

3. Once satisfied, bump the chart version in Chart.yaml. (NOTE: The chart version is independent of the EventGate code version.)

4. Package the new chart version, and reindex the Helm repository:

cd charts/
helm package eventgate && helm repo index .
# Add the (newly packaged) chart artifact to git:
git add eventgate-$(cat eventgate/Chart.yaml  | grep 'version: ' | awk '{print $2}').tgz

5. Commit and submit the changes to gerrit for review. Once merged, the new chart release should show up at https://releases.wikimedia.org/charts/.

6. Upgrade the analytics release in the Kubernetes staging cluster and verify that it works:

cd /srv/deployment-charts/helmfile.d/services/staging/eventgate-analytics
# diff to see what you'll be changing
source .hfenv; helmfile diff
# Apply the helmfile; THIS WILL ACTUALLY DEPLOY!
source .hfenv; helmfile apply
# Wait about a minute. You can check on upgrade status with:
source .hfenv; helmfile status
# POST to the service
time curl -v -X POST -H 'Content-Type: application/json' -d@/srv/scap-helm/eventgate/test_event_0.0.2.json 'http://kubestage1001.eqiad.wmnet:31192/v1/events?hasty=true'

7. Upgrade the analytics release in the Kubernetes codfw and eqiad clusters:

# codfw
cd /srv/deployment-charts/helmfile.d/services/codfw/eventgate-analytics
source .hfenv; helmfile diff
source .hfenv; helmfile apply
# eqiad
cd /srv/deployment-charts/helmfile.d/services/eqiad/eventgate-analytics
source .hfenv; helmfile diff
source .hfenv; helmfile apply

Troubleshooting in production

All helmfile and kubectl commands below assume your CWD is a helmfile.d service directory on the deployment server, e.g. /srv/deployment-charts/helmfile.d/services/staging/eventgate-analytics

Get detailed status of Helm release

See Migrating_from_scap-helm#Seeing_the_current_status

Upgrade a Helm release

See Migrating_from_scap-helm#Code_deployment/configuration_changes

Rollback to a previous Helm chart version

See Migrating_from_scap-helm#Rolling_back_changes

List k8s pods and their k8s host nodes

source .hfenv; kubectl get pods -o wide

Delete a specific k8s pod

source .hfenv; kubectl delete pod <pod_name>

Delete all k8s pods in a cluster

You shouldn't do this in production!

sudo KUBECONFIG=/etc/kubernetes/admin-staging.config kubectl delete pod -n eventgate-analytics --all

Tail stdout/logs on a specific k8s pod container

In staging (automaticly using the single active pod id):

source .hfenv; kubectl logs -c eventgate-analytics -f --since 60m $(source .hfenv; kubectl get pods -l app=eventgate-analytics  -o wide | tail -n 1 | awk '{print $1}') | jq .

For a specific pod:

source .hfenv; kubectl logs -c eventgate-analytics -f --since 60m <pod_name> | jq .

Get a shell on a specific k8s pod container

In staging (automaticly using the single active pod id):

sudo KUBECONFIG=/etc/kubernetes/admin-staging.config kubectl exec -ti -n eventgate-analytics -c eventgate-analytics $(source .hfenv; kubectl get pods -l app=eventgate-analytics  -o wide | tail -n 1 | awk '{print $1}') bash

For a specific pod:

CLUSTER=eqiad # or codfw
sudo KUBECONFIG=/etc/kubernetes/admin-$CLUSTER.config kubectl exec -ti -n eventgate-analytics -c eventgate-analytics <pod_name> bash

strace on a process in a specific pod container

First find the host node your pod is running on. See above for kubectl get pods. ssh into that node.

# Get the docker container id in your pod.  This will be $1 in the output.
sudo docker ps | grep <pod_name> | grep nodejs
# now get the pid
sudo docker top <container_id> | grep '/usr/bin/node'
# strace it:
sudo strace -p <node_pid>

Or, all in one command (after finding your pod_name and logging into the k8s node:

pod_name=eventgate-analytics-7b6fbdf7b6-bmlh6
sudo strace -p $(sudo docker top $(sudo docker ps | grep $pod_name | grep nodejs | head -n 1 | awk '{print $1}')  | grep /usr/bin/node | head -n 1 | awk '{print $2}')

Get a root shell on a specific k8s pod container

Again, find the node where your pod is running and log into that node. Then:

sudo docker exec -ti -u root $(sudo docker ps |grep <pod_name> | grep nodejs | tail -n 1 | awk '{print $1}') /bin/bash

Helm Chart Development

User:Alexandros_Kosiaris/Benchmarking_kubernetes_apps has some instructions on setting up Minikube and Helm for chart development and then benchmarking. This section provides some EventGate specific instructions.

EventGate Helm development environment setup

1. Install Minikube. Follow instructions at https://kubernetes.io/docs/tasks/tools/install-minikube/. Minikube is a virtualized local developement single host Kubernetes cluster.

If Minikube is not started, you can start it with:

minikube start

You'll also need to turn on promiscuous mode so that the Kafka pod will work properly:

minikube ssh
sudo ip link set docker0 promisc on
exit

(See: https://stackoverflow.com/questions/45748536/kafka-inaccessible-once-inside-kubernetes-minikube/52792288#52792288)

2. Install kubectl. Follow instructions on https://kubernetes.io/docs/tasks/tools/install-kubectl/

3. Install Helm. Follow instructions at https://docs.helm.sh/using_helm/#installing-helm. You will need to download the appropriate version for your OS and place it in the $PATH (or %PATH% if you are on Windows)

4. Install Blubber. Follow instructions at https://wikitech.wikimedia.org/wiki/Blubber/Download.

5. Use Minikube as your Docker host:

eval $(minikube docker-env)

6. clone the eventgate-wikimedia repository

git clone https://gerrit.wikimedia.org/r/eventgate-wikimedia
cd eventgate-wikimedia

7. Build a local eventgate-wikimedia development Docker image using Blubber:

 blubber .pipeline/blubber.yaml development > Dockerfile && docker build -t eventgate-dev .

There are several variants in the blubber.yaml file. Here development is selected, and the Docker image is tagged with eventgate-dev.

7. If you don't already have it, clone the operations/deployment-charts repository.

 git clone https://gerrit.wikimedia.org/r/operations/deployment-charts

7. Install the Kafka development Helm chart into Minikube:

cd deployment-charts/charts
helm install ./kafka-dev

This will install a Zookeeper and Kafka pod and keep it running.

8. Install a development chart release into Minikube:

helm install -n development --set main_app.image=eventgate-dev ./eventgate

9. Test that it works:

# Consume from the Kafka test event topic
kafkacat -C -b $(minikube ip):30092 -t datacenter1.test.event
# In another shell, define a handy service alias:
alias service="echo $(minikube ip):$(kubectl get svc --namespace default eventgate-development -o jsonpath='{.spec.ports[0].nodePort}')"
 
# POST to the eventgate-development service in Minikube
curl -v -H 'Content-Type: application/json' -d '{"$schema": "/test/event/0.0.2", "meta": {"stream": "test.event", "id": "12345678-1234-5678-1234-567812345678", "dt": "2019-01-01T00:00:00Z", "domain": "wikimedia.org"}, "test": "specific test value"}'  $(service)/v1/events

You should see some output from curl like:

...
< HTTP/1.1 201 All 1 out of 1 events were accepted.
...

10. Now that the development release is running, you can make local changes to it and re-deploy those changes in Minikube:

helm delete --purge development && helm install -n development --set main_app.image=eventgate-dev ./eventgate