Event Platform/EventGate/Administration

From Wikitech
Jump to navigation Jump to search

EventGate is deployed using the WMF Helm & Kubernetes deployment-pipeline. This page will describe how to build and deploy EventGate services as well as document how to administer and debug EventGate in beta and production.

Mediawiki Vagrant development

See: Event_Platform/EventGate#Development_in_Mediawiki_Vagrant


Beta / deployment-prep

Since we deploy docker images to Kubernetes in production, we want to run these same images in beta. This is done by including the role::beta::docker_services class on a deployment-prep node via the Horizon Puppet Configuration interface. The configuration of the service and image is done by editing Hiera Config in the same Horizon interface. deployment-eventgate-1 is a good example. The EventBus Mediawiki extension in beta is configured with $wgEventServices that point to these instances.

Production

Production deployments of EventGate use WMF's Service Deployment Pipeline. Deploying new code and configuration to this pipeline currently has several steps. You should first be familiar with the various technologies phases of this pipeline. Here's some reading material for ya!

  • Deployment pipeline
  • Deployment Pipeline Design (AKA Streamlined Service Delivery Design)
  • Blubber - Dockerfile generator, ensures consistent Docker images.
  • Helm - Manages deployment releases to Kubernetes clusters. Helm Charts describe e.g. Docker images and versions, service config templating, automated monitoring, metrics and logging, service replica scalaing, etc.
  • Kubernetes - Containerized cloud clusters made of of 'pods'. Each pod can run multiple containers.


Deployment Pipeline Overview

Here's a general overview of how a code and then a Helm chart change in EventGate makes it to production. Code changes require Docker image rebuilds, eventgate Helm chart changes require a new chart version and upgrade.

Each EventGate service is deployed via the same eventgate Helm chart. Each service runs in its own kubernetes namespace and has a distinct release name. The services are configured using Helm values files during scap-helm install/upgrade.

Current services (as of 2019-05):

  • eventgate-analytics - Produces high volume 'analytics' events to Kafka jumbo-eqiad cluster.
  • eventgate-main - Produces lower volume 'production' events to Kafka main-* clusters.

In case you get confused, here are the Helm and Kubernetes terms for the eventgate-analytics service:

  • docker image name: eventgate-ci (built from the eventgate-ci gerrit repository)
  • Helm chart: eventgate
  • Helm release name: analytics
  • scap-helm service name: eventgate-analytics
  • Kubernetes namespace: eventgate-analytics

In the eventgate-analytics service examples below, you will be deploying the eventgate-ci docker image using the eventgate Helm chart with the release name 'analytics'.


EventGate Code Change

1. Change is pushed (or merged into) to https://github.com/wikimedia/eventgate.

git push origin master

2. Master is pushed to https://gerrit.wikimedia.org/r/#/admin/projects/eventgate-ci. (Master must be pushed for tags to work)

git push gerrit master

3. Annotated tag is pushed to gerrit to trigger a CI pipeline image build:

tag=v1.0.9-wmf2; git tag -am "$tag" $tag && git push gerrit $tag

4. Jenkins trigger-service-pipeline-test-and-publish is triggered and launches the service-pipeline-test-and-publish job.

5. Once service-pipeline-test-and-publish finishes, the image will be available in our Docker registry https://docker-registry.wikimedia.org. You can list existing image tags with:

 curl https://docker-registry.wikimedia.org/v2/wikimedia/eventgate-ci/tags/list

Once the image is available, we can upgrade the appropriate release(s) in Kubernetes clusters.

6. On deployment.eqiad.wmnet, edit each /srv/scap-helm/eventgate/analytics/*-values.yaml files and change main_app.version:

  main_app:
    # ...
    version: v1.0.9-wmf2

Note that there are 3 *-values.yaml files, one for each Kubernetes 'clusters': staging, eqiad and codfw.

7. Upgrade the analytics release in the Kubernetes staging cluster and verify that it works:

CLUSTER=staging scap-helm eventgate-analytics upgrade analytics -f /srv/scap-helm/eventgate/analytics/staging-values.yaml --reset-values stable/eventgate
 
# Wait about a minute. You can check on upgrade status with:
CLUSTER=staging scap-helm eventgate-analytics status analytics
# POST to the service
time curl -v -X POST -H 'Content-Type: application/json' -d@/srv/scap=helm/eventgate/test_event_0.0.2.json 'http://kubestage1001.eqiad.wmnet:31192/v1/events?hasty=true'

8. Upgrade the analytics release in the Kubernetes codfw and eqiad clusters:

CLUSTER=codfw scap-helm eventgate-analytics upgrade analytics -f /srv/scap-helm/eventgate/analytics/codfw-values.yaml --reset-values stable/eventgate
 
CLUSTER=eqiad scap-helm eventgate-analytics upgrade analytics -f /srv/scap-helm/eventgate/analytics/eqiad-values.yaml --reset-values stable/eventgate

eventgate chart release change

Changing a templated value is as easy as editing the *-values.yaml files and upgrading. To modify the Helm chart to e.g. change a template or default values, do the following:

1. Edit the eventgate chart in the operations/deployment-charts repository.

2. Test locally in Minikube (more below).

3. Once satisfied, bump the chart version in Chart.yaml. (NOTE: The chart version is independent of the EventGate code version.)

4. Package the new chart version, and reindex the Helm repository:

cd charts/
helm package eventgate && helm repo index .
# Add the (newly packaged) chart artifact to git:
git add eventgate-$(cat eventgate/Chart.yaml  | grep 'version: ' | awk '{print $2}').tgz

5. Commit and submit the changes to gerrit for review. Once merged, the new chart release should show up at https://releases.wikimedia.org/charts/.

6. Upgrade the staging release in the Kubernetes staging cluster and verify that it works:

CLUSTER=staging scap-helm eventgate-analytics upgrade analytics -f /srv/scap-helm/eventgate/analytics/staging-values.yaml --reset-values stable/eventgate
 
# Wait about a minute. You can check on upgrade status with:
CLUSTER=staging scap-helm eventgate-analytics status analytics
# POST to the service
time curl -v -X POST -H 'Content-Type: application/json' -d@/srv/scap=helm/eventgate/test_event_0.0.2.json 'http://kubestage1001.eqiad.wmnet:31192/v1/events?hasty=true'

7. Upgrade the production release in the Kubernetes codfw and eqiad clusters:

CLUSTER=codfw scap-helm eventgate-analytics upgrade analytics -f /srv/scap-helm/eventgate/analytics/codfw-values.yaml --reset-values stable/eventgate
 
CLUSTER=eqiad scap-helm eventgate-analytics upgrade analytics -f /srv/scap-helm/eventgate/analytics/eqiad-values.yaml --reset-values stable/eventgate

Troubleshooting in production

Helm and kubectl commands can get rather verbose, and it can be hard to remember what goes where. One day we'll hopefully have improvements to scap-helm that help abstract this more. Until then, here are some helpful commands and tips for troubleshooting service deployments of EventGate.

NOTE: All scap-helm commands that have a CLUSTER environment variable set can be run with values of staging, codfw or eqiad. If CLUSTER is not set, the helm command will be run for both codfw and eqiad production Kubernetes clusters.

List deployed Helm releases

In staging:

CLUSTER=staging scap-helm eventgate-analytics list

In production:

scap-helm eventgate-analytics list

Get detailed status of Helm release

In staging:

CLUSTER=staging scap-helm eventgate-analytics status analytics

In production:

scap-helm eventgate-analytics status analytics

Upgrade a Helm release

In staging:

CLUSTER=staging scap-helm eventgate-analytics upgrade analytics -f /srv/scap-helm/eventgate/analytics/staging-values.yaml --reset-values stable/eventgate

In codfw:

CLUSTER=codfw scap-helm eventgate-analytics upgrade production -f /srv/scap-helm/eventgate/analytics/codfw-values.yaml --reset-values stable/eventgate

In eqiad:

CLUSTER=eqiad scap-helm eventgate-analytics upgrade production -f /srv/scap-helm/eventgate/analytics/eqiad-values.yaml --reset-values stable/eventgate

Note that eqiad and codfw must be upgraded separately. This is because they each have a different values setting for main_app.topic_prefix.


Rollback to a previous Helm chart version

Rollbacks revert the deployed release to an exact previous deployment revision #, with the values that were provided at the time that release was deployed. If you want to revert to a previous version of a chart, but with different values, you must use do a helm upgrade with a chart --version flag, described below.

In staging:

CLUSTER=staging scap-helm eventgate-analytics history staging
# Find the revision # you want to rollback to and then:
CLUSTER=staging scap-helm eventgate-analytics rollback analytics <revision #>

In codfw:

CLUSTER=codfw scap-helm eventgate-analytics history production
# Find the revision # you want to rollback to and then:
CLUSTER=codfw scap-helm eventgate-analytics rollback analytics <revision #>

In eqiad:

CLUSTER=eqiad scap-helm eventgate-analytics history production
# Find the revision # you want to rollback to and then:
CLUSTER=eqiad scap-helm eventgate-analytics rollback analytics <revision #>

NOTE: release revisions might not be equivalent in eqiad and codfw since we upgrade them individually.


Upgrade/revert a Helm release to a specific Helm chart version

In staging:

CLUSTER=staging scap-helm eventgate-analytics upgrade analytics --version <chart_version> -f /srv/scap-helm/eventgate/analytics/staging-values.yaml --reset-values stable/eventgate

In codfw:

CLUSTER=codfw scap-helm eventgate-analytics upgrade analytics --version <chart_version> -f /srv/scap-helm/eventgate/analytics/codfw-values.yaml --reset-values stable/eventgate

In eqiad:

CLUSTER=eqiad scap-helm eventgate-analytics upgrade analytics --version <chart_version> -f /srv/scap-helm/eventgate/analytics/eqiad-values.yaml --reset-values stable/eventgate

List k8s pods and their k8s host nodes

In staging:

KUBECONFIG=/etc/kubernetes/eventgate-analytics-staging.config kubectl get pods -n eventgate-analytics -o wide

In codfw:

KUBECONFIG=/etc/kubernetes/eventgate-analytics-codfw.config kubectl get pods -n eventgate-analytics -o wide

In eqiad:

KUBECONFIG=/etc/kubernetes/eventgate-analytics-eqiad.config kubectl get pods -n eventgate-analytics -o wide

Delete a specific k8s pod

In staging:

sudo KUBECONFIG=/etc/kubernetes/admin-staging.config kubectl delete pod -n eventgate-analytics <pod_name>

In codfw:

sudo KUBECONFIG=/etc/kubernetes/admin-eqiad.config kubectl delete pod -n eventgate-analytics <pod_name>

In eqiad:

sudo KUBECONFIG=/etc/kubernetes/admin-eqiad.config kubectl delete pod -n eventgate-analytics <pod_name>

Delete all k8s pods in a cluster

You shouldn't do this in production!

In staging:

sudo KUBECONFIG=/etc/kubernetes/admin-staging.config kubectl delete pod -n eventgate-analytics --all

Tail stdout/logs on a specific k8s pod container

In staging:

KUBECONFIG=/etc/kubernetes/eventgate-analytics-staging.config kubectl logs -f --since 60m -c eventgate-analytics $(KUBECONFIG=/etc/kubernetes/eventgate-analytics-staging.config kubectl -n eventgate-analytics get pods -l app=eventgate-analytics -o wide | tail -n 1 | awk '{print $1}') | jq .

In codfw:

KUBECONFIG=/etc/kubernetes/eventgate-analytics-codfw.config kubectl logs -f --since 60m -c eventgate-analytics <pod_name> | jq .

In eqiad:

KUBECONFIG=/etc/kubernetes/eventgate-analytics-eqiad.config kubectl logs -f --since 60m -c eventgate-analytics <pod_name> | jq .

Get a shell on a specific k8s pod container

In staging:

sudo KUBECONFIG=/etc/kubernetes/admin-staging.config kubectl exec -ti -n eventgate-analytics -c eventgate-analytics $(KUBECONFIG=/etc/kubernetes/eventgate-analytics-staging.config kubectl -n eventgate-analytics get pods -l app=eventgate-analytics -o wide | tail -n 1 | awk '{print $1}') bash

In codfw:

sudo KUBECONFIG=/etc/kubernetes/admin-codfw.config kubectl exec -ti -n eventgate-analytics -c eventgate-analytics <pod_name> bash

In eqiad:

sudo KUBECONFIG=/etc/kubernetes/admin-eqiad.config kubectl exec -ti -n eventgate-analytics -c eventgate-analytics <pod_name> bash

strace on a process in a specific pod container

First find the node your pod is running on. See above for kubectl get pods. ssh into that node.

# Get the docker container id in your pod.  This will be $1 in the output.
sudo docker ps | grep <pod_name> | grep nodejs
# now get the pid
sudo docker top <container_id> | grep '/usr/bin/node'
# strace it:
sudo strace -p <node_pid>

Or, all in one command (after finding your pod_name and logging into the k8s node:

pod_name=eventgate-analytics-7b6fbdf7b6-bmlh6
sudo strace -p $(sudo docker top $(sudo docker ps | grep $pod_name | grep nodejs | head -n 1 | awk '{print $1}')  | grep /usr/bin/node | head -n 1 | awk '{print $2}')

Get a root shell on a specific k8s pod container

Again, find the node where your pod is running and log into that node. Then:

sudo docker exec -ti -u root $(sudo docker ps |grep <pod_name> | grep nodejs | tail -n 1 | awk '{print $1}') /bin/bash

Helm Chart Development

User:Alexandros_Kosiaris/Benchmarking_kubernetes_apps has some instructions on setting up Minikube and Helm for chart development and then benchmarking. This section provides some EventGate specific instructions.

EventGate Helm development environment setup

1. Install Minikube. Follow instructions at https://kubernetes.io/docs/tasks/tools/install-minikube/. Minikube is a virtualized local developement single host Kubernetes cluster.

If Minikube is not started, you can start it with:

minikube start

You'll also need to turn on promiscuous mode so that the Kafka pod will work properly:

minikube ssh
sudo ip link set docker0 promisc on
exit

(See: https://stackoverflow.com/questions/45748536/kafka-inaccessible-once-inside-kubernetes-minikube/52792288#52792288)

2. Install kubectl. Follow instructions on https://kubernetes.io/docs/tasks/tools/install-kubectl/

3. Install Helm. Follow instructions at https://docs.helm.sh/using_helm/#installing-helm. You will need to download the appropriate version for your OS and place it in the $PATH (or %PATH% if you are on Windows)

4. Install Blubber. Follow instructions at https://wikitech.wikimedia.org/wiki/Blubber/Download.

5. Use Minikube as your Docker host:

eval $(minikube docker-env)

6. Build a local EventGate development Docker image using Blubber:

 blubber .pipeline/blubber.yaml development > Dockerfile && docker build -t eventgate-dev .

There are several variants in the blubber.yaml file. Here development is selected, and the Docker image is tagged with eventgate-dev.

7. If you don't already have it, clone the operations/deployment-charts repository.

7. Install the Kafka development Helm chart into Minikube:

cd deployment-charts/charts
helm install ./kafka-dev

This will install a Zookeeper and Kafka pod and keep it running.

8. Install a development chart release into Minikube:

helm install -n development --set main_app.image=eventgate-dev ./eventgate

9. Test that it works:

# Consume from the Kafka test event topic
kafkacat -C -b $(minikube ip):30092 -t datacenter1.test.event
# In another shell, define a handy service alias:
alias service="echo $(minikube ip):$(kubectl get svc --namespace default eventgate-development -o jsonpath='{.spec.ports[0].nodePort}')"
 
# POST to the eventgate-development service in Minikube
curl -v -H 'Content-Type: application/json' -d '{"$schema": "/test/event/0.0.2", "meta": {"stream": "test.event", "id": "12345678-1234-5678-1234-567812345678", "dt": "2019-01-01T00:00:00Z", "domain": "wikimedia.org"}, "test": "specific test value"}'  $(service)/v1/events

You should see some output from curl like:

...
< HTTP/1.1 201 All 1 out of 1 events were accepted.
...

10. Now that the development release is running, you can make local changes to it and re-deploy those changes in Minikube:

helm delete --purge development && helm install -n development --set main_app.image=eventgate-dev ./eventgate