Event Platform/EventStreams/Administration

From Wikitech
Jump to navigation Jump to search

See EventStreams for an overview of the EventStreams service.

EventStreams is a service-template-node based service. It glues together KafkaSSE with common Wikimedia service features, like logging, error reporting, metrics, configuration and deployment.

Internally, EventStreams is available at eventstreams.svc.${::site}.wmnet. It is routed to by varnish and LVS from stream.wikimedia.org.

Configuration

EventStreams is configured in puppet via role::eventstreams::* hiera variables. The production configuration is in the hiera scb role.

role::eventstreams::streams maps stream routes to composite topics in Kafka. Our event topics are prefixed by datacenter name. This is abstracted for EventStreams consumers via this mapping. Any combination of stream name -> composite topic list is possible, e.g.

role::eventstreams::streams:
  recentchange:
    topics:
      - eqiad.mediawiki.recentchange
      - codfw.mediawiki.recentchange

Kafka

EventStreams is backed by the main Kafka clusters. As of 2018-08, EventStreams is multi-DC capable. By default, EventStreams in codfw still consumes from main-eqiad across DC boundries, but the backing Kafka cluster can be switched anytime. It does this so that the events themselves don't have rely on MirrorMaker propagation.

NodeJS Kafka Client

KafkaSSE uses node-rdkafka (as do other production NodeJS services that use Kafka).

Repositories

Repository Description
KafkaSSE (github) Generic Kafka Consumer -> SSE NodeJS library.
eventstreams (github) EventStreams implementation using KafkaSSE and service-template-node.
eventstreams/deploy Deploy repository for EventStreams, contains scap3 config and node dependencies.

Deployment

EventStreams is deployed by Scap3 to the scb production service cluster. In deployment-prep, EventStreams is deployed to sca hosts.

Submitting changes

Change to KafkaSSE library

KafkaSSE is hosted in Github, so you must either submit a pull request or push a change there.

kafka-sse is an npm dependency of EventStreams.

If you update kafka-sse, you should bump the package version and publish to npm: https://www.npmjs.com/package/kafka-sse

Change to eventstreams repository

EventStreams is hosted in gerrit. Use git review to submit patches. If you've modified the KafkaSSE repository, you should update the kafka-sse dependency version in package.json.

Update eventstreams/deploy repository

Once you've made changes to either of the above to repositories, you'll need to rebuild the eventstreams/deploy repository. The easiest way to do this is to use service-runner's docker builder. Follow the instructions at https://www.mediawiki.org/wiki/ServiceTemplateNode/Deployment#Local_git to do so.

Deploy

Ssh to the deploy server and run the following instructions to deploy the latest commit in the eventstreams/deploy repository.

ssh deployment.eqiad.wmnet # or deployment-tin.deployment-prep.eqiad.wmflabs
cd /srv/deployment/eventstreams/deploy
git pull && git submodule update
scap deploy


Logs

Logs are output to disk on target hosts in /srv/log/eventstreams/.

Metrics

https://grafana.wikimedia.org/dashboard/db/eventstreams

Throughput limits

As of 2019-07, The public EventStreams stream.wikimedia.org endpoint is configured in varnish to only allow for 25 concurrent connections per varnish backend. There are 10 text varnishes in codfw and 8 in eqiad, so the varnish concurrent connection limit for EventStreams is 200 in eqiad and 250 in codfw for a total of 450 concurrent connections. We have had incidents where a rogue client spawns too many connections. EventStreams code has some primitive logic to try to reduce the number of concurrent connections from the same X-Client-IP, but this will not fully prevent the issue from happening. Check the total number of connections in https://grafana.wikimedia.org/dashboard/db/eventstreams if new connections receive a 502 error from varnish.

Alerts

EventStreams is configured with a monitoring check that will check that the /v2/stream/recentchange URL has data on it. This check is done on each service node via localhost, as well as the public stream.wikimedia.org endpoint. If the local check fails, it likely means that something is wrong with the node service process. If the public check fails, then likely all backend service processes have the same issue.

Incidents