EventBus/Administration

From Wikitech
Jump to: navigation, search

Hosts and Services

Please double check the following information from puppet and update if necessary. EventBus is running on four hosts:

  • kafka100[12] in eqiad
  • kafka200[12] in codfw

On all the hosts two services are running:

  • eventlogging-service-eventbus, HTTP service to accept events and publish them to kafka
  • Kafka

The hosts are defined in https://wikitech.wikimedia.org/wiki/Conftool under "eventbus", so please keep in mind that they need to be pooled/de-pooled from their related pybal domains. The only active one is eventbus.svc.eqiad.wmnet.

HTTP proxy restart (eventlogging-service-eventbus)

Warning Warning: Before starting please sync with the Services team to coordinate the restart.

The eventlogging-service-eventbus stops accepting connections while restarting so some events might get lost. Anything that is not an emergent operation should follow this procedure:

Check the status of the pybal/confdata managed domain http://config-master.wikimedia.org/conftool/eqiad/eventbus, you should see:

{ 'host': 'kafka1001.eqiad.wmnet', 'weight':10, 'enabled': True }
{ 'host': 'kafka1002.eqiad.wmnet', 'weight':10, 'enabled': True }

Remove one host at the time from the pybal/confdata managed domain:

confctl --find --action set/pooled=no kafka1001.eqiad.wmnet

Check that eventlogging-service-eventbus is no longer receiving connections on your depooled host on port 8085:

# Show rate of tcp SYNs on port 8085
sudo  tcpdump dst eventbus.svc.eqiad.wmnet and port 8085 and "tcp[tcpflags] & tcp-syn != 0" | pv -l > /dev/null

Restart the HTTP proxy:

sudo systemctl restart eventlogging-service-eventbus.service

Check logs to make sure that everything looks good:

sudo journalctl -f -u eventlogging-service-eventbus.service
sudo tail -f /var/log/eventlogging/eventlogging-service-eventbus.log

Example of problems:

(MainThread) 400 POST /v1/events (10.64.0.71) 1.87ms (MainThread) Failed processing event: Failed validating <Event xxxxxxxxxx of schema yyyyy> (MainThread) 0 out of 1 events were accepted.

In this case please contact the Analytics team and the Services team.

Host Reboot

Warning Warning: Before starting please sync with the Services team to coordinate the reboot, since they might need to temporarily stop services like ChangeProp to avoid any risk of causing an outage.

Check the status of the pybal/confdata managed domain http://config-master.wikimedia.org/conftool/eqiad/eventbus, you should see:

{ 'host': 'kafka1001.eqiad.wmnet', 'weight':10, 'enabled': True }
{ 'host': 'kafka1002.eqiad.wmnet', 'weight':10, 'enabled': True }

Remove one host at the time from the pybal/confdata managed domain:

confctl --find --action set/pooled=no kafka1001.eqiad.wmnet

Check that eventlogging-service-eventbus is no longer receiving connections on your depooled host on port 8085:

# Show rate of tcp SYNs on port 8085
sudo  tcpdump dst eventbus.svc.eqiad.wmnet and port 8085 and "tcp[tcpflags] & tcp-syn != 0" | pv -l > /dev/null

Check Kafka topic/partition status:

elukey@kafka1001:~$ kafka topic --describe
Topic:mediawiki.page_delete	PartitionCount:1	ReplicationFactor:2	Configs:
	Topic: mediawiki.page_delete	Partition: 0	Leader: 1001	Replicas: 1001,1002	Isr: 1001,1002
Topic:mediawiki.page_edit	PartitionCount:1	ReplicationFactor:2	Configs:
	Topic: mediawiki.page_edit	Partition: 0	Leader: 1001	Replicas: 1001,1002	Isr: 1001,1002
Topic:mediawiki.page_move	PartitionCount:1	ReplicationFactor:2	Configs:
	Topic: mediawiki.page_move	Partition: 0	Leader: 1001	Replicas: 1001,1002	Isr: 1001,1002
Topic:mediawiki.page_restore	PartitionCount:1	ReplicationFactor:2	Configs:
	Topic: mediawiki.page_restore	Partition: 0	Leader: 1002	Replicas: 1002,1001	Isr: 1001,1002
Topic:mediawiki.repage_move	PartitionCount:1	ReplicationFactor:2	Configs:
	Topic: mediawiki.repage_move	Partition: 0	Leader: 1002	Replicas: 1002,1001	Isr: 1001,1002
Topic:mediawiki.revision_visibility_set	PartitionCount:1	ReplicationFactor:2	Configs:
	Topic: mediawiki.revision_visibility_set	Partition: 0	Leader: 1002	Replicas: 1002,1001	Isr: 1001,1002
Topic:test	PartitionCount:2	ReplicationFactor:2	Configs:
	Topic: test	Partition: 0	Leader: 1001	Replicas: 1001,1002	Isr: 1001,1002
	Topic: test	Partition: 1	Leader: 1002	Replicas: 1002,1001	Isr: 1001,1002
Topic:test.event	PartitionCount:1	ReplicationFactor:2	Configs:
	Topic: test.event	Partition: 0	Leader: 1002	Replicas: 1002,1001	Isr: 1001,1002

You should see ISRs correctly replicated in two partitions and brokers distributed uniformly among the partition Leaders. Stop Kafka on the host:

sudo service kafka stop

Check again the topic partition status to make sure that the host is not among the partition leaders anymore:

kafka topic --describe

At this point it is safe to reboot! When the host comes back up:

kafka preferred-replica-election

Check again the topic partition status to make sure that the host is again among the partition leaders:

kafka topic --describe

Last step! You need to re-add it in the confd pool:

confctl --find --action set/pooled=yes kafka1001.eqiad.wmnet

Check that everything looks good in the logs: Check logs to make sure that everything looks good:

sudo journalctl -u eventlogging-service-eventbus.service
sudo tail -f /var/log/kafka/kafka.log

Deploying

EventBus runs out of an eventlogging repository deployed via scap. On the deploy server, deployment is done out of /srv/deployment/eventlogging/eventbus. If you do a straight up scap deploy, eventlogging-service-eventbus on the targets will be restarted after the deploy finishes. Instead, you might want to run through the HTTP proxy restart procedure described above for each node. That is:

Foreach target node

  1. depool target_node
  2. scap deploy -l target_node
  3. repool target_node

Adding worker processes

EventBus is puppetized via the eventbus role in puppet, using the eventlogging::service::service define. Setting num_processes on it will increase the number of working processes Tornado will spawn to handle http requests. See also http://tornado.readthedocs.io/en/latest/process.html. NOTE: We may want to try manually configuring multiple eventlogging-service-eventbus processes on different ports, and load balance to them via nginx (or a fancier LVS that can do this now?).


Failed events

eventlogging-service-eventbus is configured to write any event failures to a local log file. This is done using eventlogging's file:// handler, but could easily be changed to any other eventlogging handler. We write failed events to a local file instead of Kafka to make it more likely that we are able to capture events that fail do to Kafka issues. We can then choose to manually re-produce the events.

Currently, events are written out as EventErrors. The JSON string of the failed event can be found in the rawEvent field.

Failed events can be found in /srv/log/eventlogging/eventlogging-service-eventbus.failed_events.log.

Replaying failed events

The eventlogging-replay script can be used to replay streams EventError events. The failed_events.log configured for EventBus is such a stream. To replay one of these log files in eqiad:

# The following can be run on any eventbus host, e.g. kafka1001.
export PYTHONPATH=/srv/deployment/eventlogging/eventbus
cat /srv/log/eventlogging/eventlogging-service-eventbus.failed_events.log | \
  /srv/deployment/eventlogging/eventbus/bin/eventlogging-replay \
  --output-invalid file:///tmp/failed-replay-events.log \
  stdin:// \
  'kafka:///kafka1001.eqiad.wmnet:9092,kafka1002.eqiad.wmnet:9092?async=False&topic=eqiad.{meta[topic]}'

The above will cat the contents of eventlogging-service-eventbus.failed_events.log into eventlogging-replay script reading from stdin and producing to main-eqiad Kafka. The --output-invalid parameter tells eventlogging-replay to output any events that IT fails to produce into /tmp/failed-replay-events.log. Hopefully there won't be any. Note that each of the inputs and outputs are specified as eventlogging URIs. See the EventLogging URIs documentation for more information. For EventBus, the proper Kafka output URI can be found in /etc/eventlogging.d/services/eventbus. You can copy and paste it out of that config file and use it for eventlogging-replay. Just remember to wrap it in single quotes, or else the & will be interpreted by your shell.


Throughput

In August 2016, some throughput tests were run in main-codfw. main-codfw consists of 2 Kafka brokers, each 32G RAM, 16 core, 4 disk RAID 10 for Kafka data. eventlogging-service-eventbus is colocated alongside of Kafka brokers. This throughput test uses eventlogging with kafka-python 1.3.0 async=False.

Entire service (2 eventbus nodes, 8 processes each), single partition kafka topic:

ab -n 10000 -c 16 -T 'application/json' -p ./test-event.json http://eventbus.svc.codfw.wmnet:8085/v1/events

...
Requests per second:    4425.03 [#/sec] (mean)
Time per request:       3.616 [ms] (mean)
Time per request:       0.226 [ms] (mean, across all concurrent requests)