Data Platform/Systems/Varnishkafka

Varnishkafka is a daemon that runs on all the Wikimedia frontend caching hosts. Since Varnish logs HTTP requests in its own format in shared memory, Varnishkafka uses the Varnish Log API to read data, format it following some user input and finally send the result to a specific Kafka topic (using librdkafka).

Testing a code change

Use the Docker image as outlined in https://gerrit.wikimedia.org/r/#/admin/projects/operations/software/varnish/varnishkafka/testing. The wise reader might think "and what about unit tests? These ones are integration tests!". The Analytics team tried in T147432 to explore the possibility of adding unit tests to Varnishkafka but the estimated amount of time for the code refactoring, testing, releasing etc.. (this software is critical for the team) was not worth the benefits of having some code tests. The team preferred instead to spend more time on creating a flexible and simple integration testing suite.

Varnishkafka instances

You might see some reference in puppet of Varnishkafka instances, and this is because the same Varnish request data can be sliced and formatted in different ways for different scopes:

- Webrequest

- Statsv

- Eventlogging

On all the cache segments (text, upload, misc and maps) we run the Webrequest instance, meanwhile the Statsv and Eventlogging ones are only running in text.

Monitoring

The varnishkafka grafana dashboard uses Prometheus metrics, and hence it follows its conventions of grouping metrics by datacenter. This means that we'll have a separate dashboard for eqiad, codfw, eqsin, ulsfo and esams. You can switch source of data by selecting a value in the top left corner of the grafana dashboard (datasource dropdown).

Delivery errors

This error means that Varnishkafka failed to send messages to Kafka Jumbo, and hence data has been lost. Usually this means that something is not working properly at the caching layer, or that Kafka Jumbo is in trouble for some reason. Start by checking the Kafka jumbo dashboard, and see if there is any correlation with Varnishkafka metrics. If everything looks good, check with the SRE/Traffic team if anything is ongoing at the caching layer.

Important note: the delivery alerts are per datacenter, so this can give you some information about the source of the problem. If you get alarms for webrequest text on all datacenters, it probably means that something global is happening to Kafka or Varnish. If you get only an alarm for say esams, it is probably a local-only problem with that datacenter.

No Messages Delivered

As part of task T300164 it was discovered that there was a potential failure mode whereby varnishkafka could be running, but unable to connect to Kafka to deliver messages. This condition would potentially result in lost data for webrequest, statsv, and event streams. Therefore we decided to implement task T300246 which checks Prometheus to see how many messages have been delivered by each varnishkafka instance. If the number of messages is zero for five minutes, then a critical alert is triggered for the Data Engineering team.

Note that at the moment, this alarm will trigger if a varnish server is intentiallly depooled. Work is under way to integrate the desired pooled/depooled state in conftool with Prometheus, in order that we can reduce and eliminate any false positives like this.