Data Engineering/Systems/Druid/Alerts

From Wikitech

We have a number of alerts set up in Icinga and Alertmanager that relate to Druid and its ingestion jobs.

This page exists as a set of instructions or runbooks to help identify what courses of action might be needed if one or more of these alerts is triggered.

Druid Netflow Supervisor

This alert triggers if the realtime netflow ingestion job receives below a certain threshold of events, over a 30 minutes period.

The critical value is 0 and the warning value is 30.

There is a grafana dashboard showing the trend data.

Druid webrequest_sampled_live Supervisor

This alert triggers if the realtime webrequest_sampled_live ingestion job receives below a certain threshold of events, over a 30 minutes period.

The pipeline is composed by multiple parts:

  • Benthos webrequest data enrichment - on centrallog nodes we run a special Benthos instance that pulls messages/events from the Kafka Jumbo's webrequest_{text,upload} topics and enrich/modifies them with meaningful data (geo location, etc..). The end result is a new topic called webrequest_sampled.
  • Druid indexation - the Druid analytics cluster ingests in real time the webrequest_sampled messages/events, indexing them in segments that can be queried from various sources (like Turnilo and Superset).
  • Turnilo and Superset dashboards.

Note: the use case is similar to Turnilo's webrequest_sampled_128, but the main difference is that in this case we care only about the past 24h of data (and that it is indexed live as it is published to the Kafka topics, as opposed to wait for the DE's batch jobs etc..).

Since the pipeline is split into multiple parts, there may be multiple reasons why Druid is not indexing enough events:

  • The Benthos instances on centrallog nodes may be down or misbehaving. Please check status and logs of benthos@webrequest_live.service on those nodes. Useful dashboards are the webrequest_sampled topic's metrics and the Benthos dashboard.
  • Druid is not indexing Kafka events/messages correctly. In this case start from this grafana view and ping Data Engineering!

Druid Segments Unavailable

This alert triggers for each data source if the cluster is reporting above a certain number of segments missing over a 15 minute period.

The critical value is 30 segments unavailable for each data source.

The warning value is 20 segments unavailable.

There is a Grafana dashboard showing the trend.

It may well be that Druid auto-heals this unavailability automatically, so chek the general Troubleshooting techniques before taking any action.