Data Platform/Systems/Druid/Alerts
We have a number of alerts set up in Icinga and Alertmanager that relate to Druid and its ingestion jobs.
This page exists as a set of instructions or runbooks to help identify what courses of action might be needed if one or more of these alerts is triggered.
Druid Netflow Supervisor
This alert triggers if the realtime netflow ingestion job receives below a certain threshold of events, over a 30 minutes period.
The critical value is 0 and the warning value is 30.
There is a grafana dashboard showing the trend data.
Druid webrequest_sampled_live Supervisor
This alert triggers if the realtime webrequest_sampled_live
ingestion job receives below a certain threshold of events, over a 30 minutes period.
The pipeline is composed by multiple parts:
- Benthos webrequest data enrichment - on centrallog nodes we run a special Benthos instance that pulls messages/events from the Kafka Jumbo's
webrequest_{text,upload}
topics and enrich/modifies them with meaningful data (geo location, etc..). The end result is a new topic calledwebrequest_sampled
. - Druid indexation - the Druid analytics cluster ingests in real time the
webrequest_sampled
messages/events, indexing them in segments that can be queried from various sources (like Turnilo and Superset). - Turnilo and Superset dashboards.
Note: the use case is similar to Turnilo's webrequest_sampled_128, but the main difference is that in this case we care only about the past 24h of data (and that it is indexed live as it is published to the Kafka topics, as opposed to wait for the DE's batch jobs etc..).
Since the pipeline is split into multiple parts, there may be multiple reasons why Druid is not indexing enough events:
- The Benthos instances on centrallog nodes may be down or misbehaving. Please check status and logs of
benthos@webrequest_live.service
on those nodes. Useful dashboards are thewebrequest_sampled
topic's metrics and the Benthos dashboard. - Druid is not indexing Kafka events/messages correctly. In this case start from this grafana view and ping Data Engineering!
Druid Segments Unavailable
This alert triggers for each data source if the cluster is reporting above a certain number of segments missing over a 15 minute period.
The critical value is 30 segments unavailable for each data source.
The warning value is 20 segments unavailable.
There is a Grafana dashboard showing the trend.
It may well be that Druid auto-heals this unavailability automatically, so chek the general Troubleshooting techniques before taking any action.