SLO/logstash

SLO Worksheet - Logstash

Service

Logstash is a free and open server-side data processing pipeline that ingests data from multiple sources, transforms it, and then outputs it for search. In our infrastructure Logstash is a component of the logging pipeline, which consists of Kafka -> Logstash -> OpenSearch <- OpenSearch Dashboards.

Teams

Logstash is owned by the SRE Observability team, which is responsible for operation, scalability, and software updates. Contact: sre-observability@wikimedia.org and https://office.wikimedia.org/wiki/Contact_list#Observability

Architectural

Logstash consists of two clusters per-site.

A production cluster which consumes logs from Kafka, transforms them, and outputs to OpenSearch.
A barebones legacy cluster which ingests logs directly via TCP/UDP and outputs them to Kafka for consumption by the production cluster.

Hard Dependencies

OpenSearch - This is where log data is stored, logstash will block if OpenSearch becomes unavailable.
Kafka - Logstash ingests log message from the kafka-logging cluster.
Hardware - Both dedicated servers, Ganeti instances, and networking.

Soft Dependencies

none

Client-facing

Clients

software	use	connection	interval	failure mode (Logstash down)
Kafka	Aggregates and queues log messages for consumption by logstash	Pull via TCP	Continuous	Kafka consumer lag will spike and alarm
OpenSearch	Storage/archival of log data for search	Push via TCP	Continuous	Logstash will block and stop consuming log events, Kafka consumer lag will spike and alarm.
SCAP	pre-flight error checks to support deployments	Pull via TCP using logstash_checker.py in puppet	Manual	False negative/positive result during deploy pre-flight deploy check

Service Level Indicators (SLIs)

Errors - Percentage of logs which fail to be indexed by OpenSearch

Availability - Percentage of time Logstash is handling logs minute-to-minute

Monitoring

Logstash is monitored via a suite of health checks and metrics, including:

Icinga checks - Host based service up/down checks
Kafka consumer lag - Is Logstash able to consume logs from the Kafka queue faster (or as fast as) they appear, or is the Kafka queue growing faster than Logstash (and OpenSearch) can process?
OpenSearch indexing failures - Is Logstash able to output events to OpenSearch, or do a significant number of log messages fail to be stored in OpenSearch
Logstash event rate today vs. yesterday - Is the overall log volume significantly higher or lower than 24h ago?

Deployment

Logstash is installed via Debian package and its configuration is deployed via puppet.

Service Level Objectives

Errors - 99.5% of events are indexed successfully, per datacenter. Log producers may emit invalid log messages which cannot be parsed and are dropped, producers may exceed rate limits, or output excessive amounts of logs that cannot be reasonably ingested
Availability - 99.95% of the time, per datacenter, Logstash is operational and actively processing logs