SLO Worksheet - Logstash
Logstash is a free and open server-side data processing pipeline that ingests data from multiple sources, transforms it, and then outputs it for search. In our infrastructure Logstash is a component of the logging pipeline, which consists of Kafka -> Logstash -> Elasticsearch <- Kibana.
Logstash is owned by the SRE Observability team, which is responsible for operation, scalability, and software updates. Contact: firstname.lastname@example.org and https://office.wikimedia.org/wiki/Contact_list#Observability
Logstash consists of two clusters per-site.
- A production cluster which consumes logs from Kafka, transforms them, and outputs to Elasticsearch.
- A barebones legacy cluster which ingests logs directly via TCP/UDP and outputs them to Kafka for consumption by the production cluster.
- Elasticsearch - This is where log data is stored, logstash will block if Elasticsearch becomes unavailable.
- Kafka - Logstash ingests log message from the kafka-logging cluster.
- Hardware - Both dedicated servers, Ganeti instances, and networking.
|Kafka||Aggregates and queues log messages for consumption by logstash||Pull via TCP||Continuous||Kafka consumer lag will spike and alarm|
|Elasticsearch||Storage/archival of log data for search||Push via TCP||Continuous||Logstash will block and stop consuming log events, Kafka consumer lag will spike and alarm.|
|SCAP||pre-flight error checks to support deployments||Pull via TCP using logstash_checker.py in puppet||Manual||False negative/positive result during deploy pre-flight deploy check|
Service Level Indicators (SLIs)
Errors - Percentage of logs which fail to be indexed by elasticsearch
Latency - Messages ingested from Kafka logging by logstash without consumer Lag (as defined by kafka burrow)
Logstash is monitored via a suite of health checks and metrics, including:
- Icinga checks - Host based service up/down checks
- Kafka consumer lag - Is logstash able to consume logs from the kafka queue faster (or as fast as) they appear, or is the Kafka queue growing faster than logstash (and elasticsearch) can process?
- Elasticsearch indexing failures - Is logstash able to output events to elasticsearch, or do a significant number of log messages fail to be stored in elasticsearch
- Logstash event rate today vs. yesterday - Is the overall log volume significantly higher or lower than 24h ago?
Logstash is installed via Debian package and its configuration is deployed via puppet.
Service Level Objectives
- Errors - 99.5% of events are indexed successfully, per datacenter. Log producers may emit invalid log messages which cannot be parsed and are dropped, producers may exceed rate limits, or output excessive amounts of logs that cannot be reasonably ingested
- Latency - 99.5% of events are ingested from Kaka logging without consumer Lag (as defined by kafka burrow), per datacenter