SLO/logstash
SLO Worksheet - Logstash
Service
Logstash is a free and open server-side data processing pipeline that ingests data from multiple sources, transforms it, and then outputs it for search. In our infrastructure Logstash is a component of the logging pipeline, which consists of Kafka -> Logstash -> OpenSearch <- OpenSearch Dashboards.
Teams
Logstash is owned by the SRE Observability team, which is responsible for operation, scalability, and software updates. Contact: sre-observability@wikimedia.org and https://office.wikimedia.org/wiki/Contact_list#Observability
Architectural
Logstash consists of two clusters per-site.
- A production cluster which consumes logs from Kafka, transforms them, and outputs to OpenSearch.
- A barebones legacy cluster which ingests logs directly via TCP/UDP and outputs them to Kafka for consumption by the production cluster.
Hard Dependencies
- OpenSearch - This is where log data is stored, logstash will block if OpenSearch becomes unavailable.
- Kafka - Logstash ingests log message from the kafka-logging cluster.
- Hardware - Both dedicated servers, Ganeti instances, and networking.
Soft Dependencies
none
Client-facing
Clients
software | use | connection | interval | failure mode (Logstash down) |
---|---|---|---|---|
Kafka | Aggregates and queues log messages for consumption by logstash | Pull via TCP | Continuous | Kafka consumer lag will spike and alarm |
OpenSearch | Storage/archival of log data for search | Push via TCP | Continuous | Logstash will block and stop consuming log events, Kafka consumer lag will spike and alarm. |
SCAP | pre-flight error checks to support deployments | Pull via TCP using logstash_checker.py in puppet | Manual | False negative/positive result during deploy pre-flight deploy check |
Service Level Indicators (SLIs)
Errors - Percentage of logs which fail to be indexed by OpenSearch
Availability - Percentage of time Logstash is handling logs minute-to-minute
Monitoring
Logstash is monitored via a suite of health checks and metrics, including:
- Icinga checks - Host based service up/down checks
- Kafka consumer lag - Is Logstash able to consume logs from the Kafka queue faster (or as fast as) they appear, or is the Kafka queue growing faster than Logstash (and OpenSearch) can process?
- OpenSearch indexing failures - Is Logstash able to output events to OpenSearch, or do a significant number of log messages fail to be stored in OpenSearch
- Logstash event rate today vs. yesterday - Is the overall log volume significantly higher or lower than 24h ago?
Deployment
Logstash is installed via Debian package and its configuration is deployed via puppet.
Service Level Objectives
- Errors - 99.5% of events are indexed successfully, per datacenter. Log producers may emit invalid log messages which cannot be parsed and are dropped, producers may exceed rate limits, or output excessive amounts of logs that cannot be reasonably ingested
- Availability - 99.95% of the time, per datacenter, Logstash is operational and actively processing logs