SLO/Event Platform
Material may not yet be complete, information may presently be omitted, and certain parts of the content may be subject to radical, rapid alteration. More information pertaining to this may be available on the talk page.
Service
This SLO covers Event Platform services and the EventBus extension. While from an architecture point of view EventBus is a client to Event Platform services, and should be covered by a dedicated SLO, in practice it is tightly coupled to the operational health of the system. EventBus is used to produce page state changes in mediawiki and produces streams on which we need to provide quality and completeness guarantees. Performance degradation of EventBus will have a direct impact on the service SLIs. As such EventBus SLIs are included in this SLO as a special case.
The Event Platform service boundaries are determined by:
- Input: Event intake and ingestion (EventBus mediawiki extension and EventGate service).
- Output: Stream publishing (EventStreams service).
Events are published to Event Platform by clients either POSTing to the EventGate service directly (mobile client app, canary events pipeline), or indirectly by implementing an EventBus hook (e.g. events sources from MediaWiki).
Services
The following services act as input/output boundaries for Event Platform.
Input: EventGate
EventGate is an HTTP service for ingestion of events, written in Node.js. It takes JSON messages over HTTP POST requests, optionally validates them against a JSONSchema, and then produces them to a backend. The default backend (and the one used at WMF) is Kafka.
At wikimedia we have four instances of EventGate deployed on wikikube (eqiad and codfw):
- eventgate-main
- eventgate-analytics
- eventgate-analytics-external
- eventgate-logging-external
Output: EventStreams
EventStreams is a web service that exposes continuous streams of structured event data. It does so over HTTP using chunked transfer encoding following the Server-Sent Events protocol (SSE). EventStreams can be consumed directly via HTTP, but is more commonly used via a client library.
At wikimedia we have two instances of EventStreams deployed on wikikube (eqiad and codfw):
- eventstreams
- eventstrams-internal (not covered by this SLO).
EventBus
EventBus propagates state changes (edit, move, delete, revision visibility, etc) to an EventGate instance, providing consumers of the service with the means of tracking changes to MediaWiki content.
Teams
The team responsible for stewardship and development of the service is Data Engineering (Event Platform group).
Architectural
Environmental dependencies
The service runs on nodejs and is deployed on k8s via Deployment pipeline. These global SLOs depend on the state of Kafka and stream processing (Flink) producers that while bypassing EventGate, could still generate streams published by EventStreams.
Service dependencies
The system has hard dependencies on:
- Kafka (main and jumbo), which are message brokers the service produces to and consumes from.
The system has soft dependencies on:
- Stream processing consumers/producers (Flink)
- Schema services, either bundled or via API calls.
- MediaWiki, via the EventStreamConfig extension
Client-facing
Clients
- Most internal kafka consumers are (indirectly) EventGate clients, performance degradation on EventGate will result in performance degradation of all clients.
- All external stream consumers are EventStreams clients (e.g. Wikimedia Enterprise, ML Community).
- Non MediaWiki (EventBus) producers (e.g. mobile apps instrumentation, canary events producers) that POST events directly to EventGate.
Request Classes
Event Platform APIs serves both write and read requests to Event Platform. These requests are distinguished by their HTTP method.
Service Level Indicators (SLIs)
EventBus is orthogonal to Event Platform, but in practice it is the change data capture mechanism that feeds foundational streams (mediawiki.page_change.v1). SLI degradation on EventBus will have a direct impact on Event Platforms streams availability and data quality. As such, we include its performance as an SLI for the Event Platform API.
EventBus request response error rate: an increase in 5xx errors indicates a failure of the service to deliver events to an assigned EventGate gateway. This will result in data loss (events not available in Kafka). Depending on the nature of the error, EventGate might not be aware of the failed request. This SLI must be reconciled with EventGate request response error rate.
These increases can be seen on the EventBus’s Logstash dashboard for error rate..
EventGate latency increase: an increase in latency will result in performance degradation for clients and an increase lag between event emission and availability in Kafka. High latency might result in client side timeouts, and thus data loss.
EventGate request response error rate: an increased error rate indicates a failure of the service to process incoming invents, and will result in data loss (events not available in Kafka). These increases can be seen on the EventGate’s dashboard for error rate.
EventGate schema validation error rate: an increase in error validation rate can be a symptom of clients posting malformed payload, but could be a symptom of the service lacking up to date schemas.
EventStreams request response error rate an increase in 5xx and 400 responses indicates a failure of the service to deliver streams of events to clients
Operational
Monitoring
EventGate and EventStreams are monitored via Grafana and Logstash. EventGate has been onboarded on alertmanager, and its SLI can be coupled to alerts. Alerts are routed to the data-engineering group (covers Event Platform, and all of Data Platform Engineering) EventStreams has not been onboarded on alertmanager yet.
EventBus can be monitored via Grafana and Logstash. It does not expose metrics to Prometheus yet, and has not been onboarded on alertmanager.
All EventGate instances can be monitored in Grafana and logstash.
| EventGate deployment | Grafana dashboard | Logstash dashboard |
| eventgate-main | EventGate Main | |
| eventgate-analytics | EventGate Analytics | |
| eventgate-analytics-external | EventGate Analytics External | |
| eventgate-logging-external | EventGate Logging External |
Eventbus can be monitored via Grafana Logstash
| EventBus deployments | Grafana dashboard | Logstash dashboard |
| Eventbus | EventBus |
All EventStreams instances can be monitored in Gafana and logstash.
| EventStreams Deployments | Grafana dashboard | Logstash dashboard |
| eventstreams | EventStreams | EventStreams status |
Troubleshooting
As of 2023-09 no support SLA is provided. File a Bug at Data Engineering and Event Platform board and tag Event Platform (CC: gmodena, ottomata, tchin). The Event Platform team will follow up within 24 hours (on work days).
Deployment
EventGate and EventStreams are deployed following the recommended Kubernetes helmfile deploy pattern for services.
EventBus is deployed following the mediawiki deployment train practices.
Service Level Objectives
Realistic targets
Currently we provide SLOs only for EventGate:
- 99% of requests to EventGate will be successful (any HTTP response other than 504). This results in an error budget of 1 % of requests.
- 99% of requests to EventGate will be successful with at < 500ms latency at p99.
Ideal targets
We provide SLOs for all service components (EventBus, EventGate, EventStreams).
- 99.9% of requests to Event Platform will be successful (any HTTP response other than 504). This results in an error budget of 0.1% of requests.
- 99.9% of requests will be successful with at < 500ms latency at p99.
Reconciliation
- Reconciliation of realistic and ideal targets will require improvements on observability, knowledge sharing and a degree of platform evolution for the services.
- EventBus should be onboarded to alertmanager
- We might need dedicated SRE support.
Reconcile the realistic vs. ideal targets, documenting any decisions made along the way.
Once the SLO is final, consider collapsing the above three sections.
What are the agreed-upon SLOs, for each SLI and each request class?