User:VGutiérrez (WMF)/ncredir SLO draft
Status: draft
Organizational
Service
This service is a simple HTTP service that handles redirections from non-canonical Wikimedia domains (e.g., wikipedia.com) to canonical domains (e.g., wikipedia.org). It covers all relevant binaries, hosts, and clusters involved in serving these HTTP redirects.
Teams
The teams responsible for this SLO include:
- SRE Team: Responsible for responding to alerts, incident management, and ensuring the operational health of the service.
- Engineering Team: Responsible for developing, maintaining, and improving the service's codebase to meet reliability and performance demands.
- Release Engineering Team: Responsible for managing deployments and executing rollbacks to prevent or mitigate SLO impacts.
Representatives from these teams have been involved in drafting this SLO and will be involved in its finalization to ensure commitment. Client services may also be consulted for input on their needs.
Architectural
Environmental dependencies
This section would describe the underlying infrastructure where the service runs (e.g., specific data centers, cloud providers, container orchestration platforms). This information is not provided in the sources and would need to be filled in based on your actual environment.
Service dependencies
Your service has a dependency on another service if your service can't work correctly when that service isn't working.
- Hard Dependencies:
- Canonical Domain Resolution Service: A backend service that resolves or provides the mapping from non-canonical domains to canonical ones. If this service is completely broken, the redirect service would be unable to determine the correct redirect target and would likely serve errors or fail to respond.
- DNS Services: Core DNS infrastructure that allows the redirect service to resolve domain names. If DNS is unavailable, the service cannot function.
- Rationale: Your service's availability cannot be higher than its hard dependencies' availability, and its latency cannot be lower.
- Soft Dependencies:
- Logging/Monitoring Services: Services used for collecting logs and metrics. If these break, the redirect service would operate in a degraded mode (e.g., logs might not be collected), but the core redirect functionality should persist.
- Indirect Dependencies:
- Caching Layer (e.g., Varnish, ATS): While the redirect service itself doesn't directly query a caching layer for redirects (it sends the redirect response to the client), a caching layer sitting in front of your service could act as an indirect dependency. For example, if a misconfigured caching layer accidentally sent all traffic to the redirect service when it should have cached it, it could overwhelm your service, leading to an outage.
Client-facing
Clients
The service's clients include:
- Other internal Wikimedia Foundation services: Services that might programmatically use or rely on the redirect functionality.
- External users: This includes both human users attempting to access Wikimedia content via non-canonical domains and automated users like search engine crawlers. Their reliability needs are assessed for satisfactory user experience.
Request Classes
For a simple HTTP redirect service, it might initially have only one request class, as all requests aim to perform a redirect. However, if there are distinctions, they would be defined here. For example:
- Permanent Redirects (HTTP 301): Requests for domains requiring a permanent move.
- Temporary Redirects (HTTP 302/307): Requests for domains requiring a temporary redirect.
- Criteria: Requests can be classified based on the incoming domain and possibly the intended canonical domain type.
Service Level Indicators (SLIs)
Service Level Indicators (SLIs) are the metrics used to evaluate the service's performance, reflecting the client's perception of service performance. They should be directly client-visible, comprehensive, under your control, aligned with service health, and fully defined.
For the HTTP Redirect Service, the following SLIs are appropriate:
- Availability SLI: The percentage of all requests receiving a successful redirect response, defined as an HTTP status code in the 3xx range (e.g., 301, 302, 307, 308). This excludes any 4xx or 5xx HTTP status codes.
- Rationale: This SLI measures the success rate from the client's perspective for a redirect. If a client receives a 4xx or 5xx, it means the redirect failed.
- Latency SLI, acceptable fraction: The percentage of all requests that complete within [e.g., 100] milliseconds, measured at the server side.
- Rationale: Fast redirects are crucial for user experience and SEO. This SLI measures how quickly the service responds with the redirect.
Links to Grafana graphs for each SLI would be included here once implemented.
Operational
These answers reflect the service as it is, not as it ideally should be, to arrive at a realistically supportable SLO.
Monitoring
The service is monitored [e.g., 24x7 with paging alerts on all critical SLIs]. This means that the expected time between an outage starting and a responding engineer investigating is [e.g., less than 5 minutes], accounting for alert delays.
Troubleshooting
The complexity of troubleshooting is [e.g., moderate]. Engineers responding to pages [e.g., generally understand its internals, but some complex issues may require escalation to developers]. Documentation for interpreting monitoring, diagnosing problems, and taking mitigative action is [e.g., mostly complete and discoverable, but could benefit from more detail].
Deployment
The service is deployed [e.g., via automated CI/CD pipelines]. Production incidents are often resolved by rolling out code or configuration changes. A rollback process to a known-safe version exists and can be executed [e.g., quickly, skipping canary checks in an emergency].
Service Level Objectives
The reporting period for this SLO will be three calendar months, phased one month earlier than the fiscal quarter (e.g., December 1 - February 28/29, March 1 - May 31, June 1 - August 31, September 1 - November 30).
Realistic targets
Based on past performance, current operational capabilities, and considering the SLOs of its dependencies (assuming they meet their targets), the realistic targets are:
- Availability: 99.9% of all requests receiving a successful redirect response (HTTP 3xx).
- Rationale: Past performance indicates the service generally operates at this level, and our incident response time estimates suggest we can recover from typical outages within the allowed error budget for this target.
- Latency: 99% of all requests complete within 200 milliseconds.
- Rationale: This target aligns with observed performance, considering the time spent on internal processing and backend dependency calls.
Ideal targets
Based on client needs (both internal services and external users), the ideal targets for this service are:
- Availability: 99.99% of all requests receiving a successful redirect response (HTTP 3xx).
- Rationale: While 99.9% is acceptable, a higher availability target would reduce user-visible errors more effectively, minimizing lost traffic from non-canonical domains.
- Latency: 99.9% of all requests complete within 100 milliseconds.
- Rationale: Faster redirects significantly improve user experience and SEO performance, especially for a foundational service like this. This would be considered "basically satisfactory" by external users.
Reconciliation
Comparing the realistic and ideal targets, there is a gap in both availability and latency. While we can currently support 99.9% availability and 99% of requests within 200ms, the ideal is 99.99% availability and 99.9% of requests within 100ms.
To close this gap, we will prioritize engineering work focused on improving reliability and performance. This may include:
- Optimizing code paths for latency-sensitive requests.
- Implementing additional caching layers for canonical domain mappings.
- Improving redundancy for critical hard dependencies.
- Refining monitoring and alerting to reduce detection and resolution times.
Agreed-upon SLOs:
- Availability: 99.9% of all requests receiving a successful redirect response (HTTP 3xx).
- Latency: 99% of all requests complete within 200 milliseconds.
This SLO reflects the promises we can keep right now. We will publish these currently-realistic targets and continuously measure performance against them. We acknowledge the aspirational targets of 99.99% availability and 99.9% of requests within 100 milliseconds, which will guide our longer-term planning and reliability improvement efforts.