SLO/API Gateway

From Wikitech
< SLO

Organisational

What is the service?

The service is known as the API Gateway. This is the name used to refer to the service that provides the unified interface to the Mediawiki REST APIs and Wikifeeds API (in addition to other future services that are outside the scope of this document), indirectly serves the pages of the API portal and implements rate limiting, JWT verification and routing of API URLs. Its main documentation can be found on Wikitech. The specific services implemented as part of the overall product of the API Gateway consist of the following, running within Kubernetes containers:

  • Envoy
  • The Envoy ratelimit service (a distinct service from Envoy itself)
  • Nutcracker (for proxying ratelimit connections to Redis hosts)
  • Fluentd logstreamer (streams logs from the Envoy container to the eventgate service)
  • The standard prometheus-statsd-exporter container

This service runs in the codfw and eqiad Kubernetes clusters. There is an additional set of API Gateway instances in the staging cluster that are not covered by this SLO and are not expected to be part of any critical path.

Who are the responsible teams?

The team responsible for maintaining and configuring the API Gateway is the Platform Engineering Team(PET) and will in future be an SRE team as yet to be defined.

Architectural

What are the service’s hard dependencies?

The API Gateway has a hard dependency upon the normal operation of application servers (appservers). The API gateway only serves content from the appservers for both the REST APIs and the API Portal. The API Gateway has a hard dependency upon the availability of the RestBase REST APIs for providing access to the Wikifeeds APIs. The API Gateway has a hard dependency upon the availability and stability of the WMF Kubernetes cluster, as this is the only environment within which it runs.

What are the service’s soft dependencies?

The API Gateway has a soft dependency upon the Redis hosts in eqiad and codfw, as part of the ratelimit service. If these services are unavailable, rate limits will not be applied but API requests will be served as normal. The API Gateway streams access logs to analytics platforms via the Eventgate service via fluentd. This is seen as a soft dependency because a failure of Eventgate or any other part of the analytics pipeline will interrupt insight into usage of the Gateway but will not interfere with the basic operations of the platform.

What other dependencies might the service have?

The API Gateway has a dependency upon its own ratelimit service. If this service fails, the rate limit service will fail open and rate limits will not be applied, but API requests will be served as normal. (For now, the rate limit service is operated as a part of the API gateway service and monitored by the same overall SLOs, rather than an external dependency.)

Client-facing

Who are the service’s clients?

The primary customers of the API Gateway are external developers. External developers integrate the API Gateway’s APIs in order to consume Wikimedia project information into their applications and to facilitate OAuth 2.0 Authorisation management. The importance of the information they consume from the API Gateway to their individual applications is unbounded. Integrations of the API Gateway may take any form an API consumer might take - mobile applications, big data/data scraping applications, Twitter bots and others. It is expected that some internal services will also begin to use the unified API offered by the API Gateway, but the degree of adoption is hard to predict. Depending on how other teams within the foundation develop their APIs, it is seen as likely that the API Gateway will become the sole point of entry for new APIs. This is entirely contingent on community adoption of api.wikimedia.org in addition to interest within teams deploying services within the Foundation.

What are the request classes?

Primarily the API Gateway serves both read requests and write requests for the MediawikI REST API. These requests are distinguished by their HTTP method. The read requests take the form of relatively normal HTTP GET and HEAD requests against the API and the API Portal itself. The write requests take the form of HTTP PUT, PATCH, DELETE and POST operations.

Classes within the context of rate limiting

It is worth noting that when discussing request classes that the API Gateway itself has an internal conception of user classes. These classes do not define request classes in the spirit of this section of the SLO but are outlined here for purposes of clarity. At the time of writing, two of these classes exist:

  • Anonymous users: users whose requests do not include a Mediawiki-issued JWT. These users are currently classified by their IP address. Limit: 500 requests per hour, per IP.
  • Credentialed users: users whose requests are accompanied by a valid, Mediawiki-issued JWT. Limit: 5000 requests per hour, per client ID (the client ID is encoded within the JWT- limits are not per-token).

Note - listed rate limits are currently provisional and will change as the service moves towards public use. However it is also important to point out that the API Gateway makes no distinction between the two classes of user in terms of service delivery when rate limiting is not considered. If a user is over their rate limit, they will receive a HTTP 429 error. JWT tokens are currently issued by Meta. Outside of this consideration, the API Gateway sees all requests as similar in terms of priority, and considerations of latency and availability are entirely dependent upon the APIs offered by application servers.

Service level indicators

Request response error rate

An increase in 504 responses served by the API Gateway to clients is an indication of the service failing to serve user requests appropriately and that communication with Application or API servers is interrupted in some manner. These increases can be seen on the API Gateway’s dashboard for response code. SLI: The percentage of all requests that return 504 errors.

Operational

How is the service monitored?

Caveat: Currently the API Gateway is in a beta state and for this reason there are no paging alerts for the service. This can and should change as the service support model changes The API Gateway serves pages to the open internet as an LVS service. As with all other LVS services in the Foundation we offer the ability to monitor and page in cases of a total service outage. Currently this feature is disabled for the testing phase of the API Gateway but this will be changed as the gateway moves out of beta. The API Gateway is subject to the monitoring services offered by our Kubernetes configuration. This does not lead to alerting or paging but ensures some service reliability. The Grafana dashboards linked to the service level indicators listed above are prime candidates for configuration as alerts. At present the API gateway service is maintained by the platform engineering team - currently this means that support and incident response is only available within working hours IST. This situation will change as future arrangements around support are made.

How complex is the service to troubleshoot?

The API Gateway at core is not a particularly complex service to troubleshoot in its base operation. Envoy has an expansive and complicated configuration language, but for the purposes of the API Gateway’s normal operation there are only a few core components that are of interest. The most critical aspect of API Gateway is more or less that of any reverse proxy - the highlighted service indicators all refer to the ability of the Gateway to serve wiki and API content to users and any user familiar with the operation of webservers and reverse proxies should be able to understand the basic operation of the system when it comes to processing requests and responses from users and upstream services. However, failures in Envoy can be complicated - understanding of Envoy’s circuit breaking system can be difficult to grasp for new users. The verbosity of Envoy’s logs can also pose a problem for users unfamiliar with the service. Failures within the rate-limiting component of the Gateway are another matter. The operation of the service itself is relatively simple, but the service can be terse at best when it comes to logging information, and the limited metrics it offers are not particularly informative. The impact of this lack of insight is somewhat limited by the fail-open behaviour in Envoy in case of issues with ratelimit lookups. In the case of a complex failure of the ratelimit service it is likely that a developer from PET will be required to debug.

How is the service deployed?

The API Gateway is deployed following the recommended Kubernetes helmfile deploy pattern for services. A staging instance exists to test changes. This is documented on the Gateway’s wikitech page.

Service level objectives

Reporting period

Like all SLOs at the Wikimedia Foundation,the reporting period will be three months, offset from the fiscal quarter by 1 month - meaning that the reporting period would be March to May and so on.Targets Without existing SLOs upon which to base our reporting and calculations, our realistic targets must be conservative in order to allow for them to be tuned upwards as we establish SLOs for our baseline. Additionally, as the API Gateway is a new service it is important to set expectations low to allow for a data gathering period where we can establish reasonable expectations of performance. Our ideal target is close to or slightly better (assuming reliable retries) to that of the Mediawiki APIs themselves. At present there is no SLO for the Mediawiki REST APIs.

Request error rate

99.9% of requests will be successful (any HTTP response other than 504). This results in an error budget of 0.1% of requests.