API Gateway

From Wikitech

The API Gateway is a service that runs in Kubernetes based on Envoy. The service implements many features central to serving the unified API and the API portal.

What it does

The API Gateway serves pages for api.wikimedia.org. It does this by rewriting requests for the unified API to URIs that are understood by the respective APIs on the application servers, and also by serving pages for the API Portal wiki in the same way other wikis would be served. The API Gateway also uses metadata from JSON Web Tokens (JWTs) to apply rate limits to clients using the APIs.

How it works

Wikimedia API Gateway architecture diagram

Envoy is more or less functioning as any other API router/reverse proxy does. The proxy answers requests for the configured domain, does some selective manipulation and then asks appservers via their LVS endpoint for page content.

Rate limiting

The API Gateway applies rate limits to clients based upon the JWT provided (or not) by the client. During beta unauthenticated clients are currently limited to 500 requests an hour, and authenticated clients that pass a valid JWT are limited to 5000 requests. These values are entirely temporary and will be changed as the platform moves towards general release. Clients issue JWTs by requesting OAuth 2.0 clients on Meta and in future, on the API Portal.

The API Gateway also supports setting rate limits on a per-domain basis via envoyproxy/ratelimiter and on a per-service basis via discovery records, using values.yaml.

Routing

The API Gateway maps API URIs passed to the Gateway's hostname (api.wikimedia.org) to the relevant APIs understood by the application servers. For example, https://api.wikimedia.org/core/v1/wikipedia/en/page/pizza is mapped to https://en.wikipedia.org/w/rest.php/v1/page/pizza by the Gateway's configuration language. As of September 2020, it is required to use a relatively complex rewriting method using Lua and multiple definitions of URL patterns seen in the values.yaml file, but this will be fixed in Envoy 1.16.0. Currently all APIs that are offered by the API Gateway are also directly accessible via the traditional API routes on their per-service level.

JSON Web Tokens

The API Gateway verifies the signatures of JWT Authorisation headers included alongside requests. If a JWT is valid, a different limit is applied. This limit can be configured via the Helm values file per environment (Values.main_app.ratelimiter.default_limit.unit for valid JWTs and Values.main_app.ratelimiter.anon_limit.unit for anonymous users).

API Portal

The API Gateway is the means by which all clients access the API Portal. The API Portal is simply a customised Mediawiki instance and the API Gateway serves requests to it by proxying requests to the appservers. Unlike other wikis however, the API Portal is only accessible via the API Gateway.

Logs and analytics

Logs are shipped from the API Gateway to EventGate using fluentd. Fluentd runs in its own container, continuously parsing JSON request log output and reposting these logs to Eventgate.

Where it runs

The API Gateway runs in Kubernetes in staging, eqiad and codfw. The instance in staging does not receive external traffic but can be accessed internally at https://api-gateway.svc.eqiad.wmnet:8087. Changes should be deployed to staging and tested via curl on this endpoint.

How to design your API

This section is for teams who want to add their API to the API gateway and API Portal.

Before you build

Style

All APIs in the API gateway follow a RESTful architectural style. For general guidance on our interpretation of REST and recommended API best practices, visit the design principles.

Authentication and authorization

The API gateway supports OAuth 2.0 as the sole method for authentication and authorization. The gateway requires that write requests include an OAuth token, effectively prohibiting writes without a registered consumer.

At this time, the API Portal supports only the MediaWiki rights covered by the basic, createeditmovepage, and editprotected grants. To add a grant, open a task in Phabricator tagged with #API-Portal.

Rate limits

Requests to api.wikimedia.org are subject to centrally defined rate limits. Rate limits are based on the type of OAuth 2.0 workflow used by the consumer. Requests without an OAuth 2.0 token are subject to a significantly lower rate limit. Visit the documentation for specific rate limits.

Criteria: URL structure

API gateway endpoints follow a consistent URL structure:

# Base URL
https://api.wikimedia.org

# Structure
{base URL}/{namespace}/v{version number}/{project name}/{subdomain}/{endpoint}

# Example: Get the Earth article from English Wikipedia
https://api.wikimedia.org/core/v1/wikipedia/en/page/Earth

Namespace

Endpoints MUST be grouped under a namespace as the initial URL element after the base URL. The namespace should represent a logical grouping of endpoints by function or origin. For a list of namespaces currently in use, visit the API Portal.

This pattern in currently in use but should be reconsidered and/or discontinued per phab:T280087
For APIs originating outside of MediaWiki, the namespace may include both the general /service namespace and a service-specific subnamespace (for example: /service/linkrecommendation). For more information about this pattern, see the other API routes section.

Version

Following the namespace, endpoints MUST include a version in the URL. Experimental or unstable APIs MUST use v0. Endpoints using v1 and above MUST follow the stability policy.

Supported projects

Out of the projects supported by the gateway pathing map, endpoints MUST address project routing (Wikipedia, Wiktionary, etc.) through a URL element following the version, using the unabbreviated project name. If your API has limited project support, you MUST document (either in the API Portal, error message, or other location) which projects are supported by your API.

Supported subdomains

Endpoints MUST address subdomain routing (en, zh, etc.) per project through a URL element following the project, using the subdomain. These subdomains are usually ISO-standard language codes, but there are exceptions.

If your API has limited subdomain support, you MUST document (either in the API Portal, error message, or other location) which subdomains are supported for each project (Wikipedia, Wiktionary, etc.). Note that multilingual projects (commons, mediawiki, meta, wikidata, and wikispecies) do not accept a subdomain parameter.

Criteria: Conditional requests

Your API MUST support conditional requests using etag or last-modified headers.

Next steps

How it's configured

The API Gateway uses the reserved port 8087 internally and is registered in Service ports.

The core configuration for the API Gateway helm chart is documented in the default values.yaml file. Note that there are configuration overrides for production in general, and also for eqiad and codfw specifically (and staging, which does not serve public requests).

JWT tokens are verified using the public key of the keypair used to sign OAuth tokens on meta.wikimedia.org. This key has been converted to the JWKS format required for support using JWTs and is distributed as a secret via puppet.

Rate limiting internals

A rate limit table entry is based on three factors, which make up what Envoy calls a "descriptor". A descriptor is essentially a key that represents something that should be counted and potentially rate limited. The API gateway considers two different kinds of rate limit - authenticated and unauthenticated requests where authentication is done via JWT tokens. Authenticated requests are keyed by the client ID and user IDs extracted from the JWT itself (via aud and sub respectively). Unauthenticated requests are keyed by the X-Client-IP header, which is passed to the gateway by our Varnish layer. These descriptions are generalisations of the internal logic - see the config.yaml template in the service chart for more specific explanations.

In addition to the above, if not otherwise specified, all endpoints are rate limited as one bucket. This means that for a rate limit of 100 requests in an hour, doing 99 requests over the course of an hour to one API and then two requests to any other API will result in the client being returned a 429. To avoid this behaviour, a discovery service can specify a custom ratelimit_config to override the existing rate limits and also create a custom bucket for the service itself, separating the rate limits from the global rate limits.

How to add an API route

Appserver API routes

API routes for services running on Mediawiki application servers are defined in the values file for the API Gateway chart under the pathing_map.

  • The keys for the dictionary per cluster are the path served on api.wikimedia.org - this line will be interpreted as a regular expression with support for group matching
  • The sub-dictionary keys of path and host are the rewritten path and HTTP Host header used to make the request to the corresponding backend. Any groups matched in the upper API path regex will be substituted in these parameters - see the existing pathing_map for examples.

Currently requests are routed to different locations depending on the path requested.

  • All requests under /core/ with the exception of queries ending “description” are routed to appservers.
  • /core/ requests ending in “description” are routed to the mobileapps service for the description API.
  • Requests for paths starting with /service/ are routed to service discovery endpoints. This relation is defined in the Helmfile configuration for the gateway.

Other API routes

Discovery services

Routing to a custom service is a little more involved than adding an Appserver API route. Currently only services with a discovery configuration can be routed to. The host is added to the discovery_endpoints dictionary in configuration by its discovery hostname, a port is specified and and endpoint configured. The service will be accessible at https://api.wikimedia.org/services/my_endpoint. See the existing dictionary for pointers.

Versioning

Unstable APIs that may undergo backwards-incompatible changes should be added to the API Gateway with /v0/ in the path. Stable APIs that can comply with the API Gateway versioning policy should be added with /v1/ in the path.

Documentation

To add API routes to the API Portal, see the API Portal page.

How to deploy changes

The API Gateway's configurable components all live within the deployment-charts repository. The components that are of interest are the api-gateway chart itself and the aforementioned helmfile.d configuration for the service. Note: when changing configuration in the API Gateway chart, make sure to bump the version in Chart.yaml. Not bumping this value will lead to your changes not being deployed.

Changes to the API Gateway chart or configuration files follow a standard code review process. Once you have received a +1 in Gerrit, submitting a +2 will trigger the auto-merge process for the deployment-charts repository. Once the change is merged, always deploy it to staging first and then deploy to the production environments using the standard deployment process.

There are currently no specific deployment windows for the API Gateway but if deploying a change ad hoc without PET's knowledge, it is best to both !log liberally and make sure that someone from the team is on hand if you're doing something risky.

How to roll back changes

Follow the standard rollback procedures. If a change is affecting user experience in any way (increases in error codes served, timeouts etc - always refer to the dashboards when deploying), use the emergency procedure to limit the public impact of a change.

How to test changes

In development

Given the API Gateway's interactions with the appservers, testing changes locally can be difficult. However, there exists limited support for testing changes - if you have a local setup like minikube or similar, you can install a local version of the API Gateway by running helm install -f api-gateway/values-devel.yaml api-gateway in the charts directory. Once your install is complete and you have forwarded the requisite ports, requests will be passed to a fake backend service that will return the headers and parameters of requests and responses to any requests. This can be used to ensure that basic behaviour changes are in keeping with what you're expecting, that Envoy syntax checks out and that URL mappings are behaving as expected, amongst other things.

In staging

When changes have been deployed to staging, they can be tested using curl from any internal host. This can make it difficult to test changes that rely on Mediawiki changes, but it is unlikely that helm will be used to change the API Portal's behaviour in lieu of the standard mediawiki-config deployment process.

For example, to test a change to the API routing, run curl -k https://staging.svc.eqiad.wmnet:8087/core/v1/wikipedia/ga/page/Veigeat%C3%B3ireachas -v. When deploying new changes to staging, it should be verified that the change has had no impact on the API in general and specifically any API paths that have been modified or added. The normal operation of the API Portal should be tested - nothing too extensive but make sure that the main page loads okay.

How to debug it

Logs

Logstash

The easiest way to look at logs for the API gateway is via Logstash - simply filter for the kubernetes.namespace_name attribute of api-gateway or view the api-gateway dashboard.

Kubernetes

To read and follow the logs for a API Gateway instance (useful if you want to debug a specific instance while logged into the deploy server but usually more hassle than using Logstash directly) (codfw in this example):

hnowlan@deploy1001:~$ kube_env "api-gateway" "codfw"
hnowlan@deploy1001:~$ kubectl get pods | grep Running
api-gateway-production-5cd8c54ddb-rcg77   5/5     Running   0          5d5h
tiller-deploy-77f47486d6-fxhpx            1/1     Running   0          6d3h
nowlan@deploy1001:~$ kubectl logs api-gateway-production-5cd8c54ddb-rcg77 api-gateway-production --tail 10 -f

This will show the last 10 lines of the logs and then follow output.

Note that Envoy's log format is extremely verbose and dumping whole logs may take a few seconds. Following logs may be challenging at times as they can seem non-linear as many requests may be interpolated amongst each one another - one aid in sorting through logs is following the [Cxxxxx] fields in the logs which are unique connection IDs that can be used to follow requests as they are received and answered.

The above example can also be used to monitor the ratelimit service - in place of api-gateway-production simply subsitute production-ratelimit. This pattern applies to the other services within the pod but their log output is not always useful.

WikimediaDebug

The WikimediaDebug plugin is supported for accessing the API Portal. It is not currently supported for routing API requests.

How to monitor it

There is a Grafana dashboard available that monitors many features of the API Gateway.

How to assign a client to rate limit tier

All clients are assigned to the default ratelimit tier. To change the tier, use the setClientTierName.php maintenance script.

Log in to the mwmaint host in the active datacenter, usually mwmaint1002, and execute:

mwscript extensions/OAuthRateLimiter/maintenance/setClientTierName.php --wiki metawiki --client <client_id> --tier <tier_name>

At the time of writing 3 tiers exist:

  • Default rate limit class: 5000 API calls/hour per client ID/user ID pair (with null user ID counting as a pair here)
  • Preferred rate limit class: 25,000 API calls/hour per client ID/user ID pair
  • Internal rate limit class: 100,000 API calls/hour per client ID/user ID pair
  • WME rate limit class: 250,000 API calls/hour per client ID/user ID pair (this is only to be used for Wikimedia Enterprise tokens)

Known issues

  • If a user receives an error of {"httpCode":401,"httpReason":"Jwt issuer is not configured"} it is because the "iss" field in the token does not match the one configured on the API Gateway. Depending on what has been changed this could be a misconfiguration of the OAuth token creation process or of the Gateway itself. Envoy is very strict about issuer being set (although this is changing) and a mismatch will lead to tokens being rejected.

Related