User:GLavagetto (WMF)/Robot policy update draft
The following rules are an evolution of the previous version of this policy, originally published in 2009, that includes more systems, better defined limits and should help any Bot Operator who acts in good faith to limit their impact on our systems.
Failure to follow these guidelines may result in your bot being blocked or heavily rate-limited.
As an Operator of a program automatically consuming the content of the wikis, certain rules apply which depend on the type of content you’re accessing and how you access it. Failing to comply with these rules will generally result in your operation being rate-limited and you as an Operator eventually banned from accessing our resources.
Please note: while the guidelines apply to any bot that connects to our environment, rate limiting is not enforced on bots in Toolforge.
Generally applicable rules
The following rules apply to any activity on our websites; rules in the following sections will be specific to sites/URLs instead.
- Consider if dumps are more efficient than live requests. Check if you could use our dumps or other forms of offline collection of our data instead of making live requests. If that’s a viable option for your use case, it will reduce the strain on our very limited resources and make your life easier.
- Accurately identify your user-agent. Always identify your bot clearly via its user-agent HTTP header in a manner that should follow our User-Agent policy.
- Given user-agents are easily forged, to avoid impersonation and being blocked because of the misdeeds of others, if you’re making a significant number of requests to us, you should either:
- Provide a URL where we can download a list, in JSON format, of the IP spaces from which your requests will originate, as a list of CIDRs. See the “what to do if the limits are too strict for me”.
- If you’re making only API requests, you can create an on-wiki account and authenticate your requests (preferably using OAuth).
- Honor Robots.txt. Honor every directive in our robots.txt file.
- Default to gzip. Always request content with the Accept-Encoding header set to “gzip”, to reduce bandwidth usage, unless you’re requesting media files (i.e. images or video) that are already in a compressed format.
- Respect our HTTP status codes When we reply with a 429 status code, respect the Retry-After response header you received.
- Cached interfaces are preferred and more efficient. If you need the HTML content of pages, use either the
/wiki/Article_nameformat to fetch it, or the corresponding REST API endpoint (/api/rest_v1/page/html/Article_name). Both of these interfaces are cached in our CDN; your requests will be cheaper for us, and faster for you. More information about these can be found below.
Website (i.e. https://en.wikipedia.org/wiki/Main_Page)
- Always crawl the website via the
/wiki/Article_nameURLs, with no query parameters. This will ensure that if the content is CDN-cached, you’ll get a faster response allowing you to crawl the site faster and more efficiently. - Do not emulate a browser - do not store cookies or execute javascript.
- Assuming you are following all of our best practices ideally, still ensure that the maximum concurrent number of requests is fewer than 10 overall, and keep the average requests per second below 20.
- Avoid accessing content that is not current, or via non-canonical URLs: do not crawl the site using the
oldIdorcuridparameters, and only use the/wiki/Titleformat URLs.
REST API (i.e. https://en.wikipedia.org/api/...)
- You can use this interface to fetch the HTML content of the pages, or their summary.
- Always limit the total concurrent number of requests to less than 5 overall, and the total number of requests per second below 10.
- Do not send authentication cookies with your request.
- Avoid accessing content that is not current, so avoid requesting a specific revision in your URL.
Action API (i.e https://en.wikipedia.org/w/api.php?...)
- Avoid using the action API for HTML content of pages. Use the website and/or the REST API instead.
- Keep the concurrency of your requests to 1 at a time, and below 5 requests per second overall, if unauthenticated.
- If authenticated, you can raise the concurrency to 3 overall, and the number of requests per second to 10.
- Avoid using expensive API endpoints: if your request takes more than 1 second to serve, please wait 5 seconds before making another request.
- Where supported, use batch requests.
Media API (i.e https://upload.wikimedia.org/...)
- Always keep a total concurrency of at most 2, and limit your total download speed to 25 Mbps (as measured over 10 second intervals).
- Only use originals or one of our pregenerated thumbnail sizes, which you can fetch here in json format.
Other resources (i.e gitlab, gerrit, Phabricator, etc…)
- Always keep a total concurrency of at most 1, and use a delay between requests of at least 1 second.
- Pause crawling for at least 15 minutes if you receive a 5xx status code.
What to do if these limits are too strict for me?
These limits are not per-domain but global for all Wikimedia properties, with an exception for community projects of the kind which would be eligible to run in Wikimedia Foundation’s hosted infrastructure (Wikimedia Cloud Services offerings).
Specifically, bots running in Toolforge and in any other Wikimedia Cloud Services offering are explicitly exempted from such limits. We will still reserve the right to temporarily rate-limit or block individual bots that might be compromising the stability of the websites.
If you are an external entity and you need a higher volume of requests, please refer to the High-Volume access page on the Wikimedia Developer Portal.