Jump to content

Bot traffic

From Wikitech

To get help with questions about bot traffic and high-volume access to Wikimedia infrastructure, email bot-trafficatwikimedia.org.

How to ensure your bot is identified

We generally require everyone to clearly identify their bot using their User-Agent HTTP header. However, given the User-Agent is provided by the user, we've noticed multiple times that malicious actors will forge their user-agent impersonating a known crawler to hide and in the hope of evading bans. This isn't a new problem; in fact, any large crawler had to deal with it, and the easiest solution they found to allow clients to download files containing their IP ranges[1] in a simple format:

{
    "creationTime": "2025-04-08T14:46:14.000000",
    "prefixes": [
        {
            "ipv6Prefix": "2001:1234:1234:1::/64"
        },
        {
            "ipv4Prefix": "1.2.3.4/27"
        }
        ...
}

which has become the de-facto standard on the internet.

If you are making requests in large volumes, or your crawler gets banned and you suspect you're not responsible for the damage, we give you the option of providing us with a URL where we can download this file.

There are a few rules you need to follow:

  • One document for User-Agent you use. The User-Agent needs to follow our User Agent Policy
  • The document needs to conform to the format shown above. You can check the file you produce with this json schema file. If your document doesn't validate against this schema file, we won't import it.
  • The URL must be in the same second-level domain as the one used in the user-agent string. This means that if the bot has user agent FooBot/126 https://www.foobar.com, the URL where we download the file must be under foobar.com.
  • The ip ranges should be as small as possible. Simply put, we won't allow you to provide the whole IP space as a range, or even all of AWS.
  • We will download the IP ranges daily. We expect the ranges to be stable over a longer period of time.

References

See also