HTTP proxy

For the Cloud VPS web proxy feature, see Help:Using a web proxy to reach Cloud VPS servers from the internet.

Please, never use this functionality to reach out to endpoints belonging to Wikimedia. You should be able to utilize the service mesh/services proxy to reach most endpoints internally in a safer, more reliable and more performing way. If you can't find out how, please reach out to SRE

To allow HTTP requests reach the outside world, we maintain a caching HTTP proxy in each datacenter. They are exposed using services entries of the form webproxy.<datacenter>.wmnet running on the install* servers.

How-to?

You can set the http_proxy and https_proxy environment variables to make many command-line scripts use the site specific proxy automatically.

The no_proxy and NO_PROXY variables are configured automatically across the infra by the profile::environment puppet module and hiera settings.

Helper commands

In your terminal, just run set_proxy. This will take care of setting up the needed environment variables during the active session.

unset_proxy will do the opposite.

Manual config

export http_proxy=http://webproxy:8080
export https_proxy=http://webproxy:8080
export no_proxy=127.0.0.1,::1,localhost,.wmnet,.wikimedia.org,.wikipedia.org,.wikibooks.org,.wikiquote.org,.wiktionary.org,.wikisource.org,.wikispecies.org,.wikiversity.org,.wikidata.org,.mediawiki.org,.wikinews.org,.wikivoyage.org
export HTTP_PROXY=$http_proxy
export HTTPS_PROXY=$https_proxy
export NO_PROXY=$no_proxy

"no_proxy" MUST be explicitly set
- Prevents unnecessary load on the proxies (to fetch internal resources)
- Prevents stale data cached on the proxies
- Prevents unnecessary dependencies
HTTP proxies SHOULD NOT be configured by default, but on a case by case (need) basis
- It's preferred to set these variables for your current session only by using the helper commands at the terminal prompt
- services should leverage Puppet to configure proxies
These proxies MUST NOT be used from Cloud VPS instances (enforced by ACLs)

Internal endpoints

It is better to use internal endpoints instead of public ones, a list or reasons is visible on this comment.

API

Use e.g. https://mw-api-int-ro.discovery.wmnet:4446 and set the HTTP Host header to the domain of the site you want to access, e.g. curl -H "Host: www.wikidata.org" https://mw-api-int-ro.discovery.wmnet:4446

MediaWiki On Kubernetes internal API endpoints:

Direct usage
- Read-only: https://mw-api-int-ro.discovery.wmnet:4446
- Read-write: https://mw-api-int.discovery.wmnet:4446
Listeners to use through the Envoy Services Proxy:
- Read-only: mw-api-int-async-ro
- Read-write: mw-api-int or mw-api-int-async

For examples in Python and R refer to these notes.

LiftWing

See Machine Learning/LiftWing/Usage#Internal endpoints

A complete list exists at: https://config-master.wikimedia.org/discovery/discovery-basic.yaml

Example usage

curl

If you are using curl, you can use the --proxy flag:

curl --proxy http://webproxy.eqiad.wmnet:8080 http://www.google.com

wget

wget has no --proxy flag, set the appropriate environment variable instead.

https_proxy=http://webproxy:8080 wget https://www.google.com

Maven proxy configuration example

You could reference your proxy in your maven conf file ~/.m2/settings.xml to make sure you are passing through it to fetch packages at build time.

<settings>
  <proxies>
    <proxy>
      <id>http-proxy</id>
      <active>true</active>
      <protocol>http</protocol>
      <host>webproxy.eqiad.wmnet</host>
      <port>8080</port>
    </proxy>
    <proxy>
      <id>https-proxy</id>
      <active>true</active>
      <protocol>https</protocol>
      <host>webproxy.eqiad.wmnet</host>
      <port>8080</port>
    </proxy>
  </proxies>
</settings>

ant

In addition to environment variables defined above, invoke ant with the -autoproxy argument.

Spark

If your Spark job pulls dependencies via spark.jars.packages, you can point it to a settings file that automatically takes care of proxying by mirroring thru our Archiva instance:

conf={
    ...
    "spark.jars.packages": "...",  # packages to pull go here
    "spark.driver.extraJavaOptions": "-Divy.cache.dir=/tmp/ivy_spark3/cache -Divy.home=/tmp/ivy_spark3/home ",
    "spark.jars.ivySettings": "/etc/maven/ivysettings.xml"
}

Monitoring

Access log dashboard: https://logstash.wikimedia.org/app/dashboards#/view/58c908a0-a394-11ec-bf8e-43f1807d5bc2

Requests: https://grafana.wikimedia.org/d/i5YA-BXWz/squid

Future/possible improvements

~~Helper script to correctly configure the proxies for the current user session - T278315 - global http_proxy setting~~
~~Centrally managed global no_proxy settings - T278315 - global http_proxy setting~~
Maybe restrict domains accessible by webproxy
Improve proxies redundancy - T242715

Reference

Gerrit change adding the DNS entries