HTTP proxy
To allow HTTP requests reach the outside world, we maintain a caching HTTP proxy in each datacenter. They are exposed using services entries of the form webproxy.<datacenter>.wmnet
running on the install* servers.
How-to?
You can set the http_proxy
and https_proxy
environment variables to make many command-line scripts use the site specific proxy automatically.
The no_proxy
and NO_PROXY
variables are configured automatically across the infra by the profile::environment puppet module and hiera settings.
Helper commands
In your terminal, just run set_proxy
. This will take care of setting up the needed environment variables during the active session.
unset_proxy
will do the opposite.
Manual config
export http_proxy=http://webproxy:8080
export https_proxy=http://webproxy:8080
export no_proxy=127.0.0.1,::1,localhost,.wmnet,.wikimedia.org,.wikipedia.org,.wikibooks.org,.wikiquote.org,.wiktionary.org,.wikisource.org,.wikispecies.org,.wikiversity.org,.wikidata.org,.mediawiki.org,.wikinews.org,.wikivoyage.org
export HTTP_PROXY=$http_proxy
export HTTPS_PROXY=$https_proxy
export NO_PROXY=$no_proxy
- "no_proxy" MUST be explicitly set
- Prevents unnecessary load on the proxies (to fetch internal resources)
- Prevents stale data cached on the proxies
- Prevents unnecessary dependencies
- HTTP proxies SHOULD NOT be configured by default, but on a case by case (need) basis
- It's preferred to set these variables for your current session only by using the helper commands at the terminal prompt
- services should leverage Puppet to configure proxies
- These proxies MUST NOT be used from Cloud VPS instances (enforced by ACLs)
Internal endpoints
It is better to use internal endpoints instead of public ones, a list or reasons is visible on this comment.
API
Use e.g. https://mw-api-int-ro.discovery.wmnet:4446
and set the HTTP Host header to the domain of the site you want to access, e.g. curl -H "Host: www.wikidata.org" https://mw-api-int-ro.discovery.wmnet:4446
MediaWiki On Kubernetes internal API endpoints:
- Direct usage
- Read-only:
https://mw-api-int-ro.discovery.wmnet:4446
- Read-write:
https://mw-api-int.discovery.wmnet:4446
- Read-only:
- Listeners to use through the Envoy Services Proxy:
- Read-only:
mw-api-int-async-ro
- Read-write:
mw-api-int
ormw-api-int-async
- Read-only:
For examples in Python and R refer to these notes.
LiftWing
See Machine Learning/LiftWing/Usage#Internal endpoints
A complete list exists at: https://config-master.wikimedia.org/discovery/discovery-basic.yaml
Example usage
curl
If you are using curl, you can use the --proxy flag:
curl --proxy http://webproxy.eqiad.wmnet:8080 http://www.google.com
wget
wget has no --proxy flag, set the appropriate environment variable instead.
https_proxy=http://webproxy:8080 wget https://www.google.com
Maven proxy configuration example
You could reference your proxy in your maven conf file ~/.m2/settings.xml
to make sure you are passing through it to fetch packages at build time.
<settings>
<proxies>
<proxy>
<id>http-proxy</id>
<active>true</active>
<protocol>http</protocol>
<host>webproxy.eqiad.wmnet</host>
<port>8080</port>
</proxy>
<proxy>
<id>https-proxy</id>
<active>true</active>
<protocol>https</protocol>
<host>webproxy.eqiad.wmnet</host>
<port>8080</port>
</proxy>
</proxies>
</settings>
ant
In addition to environment variables defined above, invoke ant with the -autoproxy
argument.
Spark
If your Spark job pulls dependencies via spark.jars.packages
, you can point it to a settings file that automatically takes care of proxying by mirroring thru our Archiva instance:
conf={ ... "spark.jars.packages": "...", # packages to pull go here "spark.driver.extraJavaOptions": "-Divy.cache.dir=/tmp/ivy_spark3/cache -Divy.home=/tmp/ivy_spark3/home ", "spark.jars.ivySettings": "/etc/maven/ivysettings.xml" }
Monitoring
Access log dashboard: https://logstash.wikimedia.org/app/dashboards#/view/58c908a0-a394-11ec-bf8e-43f1807d5bc2
Requests: https://grafana.wikimedia.org/d/i5YA-BXWz/squid
Future/possible improvements
Helper script to correctly configure the proxies for the current user session - T278315 - global http_proxy settingCentrally managed global no_proxy settings - T278315 - global http_proxy setting- Maybe restrict domains accessible by webproxy
- Improve proxies redundancy - T242715
Reference
See also
- url-downloader (another set of squid proxies for slightly different use cases)
- T254011: Why do we have 2 sets of squid proxies?
- We need to talk: Can we standardize NO_PROXY? - useful blogpost about proxy settings support across tools