Jump to content

HTTP proxy

From Wikitech

To allow HTTP requests reach the outside world, we maintain a caching HTTP proxy in each datacenter. They are exposed using services entries of the form webproxy.<datacenter>.wmnet running on the install* servers.

How-to?

You can set the http_proxy and https_proxy environment variables to make many command-line scripts use the site specific proxy automatically.

The no_proxy and NO_PROXY variables are configured automatically across the infra by the profile::environment puppet module and hiera settings.

Helper commands

In your terminal, just run set_proxy. This will take care of setting up the needed environment variables during the active session.

unset_proxy will do the opposite.

Manual config

export http_proxy=http://webproxy:8080
export https_proxy=http://webproxy:8080
export no_proxy=127.0.0.1,::1,localhost,.wmnet,.wikimedia.org,.wikipedia.org,.wikibooks.org,.wikiquote.org,.wiktionary.org,.wikisource.org,.wikispecies.org,.wikiversity.org,.wikidata.org,.mediawiki.org,.wikinews.org,.wikivoyage.org
export HTTP_PROXY=$http_proxy
export HTTPS_PROXY=$https_proxy
export NO_PROXY=$no_proxy
  • "no_proxy" MUST be explicitly set
    • Prevents unnecessary load on the proxies (to fetch internal resources)
    • Prevents stale data cached on the proxies
    • Prevents unnecessary dependencies
  • HTTP proxies SHOULD NOT be configured by default, but on a case by case (need) basis
    • It's preferred to set these variables for your current session only by using the helper commands at the terminal prompt
    • services should leverage Puppet to configure proxies
  • These proxies MUST NOT be used from Cloud VPS instances (enforced by ACLs)

Internal endpoints

It is better to use internal endpoints instead of public ones, a list or reasons is visible on this comment.

API

Use e.g. https://mw-api-int-ro.discovery.wmnet:4446 and set the HTTP Host header to the domain of the site you want to access, e.g. curl -H "Host: www.wikidata.org" https://mw-api-int-ro.discovery.wmnet:4446

MediaWiki On Kubernetes internal API endpoints:

For examples in Python and R refer to these notes.

LiftWing

See Machine Learning/LiftWing/Usage#Internal endpoints

A complete list exists at: https://config-master.wikimedia.org/discovery/discovery-basic.yaml

Example usage

curl

If you are using curl, you can use the --proxy flag:

curl --proxy http://webproxy.eqiad.wmnet:8080 http://www.google.com

wget

wget has no --proxy flag, set the appropriate environment variable instead.

https_proxy=http://webproxy:8080 wget https://www.google.com

Maven proxy configuration example

You could reference your proxy in your maven conf file ~/.m2/settings.xml to make sure you are passing through it to fetch packages at build time.

<settings>
  <proxies>
    <proxy>
      <id>http-proxy</id>
      <active>true</active>
      <protocol>http</protocol>
      <host>webproxy.eqiad.wmnet</host>
      <port>8080</port>
    </proxy>
    <proxy>
      <id>https-proxy</id>
      <active>true</active>
      <protocol>https</protocol>
      <host>webproxy.eqiad.wmnet</host>
      <port>8080</port>
    </proxy>
  </proxies>
</settings>

ant

In addition to environment variables defined above, invoke ant with the -autoproxy argument.

Spark

If your Spark job pulls dependencies via spark.jars.packages, you can point it to a settings file that automatically takes care of proxying by mirroring thru our Archiva instance:

conf={
    ...
    "spark.jars.packages": "...",  # packages to pull go here
    "spark.driver.extraJavaOptions": "-Divy.cache.dir=/tmp/ivy_spark3/cache -Divy.home=/tmp/ivy_spark3/home ",
    "spark.jars.ivySettings": "/etc/maven/ivysettings.xml"
}

Monitoring

Access log dashboard: https://logstash.wikimedia.org/app/dashboards#/view/58c908a0-a394-11ec-bf8e-43f1807d5bc2

Requests: https://grafana.wikimedia.org/d/i5YA-BXWz/squid

Future/possible improvements

Reference

See also