News/CloudVPS NAT wikis

From Wikitech
Jump to navigation Jump to search
The user facing changes proposed in this page are currently "on hold" pending further investigation of possible side effects for the wikis.

This page describes a network change in the Cloud VPS service, which affects how processes running inside the cloud will reach WMF-hosted wiki services, including wikidata, commons, etc.

What is changing ?

The Cloud VPS network has a general egress NAT public IPv4 address, known as 185.15.56.1. This public address is used to translate internal virtual machine private addresses, which look like 172.16.0.0/21.

However, traditionally, there was a networking policy exception that prevented the private range from being translated when the destination of the network connection was a WMF-hosted wiki. This means that, previous to this change, WMF-hosted wikis would see the internal virtual machine private address.

The change covered in this page is precisely to drop this networking exception, so WMF-hosted wikis will see network traffic originated from Cloud VPS as coming from the general NAT address.

The following diagrams may help visualize what is changing.

Before this change:

After this change:

What is not changing ?

Everything not explicitly declared above isn't part of the change. Some things worth clarifying that aren't part of this change:

  • access to dumps. Not affected by this change.
  • access to Wiki replicas. Not affected by this change.
  • access to ToolsDB. Not affected by this change.
  • ssh access to bastions (for example, Toolforge bastions). Not affected by this change.
  • how internet users access webservices running in Cloud VPS or Toolforge. Not affected by this change.
  • how your Cloud VPS virtual machine or your Toolforge tool contact other internet endpoints. Not affected by this change.

Timeline

This change will follow this timeline:

  • 2021-01-25: announce the change to the community. Ask for feedback.
  • 2021-02-01: after collected feedback was evaluated, it was decided to delay this project.

What should I do?

If you are a Cloud VPS project owner / user

If your virtual machine instances contact WMF-hosted wikis in any way, be inform of this change. You don't have do to anything specific other than monitor that your services keep working as expected.

If you want to be on the safe side, we suggest you review the OAuth and/or bot passwords configuration for your tool to ensure it also allows the new IP address 185.15.56.1.

In case you detect your service no longer works due to a WMF-hosted wiki block, ratelimit or similar, please contact the WMCS team.

If you are a Toolforge developer / user

It is very likely that your Toolforge tool interact in some way with WMF-hosted wikis. You don't have to do anything specific other than monitor that your tool keep working as expected.

If you want to be on the safe side, we suggest you review the OAuth and/or bot passwords configuration for your tool to ensure it also allows the new IP address 185.15.56.1.

In case you detect your tool no longer works due to a WMF-hosted wiki block, ratelimit or similar, please contact the WMCS team.

If you are a CheckUser

Make sure that you and other CheckUsers are aware that edits from WMCS will be coming from 185.15.56.1 instead of from 172.16.0.0/21. Most WMF-hosted wikis require that all edits from WMCS addresses are done by logged-in users, except for a few exceptions (testwiki, test2wiki, testwikidatawiki and testcommonswiki).

If you are a WMF engineer or SRE involved with the wikis

There will be a bunch of new connections to the wikis coming from 185.15.56.1. Potentially all kind of wiki endpoints and HTTP actions, including APIs, downloads, uploads, etc. There is a potential risk that service ratelimits on your side affect the new traffic, so you may need close up monitoring for a few days. Be ready to tune some of the ratelimiting values and configurations.

Beware that this address will now be responsible for about 30%-40% of wiki edits.

Some mechanisms that will potentially require adjusting:

  • varnish (or caching) level ratelimits
  • varnish (or caching) level UA blocks -- shouldn't be a problem if they are exclusively based on UserAgent and not IP address
  • DoS alerts
  • Anti Harassment Tools
  • mediawiki bad logins limit
  • some others we are not aware of just yet

Worth noting that by the time of this writting, the following wikis seem to accept anon writes from WMCS and they may need extra consideration:

  • testwiki
  • test2wiki
  • testwikidatawiki
  • testcommonswiki

Solutions to common problems

Here is a list of solutions to common problems we are collecting.

Tools with restricted Oauth or password

Some bots and tools using OAuth or bot passwords security-consciously with a restricted allowed IP range store this information on the mediawiki database.

See mw:Manual:Bot_passwords_table and https://doc.wikimedia.org/mediawiki-core/master/php/classMWRestrictions.html for better understanding of this restriction.

As of this writing we are still evaluating the impact, but very likely the information stored in the database will need to change to allow for this new IP address 185.15.56.1. We will post here more information as it becomes clear for us what to do.

MediaWiki has a strict limit on bad logins

The MediaWiki deployments serving WMF-hosted wikis may have a strict limit on bad logins, per source IP address. The potential impact of this is a misbehaving bot blocking login for other bots running on CloudVPS/Toolforge.

We are currently investigating this issue, gathering more data and will post here more information as it becomes clear for us what to do.

NAT overflow

We currently hold about ~300K to ~500K connections in the neutron router. NAT overloading could be a potential risk, specially if every internal VM instance were to contact the exact same destination address/port.

The nf_conntrack/nf_nat engine uses 4-tuples, and therefore the maximum allowed connections to a single destination IP/PORT is (65535-1024+1 = 64512).

This means that we will support maximum 64K connections to the same wiki endpoint (like text-lb:443).

By the time of this writing, we have this:

aborrero@cloudnet1004:~ $ sudo ip netns exec qrouter-d93771ba-2711-4f88-804a-8df6fd03978a conntrack -L --dst 208.80.154.224 | wc -l
conntrack v1.4.5 (conntrack-tools): 21527 flow entries have been shown.
21527

Why are we doing this?

There are several technical reasons that suggest this change should be done as soon as possible.

One of the most important ones is realm separation. WMF-hosted wikis run in a realm which we can call wikis production whereas Cloud VPS runs in a realm called cloud production. Network connections happening between the two realms should not have any network special treatment or exception, therefore the need to introduce the NAT that was previously disabled. When the NAT is in place, WMF-hosted wikis will see and handle connections from Cloud VPS as they would with any other internet client.

Other reason for this change is that this exceptions has been identified as requiring a lot of attention in the sense of burden to properly maintain. By removing it we are trying to reduce engineering technical debt, and therefore ease maintenance.

This change is one of the smaller pieces in a bigger architectural change that we will be introducing in upcoming months, as part of the 2020 network refresh project.

See also