Jump to content

Vlan migration

From Wikitech

Migration to per rack vlans

As you might have heard before, we're migrating our network infrastructure to a per rack vlan design (also known as L3 to the Top Of Rack).

And we need your help!

Why ?

Our "legacy" design consists on vlans shared across all the racks of a given row (usually 8 in core DCs). This had the upside of easier management, but now that we have proper automation, it is not that much of an upside.

However the downsides are significant:

  • Shared fate (a network flood or platform bug would impact all the host on the rows)
  • Higher cost (requires special licenses)
  • Higher complexity to troubleshot

Where ?

Now:


No migration needed, 100% on a per rack vlan :

  • eqiad rows E and F
  • drmrs
  • magru
  • esams


In the future :

  • eqiad rows A-D
  • eqsin
  • ulsfo

How ?

If you're provisioning a new server, you have nothing to do. Hurray.

For now let's focus on the 286 baremetal codfw servers that are still in the legacy "private" vlans (down from 341 in March and 557 in October 2024). You can see them grouped on this Netbox report : https://netbox.wikimedia.org/extras/scripts/19/

You can also query them by team-ownership with (adapt the first alias for your own team):

sudo cumin 'A:owner-collaboration-services and A:codfw and P{F:fqdn ~ ".wmnet$"} and not A:vms and not P{F:netmask = "255.255.255.0"}'

To not have to wait for a regular 5 years refresh cycle, we're kindly asking SREs to re-image their servers using the --move-vlan argument.

But beware, as we're changing the host's vlan, we're also changing its IP !

For a lots of servers this doesn't matter an will work out of the box, for example if it's referenced by its hostname in other parts of the infra. But for others (like database) where IPs are hard-coded or cached in exotic places, it can be more complex.

The cookbook will run some validations before doing the re-numbering, as well as check for the potential presence of the server's "old" IP in various repositories.

If the host have BGP configured, the migration should take care of creating the new sessions automatically, but won't clean up the old ones. Some edge cases might exist as well when the BGP peers IPs are hard-coded.

Don't hesitate to contact Netops or the Infrastructure Foundation team for any help to prepare this migration.

Kuddos to the Service Operations team who has been massively and successfully migrating its mw (wikikube-worker) servers.

What's next ?

The above only applies to baremetal servers in the private vlans.

Ganeti is being tackled with Ganeti#Routed Ganeti

Baremetal public vlans hosts have nothing to do for now.