Jump to content

Wikimedia Cloud Services team/EnhancementProposals/Decision record T377467 cloud vps vxlan ipv6 migration

From Wikitech

Origin task: phab:T377467

Date of the decision: 2024-11-06

No decision meeting was needed, agreement reached in the task.

Decision taken

Option 2 was chosen, to base the migration on VM rebuilds.

Rationale

  • no additional engineering time required from WMCS to invest in migration scripts and such
  • no artificial downtime. A project admin explicitly created a new virtual machine via horizon. Clean.
  • the introduction of the new IPv6 is fully in control of the project admin
  • a shiny new IPv6 may be a good incentive for users to do the migration soon.

The migration will be triggered by projects admin on a self-service fashion.

We will start with 3 network definitions in neutrons, available via horizon:

  • VLAN/legacy
  • VXLAN/IPv4-only
  • VXLAN/IPv6-dualstack

Then we will have a migration timeline similar to this:

  • 2024-12-01: announcement about the transition. 3 network options available in horizon
  • 2025-02-01: (2 months later) option to create VMs in VLAN/legacy is disabled in horizon. Just VXLAN/IPv4-only or VXLAN/IPv6-dualstack remain available in horizon.
  • [ .. from this point on the migration is progressing organically .. ]
  • 2025-12-01: (1 year later) we evaluate how the migration is progressing, and maybe automate some of if with a script if we need to accelerate it.
  • 2026-12-01: (2 years later) we expect no VMs in the legacy VLAN to exist. If some exist, we will evaluate what to do.
  • 20XX-XX-XX: (at some point TBD) we may want to disable VXLAN/IPv4-only VM creation options, or keep it only for special cases upon requests.


Problem

Per phab:T364725, we need to migrate virtual machines from the old VLAN-based subnet to the new VXLAN-based subnet (which includes IPv6).

There are, however, different ways in which this can be done, depending on a number of factors, such as:

  • how much effort we want to put into it
  • how fast we want the migration to happen
  • what level of disruption is acceptable for our users
  • how confident we are that everything will just "work", i.e, Toolforge migrating to IPv6, there be dragons

Constraints and risks

Migrating a virtual machine to the new network requires downtime, either:

  • a reboot with a new neutron port
  • the VM is completely new

Also, given there is a new IP address, it involves DNS changes.

Options

Option 1

Based on VM migration. Triggered by WMCS team with no projects admin intervention.

Write a script that takes a VM and 'moves' it to the new network setup.

This is {T377346}.

Pros:

  • can be effective in completing the migration somewhat "fast"

Cons:

  • crafting the script can be a costly task (in terms of engineering time)
  • it may involve introducing artificial downtime for user VMs
  • it may involve modifying the VM filesystem, which sounds scary
  • it is less "clean" compared to option 2
  • risk of introducing IPv6 without control for systems that may break if not ready


Option 2

Based on VM rebuilds. Triggered by projects admin on a self-service fashion.

If a VM needs to move to the new network setup, it needs to be rebuilt. This is executed as a self-service thing via normal user workflows (i.e, horizon, tofu) from users.

We could start with 3 network definitions in neutrons, available via horizon:

  • VLAN/legacy
  • VXLAN/IPv4-only
  • VXLAN/IPv6-dualstack

Then we could have a migration timeline similar to this:

  • 2024-12-01: announcement about the transition. 3 network options available in horizon
  • 2025-02-01: (2 months later) option to create VMs in VLAN/legacy is disabled in horizon. Just VXLAN/IPv4-only or VXLAN/IPv6-dualstack remain available in horizon.
  • [ .. from this point on the migration is progressing organically .. ]
  • 2025-12-01: (1 year later) we evaluate how the migration is progressing, and maybe automate some of if with a script if we need to accelerate it.
  • 2026-12-01: (2 years later) we expect no VMs in the legacy VLAN to exist. If some exist, we will evaluate what to do.
  • 20XX-XX-XX: (at some point TBD) we may want to disable VXLAN/IPv4-only VM creation options, or keep it only for special cases upon requests.

Pros:

  • no additional engineering time required from WMCS to invest in migration scripts and such
  • no artificial downtime. A project admin explicitly created a new virtual machine via horizon. Clean.
  • the introduction of the new IPv6 is fully in control of the project admin
  • a shiny new IPv6 may be a good incentive for users to do the migration soon.

Cons:

  • not automated, requires project admin intervention. We require actions from the community
  • will delay completion of the network migration

Option 3

Mixed approach. Focus on the self-service VM rebuild approach, but create a script to handle some other complex cases.