User:Nskaggs/draft-cloudgw/implementation details
Implementation details
Some details about how the implementation will look like.
Risks
- Given the differences in hardware between codfw and eqiad, some of the changes / experiments will have to occur with limited testing outside of eqiad before deployment.
eqiad
In the eqiad datacenter, related to the eqiad1 openstack deployment.
specs for eqiad1
On cloudgw side, each server:
- Hardware, misc box
- CPU: 16 CPU
- RAM: 32 GB
- Disk: 500GB
- 2 x 10Gbps NICs. NICs are bonded/teamed/aggregated for redundancy.
- Software
- standard puppet management
- prometheus metrics, icinga monitoring
- netfilter for NAT/firewalling
- keepalived or corosync/pacemaker for HA
On cloudsw side, each device:
- Juniper QFX5100 switches with L3 routing licenses
network setup in eqiad1
allocations
IPv4 allocations:
185.15.56.0/24
185.15.56.0/25 - Openstack instances NAT
185.15.56.128/26 - reserved for the above groth
185.15.56.192/27 - unused
185.15.56.224/28 - unused
185.15.56.240/28 - infrastructure
185.15.56.240/29 - 1120 - cloud-instances-transport1
185.15.56.248/31 - 1104 - cloudsw1-c8<->cloudsw1-d5 - cloud-xlink1
185.15.56.250/31 - unused
185.15.56.252/30 - loopbacks
VLAN allocations:
1102 - cr1<->cloudsw1-c8 - cloud-transit1-eqiad
1103 - cr2<->cloudsw1-d5 - cloud-transit2-eqiad
1104 - cloudsw1-c8<->cloudsw1-d5 - cloud-xlink1-eqiad
1105 - cloud-instances1-eqiad
1106 - cloud-storage1-eqiad
1107 - cloudsw1<->cloudgw - cloud-gw-transport-eqiad ?
1118 - cloud-hosts1-eqiad
1120 - cloud-instances-transport1-eqiad
stage 0
starting network setup
VLAN | Switched on | L2 Members | L3 Gateway (“to internet”) |
cloud-hosts1-eqiad | asw2-b | cr1/2
all cloudvirt eth0 all Ceph OSD eth0 |
cr1/2 (via asw2-b) |
cloud-instances2-eqiad | asw2-b | all cloud VPS
all cloudvirt eth1 cloudnet1003/1004 eth1 |
cloudnet1003/1004 eth1 |
cloud-instances-transport1-eqiad | asw2-b | cloudnet1003/1004 eth0 | cr1/2 |
cloud-storage1-eqiad | asw2-b | all cloudcephosd eth1 | (none) |
stage 1
: Route cloud-hosts vlan through cloudsw
The cloud-hosts vlan, which is part of the production realm, is curently routed on cr1/2-eqiad:ae2.1118. Which are the interfaces facing asw2-b-eqiad.
In the optic of better separation of WMCS and production realm, that routing should be moved to cr1/2-eqiad:xe-3/0/4.1118, the interfaces facing cloudsw.
This will contribute to goals (A) and (C) of the cloudsw project.
This is a low complexity change. See https://phabricator.wikimedia.org/T261866 for the implementation.
stage 2A
: enable L3 routing on cloudsw nodes
This will contribute to goals (A), (B), (C) and (D) of the cloudsw project.
Steps (to be moved to a task for implementation):
- Baseline configuration
- Cloudsw vlans (L2) - 1102, 1103, 1104, 1120
- iBGP and OSPF between cloudsw
- eBGP between core routers and cloudsw (advertise 208.80.155.88/29, 185.15.56.0/24 and 172.16.0.0/21, receive 0/0)
- Static route for 185.15.56.0/25 and 172.16.0.0/21 on cloudsw
- Firewall filters - lo, cloud-in4 (on core routers)
- Test connectivity
- cloud-instances-transport migration (downtime required [!])
- Ensure cr1 is VRRP master for all vlans, including 1120
- Move cr2:ae2.1120 to cloudsw1-d5:irb.1120
- Test cr1:ae2.1120 to cloudsw1-d5:irb.1120 connectivity (and VRRP sync)
- [!] Move vlan 1120 VRRP master to cloudsw1-d5:irb.1120
- [!] Remove static routes for 185.15.56.0/25 and 172.16.0.0/21 on core routers
- Test connectivity
- Move cr1:ae2.1120 to cloudsw1-c8:irb.1120
- Cleanup (remove passive OSPF, trunked vlans, update Netbox)
- Renumber cloud-instances-transport (downtime required [!]) [Could be done when introducing cloudgw] similar to https://phabricator.wikimedia.org/T207663
- Configure 85.15.56.240/29 IPs on all devices
- [!] Reconfigure cloudnet with new gateway IP (to be confirmed)
- Update static routes on cloudsw to point to new VIP
- Cleanup 208.80.155.88/29 IPs and advertisement (+Netbox)
At this stage:
VLAN | Switched on | L2 Members | L3 Gateway (“to internet”) |
cloud-hosts1-eqiad | asw2-b*
cloudsw |
cr1/2
all cloudvirt eth0 all Ceph OSD eth0 |
cr1/2 (via cloudsw) |
cloud-instances2-eqiad | asw2-b*
cloudsw |
all cloud VPS
all cloudvirt eth1 cloudnet1003/1004 eth1 |
cloudnet1003/1004 eth1 |
cloud-instances-transport1-eqiad | asw2-b*
cloudsw |
cloudsw
cloudnet1003/1004 eth0 |
cloudsw |
cloud-transit1/2-eqiad | cloudsw | cr1/2
cloudsw |
cr1/2 |
cloud-storage1-eqiad | asw2-b*
cloudsw |
all cloudcephosd eth1 | (none) |
* To be removed when hosts are moved away from that device
stage 2B
: enable L3 routing on cloudgw nodes
TBD
stage 3
final status for all main network components
TBD
- connectivity between cloudgw and the cloud-hosts1-b-eqiad subnet.
- L3:
- a single IP address allocated by standard methods for ssh management, puppet, monitoring, etc. Gateway for this subnet lives on core routers, but is switches through cloudgw after stage 1.
- L2:
- cloudgw has 2 NICs bonded/teamed/aggregated and then trunked with 3 vlans:
- cloud-hosts1-b-eqiad (vlan 1118) 10.64.20.0/24
- cloud-instances-transport1-b-eqiad (vlan 1120) 208.80.155.88/29
- cloud-new-transport-eqiad (vlan 11XX) final CIDR TBD
- cloudgw has 2 NICs bonded/teamed/aggregated and then trunked with 3 vlans:
- L3:
- connectivity between Neutron (cloudnet) and cloudgw:
- L3:
- keep the current cloud-instances-transport1-b-eqiad (vlan 1120) 208.80.155.88/29
- keep the current cloud-instances2-b-eqiad (vlan 1105) 172.16.0.0/21
- L2:
- cloudnet keep 2 NICs, each with different setup:
- cloud-hosts1-b-eqiad (vlan 1118) 10.64.20.0/24
- other trunked with vlan 1105 and vlan 1120 (cloud-virt-instance-trunk).
- cloudgw has 2 NICs bonded/teamed/aggregated and then trunked with 3 vlans:
- cloud-hosts1-b-eqiad (vlan 1118) 10.64.20.0/24
- cloud-instances-transport1-b-eqiad (vlan 1120) 208.80.155.88/29
- cloud-new-transport-eqiad (vlan 11XX) final CIDR TBD
- cloudnet keep 2 NICs, each with different setup:
- L3:
- connectivity between cloudgw and cloudsw:
- L3:
- allocate new transport range and vlan 11XX.
- static routes between cloudgw and cloudsw
- L2:
- cloudsw has ports aggregated and trunked with vlan 11XX to connect with cloudgw.
- cloudgw has 2 NICs bonded/teamed/aggregated and then trunked with 3 vlans:
- cloud-hosts1-b-eqiad (vlan 1118) 10.64.20.0/24
- cloud-instances-transport1-b-eqiad (vlan 1120) 208.80.155.88/29
- cloud-new-transport-eqiad (vlan 11XX) final CIDR TBD
- L3:
- connectivity between cloudsw and prod core router:
- L1: cloudsw are directly connected to the prod core routers using 1x10G port each
- L2: 2 vlans are trunked between the two sides: vlan 1118 (cloud-hosts) and 1102 (public interco vlan)
- L3: allocate two new interco /31s prefixes (208.80.154.210/31 and 208.80.154.212/31), configure eBGP in
stage 2A
codfw
In the codfw datacenter, related to the codfw1dev openstack deployment.
specs for codfw1dev
For cloudgw, repurpose labtestvirt2003 as cloudgw2001-dev.
For cloudsw, we assume we wont have the device anytime soon.
network setup in codfw1dev
Specific configuration details for each stage.
allocations
IPv4 allocations:
185.15.57.0/24
185.15.57.0/29 - Openstack instances NAT (floating IPs)
185.15.57.8/29 - reserved for the above growth
185.15.57.16/28 - unused
185.15.57.32/27 - unused
185.15.57.64/26 - unused
185.15.57.128/25 - infrastructure
185.15.57.128/29 - 2120 - cloud-instances-transport1-b-codfw (cr-codfw <-> cloudgw)
185.15.57.144/29 - 2107 - cloud-gw-transport-codfw (cloudgw <-> neutron)
VLAN allocations:
2105 - cloud-instances1-codfw (172.16.128.0/24)
2107 - cloud-gw-transport-codfw (cloudgw <-> neutron) (185.15.57.144/29)
2118 - cloud-hosts1-codfw (10.192.20.0/24)
2120 - cloud-instances-transport1-codfw (cr-codfw <-> cloudgw) (185.15.57.128/29)
stage 0
starting network setup
stage 1
: Route cloud-hosts vlan through cloudsw
We don't have hardware for cloudsw in codfw. This stage is NOOP.
stage 2B
: enable L3 routing on cloudsw nodes
We don't have hardware for cloudsw in codfw. This stage is NOOP.
stage 2A
: enable L3 routing on cloudgw nodes
stage 3
final status for all main network components
- connectivity between cloudgw and the cloud-hosts1-b-codfw subnet.
- L3:
- a single IP address allocated by standard methods for ssh management, puppet, monitoring, etc. Gateway for this subnet lives in cloudsw.
- L2:
- cloudgw has 2 NICs bonded/teamed/aggregated and then trunked with 3 vlans:
- cloud-hosts1-b-codfw (vlan 2118) 10.192.20.0/24
- cloud-instances-transport1-b-codfw (cloudsw<->cloudgw) (vlan 2120) 208.80.153.184/29
- cloud-gw-transport-codfw (cloudgw <-> neutron) (vlan 2107) 185.15.57.144/29
- cloudgw has 2 NICs bonded/teamed/aggregated and then trunked with 3 vlans:
- L3:
- connectivity between Neutron (cloudnet) and cloudgw:
- L3:
- cloudnet keeps the current connection to the cloud-hosts1-b-codfw subnet for ssh management, puppet, monitoring, etc. Gateway for this subnet lives in cloudsw.
- drop the current cloud-instances-transport1-b-codfw (vlan 2120) 208.80.153.184/29
- add cloud-gw-transport-codfw (cloudgw <-> neutron) (vlan 2107) 185.15.57.144/29
- keep the current cloud-instances2-b-codfw (vlan 2105) 172.16.128.0/24
- L2:
- cloudnet keep 2 NICs, each with different setup:
- cloud-hosts1-b-codfw (vlan 2118) 10.192.20.0/24
- other trunked with vlan 2105 and vlan 2107 (cloud-virt-instance-trunk).
- cloudgw has 2 NICs bonded/teamed/aggregated and then trunked with 3 vlans:
- cloud-hosts1-b-codfw (vlan 2118) 10.192.20.0/24
- cloud-instances-transport1-b-codfw (cloudsw<->cloudgw) (vlan 2120) 208.80.153.184/29
- cloud-gw-transport-codfw (cloudgw <-> neutron) (vlan 2107) 185.15.57.144/29
- cloudnet keep 2 NICs, each with different setup:
- L3:
- connectivity between cloudgw and cr-codfw:
- L3:
- relocate cloud-instances-transport1-codfw (cr-codfw <-> cloudgw) (185.15.57.128/29) vlan 2120
- L2:
- cloudgw has 2 NICs bonded/teamed/aggregated and then trunked with 3 vlans:
- cloud-hosts1-b-codfw (vlan 2118) 10.192.20.0/24
- cloud-instances-transport1-b-codfw (cloudsw<->cloudgw) (vlan 2120) 208.80.153.184/29
- cloud-gw-transport-codfw (cloudgw <-> neutron) (vlan 2107) 185.15.57.144/29
- cloudgw has 2 NICs bonded/teamed/aggregated and then trunked with 3 vlans:
- L3: