This page may be outdated or contain incorrect details. Please update it if you can.

Introduction

The OpenStack cloud, which is made up of tenant virtual machines (instances), runs on top of our OpenStack provider equipment. This provider equipment itself exists in production VLANs and provides services such as Designate, PowerDNS, Keystone, and others necessary for a functional tenant cloud.

The tenant network(s) is where virtual machines are created and connected. Tenants on the tenant network(s) have an instance that is bridged on a hypervisor to the relevant cloud-instances[12]-b-$site VLANs. Instances in these VLANs have their DHCP, DNS, and connectivity dynamically managed by the OpenStack control plane services. Instances in these VLANs are considered to exist 'in the cloud'. The gateway address for instances in the tenant network(s) exists on a cloudnet* (or cloudnet*-dev) host.

The services and equipment that power the tenant cloud have their own networking needs. These networks allow the OpenStack control plane components to communicate with each other and the rest of the Wikimedia production networks. These are called provider networks in OpenStack design and topology terms.

Ideally, no tenant instance accesses any privately addressed provider network directly and all transported tenant traffic is separated until a pre-determined point of egress.

Policy Statements

These guidelines establish shared understanding and best practice. These will evolve over time as capabilities and prudence changes. These statements should guide architecture and implementation to create a consistent topology.

An ACL must exist between the instance network and the rest of production (which includes the provider networks) to minimize exposure. This ACL should be as restrictive as possible between any privately addressed production equipment and the tenant networks. As of this writing, there are 2 relevant devices that do or can do this network policing: the production core routers and the cloudsw routers. There are plans for introducing another firewall in this network path, cloudgw.

Any service in the production address range exposed directly for instances to consume should exist in the public address space and be treated as if it were serving external clients. This means iptables restricting the service to the cloud range and required clients only. Functionally, we trust the instances hosted by Wikimedia Cloud Services as much as we do the Internet at large. Administratively, we acknowledge we have more control and ability to determine outcome. There are several current outliers that exist in the historic cloud-support* VLANs.

Any service exposed directly for instances which can be brought into the Cloud network itself should be. In the past, solutions like OpenStack Ironic for a bare-metal cloud have been considered without success.

Host Type and Networking Requirements

cloudvirt*

These hosts require 2 NICs. Eth0 is an access port in the relevant cloud-hosts* VLAN (generally vlan-cloud-hosts1-b-eqiad) which is determined by physical site and row. Eth1 is a trunk that allows tenant instance networks (cloud-virt-instance-trunk/cloud-instance1-eqiad).

We use 'flat' network OpenStack logic with the VLAN package which creates eth1.XXX subinterfaces to bridge the physical networking infrastructure to the logical.

The iptables setup on these hosts is managed dynamically by OpenStack.

cloudnet*

These hosts require at least 2 NICs. Eth0 is an access port in the relevant cloud-hosts* VLAN which is determined by physical site and row. Eth1 is a trunk that allows tenant networks, including the cloud-transport* VLAN. Cloud-transport* is meant to separate tenant and provider traffic in the coming Neutron based OpenStack deployments.

The iptables setup on these hosts is managed dynamically by OpenStack.

cloudcontrol*

These hosts require 1 NIC. Eth0 is an access port in the public VLAN.

These hosts should be heavily administratively firewalled by iptables. These hosts have both services that tenant instances should legitimately query and those they should not.

cloudmetrics*

These hosts require 1 NIC. Eth0 is an access port in a cloud-support* VLAN.

These hosts should be moved to public address space and have their iptables rules evaluated.

cloudservices*

These hosts require 1 NIC. Eth0 is an access port in the relevant public* VLAN by physical datacenter and row.

Recursive DNS functions that serve instances here should be brought into the tenant network(s) and be virtualized. [Need task]

TODO: clarify what happens with tcp/udp/53. Shall they be open for the wide internet?

cloud[lab]web*

These hosts require 1 NIC. Eth0 is an access port in the relevant public* VLAN by physical datacenter and row.

These hosts are behind the misc varnish cluster and could be considered for moving into private address space at a later date.

TODO: It is unclear whether current defined best practice requires these hosts to be in the public address space. They are now because of connectivity requirements. Cloud[lab)web* requires the ability to query nova-api which is restricted from private production VLANs. Because of this requirement they are in public address space. The cloudcontrol* refresh and Neutron deployment moved the nova-api service to cloudcontrol* hosts instead of cloudnet*. This should be reevaluated.

cloud[lab]store*

store* hosts do not share a common function other than providing large storage so they may be varied in their requirements.

cloud[lab]store100[45]

These hosts require 2 NICs. Eth0 is currently an access port in the labs-support* VLAN. Eth1 is a directly connected port for DRBD replication between hosts. The 'secondary' cluster (so named when it was actually the secondary) provides NFSd on top of DRBD for Toolforge and an array of projects.

These hosts should be moved to public address space, or be brought internal to the tenant network, and have their iptables rules evaluated.

cloud[lab]store1003

This host requires 1 NIC. Eth0 is an access port in the labs-support* VLAN.

This host is being deprecated by a combination of Labstore100[67] for dumps, and Labstore100[89] for scratch/maps. This host is way over warranty is should sunset as soon as possible. These functions are being transitioned to hosts that exist in the public address space.

cloud[lab]store100[67]

These hosts require 1 NIC. Eth0 is an access port in the relevant public* VLAN by physical datacenter and row.

These are meant to be independent archives of the same data and should exist in separate physical rows. Maintenance on these hosts should be performed separately for this reason.

cloud[lab]store100[89]

These hosts require 2 NICs. Eth0 is an access port in the relevant public* VLAN by physical datacenter and row. Eth1 is a directly connected port for DRBD replication between hosts.

These hosts are meant assume the NFSd scratch and maps duties for Labstore1003 so it can be decommissioned finally.

cloud[lab]db* and dbproxy*

These hosts do not share a common function other than providing a database server endpoint (they are not even all mysql or mariadb), or acting as a loadbalancer for tenants to access the database endpoints. These may be varied in their requirements.

labsdb10[09|10|11]

These hosts hosts require 1 NIC. Eth0 is an access port in a labs-support* VLAN.

Whether these database servers (which provide wikireplica database services to Wikimedia Cloud VPS users) should be moved into the public, private, or other VLAN needs discussion. It may be that these remain in some private VLAN as actual connections from tenant networks are made to the relevant dbproxy* host instead of these directly.

dbproxy10[09|10]

These hosts hosts require 1 NIC. Eth0 is an access port in a labs-support* VLAN.

Whether these HAProxy servers (which provide a termination point for load balancing to the wiki replicas) should be moved into the public, private, or other VLAN needs discussion. It may be prudent to move these to public address space.

Examples of tenant resolution for accessing a wikireplica endpoint:

enwiki.analytics.db.svc.eqiad.wmflabs is an alias for s1.analytics.db.svc.eqiad.wmflabs

s1.analytics.db.svc.eqiad.wmflabs has address 10.64.37.14

host 10.64.37.14 is dbproxy1010.eqiad.wmnet.

dbproxy1010.eqiad.wmnet HAProxy on port 3306 has labsdb1010 in its pool.

labsdb100[4567]

These hosts require 1 NIC. Eth0 is an access port in a labs-support* VLAN.

These hosts are out of warranty and are being replaced by virtual machines inside of the tenant network with dedicated cloud[lab]virt hypervisors. This work is being tracked in T193264

cloud*-dev

These hosts belong to a long-lived PoC and staging environment codfw1dev for provider and tenant integration and upgrade testing.

The network and policy requirements for these hosts should be consistent with non-test deployments with a few special considerations:

We do not allow non-administrative users access to this cluster. This means access to any portal or SSH for provider equipment and instances. Anyone with access must have an NDA and be managed via the admin module.

We do not allow registration on the related wikitech (LDAP registration) site. This can be enabled as needed or a user can be created manually on a per-case basis. Registration should be disabled during normal operations.

Currently, cloudcontrol*-dev, cloudservices*-dev (which includes LDAP for the existing deployment) and cloudweb*-dev* are all in public address space with ferm settings.