Netbox/ProvisionServerNetwork Script

From Wikitech

The Netbox ProvisionServerNetwork script is used by DC-Ops to add required elements to Netbox to represent a servers network connection and setup.

It is maintained by Infrastructure Foundations with the other Netbox scripts, in the netbox-extras Gerrit repo.

NOTE: The remainder of this page relates to a currently in-development updated version of the script, which at the time of writing is only available on netbox-next (CM 2023-12-15).

Versions of Script

There are 3 variants of the script available:

Server Provision is the normal version DC-Ops should run when configuring a new server. It will look for a Host Profile matching the hostname prefix (for instance 'ganeti' for a server called 'ganeti1065') to gather detail on what vlan, IP and bridge configuration the new host should be assigned.

CSV Server Provision is a variant of the script that allows the user to upload a CSV file with the relevant details of multiple hosts that need to be provisioned at once. It also uses the Host Profile definitions to determine how to set up each host.

Custom Server Provision is a variant which allows the user to enter more details, such as selecting a particular switch to connect to, and a custom set of tagged and untagged vlans. It is not expected this should be required under normal circumstances, but is provided to give flexibility if non-standard cases arise.

Host Profiles

The host-profiles data structure is used to define how the networking should be set up for a host of a given type. If no matching profile is found for a given hostname the special "default" profile is used when provisioning the host.


Example

The below example shows a host profile structure with every option included (this is used to illustrate all options, no hosts we have actually need all set):

    'sretest': {
        'untagged_vlan': 'private1-{location}-{site}',
        'tagged_vlans': [
            'public1-{location}-{site}',
            'analytics1-{location}-{site}'
        ],
        'no_ip_vlans': [
            'public1-{location}-{site}',
            'analytics1-{location}-{site}'
        ],
        'cassandra_instances': 4,
        'ipv6_dns': False,
        'bridges': {
            'private':   ['private1-{location}-{site}'],
            'public':    ['public1-{location}-{site}'],
            'analytics': ['analytics1-{location}-{site}']
        }
    }

Adding a device matching the above profile, in codfw rack C1 and port specified as 9, would result in the below host interface configuration (example script log):

Note that the analytics vlan is not present as none exists in codfw.

Attributes

untagged_vlan: This is the only mandatory parameter. The contents should be a string formatted as a vlan template. This value determines the primary vlan the host will be added to and have IPs assigned to it from.

tagged_vlans: Optional parameter to specify one or more vlans that should be trunked to the host over its primary network connection in addition to the untagged vlan. These vlans will use 802.1q tagging on the wire to separate traffic from each other and the primary untagged vlan. The value of this parameter should be a list of strings, each of which are a valid vlan template. For each trunked vlan a child interface of the primary link on the host will be created, with naming convention 'vlan<vlan-id>'. An IP address from the relevant subnet will be added for each tagged vlan and assigned to the vlan sub-interface on the host. If a given vlan-template fails to resolve to a vlan name in a given location the allocation of that sub-interface is skipped, but provisioning continues.

Normal DNS names will be assigned as per the vlan domain definition for each vlan. However, if the resulting FQDN matches the primary dns_name for the host the FQDN for the sub-interface will be prefixed with the vlan interface name. For instance if a host has a primary dns_name of ganeti1068.eqiad.wmnet, and the host profile has a tagged vlan that would also generate the same name, the IP on the vlan interface is instead called something like 'vlan1234.ganeti1068.eqiad.wmnet'. This ensures the primary dns_name will only ever resolve to a single IP in a given address family.

no_ip_vlans: Optional parameter to specify one or more vlans, which must appear in the 'tagged_vlans' list, which should not have any IP addresses added to them. This can be used where a host needs to be connected to a vlan, but we don't want to assign it any IPs on that vlan. An example is our current Ganeti host config which requires to be connected to the public and analytics vlans, but only uses them as part of an internal bridge to connect VMs, without any IP from the associated subnets being configured at the host layer.

cassandra_instances: This can be used for hosts that run multiple Cassandra instances, where each of the instances needs a separate IPv4 address to bind to. The value should be an integer reflecting the number of additional IPs that should be allocated to matching hosts. The DNS name of matching hosts will be set to $HOSTNAME-a, $HOSTNAME-b, etc.

ipv6_dns: This parameter can be used to disable the generation of AAAA DNS records for any IPv6 addresses assigned to the host. The value is a boolean (True/False). If not included for a given profile the script defaults to 'True' and will assign DNS names for IPv6 addresses. The toggle is global and will affect all IPv6 addresses added to a host (i.e. the script currently cannot disable IPv6 record creation on a per-vlan basis if there are multiple). This is needed for some hosts which don't fully support IPv6, ensuring anything connecting to them only gets IPv4 records returned for DNS queries, and so only uses that protocol to connect to them.

bridges: This parameter can be used to define one or more grouped bridge interfaces on the host. On end-systems these will typically be configured as Linux bridge devices, which act as a virtual Ethernet switch connecting multiple host network devices. The value of this parameter should be a hash/dictionary with keys corresponding to the name of the bridge devices, and values being a list of vlan templates to define the interfaces that should be members of the bridge. Every vlan listed should be one from either 'untagged_vlans' or 'tagged_vlans'. If a vlan is listed that no interface on the host is connected to then it is skipped and provisioning continues. If that would result in an empty bridge, i.e. one with no members at all, the bridge is not created.

Typically only one vlan should be listed for each bridge, as we don't want servers to act as Ethernet switches, bridging between two vlans that are configured on the network side. Typically a host bridge will have one vlan member, connecting the bridge to the outside world, as well as virtual members added by daemons running on the host. Such virtual devices might be tap or veth interfaces connecting VMs or containers.

If an interface is made a member of a bridge, any IPs that had been assigned to the member port are moved to the parent bridge device instead. This is always the correct way to configure IPs on a bridge, as member ports are effectively L2 switch ports, and the bridge device should act like a routed Vlan/SVI/IRB interface.

VLAN Templates

A "vlan template" is merely a string which is used to define a vlan in a host profile. It has two special elements, '{site}' and '{location}', which get replaced based on the properties of a given device when the script runs.

'{site}' always gets replaced with the site the host belongs to in netbox, i.e. eqiad, codfw, drmrs etc.

'{location}' in the first instance gets replaced by the host's rack, for instance E4. If the vlan name rendered by the template is not found with location set to the rack, the row is used instead. Lastly, if no vlan is returned when using the row, the location is left out completely to search for a site-wide vlan.

For example, given the vlan template "private1-{location}-{site}", for a server in Eqiad rack E4, the script will try to fetch the following vlan names in order:

  1. private1-e4-eqiad
  2. private1-e-eqiad
  3. private1-eqiad

In reality, for rack E4 there is a vlan matching number 1, so that would be used. But some rows still only have row-wide vlans, and some POPs, like ulsfo and eqsin, only have site-wide vlans matching number 3.

VLAN Domains

The VLAN Domains dictionary is a small data structure included in the provision script code, which is used to define what DNS suffix is used for Netbox DNS entries on IPs attached to that vlan. For example:

VLAN_DOMAINS = {
    'private': '{site}.wmnet',
    'public': 'wikimedia.org',
    'analytics': '{site}.wmnet',
    'cloud-private': 'private.{site}.wikimedia.cloud'
}

When a DNS name needs to be assigned to an IP the script iterates on the above dict, and if the vlan the IP belongs to has a name starting with one of the keys the corresponding DNS suffix is used. The '{site}' element is again a special one which is automatically replaced with the site name when being added. If the vlan name does not start with any of they keys defined in this dict then no DNS records are added for IPs created in it.

CSV Format

The CSV Server Provision interface allows a CSV formatted file to be uploaded so that multiple devices can be configured at once. The CSV provisioning is based on the standard script, in other words it uses the Host Profiles to find vlan and other information.

CSV files need to be formatted as follows:

device,sw_port,int_speed,cable_id
ganeti2009,9,10G,12345
ganeti2010,24,10G,
dns1005,12,1G,5678

These fields need to be populated as follows:

device A netbox device of type 'server' which has a status of 'planned' or 'inventory' and has its rack location set.
sw_port The physical switch port number the device is cabled to (matching numbers shown on physical device)
int_speed Either '1G, '10G' or '25G', reflecting the speed the host has been connected at
cable_id The number/id of label applied to the physcial cable for indentification

The 'cable_id' is an optional field, however the comma should always be included after the "int_speed" regardless (like in the middle line in the example). This ensures the cable_id field is there - just empty - for the script to read.